AI specialist Jürgen Schmidhuber on the first deep networks, backpropagation and whether you can train a network without unsupervised pre-training
This lecture is part of the collaboration between Serious Science and the Technology Contests Up Great READ//ABLE.
Our second lecture is going to focus on deep feedforward neural networks without recurrent connections. Early supervised feedforward networks in the 1950s had only one single layer, one input layer, one output layer, and they were essentially variants of linear regressors dating back at least two centuries to Gauss and Legendre around 1800. Actually, around 1800 Gauss was the one who performed the first famous example of pattern recognition when he got data points from an asteroid called Ceres which disappeared behind the Sun, and then the big problem was to find it again, to predict where it’s going to appear. Gauss put all the knowledge about celestial mechanics into a prediction machine which had parameters and then he used the method of least squares which is attributed to both Legendre and Gauss to adjust these parameters, to learn them basically from the observations such that he could predict the new observations of Ceres. That’s what made him famous at very young years. The perceptron of Rosenblatt in the 50s was very similar to these linear regressors.
Deep learning started in 1965 in the Ukraine which was back then the USSR which at that time was leading many fields of science and technology: they just had started the space age and they had the first human-made object on the moon, and the biggest bomb of all time, and the first person in space, and the first woman in space. But maybe more importantly, they had many of the best mathematicians, and two of them were Ivakhnenko and Lapa who really had the first deep networks that reliably learned.
So in 1965 Ivakhnenko and Lapa published the first general working learning algorithm for supervised deep feedforward multilayer perceptrons. Their method was still widely used in the new millennium.
Their activations of their units were polynomial, so they had polynomial activation functions combining additions and multiplications in so-called Kolmogorov-Gabor polynomials, and then they used regression to adjust the parameters, and the outputs of the first layer became the inputs of the second layer and so on. In 1971 Ivakhnenko already described a deep network with eight layers (which is deep even by the standards of the new millennium) trained by this method which was still used in the 2000s. How does it work? Given a training set of input vectors with corresponding target output vectors, layers are incrementally growing: first the first layer, then the next layer, and so on, and they are trained by regression analysis, and then they are pruned with the help of a separate validation set where regularization is used to read out all those feature detectors in the current layer which are superfluous. So the validation set is used to identify superfluous units which are removed. The numbers of layers and the number of units per layer can be learned in a problem-dependent fashion.
So the modern efficient version of backpropagation for sparse networks including Fortran code is due to Linnainmaa in 1970. Interestingly, Finland is a border state of the Soviet Union where the first deep learning emerged in 1965. What Linnainmaa did, he extended Kelly’s early work from the 60s which already used basic concepts of backpropagation. The nice thing is that the complexity of computing, the derivatives of the output error with respect to each neural network weight in the system is proportional to the number of weights. So you have the forward path which is proportional in complexity to the number of weights and you have the backward path which is uh proportional to the number of weights in terms of complexity, and that’s the method that is still used today. Werbos in 1982 was the first to apply that method really to neural networks because it’s a general method that you can use for all kinds of applications: for example, in TensorFlow of Google and similar software packages.
This method, the reverse mode of automatic differentiation or backpropagation, is used to adjust the adaptable parameters in any computational graph with differentiable nodes.
Between 1980 and 1990 computers became 10 000 times faster per dollar than those of 1960-1970 when the backpropagation was invented and developed. That was then good enough to do first experiments with backpropagation on cheap or relatively cheap desktop computers which just came up there around the mid-80s. Rumelhart and colleagues then showed that this method really can learn internal representations in hidden layers. By 2003 deep backpropagation-based standard feedforward neural networks with up to seven layers were already used to successfully classify high dimensional data. There is a reference by Vieira and Barradas of 2003 on that.
Due to massive weight replication (you copy again and again the weights of these filters to the next instance of the same filter), relatively few parameters may be necessary to describe the behavior of such convolutional layers which typically feed into so-called downsampling layers where you have a big previous layer which feeds into a smaller layer which is a downsampling layer in the sense that you get in a sense with lower resolution the same information that you have in the previous layer. There are fixed weight connections which originate from physical neighbors in the convolutional layers below. Within this downsampling layer you find these physical neighborship preserving units. Downsampling units use spatial averaging to become active if at least one of their inputs is active, and their responses are then insensitive to certain small image shifts which is very useful in many vision applications.
Wang in 1993 later replaced the spatial averaging of Fukushima by something which is now widely used which is called max pooling which is a central ingredient of many CNNs. Here a two-dimensional layer or array of unit activations is partitioned into sub-sections, into small rectangular sub-arrays, and each of them is very simple, each is replaced in a downsampling layer by the activation of its most active unit. In 1987 neural networks with convolutions were combined by Waibel weight sharing and backpropagation because Fukushima didn’t use backpropagation, he used other ways of adapting the parameters of his system. Waibel was the one who combined these two concepts: gradient descent through backpropagation and convolutions. That also happened in Japan.
But then my team with my outstanding postdoc Dan Cireșan in 2010 was able to show that indeed it’s possible: you can train really deep networks by backpropagation without any unsupervised pre-training. Back then our team broke a famous benchmark record which was used, that benchmark has been had been used by then for decades.
The way we did that was, we achieved this by greatly accelerating traditional multi-layer perceptrons on highly parallel graphics processing units or GPUs going beyond the important work on GPUs by Jung and Oh who in 2004 were the first guys who apparently had working implementations of neural networks on graphics processing units.
But then in 2010 it was really fast enough to train these deep networks which seemed untrainable before, and a reviewer called this a wake-up call to the machine learning community and then everybody started doing this.
In the 2010s this little supervised deep learning revolution quickly spread from Europe to north America and Asia. Our results set the stage for this recent decade of deep learning. In February 2011 our team extended the approach to deep convolutional neural networks, to the CNNs that I mentioned before, and this greatly improved earlier work. Our so-called DanNet, after Dan Cireșan who was the first author on these publications, broke one record after another. In May 2011 DanNet was the first deep CNN to win a computer vision competition; in August 2011 it was the first CNN to win a vision contest with superhuman performance, and our team with Dan Cireșan, Ueli Mayer and others kept winning computer vision contests in 2012 in medical imaging and other fields.
Subsequently, many researchers adopted this technique as well. By May 2015 we had the first extremely deep feedforward networks with not only 10 or 20 or 30 layers but now with more than 100 layers. That was the highway networks made possible through my students Rupesh Srivastava and Klaus Greff. The special case of the highway networks is called ResNets which have become very popular. The original successes required a precise understanding of the inner workings of GPUs; today, however, there are convenient software packages which shield the user from such details. Compute is now roughly 100 times cheaper than in 2010 when we were lucky enough to start this whole development.