Long Short-Term Memory
AI specialist Jürgen Schmidhuber on backpropagation, vanishing and exploding gradient problems and how LSTM he...
This lecture is part of the collaboration between Serious Science and the Technology Contests Up Great READ//ABLE.
Our second lecture is going to focus on deep feedforward neural networks without recurrent connections. Early supervised feedforward networks in the 1950s had only one single layer, one input layer, and one output layer, and they were essentially variants of linear regressors dating back at least two centuries to Gauss and Legendre around 1800. Actually, around 1800, Gauss was the one who performed the first famous example of pattern recognition when he got data points from an asteroid called Ceres, which disappeared behind the Sun, and then the big problem was to find it again to predict where it was going to appear. Gauss put all the knowledge about celestial mechanics into a prediction machine which had parameters, and then he used the method of least squares, which is attributed to both Legendre and Gauss, to adjust these parameters, to learn them basically from the observations such that he could predict the new observations of Ceres. That’s what made him famous at very young years. The perceptron of Rosenblatt in the 50s was very similar to these linear regressors.
Deep learning started in 1965 in the Ukraine, which was back then the USSR, which at that time was leading many fields of science and technology. they had just started the space age, and they had the first human-made object on the moon, the biggest bomb of all time, the first person in space, and the first woman in space. But maybe more importantly, they had many of the best mathematicians, and two of them were Ivakhnenko and Lapa, who really had the first deep networks that reliably learned.
So in 1965 Ivakhnenko and Lapa published the first general working learning algorithm for supervised deep feedforward multilayer perceptrons. Their method was still widely used in the new millennium.
The activations of their units were polynomial, so they had polynomial activation functions combining additions and multiplications in so-called Kolmogorov-Gabor polynomials, and then they used regression to adjust the parameters, and the outputs of the first layer became the inputs of the second layer and so on. In 1971, Ivakhnenko already described a deep network with eight layers (which is deep even by the standards of the new millennium) trained by this method, which was still used in the 2000s. How does it work? Given a training set of input vectors with corresponding target output vectors, layers are incrementally growing: first the first layer, then the next layer, and so on, and they are trained by regression analysis, and then they are pruned with the help of a separate validation set where regularization is used to read out all those feature detectors in the current layer which are superfluous. So, the validation set is used to identify superfluous units which are removed. The number of layers and the number of units per layer can be learned in a problem-dependent fashion.
Like later deep neural networks, Ivakhnenko’s and Lapa’s nets also learn to create hierarchically distributed internal representations of the input data of the incoming data. They did not yet use the technique, which is now known as backpropagation; they did not use supervised learning through pure gradient descent because what Ivakhnenko had there had incremental regression analysis to train the weights. The alternative was also described around this time. If you want to do gradient descent in an objective function such as the total classification error on a given training set of input patterns and the corresponding labels, then what you want to do today is use this method which is called backpropagation or the reverse mode of automatic differentiation which was published in 1970 by a Finnish master student, by Seppo Linnainmaa in Helsinki.
So the modern, efficient version of backpropagation for sparse networks, including Fortran code, is due to Linnainmaa in 1970. Interestingly, Finland is a border state of the Soviet Union, where the first deep learning emerged in 1965. What Linnainmaa did was extend Kelly’s early work from the 60s, which already used basic concepts of backpropagation. The nice thing is that the complexity of computing, the derivatives of the output error with respect to each neural network weight in the system, is proportional to the number of weights. So you have the forward path, which is proportional in complexity to the number of weights, and you have the backward path, which is proportional to the number of weights in terms of complexity, and that’s the method that is still used today. Werbos, in 1982, was the first to apply that method to neural networks because it’s a general method that you can use for all kinds of applications, for example, in TensorFlow of Google and similar software packages.
This method, the reverse mode of automatic differentiation or backpropagation, is used to adjust the adaptable parameters in any computational graph with differentiable nodes.
Between 1980 and 1990, computers became 10,000 times faster per dollar than those of 1960-1970 when backpropagation was invented and developed. That was then good enough to do the first experiments with backpropagation on cheap or relatively cheap desktop computers which just came up there around the mid-80s. Rumelhart and colleagues then showed that this method really can learn internal representations in hidden layers. By 2003, deep backpropagation-based standard feedforward neural networks with up to seven layers were already used to successfully classify high dimensional data. There is a reference by Vieira and Barradas of 2003 on that.
The 1970s also saw the birth of the convolutional neural network architecture, the CNN architecture. That happened in Japan. The CNN architecture was introduced by Fukushima: he called it the neocognitron in 1979. It was inspired by the neurophysiological insights of Hubel and Wiesel, and today such architectures are widely used for computer vision. What’s happening? There is a typically rectangular receptive field of a unit; any of these units in the first layers of such a CNN has a weight vector which is connected to this receptive field, and the field is a filter which is shifted across an image for example, step by step, across a two-dimensional array of input values such that the network is basically perceiving all the pixels of an image in a systematic fashion. Usually, you have not only one such filter but many such filters. The resulting array of subsequent activation events of this unit can then provide inputs to higher-level units and so on.
Due to massive weight replication (you copy again and again the weights of these filters to the next instance of the same filter), relatively few parameters may be necessary to describe the behaviour of such convolutional layers, which typically feed into so-called downsampling layers where you have a big, previous layer which feeds into a smaller layer which is a downsampling layer in the sense that you get in a sense with lower resolution the same information that you have in the previous layer. There are fixed-weight connections which originate from physical neighbours in the convolutional layers below. Within this downsampling layer, you find these physical neighborship preserving units. Downsampling units use spatial averaging to become active if at least one of their inputs is active, and their responses are then insensitive to certain small image shifts, which is very useful in many vision applications.
Wang, in 1993 later replaced the spatial averaging of Fukushima with something which is now widely used, which is called max pooling which is a central ingredient of many CNNs. Here, a two-dimensional layer or array of unit activations is partitioned into sub-sections, into small rectangular sub-arrays, and each of them is very simple; each is replaced in a downsampling layer by the activation of its most active unit. In 1987, neural networks with convolutions were combined by Waibel weight sharing and backpropagation because Fukushima didn’t use backpropagation; he used other ways of adapting the parameters of his system. Waibel was the one who combined these two concepts: gradient descent through backpropagation and convolutions. That also happened in Japan.
Just one decade ago, in 2010, many people thought that deep neural networks could not learn much without unsupervised pre-training, a technique which I introduced myself in 1991 and which was later also championed by others. Unsupervised pre-training just means you pre-process the data such that it becomes more compact, and then downstream supervised learning becomes easier. In fact, around 2007, one well-known researcher said that nobody in their right mind would ever suggest using plain gradient descent through backpropagation to train a deep neural network. I won’t mention this well-known researcher by name except to say that he is Doctor Hinton.
But then my team, with my outstanding postdoc Dan Cireșan in 2010, was able to show that indeed it’s possible: you can train really deep networks by backpropagation without any unsupervised pre-training. Back then, our team broke a famous benchmark record which was used. That benchmark has been had been used by them for decades.
The way we did that was, we achieved this by greatly accelerating traditional multi-layer perceptrons on highly parallel graphics processing units or GPUs going beyond the important work on GPUs by Jung and Oh who in 2004 were the first guys who apparently had working implementations of neural networks on graphics processing units.
But then, in 2010, it was really fast enough to train these deep networks, which seemed untrainable before, and a reviewer called this a wake-up call to the machine learning community and then everybody started doing this.
In the 2010s, this little supervised deep learning revolution quickly spread from Europe to North America and Asia. Our results set the stage for this recent decade of deep learning. In February 2011, our team extended the approach to deep convolutional neural networks to the CNNs that I mentioned before, and this greatly improved earlier work. Our so-called DanNet, after Dan Cireșan, who was the first author of these publications, broke one record after another. In May 2011, DanNet was the first deep CNN to win a computer vision competition; in August 2011, it was the first CNN to win a vision contest with superhuman performance, and our team with Dan Cireșan, Ueli Mayer and others kept winning computer vision contests in 2012 in medical imaging and other fields.
Subsequently, many researchers adopted this technique as well. By May 2015, we had the first extremely deep feedforward networks with not only 10 20 or 30 layers but now with more than 100 layers. That was the highway network made possible through my students Rupesh Srivastava and Klaus Greff. The special case of the highway networks is called ResNets, which have become very popular. The original successes required a precise understanding of the inner workings of GPUs; today, however, there are convenient software packages which shield the user from such details. Compute is now roughly 100 times cheaper than in 2010, when we were lucky enough to start this whole development.
AI specialist Jürgen Schmidhuber on backpropagation, vanishing and exploding gradient problems and how LSTM he...
Carnegie Mellon Prof. Matthew Bass on how to keep products innovative, development roadmaps, and interaction b...
Philosopher David Chalmers on artificial intelligence in movies, consciousness of computers and moral rights o...