Long Short-Term Memory

AI specialist Jürgen Schmidhuber on backpropagation, vanishing and exploding gradient problems and how LSTM helps to improve speech recognition

Jürgen Schmidhuber Scientific Director, Swiss AI Lab IDSIA

This lecture is part of the collaboration between Serious Science and the Technology Contests Up Great READ//ABLE.

Our next little lecture is about long short-term memory (LSTM). There are extensions of backpropagation, the method first published by Seppo Linnainmaa in 1970, extensions for supervised recurrent networks: not just feedforward networks, but recurrent networks. They are known as backpropagation through time, and there, the recurrent network is unfolded into an FNN, into a feedforward network that has essentially as many layers as there are time steps and the observed sequence of input vectors, which could be a speech signal. In speech, for example, you get roughly 100 input vectors per second (every 10 milliseconds, a new input vector) from the microphone.

AI specialist Jürgen Schmidhuber on the first deep networks, backpropagation and whether you can train a network without unsupervised pre-training

The recurrent networks are general computers. The proof is very simple because a few neurons can implement a NAND gate, and a net of NAND gates can emulate the microchip in your laptop, Q.E.D. However, early recurrent networks couldn’t learn deep problems, the problems with long input sequences and long time intervals between relevant observations and relevant input events. In 1991, I first used unsupervised pre-training to overcome this problem. My neural history compressor is a stack of recurrent networks, and it works like this: a first recurrent network uses unsupervised learning just to predict its next input (for example, a sequence of letters is coming in, and it just tries to predict the next letter given the previous letters). Each higher-level recurrent network tries to learn a compressed representation of the information in the recurrent network below; it’s trying to minimize the description length, or in other words, the negative lag probability of the data. The way it is doing that is using predictive coding because the higher-level network gets only the letters which were not predicted by the lower network and so on and so on, and the top recurrent network may then find it easy to classify the data by supervised learning, by downstream supervised learning.

One can also distil the knowledge of the higher recurrent network (‘the teacher’) into a lower recurrent network (‘the student’) by forcing, the lower RNN to predict the hidden units of the higher one, which is clocking on a slower time scale because it sees only the inputs that the recurrent network on the lower level was not able to predict. In the early 1990s, by 1993, such systems could solve previously unsolvable very deep learning tasks involving over 1,000 subsequent computational stages, so problems of depth greater than one thousand.

This worked great, but then came something that was even better. It worked without any unsupervised pre-training, and it has revolutionized sequence processing: that’s the long short-term memory or the LSTM. Recently, I learned that by the end of the 2010s, our 1997 LSTM paper got more citations per year than any other computer science paper of the 20th century, and that is worth being proud of.

LSTM or long short-term memory overcomes the vanishing or exploding gradient problem (or fundamental deep learning problem, as I like to call it) which was identified and analyzed by my first student ever, by Sepp Hochreiter in 1991 in his diploma thesis.

He realized that with standard activation functions, cumulative backpropagated error signals either shrink exponentially, getting smaller and smaller and smaller, exponentially in the number of layers or time steps in a recurrent network, or (which is just as bad) these error signals grow out of bounds and in both cases learning fails. The problem is really most apparent in recurrent neural networks, which are the deepest of all neural networks.

LSTM is designed to overcome this problem. The first idea is already present in Sepp’s thesis of 1991. I don’t have time to explain LSTM in detail, but at least I can mention the brilliant students in my lab who made it possible: first of all, Sepp, but also Felix Gers with important contributions such as the forget Gate, who is now an essential ingredient of the vanilla LSTM that everybody is using; then Alex Graves who also had important contributions, and Daan Wierstra and others.

In 1997, compute was 100,000 times more expensive than in 2020, but since 1941, when Konrad Zuse built the first working program control general computers, every five years, compute got 10 times cheaper, and by 2009, compute was so cheap that my student Alex Graves was able to win for the first time competitions through deep learning through long short-term memory that was about handwriting recognition back then with long time lags and deep credit assignment paths, but it was possible to outperform all the competition.

Computer Vision Computer scientist Cordelia Schmid on computers’ ability to recognize and produce images, machine learning, and artificial intelligence

By the 2010s, computing was cheap enough to spread LSTM all over the planet on billions of smartphones. For example, since 2015, LSTMs trained by our method called ‘connectionist temporal classification’ or CTC (Alex Graves was the first author on that published in 2006); this combination of CTC and LSTM did Google’s greatly improved speech recognition on billions of Android phones. Our LSTM also was the core of the greatly improved Google Translate in 2016. Before 2016, the Chinese laughed about their translations from English to Chinese and back, but not any longer afterwards. In fact, by 2016, over a quarter of the awesome computational power for inference in all those Google data centres was used for LSTM.

Facebook announced in 2017 that they are using LSTM to translate 30 billion messages per week: that’s over 50,000 per second. LSTM also learned to improve Microsoft’s software in several ways and Apple’s Siri and QuickType on a billion iPhones. It also learned to create the answers of Amazon’s Alexa in 2016: that’s not a recording; it’s a voice that is generated anew for every single case. Great companies from Asia, like Samsung, Alibaba, Tencent and many others, are also using LSTM a lot.

It is important to realize that LSTM can not only be trained by gradient descent but also by reinforcement learning without a teacher who shows on a training set what should be done. No; it can be trained by policy gradients to maximize rewards in reinforcement learning, as shown in 2007-2010 with my collaborators, including my PhD students Daan Wierstra, Jan Peters and Alexander Förster. Daan Wierstra later became employee number one of DeepMind, the company co-founded by Shane Legg, another PhD student from my lab, from here, from the IDSIA. In fact, Shane and Daan were the first persons at DeepMind who really had publications in AI and a PhD in computer science.

Policy gradients for LSTM have become important. For example, in 2019, DeepMind beat a pro player in the video game StarCraft, which is much harder than chess in many ways. DeepMind used a program called AlphaStar, whose brain essentially is a deep LSTM core trained by policy gradient methods. The famous OpenAI Five program also learned to defeat human experts in the Dota 2 video game, which is a famous video game. That was in 2018, and again, the core of that system was a policy gradient-trained LSTM, which had 84% of the model’s total parameter count, so 84% of the parameters to be adjusted were the LSTM parameters. Bill Gates himself called this a huge milestone in advancing AI.

Become a Patron!

Support our cause Serious Science is a team of creators that are passionate about knowledge.

By donating to Serious Science, you enable us to continue producing and sharing free, high-quality educational content and expand our collaborations with top experts and institutions.

Donate through Patreon