Artificial Neural Networks
How artificial neural networks are trained and used in industry
This lecture is part of the collaboration between Serious Science and the Technology Contests Up Great READ//ABLE.
Since the mid-20th century, scientists have been trying to establish the principles defining how a brain-like machine could solve very complex problems. In 1943, neuropsychologist Warren McCulloch and mathematician Walter Pitts proposed a mathematical model for an artificial neuron.
The model turned out to be identical to the already-known linear classifier. Let’s say there are a number of objects which can be distributed into two classes: in the medical field, it might be patients and healthy people, and in the banking field, it may be creditworthy and non-creditworthy clients. We may describe the objects using features, i.e., their characteristics, both quantitative and qualitative. For instance, for a patient, these may be the age, gender, test results, complaints, clinical records, reaction to drug intake, etc. For the bank clients, it may be their socioeconomic status, their salary, education, profession, etc. The collection of features is called a vector. All possible vectors form an n-dimensional space where n equals the number of features.
A linear classifier cuts this space into two by a hyperplane so that we get, in the medical field, vectors of healthy people on one side and vectors of patients on the other. Same with the bank clients: in this case, there should be vectors of creditworthy clients on one side and, on the other, vectors of those who do not usually repay debts. For this to be true, the hyperplane must be correctly oriented and deployed in such a way that if we check the vectors we already know, there are as few errors as possible. This process is called training, and a set of vectors that we know how to classify beforehand is called a training set. The hyperplane is able to classify new objects, that is, to automatically make decisions about whether the patient is healthy or not or whether the client will be able to repay the debt to the bank or not. We call it “hyperplane” because ordinary planes are in three-dimensional space, and here we have n-dimensional space. Still, when we turn to geometric intuition in order to imagine this process, we can only imagine ordinary planes.
From a programmer’s perspective, this is fairly simple: a linear neuron (i.e., a linear classifier) receives n features as an input, then multiplies each of them by a specific for each feature weight and calculates the sum, and then applies an activation function to the result. The activation function can be very simple: for example, if the sum is greater than zero, then the object belongs to the first class, otherwise to the second. Finding the optimal position of the hyperplane basically means determining the weights which result in the best classification of the training set, i.e., classification with the minimal number of errors.
A neural network is a function composition of lots of linear neurons. Function composition in mathematics is a function of a function; in our case, a function is a neuron. We have neurons that receive an input which is n features of an object: they form the first layer of the network. Each of these neurons gives a certain value as an output, so together, they form a vector of new features of the object, where the vector dimension equals the number of neurons in the first layer. Similarly, we can build a second layer consisting of neurons that take as their input the features formed by the first layer, not the initial data. We can decide how many layers we need to solve a given task; usually, the more complex the task is, the more layers we need, but we’ll also need more data to train all the weights in all the neurons for a complex task.
Machine learning is based on solving optimization problems. We set the criterion that we want to minimize (for instance, the loss function on the vectors of the training examples) and run a sophisticated algorithm that changes the coefficients throughout the network, gradually turning the hyperplane of each neuron to the best position.
This is usually done using the gradient descent method, which can be illustrated with a simple example.
We all solved simple optimization problems at school. In order to find the extrema of a function f(x), we have to find its derivative, equate it to zero, and, having solved the resulting equation, find the local extrema (perhaps there will be several of them, so then we will have to figure out which one is minima and which one is maxima). When training a neural network, we are looking for a minimum of a very complex function in a multidimensional space. The school trick no longer works here, but the idea is very similar: We have to find the derivative of the loss function along each weight. The derivative shows how quickly the function grows along a given argument, and we collect all these derivatives into a vector, which is called the gradient, showing us the direction of the fastest increase of the loss function in the weight space. If we want to minimize the loss function, we have to go in the opposite direction. It is important to adjust the step size, which determines how quickly we go down: it is related to the learning rate. Over the past decades, many methods have been invented to help determine the step size and to align the steps, so neural networks can now be trained very quickly.
Each layer of neurons transforms the input vector into the output vector of the dimension equal to the number of neurons in this layer. We can perform many such transformations one by one: this is called a multilayer or deep neural network. Why is it so complicated, and why do we need many layers? The thing is that each layer is a set of basic linear classifiers which specializes in one relatively simple vector transformation, but together they form something like a conveyor belt allowing for quite complicated transformations of vectors.
The thing is that you can’t solve the task using only the vectors that enter the conveyor belt; they are difficult to work with; however, after a series of transformations, you can get an output vector which allows you to get the result you need. Why? Because when you have a criterion that allows you to estimate the result, you train all the layers accordingly so that they adjust their work and form the required transformational conveyor belt. If the task is complex, then all the necessary transformations may simply not fit into a short conveyor belt, i.e., a small number of layers.
Convolutional networks, which are used for image processing, are based on a similar idea. Convolution results in weighted averaging of brightness over adjacent pixels. If we apply it to all the pixels in an image, we will get a slightly blurry image or an image with certain color transitions highlighted. Typically, we apply dozens or hundreds of different convolutions to the same image so that, in the end, we get lots of slightly differently processed images. Layers of convolutional neurons alternate with pooling layers which roughen the image, reducing its size.
How does it work? Each pair of convolution and pooling layers reduces the image size, increases the vector dimension in each pixel, and allows us to detect larger and larger image elements. At the output, the image size shrinks to 1 × 1, but the vector representing it may be of a dimension, for example, 4 000: rather impressive. In the vector, we’ve encoded the information about all the objects on the image that might be of interest. The mechanism is the same: a convolutional network is usually trained to recognize thousands of different objects on millions of images, and in the process of gradient optimization, the convolutional layers adjust to each other so that the final image vector is the best for correct recognition.
Recurrent neural networks are very popular for signal and text processing. These networks process a sequence of input vectors one by one; having received a first vector, they form a state vector at the output, which is then fed to it when processing the second input vector, and so on. Such a recurrent structure allows the network to store important information about the history of the sequence processing. Recurrent networks can correctly classify not only entire sequences but also their fragments, and they can also transform one sequence into another. These networks can be used, for example, for automatic translation, speech recognition, and synthesis.
Another important type of network architecture is called autoencoder. This is an ordinary deep network in which the vector dimension gradually, layer by layer, decreases and restores back to the initial state. The goal is to match the reconstructed vector as close as possible to the original. This learning principle allows the network to learn how to compress (encode) a vector so that it contains all the necessary information and how to then decompress (decode) it with reasonable accuracy. In other words, the network learns to form compressed vector representations of objects containing the most information possible. Autoencoders, as part of a neural network architecture, are used for many tasks in order to vectorize the objects in the most effective way.
A real blooming of neural networks that we see today started with an international competition for object recognition systems ImageNet. In 2012, Jeffrey Hinton and two of his graduate students, Alex Krizhevsky and Ilya Sutskever, created a deep learning CNN called AlexNet for this competition. ImageNet database back then included about a million images that were collected on the Internet and labeled manually, thanks to Amazon’s Mechanical Turk crowdsourcing. At that time, this was an unusually huge amount of training data since previously recognition algorithms were trained on datasets that included tens of thousands of images at best. ImageNet now contains approximately 15 million images, divided into 22 000 categories.
At the beginning of the ImageNet competition, object recognition algorithms resulted in about 30% of errors; AlexNet reduced the error rate to 16%. Since then, only deep neural networks have been giving the best results in the competition. In 2015, the error rate dropped to less than 5% (which is the level of human performance) and finally stopped at about 2%. This means that we created image recognition technologies that allow us to automatically solve a huge number of tasks: from recognizing faces and license plates to monitoring the environment for self-driving cars. There is another type of task that involves large training samples and which can be solved with great success using deep neural networks: this is natural language processing. In computational linguistics, people have been building probabilistic language models for quite a long time. A mathematical language model means predicting the probability of a certain word appearing in a text based on the context, which may be the previous ten words, the previous three sentences, or the whole text from the very beginning.
There have been experiments in which the participants were asked to guess what word was covered in a sentence, and it turned out that a human performance perplexity is 10. Perplexity is an assessment of a language model: it shows us how well the model can predict words. If the perplexity is 1000, it means that out of a thousand words, anyone could be at a given position in a sentence; if perplexity is 10, then we’re choosing the word out of ten only. The less perplexity, the higher the quality of the model.
Recently emerging neural network language models with complex architectures (for instance, BERT, GPT-1, GPT-2, GPT-3) have learned to predict words in sentences with the same accuracy as humans. This progress was possible thanks to a huge training set of texts, half a terabyte in size.
However, an artificial neural network doesn’t understand the meaning of the texts it works with: we have no models of understanding there. Nevertheless, neural networks are capable of generating fake news and even jokes that look very natural.
This happens because the network saw how the language works, having learned from such a huge body of texts. Now the scientific community tries to combine such models with extralinguistic data (i.e., knowledge about the world), algorithms for logical reasoning, associative thinking, and other elements necessary for a true understanding of language. So far, neural networks are only an imitation of true intelligence.
Deep neural networks allowed for a major breakthrough in machine learning by automating the vectorization of complex objects. In the past, we needed a whole engineering team to facilitate the processing of each image, signal, or text: this work, called feature engineering, required an unreasonable amount of resources to spend on. Deep neural networks have truly automated this type of work. Now we may expect another similar breakthrough: methods like NAS (Neural Architecture Search) are ready to automate the process of selecting architectures and model hyperparameters which works for other neural networks as well. It means that neural networks are already learning how to build neural networks.
Now in the laboratory of machine intelligence of the Moscow Institute of Physics and Technology and in the Center for Analysis and Big Data Storage at Moscow State University, a project aimed at creating a kind of hybrid intelligence is being implemented: it would be a search and recommendation system that allows you to collect scientific papers with a certain subject. The algorithm checks the papers that the user has already collected and suggests similar items, which helps to form a selection of hundreds of relevant articles in an hour, whereas for a human, it might have taken several days.
Now we have set ourselves the second task, which is to help the scientists in writing reviews of the selected articles. This is a creative task: we cannot and will not exclude the author from it; full automation is simply inappropriate here. The scientist is the one who decides on the general idea, the goals, and the style of the review. Still, the system can look for useful phrases and make a rating, and it can suggest which articles are worth mentioning first. We called this system ‘a prompter’: while the author is editing the text, one prompter suggests how to describe the main idea of the cited article, the second prompter notes how the same thing was written in fifteen other articles; the third prompter collects related studies. All prompters display ranked lists of phrases that the user can read not only as a source of ideas for wording but also as a source of information for a deeper understanding of the problem, working with dozens of sources at the same time. This is a new way of non-linear reading and understanding large amounts of scientific and technical information. The user can search, understand and process lots of information and create his own product based on the results.
The third project is aimed at finding contradictions in mass media as well as highlighting different perspectives and methods of manipulating public opinion. The technologies that we have been developing since 2012 already allow us to divide the newsfeed into topics and events. The next step was to develop methods for highlighting different perspectives on the same topic. Different media may refer to the same events in completely different ways, and a special search engine can find all the differences and show us which events and perspectives are hushed up and by whom (that is, they make the main methods of propaganda visible). The next task will be to identify the constructs of the mythologized perspective on the world that supporters of any given ideology have created in the media. A text that we would be analyzing may not even look dangerous: it may not contain direct calls to violence or emotionally charged language, but it can still implicitly promote a certain ideology and affect the target audience. Identifying such details in media texts is an interesting interdisciplinary project and a challenge for artificial intelligence and natural language processing technologies.
How artificial neural networks are trained and used in industry
Harvard Prof. Mikhail Lukin on the quantum mechanical switch, improving the lifetime of our memory, and quantu...
MIT Lecturer Vyacheslav Gerovitch on the idea of optimizing the Soviet economy with a computer network, the te...