Problems of Modernization in Russia
Historian of science Loren R. Graham on the mistakes Russian leaders make, intellectual property and risky inv...
This lecture is part of the collaboration between Serious Science and the Technology Contests Up Great READ//ABLE.
A deep neural network is a complicated program. Structurally, it is a huge number of artificial neurons forming lots of hidden layers where we can set up the weights of the neurons. The input layer, i.e. the first one, gets a vector of features that describe an object, and the vector is processed in the hidden layers: we multiply the input vector by the weight matrix, and the resulting vector is transferred to the next layer and so on. The final vector is sent to the output layer of the network.
For each neural network, we have to set up many parameters, too many to be adjusted manually, and so, therefore, we train a neural network on a special training set of data. During this process, the network changes the weights so that all the calculations and signal processing result in the desired outcome. In order to adjust the weights, we use a simple optimization algorithm based on the gradient descent method: it allows you to track the result of signal processing by changing each of its weights slightly step by step. The structure is called a ‘deep’ neural network since we adjust the weights in each of the many layers of the network.
Usually, the more complex a neural network is (i.e. the more layers and neurons it has and the more computational operations it performs), the better results it gives, but also, the more difficult it is for us to understand what is happening in its hidden layers. However, thanks to the new algorithms that help to visualize all the processes in the inner layers of the networks, we have recently begun to better understand them.
If you’re working on a task that no one has worked on before, in order to train a neural network, you’ll have to get a training sample ready, and this process takes time (usually most of your time). In computer vision, we know how to approach most of the tasks, and for almost all of them, there is a training set to help you start working. But in any case, if you want to solve the task more efficiently, you have to collect the new data (pictures, videos, and so on) and either label it or leave certain instructions for the neural network on how to process it.
It is important to collect a large training set because even a simple neural network trained on a good set will give you better results than a network with a complex architecture but trained on a smaller set. However, mixed training sets are becoming more and more important now. for example, a network may be trained on two datasets, one of which is large and diverse but has wrong labels or doesn’t contain any labels whatsoever, and the other is correctly labelled but isn’t large and diverse enough. In this case, it is necessary to come up with a new algorithm which will allow to correctly train the neural network on both training sets.
In order to train a network, we can use supervised or unsupervised learning. Supervised learning means that we have a training set which consists of pairs of input and target vectors which we give to the neural network, i.e. we already know the task and its solution. Each training example is fed to the network; then it is processed in the inner layers, then the network calculates the output and compares it with the target vector, i.e. the expected result. This allows the network to calculate the error rate, which helps to further adjust the weights. This process repeats until the error rate over the entire set of input vectors of the training set becomes the lowest possible. In this way, the network learns to label the data following the example and make reasonable predictions for the new data.
Unsupervised learning means that we only know the input vectors and the neural network looks for patterns in the data and gives the best value for the output. The type of vectors generated at the output depends on the specific type of unsupervised learning.
The most efficient way to organize the work with a neural network is to reduce unsupervised learning down to supervised learning. this allowed for a lot of major breakthroughs in the last five years. For instance, we give the neural network corrupted data (it may be phrases with some words missing), and then the network will try to fill in the gaps. Another example: we can feed some black and white pictures to the network and ask it to restore the colour. We do not need to prepare the data beforehand and outline the desired outcome: we already have a huge number of similar pictures that can be downloaded from the Internet. For a neural network, this training resembles supervised training: it gets a black-and-white picture or a phrase with a word missing, and then it has to restore the data following the examples.
There is a third way of training a neural network, which is a mix of the two that we’ve described: supervised learning (when a network is being told whether it made a mistake or not) and unsupervised learning (when a network doesn’t get any instructions on the labelling). This third way of training is called reinforcement learning. It allows you to train a neural network not to predict the desired vectors but to behave in a certain way, for example, to create texts of a certain genre. This is very similar to how a human learns: a neural network changes its behaviour model based on its own experience and the consequences of its actions.
Supervised learning is the easiest to work with, but only if there is a good source of correctly labelled data or when we can collect and label the data ourselves. Unfortunately, usually, the size of the training set is too large, and therefore it is difficult to process the data by hand. in this case, people usually use unsupervised learning, and then the training is completed on a small labelled dataset.
One of the methods in machine learning is data transfer: a neural network trained to solve a specific problem is asked to solve a new problem with a limited amount of new data. Say a neural network has learned to identify dog breeds, and now we want it to identify cat breeds as well. Since dogs and cats are somewhat similar, the neural network’s experience from identifying dogs can be used to solve this new problem.
However, we are limited by a fundamental problem that has not been overcome yet: algorithms can only transfer data between similar tasks even though the computing power and the number of the networks’ parameters to adjust are growing. For instance, computer vision allows us to solve more and more tasks at the same time, but in the end, any network at some point would need more computational resources than it can get, and the efficiency of its work on each specific task starts to decrease. This probably means that our algorithms are not perfect: so far, neural networks are inferior to the human brain’s ability to transfer data when solving various problems.
In some tasks, neural networks perform as well as humans, sometimes even better: for example, they recognize faces if the light and the quality of the picture are good, they predict protein structures more accurately, and they beat people in chess and go. They can already monitor data from CCTV and respond to potentially dangerous situations.
Nevertheless, neural networks are worse than humans in assessing the accuracy of their predictions, and while a person can admit that he or she is not certain about something, a neural network can never do that. For example, if a neural network was trained to identify cat breeds and we show it a sheep, then it will classify it as a cat breed with complete confidence. Therefore, if there are errors in labelling, a neural network will not stop to think about it and simply learn them as true.
Still, the most efficient way to solve most problems is to arrange the cooperation of a person and a neural network. For example, the best players in go or chess are “centaurs”, i.e. teams of a neural network and a human player, although maybe some neural networks are receiving less benefits from a partnership with humans lately. However, chess or go is an artificial situation: in the real world, when some common sense and understanding of other people is needed, humans perform much better than neural networks.
The best way to solve almost all computer vision problems is using convolutional neural networks: flexible architecture that processes different types of signals, including one-dimensional (voice), two-dimensional (image) or three-dimensional (3D objects). This type of network is based on the convolutional layers in which the information coming from the previous layer is processed by fragments. The neural network itself determines the parameters of the convolutional layers (‘kernels’) during the learning process.
There are two special types of layers in a convolutional neural network. Convolutional layers calculate a linear combination of activations of the neurons on the previous layer, which allows it to form a feature map: it shows whether a specific feature is present in the layer. Each convolutional layer creates a new description of the object (therefore, at the output, we have many images). Pooling layers make the image several times smaller: they replace the activation of neurons located next to each other by their maximum or average value. Convolutional and pooling layers alternate with each other. In the end, we get an image of a minimal size but a large dimension, in which all the information of interest about the objects on the image is encoded. Convolutional networks allow you to extract meaningful information from images and then analyze the data and recognize the objects. This technology is used, among other things, in medicine in order to recognize and process biomedical images (for example, MRI scans) for more accurate diagnostics.
Neural networks are also widely used in self-driving vehicles. The main problem here is to make them safe: so far, they are not safer than human drivers, and it turned out to be very difficult to increase safety by means of machine learning alone, so society is not ready to accept them yet. Perhaps the main difficulty is to model people’s behaviour, both pedestrians and drivers so that the car would be able to predict the actions of all the road users. Self-driving cars are equipped with various recognition systems that help them get a certain amount of information about the environment, the location and movement direction of the objects, and this allows them to adjust their movement: their speed, trajectory, and so on. However, they still make mistakes. For instance, there was a story of a Tesla driving along the road, getting lots of information from cameras, and then crashing at full speed into a white truck that was blocking its path. It turned out that the recognition system had never seen such a training example, so it mistook the white trailer for a cloud and decided it should not slow down, while the truck driver hoped that the car on the main road would stop (which was a violation of the traffic rules). A human would’ve quickly analyzed the situation and stopped the Tesla.
How do we avoid such situations? That’s a difficult question. Technically, we can show neural networks a lot of examples when people violate traffic rules. for instance, we could build a simulation that would generate such situations or get them from the real world (say, from companies that develop and test self-driving systems). However, there is still a possibility that we will not take into account some factors or a combination of factors that can lead to an accident, so the neural network will not learn how to behave.
Recently, we’ve found out that we can train neural networks not only to recognize but also to create images, although it is more difficult to do since it requires a better understanding of face features than for regular face recognition. The first works in this area were published in the early 2010s: in particular, in 2014, Alexei Dosovitsky published an article dedicated to image generation for chairs, which is a very diverse class of objects. This paper attracted great interest among researchers, and they started to develop new methods of image generation. Now, a large class of algorithms is being developed: they are based both on classical computer graphics, not related to machine learning and deep learning itself. This combination gives us quite interesting results.
One of the applications of such image generators is telepresence systems, which will help people to communicate at a distance, getting a three-dimensional image of their interlocutor thanks to various technical means: three-dimensional monitors, virtual and augmented reality glasses, and so on. The question is how to make the image as realistic as possible and to correctly capture people’s appearance, facial expressions, movements, gestures, etc, which are necessary to preserve the non-verbal part of the communication process. All of this will make the conversation more interesting and help to create an impression of another person actually being there. In order to solve this problem, we’ll have to train the neural network on a huge number of videos and pictures.
Another promising area of application for this technology is the animation and film industry. Today, neural networks can generate cartoons based on textual descriptions. they do make mistakes, but still, they make the animators’ work easier. They also help to write music and even scripts for films. In addition, they make it cheaper and easier to create visual effects in films where traditional computer graphics are usually used.
Machine translation boils down to working with sequences: we have an input sequence of words in one language, for example, a phrase or a whole paragraph, and we need to get the same sequence of words in another language. A phrase usually contains less than a hundred words, which is much less than pixels in one picture; however, each of the elements is much more complex than a pixel. Most algorithms now use a large dictionary for this type of task.
The best solution here is a transformer, which is an architecture developed in the mid-2010s that proved to give the best results in text analysis and machine translation. Each element of the text is processed individually as a vector, and its position in the sequence is memorized. At each stage of processing, a so-called attention mechanism is activated: when the vector describing a specific word is updated, the transformer focuses its attention on other words of the phrase. The degree of attention is predicted by the transformer itself based on the vectors calculated in the previous steps. The vectors of words that attract the transformer’s attention are summed up and multiplied by a matrix, and the resulting vector (the context vector) is used to update the representation of the word. This process is not sequential, as in recurrent networks, but parallel.
Despite the advantages of the transformer over other neural networks (for example, recurrent and convolutional networks), it is still difficult for it to create a high-quality literary translation since it does not yet grasp the subtleties that are the essence of such texts. It works best with legal literature and similar types of text where standard wording is used; it is also much easier to collect training examples for these types of text.
Historian of science Loren R. Graham on the mistakes Russian leaders make, intellectual property and risky inv...
Professor Mitchel Resnick on the kindergarten style of education, creative learning spiral, and challenges of ...
Philosopher David Chalmers on artificial intelligence in movies, consciousness of computers and moral rights o...