Computer Vision

Computer scientist Cordelia Schmid on computers’ ability to recognize and produce images, machine learning, and artificial intelligence

faq | September 6, 2017

Computer vision is part of a larger scientific area called computer science. The explanation of computer vision rests on several components: mainly images and data. Computer vision deals with both of them; it wants to recognize and reconstruct them in 3D-world or, alternatively, to synthesize images in the sense of reconstruction of the process of image generation.

Related areas

Within computer science, computer vision is tightly connected with two other fields: computer graphics and machine learning. They have to be distinguished. Machine learning is very important for computer vision, but in itself, it is not computer vision per se. Machine learning is about learning algorithms, relations between the units in the training dataset, and how to generalize the findings. But this area is separated from computer vision. Although computer vision relies heavily on machine learning approaches, the latter has many other applications.

The aim of computer graphics is to create certain visual data. For example, if you wanted to have an animation of sand, you would be working on how to create sand. Or all the techniques which are used in movies (how to generate and animate characters) – these are all fields of focus for computer graphics. Computer graphics and computer vision intersect at several points. One intersection would be that the computer graphics guys use already existing images to make their models more realistic, to be able to animate models of realistic images. People from computer vision now use these graphic models for algorithms training, to produce things like interfaces, and to teach a machine how to create these images.

Physicist Hoi-Kwong Lo on quantum key distribution, "flying qubits", and quantum hacking
Computer graphics, as well as computer vision, is obviously connected with machine learning. However, it doesn’t mean that all algorithms are based on it. Many are not. If you want to classify all these areas, you will come up with the understanding that these are separate areas of research: they have their own conferences, their own publications, etc. But, of course, now they are gradually moving closer and closer together. Today there are a lot of projects that actually have computer vision as the main application. So, the borders are not so clear anymore.

Machine learning and computer vision

There are some research centers that study the problems of computer vision. One of the major players that specialize in this issue is INRIA in France. It was one of the first institutes to study machine learning and computer vision. In 2002 we created a team there that relied on machine learning techniques to create models for computer vision. Basically, the goal of the team is to find out which machine learning techniques could be adapted to computer vision and how exactly they can be applied. One of the focus areas is the recognition of certain objects in images and, more importantly, certain actions in videos. The question is how to detect actions and objects and how to recognize and classify them. For these purposes (for training these algorithms of recognition), synthetic data can be used.

If you take a camera and produce photos or video material, you will get real images. Synthetic data is something different, something generated by a computer. If you want to create synthetic data, you need to have images that look real but are not real in essence. A recent experiment in INRIA works with human body models and uses them to generate new bodies with different appearances, variability, poses, and so forth. Basically, what we have are just algorithms that generate this data on the basis of existing models. These are advanced techniques that people introduce in computer graphics when they want to render things, create new movies, and things like that.

There is another interesting example of human action recognition in videos. Imagine that you have people’s actions recorded, and you want to know what exactly they are doing. Important there are several things: how to recognize people and their actions, how to follow them through the video, how to model their structure and to model which objects they are interacting with, and how to define these possible actions and classify them. So, how do we set up the whole thing?

Another quite obvious issue, which is very popular these days, is related to self-driving cars and how to predict scenarios in which a car will be moving in a certain way, taking into account some amending segmentations. So you have an image of something you’re working on, and you want to classify all the pixels, what kind of label, and what kind of objects it has. It is interesting to know how this segmentation and detection interplay, how to distinguish objects, and to segment out the structure in the image.

Technically important is how to set up your approach for machine learning. That’s one of the main things which people nowadays are working on. There are two problems in this field. The first one is how to select a correct learning approach based on a given problem. And the second one is where to get the data from. An illustrative example of this issue is the COCO challenge, in which people indicate all objects in the images, and now they can be used for training, so all the data are available.

Then there appears another approach — weakly supervised learning where instead of just labeling everything by hand or synthesizing the data from which you want to learn, you can have images that come with tags, and you want to use that to match where exactly in an image you can find those tags. This would be weakly supervised or semi-supervised learning. For instance, we can see an actress talking in a video with subtitles. We can use this information to localize where the action is happening.
One illustrative example of this technology application is license plate recognition. It is already something that works very well. A vehicle just drives through a station, and it is possible to recognize the car number and all available information by simply scanning the license plate. It is impressive. It’s all about very practical things that make your life easier.

From image reconstruction to recognition

Until 2000, people still worked mostly on reconstruction; it was all about the definition of geometry results. At that time, computers were not very sophisticated, and the algorithms were not very good for these tasks. So, it was quite hard to recognize objects in an image, and it was even fascinating how badly it worked. A human observer is able to see things and actions and immediately understand and classify them, but the computer at that time couldn’t recognize simple objects, such as a cube or a bottle.

Professor Ilya Nemenman on machine learning, the laws of biology, and the quest for a 'robot-scientist'
Starting in 2000, people started to apply machine learning for computer vision development for recognition. Basically, machine learning techniques were used to model the space of possible appearances. It was quickly understood that it was too complicated to handcraft algorithms. A direct methodology was too complex to just go there and handcraft that. The idea was then to create databases of images and learn the content of that. It has become a mainstream theme starting from early 2000. A lot of developers have been made recently, but even before, there was much progress made.

What is also important is which data to use. At the time, we thought there had been a lot of effort in data collection, but recently people have started to use synthetic data to generate examples for the learning of algorithms. They have more sophisticated manners of data generation, and then they are able to decide which data are missing for learning algorithms.

Practical applications

There are a lot of examples of things that already work. For instance, Google image search. Such operations as creating and organizing collections of images and searching for them on a possibly large scale. It allows the creation of structures linked together that can be used together.

Also, there are robots that help people to clean their houses in everyday life. Every robot has some capacity to see; it moves around the house, picking up things. Then we have some applications for blind people (helmets or glasses), which help to navigate them. Examples of this are «Hololens» or «Oculus Rift». They already produce these devices.

Obviously, the issues of security are intensively based on computer vision. There are already a lot of functioning and developing projects out there that use a camera to detect certain motions or objects and to send out some signals (security alerts, etc.). However, there are still various directions of research going on. Cameras can recognize something, but it is still not completely automatic. There has to be somebody, a person, who sits behind these screens and watches all these panels, and there is somebody who stays there, who analyzes this information and makes a decision. The idea now is to make it more and more automatic.

There is also fingerprint identification. It’s not really a camera, but it’s very similar. It is essentially something that captures the fingerprint and then analyzes it. It’s already on smartphones. You can do the same thing with face recognition, you can take a picture and run an algorithm on it which recognizes things like that, but there are obviously applications that are there at the border control, in the US, for example.

Artificial mind project

To begin with, it’s necessary to say that right now, we are not at all there. Basically, there is not even a robot that can walk around without any outer support. It does not exist, and we are very far from it. The question is whether it is possible to have some intelligence that is programmed to do something without any assistance or control. All you can have today is just a car, which drives around on its own until it crashes into something. That is already possible.

The question of whether you can really reproduce human intelligence is not clear. In order to do so, you also have to model sentiments and feelings and build an integrated system. Of course, we could possibly create a machine that might be dangerous. You can just take a self-driving car, put it into a run, and it might cause some danger. This is not a problem. The problem is whether it has the same capabilities as human beings do, can it really make decisions, follow the rules, norms, etc.
You know how to teach a computer something, but you don’t know how to teach it to learn something completely automatically, to decide which way to go and to decide which task is important, and then come back when it has figured it out. It is not possible.

Besides, there is still a problem in artificial networks which is called ‘catastrophic forgetting’. It happens when you start adding many things simultaneously; the network forgets a lot of things that it has learned before. The same process occurs in the human mind. If you train several skills one after the other and again, you start to forget things. So we are very far from learning things automatically, assembling facts. We have no clue what the structure is, especially in some high-level modules.

Unresolved issues

As a scientific discipline, computer vision still meets a lot of challenges. There are mathematical problems with how to train these algorithms and how to set them up. You also have to think about how to set up the data and how to set up more complicated tasks like recognition. Surely, recognition would go to more complex modules, which, for example, would interact with a robot or to train a robot. And then the open question would be – what kind of models, which are still unfamiliar to us, would be appropriate there?

Computer Scientist, INRIA Research Director, Head of the THOTH project-team
Did you like it? Share it with your friends!
Published items
To be published soon