Speech Processing

Neuroscientist Sophie Scott on humans’ ability to distinguish sounds, bilingualism, and the Japanese language

videos | August 11, 2017

The video is a part of the project British Scientists produced in collaboration between Serious Science and the British Council.

Speech, human speech is unique in nature. There really is nothing else out there in terms of a sound that’s like speech. The way that we produce the sounds of speech is just unparalleled in complexity. And sometimes we can forget, how hard that makes speech perception and speech production as tasks, because by the time you’re an adult you’ve spent a lot of your life learning how to speak, and you don’t really notice it being difficult or effortful to do. But actually there’s a lot of work that the brain has to do both to control the sounds of speech when you’re speaking and also decode the sounds that other people are making. And there’s been a lot of interest in this historically, partly because, of course, it can go wrong. So, the first work that was really systematically looking at speech in the brain was done back by French and German neurologists trying to understand what was happening with patients. They had patients who had problems with their brains and they wanted to understand what was going wrong. So, we knew from that, through the work by Paul Broca or Carl Wernicke, that there were brain areas that seemed to be particularly important in controlling how we talk and also perceiving speech.

That’s been very important and very influential, but one of the things that we’ve been able to do more recently is we’ve been able to use functional imaging techniques (literally to take photographs of the brain in action) and actually start to go into this in a lot more detail. So, there are some things that we do know about speech. There’s the language that you learn as a child, or the languages, they really shape how you can hear sounds as an adult. So, babies are actually able to hear differences between speech sounds that adults can’t hear, because by the time you’re an adult, that they are just going to know the group of your own language.

One example would be the speech sounds at the start of the words “read” and “led”. That’s a [r] and the [l] sounds for English speakers. If you are Japanese, those are allophones, the same sounds or the versions for the same speech sound. So [r], [l] – distinctions are very difficult for Japanese speakers to hear and also to produce accurately. But all languages have this. So, in English the sound at the start of the word “leaf” is the same as the sound at the end of the word “bell”. Those are both [l]-sounds for English talkers. And there are languages, I believe Russian is one, where in fact those would be two different speech sounds. So, even an English speaker as an adult goes off to learn Russian they can struggle with these sounds that are clearly different to a Russian speaker, but sound like they’re the same speech sound for an English talker.

Neuroscientist Mahzarin Banaji on the role of functional MRI in social neuroscience, ways our brain perceives social world, and the origins of human consciousness
So, you’ve got this quite interesting kind of narrowing in, and that does seem to result in speech perception having some different characteristics from how we perceive other things in our environment. We group sounds together, if they fit into particular speech sounds. It’s very interesting to think about how that is actually implemented in the human brain, because you are still talking about a sound. And what we’ve been able to do with these functional imaging techniques that we now have is actually go into the patterns of neural activation – how the brain is involved in teasing apart the sounds of speech. It is very interesting, because you see this almost like a duality of how speech is processed in the brain: at one end, it’s a sound and it activates parts of the brain that care about sound, but it’s also a sound that you are treating differently, because you can do something different with it. You can deal with this as speech. And you can actually see that happening as well. You can actually see shaping of acoustic processing in the brain really quite early on in the brain. That is shaped by the language you’ve learned to speak. Because languages differ in the way that they treat sounds as being what we call phonetically relevant, sounds as having an importance and affecting what speech sounds are. Different languages do that differently. So, for example, in English vowel duration isn’t a thing. If I say “cat” or “cat”, I still say the word cat. Now, if you were Japanese, vowel duration changes the meaning of speech sounds. So, “Tokyo” and “Tokyo” would be two different words. You can actually see in the brain Japanese and English talkers process vowel duration differently, even in meaningless words, really early on in the auditory system. You sort of see the shaping of the auditory system towards the languages you speak.

Then, as you get further into the auditory system, you see the kind of acoustic stuff drop away, and you start to see a signal again more and more specifically linguistic. So you get further into the speech perception network, you see brain areas that care less and less about what that speech sounded like and or more about what does that word mean, what’s that linguistic signal. Interestingly, the further you go into the network, the more you see speech perception and speech production looking very similar. So if I look at the brain areas that are involved in meaning and thinking about meaning, they are pretty much exactly the same: if you are thinking about meaning and stuff you’ve heard or if you’re thinking about meaning and what you’re about to say. You can go on from sound through to meaning in the brain. And then, of course, when you are actually speaking, we see other brain areas recruited, which are helping you do the business actually of producing speech, and then you see that getting driven again.

So, there’s a very interesting and complex network where you’ve got perception and control of output. And in between the really big mediating factor is language, its linguistic representations that are independent. Those representations don’t care if you’ve heard a word or read the word, because it cares about meaning. And what we’re very interested in is understanding of how that works. Because we can see it as a network that doesn’t tell us computationally what’s going on. And it also doesn’t tell us interesting things like, for example, in languages you have semantics, but of course you also have syntax. You’ve got grammar, and that’s very strongly driving what sentences mean, as much as the semantic content. And although languages vary a great deal in actually how they implement semantics and how they implement syntax, you’ve got this commonality, you always have both fair. So, what I’d be very interested in knowing is actually how this network that we see that seems to be generally involved in a high order representations of language, how much is that being driven by semantics, how much of that is being driven by syntax, the rules of how we put language together.

There are a number of different ways you can study speech. One whole area looks at phonetics, and they are the people who treat the linguistic end of understanding speech. So, they’re very interested in describing, and classifying, and theorizing about the sounds of speech. There’s also the world of engineering, because if you want to build a computer that can understand speech, you have to think about many of the same problems that your brain deals with. And actually one of the very interesting things that happens if you want to build a computer to understand speech, you don’t go out looking for speech sounds at all, you look for sequences. Because, of course, the way that we produce speech sounds isn’t just a series of different segments: you actually what’s called co-articulate them, you run them together.

So, if I say the word “sue” or the word “see” that was both [s]-sounds at the start, but actually that [s] in “sue” is very different, because I’m anticipating that [u] – vowel, than if I say “see”, when I’m anticipating the [e] – vowel. And that makes it sound different. Now, that’s all an [s], and you hear an [s], but actually the signal is very different. And there’s an information there. You hear [s] or [sss], it’s telling you what vowel is coming up. You want that information. Very interestingly, in the engineering and the computer approach to decoding speech they’ve picked up on a lot of really important stuff that we can be fairly confident that the brain is using. The brain isn’t sort of breaking speech down by listening at the speech sounds and reassembling them into words. It seems to do something much more active.

Neuroscientist Sophie Scott on contagion effects in laughter, conduct disorder, and the tribes of Namibia
So, you’ve got the world of phonetics and you’ve got the world of engineering, and then, of course, you’ve also got the world of clinical issues, because you can look at what happens when speech goes wrong. You can look at what happens in the brain of somebody who has difficulty decoding speech sounds or somebody who has difficulty with semantic representations, or syntactic representations. Because if you can selectively damage particular kinds of speech sounds, or particular kinds of words, it’s telling us something about how the underlying system is working that can support that. And then, of course, we’ve also got psychology, which tells us a lot about perception and production of work in the human brain. And there you’re studying it in a behavioral way. What I found very interesting about speech – you don’t get the answer from any one of those. You need to know that about all of it: you need to know about the phonetics, you need to think about the computational issues and you need to think about what you can learn from patients, because none of it on its own is going to help you understand these extremely complex phenomena.

There’s quite a lot of controversy at the moment about bilingualism, because some people argue that bilinguals have quite different brains. You train up your brain differently if you are bilingual, than if you’re monolingual. And that is controversial. Some people are making strong claims for that: for example, that there might be knock-on effects that you’d be protected from dementia by being bilingual, because you have got this stronger language system. Other people are arguing that that’s not the case. I think, what is very interesting, if you actually look at the world, is that the norm is to be multilingual. Monolingual people, like you find a lot in the UK, they’re the unusual ones.

So, actually the brain is very good, the human brain is very good at dealing with multiple languages. If you have the opportunity to learn them when you’re young, then you can cope really pretty well with multiple languages. There is a great deal of interest in this, because you could still say: well, is that language: all in one place, is it all stored together, or is it separated. So, a lot of the questions around bilingualism and the brain are also trying to understand how it’s actually implemented and how it works. I mean, would it be possible for one language to be damaged and the other to remain intact, for example? It does seem that in bilingualism there is more commonality than you might think. So, it doesn’t seem that there is a separation out of the different languages in the brain. You might be more familiar with one language, you might be in a particular environment that makes it easier to speak one language than another, because the people around, say, you are more fluent in that language, but it’s not because your brain is somehow separately treating them.

It’s very interesting to think about the examples of people who are particularly skilled in languages, because you do meet people, who are very very good, they’ve learnt how to learn languages, they’ve learnt what to listen out for. I suppose there’s the possibility that anybody could do that, and one of the limits on language learning tends to be time. You spend a lot of time learning languages when you’re a baby, and you have a lot less time when you’re an adult. And once you’ve separated that out, there does seem to be variations: some people do seem to be better than others and actually there’s quite a lot of interest in the underlying genetics of that. Because there is the possibility that what we’ve seen in the population is variation in linguistic ability that’s kind of operating at this level, learning to learn, being able to get to grips with linguistics, being able to produce new speech sounds and acquire those sequences. There’s a lot of interest in how variation in that might be underpinned by genetic variation. So there has been a lot of interest in the end in which people struggle to learn to speak, they struggle to acquire languages, they do it with difficulty, they find it hard to get it right. And there’s a lot of interest in what might be going at the other end of that distribution, or would we find people who particularly able and have a facility with language and language learning that actually means they are more naturally polyglots.

I think, the future of the field for speech processing is going to be: can we ever really quantify what’s different about speech? Or is it just the case that there are some things humans do, because face processing is another example where we seem to show a tremendous facility for it, and we seem to deal with face information in a way that’s different from other things that we look at. Is that something about faces, or something about their social meaning, or is it just that we happened to have learned to deal with them this way, because we have to. The same is true for speech. So, there’s this tension: are we looking at something which is a specialized skill, maybe something which we are predetermined to be good at, or is it something that we just happen to get really good at and we can use in this way. For example, people have argued that if you look for it, maybe a lot of the stuff that we see in the brain, when we look at speech perception networks, actually have nothing to do with speech and is to do with expert recognition system. So, we took somebody who’s an expert in some other auditory world, like maybe identifying musical notes or recognizing birdsong. Would that look the same? Are we seeing anything speech-specific at all or is it all auditory expertise? So, in short, I think, one of the big questions for the future is going to be: is any of this specific to speech and language or is it just the brain areas that do expert processing?

Wellcome Senior Research Fellow in Basic Biomedical Science; Professor of Cognitive Neuroscience, Institute of Cognitive Neuroscience, University College London
Did you like it? Share it with your friends!
Published items
To be published soon