For over 50 years attempts have been made to determine if humans possess a ‘special,’ dedicated neural system for processing speech or if we use a general mechanism for all sounds, including speech. The theory of a dedicated speech-processing substrate has risen to have major influence only to fall to unpopularity and remarkably, is now regaining consideration. One consistent property of the theory, however, is controversy.
Listening to and understanding someone who is speaking probably couldn’t seem more natural and effortless. We do it even when we don’t even want to – think of noisy neighbours. Yet, in comparison with other sounds that are meaningful to us, human speech is a formidably difficult stimulus to extract information from. Unlike written language, speech is a continuous onslaught of many varying sound properties, each contributing to the intended message. It advances rapidly through time, mostly without pauses between words. And if any slight variations in its sonic attributes are not correctly perceived, the consequences could be dire.
In fact, the idea that speech processing was in some way unique is rooted in early attempts at text-to-speech technology. It was in the 50s, long before today’s computing capacity, that Alvin Liberman began designing and testing speech readers for the blind. Little did he know that such an endeavor would lead him to develop one of the most widely known theories of how we perceive speech – The Motor Theory of Speech Perception. Amazingly, this theory holds that when we perceive speech we are actually perceiving vocal tract gestures – the physical motor articulations that gave rise to the sounds we hear – rather than extracting meaning directly from the sounds emitted by the speaker. The evidence for this is so striking that it bears the primary implication for a ‘special’ innate module for processing speech.
The Motor Theory of Speech Perception
When Alvin Leberman tested out his text-to-speech reader he was met with disappointing results. At that time, the idea was to convert text into a series of tones, and with some practice, one could recognize the words that were coded by a unique sequence of these tones. Despite practice, however, it wasn’t feasible. The tones just sounded like a very rapid series of bleeps and blurps, too fast to be intelligible. Perhaps the task would have been like trying to understand R2D2 if he was somehow reading these very lines. Participants simply did not have the temporal resolving power to decipher the sounds when they were presented at a practical rate.
How could this happen? Speech, even at relatively slow rates, involves more complex sound patterning than the tones from Liberman’s text reader. When he investigated the acoustic patterns of speech, what he found was surprising. He discovered a phenomenon now called ‘coarticulation.’ It occurs when distinctive speech sounds overlap in time – we co-articulate multiple, unique, speech components simultaneously, lengthening their duration and thereby making them easier to hear. This phenomenon underlies much of the complexity of the auditory speech signal and forms a main root of the Motor Theory.
Coarticulation lengthens sounds but causes variability – the kind of variability that creates ambiguity. That is, there are multiple acoustic cues for a given speech sound, and likewise, a single acoustic cue can be perceived as different speech sounds. An example is the word “say,” for which it’s component sounds can be perceived as “day” or “stay.” How then do we disambiguate the speech stream? Critically, Liberman realized that it is the vocal gestures that create speech sounds that are that are reliable cues to what the speaker intended to communicate. That is, if we can’t rely on the acoustic cues, we can rely on perception of the movements, or the ‘articulations’ that give rise to the acoustics of speech to be able to understand it. If we can understand someone talking, our perception must track the vocal gestures and not the ambiguous acoustic cues. And hence, the Motor Theory of Speech came into being in 1957.
Speech is special: Evidential highlights
On the surface, the theory admittedly seems odd and unlikely, but it has garnered some intriguing evidence. One line of support, discovered by Liberman himself, was elementary speech sounds (which can be shorter than single words) seemed to be perceived categorically rather than continuously. This means the threshold between perceiving one speech sound rather than another is discretized – black and white. Thus, this aspect of our perception of speech runs contrary to very basic acoustic properties that actually comprise it., Take the properties of loudness and pitch, for example. These properties are perceived on continuous levels, as all the shades of grey between black and white are. Ironic since speech is produced as a continuous sound. This special way of perceiving speech became known as ‘Categorical Perception’ and it was implied that only a speech-specific module could underlie categorical perception.
‘Duplex Perception,’ discovered by Timothy Rand in 1974, is another phenomenon thought to be special to perceiving speech. It occurs when listening through headphones. The first part of a syllable is played in one ear and the rest is presented to the other. This second part is perceived as continuation of the first part of the sound. But in the other ear it sounds like a simultaneous, nonspeech ‘chirp.’ How can we perceive the same sound as speech and nonspeech at the same time? It was thought that there must be both a speech module and an otherwise general module.
A similar phenomenon centers on what is called sine wave speech (SWS). SWS is artificially modified from real speech, consisting simply of pure tones varying in frequency – it contains none of what were thought to be the acoustic cues thought to be central to speech perception. This was a field-changing discovery by Roger Remez et al., in 1981. Incredibly, SWS shows that there are zero acoustic cues for predicting what will be perceived as speech other than sine wave tones compatible with resonances of the vocal tract (i.e. the motor articulations that created them).
More evidence supporting a speech-specific, motor-integrated mechanism comes from multisensory research – the famed ‘McGurk’ illusion. Try it out yourself. Play the video at this link and you’ll hear the syllable “ba”. Now close your eyes and play it a few times. How could it sound different but be the same?
From a motor-theory perspective, what you hear is influenced by witnessing a different motor articulation than what produced the sound – that of /ga/. The idea is that your brain resolves the incompatibility by guessing it was neither /ba/ or /ga/ and that it must have been /da/, a possible misperception of both audio and visual inputs. The key is that the illusion only happens when you witness the motor activity at the same time, implying a direct link between seeing vocal gestures and speech perception.
What’s more is that it’s possible to change what people hear by externally moving their face muscles. This seemingly unlikely finding was made by Takayuki Ito et al. in 2010. Participants were physically connected to a device that could move their lips. While they witnessed videos of a person saying words, the device was activated and their lips moved. Amazingly, they misheard the words audiovisually presented to them in a systematic manner, more compatible with their own artificially created lip movement. Thus, their own speech-related gestures influenced the words they heard.
Figure shows setup of Ito et al., 2010
These studies comprise some of the strangest, most intriguing and most influential support for the existence of speech-specific mechanism in our brains. They are not however, the only studies. For further reading, I suggest reviews by Galantucci et al. (2006), and Carbonell and Lotto (2014).
Speech perception: Not so special
The intrigue raised by the proposition of a special speech module is as phenomenal as evidence raised to support it. As with all other lofty claims, however, it did not proceed without detractors.
One of the first major blows to the ‘speech is special’ theory was accomplished by a team of four chinchillas. Patricia Kuhl and James D. Miller (1975) were able to train them to distinguish a /d/ sound from a /t/ sound. The acoustic properties between these sounds vary continuously, yet we normally perceive them categorically, consistently as a “t” or as a “d” in their normal speech context. The unlikelihood of chinchillas having categorical perception is underscored in a statement by Liberman himself: “Unfortunately, nothing is known about the way that non-human animals perceive speech… however, we should suppose that lacking the speech-sound decoder, animals would not perceive speech as we do, even at the phonetic level.” Unfortunately for the uniquely human “speech-sound decoder” idea, the chinchillas did perceive the /t/ and /d/ sounds categorically. The findings beg the question of why chinchillas would have such a speech decoding module if they can’t talk?
Yet, it remained, that categorical perception of sound was a property associated with speech until 2005. That year, however, Travis Wade and Lori Holt demonstrated that categorical perception also occurs with nonspeech sounds. Cleverly, they devised a unique video-game context in which sounds helped gamers to identify targets in a maze. When these targets appeared, they were presented with the sounds. As the game progressed, the targets became less visible and were incidentally aided by the sounds in their identification. After playing the game, participants completed a categorization task involving the sounds they heard. Amazingly the results were consistent with Categorical Perception. Moreover, the categories were learned incidentally because the participants didn’t need them to play the game and were not instructed to pay attention to them. In sum, a learned categorical perception of nonspeech sounds was not good news for advocates of an inborn, speech specific brain module.
What about going beyond acoustic perception and directly testing the link between observation of action and the perception of sound, as in the McGurk illusion? Can seeing human motor action influence our perception of nonspeech sounds? It turns out that yes, it can. In 1993, Helena Saldaña and Lawrence Rosenblum used videos of cellos being either plucked or bowed to see if watching these actions could influence judgments of plucking and bowing sounds. Indeed, they found that seeing a cello plucked was more likely to result in the auditory perception of plucking rather than bowing, and vice versa – a finding which advocates a generalized multisensory theory of perception, not solely tied to speech.
In 2010 Joseph Stevens and Lori Holt revisited Liberman’s attempt to essentially convert speech from one modality to another. They invented an auditory-to-visual speech reader – in some respect, the opposite of Alvin Liberman’s visual text-to-speech reader. Steven and Holt’s ‘robot’ turned speech sounds into coded changes in dials and lighted bars. What they found was that with practice, the visual signals could be used to enhance the intelligibility of speech in constant obstructing noise. This finding demonstrates that arbitrary visual cues can be used to influence the perception of speech and that such influence is not limited to witnessing a speaker’s vocal tract movements, as in the McGurk illusion.
These and a host of other experiments cracked the foundations of the Motor Theory and the notion that speech perception is accomplished only with an innate, dedicated speech module in the brain. What’s important is that these studies demonstrated in unique ways that the basic properties used to implicate such a mechanism have been found with animals that don’t talk, they can be learned and that they are present in the perception of nonspeech sounds. It seems that only a ‘general mechanism’ can explain these findings.
The rebirth of ‘Speech is Special?’
In a study published last year in Nature Neuroscience, Tobias Overath and David Poeppel used functional Magnetic Resonance imaging to isolate a neural substrate that responds selectively to the temporal attributes of speech. Arguably, this finding brings the quest to find a specialized speech mechanism full circle. It provides some answer to Liberman’s original inquiry as to why we can perceive speech at the rate it occurs, but not text converted into synthetic sounds played at the slowest practical rate.
Overath and Poeppel took the rapidity of acoustic speech into consideration, setting out to determine the existence of brain areas tuned to the temporal structure of natural speech. To do this, they created what they called “sound quilts.” These stimuli consisted of recorded speech segments in a foreign language that were broken up into smaller ‘patches,’ which were then reshuffled to make a new segment. Here, they carefully eliminated the abruptness that would otherwise occur at the end of one patch and the beginning of another. The more the stimuli were quilted in this way, the more disruptive they were to regular temporal patterning of speech. Remarkably, they found a part of the Superior Temporal Sulcus (STS) that showed responses that varied with the amount of quilting in the stimuli. This finding implies that these areas are ‘tuned’ to the temporal properties of speech in that there were greater responses to less quilted, less “patchy”, more natural speech.
This original experiment was followed by an onslaught of control conditions. Importantly, they ruled out alternative explanations based on acoustic properties such as pitch and amplitude. In a final experiment they produced quilts using nonspeech sounds consisting of, for example, bird song and footsteps. These nonspeech stimuli did not elicit the same, unique response observed with speech quilts. Ultimately their evidence for speech specific neural substrate is considerable.
Although Alvin Liberman passed on in the year 2000, it’s likely that he would be amazed upon realizing the tumult of his original ideas in the greater part of the last two decades. What holds for the future? Will the Overath and Poeppel findings be replicated? Will brain imaging produce further evidence? Regardless, in relation to Liberman’s Motor theory, investigations surrounding the Motor Theory of Speech indeed reveal remarkable connections between our perceptions of auditory and visual information and our implicit knowledge of how to generate the actions that creates the sounds of speech.
Carbonell, K. M., & Lotto, A. J. (2014). Speech is not special… again. Frontiers in Psychology, 5, 427.
Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor theory of speech perception reviewed. Psychonomic Bulletin & Review, 13(3), 361–377.
Ito, T., Tiede, M., and Ostry, D. J. (2009). Somatosensory function in speech perception. Proc. Natl. Acad. Sci. U.S.A. 106, 1245–1248. doi: 10.1073/pnas.0810063106
Kuhl P., Miller J. (1975) Speech perception by the chinchilla: Voiced-voiceless distinction in alveolar plosive consonants. Science. 190(4209), 69–72.
Liberman, A. M., Delattre, P., and Cooper, F. S. (1952). The role of selected stimulus-variables in the perception of the unvoiced stop consonants. Am. J. Psychol. 65, 497–516. doi: 10.2307/1418032
Liberman, A. M., Harris, K. S., Hoffman, H. S., and Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. J. Exp. Psychol. 54, 358. doi: 10.1037/h0044417
McGurk, H., and MacDonald, J. (1976). Hearing lips and seeing voices. Nature 264, 746–748. doi: 10.1038/264746a0
Overath, T., McDermott, J. H., Zarate, J. M., & Poeppel, D. (2015). The cortical analysis of speech-specific temporal structure revealed by responses to sound quilts. Nature Neuroscience, 18(6), 903–911. http://doi.org/10.1038/nn.4021
Rand, T. C. (1974). “Letter: Dichotic release from masking for speech”. The Journal of the Acoustical Society of America. 55 (3): 678–680. doi:10.1121/1.1914584.
Remez, R.E., Rubin, P.E., Pisoni, D.B., & Carrell, T.D. Speech perception without traditional speech cues. Science, 1981, 212, 947-950.
Saldaña, H. M., and Rosenblum, L. D. (1993). Visual influences on auditory pluck and bow judgments. Percept. Psychophys. 54, 406–416.
Stephens, J. D. W., and Holt, L. L. (2010). Learning novel artificial visual cues for use in speech identification. J. Acoust. Soc. Am. 128, 2138–2149. doi: 10.1121/1.3479537
Wade, T., and Holt, L. L. (2005). Incidental categorization of spectrally complex non-invariant auditory stimuli in a computer game task. J. Acoust. Soc. Am. 118, 2618–2633. doi: 10.1121/1.2011156