 Hello. The hearing system cannot react to all features present in a sound wave. Thus, it is essential to determine what we perceive and how we perceive it. This enormously complex field is referred to as speech reception. The following questions have dominated research. Question number one. Does the speech signal contain specific perceptual or acoustic cues? And then are there any perceptual units, any fundamental units of speech perception? And last but not least, how can we model the process of speech perception that is, are there any theories of speech perception? A further important issue in speech perception, which is also the province of experimental psychology, is whether speech perception is a continuous or as often assumed a categorical process. These and some minor questions will be discussed in the following. The speech signal presents us with far more information than we need in order to recognize what is being said. Still, our auditory system is able to focus our attention on just the relevant auditory features of the speech signal. Features that have come to be known as perceptual or acoustic cues. Now the importance of these small auditory events has led to the assumption that speech perception is by and large not a continuous process, but rather a phenomenon that can be described as discontinuous or categorical. So the entire process can be characterized as categorical perception. In other words, these cues are not perceived along a continuum but as fixed categories. Let us exemplify such categories. The first one is called voice onset time. Now voice onset time or in short VOT is the point when vocal fold vibration starts relative to the release of closure. It is crucial for us to discriminate between clusters such as Ba and Pa. And it is a well established fact that a gradual delay of voice onset time does not lead to a differentiation between the voiceless and voiced consonants. Let us illustrate this. Now if voice onset time is long, let's say 250 milliseconds, what do we perceive? Clearly the voiced variant Ba. Now if by contrast we make it extremely short, let's say 10 milliseconds, we perceive Pa again. Okay, so 10 milliseconds and the result is Pa. What about 50 milliseconds? It is still the voiced variant and 20 milliseconds? It's the voiceless one. Now quite interestingly, if we generate a voice onset time value of 30 milliseconds, then we have trouble to identify what we hear. Ba. In other words, if VOT is longer than 30 milliseconds, we hear Ba. If it is shorter, the perceptual result is Pa. The voiced onset time value of 30 milliseconds serves as a key factor as some sort of acoustic queue. Here is another acoustic queue, Formant 2 or in short F2 transition. Now the formant pattern of vowels in isolation differs enormously from that of vowels embedded in a consonantal context. If a consonant precedes a vowel, then the second formant F2 seems to emerge from a particular point. The point is very high for Ka, it is intermediate for Te and it is very low for Pa. Now this frequency region from where F2 emerges is referred to as F2 locus. And it may be assumed that a gradual change from high to low may result in a gradual change from Ka to Pa, if we generate the consonant in a vocalic context. Let's listen. Now here's a high locus. Ka. Clearly the result is Ka. Now if we contrast this with a very low one, Pa. The result is Pa. And in the middle Pa. We clearly hear Ta. But what about these intermediate values? Ta. So if it is higher than the locus for Te but lower than the locus for Ka, we cannot identify the respective consonant. Thus it seems that speech perception is sensitive to the locus of F2 and that the transition of F2 from the locus to the vowel is an important cue in the perception of speech. Now another cue that we rely on in the perception of speech are frequency patterns. Now the frequency of certain parts of the sound wave helps to identify a large number of speech sounds. Fricatives for example. Involve a partial closer which produces a turbulence in the airflow and results in a noisy, a very noisy sound with spreading over a broad frequency range. Now this friction noise, and again, this friction noise is relatively unaffected by the context in which the fricative occurs and may thus serve as a nearly invariant cue for its identification. Having discussed the three central acoustic or perceptual cues, voice onset time, F2 transition, the locus of F2 and frequency patterns, let us now see whether there is a central unit on the basis of which we segment the incoming signal. Studies into language acquisition, especially into infant speech processing, suggest that the fundamental unit of speech perception corresponds roughly to the syllable. The central argument is the unavailability of obvious cues that facilitate the segmentation process. Despite the absence of such cues, children are capable of acquiring their lexicon, even though they have little or no information about the phonological properties of the words. Hence they must process some sort of information, perhaps innate about the properties that distinguish one word from another. This information seems to be based on the syllable. Well, the syllable, as you know, consists of an onset, a peak, and a coda. This is the standard notation. Here is an abbreviated notation. And here are some examples, a very simple syllable. Man. Man consists of the onset, a bilabial nasal, a vowel in the peak, and the coda is another nasal consonant. If we take strength, we have three consonants in the onset. One vowel in the peak and two consonants in the coda. So this is the syllable. Now, how can all these findings be modeled? Well, there are several theoretical approaches. They are subsumed under the heading of speech perception theories. All these theories have in common that, first of all, they agree that the ear amplifies the incoming signal and transmits it to the auditory nerve where a primary auditory analysis takes place. That is, filtering out of non-speech aspects and so on. Then an auditory pattern is being generated, form and patterns, etc., using some sort of mental recognition device. And it is this device here, this device where the theories make different claims. By the way, the output of all these theories is, in any case, in all cases, some sort of phonological form. So here, either in terms of phonemes, we might argue, or in terms of features, that contains the relevant information. We forgot an arrow here. Let us now compare the different types of theories. There are essentially two types of theories, active and passive theories. Let's start with active theories. Active theories assume that the process of speech perception involves some sort of internal speech production. That is, the listener applies his articulatory knowledge when he analyses the incoming signal. In other words, the listener acts not only when he produces speech, but also when he receives it. Two influential active theories have emerged. Here on the left, you see the motor theory of speech perception. According to the motor theory, reference to your own articulatory knowledge, down here, is manifested via a direct comparison with articulatory pattern. According to the motor theory, reference to one's own articulatory knowledge is manifested via a direct comparison with one's own articulatory pattern. An alternative is the analysis by synthesis theory. Now, this theory postulates that the reference to our own articulation is established via newly generated auditory patterns. Now, active theories can be contrasted with passive theories. The passive group of theories of speech perception emphasize the sensory side of the perceptual process and relegate the process of speech production to a minor role. They postulate the use of stored neural patterns which may even be innate. Here are two influential passive theories. Here to the left, you find the theory of template matching. Templates are innate recognition devices that are rudimentary at birth and are tuned as language is acquired. Now, the alternative is the feature detector theory where feature detectors are specialized neural receptors necessary for the generation of auditory patterns. Let us summarize our observations. There is evidence that the analysis of the spoken input signal is by and large sensory and that a system of sub-segmental feature detectors is central to any theory of perception. These feature detectors cope with specific acoustic cues in the signal. So, that's it. Thank you very much. Thanks for your attention and see you again.