 Speech perception begins with a highly complex, continuously varying acoustic signal and ends with a representation of the phonological features and segments encoded in that signal. This e-lecture discusses the main principles according to which the acoustic input is converted into such a code. In particular, we will discuss three central questions. Does the input signal contain specific perceptual or acoustic cues? What are the perceptual or linguistic units of speech perception and how can we model the process of speech perception? The speech signal presents us with far more information than we need in order to recognize what is being said. Still, our auditory system is able to focus our attention on just the relevant auditory features of the speech signal. Features that have come to be known as perceptual or acoustic cues. Now the importance of these small auditory events has led to the assumption that speech perception is by and large not a continuous process but rather a phenomenon that can be described as discontinuous or categorical. So the entire process can be characterized as categorical perception. In other words, these cues are not perceived along a continuum but as fixed categories. Let us exemplify such categories. The first one is called voice onset time. Now voice onset time or in short VOT is the point when vocal fold vibration starts relative to the release of closure. It is crucial for us to discriminate between clusters such as bar and par. And it is a well established fact that a gradual delay of voice onset time does not lead to a differentiation between the voiceless and voiced consonants. Let us illustrate this. Now if voice onset time is long let's say 250 milliseconds what do we perceive? Bar. Clearly the voiced variant bar. Now if by contrast we make it extremely short let's say 10 milliseconds we perceive? Bar. Again? Bar. Okay so 10 milliseconds and the result is par. What about 50 milliseconds? Bar. Bar. It is still the voiced variant and 20 milliseconds? Bar. Bar. It is the voiceless one. Now quite interestingly if we generate a voice onset time value of 30 milliseconds then we have trouble to identify what we hear. Bar. Bar. In other words if VOT is longer than 30 milliseconds we hear bar. If it is shorter the perceptual result is par. So the voiced onset time value of 30 milliseconds serves as a key factor as some sort of acoustic queue. Here is another acoustic queue. Formant 2 or in short F2 transition. Now the formant pattern of vowels in isolation differs enormously from that of vowels embedded in a consonantal context. If a consonant precedes a vowel then the second formant F2 seems to emerge from a particular point. The point is very high for kar. It is intermediate for t and it is very low for par. Now this frequency region from where F2 emerges is referred to as F2 locus. And it may be assumed that a gradual change from high to low may result in a gradual change from kar to par if we generate the consonant in a vocalic context. Let's listen. Now here is a high locus. Bar. Clearly the result, the result is kar. Now if we contrast this with a very low one. Bar. The result is par. And in the middle. Bar. We clearly hear tar. But what about these intermediate values? Bar. Bar. So if it is higher than the locus for t but lower than the locus for k we cannot identify the respective consonant. Thus it seems that speech perception is sensitive to the locus of F2 and that the transition of F2 from the locus to the vowel is an important cue in the perception of speech. Now another cue that we rely on in the perception of speech are frequency patterns. Now the frequency of certain parts of the sound wave helps to identify a large number of speech sounds. Frickatives for example. Bar. Involve a partial closure which produces a turbulence in the airflow and results in a noisy, a very noisy sound with spreading over a broad frequency range. Now this friction noise. And again. Bar. This friction noise is relatively unaffected by the context in which the fricative occurs and may thus serve as a nearly invariant cue for its identification. Having discussed the three central acoustic or perceptual cues, voice on set time, F2 transition, the locus of F2 and frequency patterns. Let us now see whether there is a central unit on the basis of which we segment the incoming signal. Studies into language acquisition, especially into infant speech processing, suggest that the fundamental unit of speech perception corresponds roughly to the syllable. The central argument is the unavailability of obvious cues that facilitate the segmentation process. Despite the absence of such cues, children are capable of acquiring their lexicon even though they have little or no information about the phonological properties of the words. Hence they must process some sort of information, perhaps innate about the properties that distinguish one word from another. This information seems to be based on the syllable. Well, the syllable, as you know, consists of an onset, a peak and a coda. This is the standard notation. Here is an abbreviated notation. And here are some examples. A very simple syllable. Man. Man consists of the onset, a bilabial nasal, a vowel in the peak, and the coda is another nasal consonant. If we take strength. Strength. We have three consonants in the onset. One vowel in the peak and two consonants in the coda. So this is the syllable. Now how can all these findings be modeled? Well, there are several theoretical approaches. They are subsumed under the heading of speech perception theories. All these theories have in common that first of all they agree that the ear amplifies the incoming signal and transmits it to the auditory nerve, where a primary auditory analysis takes place. That is filtering out of non-speech aspects and so on. Then an auditory pattern is being generated. Form and patterns, etc. Using some sort of mental recognition device. And it is this device here. This device where the theories make different claims. By the way, the output of all these theories is in any case, in all cases, some sort of phonological form. So here, either in terms of phonemes, we might argue, or in terms of features, that contains the relevant information. We forgot an arrow here. Let us now compare the different types of theories. There are essentially two types of theories, active and passive theories. Let's start with active theories. Active theories assume that the process of speech perception involves some sort of internal speech production. That is, the listener applies his articulatory knowledge when he analyses the incoming signal. In other words, the listener acts not only when he produces speech, but also when he receives it. Two influential active theories have emerged. Here on the left, you see the motor theory of speech perception. According to the motor theory, reference to your own articulatory knowledge, down here, is manifested via a direct comparison with articulatory pattern. According to the motor theory, reference to one's own articulatory knowledge is manifested via a direct comparison with one's own articulatory pattern. An alternative is the analysis by synthesis theory. Now, this theory postulates that the reference to our own articulation is established via newly generated auditory patterns. Now, active theories can be contrasted with passive theories. The passive group of theories of speech perception emphasize the sensory side of the perceptual process and relegate the process of speech production to a minor role. They postulate the use of stored neural patterns which may even be innate. Here are two influential passive theories. Here to the left, you find the theory of template matching. Templates are innate recognition devices that are rudimentary at birth and are tuned as language is acquired. Now, the alternative is the feature detector theory where feature detectors are specialized neural receptors necessary for the generation of auditory patterns. Let us summarize our observations. There is evidence that the analysis of the spoken input signal is by a large sensory and that a system of sub-segmental feature detectors is central to any theory of perception. These feature detectors cope with specific acoustic cues in the signal. In our e-lecture pre-lexical processing part 2, we will compare the analysis of the speech signal with the analysis of written input and provide a unified model. So, stay tuned.