 Yeah, thanks for that introduction. As Narees said, I haven't been on the conference speaking circuit in a big way. But I hope to do here is talk about this topic of deep learning in speech recognition and hopefully give you an idea of how deep learning has really changed the world here for speech recognition. So let's start off with maybe just asking some questions. How many of you have used a speech recognition system on your cell phone? Raise your hands. I would say it's about 60%, 70%, 70%. And how many of you have used Alexa or Google Home or something like that? Almost the same number. And how many of you think speech recognition works? A reasonable fraction? How many of you think speech recognition sucks? A smaller number. I think speech recognition has come a long way. And it used to be the case that the answer to the last question would have resulted in many more raised hands than I saw. So let me just start with a brief personal, I wouldn't call it a story, but a personal path about how I came into speech recognition. So I finished my PhD in 1991 at Rutgers University and then went over to Bell Labs to work in the area of speech recognition. At the time, Bell Labs was one of the premier research laboratories in the area of speech recognition. Larry Rabiner and company had invented HMMs and applied it to speech recognition. They didn't invent HMMs, but they did a lot of work in applying it to speech recognition. At the time, we had already moved from connected word recognition where you had to speak like this, to continue speech recognition where you could speak like I'm speaking now. We'd already started doing large vocabulary, that is maybe tens of thousands of words. But today's systems can recognize hundreds of thousands of words, maybe millions of words. So we've come a long way since those days. I then went to Stanford Research Institute in 1994 to continue the work in speech recognition. Stanford Research Institute is a contract research organization based in Menlo Park and they work on government sponsored research work. At the time in 1994, this was after the previous wave of neural networks which happened in the 80s. There was already some work going on in trying to get neural networks into speech and make it work for speech recognition. They had some limited success, but it didn't work very well, and I'll talk about that later in the talk. Fast forward another 15 years to 2010 and neural networks basically made a huge impact in speech recognition and since then it has basically changed the equation completely and now that's what everybody uses for speech and other pattern recognition tasks as you know. So before we get into all the details, maybe we should talk about what a speech recognition system looks like. This is going to be one level lower, just opening up the hood and taking a look at the stuff. I'm hoping that at the end of this talk you will be able to talk somewhat intelligently about how a speech recognition system works without getting into all the gory maths and stuff like that. So the input to a speech recognition system is naturally speech. Here is a recording of myself saying the sentence this is speech. So I recorded myself saying this is speech. On the top you see a waveform. The waveform is basically the waveform of this utterance. In the center panel what you see is what we call a spectrogram. A spectrogram is simply a time frequency representation of the signal. And at the bottom you see the transcription of the sentence which is this is speech. So let's look at the spectrogram. To create the spectrogram what we do is we divide the time axis into multiple frames, what we call frames of speech. Each frame is about 25 milliseconds and we shift these frames 10 milliseconds at a time. So we have now 100 frames per second because it's 10 millisecond shifts. So every 10 millisecond you compute a new frame. For that frame of speech which is about 25 milliseconds you compute a fast Fourier transform to get the spectral energies at different frequency values. So by doing this we can compute a sequence of spectral features. I'm not going to go into the details of this but I think suffice it to say that the features are spectral representations. They relate to the energies at different frequency bends at a specific time frame. Now if you go back and look at that spectrogram that I showed you in the first picture you can clearly see that the different sounds have clearly different spectrums. So the sir sound in this has energy at a very high frequency like 5000 hertz but the other, I don't know there's a pointer here but you can see the spectrogram clearly I suppose and you can imagine why these spectrum would actually be representing features of these different sounds. So the input to a speech recognition system is this 40 dimensional spectral vectors. Typically 40 is the common dimensionality. Again it's not critical to get into that. The speech decoder takes the sequence of speech features and outputs a sequence of words. So in this case we started off at 16,000 samples because we sampled the speech at 16,000 samples per second, reduced that to 100 frames and output is only three words because we can say this is speech in about a second. So let's dig down a little bit more into the speech decoder itself. The input as I said is the sequence of spectral features. Each feature is a vector x1, x2, x3 are vectors representing the feature at that particular time and the output is this word sequence. So how do we compute that output? We use standard statistical pattern recognition where we look over all possible word sequences and pick that word sequence that has the maximum posterior probability given the input feature sequence. So we maximize probability of W which is the word sequence given x which is the feature sequence. That's it. This is standard pattern recognition. Nothing fancy there. In order to do this, we break the posterior probability into two parts. The first is the class conditional probability. The probability of the feature sequence x given the word sequence W multiplied by the probability of the word sequence W. Again, standard stuff. If you had a very good, accurate model for these class conditional probabilities and the posterior and the prior probability of the word sequence, then you would get the best possible result that nobody could possibly beat for this task. The problem is that we don't know what these probability distributions are. Even if we knew what they were, we don't have the data to train those models, blah, blah, blah. So that is the standard problem that people face in pattern recognition and we'll go into some of that. In order to compute these models, these probabilities, we use an acoustic model, a lexicon, and a language model. The acoustic model and the lexicon together are used to compute the probability of x, the feature sequence, given W, the word sequence, and the language model is used to compute the probability of the word sequence. So now let me go into how these models are kind of used in a speech recognition system. This slide is somewhat important and will kind of lay out how things work. So let's start with the language model. We're going to look at how we can go from a sequence of words to the spectral features that I showed you that come into the speech decoder. So let's start with the language model. In this case, I have a simple language model with three words. This is speech. The solid arrows, an arrow between two words, represents the probability that I can go from one word to the next word. So for example, the arrow between this and is represents the probability of the word is given that the previous word was the word this. This is what is called a bi-gram language model. And you guys are probably familiar with this. There's also things called trigram language models where we would have the probability of the word is given the previous two words and so forth. But here we're using, I'm only showing a bi-gram language model. The solid arrows represent what might be high probabilities. So this speech is a very likely word sequence. But as you can see, the word this can follow itself. So this can produce the sentence this, this, this, this, this, this, this, this, any number of times, followed by the word speech, and so forth. But hopefully those sentences would have lower probabilities because those dashed arrows represent lower probabilities here. We would train these models with large amounts of texts and estimate the bi-gram and trigram counts to get this model. Okay, the next step is to move from the words to phones or phonemes. I'm not gonna distinguish too much between phonemes and phones. Let's just call them phones for the purpose of this talk. A phone is the basic sound in the language. In English we have about 40 phones. So the word this is the sequence of phones, the, a, and sir. It's so we have a lexicon which basically has a set of entries for all the words in the vocabulary and then each word is followed by its corresponding phone sequence. Why do we do this? Well, like I told you before, we have hundreds of thousands of words. If we try to model 100,000 words and the acoustics that are produced by these 100,000 words, we probably wouldn't have enough data and our training data to estimate these things. So instead of that, we model only 40 phones, which is much easier to do. Because I might see the word this, but I might not see the word that. But if I've seen the word this, then the sound there has already been modeled, okay? Now it turns out that speech recognition researchers, we use what is called triphones instead of phones. And the reason for this is the human articulatory system cannot suddenly move from the there sound to the sound. When I, let's take the example of the word cat and the word bat, okay? The air sound in cat will sound different than the air sound in bat, merely because I went from ker, which is a back sound produced from the back of the vocal track to a bur from cat is starting over here, but bat is starting at your lips. So the ur will be influenced by the previous sound. Therefore, if you model only context independent phones, the 40 context independent phones, you will not get as good models as you could get if you model triphones. And this is a standard thing. People have found triphones work very well, so that's what we use. If you have 40 phones, you'll have 40 to the power of three possible triphones, big number, but obviously, triphones like the, the, the don't exist. There's no such word in a sequence of phones that exist in the language. So we have a countable number of these things, far less than the number of words. Each triphone is then modeled by what we call a hidden Markov model. So each of these hidden Markov models is three states going from left to right. Each state can stay in its, can go to itself in the next time instant, or it can go to the next state on its, on its right side. The first state typically is supposed to model the beginning part of the triphone, the second, the steady part of the triphone, and the last, the exit part of the triphone. So this is the acoustic model that actually produces the feature sequence X, the spectral features that I showed you, given the, a triphone. So now we have gone from words to triphones to the observed spectral features. So this is like a production model. It's a generative model. And using this model, if you observe one of these sequence of spectral features, you can somehow go back to these words, and we'll talk about that. So putting all these models together, we get this composed search graph. Now as you can see, I have connected the triphones together to form words. I've connected the words together according to my language model to form sentences. And any path through this composed search graph would correspond to a word sequence by construction, because I started off with the word sequences. So clearly any path through this graph has to be a valid word sequence in English. And our task is to find the best possible word sequence through this graph, given the input feature sequence that we observed. And if we can do that, we've solved the problem, okay? So naturally the easiest way to think about this is let me go through all the possible paths in that graph, which are countless, evaluate each one of them, and pick the one that has the maximum posterior probability that I told you in the first slide, the P of w given x. Obviously it's an impossible task, right? So instead we use what is called the Viterbi algorithm, which is an efficient way to do this task. The Viterbi algorithm has, if you can picture it like this, on the x-axis we have the sequence of feature sequences. These are the feature vectors, the 40 dimensional vector, the spectral feature vectors I talked about. On the y-axis you can sort of picture this composed state graph that I showed, the search graph I showed you before. And the Viterbi algorithm is trying to find the path through the alignment of these states to this feature vector sequence, the best alignment that gives you the best posterior probability. Whatever that is, that is going to be the thing that the system outputs. In order to do this, what do we need? At any particular point in time, at any frame, I need to be able to compute the score for that feature vector for that state. What is the score for the feature vector x given the state s? If I can do this, I'm home free, Viterbi algorithm takes care of the rest, okay? So how do we do this? The P of x given s is what is computed by our acoustic model, okay? In the past, speech recognition systems used to use Gaussian mixture models for their acoustic models. And a Gaussian mixture model looks like this. We look at the bottom right of the slide is the equation. It's a very simple thing. It's a Gaussian mixture model is simply a weighted sum of normal distributions. The x's here are the spectral vectors that we talked about, the 40 dimensional spectral vector, and the n represents a Gaussian distribution, and we're just weighting this. So if you had a single Gaussian, then you just have a normal Gaussian shape, and that's all you have. But if you have two Gaussians, you can imagine that you have one like that, you have another like this, by weighting, by making one shorter and the other taller, one wider, and the other narrower, you can get a distribution that now looks like that. And if you have 50 of these, you can maybe make a more fancy distribution. And the idea was by throwing in multiple Gaussians, and by estimating the means and the variances and the weights to be whatever they are, they come from the data, we can sort of estimate arbitrary distributions. So each HMM state has its own Gaussian mixture model. This is a critical aspect. Each of these, so we typically model about 10,000 tri-phone states. Remember we had 40 phones, 40 to the power of three possible tri-phones, typically much less than 40 to the power of three. Each of those tri-phones has three states. We typically had about 10,000 states, and that's kind of the order of magnitude now, 10, 20,000, something like that, okay? So each of these 10,000 HMM tri-phone states has a Gaussian mixture model, 20 Gaussians, let's say. So that's it. The model for the state is not very expressive. It's got a very small number of Gaussians, and that's all you can do. Each state is trained separately with data from that state. So that is another issue with these models, okay? How do we, and typically this is the type of models we used to have, 10,000 states, 30 million parameters. How would we train such a thing? You start off with your current estimate of your HMMs for all those 10,000 states, all our tri-phones rather, we would have some HMM model that we have at iteration one. We use that model and the Viterbi algorithm or an algorithm similar to it called the Bomb Welsh algorithm to align our training data. So our training data is essentially a sequence of feature vectors and the corresponding transcript. That's how the training data looks. We have waveforms with humans having transcribed those waveforms into words. The words will be converted to phones. The phones will be converted to HMM states, and so we will have our feature sequence, which is spectral feature vectors and a state sequence for our training example. We have to now align each of those states with the corresponding feature vectors. That is done by the Viterbi algorithm or by the Bomb Welsh algorithm. And we will then take all the feature vectors that landed up in state number one and estimate the Gaussian mixture model for state number one, similarly for state number two, and so forth. Having done that, we now have a new HMM. We will use that to realign the same data against the transcriptions that we have and re-estimate the parameters of the models. So this is an iterative re-estimation algorithm. It is very much like a expectation maximization algorithm. It is the expectation maximization algorithm or the EM algorithm that you might have heard of. It's doing clustering in space and time. You can think of the time clustering as the alignment of the frames into the states. The space clustering is figuring out which of these 20 Gaussians per state produce that particular feature vector. Okay, so that's how HMMs used to work with Gaussian mixture models. What were the ways in which we used this? Our two tweak factors here are the number of states. Increase the number of states, get more resolution. Increase the number of Gaussians per state, get even more resolution. So if you want finer and finer resolution, throw in more parameters by increasing the number of states and increasing the number of Gaussians. Similar to what we would do for neural networks, throw in more layers, more nodes per layer, right? But when you do that, you need to be able to train it. And typically we would run out of training data to train these large number of parameters. And in an HMM it's even worse because each HMM state contains maybe 20 Gaussians. And if I keep increasing the number of states, I may not even observe data in that specific state. So a lot of work was done in robust estimation, trying to estimate these parameters with limited amount of data. In other words, HMMs with Gaussian mixture models are a generative model trained with maximum likelihood estimation. That is we train each state with only the data that lands up in that state. But we can do better if we can take the data that lands up in all the states and do a discriminative training to try to separate the states from each other. So discriminative training was tried and we got better results doing that. The feature vectors that I talked about, typically people assume that the components of the feature vectors were independent of each other. It's a hacky assumption purely made because we want the math to be easy and not to have a huge number of parameters with full covariance matrices. So people did things like transform-based Gaussian covariance modeling which gave us pretty good results as well. And finally, speaker normalization. If you say the word cat and I say the word cat and any of the 100 people in this room say the word cat, it'll all sound slightly different because we have different accents, different genders, different heights, all our voices are different. So the word cat is all over this acoustic space and we wanna bring those words down into a canonical space so that we can model in this canonical space. If we can do that, we can model much more efficiently. So using techniques like this, people worked on several speech recognition tasks. I just wanna show you the switchboard example here. So the US government, DARPA and NSA used to fund speech recognition research in the 70s, 80s and 90s, probably even now. I don't work on those funds anymore so I don't know about what's going on now. The way it worked is they would have these tasks like switchboard and broadcast news. They have specific tasks. All the sponsored research sites would be given this data and they'd be given some money to work on the data as well. And at the end of the year, they would all have to evaluate their systems on a given fixed test set. And you'd get one week to run your models on the test set. Then NIST would take this and score the results and basically tell which site did how well. It's kind of a friendly competition that over time really spurred speech recognition research. Switchboard is a specific example of this data that started off in the early 90s. It is a conversational speech recognition task. Let's see if I can play it. Is it coming through? Can you hear anything? No, let's skip it. So what this is is basically a task where pairs of speakers were given a topic and asked to talk on this topic over the telephone. So for example, this topic that they were talking about was gardening. Two people talking about gardening. It's a conversation between two people. Each channel was recorded separately, transcribed by human transcribers, and used... Well, let's see, I put in pepper plants this weekend. Oh, wish I could be doing that. Yeah, I got all my little seedlings coming up in the kitchen, and I enjoy tinkering with it, you know what I'm saying? Okay. I don't know how to stop this. Do you have a long? Okay, so you get an idea, right? So basically it's a fairly complicated task to transcribe that. Normally the kind of speech recognition systems we use, you're talking to a machine. When you talk to a machine, you kind of know that the machine is not as smart as you, so you're gonna put in some effort to talk somewhat reasonably. But when people are talking to each other, this is what they do, right? And you have a machine eavesdropping on that and trying to recognize. So human to human transcription is much harder than human to machine transcription, right? This task started off in the 1990s. As you can see, we started off with a very high word error rate. In speech recognition, we use word error rate as our metric. Our high number is bad, a low number is good. So word error rate was about 48% in 1995. And over time, using the techniques I talked about in the previous slide, it was pushed down to the 19.5% error rate range. But then it flattened out in the early 2000s and we didn't get far beyond this, right? With the techniques that people were using. The models that were used over here use about 20 million parameters and actually use a combination of systems. So people trained like HMM systems with slightly different tweaks and combined them using some voting mechanism. That is what this model is, okay? Meanwhile, what is happening in neural networks? So in the previous wave of neural networks, I would say in the 80s, kind of when I did my PhD in that area, was, you know, backprop was invented, was shown to work for deep networks but actually only worked for shallow networks because if you go deep into a deep neural network, the, with these activation functions that are between zero and one, you'll have gradients that go smaller and smaller and smaller and training becomes very, very slow. And so, oh, in the 1990s this resulted in a sort of hiatus in neural network research. But in 2000s, Hinton, Jeffrey Hinton and his students came up with the idea of deep learning through unsupervised deep belief training. And that resulted in a resurgence of neural networks and it got applied to speech. In speech recognition, this is kind of the history of neural networks in speech recognition. As I said in 1994, there was an effort to do neural networks for speech. It actually worked for a very limited task but it only used a very shallow network and a very limited speech recognition task. Nothing like the switchboard task that I just showed you the example of. Then in the 2000s, Hinton and his students came up with deep belief and they applied it to phone recognition, worked pretty well on the timid corpus. And then Hinton students, Dal, Muhammad and Jetli interned at different companies, IBM, Google and Microsoft and tried to apply this to large vocabulary speech recognition and showed some improvements. And finally in 2011, Frank Seid and his colleagues at Microsoft came up with the first paper where neural networks, deep neural networks was used for speech recognition and showed a dramatic gain that, the kind of gain that nobody had ever seen before in the area of speech recognition. I should say here that in the past, in speech recognition, any specific idea in the 1990s and early 2000s, any specific idea that one somebody might come up with would give at best a 10% relative improvement in whatever rate. So you might go from 20% to 18%. And that would be considered a huge win and you would get a lot of pats on your back if you got that sort of 10% relative improvement. And that was huge. 3% was more normal, right? And even two. So people would add these two persons, three persons, one person even and add all these improvements to get to that 19.5% number that I talked about. So with deep neural networks, they did much, much better. This is the difference between a deep neural network and a Gaussian mixture model. And I think it's kind of interesting to see what the differences are. Remember in a Gaussian mixture model, each HMM state has its own Gaussian mixture model. That's shown on the left hand side. Each state is modeled independently by itself. The data for that state is used to train that state and no other state. On the right side, you have the equivalent deep neural network. Same number of states, 10,000 states in both cases, but in the deep neural network, all the states share the same deep neural network. All the states are trained with the data from all states in a discriminative fashion. The model is much more expressive in its power than a Gaussian mixture model. And that's why it tends to work much better. So Frank Seid in 2011 did this work where he simply replaced a Gaussian mixture model of a state-of-the-art Gaussian mixture model system for switchboard, simply replaced the GMM with a DNN and got from 23.6% VDR rate to 16.1% VDR rate. Just an amazing thing. 32% relative improvement in VDR rate was never heard of before. And I remember when I saw this thing, I was like, yeah, there must be some bug. And everybody thought that. But then other people repeated the same thing in different labs, like at IBM and at Google and at different universities. And the rest is history, okay? So this was a really very special paper. Very simple in a sense, because all he did was take GMM and record DNN at a high level, but none of these things are simple to do, as you know, right? What is the difference between Steve Reynolds's work in 1994 and Frank Seid's work in 2011? Steve Reynolds had the same idea, but instead of modeling 9,000 tri-phone states, he modeled 70 context-independent phoneme states. He was only modeling the thes and the e and the sirs, whereas Frank Seid was doing the e with the left context of the and the right context of the sir. So a final resolution. One hidden layer in 1994 with 1,000 nodes per layer as compared to seven hidden layers with 2,000 nodes per layer. That's it. That's really the difference. So that made the difference, and that got Frank Seid all the credit for doing this stuff. But as you can imagine, that the devil is in the details. The idea was there in 94, but it only came to fruition in 2011. Why did this happen? Well, the advent of GPUs and compute power is what made this possible in 2011. And also, Hinton's work in deep, in training a deep neural network starting off with unsupervised deep belief. But it turned out that over time, even this deep belief stuff was not necessary. People today only use standard weight initialization for deep neural networks, standard back propagation. They don't even have to use this unsupervised just a deep belief training because we have much more data and much more compute power. So the slowness of the training sort of went away as a problem. So what's next? The next thing one might imagine is you want to try to use recurrent neural networks. Why would we use recurrent neural networks for speech? Well, speech is a sequence. It's a time sequence. So if you want to model something at time t, you might want to know something about time t minus one and t minus two. And a deep, a feed forward neural network on the left doesn't really allow you to do that because it's only looking at time t at any particular point in time. The input is only from time t. The RNN on the other hand feeds back it's a hidden representation at time t to time t plus one. And so you can theoretically learn infinite history using an RNN. So replacing a DNN with an RNN for acoustic models was the next idea. This slide basically shows how an RNN, like I said, theoretically can have infinite history but actually what happens is because of vanishing gradients over time you will actually lose history fairly quickly. This was a problem that was recognized by Jurgen Schmiduber and a student, Sepp Hochreiter and they came up with the idea of long short time memory neural networks LSTMs to fix this using memory and gating cells. So what they do is they basically use these gating cells to decide how much of the previous cell memory should be kept or removed and what should be input and what should be output and so forth. By doing this, they were able to control the amount of memory that the network could remember based upon the values of these gating cells and all these things are trained. The parameters of these LSTM cells are trained by data. So depending upon the input, you might remember more or less. But the point is that the network is able to do that and so as you throw data at it, it is able to learn what it needs to. So in 2015, Mohammed, who is one of Jeffrey Hinton students who at the time was at Microsoft, I think, did this work at Microsoft where they replaced a DNN with an LSTM. So now we've already gotten this 32% relative improvement from going from Gaussian mixture models to DNNs and now we're going from DNNs to LSTMs, right? And they got a 15% relative improvement by doing this. 15% is nothing compared to 32% but it's still pretty darn good compared to the sort of things we used to get before with Gaussian mixture model approaches, right? There, like I said, the best gains we typically got at any particular thing was 10. Obviously, when we started off from context-dependent phones to context-dependent phones, we probably got a huge gain but I'm talking more recent stuff, all right? So this was a pretty large gain as well. And I think with these two, the deep neural networks and the LSTM networks, I would think that these are the main things as far as acoustic modeling is concerned in the way that neural networks have sort of improved the standard HMM, GMM systems. We've used other things like convolutional neural networks. We've used other cost functions like CTC instead of cross entropy but overall they all work very well. Sometimes they work better for some teams versus other teams because a lot of this stuff is in the details. You might try it. Somebody else may try the same thing but somebody may get a win but the other may not. So I know that at Microsoft, they do use convolutional neural networks. At Google they used to use it too but LSTMs work very, very well by themselves. What's next? So I talked about the acoustic models and how deep neural networks transform the way that speech recognition is done as far as acoustic modeling is concerned going from Gaussian mixture models to deep neural networks. But there's this other model called the language model. Remember I told you the language model is the one that tells you what word sequences are possible and the language model basically tells you, hey, if you can't say words like this, this, this, because that's not a sentence. You're only allowed to say reasonable things. One example of this is that if you try on Google if you say something like play music by U2, the band U2, you might actually get play music by U2 because U2 and U2 sound very similar acoustically but U2 might have a higher probability as far as the language model is concerned. So try things like that. You can break some of these systems but it'll learn fairly quickly and work after that. So here what we're doing is trying to use a neural network for language modeling. So this is a language model also requires a recurrent neural network because a sentence is a word sequence. There is a temporal aspect to it from left to right. And so what we're doing here is we're using a LSTM, a deep LSTM network to train a language model, to represent a language model. The input to this language model is a one hot feature vector where the dimensionality of this feature vector is equal to the vocabulary of the word, of the system. So it might be 100,000 dimensional vector with only one unit being on representing the specific word that was presented to the system. And the network is trying to predict the next word. So for this is speech, I will give it the word this. The network has to predict is. Then I give the network is and the network has to predict speech. So that's the way you train this network. The output layer is also a one hot word vector where you can, and you turn the word on that is you're trying to predict. And the internal layers are LSTM layers with recurrence and the first layer is a projection layer that brings down this 100,000 dimensional feature vector to some lower dimension that can be more easily managed. And you train the system using a standard LSTM recurrent neural network back propagation training algorithm and apply it on speech recognition. Now this language model cannot be applied in a standard first pass speech recognition because you cannot construct the state graph that I showed you using an LSTM language model. The reason for that is this thing has infinite history. So there's no way for me to construct that state graph. Instead what we do is we take an n gram language model, typically in speech recognition people use four gram language models. The example I showed you was a two gram language model but people actually use four grams because that gives better performance. So you use a four gram language model and produce instead of just a one best result which is the best result that the result that the system thinks is the correct one, you produce 500 best results. It's called an n best list with n equal to 500. You can make n equal to 1000, n equal to 10 various, those are all tweak factors. In this particular paper of Microsoft they used n equal to 500. So you have 500 answers and then these 500 sequences are re-scored using an LSTM that's easy to do because you just give each of these 500 things to the LSTM and it gives you a score for each of them. And that is reordered based upon the score from the LSTM and your previous models. Doing that, they were able to reorder the answers and get a 27% relative improvement on switchboard which is a huge gainer also, right? So these are the kind of things people have done in neural networks. Deep neural networks for acoustic modeling, LSTM, deep LSTMs for acoustic modeling, deep LSTMs for language modeling. By doing this, the switchboard task that I showed you before this is where things have gone. Between the 1990s and early 2000s we got these gains using techniques like robust estimation for Gaussian mixture models, covariance modeling, using discriminative training, using speaker normalization and things sort of leveled out in the 19.5s, 19.8% word error rate ranges. There was a sort of brief time where we didn't get very far beyond that till Frank's side's paper brought the word error rate down to 16.1, okay? I did say if you recall that his baseline number was 23.6 or something, whereas we see 19.8 over here. The reason for that is these 19.8 numbers are combinations of multiple systems, each of them probably producing 23.6. So Frank's side 16.1 number is actually a single system. If he combined three of these things he would have been down to 13. something. So that's the way to compare this, but this graph is just giving an idea. It's not exactly the same data and stuff, right? It's the same data set, but every year the actual test set was slightly different. In this graph I think from 2000 onwards we're using the same test set. So from 16.1 we went to 9.9, which is the LSTM improvement. And finally Microsoft and IBM have recently shown work on this task where they claim better than human speech recognition on the switchboard conversational speech recognition task. When they say better than human the way they do the human recognition is they have like, IBM did this thing where they have four humans transcribe the utterance and a fifth human look over those four and pick the best one, right? So it's obviously not a real time system. Neither are these automatic systems real time. But humans got to 5.1% and now machine can get to the same number or maybe the human was slightly worse, right? So the claim is that the machine is almost as good as a human for conversational speech recognition. It's existence proof because the system, the 5.1 system actually combines eight or nine different acoustic models all trained with deep neural networks in slightly different ways, different types of recurrent neural networks at different bi-directional LSTMs and so forth. And it also has like multiple language models and LSTM language models, some four gram, n-gram language model and combine in a very special manner to get to 5.1. But if you have time and you have compute and it's not a real time system you can get to 5.1, right? Which is pretty cool. So this is where I think if you see these two ellipses we used to be there and now we've come down here which I think is really cool, really exciting and over the last 10 years I've seen this happen and it's really fun to see. And nowadays in speech recognition it's all about using deep neural nets for the sort of work. Okay, I think I'm done. What's next? There's sequence to sequence models where people are actually using a single neural network to model the language model, the lexicon and the acoustic model all in one, a single neural network. The input is speech, the output is character sequences. That's pretty cool but the results are almost as good as a traditional speech recognition system and I think it's a matter of time before that becomes mainstream. And the other area I would say is generally the area of applications. I think speech recognition is at a point where we can think in terms of interesting applications. When I was at Google we worked on the area of an application for doctor-patient conversations. We tried to recognize the conversation and extract a doctor's note automatically. Or another thing might be trying to recognize what I'm saying here and then get little nuggets out of it so that we can later search through this meeting. People talking over each other. Today if two people are talking over each other we just throw up our hands and don't recognize that part. But if you can transcribe those parts and get training data for that maybe the neural network can learn to separate out two people talking over each other. So there's lots of new areas that we can work on. I hope this talk gave you some idea of how speech recognition works and showed you how neural networks have made a change in the last 10 years. I have maybe a few minutes to take a couple of questions. Thank you so much. All right, thank you. Extremely interesting talk, my compliments to you. Now another thing about FOTS next. How, what is the compute power which will eventually be required? Or what we are looking at? What kind of a smaller and smaller devices can we put this into? So today we already have deep neural networks working in cell phones. I'm not an expert on compute power but today we already have deep neural networks working in our cell phones at Google. So I think with as compute power increases naturally we'll do even better. So with this 5.1% number which is a combination of multiple systems one can imagine that if you have significantly more compute power, a system like that could become more and more real-time. Probably wouldn't be streaming, it would be real-time but I think for streaming systems we still, to get to 5.1, I still think there's a modeling aspect to it. It's not merely computing power. If you're willing to wait for the entire utterance and combine multiple systems, yes, we're there but then it's a compute issue but I think there's still modeling aspects to get a streaming system to recognize with a single model as you stream speech on. So any research is going on what animals are trying to say and how we can recognize that. Could you repeat the question? What animals are trying to say? Like humans, what humans speak we can understand. Is there any research going on what animals are trying to say their language? Is there any research on that? Research on what anyone is trying to say? Animals. Animals, so I see. I don't know. I think there was words. I'm familiar with some work on bird song recognition. I'm familiar with the fact that work exists. I'm not familiar with the details of it but yes, I mean there is work on recognizing bird songs in order to be able to, you put these sort of microphones all over the forest and be able to count the number of birds of different species that exist based upon their calls. So yeah, there is work on that as well. All right, thank you very much. Dr. Anand, that was really fantastic. Appreciate it. Thank you so much for having me. I am hoping you'll be around so people can spend some time. Yeah, I'll be here for a little bit more time in the morning. All right, thank you so much. Thank you so much.