 Okay. Good morning everyone. Today I'm going to talk about ASR. ASR stands for automatic speech recognition. And first I'm going to talk about the need for ASR and then I'm going to talk about what goes inside typical speech recognition system. So let me start off with the Bixby. So Bixby is Samsung's intelligent assistant. Bixby is what is going to power all our devices, right? It's the brain that's going to be behind all the devices. So if you see, you know, about 1990s was the era of the web, right? And 1990s or even before that everything was, you know, you had systems connecting to the internet, you know, you had client server kind of communication, right? And then 2000s was the era of the mobile phone, the smartphone, which is there even today, right? People have one, two, three smartphones, right? And smartphones are really proliferating. And smartphones are all powered by apps. But friends, very soon we are moving to a world where there will be no apps at all. We are moving slowly. And, you know, Siri, Cortana, all these systems have been introduced, right? Siri came while back when Google has turned. Bixby was launched in 2017. But, you know, 2017 itself was the year when the smart speakers took off. So around 47 million smart speakers were launched in 2017. And then, you know, last year more than 100 million smart speakers were sold. So with smart speakers we are seeing a big revolution, right? And so things are changing, okay? I don't know. Where is that coming from? I don't know. It's probably not very smart. Okay? So what I'm telling you is that the world is changing. You're going to have a world where everything around you will have, you know, capability to listen to you and act. Your curtains will become, they're already becoming, right? You have smart locks. You have smart sensors. Everything is becoming smart. And these assistants are there in all devices now. Even cars have assistants. You have robots. So the world will come where robots will become very popular, right? Today, robots are not there, right? So to make all this happen, you need the system to be a lot more intelligent than today, right? Today it's somewhat intelligent, but not as much. And that's the goal of Bixby, to make it very intelligent and give people one experience. No matter which device they go to, they can experience this intelligence. So Bixby has voice, Bixby has vision, Bixby has data, all kinds of intelligence. But today I'm going to talk more about, you know, Bixby has about now 50 million users, registered users. And it runs in about, it's available in 200 countries. And there are about 500 million devices. So Samsung's strength is devices, right? We ship more than 500 million devices a year. And Bixby is running in all those. But today I'm going to talk about what is critical to make these voice systems work, right? You need a speech recognition system. So I'm going to talk more about the details of what goes inside a typical speech recognition system. Speech recognition should be available in, you know, today we have all these languages, right? Bixby supports all these languages. But what are the challenges today? The challenge, why is speech not taking off on mobiles, right? For example, mobiles, you already have touch. But there are devices such as, you know, speakers and some other devices where there is no touch, right? So you, the only way to interact with them is voice. But for voice to really take off, first of all, voice, it should be accurate. No matter how users speak, users can speak in whatever way, right? It should understand. That's the first thing. Second thing is people can speak in multiple languages. They can change their languages. For example, we can speak in English, few words of Hindi. There's code mixing happening all the time. And, you know, the system should recognize, should work in a noisy environment. In a home, you know, it may be very noisy environment. In a car, it may be very noisy environment. But that's where people are using voice the most. People don't use voice when they're at work. They use voice when they're at home. But you can have lots of noises, right? Dog barking. You can have all kinds of noises, right? So it should be tolerant. And then it should understand who spoke what. There are multiple people speaking, right? So there's a whole field called speaker diarization that does that, that tries to figure out who is speaking what. If people are conflicting with each other, speeches getting overlapped, they're able to see that. So today I'm not going to talk too much into those areas, but I'm going to focus on the fundamentals and some of the research areas that we are working on, we are pursuing. So I'm going to talk about basically about two, two areas. One is called wake up, wake up and the other is ASR. So I'm going to talk about how typically the system works, right? So let's say user utter something, right? When a user utter something. So the first thing that happens is these speech signals get converted, these acoustic signals get converted into some representation, right? And this representation, you know, could be MRCC or TCC, some kind of encoding is there, right? And then there is some preprocessing is there that removes echo, cancels, noise. So this kind of preprocessing happens. And then after preprocessing, the first thing that happens in this, so it's not only ASR, there are many things that work together. The first thing that happens is wake up. So when user starts uttering, right? The first thing that people say is high bixby. So the system has to wake up. That's called wake up, right? So when the system really, and this wake up itself is very complicated, I'll explain how. But the first step is wake up. The second is, you know, there is ASR, the ASR sort of converts this speech into a text, right? And even though the speech, when the system is trained, you can add things like 7 colon 00, right? That needs to be denormalized. So there is a, you know, there's a normalized, sorry, there's a normalization step that happens in the beginning. And the ASR when it outputs, it just says 7 AM, right? And it can say anything. It can spell John, J-O-N, different ways, right? So you need something called personal data. It's called personal data or a PLM, personal language model that takes whatever contacts you have in your phone, your contacts, your file names, app names, everything that is in your phone or other devices. And it has to sort of tune the model so that it works properly so it can output what makes more sense, right? And then finally, after the ASR converts, gives its output, there is something called inverse text normalization that converts again the 7 AM back to 7 colon 00 AM. And there's something called EPD. So both start point detection and end point detection. EPD stands for end point detection. SPD stands for start point detection. System has to understand when user is starting to speak so that even after high Bixby, even after the wake up happens, so wake up happens typically there is a small amount of processing that happens on the device. The device is always listening to you. So there is a DSP on the device typically that always keeps listening to you. So whenever you say high Bixby, it kind of wakes up, right? And then there is another model that runs and sort of makes sure that really you have said high Bixby or not. So I'll explain that more. But all of this, you know, the start point detection, after the after uses high Bixby, there could be a pause, right? And then the user has start speaking. So that start point detection is critical because if you don't do start point detection accurately, you can have all kinds of junk data, noisy data, go to the server, just raise your server processing. And end point detection is also very important because you have to understand when the user has stopped speaking because the user if you cut off the sentence before the user stops speaking, you know, there'll be a lot of drop in accuracy. So and if you don't cut it off in time, even then there could be a problem, there could be latency issues. So I'll talk about that. So basically this is all how the system works. So it's not only ASR, but a whole complex system, which has many components all working together, right? So why is speech so difficult? Why is ASR so difficult? Why after it's been out 30 years or 40 years and people are working in speech? Still, if you see the accuracy of any commercial system, not more than 75%. The real accuracy, you know, if you measure, right? It's not more than 75%, 80% max. So why is there so much problem? Because this human speech is so difficult. I mean, so, so many issues, right? Every speaker speaks differently, right? Some speakers speak slow. Some may speak fast. Some have different accents. Some have high-pitched, some have low-pitched. People can have very large number of words, right? There are huge number of words. There's a huge vocabulary that keeps changing. And you can have people who just murmur, somebody just whispers. I mean, it's like all kinds of things. They could be noisy environments. You could have multiple people speaking together. So there are many, many issues because of which inherently the speech recognition is a difficult job. However, you know, deep learning has been used a lot in speech, right? And I'll explain how deep learning has kind of been used. So if you see deep learning, so from the time deep learning took over, so initially, you know, the people used to use GMMs and HMMs, right? Mostly GMMs and HMMs, all working together. So, you know, they were used, but if you see around four or five years back, right, or maybe like 2013 or so, that's when 2011, that's where really the DNNs took over. And since that time, the speech recognition accuracies have improved a lot, right? So I'm going to explain how this thing works. So I'm going to attempt, you know, explaining exactly how a typical ASR works, right? So typical ASR, as I mentioned to you, you know, there is audio when the user is speaking. So assuming a wake up is over, assuming the high-beats we say and wake up is done, now user is speaking, right? There is audio that comes, and from the audio there is feature extraction that happens, and some speech features have happened, and there's, of course, some preprocessing that happens. After that, there is an acoustic model. Acoustic model basically links your acoustics to phonemes, right? So it basically links an acoustic to a word of phoneme sequence and estimates likelihood of an acoustic sequence given a word of phoneme. So typically for this purpose, GMMs have been used traditionally, but now it is being replaced by DNNs, typically LSTMs. We use multi-multi-layer LSTMs for this, right? So it gives you sort of, so to train these acoustic models, we need to train with a lot of data, you know, thousands of hours of data, 10,000, 20,000 hours of data, and we need this data is huge, and it's all different accents, different speaking styles, different demographics, you know, different noise conditions, a lot of data is needed, and it's complicated by the fact that in a device like a phone, it's a near-fielder device, but the moment you have a speaker, it's a far-fielder device. Things like reverberation, things like, you know, noise play a big role. So we have to train the model with all that kind of thing, and some of that data, you know, we generate synthetically, some of the data, you know, we do some, some other tricks we play, and some of the data really we have to create, we have to make, right? So that is, that is the acoustic model, and then there is something called language model. So acoustic model basically converts it into sort of simple ways, converts it into a sequence, right? Funny sequence. A language model sort of looks at all the words that have been spoken, and then it kind of predicts the next word, right? So give basically a probability of word given the preceding words. So people have used n-grams, but these days people are also using RNNs for this language model, and then finally there's something called a decoder. Lexicon is another thing. Lexicon is basically a pronunciation model. So linguists sit and they make a pronunciation model, and lexicons typically use HMMs, and it basically, for any specific word, right, they look at all kinds of, you know, ways of pronouncing. So you can imagine lexicon is a way of just kind of, how do you pronounce some things? For example, something we pronounce differently in different, by different people, right? So typically what are the ways of pronouncing them? And a language model is more like a dictionary, and more like a dictionary, it's more like a grammar you can say. So all kinds of words in the language. So language model has billions of entries. It's really huge. So you can see the scale at place here, right? So you have thousands of hours of acoustic data, you have billions of, you know, entries in the language model. All of them work together to sort of make it happen, right? And a decoder kind of finds the best sequence. It searches for the lowest cost path in a graph. It's usually based on dynamic programming. So let me explain this in a little more detail. This is just another way of representing. So there is a substantial feature extraction that happens. You get MSFC features. There's a Gaussian acoustic model, which can also be, you know, a deep, deep learning model. And then there is a Viterbi decoder, and it sort of takes all these things together and kind of optimizes and then gives the best possible output. Now, a little theory behind it. So if you take acoustic O as a sequence of individual observations, and you define W as a sequence of words, so you can basically the speech recognition problem is you have to pick the most probable word sequence, right? So given an acoustic input O, it has to pick the most probable word sequence, W hat, right? And to do this, if you rewrite it using Bayes' rule, you know, you can write, you rewrite, all of you know Bayes' rules, I assume. So you can rewrite it this way. And once you rewrite it this way, I mean, you can ignore the denominator because it's same for everything. Basically, you come up with this equation, right? And this equation, what happens is this is typically the first one is the acoustic model and the second one is the language model. So you can look at, you can, conceptually that's how it is, right? So, and the arg max is more like a decoder, right? So it looks at all these things and finally tries to find the best fit. So typically an acoustic model, you know, has phonemes. So it converts sort of the acoustics to phonemes and in English language, there are about 50 phonemes, right? And it's trained using label audio data as I mentioned. So that audio data has to be labelled. So I mentioned there are thousands of hours. All these thousands of hours, it has to be labelled properly. So labelling is a very expensive job and sometimes, you know, we are trying to see if we can do it automatically using some kind of unsupervised learning. We can do it automatically. But labelling is a very tough job and typically thousands of hours of data balanced across different phonetic sequences, speaker styles, accents, gender, etc. So it's like a long, you know, a lot of, a lot of data, right? A lexicon as I mentioned is a pronunciation model. So it sort of says house, for example, how house is pronounced, right? Typically. So that's what it is. And then there is language model, which is typically N gram, but these days people use RNNs as well. So it sort of predicts the next word, you know, it's used to model the probability of occurrence of a word based on the N minus 1 previous words. And there are different types of language models. So you can have a generic language model, like an open domain language model. You can also have a specific language model. For example, you can have a language model for medical domain or language model for music domain, right? And this language model has to keep up with the changes, right? So in music domain, you have new music titles keep coming up. Every month there are so many new music titles coming up, movie titles coming up. So this language model has to kind of keep up with that and it has to have the latest information. So that's how it all works, right? And then of course there is some sort of specific things related to, you know, user, specific data related to user. So that I'm going to go next. But basically this is how, you know, there are different domains of this language model. A voice assistant domain could be specific things which are related to like calling a contact, setting a alarm. So there could be different domains and sometimes you have multiple types of language models and you have a multiple pass decoder. So you have a first pass decoder and then you kind of re-score and then you come with a second pass decoder. And sometimes you are able, we need to understand which domain. So for example you are saying play, right? When you say play, you could say play a Michael Jackson song or play a latest Shakira song. When you say something like that, you know, you have to understand that it's a play domain and then quickly be able to switch your language model or put another language model. So sometimes based on the, you have to understand the context even before you've gone to the NLU, right? You have to understand guess what the user is trying to do and then load the correct language model. And then I already told you about inverse text normalization is sort of, you know, can clean up, it can correct, convert the worst numbers based on the context. For example, 6.30 a.m. it can make 6 colon 30 a.m. Things like that, you know, $100 it can convert to numbers. So typically this is all post processing that happens after the SR has done his job and it can also correct SR errors. For example, where is John Hopkins, right? There is no apostrophe here. Where is John's home number? There's an apostrophe. So you, it has to make these kind of corrections in the output. And many times there are things called homophones which are things which sound similar, right? So how many carriage is the Kohino Damant? Here the carriage is C-A-R-A-T-S. Add six carats to my grocery list. Here the carriage is something. So the spelling has to change, right? So SR doesn't know this. This is something that ITN can do that. Now, as I told you in the past and that there's also something called PDSS, which is personal data modeling. This personal data modeling is all about, you know, your contacts, your file names. All these things are also important and they play a role in optimizing, optimizing it. And this is used along with the language model. This personal data model is also used. And for example, you know, let's say in the LM corpus, frequency of occurrence of Sandeep is more than Sandeep, say in DIAP, right? But on my wife's phone or mom's phone, my name is stored as Sandeep, right? So when they say call, when they say call Sandeep, you should convert it to call Sandeep because it has to look at what is the way the contact is stored on the phone. So these, this is all very complex and very important and that's how, because a typical ASR system has to take care of all these things, right? So what are the major challenges in ASR? So as I mentioned earlier, the acoustic modeling, you know, the huge amounts of data is needed. And there is far field data as well as near field data. Near field is near the phone, far field when you have a device away from you. And there is lot of variation that can happen and a lot of preprocessing can happen. So when you're on a speaker or when you're on a TV, depending on where the speaker is, they can be all kinds of noisy things. They can be reverberations, things like that. So they have to be taken care of. There could be various variations. And the language model, there is a large amount of vocabulary cleaning and preparation needed because as I mentioned, billions of words are there in the vocabulary. And then decoder, there's multi-pass rescoring. So when the multi-pass rescoring happens, when you switch, when you do one decoding, the next level of decoding, you need to make sure the latency is not impacted. And context is very important. As I said to you, when you say play, it kind of determines the context, right? Or when you say call, it determines the context. Based on the context, it can switch the language model. So all this is very, you know, very, very important. And also, there are multiple variants of the language. For example, in English, whether you're speaking in an US accent or a Great Britain accent, you know, UK accent or an Indian accent, it's different. So all the variations have to be added in acoustic model, in language model, and everywhere, basically. So these are all the challenges that we have seen, right? As a result of all this, you know, these challenges, the accuracies have still not achieved more than 75% or 80%. So how do you go to the next level? What do you do? So how many of you here attended the previous talk? End to end, okay, only some of you, okay. So well, end to end programming is one way where, you know, typically, instead, you know, instead of you have a acoustic model, lexicon language model, you don't do any of that. You just have one model, which uses sequence to sequence learning. And after feature extraction, you simply do the sequence to sequence. And there are many different variants of this. So there is, you know, started with something, you know, basically listen to 10 spell model, which has come. And there is also an RNN transducer model that has come. And various other variants of that have been there, okay. And there is something called CTC, which is, which was there earlier. So all of these models, you know, have been there and people are experimenting with them. And some of them have worked very well. So let me go a little more detail. So this is an example of, of a listen, attend spell architecture. So where there is an encoder. So keep in mind here, there is in this model, there is no language model, okay. That means you're not feeding it that billions of words, okay. All of that is not there. Which means that your input to this, right. Now in this, you're giving only, there is no language model. So therefore, everything has to be given in the input. So your input has to be much richer. Yeah, and I'll get to that later. I have a slide, okay. So here, you know, as you can see the inputs to this end to end SR have to be lot, lot more richer, right. And the way it works is typically, you know, it works like any end to end system, right. And there is a tension there. So there are different ways of doing it. And this listen, attend spell model, it takes the, it takes the speech inputs. And it's analogous to an aim, it converts it into an intermediate representation. And then there's a attention model, which is basically a for alignment. And it sort of figures out which parts of the sentence are more relevant and kind of does that. So any of you who know attention, you know what is, so I'm going to skip that. But you can read about attention. It's used in pretty much any end to end system, you know, there's a tension being used. And then there's a decoder, which take this and then uses the previous, whatever the previous output has been, right, takes that and then sort of predicts each output token as a function of the previous predictions. So this is how typical listen, attend spell architecture works. Now people have experimented with other kinds of architectures such as RNN transducer model, wherein it takes a CTC model. CTC is nothing but a more like a, you can say it's like a simple network, but it sort of assumes conditional independence between all the inputs, right. And CTC doesn't work very well, so they have looked at using some kind of appending it with the predictor, right, along with it. And they jointly learn together. That's the RNN transducer model. So there are different types of, you know, networks out there. And this, these days a lot of research is happening. And end to end these kind of, you know, LAS kind of architecture was used previously. The attention, using attention has been used for translation as well. So language translation, this has been used, right. So there are many kinds of things which are similar in domain, right, similar in nature. So end to end SR, I think the biggest issue is that the inputs, right, they have to be enough, enough data to there to be, to program it. But assuming you do that, what is the main advantage? The main advantage is because there is no language model, there is nothing, there's no out of vocabulary. So anything, you know, it can take care of anything and it can take care of multiple code switching, for example. People speaking multiple languages, same time it can take care of that. And it's very robust. So it's very good for dictation tasks. It's very good for open vocabulary. It can reduce your memory a lot. So because there is no language model. Typically language model is very bulky. But using this, you know, because it shrinks the model, it can really make a, give a big reduction in the ROM. So as I mentioned to you earlier, there'll be a time when you have this speech system in many devices. And this thing, it may be online. It may not necessarily be in the cloud. And this can really help bring up that. So once you have an end to a DSR, which works well, you can actually run it on device. And it's much simpler to build. And, you know, there's a lot, it just makes the whole thing much simpler, right? But there are challenges. Some challenges, one is data. So it requires a huge amount of data. And second thing is streaming. So typically it looks at the whole sentence. And then it outputs the, you know, the output, right? So the base architecture basically requires full context. And even with attention, it requires full context, right? But there are some things called mocha, which is monotonic chunkwise attention. So they're making the attention down into pieces and then doing it. This thing is one possibility. As I mentioned to you earlier, you know, there is something called the personal language model, right? It takes your personal data and kind of, you know, improves the accuracy. Now here, there is nothing like that, right? There is no personal language model. So what do you do? So you can put your, you can, you can have a, you can put, put some context in it. So you can have an auxiliary network along with it and learn. So there can, as I mentioned to you, there's a bias encoder here you can see. And it sort of takes it and sort of learns along with that. So there have been techniques called CLAS, which is contextualism at n-spell architecture, which has, which has been doing, which we are working on. And also there is another approach of fusing the LM. So you can have a lightweight LM because even you have this, this is very good for open, open vocabulary. But when you come to specific things like music and all this will not work. So you do need a language model, but you can have a lightweight language model and you can fuse the two outputs. And there are various ways of fusing the output. So you can have, you know, sort of shallow fusion, you can have deep fusion, you have cold fusion. So I'm not going to go into too much in the detail. You can read it up. But there are many techniques where the language model is sort of fused along with the N2 and SR. And we are, we are working on the shallow fusion, which is being used a lot these days. So I earlier touched upon on-device. So on-device is another area that we are working on. So as I mentioned to you, you know, on-device is very becoming very important because these days, you know, the device capability is going up. And with the advancements in hardware, you know, you can run this entire model on-device as of instead of the cloud. And also it gives you the advantage of latency. So it makes things a lot more faster, right? And gives you privacy as well because the data doesn't have to leave the device, right? And reliability because you don't depend on the network. You are in a car, you are moving around, you have network conditions can keep changing. If things are on-device, it works perfectly, right? And typically it's been applied to dictation and stuff like that. But these days we are looking at applying it to everything. And eventually the vision is that this asr, on-device asr can replace the cloud asr. Everything can run on the device, right? So a lot of research is happening on model compression and some sort of hybrid architecture in the beginning. So something on device, something on cloud working together. So a lot of research is happening these days. The next area that we are going to talk about is whisper asr. So these days research has shown that people use voice as I mentioned to you in the car or in the home, but when they're in a public in place, they don't like to speak, you know, loudly. They want to whisper to their phone. They're in a meeting or you're in a movie hall. You don't want to, you know, you want to do something with your phone. You would like to whisper to it, right? So there is something called, there is a whisper asr that we're working on. And this, the first thing is to detect whether you are using normal speech or whisper speech. So there's detection that happens. It includes both wake up and end point. All these things happen. And there's whisper detection. And after the whisper is detected, then we have to do the whisper asr. So the different approaches of doing that, I'll explain that. But whisper speech has a different characteristics. The spectral characteristics itself are different for whisper speech. So whisper speech, as you see, there is kind of less energy at the lower frequency bands compared to normal speech. And there is a consistent increase in the F1 format frequency compared to normal speech. So what we are working on is trying to see how can we detect whispered speech using this. And we have a model that can just detect whether it's whispered speech or not, right? So that's what one of the things we're working on. Once we know it is whispered speech, then there are multiple approaches. One is that we can, you know, convert this whispered speech into, into normal speech, right? And something called a deep need denoising auto encoder. So we're using a deep denoising auto encoder to convert whispered speech to normal speech. And then we have a normal ASR, right? Now the benefits of this is that you don't need a lot of data. So you can do this with less data and there is no, you can use a normal ASR. But the accuracy is not as good, right? But this can be good to start with. But eventually, this is the approach B is better, where you actually convert an ASR, you have an ASR, full-fledged ASR, once you have enough data, you do some AM adaptation, acoustic model adaptation, and you can actually make a, make a proper whispered ASR. So both the things are, you know, are important. Both need to be done, depending on your condition. So this is a typical approach for whispered ASR and we're doing a lot of work in that area. One of you talked about, asked about pre-processing, right? So pre-processing is very important because, you know, when you're speaking, you can have a lot different noisy conditions. So let's say you're, you have a speaker in your TV, I mean, not speaker, sorry, a receiver in your TV, right? When you have a receiver in your TV, you can have a high eco because there's already TV output coming, right? Similarly, you can have high ambient noise from vacuum cleaner, et cetera, in a home, right? When you have a, when you have a speaker, let's say, when you have, let's say, an AI speaker, you have lots of noisy environments. Similarly, in a car, you know, you could have multiple noisy environments. And you can have, these day speakers themselves may have multiple, multiple microphones. So for example, one of the, one of the speakers that we have, it has eight microphones, right? So, so things like that, you know, there are a lot of noisy conditions. So we need, we do need a lot of pre-processing. So there's acoustic eco cancellation, and there is noise suppression, and there is some more front end processing that happens. And in this, you know, one of the areas that we are working on is neural beam forming. So when you have multiple microphones, how do you take all that output and channelize into one output, right? Using some, some deep learning techniques. And second is the direction of arrival estimation. Can you detect which, if you have multiple speakers, can you, can you detect, if you have multiple microphones, can you detect which direction the sound is coming from? So in a robo environment, this is very important because, you know, your robo is at a corner of the room and you're speaking from here, the robo needs to understand where the sound is coming from, right? So the future has become very important. So direction of arrival estimation is one of the things. And of course, low footprint and preprocessing. Yes, yes. That is some area of research I won't be able to talk to you about. But we have techniques such as caustic beam forming which I talked about. The direction of, direction I'm not covered here, but there are a lot of papers in that area you can read. But these are all the things we are looking at. So caustic insulation that is very standard. Noise suppression very standard has been done for years. And acoustic beam forming we are trying to do, you know, using neural networks. Sort of a three-layer network, you know, not a very big network. You can do that. So preprocessing is basically another area which is very important. And sometimes it gets ignored, but this is extremely important area. And the next is about wake up. So as I mentioned to you, you know, the wake up is usually a multi-stage wake up. So there is something called keyword spotter which always runs on the device. Because the device, you know, even if it's let the screen is off, right? Even then it has to keep listening. So when you say high big speed, it still always wakes up, right? So there is a keyword spotter which is a very low-end model. And the good thing is that it always wakes up, but there could be a lot of false alarms. So to ensure, you know, we need to have another stage of wake up, and then maybe another stage of wake up on the server. So typically keyed up keyword spotter and the verify one run on the device, verify two runs on the server. So this multi-stage wake up achieves best accuracy at a low power consumption. So this is, you know, the wake up technology we are working on. We are also looking at end-to-end models. So just like end-to-end ASR, we can have an end-to-end wake up system. We are looking at that. We're also looking in combining wake up along with pre-processing. Because wake up and pre-processing, you know, if they're combined together, sometimes they give better output. And of course, user-defined wake up words. So instead of saying high big speed, you can give your own wake up word, right? And just wake up with that. And simultaneous wake up in a multi-device environment, you know, you can have many, many different types of different devices and you want one of them to wake up. You don't want all of them to wake up. So things like that, you know, are very important. I have to be looked at. And as I mentioned to you, far field is very important. So the device could be 10 meters away. It should wake up. And there could be ambient noise, things like that. And also, you don't want false alarms because false alarm means the device is waking up when you don't want it to. That is not good because that means your data is all going to the server. It's a loss of privacy, right? So you really want the device to wake up only when user wants to wake up. So this is a very hard problem. People are trying to, a lot of, you may have seen a lot of news about things. Accidentally, you know, all the voice going to the server and issues like that. So this is a big area of research for us. And this is a kind of multi-stage wake-up and process. So there is always, there is a first level of wake-up which is always on. And then there is a second level of wake-up which is more, which is more accurate. And finally, there's a third stage of wake-up which is on the server. As I mentioned in the beginning, there is start point detection and there is end point detection. So start point detection simply detects when you're starting speaking. End point detection detects when you're stopping speaking. So end point detection is also very important because typically when users are speaking, they may have a pause in between. So let's say you're trying to book tickets. You say, oh, get me movie tickets for something like that. Let me just think of the movie name. So you may pause in between, right? So you could have contextually you have to understand in some cases based, based on the situation, you may pause, right? So there is a, there is a, you know, techniques which are machine learning-based techniques to figure out when the user is stopping to speak, right? So something called contextual end point detection. It could be based on acoustics. It could be based on other techniques. So we have to differentiate in-sentence pause with an end-of-sentence pause. So user may have pause, but really, user is not intending to pause, right? So it's very important because, you know, early, you know, late detection increases latency and early detection or late detection can also increase the word error rate. So it's very important for end point detection to work properly. So there are a lot of techniques on end point detection, you know, so we are not going to go too much into it, but this is another area where we are working on, you know, using acoustics as well as other, other signals to that. So partial hypothesis is another area. So we kind of figure out which domain it is. So these are all the research areas that Samsung we are looking at. So first is end-to-end ASR. So end-to-end ASR with a very low computational cost so that it can run on the device. Second is far-fee ASR. So far-fee ASR is very important because you have multiple, you know, as I mentioned to you, speakers, you know, so on, with a linear distance. And this has reverberation effects. It has noise artifacts. How do we make it happen? And you have multiple devices that can wake up and sometimes if these multiple devices all wake up together, only one device may wake up but other devices can listen. So they can help each other. So you have to see how it all works. Neural beam forming I talked to you about. So direction or rival estimation and whisper ASR, whisper direction and pH recognition. On-device ASR, you know, which is basically ASR running on the device and contextually PD. So these are the typically the areas that we are working on. But we are also looking at diarization, speaker identification, spoofing, diarization, some of the other areas that we are working on. So that's all I have and I have some references here. So I'll be putting these slides on the web, on the website and you can go through them. Any questions? Yeah. Can go. As of now, most of the audio interactive, let's see, shall we see or whatever application is available? Yes. Mostly they listen and then they do speech to text conversion and then they apply, then they do NLP. Correct. They apply HMM and then they do, with the outcome text, they do text to speech conversion. Yes. But here, maybe I was not able to figure it out whether you are also following the same. Yes, we follow the same. So I told you, I talked only about the speech to text part. I didn't talk about the rest. Typically system will have NLP in between, it will have action planning between and then finally there is a text. Finally there is some you know, NLG, the natural language generation that happened and then finally there is text to speech. So I didn't talk about all the rest. I only talked about ASR. Okay, thank you for that. And one more, you talk about wake up bird. Yes. So see, after waking up, the service is processing whatever audio is coming in text and it is giving appropriate output. Yes. Okay. But why, like whether it is a Alexa, whether it is a Google, whether let's say Bixby, why you are forcing user to, for a specific word as a wake up word, like why can't user configure a specific. So user configured wake up is happening. The accuracy is not very good. That's the reason, because we don't want it to get confused with some other word. So let's say you put some user word anything. You can, if we give you the choice, you will put anything you want, right? And then it can conflict with something else. So we don't want false wakeups, but it's being researched and you know, slowly when the accuracy improves, then we can launch it. But point is that wake up has to always work. See, remember one thing, the wake up, even when the device is in low power stage or dormant stage, the wake up always has to work. So that's one thing. And that's the reason why we're using a wake up word, because we don't want to send every, all the audio to the cloud every time. We only want to send it when the user is intending to. That's the need of a wake up. Yeah. In case of image, so all the feature detections using then the spatial domain. Yes. So in case of the speech, so we are detecting all the features in the time domain or in the frequency domain. It's frequency domain. All are all the features you were getting from the feature domain. Okay. Yeah. Thank you for the details here. So thank you for all the details. So these are generic purpose applications, right? So Google, Microsoft, Apple, Samsung. So what are the challenges in getting together and coming up with one system rather than reinventing the wheel? I'm sure all the four people are reinventing the wheel. Yes. What are the challenges in coming together? It's, you know, it's very hard. That's how the industry works. You know, it's not, you don't, I mean, if you see typically you see databases, for example, you had Oracle, you had Informix, you had IBM, you had, you know, and some point of a mergers, for example, Informix, I used to work there. You got acquired by IBM. That's how things are. We all are working on the same thing. They're all relation databases, right? Each one is different. There was IBM DB2, there was CyBase. I mean, I'm talking about 20 years back, right, 25 years back. Yeah, but it's very difficult. When you're in the industry, you know, I mean, there is research happening. See, when we publish, when we, when we have research, we are all publishing papers and that is all open, right? So there is a lot of work happening where we refer to Google's research or Google refers to ours. At that level, at the academic community, everything is, all the efforts are aligned. But in the industry, it's not. Hi. I just have a couple of small questions. So the first one is more on a technical note. So you mentioned about using attention, right? So and using, using attention in the models that you guys are trying to build. So in your opinion, like, what has your experience been like? Is the transformer architecture better off or is a self-attention sort of thing with the PGN working better? We found the self-attention better. Self-attention. And the second question is, there's been a lot of talk about how Anki is doing some really good stuff with their speech recognition system in their bots. Our, our key. Anki, ANKI, they have these small bots, right? But they've refused to release any of their work in the public domain in terms of research. Do you believe, like at a point where we are in the field of NLP and speech recognition, this is something that all companies should come together to not do and like release research so that- Yes, I believe so. We should do, but it's very hard to force somebody, right? More as a community sort of thing. As a community, yes. We are encouraging that. So there's already communities in speech that are trying to promote that. ISCA is one such community. So ISCA is one such, and we are organizing many seminars, conferences. So just to promote and we have industry participation as well. And speeches in India itself which research is happening a lot. Inter-speech, you know, with leading conference speech was held in India, 2018. We were the diamond sponsors. So there is speeches happening in India as well. And a lot of- I was, I was thinking more in the sense of collaboration across the companies to work on industrial research. Something like that. I think there is still a need. There is still a need. For example, Indian languages. That's one area where all the companies can collaborate. Because there is shortage of data. You know, and if companies collaborate, because these days, you know, there is a big need in India for speech. Because, you know, it's not enough to have English only. English can cover only small part of the population. You need all region languages and people who are not literate. They can use speech, right? So that's very, very important. Then I think the company should come together. So there are, there are some efforts in that direction, but it's not easy. Yeah. Thanks. Thanks so much for the talk. Okay. Hello. Hello. It- I'm here. Are you have time? Yes. So what, we have been using Alexa at home. And the primary problem which we have been facing is that whenever the music is playing or anything is going on, at that time you try to pause it and want to switch to something else. You say Alexa, stop. That time it hardly works properly. Yes, that's exactly because of the reason I told you. So there is noise happening. I mean, when the music is playing, right? So when you say something, the signal sort of gets mixed. And that's where things like beam forming in all can help and it can sort of understand which direction the signal is coming, where the music is coming from. So even if it is- And a lot of, a lot of noise, suppression, those things. Even if it is playing on the same device, same Alexa. Same device, same device. Then it is even a bigger problem. Yes. So that's what I told you, right? In the beginning, I told you about, if you have a TV, for example, or a speaker, you have high echo. So this echo is a problem. Hi. Me? Okay. Over here. So you talked about speech-to-text. There's a second part to these systems where there is an intent and the system takes an action out of the intent. Yes, yes, yes. If there, in Bixby, what is the interaction between that and the intent system? For each intent, are you doing speech-to-text separately? Yes, yes, yes. Because it's slightly different, right? Yes, yes. You're doing separately. So we have an NLU system. So Bixby 1.0, we use deep learning finally as well. So we use a mixture of CNNs and for basically for, you know, you can say domain identification and for intent detection, we used CNNs and we used RNNs for the slot extraction. That was a deep learning base. In 2.0, we are not using deep learning, but we are using some other techniques. And basically it's like, we have an NLU in place, right? And then after NLU, you need some kind of action planning as well. And there is also some kind of a knowledge graph. So all of those things work together, but today I didn't, I focused only on the speech recognition. Right now, typically speech recognition is separate. It happens, I mean, speech just gives output text and the text is given to the NLU. That's how things work. There is no, I have very few cases where together speech and text are all fused together. It really doesn't work that way. Okay, and of course, TTS happens at the end. So once the NLU does its job, and then it figures out what to say, what to tell the user back, then there is natural language generation. And then there is TTS. That just takes that and produces, you know, speech. Okay, thank you. I'll be here so you can.