 And we are actually one minute before schedule, right? This never happens. So the organizers are doing a great job. And I think they've also scheduled the talks well, because you heard Dr. Anand's talk in the morning, where he did a deep dive into how speech recognition happens and really sort of went deep into that. And I just spoke a little bit more about some algorithms about image and video and audio processing. And I'm sort of continuing that conversation, right? Where I'll talk a little bit about audio processing, but also how we sort of understand what the user's saying and what the intention is. So I think it's a nice sort of flow between those three talks. So before that, and just to quickly set context, again, I spoke a little bit about how humans have evolved from the last millions of years. But let's talk a little bit about how our interactions with tech has evolved. So if you really look at it, this is how we have interacted with personal tech. Personal tech sort of started in the mid-70s, where you had small, I don't know if you can really call it personal computers, but small terminals, right? Black and white monitors with just a keyboard and you interacted via character mode, right? Not unlike the terminal that we use now while coding, but that was all there was. Sometime 10 years later in the mid-80s, things changed. We moved to a sort of graphical user interface. Companies like Microsoft and Apple and Xerox came up with visual operating systems, right? So you actually had, you could actually see a file, you could click on a file, you could move this piece of hardware that would translate onto the screen, your computer mouse, which didn't exist before. Sometime in the mid-90s, things changed again. We moved to the web paradigm. Again, how we sort of interacted with tech, I think changed that time, right? Because you had early versions of browsers. It's fairly young crowd, but I'm sure a lot of you all remember early versions of Netscape Navigator and Internet Explorer, where it took forever to load the static website, but there was a thrill in actually seeing that. And how we interacted with tech changed. You had radio buttons, drop-down menus, browser components that came into the picture. So things changed then. In the mid-2000s was the era of mobile and this is something that I think we are all familiar with. And even in the 15, 20 years or so that we have had the mobile phone, things have changed so much, right? We have gone from really clunky, basic phones with small keypads and black and white screens to the advanced smartphones that we have right now. And just think of the sort of interactions you have with your smartphone right now, like how do you zoom in on a webpage, right? You do that, you pinch to zoom, or you swipe right, or you pull down to refresh. These interactions didn't exist before. So as tech sort of evolves, so does how we interact with tech changes. So if you can see, almost every 10 years or so, there's a change in how we interact with tech. So as if on time, sometime in the mid-2010s, things changed again, right? And the paradigm sort of shifted to voice user interfaces. And now, we as humans, we've been using voice as a means of communication for a really long time, about 100,000 years. It is an integral part to how we communicate. But surprisingly, our tech hadn't reflected that up until then, right? People had tried, but it was simply not technologically possible. Now, we finally have access to the computing power, we have access to the networking power, and we finally have access to the data that sort of allows us to be able to talk to machines via voice, right? So we at Amazon, we truly believe that voice represents the next major disruption in computing. And you guys probably hear this word a lot, disruption, when something changes the status quo, or a new platform, or a new product, really changes how we operate something, or changes how we behave. And we believe in computing, voice is the next major disruption. We, of course, saw this disruption happen, and we came up with this beautiful device here, it's called the Amazon Echo, right? Essentially, there's a line of hardware products called the Echo, there's Echo Dot, Echo Plus, et cetera. There are also third party devices, and all these devices are really powered by the Alexa cloud, right? These devices, just in case there are people who don't know, let you voice control your world. So everything from switching on your smart light to getting the news to playing music, to hearing a song, all these things that you typically do on your phone, now you can do just via your voice, right? They're completely voice controlled. The great thing about these devices is most of them come with either a five microphone or a seven microphone array, so it enables something called far field recognition. So I don't have to have the device right next to me when I talk, I can stand across the room with ambient noise and still interact with these devices, right? So similar to how on your phone, when you buy a phone, it can do a few things out of the box, like make calls and SMSs, take photos. What's the first thing that you do, right? You download apps. Similarly, you can enable skills. A skill on Alexa is like an app on your phone which sort of enhance the capabilities of Alexa. So a lot of things you can do with an Alexa device, but you can really enhance the capabilities via skills. So you can order an Uber cab, for instance. You can play a song via seven. You can get a restaurant recommendation via Zomato and so on. And similar to the mobile app ecosystem, the skill ecosystem is completely open. So third party developers can, in fact, build skills on their own, right? And there's a skill store. You can publish your skills there and anyone else can actually use it. So, and by the way, I'm doing a workshop on how to build skills this Sunday as part of ODSE. So if anyone's interested, please sign quick plug there. But so that was just a set context. Today, though, we're going to talk a little bit about how Alexa actually works, how machine learning and data science plays a huge role. It is so vital to how Alexa actually works. And I'm going to talk a little bit about the sort of customization that we had to do with an Indian context, right? Alexa launched earlier this year and there are so many things unique about India and how we speak that we had to do a few customizations for the Indian market. Before that, though, like I said, machine learning plays a huge role in how Alexa works, right? Recently, like I said, I do a lot of these events and someone asked me this question saying, if I buy this device now, when will it get outdated? When do I have to buy the next version? And without the risk of sounding too like markety, you don't because all these devices are cloud based, so as long as you're happy with the quality of the speaker, all the enhancements are happening on the cloud, right? So soon Alexa will be able to understand more and more, more accents, more words, more context, more languages, while the device sort of just serves as the conduit for like an input output. So the cloud really is getting smarter thanks to all the machine learning and the sort of breakthroughs that are happening with speech recognition and natural language understanding. Cool, so let's sort of dive deep into how some of these things work and you will see this slide quite a bit today, right? So essentially, this is how we use data in machine learning. It's highly simplified of course, but we sort of split it into two phases, right? There is a training phase and an inference phase. In the training phase, what we do is, and this is like say before the Alexa device was launched, we didn't put a lot of training data for speech recognition, for natural language understanding, whatever, and we'd have a ground truth, right? A ground truth is basically something empirical that we know to be the truth and we'd put all of those through a trainer and have certain models, right? And then in the inference phase of course, when like a user comes and interacts with Alexa, that's the input, the decoder sort of infers the output, right? And if your training data is very good, obviously your decoder is going to be good and you're gonna get like really good results with the speech recognition and the natural language understanding. So I'll get back to this slide a little later with some specifics. Typically though, this is how Alexa actually works, right? If you have to really break it down, excuse me. So on the left, right, is a device right there and those are all the components on the right that you see. So you have ASR, which is automatic speech recognition. You have NLU, which is natural language understanding. You have skills, I just told you what the skill was and you have the text to speech component. Now each of these four components have like a lot of machine learning and data science that go into it, they're actually, you know, sort of ensures that they work well. So suppose a user says something like Alexa play music, right? Very generic statement. First is the device listens for something called the wake word, right? And here's where I really stress that most of the magic, most, all of this happens in the cloud, right? The hardware device in itself doesn't do too much. It just listens for something called the wake word, right? And the wake word in most cases is the word Alexa and you'd be surprised as to how much like, you know, data science and machine learning goes just into the wake word engine, right? Because it has to be, it has to accurately, you know, sort of pick up Alexa and someone in India versus someone in the US is going to say the word Alexa differently, right? So it's something I didn't expect but there's actually a lot of science that just goes into that one small thing. So there is a wake word engine that listens to the word Alexa and once the user says Alexa followed by an utterance, whatever the user said is pushed to the cloud, yeah? In the cloud, there are a few things that happen. First is, you know, you obviously need your speech to be converted to text, which is where ASR comes in. That's the first step right there. Once that's done, you sort of need to understand the intent behind what the user said, which is basically what natural language understanding is. So that happens, you get the intent. Then you need to push that to a skill. It could be an inbuilt skill or a third party skill, right? For you to actually complete or execute the command or to respond to the user. And then finally, that text has to be converted back to speech, you know, for text to speech because again, we are in a voice first world so the user needs to hear the answer or you know, maybe the song. In this case, when they say play music, they need to hear a certain song. So let's take each of these steps and let me tell you how data science and machine learning sort of plays an important role. I'll be largely concentrating on the NLU and the skills part of it today though. With ASR, again, I spoke a little bit about automatic speech recognition. I mean, there are two phases to it. One is before launching the device, how do you really sort of make it accurate given that there are so many different accents, so many different dialects and so many different ways of saying things. So we did something called active learning for spoken language understanding, which is a sort of semi-supervised way of learning, right? So essentially, we had some existing speech-to-text models and again, before the device was launched, we put a lot of data into it, right? So and it was semi-supervised. So the model essentially looked at confidence scores and it did, if the confidence score was below a certain number, it would send it for some human intervention, right? And then humans would listen to it and actually transcribe the text, do speech-to-text for a large amount of data, right? And this was, of course, when we launched, before we launched the device in the US, that was like few years ago, but before we launched the device here, we had to follow like a similar sort of exercise as well. The thing is the model itself figured out the important labels to get the human utterances for, right? If the confidence score, if it was slightly below a number, it wasn't able to accurately do it. It wasn't able to accurately get that speech-to-text. Human would listen to it and transcribe it, right? This way, remember I said there's a training and an inference phase. You were able to train a large set of data without having to do too much manual work. Once that was done though, and everything was set, we have to sort of rely on ASR to get accurate speech-to-text, right? Speech-to-text is solved, is a sort of solved problem, but there are times when a phony might look like this, right? When someone says 40 times, right? So what could it mean? Now, how do you accurately get what the user said? Mainly because of this, right? It could mean that the user said 40 times, like doing something the number 40 times. There's a chance the user meant for T times, right? The phonemes are the same, and this is a reference to Chai. Maybe the user meant for T times to T off as into play golf, right? Or there's a chance that the user meant for T times, yeah? Something like this. So there are so many cases like this, especially in spoken language, when people are trying to be conversational, where like a phrase or a word might sound very similar. So how do you sort of differentiate what the user said? What happens in this case is Alexa actually looks at the training data that a developer provides in reference to a skill, right? So for example, if I'm a developer and I've built something to do with T and coffee ordering and a user says, you know, Alexa opened the skill and then says 40 times, there's a good chance that they meant this for T times, yeah? So that is that one extra step that we do to sort of accurately match, especially in cases like this when, you know, a set of phonemes could mean like two, three different things. Once that's done though, you need to, once that's done, you need to, you know, actually do the natural language understanding before that. But let's talk a little bit about how we had like to do some certain things for the Indian market, right? So there are so many words and phrases and things that are so local to India. Like for instance, back in the US, not too many people actually say who is the prime minister because that designation doesn't exist, right? So we had to sort of feed in the data and there is a tendency with like an Indian user to say the word prime minister instead of prime minister, right? So all these sort of variations actually had to be fed into the system. Also things like food names and festival names as well, you know, locality names like Malaysia was very difficult, right? But thanks to all the machine learning that we have actually fed into, it's now able to recognize things like Malaysia. There are a few more complicated things. People say things like, remind me to go to Mukambika Devasthanam, right? It's not going to get 100% matched just yet, but luckily with some of the algorithms that are there and with repeated usage with users saying things like this, we're going to get to a place where, you know, we can actually accurately say that. Cool, so that's done. The speech is converted to text, but now how do you really sort of understand what the user is saying, right? Yeah, so his question was to the truth can be sort of expressed via not just voice but gestures and things like that. So for this talk, I'm going to be talking only about voice. Amazon does have something called the echo look, which is a camera that sort of helps you with what to wear and gives you things which is basically sort of going towards the whole multimodal ambient computing thing. So in the future, it'll definitely be a part of it. For this talk, I'm just going to stick to the whole voice aspect of it, yeah? Cool, so I'm going to use one specific example. Does anyone know where this is from? Trilla, yes, thank you. So this is Michael Jackson-Silla and I'm going to use this specific example for like the next few slides, right? One of the greatest music videos ever. So suppose someone says something like Alexa, play thriller by Michael Jackson, right? The ASR does its work fairly easy, you know, it picks up what the user said and now it's gone to the NLU, the natural language understanding. So what actually happens? So right now how it actually works is this, right? Where if someone says something like Alexa, play thriller by Michael Jackson from my 1980s playlist. There are a few things that are happening. So Alexa basically picks out three types of things, right? The first one it picks out is the domain, right? This is clearly to do with music, so it has like the domain as music. The second thing it picks out is an intent, right? For those of you all who might not be familiar, think of intent as a sort of meaning or the intention behind what the user said, right? The intention here is to play a certain track. So Alexa picks out that, which is the play music intent and then it picks out the third thing, which is a slot, right? A slot is like an entity, think of it as a certain variable in an utterance, right? Because this could easily be something like play, I don't know, parachutes by Coldplay from my 2000s playlist, right? So there are three variables in this, in this case, one is the name of the song, which is music recording, the other is the musician's name and the third is the playlist, yeah? So essentially you need your domain, your intent and your slot, which is picked up by the Alexa Cloud. So this is a flat intent or a flat slot schema. It's fairly easy to represent and it's worked well for us right now, right? So right now I'm guessing few of you all have Alexa devices for statements like this, it works very well and it's also easy to represent but there are certain problems with it. For instance, the exact same utterance, right? Could mean something else because there is a video of that same song, right? So if someone says play thriller by Michael Jackson from my 1980s playlist, there's also a video intent that could possibly be, you know, be activated. So how do you sort of, and of course the slot values also change accordingly because thriller now goes from music recording to music video and Michael Jackson goes from musician to actor, right? You have also from music playlist, 80s playlist goes to videos playlist. So it might be easy in terms when you have like very few domains, like maybe you just have like say music, smart home, weather, games, whatever, like say eight, nine domains. But eventually we want to be able to scale up to a lot more domains where users don't have to, you know, users can speak naturally to Alexa and Alexa can actually respond. So there are a few more challenges as well and again, few of you all who have actually used Alexa a lot might see this. One is something like this, doesn't share any shared semantics, right? There are no shared semantics between play music intent and play video intent, right? Which can be a problem once you're really scaling up. Also from a usability point of view, there are a lot of problems, right? For instance, if someone says something like, find me restaurants near the IPL game, right? The intent here is to find restaurants but it's two levels of intense, right? First you need to find where the IPL game is which is like a location sort of intent, get that, push it to find me restaurants and whatever. So it's two intents right there. The second one is a little more straightforward. It's find, play Harry Potter and turn the lights to dim. It's straight up two intents that are not related but the user has said it in one line. Now if someone comes and says something like, Alexa dim the lights, the lights are dimmed, Alexa play Harry Potter, the Harry Potter movie is played but if someone says something like play Harry Potter and turn the lights to dim, it kind of gets little complicated. The third one is a sort of holy grail I think in conversational interfaces which is negation, right? If someone says find me running shoes, not by Nike, it's a little tricky for a conversational interface for machine learning to sort of, you know, look for like say Reebok, Adidas and Puma because what it's going to do is it's going to say find me running shoes so that's a shopping intent and it's going to look for a slot value which is Nike and then show you Nike shoes, right? So it's very difficult and maybe with this example you can define utterances which sort of pick it up but as a human you can really sort of fool a system by using negation and things like that. And lastly just something as simple as play thriller, right? Because thriller can be a song, thriller can be a video, maybe it's the name of a book, an audio book. So so many different things and then you're sort of hoping that the system disambigates for you instead of smartly trying to figure out what happens. So to change that, we came up with something called the Alexa meaning representation language. This was just presented I think couple of months ago at NAACL 2018, I think that was the end of July, August. So it's a paper that's just been presented, you can see the link down there. The big difference in this is it doesn't really represent in terms of straight intents. It's more about actions and creating a factored graph between all these things. So let me show you how it works. So when someone says play thriller by Michael Jackson from my 80s playlist, the first step is to pull out the slot values. That still remains the same, right? This pulling out the slot values still remains the same. So you're going to pull out a music recording, a musician and a music playlist from the slot values. Here you're replacing your intent with something called an action, like a playback action. Note that we haven't defined it's a playback music, playback video, et cetera. We've just defined playback action. Now each action sort of can take an argument, right? Each action can take an argument. In this case, the music recording is the argument that the action is taking. And similarly, the slot values can associate itself with properties as well, right? So the music recording slot value is associating with the musician and music playlist by a properties called by artist and in playlist. So as you can see, a sort of drawing like a factored graph, there is a graph that's sort of related. And you're going to use this factored graph representation as a representation of what the user said, as opposed to like a flat intent schema. The difference between this and the earlier one was in the earlier one, we use very sort of fine grained types, right? Where it was music intent specifically, like play music intent. Here you're using very coarse grained types, right? It's just playback action. We don't know if it's audio or video just yet. The thing is this is rooted in, it's called the Alexa schema ontology. It essentially extends the schema.org. So the great thing is it's compatible with all the other websites and other things that use schema.org. And this is the sort of representation of how it looks, right? You have a thing which, and this is just a small representation. It can be an action, a person, or a place, or a creative work. Under creative work, you can have a movie, music recording, book, comic, whatever. So this is a sort of ontology that is followed. And it sort of takes care of a lot of things, including type ambiguity. Remember earlier when I said play thriller was a little, it could have been problematic given that, you know, thriller can be like a video or a book or a song. So if someone says something like play thriller, the first thing is the action, which is a playback action, which is associated with a creative work whose name is thriller, right? So there could be an ambiguity there in terms of whether the creative work is either a movie or a music recording. That's fine. So I'll show you later how that's actually handled. The good thing about this AMRL, Alexa meaning representation language, is the fact that you can reuse it across domains as well, right? So this example is another tricky example in the current scenario where a user says something like turn on the song thriller by Michael Jackson, versus the user saying turn on the living room lights. So as you can see, the sort of action associated with both because you've used the words started with turn on the, is an activate action. It's just that they take arguments that are different objects, right? So in the first example in the thriller by Michael Jackson, the object is music recording, right? And of course it gets, you know, the name of the song, the name of the artist, et cetera. In the second case, the object is lighting, right? So then you get the room and what type of lights and so on. Cool. So that's been done, right? So now we have a way to represent what the user said. How do you actually accurately execute what the user said? So we do something, and again, this was just presented at NACL very recently. Excuse me. It's called Hyprank or Hypothesis Reranker, right? So assuming someone said the phrase just play Michael Jackson, since we said play, it's a playback action. That's the action. And it's going to, I've drawn a dotted line there after playback action because it creates the rest of the ontology. Once it's done that though, you sort of need to get the best skill to actually respond to this, right? So how do you actually do that? So we have something called a Hypothesis Reranker, which does two things, right? It does, the first one is called short listing where it essentially in the most efficient way finds the K best number of skills that can actually execute this, right? So there are probably a lot of skills that will be able to execute this command, play Michael Jackson. So it finds the K best skills that can execute this. And then what the Hypothesis Reranker does is it accurately with a lot of precision using some contextual signals, which I'll talk about, using those contextual signals. It finds the most precise one that will actually execute this command, right? So how the Hypothesis Reranker works is pretty, is quite cool. It's also a neural net, it's an RNN. So Dr. Anand earlier spoke about LSTMs, right? Which is long short-term memory. So that's what it does. Essentially there's a hypothesis of the user utterance. And with users utterances, you need bidirectional LSTMs because you need to know what came before, right? If a sentence is six words, just knowing what the fifth word is doesn't make sense. It has to be with the context of the four words before and the one word later, right? And with bidirectional LSTMs, you can actually do that. So the Hypothesis Reranker sort of uses these LSTMs and it is a hypothesis to sort of say, hey, this skill is the best one to execute what the user said. So for instance, when someone says play Michael Jackson, there could be three skills that could execute this, right? We know Michael Jackson as a pop music star, so maybe it's that. There are a bunch of Michael Jackson videos as well. There's also a classical musician, not related to the famous person we know, but a classical musician named Michael Jackson, right? So using these contextual signals, which could be everything from your usage to the last seven days usage, to domain, to popularity, to the device you own, it'll actually pick up the skill that will best serve what the user said. So for instance, if I have an echo without the screen, it's that simple smart speaker, it'll probably pick the pop music one based on all these contextual signals. There are a lot of Amazon echo devices with the screen, like the EchoSpot and the couple of others as well. So given that your device is the one with the video, it'll probably pick up the video one because that is the contextual signal there. But essentially what the Hypothesis Reranker does is use all these contextual signals to pick out the best skill for you at that moment of time. Of course, with India, we had to make some of these customizations to the NLU as well, and I think this was very tricky, right? For instance, typically like in the US, a typical user would say something like Alexa play songs by Coldplay, right? Or they would say something like Alexa play songs from parachutes by Coldplay, where you know an album name and you know a band name. But here in India, there's more people will tend to use movie name and actor actresses name to listen to songs, right? You typically won't hear American users saying play Tom Cruise songs, for instance, right? Because it just doesn't happen that way. It's just how we listen to music. So then we have to change the sort of contextual signals to sort of prioritize, like say when someone says Alexa play Amerkan songs, to look for data associated with Amerkan songs to actually play out, right? There are a few other pretty cool examples, I think. For instance, I think India is the only country where we'd multiply or ask for multiplication by saying two twos are, yeah? So a lot of this data actually had to go into the systems, you know, so that when Indian kids actually ask questions like Alexa what are two twos are, it should, you know, accurately be able to answer. Few other things that we have to do. So for instance, in India, when someone says Alexa, what's the score? It's very, very, very high probability that they mean cricket, right? So again, you're sort of improving the contextual signals there to sort of give the cricket score so that it picks up the skill that can actually give the cricket score instantly. Or you know, during the IPL, we should get a lot of questions like Alexa, how much should Koli score in the IPL or what's Mumbai Indian score right now and things like that which basically we had to customize a sort of NLU to really understand this and you know, give with context. So, so far I've spoken a little bit about how, you know, the NLU works when a user says something, but what about a skill developer, right? If a lot of you here are developers, you probably want to develop or have already developed for Alexa, how does the data that you provide actually sort of accurately help with what the user is saying? Right, so I'm going back to this diagram, right? Where we have two phases, the training phase and the inference phase. So let's look at how you as a developer can actually provide some training data that gets used when the user is saying something, right? And again, we're going back to intense because that's the system we're using right now. AMRL, like I said, is something new that we'll be looking to implement going forward, yeah? So the training data that you provide as a skill developer is few things, you know, utterances, few sample utterances like how a user would typically interact with your skill, right? So suppose you're building, say, a weather skill, you'd say something like, you would, you have to predict what the user will say. So a user might typically say, Alexa, what's the weather? Alexa, how's the weather? Or even something like, Alexa, is it cold outside? Is it hot outside? Or maybe even something like, Alexa, do I need an umbrella today? All these things are essentially different ways of asking for the weather, right? So you need to give those utterances and match what the intents are and what the slots are. So here's some ways you can do it. So we're taking an example of a skill that gets you to do different activities, right? So you can say something like, and I'm looking at the bottom right there. You can say something like, Alexa, ask travel buddy. That's the name of my skill, right? Travel buddy is the name of my skill. Alexa, ask travel buddy about surfing in Sydney. Or someone can say, Alexa, open travel buddy and then say, is there good surfing in Sydney, right? So the intent here is to find information about an activity. So we're going to call it activity info intent. And there are two slots. Remember, I said a slot is a variable. So a slot here is surfing and Sydney, right? Which is the type of activity and the city in which you want to find the activity, right? Because that sentence can easily be, Alexa, ask travel buddy about bungee jumping in Bangalore, for instance, right? So you have an activity and a city. Now, typically as a developer, you want to provide some training data like this, right? And wherever there are curly braces, that means it's a slot value. So you would provide like say five to seven different utterances that a user might say to your skill, right? So they might say something like activity in city or activity is in city, good activity in city, how is activity in city and is there activity in city, right? So you're just typically giving, okay, these are the ways my users will talk to the skill, right? What happens is when a user says something, there's sort of statistical match that actually happens. So suppose you don't give enough data, right? You're being lazy. You say I have one intent which is activity info intent where I say activity in city as one utterance. And you give another intent where you give the exact same utterance. Basically one utterance is matching to two different intents here, as you can see on the top left. So if a user comes and says something like Alexa asked travel buddy about surfing in Sydney, it'll probably match to the wrong intent, right? Because there's a 50% chance it could match to either intent because the training data that you provided is too less. But if you've done a good job, if you've actually provided this, just take a look at the examples. There's the third one says good activity in city. The fourth one says how is activity in city? And the fifth one says is there activity in city? But the user has come and said is there good surfing in Sydney? Now we haven't actually, as a skilled developer, have not actually given that example in my user's utterances. But because it's not just a deterministic match, it's also a statistical match where it looks for the same meaning or some sort of statistic match there, it'll accurately pick up what the intent is and what the slots are, right? So as a skilled developer, this is how the data that you provide helps with how the users really react or interact with your skill. So we've spoken a little bit about how the ASR works, how the NLU works and how a skill uses data. I'll briefly touch upon the text to speech. Again, it's a little more hardware and not really my domain and not a little bit out of the scope of maybe this conference. But essentially even the text to speech uses some amount of machine learning and data. So during the text to speech as well, there is some speaker adaptation that actually happens. Remember I said it's not just the echo devices that use Alexa, the Alexa is a cloud service, right? So there can be third party devices that use Alexa as well. Like I know Sonos has a speaker that use Alexa or I think Harmon has one. There are a bunch of third party manufacturers that use Alexa. So there is some sort of unsupervised speaker adaptation that actually happens. So given some data like my characteristics, device characteristics, speaker accents and stuff like that, it automatically sort of learns and adapts to be able to do the text to speech as well. So just to sort of recap, when a user says something like Alexa, is it hot outside? Like I said, the device does few things. Signal is processed, it listens to the wake word, all this happens on the device. And then it pushes this to the cloud where first ASR happens, automatic speech recognition that once that speech is converted to text, there is some natural language understanding that takes place. Once you sort of get the intent and the intention behind what the user said, it goes to the skill where you actually maybe hit like an open weather API, get the weather, figure out location, et cetera. And then it goes back to the text to speech and Alexa reads out whatever the answer is. So that was in a nutshell. I'd like to cover one more thing. I have 10 moments, okay. One more thing, I wasn't sure if I'd had enough time so I didn't add it in my deck, but I do. There is something pretty cool that we're trying though, it's not in production, but how do you really learn customers' preferences? Let's take the music example. How do you really learn customers' preferences via voice, right? So on like say Amazon.com, you can learn customer preferences by a variety of ways, by what they buy, what they click on, what they add to their cart, what they've reviewed and you know, you can weight them very differently, but this is not a visual medium, it's voice first. Typically you're presented with one option or you ask for one option. So how do you really learn what the customer's like? So there's a very cool paper again about sort of trying to learn customers' preferences by song duration, right? So suppose a customer says play hello. There are two famous songs with that name. There's one famous song by Adele recently and for the older people in the crowd, there's like a Lionel Richie song from the 80s. I can see a few people nodding cool. Yeah, which also has the same title. So how do you know which one the customer really wants? So basically again, this was presented recently with really good data to back up, but essentially you try and learn the customer's preferences using the song duration. So what happens is take this really long, huge matrix where every row represents a customer and every column represents a song. So each element in that matrix would represent the customer's affinity to that song. Excuse me. So essentially what you're doing is you're saying, if the customer listens to the full song, that means that was probably the song that they have liked. So you make that hypothesis right there and you can do some sort of weightage. So you say if a customer's listened to, say beyond 30 seconds of the song, you add like a plus one. If it's less than 30 seconds, if they've said Alexa, stop, that's probably not what they meant, right? That's not the song they wanted to listen to at that point of time. So you add a minus one. Now even from the 30 second till the end of the song, also you can't have like a straight line because maybe someone said stop after 50 seconds, but if someone's listened to the song all the way to the end, that probably meant you got it right. So you have a sort of weighted graph to do that. So essentially you have hypothesis for like every song and customer, but that makes your matrix really, really big, right? It's like a theoretically and infinitely big matrix. So what happens is you essentially just factorize the matrix, bring out like maybe get like a 50 by 50 matrix. So when a customer actually asks for a song and if they listen to it, you sort of add that to that part of the matrix. And then that's how you actually learn the customer's preferences based on the song durations of what they've been listening to. So this was something presented very recently. I thought it was very interesting because there is no way for a customer to actually say, yes, this is a song or rate that song at that point of time. I mean, there are different ways of looking into your library and things, but when you've actually not done any of it and you have zero data points, how do you infer like everything like this? So again, this was presented recently and I just thought I'll share it with you all. Cool, so almost to the end of the talk, if you want to learn building for Alexa, how you can build skills for Alexa, I have a workshop on Sunday, but if you want to learn online, there's a great course on Code Academy. So just go to alexa.design slash code academy and you can learn it. We do a lot of events and meetups and we have meetup groups. So just go to alexa.design slash India. We have a dev days coming up in Delhi and the last link is a link to a blog where a lot of these things are explained, right? The alexa science blog specifically. So take a look for the ones maybe who want like little more depth. There are the papers where, you know, you actually get a lot of depth about it.