 Some of you know me from my previous session, but in this session, I'm going to be talking. So the previous session was about, you know, what's the difference between prototyping and production. This session is going to be very different. We're going to be talking about voice assistants like Alex, Google, and more on microcontrollers. So typically, any commercially available product runs on a Linux-based system, some sort of a non-RTOS heavy kind of a thing. But what we've been experimenting with and have been pretty successful with Atas Press F is making this run on $3 and $4 chips, microcontrollers with a couple of megabytes of RAM, similarly a couple of megabytes of storage, and that's about it. So the idea is that this becomes so cheap that this gets built into typically dumb things as well. For example, a coffee machine, the barrier to entry for adding a voice assistant to it becomes very low because now it's only a few bucks. So I'm from Espressif, so unlike the previous talk, all the stuff that I'm about to show is only workable on Espressif microcontrollers, the GitHub project part of this. Again, yeah, you can follow this on GitHub. About the company, yeah, we are a 10-year-old company, we are a fabulous company. The 230 employees part is a little outdated, we are now 260, and most of you must have heard about us because of ESP266, which is one of our most popular chips, and yeah, ESP32 is a successor to that. It comes with Bluetooth, both Bluetooth Classic and Bluetooth Low Energy. So this is the architecture of ESP32. So this particular tech, I'm going to be talking more about ESP32, so this is how it looks. Essentially, what we have is a H02.11 BGN Wi-Fi, two microcontrollers, two cores, sorry, one microcontroller with two cores. We actually have a smaller core which we use for ultra low-passer stuff, but in this particular scenario we are not using it. But yeah, so this is the ULP compressor that I'm talking about, and that's it. So the Bluetooth part is actually interesting because we are building low-cost speakers, so that means we want to have the Bluetooth pairing like connect to a different Bluetooth speaker as well as connect our phone to this particular speaker. So we have an open source project on GitHub, which is the ESP Audio SDK, and what it essentially is, it's based, so this is our generic SDK for non, like not related to any domain stuff, so this has all the low-level drivers and stuff, and on top of that we have created a bunch of components which basically bring it at par with any Linux based system. Of course, we will never have that breadth of functionality, but whatever is required to run audio applications we can do. So be that, you know, codecs like, yeah, I mean, these are all the codecs that we support. Some of them are easier said than done. We do stuff like Bluetooth because we already have Bluetooth as part of a stat, as part of our radios. We do stuff like AirPlay, DLNA, and more, and yeah, all of this. So why this is interesting is, so if you want to do, like the previous talk, if you want to put an Alexa or a personalized speaker, you would typically go and buy Raspberry Pi, which goes for $35, my bad, which typically goes for $35, but if you want to really go to production and you really think about the cost, then microcontroller is always going to cost less as compared to a microprocessor. That is because it's got its RAM built into it. The flash is typically an external component, but it's pretty cheap because you don't require as much because you're not running Linux. And yeah, I mean, it's a very small constraint system which has very few other passives or other components that are required, which is never the case when you're building a Linux based system. So based on, because we have Wi-Fi, which gives us internet connectivity, we can also do a lot of music related stuff. So all of these players have their own standards. So it's all essentially at the end, HLS, which is HTTP live streaming. So there are some versions of it. Not everybody follows standards, which is something that we are used to. But yeah, essentially, we are fetching music remotely and playing it on our speakers. The typical target products for this, some of them you might already have in your house, some of them might not be smart yet, but making it so cheap makes it a no-brainer to put it inside of your alarm clock. Because if you're going to buy an alarm clock, which has a digital display and stuff, and if it costs you, okay, now, what currency should I use? Yeah, if it costs you $10 a year, and if this chip is going to be $2 extra, and for $15, you're getting, sorry? That's a very non-ideal scenario then. Anyway, so yeah, in this case, $2 plus, if you're getting a smart speaker for $6 or something like that, would you get one? No, because privacy. But yeah, the argument is to reduce the cost of adding smartness to this, because right now, what happens is many a times you'd say something, but it's not within your range, within the audible range. So all of these devices working together, creating a unifying experience, that's what the goal is, and that's what science fiction has honestly taught us. When you say something, something happens. That's what happened in Star Trek or anywhere. Okay, so just to give an idea of how Alexa has grown. So on the y-axis, I have how many million units sold, and on the x-axis, I have year, right? So if you see this, Amazon said that in 2018, they had sold over 100 million devices. And that's why this is interesting because they started only four years ago, and this is very accelerated growth, right? They are putting Alexa-enabled devices inside TVs, they're putting it inside cars, they're putting it in your fridge and ovens, they're putting it everywhere. Apart from the Echo Dot devices that the standalone devices, they have managed to put that everywhere and they are aggressively growing. This is not exactly a tech metric as such, but I recently read that Google is hiring as many engineers as Amazon's Alexa division. So that's the amount of aggressive growth that's happening. And Amazon is honestly not the only one who's doing this. There are many other people who are working on it. And as more and more devices come into this field, 100 million is a big number. But as it goes bigger, there will be a lot more integration because to get to the next 100 million you have to integrate with other services. For example, something that was unthinkable kind of happened where Amazon and Apple announced that they are going to start supporting AirPlay. AirPlay 2 in coming devices, which is very interesting because Apple has always been KG about its protocols and suddenly they are opening up because they see the opportunity in that. Okay, so now, can I get the mic? Okay, so what I'm going to show is a quick demo. Hopefully, this works because I'm having issues with the Wi-Fi connected. But essentially, what I have here is a small development board. It's a pretty low cost development board. I think it costs $15 or something. You can buy it off Mauser and a couple of other websites. And what I've done is I've configured my Amazon account inside of this and I've configured a Wi-Fi network as part of this. So now what I'm going to do is, hello, okay, so I'm definitely having issues. So what I'm going to do is I'm going to show video instead because this was, okay. So to give you an idea of what is possible, so this is a device which is of this size, right? So this is the module and we have managed to, so this is the same module that's here, right? And what we've done is we've put a speaker behind it and a small battery. And what we've managed to do is shrink it to the size of an Oreo, right? So it's as small as an Oreo biscuit. And what we have is a completely self-sustained battery operated. Can you increase the volume? Okay, okay, okay, I'll do this. Where's the audio coming from? Okay, so the audio is coming from that. Okay, okay. Okay, but as you can see, essentially what we have is we've managed to get a voice assistant recording data, sending it to the cloud and getting it back from the cloud and playing it on the speaker. And that's not all because we still have a lot of memory or still a lot of processing power left after all of that happened. So this is one more example. No, it's okay. No, I'm good, I guess. Okay, so in this particular scenario, what we're having is along with, yeah, so I have the board I'll pass around later. So along with doing like recording audio, sending it to the cloud, getting the response back and playing it, what we also do is wake word recognition, which is completely happening offline. So this is interesting because if you're privacy conscious, you might be thinking that I can't see the source code for what is happening inside of an echo. Why don't you just get an ESP32 flash the device with your own firmware? We have the SDK out there and then it's easier for you to sleep at night. But if you're still more adventurous and are wanting to try out something more, in this scenario what we have is we have face detection, face recognition, and a voice assistant all on one ESP32. So I'll just quickly show the demo. So this is a colleague of mine, this is Amai, and that's me in the background. Both of our faces have been pre-configured in this. The training also happens on the device. There's no cloud connectivity involved in this at all. So when he comes in front of the, so did you realize, so the assistant. So in this case I'm using a Google Dialogflow Assistant, right? It's not necessary that you use Alexa or Google Voice Assistant because those are personalities, so I'll come back to my slides later. But essentially those are personalities. If you want to make your own, there are services like the one that suziai.people are using, right? So in this case, we are using Google Dialogflow, which is a Google service for chatbots, which also includes text as well as audio stuff, right? So what happens is the device recognizes that this is Amai instead of Anuj, and asks Amai what kind of coffee he wants. So he said latte, that's a lot. But now when I come, it recognizes that it's me, the pregnancy. That's better than Amai, right? So these are the kind of things that we are kind of enabling by putting these things here, again, the face recognition and face detection. The face recognition happens at, I think, six frames per second. And face detection happens at six frames per second, and the recognition happens at three frames per second, which is pretty fast, pretty snappy, and again, all of this is open source. So if you want to make your own products, you could just go to our GitHub, download all of this, get one of our boards, and make it, right? But that's the idea, that I don't want to say what coffee I want every day, right? And if there's a product that fixes that, that's great. I have a couple of more demos, but I'll come back to it. So the difference between Alexa, Google Voice Assistant, and others is the personality, right? Alexa is known to tell jokes. It will have skills. It will talk in a certain way. It will have that sass, right? But when you're building your own product, you don't necessarily want to be limited by that or be part of that, right? For whatever branding results that we have. That's why plus privacy is a major concern here, right? A lot of people that I've spoken to have said that we think that this is always listening to us, right? And that might not be true, but it is very difficult to convince people otherwise, and the best way to do that actually is to make something available to them, which can put them at ease. Making something open source that you can flash yourself, which uses the same APIs in the background. That's a good sell, right? So yeah, Alexa is a personality, okay, Google is a personality, but the underlying technology behind all of this is the Amazon Lex service provided by AWS as well as the Google Dialogflow. So essentially, we've managed to add support for both. We haven't released the Lex bit because there are some, so it wasn't meant to be run like the way we are running it right now. But yeah, the Google Dialogflow thing is on GitHub already. And you can make your own, you can give your own personality to these devices. You can create your own responses. You can create your own SaaS, because that's what makes them fun, right? When you watch Iron Man, you like Jarvis because Jarvis is sarcastic. Like he talks back to Iron Man. He's not some okay, sir, thank you, sir type, right? Like he has that SaaS, and that's what you want to build in your products, right? Because that's honestly fun. Plus there are some more reasons why this thing needs to catch up. Because essentially, we have been using buttons and display, then we graduated to screens with touch built in them. But there are quite a few reasons. I was talking to Bunny on Thursday, and he said one of the things that he wants to do in his new product is make it possible to record audio and listen to audio. And that's because not everyone can read and write, right? Not everyone, even if they can read and write, maybe they're permanently or temporarily disabled from doing that. So accessibility is a big thing when it comes to these technologies. Okay, I've messed up the sequence of that because accessibility was supposed to come later. But anyways, coming back, I wanted to show this because this is one of the early washing machines, and there are no buttons and no displays. This is a more, so actually I think my parents have this exact same model. Because there's no display on this, there are just buttons. And the newer versions come with displays. I think most of us might have this in your houses. So that's how the way HCI has evolved, right? Because this is essentially a computer in one way or the other. And the interaction between them has changed. And now we are at a point where voice interaction is so cheap and so accessible that we are able to add it at virtually no cost to our product. Which is the interesting bit because all of this can go away if I can talk to my product. Okay, I have a demo. So in this case, instead of Alexa, we have a Google Dialogflow bot where we are talking about, right? So yeah, and what's interesting is that it gets better with time. For example, you don't have to set it in stone because all the communication is happening in the cloud, right? You can evolve the way you say it. For example, you could say, instead of it being a conversation, it could be a one-way instruction where you do x, y, and z. And that's the good thing because asking people to constantly update their firmware is a very tricky piece because not everyone liked that. It's invasive. People would like why I have to do this again and again. But once you're building a natural language interface, all of that goes away because you don't have to ask the user to do anything because it's just a dummy terminal. Yeah, okay, so yeah, so this was supposed to be the last video, but I actually showed it earlier on. But that's about it from my end. I think putting voice assistance on microcontrollers is a pretty big deal. We are not the only ones working on it. There are some other people like NXP who are also working on it. And it's interesting because accessibility is one. No need for updates is another. And yeah, I mean, honestly, for me, the coolest part is the science fiction aspect of it, right? 2001, Space Odyssey, anyone? But yeah, it's been the dream as always. Like it's always been the dream when it comes to science fiction, to have a very natural language interface with technology. Any questions? Sorry? What do you input source for this? Oh, there's a mic on the development board. Yeah, I mean, we have a driver IC who does that for us. Essentially what we also do is we also support multiple wake word engines. Some of them are software, some of them are hardware. In which case the mic has to be routed to them because they do the processing on that. So essentially they are DSPs. Yeah, but yeah, I mean, we work with a couple of DSPs. DSP is one, MicroSemi is another. I think we have six one including six, yeah, five or six ones, including Intel. And we also do the DSP bits on ESP32 as well. That will take up more application code space and all of that. But it's doable. Like when you say Alex at the wake word detection happens on the ESP32 itself. Yeah? The phase recognition is not that inside the DSP32. Yes, so the question is the phase detection thing done inside the ESP32? The answer is yes. There is no communication with anybody other than ESP32. Yeah, we actually have some left after that, where we are doing all the voice related stuff. We could actually still do some more things there. If you are interested in, I have a small development board for that kind of a demo, and we also have the source code available on GitHub. So you are free to try it out. But yeah, I mean, we do that locally. The machine learning bits are there, but everything's local. It's not communicated to anybody on the same network or outside the network. So we have enough muscle locally. The video one, how do you interface the camera to the ESP? So I think that's spy. Okay, that's not a ESP. Now, how do you interface the camera to the ESP? No, I think it's a spy. A spy camera. I think so. I'm not exactly sure. I'll have to check the schematic. But the whole thing is up on GitHub, so. And the other one was, so the detection is happening. The word detection is either happening on the ESP32. Or ESP32, yeah. And then you just send the entire audio up to the. Yeah, so essentially the way it works, it's all streaming. We don't store it locally. That's one of the reasons why we have been able to do it on a microcontroller. Because we don't need a lot of RAM for that. So what it does is it says start recording and then when it detects, when it detects that some sentence has ended. Then it will give us another indication saying that okay, we are done. One last question, the list of codecs you said earlier, audio connects or is it all soft? No, some are soft, some are hard. Okay, so some of them you do soft recording or encoding in other ones, you have ESP or ESP? ESP, I, I, I, yeah, ESP, I. So if I have an IP cam connected to Wi-Fi 10, that video will be streamed to the ESP32 for facial recognition in theory it seems to be possible. I think so, I'm not sure. I'm trying to think of what the bottleneck will be in this case. Frame buffers, yes. It's already in the Wi-Fi network. Yeah, but, but, yeah, but typically IP cameras are pretty, yeah, but that's actually an interesting question. I, I, I'll, I'll ask the right people about that. But, yeah, yeah, yeah, you can have many cameras and you can have ESP32 doing that, not yet, not on GitHub at least. But we do have it working. It's actually one of those PRs that's assigned to me. Yeah, yeah, yeah, essentially it works. There are some edge cases that haven't been handled. But that's one of the good things of having ESP32 that we have BLE as well as BT on board. So then you use BLE to provision the device with your login credentials as well as all of that. And then we use BT for, yeah. If the device is monitoring, does it support encryption, yeah. So all of this is TLS. So all the communication that happens between microcontroller and Amazon's APIs or Google's APIs, everything is TLS. Because those guys won't have it any other way and neither would be because this is pretty sensitive information. And all of this, there's no unencrypted bits anywhere in this. So the only attack factors are probably physical, yeah. I'm going to be around so anybody has any questions? Okay, what potential could be used for colonization of a person or object? Because now you have the object. Actually there's also scope for doing that using Wi-Fi. Yeah, of course there are multiple ways to do that. But there's also some work that's been done for localization. Yeah. It's getting slower. Yeah. So you probably need my help. Also I just understood what you meant by the audio cortex. You were talking about, yeah, all of that is soft. Yeah, all of that is soft. I thought you were talking about how we drive speakers. No, no, no, all of that is soft. Yeah, all of that is soft. I thought you were misunderstood because I, yeah. Okay, very interesting. I think we can continue the contortion. Yeah, yeah, yeah. Because we have to start with the next two minutes. So since this is the end.