 Hi everyone. So this is actually not like a learning teaching session. I'm here to share about something interesting that I that I built last year. But before I go any further, some basic introduction about myself. My name is Crystal. I am from AWS. I'm from the professional services team. So if you are not familiar with professional services, we are like the consulting arm of AWS. Some of your companies, you might want something built. You don't have sufficient time, manpower, or skills, skill sets in your company. We come in and we help to implement. So that's my day job, right? So I'm a data and analytics consultant with ProServe. I do everything related to data. But today that's not what I'm here to talk to you about. So storytelling for children with AI technology, like what is that? It sounds so lofty and all that. Basically what happened is that last year I had had an opportunity to build something for a reinvent. So and this is what I presented at the Builders Fair conference. So it's for fun is to show you what you can build with AWS services. So let's dive right in. So are you bad at telling stories to your kids? Do they sometimes wish for the other parent instead? Because they can make better noises than you. So that's what I was thinking about when coming up with the idea. I was talking to some of my colleagues. They work super long hours and then at night, they still have to go back and read stories to their kids eight years old, five years old, over and over again. And it's always the same stories. You have to make the same noises at every single interval. And that's what they're tired of doing, right? I mean, after a whole day of building code, building pipelines for your DevOps, you don't want to make the same dog noises again every night. So I built this application called Immersive Storytelling. So Immersive Storytelling is a web app. It accompanies you while you're reading. It recognises what you say, it understands and then it plays back relevant sound effects in real time. So what it does is it uses different technologies built out of the box, natural language processing technologies, and then also a bunch of serverless technologies just to make sure the whole thing is hosted online and you can use it whenever. So we want something that you don't have to download on your mobile. You don't download the whole application. So that's why we built it with vanilla HTML, CSS and JavaScript. So you just go to the browser, you click on it and you start reading and it just plays sound effects back for you in real time. And if you were at re-invent last year, you would have seen me at one of the booths. This is what I had to present for a few hours on those few days. So maybe it's easier to show you a demo rather than talking about it. So this is the application, very simple. It's supposed to look at you, not the screen. So over here, there are just a few different settings that you can choose because it was up since last year. I'm still set to US West, so it might be a little bit slow. There's one thing here that you can set, which is Mac sounds. So this is how many sounds you want to play back at maximum. Let me get a little bit bigger. Maybe too big. Okay. So Mac sounds. So this tells the application how many sounds you want at maximum from a single sentence. So you talk about three different animals. You only play three sounds. You talk about five animals. You play the first three sounds. You don't want to have the sounds keep playing. So you control it like that. The plus one is nothing. It's just for me to monitor how many people will add the boost. Okay, so let's start. Hey, oops, let me refresh this. Hey, diddle, diddle. The cat and the fiddle. The cow jumped over the moon. The little dog laughed to see such fun. And the dish ran away with the screen. So that's the sound of running. Yeah, so basically that's the entire application. Simple for you to accompany you while you read. So it just adds a little bit of ambience to your reading sessions at night. So some design considerations that I was thinking about when I was building this. Firstly, it has to be portable, convenient, and it has to be quick to set up. Obviously, this is not my full-time job. So I only had like one hour a week to build this for weeks maximum. So I wanted to build it very, very quickly. So what I did was to build that web app with that vanilla HTML CSS. You saw that the UI, it's not meant to be anything fancy. It's meant to be quick. The next thing is to have very simple, quick hosting and also CI CD set up. So every time I make a change to my code, I can deploy it easily. And for that, I use AWS Amplify. So it's just a series of tools you can use that helps provision your hosting. It sets it up with a static website, with CloudFront CDN on top of it. And it also has that CI CD Pipeline building. So the code is hosted on AWS CodeComet. Every time I make a change, I push it. I see it deployed on the website already. So it really took away a lot of the undifferentiated heavy lifting for me. The next thing I had to consider was that I needed a fast playback of sound. While you're reading, you don't want to be waiting multiple seconds before the sound comes back. So it has to be as real time as possible. So in order to do that, I had to make a few design decisions. The first one is when you are speaking. First thing is you need that speech to text. And then you need to understand what is being said. So you need to pick out insights. You need to understand the nouns, the verbs that are coming from it. So there are multiple parts to this. Once you get the nouns and the verbs, the sounds that you want to play, you then have to go and pick up the sounds from the backend, wherever you're storing it. So all these, my yardstick after reading a little bit of research is that it has to be within two seconds or it doesn't make sense anymore. People lose interest. So the first part is to use transcribed streaming. So AWS transcribed is speech to text. But not only how it works is you have a large batch of files. You create a job, maybe a lot of text files, PDFs. It could be in the Gigabyte kind of range. You create a job and then it transcribes everything for you. So instead of doing that, this time it needs to be in real time. So I use streaming. So we open up a web socket, you stream whatever you're getting in from the microphone in and it comes back to you every single time it has an update on what you're trying to say. So this shortened that time considerably. So the speech to text part was a belly anytime at all. So now to focus on the other side of things. The next one is picking up sounds from the backend. When I look at children's story books, normally they have a few main characters and these characters keeps on recurring. So one thing that I looked at was caching on the client end. So the start of the book, you have introduced a character, a pig, a dog, whatever. That sound is going to be heard very frequently in the rest of the book. So we do some caching on the client end. And the last thing is the storage of the sounds. So the first time I thought about it, I thought, okay, this problem, maybe I will compress the sounds. So store it on S3, retrieve it and the smaller the size of the MP3 files, the shorter the download. But eventually that took a lot of time as well, right? The download itself and the playback and everything, it took more than three to five seconds depending on the size of the file. Even though these were sound bytes, less than two seconds long. So what I had to do to change that is to actually take the MP3 file and code it in base 64 and then store it in a DynamoDB table. So the storing sounds in the table was not the first thing that maybe you think of, but because it's just a binary string, you store it there. It's millisecond, no SQL. All you have to do is search for the word and you get the sound back. And it's base 64 encoded. You just have to feed it into your buffer and it plays out on your speaker directly. There's actually not much processing to be done and that shortened the time from five seconds to one point something seconds and that's what I needed. I did a lot of benchmark testing on how to make it faster, a lot of console.log and timestamps. And what I realized is that the main thing causing the lag from the entire application right now is actually waiting for transcribe to tell me when the speaker has paused in the speech. So actually retrieving the sounds, finding out what nouns and what sounds to play, that is taking a much shorter time. I will dive into that a little bit deeper when we head in, but what I have to understand from this is you don't want to play the sound while you're still talking. So what happens is you speak in phrases, you speak slowly and you pause. So earlier when I was reading, I was pausing in between each line. Transcribe streaming will let me know when the speaker has paused and the sentence is full, so it's no longer partial and then I can start to retrieve the sounds. So you don't just read on and the sound just plays while you're reading. It's a bit distracting. So that part, waiting for that pause and transcribe streaming to tell me this is a full phrase. That is the part that takes the longest now. And the last design consideration was the adding of sounds. So this means that I can focus first on the main sound package. So for children's book, for me, I focus on animal noises, nature sounds and human noises. So if you do like crying, laughing, I have those noises, but that part is still very much a manual process because you have to find the sounds and you have to load it into the library. So instead of finding as many sounds as I could, I thought why not make it modular, right? This application is for children's books, but I've also heard feedback telling me, oh this would be amazing for dungeons and dragons, role-playing games. You say a dragon soups in and suddenly you hear like a roar of dragon, you hear a fire, you hear the crackling. So you can expand it to different target audiences and they can have their own modular sound packs that they install and process and add it to the library. So whatever that you need. Somebody also told me that if I can have this during my business presentations, I go to my customers and I say the name of my company and then immediately the jingle starts playing. So I mean, you can do that, right? It's up to you how you want to do this. So how to make the adding of sounds a bit simpler. So there's a bit of automated processing and expansion pack logic behind it. So later in the architecture, you'll see you drop the sounds into the S3 bucket, lambda picks it up, and then it just adds that sound into the Dynamo DB for you. So it does all the base 64 encoding everything, adding it to the correct, add it to the correct word. And the next one is about synonyms. So I don't want to find a different sound for jumping, hopping and leaping, for example. I mean, sometimes it just sounds the same, running, walking, quite similar. Hills walking, you know, very similar words, even though once a verb and once a noun. So what I did there is to make use of this library called NLTK wordnet. So what it does is it helps to find synonyms for me. I put in dog and it comes up with k9. It comes up with puppy, for example. And all those words have a map to the same sounds because nowadays, computation is more expensive than the storage. So putting that base 64 encoded string, it's not much storage anyway. So I find the synonym and then I put the same sound for all those words. And when I'm using the application, it finds the sounds and it plays it back very quickly. So I can get more hits per sentence. So this is the architecture. So very simply, you see Amplify that is the series of tools to help me with the hosting and the CI CD. So you see the blue line over there from Code Comet. So every time the maintainer, which is me, I make a change. I comment and push and the deployments get triggered immediately. And I didn't ever have to set up that Code Comet repo, right? Amplified it for me. Above that, you see Amazon Cometo. So this may need to get temporary credentials. So earlier when I was doing the demo and you saw that red line, the security credentials have expired. It's because of that. Whenever you first go onto the page, it gets temporary access keys and then that gives you some permissions to use the rest of the application. And then the orange lines at the bottom. So the first thing it does is once you go on the app, you press the start button, you start your WebSocket transcription. It goes to transcribe streaming. So whatever they are saying, you get it back in the form of text. Once you get the text, I send it to Amazon Comprehend. So this is where I am trying to detect the syntax, syntax of the words that you are saying. So what I want to find are the nouns and the verbs, which usually have the most sounds. So the cat, the dog, the cow, for example, I want to pick up those words. Once I get those nouns and verbs, I then hit up my DynamoDB, which is my sound meta store. Basically, I say get item, right? Find cow and then retrieve the sound byte for cow. My table is very simple. If you look at it, it only has two columns. So once I get it back, I play it back on my web app through the speakers and that's it. That's the entire application. On the right-hand side, you see the red lines. That is the pre-processing architecture. So that's how we can add more sounds to the application. You can have your expansion packs. So maintainers, again, you upload the sound byte as mp3 files. You just drop it into your S3 bucket, which is an object store. There is an event notification, right? Every time something gets dropped into the bucket, the event will be triggered and it will send it to a lambda, which processes the sound for you. So what it does is, first it takes the mp3 file, it gets the base64 encoded string and then it stores that same string for all the different synonyms of the same word. That's pretty much it. So if we have some time, let's dig into the code. So again, I am a data architect by training. Be kind to the code that you're going to see. So this first part is about the start button, right? You click and the most important thing on this page is once you click start, it starts streaming whatever you're saying to the web socket. So you're creating a web socket and you're sending all the binary sounds to it. And this is that particular function. So it creates your pre-signed URL. It takes the credentials that you need and it gives you that temporary permissions. And every time you are sending the audio chunks, whenever it detects that something is being said, it sends it to our next function, which is the wire socket events function. So for this, we're basically taking whatever that's being said and we're handing it to Amazon Transcribe. So on the right hand side, you see you're trying to get whether or not there is a response coming back from Amazon Transcribe. So if there is more than zero words, then we have to start processing it. The interesting part is the part on the bottom, right? So if they're checking this thing called whether it's partial. So remember, we talked about waiting for the pause before playing the sounds. So when you look at what the API returns, there's this part. It tells you whether it's a Boolean, right? It tells you whether or not the phrase is partial or not because this is a streaming API. The results are coming back all the time. So it gives you a streaming response and I don't want to process every single part of it. I just wait until the Boolean turns to false. It's no longer partial. It's a full sentence and I start processing what is in that sentence. So this is how it looks like. If you want an idea of what Transcribe streaming is, this is me speaking yesterday. I'm speaking to the microphone and you can see that the words are changing all the time. So it says I'm crystal nice and t-o-o because at the start, before the phrase was fully completed, it thought that word was too as in t-o-o too. But when I finished the sentence, it got a more context about the rest of the words and then it corrected it to nice to me as in t-o-o. So waiting till the whole sentence is full actually gives it more context to give a better and more reliable transcription. So I keep getting all these streams. I check whether or not... So this is how it looks like. So I keep getting all these streams. I check whether or not the phrases are partial or not. Once it's not partial, I then pass it over to comprehend to pick out the parts of the speech. So the exact API that I'm using here is called Detect Syntax. So what I'm trying to do is to find out is this a noun? Is this a verb? If it is, let's do something with it. And one interesting thing done here is also the lemmatising of the word. So lemmatising is getting the root words. So if you are saying something like running, the root word is run. If I search running in my sound database, you won't come up with a sound. You don't want to store the same sound for running, runs, rend and whatever other form of the word there could be. So you lemmatise the word, you get the root word and you store it only once. So you first lemmatise it, then you find the part of speech that it is. Whether it's a noun, it's a verb and then there is some logic here to limit it only to the maximum number of sounds that we want. So we only want to play three sounds. We don't want to overwhelm the audience and we get the sound from our database. So here, the first thing that I do when I try to find a sound is I check whether or not it's available in my local storage. So I check my cache. If it is, play it back. If it's not, then I go to my DynamoDB table and get the sound back. So it comes back as a base 64. I have to decode it and then push it into the buffer and play it back. And that's it. We are done. So that's all there is to this application. It's not difficult. I built it in a matter of weeks and it's still up. So we can probably share the URL. So tonight if you go home and read Hey Diddle Diddle or some, I have another story also, the woman wanted noise, I think. So if you want to read those sounds, the sounds are in a database, you can try it out. And that's all for immersive storytelling. Thank you. Sure. You get, somebody speaks a sentence, right? The cow jumped over the moon. I send the entire sentence to Comprehend. Comprehend comes back with a json. It tells me every single word, what kind of word it is. I can't remember what the is, right? But cow, it will come back with noun. It will give you the part of speech basically. So cow will come back with noun, John will come back with moon. And then I pick up only the nouns and the verbs because those make noises. And I lemmatize the word to get the root word. And then afterwards I just try out DynamoDB whether the sound exists or not. So one thing I forgot to mention is that it is a lot faster to just get item from DynamoDB rather than check whether it's there and then pull back the item. So if it doesn't come back, I get an error message and I know the sound isn't there. I store it somewhere else so I know, okay, this sound, a lot of people are requesting for it. I can probably add it in the next time. And if it's there, I just play it back. Thank you for the question. Which message? Oh, so every time I get the error message, I just, there's just one item in my DynamoDB table that just counts how many times. Yeah. Thank you. What is the repercussion, accent, the results of the program and Transcribe? That's a very good part. So it is trained on different accents. Amazon Transcribe, I think now supports several different localized accents as well. And they keep adding it. But it does play a big factor. When I get some of my friends to try it with certain different accents, it might not work. So sometimes it just is about choosing which accent if it does exist. And other times you just have to try to speak in a clearer, enunciated way. Oh, so Transcribe streaming sends all of their information back. So within the result of what is being said. So it gives you the Transcribe text phrase. And then there is just one part of their response that tells you whether it's full or whether it's partial. So whether it's true or false. Yeah. So that's done on the server side. So it's very, very light on the client. Oh, that's very cool. I'll try that out. Sure. Yeah. Yeah. So the limit for DynamoDB is about 64 kb per item. My sounds are generally two seconds, some bytes. So even the mp3 file isn't that big. So when I get the base 64 encoder string, it usually fits in. So it's really because of the use case. If you want to put very long songs inside, it might not work. But because this particular case, I just wanted short sound bytes. I wanted to get it back very fast. It fits. Yeah. All right. Thanks so much.