 Okay, hi everyone. Apart from Ben, who needs to be here, this is a noob's presentation. So if anyone is looking for something deeper, you know, this may not be the right venue. If not, we will just get started. And, you know, now that I'm going to begin, does anyone want to volunteer like a magic trick? No? Okay, then I will be my own volunteer. You're a volunteer? Okay, perfect. So, what is it really small? Okay, does anyone have anything that you want Ben to say? I hope it gets recorded. Like, make America great again, something short, simple. Less than 10 seconds, if not my calm would just die. Anything? No? Okay. Okay, let me just, what's the command again? Yeah, okay. Should run. Yeah, go ahead. Okay, so I'll do a few recordings. Just bear with me for a moment so that I don't need you to volunteer again. And I'll explain all this later as well. Sorry, round two. Let's make false nature greater. And then, yeah. Let's make false nature the greatest. Cool. Okay, so hopefully now that I have everything inside my folder, I might have overwritten something, but that's fine. And if you look at it here, so now I've recorded two way files. Live, yes, there's no tricks up my sleeve. And if I were to run the code, sorry, let's see. So I am cheating here a little because I don't want to memorize the whole script. This should work, hopefully. So what this is doing is calling up the deep speech library, which has a TensorFlow framework underneath it. And it says here, ah, crap, this doesn't work. So this is why I got Ben to record a few times so that we can see the errors. So here what it's saying is that it doesn't recognize the signed and unsigned state. And if I were to run the second one, this should work. And I think Ben said something along the lines of make false Asia greatest. And that should be the output that I should see. But it's not very accurate. Let's give it a while. Incomplete way chunk. Okay, let me see now. Could have been too short. Let me just record something myself. So one more round, right? Sorry, guys. So as I'm typing this, maybe I'll explain this as well as to what I'm doing. So what happens here is that are you guys familiar with the socks framework for voice and recordings? No? So this is a pretty nifty tool for anything that you want to do for manipulation of voices. So it's pretty cool. And when you're running demos, you know, you do everything within terminal. It just looks cooler. And what I'm doing here is calling out the record command. I'm setting the channels this dash C to be one because deep speech as of now can only do mono channel versus stereo. At the same time, the rate must be set at 16,000. So this is also part of the documentation. You can't do anything else for the sampling. So the error you saw just now, the unsigned integer, that is the encoding. So that's what I'm setting as well. And then, of course, the file name to be testing one dot wave. Okay, hopefully this works. Let's make false Asia greatest. And let's run it again. And if this still doesn't work, then I'm sorry, it's really the demo gods. And we'll go back to the presentation slides, right? So let me explain a bit here as well. I think what is happening is that this is a bit too short. So it's not really running. I'll go with something that I actually have that works. So this particular LDC file is normally the test file that you use to play a short script. Okay, let me just get this to run. Sorry, it's really slow. There's no volume. Sorry, is there any way to play volume on this? Oh, I could just take that off. Okay, I'm really sorry. You're just going to have to take my word for it. So this particular file, I could showcase it to you later, stand alone. It's a sound file that comes from this particular TI-MIT library. And it has a very short text that I think over 600 Americans actually read out. So this is one of the test files. And when you actually run this, it will output the speech to text. Yep, sorry. So what I'm doing here is that I'm loading. If you look at the command, it will say deep speech. That's the library. And then the output graph dot PB, if you guys use TensorFlow, it is the model that's output from TensorFlow. So you need to load that. And then the wave file, alphabet dot text is really all the consonants and sounds that you have inside your alphabet. So part of what we're doing was we're training for Bengali. There's actually, if I understand correctly, 40 to 50 sounds, but we didn't care. We just went with the English alphabet. Okay. I'm sorry guys. I'm guessing I didn't pray to the demo gods today. It's not working. I tried it before I came here. So I'm really sorry. I will look at it later. Okay. Maybe I'll go back to this first. So I failed. I got chopped in two. Yes, the demo didn't quite work. And just to give credit to the folks. So Mozilla, it was their code. The actual theory came from Baidu. So I think Andrew was the one that came of deep speech. The three variations of it, really cool paper. And then that's my team sitting over there. Maybe they'll be able to get the demo to work. I'm just a guy copying code from them. And why this is important. I actually gave a talk about this at geek camp, I think earlier this year or last year. So as we move forward, you see Alexa at HomePod and stuff. And it becomes very important that SEO is going to change in a very different direction. When you query for something, it's only going to be the top result. It's not even the first page anymore. So I think voice user interface would be something of the future. And the project that we're all running is based on Bangladesh. And just some context, high illiteracy. So people don't really understand words they don't really read and write. Which is why the concept was to have a voice user interface for accounting. And so what are the things you need? I don't know why they've not updated this to 3.0. It's still 2.7. When you're going through the code, confirm that it's there. And we set things up in virtual environment so that we don't muck up if you have multiple versions of Python. So these are all the things that you need to set up. If you are running it, do not... Oh, maybe I didn't go into... Yeah, okay, I think I might have missed a step. Anyways, so once you're in virtual environment, you set it up, you start it, and then you deactivate it. And of course you install deep speech. And this is a ginormous file. So have really good bandwidth. I think it's around 1.4 gigs. So these are the four files that I was telling you about. The TensorFlow output, the alphabets. This is the language model binary format. I have no clue what this is. So if you guys know what it is, please let me know. And the tree is, I guess, computer scientist way of playing with the word tree, like t-r-e-e. But this is t-r-i-e. So it's also another power model. And yeah. Maybe if we have a bit of time, I'll show you the model again. I believe, I think I know what's wrong. Anyways. Yeah, so in terms of the models, it was... Let me just go through. Yeah. In terms of the models, it was trained based on these three papers. So Fisher, Libre Speech, and Switchboard. So it accounted for different accents, different people. And I think there's around 3,500 hours of training. And it's still really, really crap. So if I get the demo to work, you'll see it's around 50% to 60%. And right now, our team has around maybe 50 to 60 hours. So what we have for the Bengali model, and because of the fact that we're not using Bengali alphabets, it's really crap. But we're working on it. So this is really the part whereby you've already trained stuff and you want to fine-tune your model. Things that I understand is if you change the learning rate, things could get a bit faster, but may not be as accurate. The checkpoint steps and checkpoint directory, these are meant for you. It's part of TensorFlow. So if you're training, it allows you, you know, your computer to suddenly run out of power and it'll be fine, and it'll just pick up from last checkpoint. So these are cool things that are built in. So maybe I will explain some of the theory. So who's familiar with sampling and how it works? Yeah? So I think this is an important step of how we convert voice to text. This is the first step. So why it's run at 16,000? You have to imagine each bar to be one, and we have 16,000 running across your voice. So the human speech is around 200 to 20,000, I think. Something to that effect. I need to check again. And it helps to sample enough such that we can reconstruct it. And this was a slightly more difficult concept that I took a long time to figure out from the papers of deep speech. So is anyone familiar with this? No? Okay. So I'll try my best. I've got two videos. One is under pressure from Queen. And the other one is Venetal Ice. So Ice, Ice Baby. So have all of you heard the songs? Yeah? I just played the first three, four seconds, right? Yeah, so now. I'm sorry. So you hear that it's... Okay, now we look at Venetal Ice. So you hear it's really around the same, right? You notice that it's also... So what this means is that if I were only to listen to three, four seconds of this particular clip, I cannot differentiate... Is it under pressure? Or is it Ice, Ice Baby? But if I have one more second of this, then I can classify to say, hey, this is the crappier version. And these are the classics. Yeah? So essentially this is what CTC is. My understanding. If you guys figure out something else, let me know. Because this is how I interpreted the paper. Right. And recurrent neural networks. So I believe some of you may be better experts in the field. I know Ben is. My understanding is that this is really just chucking memory into your neural networks. So that there is... You know, you're able, especially for sequences, it becomes better. Because you remember what's previously. So if you look at CTC, you need to have like a memory to say, oh, what is the same? What has changed? I will share some links from this blog that I picked up, which had really simple explanations. And I thought it was pretty awesome. So yeah, I'm going to skip this. And in essence, this is really how the deep speech model works. So there's also a blog link that explains it in more detail. In essence, you know, the H1, H2, H3, these are the layers within the neural networks that pushes out to the feature extraction. Okay. And there's really like out of this fine tuning stuff, there's still a lot of things that I don't understand. So as of now, I'm just going to go with the, you know, creative writers license and say, just imagine it to be vibranium technology. I don't get it at this point of time, because it's still a lot to go through. And so the challenges that we faced when we were actually training the data is a lot of noise. How many of you have been to Bangladesh, except for my team? No? So it's really noisy. There's a lot of cars honking. There's a lot of people talking. It's just a lot of people. So when you are capturing your data, even if you're in a room, it's still like you still get a lot of noise coming in. So we've tried pre-processing steps to remove the noise and filtering out through frequencies, but it's still proving to be a challenge. At the same time, like I said, we roughly need around maybe four to 5,000 hours of data. Right now we have 50 to 60, so we don't have enough data sets. So I was really keen to go to the modular guys talk on deep voice, and I think they have a really cool platform for voice collection, but I think as of now it's still English. And this is the presentation shortcut if you guys are interested. These references, if you guys would like to learn more, I'd strongly recommend it. So the first one talks about how RNNs work in simplicity. It's not the cool ass paper from TensorFlow. That is just extremely complicated and extremely long. So I found this a lot better. This Hackstop Mozilla link, this is the guy who actually came out of deep speech for Mozilla, so I think he explains it in a lot better detail. And before we go on to questions, I just want to try one more time. I think I know what I'm missing. Okay. So, let's see. Oh, sorry. This is the two ways one. Yeah, this should work. Ah, I'm sorry. Right. That's it. Any questions? Chetan? How do you differentiate wave name and deep speech? I tried wave name for speech-to-text, but I feel it's not that... It should be open source wave name model. I never quite did the deep speech, but what do you think in terms of accuracy? I think in terms of wave... I haven't tried WaveNet personally, but I don't think it runs on a neural network, right? It runs on... It does run on... So TensorFlow and Keras is the demo model. They have been published on GitHub as well. So I just blown in the inference with my data file. Even though it was PL-wise, there was a lot of... It's like 70% of accuracy I found on good wave file, but I never quite did speech, so what do you think the accuracy is with it? Right now, I don't know. It's really dependent on the data that you have. So the larger the dataset, with more diversity, of course, it gets better. Okay. Yeah, so I don't have an answer. I'm sorry. Thank you. So what you guys are doing is taking a pre-built model for TensorFlow, and then using transfer learning to further train it with your own data. We don't transfer. We take it as is. So it's built for English. So we take the consonants as is. So if the only Bengali word that I know that my team is so sick of is potatoes, so aloo. And when you put it into Romanized English form, it can be spelled as alu or aloo. So in essence, we collect all this and then we tag it to say, okay, this is aloo. And we have a dictionary that maps back to the intent that we wanted. So it's a stupid step at the moment, but it works. Thank you. Sorry. Thanks.