 So we're here at the NVIDIA event and who are you? My name is Ian Lane, I'm a professor at Carnegie Mellon working with NVIDIA on speech recognition. So are you showing the fastest, the best of speech recognition on a mobile chipset right now? Fastest and most accurate speech recognition on an embedded solution. So this is the fastest arm powered but you're using the GPU? What are you doing? So here we're really just using the GPU. We've got an embedded solution here on a shield tablet. And the shield tablet has a quad arm core with a TK1 GPU. And we're doing speech recognition leveraging just the GPU. And I can show a demo later showing the GPU versus the ARM-based CPU. So are you hooking into the Google Voice Recognition app kind of? So what we're showing here, we've got an example case where we're looking at speech recognition for autoboken solutions. And comparing the case where we're doing speech recognition just locally on an embedded solution with a network compared to a cloud-based solution where we have a good network connection. So you're talking about offline voice recognition? Offline versus online, yes. Alright. So what can you show? So we've got our offline and our online system. And here we've got a wired connection so the online should be quite fast as well. How far is it to San Jose? Let me try that again. How far is it to San Jose? Are there any good Japanese restaurants nearby? What's the weather like in San Francisco? And so what we can show here is with the GPU enabled, we're able to perform speech recognition offline with similar accuracy and the same speed as a cloud-based solution when we have a really good network connection. And so what I'll show next is what's the case when we have some connection but not very high bandwidth. So I'm shrinking down the network to provide only 10 kilobits per second and then we'll do some more testing. And so one of the issues when you're driving around in a vehicle is the network changes dramatically and speed is one of the factors here. So let's try again with the offline and the online speech recognition systems. 423 Al Camino Real Mountain Real. Okay, we've got a network connection but the bandwidth is low. You can see it's slowly coming up now. How far is it to the airport? How do I get to Caesar's Palace? This is really awesome. So a few versions ago, I think it was an Android version, Google announced there was some kind of offline support, right? And the phones that have that do it on an ARM only? I believe so. So you can do speech recognition offline and Google does that as well. But what we're able to do here is really take the cloud-based solution and without making any optimizations, without shrinking it down so it can run locally, we can just take the cloud-based solution and run it locally with no changes at all. The cloud solution, I would expect it to be on like millions of servers, I don't know, something like that. Thousands of servers. Speech recognition is, yeah, thousands of servers. I would guess. But how can you fit that on one device, like a tablet? So it comes down to speed and memory. So we can actually take the memory footprint of a cloud-based solution, it can be up to 32, 64 gigs, and with an SD card, we can store that memory here locally, and with the GPU, we have enough compute power to match a Xeon processor on the cloud. So when Google does it on the cloud, they just use one Xeon processor over there? So they use one Xeon processor plus an extra thousand Xeons for doing... To do what? For doing some language model rescoring. So they do a two-pass process where they first try to recognise what you're saying and generate a number of hypotheses, and then, given the amounts of data they have, they rescore all of those to give the final solution. So how about combining? You're going to combine it, right? Yes, and I think everyone will. So this automotive is intriguing because, similar to mobile, you typically have one or maybe a couple of users, and you want a solution in the car that's responsive. Responsiveness and accuracy are the key issues, and if it's only two 20,000 or 30,000 words, but they are your words, the words that you use, then that's what you want to have. And so typically today we have a small system in the car, a large system with many million words in the cloud, but no one has quite implemented this learning system that would optimise the system in the car for the vocabulary that you use. And that's something we're working on. So basically it listens to what you say, and it keeps only the kind of the words that we like to use. Exactly. And the other ones, you can keep them on the cloud. You can keep them on the cloud. So once in a while it might say, hey, you're not usually saying this. I need to fall back to the cloud, but I don't need to do it as often, and humans are intriguing because they say the same thing again and again and again. And so the first time you might need to fall back, but then after that you have it there locally. Nice. So this is quite awesome to be able to do all this on the device, offline and combining with the cloud, and I think it sounds crazy that you can do it offline. I think it's going to change how we interact with machines in the near future. Hopefully moving away from single query answer into true conversational. Accents? Okay. Accents. My French accent. Comes down to data. Data is just more data. More data. So accents is gigabytes more. Gigabytes more? Or something. If you have 2,000 speakers from around the US you can probably do quite well. If you have a million users, then you do really well. And so similar to image processing, speech, natural language understanding, today it's machine learning. It's all driven by data. So if you have systems that people use, then you can learn from it and get better and better over time. You were showing the slide before about TK1 and ARM. This is showing on the TK1 platform that has a four core ARM based processor and a Kepler GPU. We're doing speech recognition across different systems. Here just increasing the vocabulary going from 50,000 words to 500,000 words, a million words and showing the performance in terms of speed or real time factor with just a single ARM core or with the TK1 GPU. So it doesn't use more time. It doesn't take more time to... So on the TK1 we are six or up to 13 times faster and most critically, we're always faster than real time. So as fast as you can speak, we can do speech recognition. And how about the X1? X1 we haven't played with yet. That's going to be double? It should be double, if not a little bit more because you can get... Some of the architectures are really optimized for neural network processing and we expect to probably get even three to four X, perhaps speed up. This is awesome stuff. So there's speech here, there's images over there. This is kind of like awesome. But we need to see this in products right now. Yes, everyone agrees.