 OK, there's all the announcements. I would like to introduce Norman Casagrande from DeepMind, who will talk about WaveNet, what is behind Google's voice. Hello, everybody, and welcome to my presentation. I would like to start with a little bit of a teaser, all right? So this is recording. Well, I'll play it first, and so you'd be the judge. This is boring, deadly mound. He shuffled away. Harry moved in front of the tank and looked intently at the snake. I'll play it again just because it didn't get the first part. This is boring, deadly mound. He shuffled away. Harry moved in front of the tank and looked intently at the snake. So this is a voice which has been entirely generated by a computer. There was no recording of a person saying that sentence. There was no, even parts of that recording that were stitched together to generate that part. This is really entirely generated, a voice by technique called deep learning. When I have a look at what's behind this technology and what drives it. OK, well, first of all, a little bit of a history of speech synthesis. Well, it has a long, long history. Actually, we can go back to middle-aged ages where people built these brazen heads, which had some sort of, we're making some sort of sound that reminded the human speech. And in this picture, it's basically showing how it could be used to actually scare people. But the desire for generating machines that can mimic the voice of human beings actually goes a long way, way back. In the 20th century, we started using a little bit more advanced technology. This is a schema for what it was called a border, sort of like a piano-like instrument, which could be used to generate some sort of spoken audio. But it was mildly intelligible, but it wasn't really that much useful at the time, except entertaining people and making feel like this machine was spookingly generating audio. More recently, we had voices which became very famous, like the one that Hawking used back in, well, until his death. Now, more from a software perspective, traditionally, techniques from a software point of view have been using two different approaches, mainly, at least. One was called unit selection. And basically, the idea is that you have somebody sitting in a studio recording a long, long, long list of sentences. It can be very boring. We're talking about, like, tens and tens of hours, all the way up to 100 or 200 hours. And then you take a software, you slice those recordings into bits, and you try to stitch them together. It does sound natural, especially for the parts that were part of the original recording. But if you're trying to say something that it's not part of the original database, it might sound a little bit glitchy. I have an example here. The avocado is a pear-shaped fruit with leathery skin, smoothed edible flesh, and a large stone. So you can really hear that in the parts where it's trying to say something like a word, it sounds good. But in between the words, or sometimes in between parts of a word that was not part of the database, it sounds really, really bad. An alternative, which was also quite popular, was called parametric. So the idea is that you would try and simulate the vocal tract of human beings with a formula, essentially. And then you would use a set of parameters where you manipulate this formula and you generate audio. And you can also try to adapt those parameters based on speech that you have recorded it on, for which you have recordings. Now, these works, when you want to generate pretty much anything, but it sounds quite unnatural. I have another example here. The avocado is a pear-shaped fruit with leathery skin, smoothed edible flesh, and a large stone. It might not sound too bad, but I assure you that if you try with a headphones, it will sound really, really bad. And in particular, if you don't have a lot of recording, so for the parametric system, it doesn't sound good at all. And only recently, deep learning has come prominently as a technique to actually take the best of, to a certain extent, the best of the world. Because we learn a system end to end to manipulate, to generate the audio from the samples themselves. And this is an example of the same sentence generated with WaveNet. And we're going to look at what WaveNet is. The avocado is a pear-shaped fruit with leathery skin, smoothed edible flesh, and a large stone. Cado is a pear-shaped fruit with leathery skin, smoothed edible flesh, and a large stone. All right, then the next one. The avocado is a pear-shaped fruit with leathery skin, smoothed edible flesh, and a large stone. So it's way more natural. Now, what is WaveNet? Well, the definition, the actual definition of WaveNet, it's an autoregressive neural network with dilated convolution, which can be summarized into magic. Which is basically what deep learning is most of the time. I try to keep the memes at the minimum. But anyway, this is EMF. And so we want to dive in, open the hood, and look at what exactly these autoregressive neural networks with dilated convolution means. So first of all, the neural network part. Now, I'm not going to go too much into details about what a neural network is, because I only have half an hour. And there is lots and lots and lots of really good tutorials out in the web. But in a nutshell, you have a bunch of inputs. Let me take a pointer here. So you have a bunch of inputs, right? You make and go through what we call the hidden layers, which basically is you take the input, you multiply it with a bunch of matrices, which are your parameters that you need to adjust. You apply some linearities. You pass it to the next level, blah, blah, blah. And then you go all the way up to the top. And in this top, it's what you're trying to predict. And basically, depending on how well your prediction is, you adjust your weights, which were these hidden parts here, to make sure that the next time you run with the same example, you're actually closer to the output there. Now, what are those outputs for WaveNet? Well, those outputs are the samples. This is really part of what made WaveNet so revolutionary. We're trying to model the samples themselves. And the samples are those values that exist in the form, the representation of the audio, which is this waveform. I have this animation here that shows how we have, for a good quality audio, about 24,000 of them per second. So you need a model that is able to predict 24,000s of those tiny little samples per second in a way that makes it sound natural with respect to the audio. So, how does it work? Well, the NeoNetwork, we're seeing the NeoNetwork parts. Well, next bit, it's the autoregressive part because we are actually proceeding one sample at a time. We start with the one at the very beginning here, the bottom, which when you're actually, when you're training, you have the ground truth, okay, usually, so the original audio. When you're not training, you might wanna start with random or zeros, depending on your thing. Anyways, you make it go through the hidden layers, multiplication, blah, blah, blah, and so forth, and then you get the output. Now, as you remember, I say that in NeoNetworks, you have the back propagation stage where you adjust the weights here, but anyway, long story short, you get a number at the end, which is your sample at that time step. Well, then you pick up that sample at that time step and you use it as the next input. And then, again, you go through the network, which, by the way, is the same, okay, just to be clear, in this case, those two samples are different, but the network in the middle is the same, right? So you readjust those things a little bit and so forth. So you repeat, you repeat, you repeat, and then you end up having a lot of those samples that are generated. Now, last part, oh, well, and obviously, you got a lot of them. Now, the problem is that you have a lot of them, right? Previously, I was showing you, there's like 24,000s of them per second. And generally, when you wanna say something, it's more than a second, it's, I don't know, five second or something like that. So an additional critical contribution by WaveNet is this concept of a dilated convolution. So instead of just taking, picking up the output of, you know, well, your input, you stick it into the hidden layer and then it goes all the way up to the end in a linear fashion. You actually pick up also the one that comes before, at least for the first layer. And then for the layer above, you have like this concept of dilation, which means that it picks up the output of the previous layer which is in a way that it's a factor of the timestamp. So in the case of here, it's a factor of two and then here is a factor of four and then a factor of eight and so forth. Which means that because of those hidden layers or to a certain extent representing a summary of the state of the knowledge at that point, these sample, the one that you're generating at the end, it's actually encapsulating the knowledge that goes not just from these step before, but it goes all the way back to, you know, a power of two, depending on the number of layers that you have in between. Okay, and you're going in summary, you got like this fancy animation which was generated, that was created when the block post about WaveNet came out and shows the whole process. And you see here, you start with the input, it goes out, it generates the output, they use it again and so forth and so forth and so forth. Now, thing is that until now, we've been talking about samples, right? And in particular, I hinted at the fact that at the very, I mean, so when you don't train and you actually want to generate audio, you might start at the very beginning with silence or a random number. But if you trained on a lot of audio or spoken audio, what will happen is that the network will learn to say something, okay, but just anything, okay, nothing in particular, something that kind of sound like voice but doesn't mean anything. And I have a bunch of examples here that are quite eerie. So, you know what I'm trying to say here, you're still German, but in a bit too many, so it's not a teacher, I don't know if I can look at you at all. I'm pretty, I'm quite. So this is a wave net that has been trained on tons and tons of spoken audio, all right? And then you have a random number at the very beginning and you run it through and you just tell him, tell it to generate samples, up and samples, up and sample, up and sample, and you got something that sounds like Swedish. I just even though it says this, it's a lot of work, I can't even tell you. It's another example. Could I jiff over it? It's over, tell you, that's not, are they? That here, what on? It's gibberish, but it's gibberish that it's very much believable with respect to what the human being would say, or you know, the way a human being would pronounce stuff. So, how do we tell wave net what to say? Well, I just even though it says this. Not this, I don't want you to say that. All right, well, there is this other part which we call the conditioning stack, okay? In this conditioning stack, well, you know, you have at the very beginning what we call these linguistic features and we stick those as the inputs, okay, to the network. And at the beginning, so these linguistic features are usually things like phonemes, intonation, so we want to tell the network, look, here you have to put a stress, we have punctuation, question marks, points where the network learns that when it says dot, it means that it has to break for a moment and so forth. And importantly, every language obviously has its own. For that matter, every voice has its own. I mean, I would challenge anybody to try and replicate what David Attenborough is doing. It has like its own unique way of saying things, right? So, you need a set of those linguistic features that you want to match for a specific speaker. And for a specific language. Now, these at the beginning, so you have an input, anybody knows anything about your network will understand that this is not as easy as the previous example that we saw because for a sample, you have a number. In this case, we have phonemes, so you have to map those phonemes in a sort of way. But long story short, there's also some deep learning technique, if you want, we can discuss those things later on, but anyway, they get mapped into neural network stuff, okay? And then they squeeze or squeeze, then rearrange, et cetera, et cetera, so that they can match into the existing hidden layers. They're literally added to the hidden layers. So, you can imagine that at this stage here, okay, this mishmash of linguistic features end up being a vector of numbers that then gets summed to the vector that's represented by those weights. And they represent those weights. And this is, you know, the wave of magic. And it's really, really cool. It actually made a huge splash and nobody really was waiting for something like that. Now, the problem, oh, yes, and the results speak for themselves. I think we already had a bunch of examples. Maybe I'll play just one more to give you a sense. So, this is... A single wave net can capture the characteristics of many different speakers with equal fidelity. So, this is unit selection, stitching things together, stitching things together. This is wave net. A single wave net can capture the characteristics of many different speakers with equal fidelity. All right, and then the parametric version. A single wave net can capture the characteristics of many different speakers with equal fidelity. A single wave net can capture the characteristics of many different speakers with equal fidelity. All right, now, this was all well and good. In 2016, however, A was unfortunately a little bit slow. Something I didn't mention is the fact that this process of going back and forth, back and forth, back and forth through a lot of those hidden layers which were holding, I mean, I briefly skimmed on the fact that you got matrix multiplication, but there were a lot of those matrix multiplication. So it was great, but it was really, really, really slow. However, people recognize it was extremely an extremely exciting piece of technology. So it was like a huge push for making it, you know, real time so that we could use it to power a Google Assistant and Google Home. So in 2017, we came out with a different piece of technology which was sitting on top of WaveNet. And just to give you an example, the original WaveNet, so this is showing you how much time it was taking to generate 0.02 seconds of audio. I was thinking about a second, OK? So bear that in mind for a moment. Oh, and by the way, like late Augusta where Apple said, it's really cool technology, it's not feasible. And then we launched it like a couple of months later, which was pretty sweet if you ask me. So anyway, how does it work? Unfortunately, I don't have a lot of time to go into details because we're going to also talk about version three. But in a nutshell, so remember, this is the autoregressive part. Well, basically, there is a technique in machine learning that is called distillation. And it's kind of like a magical to a certain extent because it means that you first train this model on a bunch of data, OK? And then you have this model, which is OK. It's pretty good, but it's really, really slow. But what you can do is that you pick up another model that trained on the output of the first model. In a way, it's kind of like two computer talking to each other and one teaching email, you should do this, you should do that, without any oversight from human beings or additional data. And so this is what the feedforward model is about. Funny thing is that the actual feedforward model starts with noise, goes through a bunch of dilation layers, et cetera, et cetera. And then it outputs the result in one go, instead of going through this back and forth model. Yes, I can say that, unfortunately, I don't have a lot of time to discuss this. It's quite interesting. There is a bunch of papers. But anyway, more interesting, I think, oh well. First of all, we're used to take one second to generate the 0.02 seconds of data. Now this thing can produce 20 seconds of data in just one second. So it's really, really, really fast. And this was 2017. And this is actually what currently is in production within Google. The only problem with it is that it takes about, it's a bit complicated because it takes some time to train the first model, the one that teach. And you have to train it with a lot of data on the speaker. And then you have to take the other model and then this student model has to train on the teacher model. So it's kind of like a complicated process. So in 2018, we further advanced the research. So you see it's slow, complicated, and make it even faster. And so remember, this was the original schema of WaveNet. Well, this part is gone. And it has been replaced by, well, instead of this bit, this bit, which is a GRO cell. Now bear with me for a second. We're not going to go too much into detail about this thing. Just so that you know, this is a quite common technique in machine learning for recurrent processes where you have sequences. But just to give you an idea, the input goes here, goes through a bunch of stuff, mathematical operations to basically decide what to remember, what not to remember, et cetera. And then it goes out there. And there is also a bunch of stuff that goes into the connection between those cells so they can talk to each other. But the beauty about this new system is that it uses way less parameter in this area. It can talk to each other in between the timestamps. It doesn't need dilation, not in this bit at least. And, oh yeah, and by the way, the conditioning goes somewhere here. This is very scientific notation, by the way. Yeah, and it's a very compact model. And above that, it can also be sparsified. That's also another technique. I won't go too much into details because this is really a Pandora's box. I could talk for hours about that. But in a nutshell, the idea is that these parameters that the model is adjusting during the training process, there's just a bunch of values. They're called weights. And not all of them are really that important for the actual process of generating the output that we care about. And if you, during training, you could decide to say, OK, look, those weights that are not that high, that are not really contributing too much to the results, just zero them out. And then you go on for a while training. And then you get another bunch of weights that are also kind of tiny. And you say, oh yeah, I don't care too much about those. And the model, because this is a training process, the model has to cope with that kind of situation. And eventually, you end up having a model that is extremely sparsified. Most of the weights are zeros, which means that after you train the model, you can just ignore those zeros, because they don't contribute to the process, to the point in which you can even run this thing on a device, on a device like a phone. Now, what about the future? Well, we want to make it even faster, obviously. We've got a bunch of ideas. I cannot go too much into details. But I think that even more important than that, it's make it train faster. It is true that this third model that I was talking about doesn't need the teacher-student approach, so you can train it in one go and then deploy the thing as it is. But it still needs a lot of data. However, there's different techniques that we are looking into to just learn, based instead of using tens of hours or hundreds of hours of recording, we can maybe train it in 10 minutes or something like that. And bring it to the whole world, because right now, since we're kind of like data constrained, the assistant is only available in English, German, French, Japanese, and a bunch of other languages. But we really would like to bring it to the entire world and to all different types of languages. We also have a bunch of additional ideas for architectures. We in EMF, I'm pretty sure it would be really cool. But I think that more important than that is that you could use this technology to order things outside of text-to-speech, because WaveNet can replicate signals really well. And I have an example here. This is WaveNet trained on music rather than voice. And this is what it generates. Now, this is like the babbling that we were hearing before. There is no conditioning. We're not telling WaveNet play specific sets of nodes. We just say, tell WaveNet, look, you train on music, now figure something out. Start with random. And this is what it generates. And this is another one example. I think it's really cool. I don't know if you're... Anyway, that's the end of my presentation. Whoops. If you have any questions, do we have time for questions? Ah! Well...