 I'm going to talk about wave nets, so probably if you've been here before, you know something about me, I moved from New York City, I'm originally English, I moved from New York City in September 2013. 2014 was kind of sabatical year when I just focused on machine learning, deep learning, natural anchoring processing, did some robots and drones. But since 2015 I've been in kind of serious mode and I'm doing natural language processing, deep learning, I've managed to write a few papers along the way. Recently we were at NIPS, Sam and I went to NIPS, which is fantastic, and we've also been teaching a developer course. So things are beginning to turn into something now. So the outline of this, I'm going to talk a little bit about WaveNet version one and then what it's made out of. Then some enhancements, firstly there was Fast WaveNet, which is a necessary optimisation, but then we went to Parallel WaveNet, which is a whole new ball game, and that is what is shown on people's Google's home in America and in Japan. But why is it excitement and how did that work? So WaveNet version one came from Google DeepMind, so Google acquired DeepMind a while ago. This is one kind of concrete product which they came out of it. And in the splash blog post they have they announced the paper and also it has some nice clickable audio examples. So this presentation I'll put links up in the meet-up. You can click on these, get to the web page. The key thing here though, the reason it's exciting is that if this is human speech with this kind of mean opinion score, and these two are the previous ways of doing it. So suddenly WaveNet is a whole lot closer to being human-like, so this was dramatically better. So this was the blog post which came out and they explained how basically they're actually predicting audio samples. So this is what they're doing, that's the clever thing and they have little diagrams which is very nice, here's the thing. And basically let me just play some audio, hopefully this is going to come out of my speakers. The Blue Lagoon is a 1980 American romance and adventure film directed by Randall Kleiser. Can anyone hear that? Aspects of the Sublime in English Poetry and Painting, 1770-1850. Is it coming out at all? How about we have two? Aspects of the Sublime in English Poetry and Painting, 1770-1850. Try again. Aspects of the Sublime in English Poetry and Painting, 1770-1850. So the way in which this first one is done is they have a speech synthesiser which is something where they're moving the parameters around on this very quickly. So it's trying to make a synthesiser output speech. And so what happens here is it tends to feel a bit sing-song-y. It's a bit weird, a bit kind of lumpy in a way. Aspects of the Sublime in English Poetry and Painting, 1770-1850. Now the next method which is commonly used and say Apple will use is concatenative. So what they do for this is they pick, that they know what phonemes you're going to be producing and they're going to pick out of a large corpus what is the closest match to that phoneme and just output the wave signal and then blend them all together. So basically this is taking from a corpus of speech, like the closest thing what you're trying to say and just hitting the speaker with it. Aspects of the Sublime in English Poetry and Painting, 1770-1850. So this used to be kind of the winner in the quality war just because each individual segment of it is pretty high quality. But the problem is, if you listen to it carefully, now I'll tell you what to listen for, you'll hear that it's going from that piece down like the song is all wrong because it's taken from the song from a completely different corpus and so it jumps around all over the place. Aspects of the Sublime in English Poetry and Painting, 1770-1850. So it's got the individual sounds right but it's got the song of English wrong whereas this thing will get the song of English correct but the sound will be a bit off. Aspects of the Sublime in English Poetry and Painting, 1770-1850. So along comes WaveNet. Aspects of the Sublime in English Poetry and Painting, 1770-1850. Can people hear that that's markedly better? Well, this is the Google Play. So suddenly Deep Mind comes along and they've done something really pretty good. So if we look at Mandarin Chinese, which I know nothing about. But you might be able to hear that sounds more like a Casio version of someone. So this is very kind of deliberate. WaveNet. So I don't, I'm not one to say but maybe that's quite good. I don't know. The other thing which they can do is they can play a game with just trying to say, let the network loose and just try and say stuff without being driven by text. So this is kind of a model of human speech which is clearly... It's kind of disturbing. So basically WaveNet has understood both how to form the individual sounds but also join them together in a language-like way. And they can also play games with, like really so. So they can do multiple speakers just by changing some parameters for their network. They can also play some games with music which we have no time for. So this is why this is kind of exciting is that this is markedly different from what would come before. And so I'll just explain what the key elements of this were one by one. So one of the important things is that they're producing audio samples just as the output of a neural network which is kind of remarkable. In the past people have been emitting words or maybe changing pictures a bit. But here they're predicting individual samples at 16 kHz which is kind of crazy. And they're also doing it at 8-bit resolution. So I'll give you more idea about how that's going. But one of the surprising things is that a normal recurrent neural network which is how you would normally think of producing a time series or analysing a time series you're kind of talking about 50 steps is kind of where it maxes out. But these word features are thousands of steps in space, so in time for these things. So how are they analysing this? And so this is kind of one of the remarkable things. Now one of the things they did was you can't use recurrent neural networks or standard recurrent neural networks so let's use convolutional neural networks and these are familiar from the vision, like computer vision people. But one of the problems with that is if I have a network which is four layers deep basically I only go four layers back in time, or four time steps back in time in order to go a thousand times back in time I have to have a thousand layers. So this is a problem. So it kind of has a linear footprint. So one innovation here is they have these dilated CNNs and you'll see that at the bottom we have just a one-stepper but the next one up has a skip of two and the next one up above that will have a skip of four and then a skip of eight. By having this if you look basically you can join them on now you've got kind of like an exponential length of time you could be referencing. I mean clearly you're not referencing this piece you're only referencing bits of the past but at least you have some idea of going much further back. So then if you have a ten layer deep thing you can go back a thousand steps. Another thing which they do is so with CNNs you've got the pros and cons I mean the advantage of this is you can get this very long look back and you can train it faster. Basically all you need to do is you have your here's my time steps for my long piece of speech as input what I wanted to predict is the next sample at every stage but I know what that is just by using the same sequence shifted back by one. So that will allow me to align input to output for every single time step in just one training example across thousands and thousands of because there's kind of a time invariance going on it doesn't matter what second you speak a particular word it's all going to be the same but within that second you need to capture that whole variance. On the other hand the disadvantage is how do you know what the next sample is? I've now got to carry suppose I predicted a sample at time zero to time one I now need to carry that time one sample back to the bottom of my CNN and evaluate this massive thing thousands upon thousands of times in order to get a sequence of speech. Whereas with an RNN I could just go one sample to the next sample to the next sample to the next sample to the next sample. Another thing which they do is at each of these dots which has got this kind of gathering mechanism is not just a simple multiply and add they actually have this, they divide it into two pieces and they do a tal on one side and a sigmoid on the other multiply them together then do a one by one convolution and then add them to what you first thought of and this is one of those dots. So this kind of extra complexity, this kind of gated unit also adds something. It clearly adds something otherwise they wouldn't have put it in because I don't think they want to just burn money though the outcome of this is they do burn money. The other thing which they do is they have these side chains so across each of these individual circles also feeds into this thing which is a chain of essentially one by one convolutional layers in order to get to a softmax which is the output. So the output is fed kind of sideways. You've got this vertical breakdown of time and sideways is kind of the sum of all possible times for each sample. But the other thing is they're not producing a single number at each time step, they produce a whole distribution of what it thinks the next sample should be. So instead of having a neural network which is trying to track the exact number which is like a, you could do a regression for it. Here is the perfect value to come out with. It's actually producing like a scatter diagram of guess where the number should be and so clearly they find that this is the better way to do it. But it does seem crazy that you're doing 256 times as much work as you would do if you were just producing a single number. But the results kind of prove that it was worth doing. So as I mentioned before, there's a computational burden. The training is kind of quick, but every time step you know what your next training sample should be so you can do them all simultaneously in parallel. But when you come to actually running this for output is extremely slow. And so one second of audio output using DeepMind's resources was I think one or two minutes of GPU time. So this is why all of the samples they have are really small because it took them hours and hours to produce these things. This is clearly impractical. In particular if I ask my Google Home what is the weather like it will then have to think for several hours before outputting the answer. Anyway, they carried forwards. They also kept themselves busy with AlphaGo as well. So they had stuff to do. OK, so now we're with WaveNet 1. Now let's talk a bit about the implementations. There's one intensive flow called IBab. Most people think this is kind of the reference implementation. Surprisingly there aren't hundreds of implementations of this. But I guess it's so tight consuming to train. People are talking about getting sample results after a week on their TitanX. So this thing takes a long time to do something which is half OK. DeepMind clearly had a lot of processing power to play with this. There's another thing called, there's a project by Google called Magenta which is really cool and I can give you a quick aside on this. And they actually have code on GitHub because they've implemented WaveNet themselves. So this is kind of, from what I see, the most official Google implementation. Let me just show you. So this is the Magenta project. Basically they've got a whole bunch of interesting games to play. They have this kind of, people were doing quick draw. If you haven't seen this, it's super fun to play. This is kind of, you draw a little bit and then it comes up with more diagrams. Or they play this music. I guess. OK, so this is random Chopin which they've trained like an LSTM just like producing the works of Shakespeare by reading the works of Shakespeare. They're doing the same for kind of expressive piano playing. It seems that this is a place in Google where people have fun. They also have, they've come up with this N-Synth which does neural audio synthesis which includes a WaveNet decoder. On the other hand, when you play these samples prepared to be disappointed compared to the deep mind voices. Anyway, you may find it to be fun. And then here's, sorry, OK, there we go. OK, now it's time to show the actual Singapore implementation of this thing. And I have to preface this by saying it doesn't do a very great training of anything. I just wanted to get the code out there in a working state so it will do something. Now the hotness which I'm doing, so this is kind of, so there's some at least some TensorFlow takeaway here, is it uses the new dataset API which I'm sure the Googlers will have witted on about in December. And it produces TF records and it streams them from disk. So this is something where the TensorFlow examples actually just read the stuff from memory. It's kind of frustrating and I see some nodding hits. So the TensorFlow examples in there just have MNIST and read it from memory and you're like, well, the whole point of this is to read it from disk because I want to do it asynchronously. Another thing which is this is a Keras model which I've exported to being an estimator. So this allows you to do the estimator thing which will then allow you to run this on TPUs or on your mobile or all sorts of other interesting things. And it's in one notebook which works end to end. And let's just have a look at this. So I don't think I want to execute this but basically the flow of this is the task I'm trying to do is to take speech, reduce it to its mel spectrograms which is a spectrogram like compressed and then expand it out into good speech again which is kind of what WaveNet is doing but I just wanted to have my own little go. And basically I've got some Libravox books which are quite well spoken. I just want to read those in as mp3. So here's some kind of FFT things. I read them in. I read in these samples. I do an FFT. I then start to write these TensorFlow records out. So this is something which you probably need to download and look at the code if you cared because it took a while to do this. So one of the other lessons which I do is I actually preprocess everything before I write it to disk. Now the TensorFlow people seem to want you to preprocess it as it flies off the disk. On the other hand, just doing simple processing proved to be such a documentation nightmare that it's far easier just to do it in numpy and save it on disk. So perhaps this is easy to do in TensorFlow. There's an example of how I did do it in TensorFlow and then just gave up because manipulating these complex numbers was no fun at all. Whereas in numpy I can just do it. So basically I take these spectra and these mail spectra and then make some angles, make some shapes, store it into disk with phases I come in as I can take this dataset and I can pass it. And then I can take the dataset and I can shuffle it or batch it or make it into an iterator. So this last bit is the magic working for me. But it took an awful long time to set up the magic before the trick before the magic would work. And it also, I have to say, it also feels very backwards in that you're writing all this code and you won't be able to ever run it for two days because you have to write the model and everything else before you'll even try and suck the data in. So anyway, that's the experience. So now I'm going to, so now I'm set up so I know I have some data which I could read in if I had something which would try and pull the data. So I'm now going to write a Keras model. So I've got two functions here, one of which is a wave net layer which basically takes, it does two of these 1D convolutions. So this is the, there's a tan bit and a sigmoid bit and it has this dilation rate which is the skipping. And now it multiplies them together. It pads it out a bit and then it does this skip out and here's the residual thing. So basically this is a single node on that tensor flow graph, including the dilation. And then in order to get the actual, for the model which is the mind-mell spectrum to the full spectrum, I now just say okay I'll take my inputs and I'll now do, I've got five layers here of wave net and I then kind of make stuff happen. So basically this is the wave net thing being used. Now one of the issues that I found is that the Keras internal to tensor flow is now different from the Keras external to tensor flow in that the Keras external to tensor flow works and the one inside tensor flow doesn't according to the docs. So there are parameters in the ones where the docs don't match what happens in the code and when Francois Chollet is asked about this he says submit a pull request to tensor flow and closes the issue. This is not helpful. Now the reason I want to use the internal tensor flow version is because that is the only one which has the Keras to estimator function. The Keras to estimator function doesn't work on the external stuff for some mysterious incompatibility reason. Anyway. So what happens out of this? So Keras is nice, you can build a model and out comes a nice model which Keras can explain. Keras does all this nice stuff which is why it's quite nice to use. I can build a custom loss. I can... and then I can set up a training specification and an evaluation specification and now I can run the magic... so that's setting up the magic trick and this is the magic. I can now just run train and evaluate and it will do all the right stuff. So if you run train and evaluate because I've got essentially this input function reads this stuff from the data sets and then all feeds into itself and trains the model with the right training stuff and if I look on tensor board it's produced me a beautiful diagram. If I look... OK, we've got some... basically some graphs of the training performance. All of this stuff is all beautifully done once it connects up but I can tell you it was not easy. Anyway. So there we have... so this thing will produce beautiful graphs and this thing... I can stop it, I can make predictions and so this is not really a very good demo because it doesn't sound very good but this is the original. Bit of Pope and Catoolus. Lanza, Voltaire, Rousseau and Wild. That is what Rapau must have had in mind. So clearly this is a good English speaker speaking some book, this is about bachelors in English in Libravox and the text is free because it's Gutenberg Press and all the speech files are free like free of any obligation or maybe there's an attribution hit. But it's very nicely done. So one thing which I do is is if you then take... I convert that to mel spectra and I then say well let's just convert this back into the real spectrum. Yes? And then render that to text and it sounds like... Sorry, let me just make sure that works. Bit of Pope and Catoolus. Lanza, Voltaire, Rousseau and Wild. That is what Rapau must have had in mind. So this kind of shows that the phase information, the complex spectra consists of real magnitudes and a phase number and so this is why it sounds like it's horribly phasing, it sounds like it's gone through a phaser because I've set this all the phases to exactly zero here. So this is how I just use the real part. Sorry, the absolute part. Now I can prove to myself that the actual mel conversion bit worked okay by using my predictions plus the original phases. Bit of Pope and Catoolus. Lanza, Voltaire, Rousseau and Wild. So this kind of proves that if I could get the phase stuff right then I'd be in business. So if I now take out these things, so this is why it's kind of like a non-perfect demonstration. So we get, finally we've done a bit of prediction, we get to... Pope and Catoolus. Lanza, Voltaire, Rousseau and Wild. So there's about 50-50. I would say five out of ten on the scale. But it would have been great if it abused perfect speech. On the other hand, if it had produced perfect speech, that would probably be a product and I would probably not be open sourcing this straight away. So in a way it's good that the version here doesn't work so well because it's a perfect example of the Keras and the estimators and everything, but it doesn't do exactly what we need it to do. Sorry about that. So that was the demo. So one of the problems with this one minute equals one second or more is it's not very fast. So people try to speed this up. One obvious, well, one semi-obvious way to do this is to essentially try and make a lot of this layer computation as you go along is kind of self-referential. So there's a thing called fast net wave net, which involves queues. So as long as you remember just the right number of things which you precalculated, you can calculate this thing a lot faster. This is several hundred times faster than it used to be, but it's still not fast. There's a reference to the code. That was too fast. So the next big splash came in October 2017. So basically a year later, well it's a semi-big splash in that there's a blog post which announced that Google Assistant now uses wave net because they've done something really smart. But they didn't really explain what they did. So I've just said, you know, I've got a Google home here which is an English version which doesn't use wave net. So hey Google, can you tell me a limerick? Here's a limerick for you. There was a young lady of Sweden by Edward Lear. There was a young lady of Sweden who went by the slow train to Widen when they cried, Widen station, she made no observation but thought she should go back to Sweden. OK, so you can hear that it's going up and down. So this is not wave net because this is, whereas the US ones sound much better. So you can hear, you can kind of, now when you hear computer generated voices you'll be able to hear what the problems are, right? So I'm sorry to have ruined it for you, but the wave net stuff works really well. And clearly other people are coming up with good stuff, but yeah, Google home is not quite there for the English version. But what happened afterwards, fortunately, is DeepMind then came up with a November post which actually explained what they were doing and had a paper, presumably because of some conference release date or something. And the trick there is basically what you want to do is you want to feed in just noise or some kind of audio and make it into the nice speech in one go. If you could do all this in parallel rather than one sample at a time, you'd be done because the GPU will just do layer, layer, layer, layer, layer, and I have 10 seconds of speech. Like I was doing with my example, I was just doing 10 second batches basically. So the question is how do we make this input straighten out into this beautiful output? So what they do is they make two synthesizers, one of which is this teacher network which they've already trained to do what the original one did. What that does is it takes good samples and converts them into the next sample. So it does the right thing, but also it produces the distribution of every sample along the way as one of its kind of artefacts. What you do with a student one is you put in just noise and that will then produce at the end a distribution as well as a sample. So you can then put the samples into WaveNet and then it will come out with a distribution. So that basically you can then say, well I'd like the two distributions to be, I can't make the actual samples to be identical, that's too much of a goal, but I can make the distributions of the samples the same. And then basically by converting these two distributions to each other you force, the teacher can force the student to learn what the distribution should be which is to be a good WaveNet. So that they've got this little animation, basically they've speeded it up so instead of 0.2 seconds they can produce a lot more. So suddenly it's now a commercially practical application and they're doing it by having this teacher and student thing. Now the problem with the teacher and student is if this teacher can be only semi-good after a week on your GPU it will only train a really pathetic student, right? You need to train the teacher all the way before it can train the student. So this is where, as far as I know there are no implementations of this in the wild just because no one else can run it. So this is why I want to play around with the spectral thing. OK, so to wrap up. WaveNet started out very good but super expensive. But because it was so good it proved that it was worth optimising and there are a ton of smart people at these companies in particular DeepMind and Google which have actually done it. And now it's a viable product whereas a year ago it was kind of, October 2016 it was kind of a nice well, nice state of the art result but it wasn't practical anyway. But there's lots of, one of these things which Google seems to be doing with their excessive amounts of hardware is proving that things are possible. But that allows other people to then come along and say well actually now I know it's possible I can actually train a network to do it because I know it can be done. Whereas before the academics were training possibly pushing the envelope upwards but only not reaching for the sky they were reaching just above where they previously were. Was coming out with this super good result means that people say well I don't want to do the little thing I should do the big thing. And actually having bigger ideas there's a huge amount of innovation which is possible in this stuff. OK, I've got some ads we could do this later we could do this now. OK, so let's do some quick ads. Clearly you know about the deep learning meetup group because you're here. The next one is the 22nd of February. What we typically aim for is talk for people starting out something from the bleeding edge and lightning talks. Now it may not be that this one has much for beginners so we'll see. But we're trying at least. OK, here's a bonus piece of news. You can now get from Google via Kaggle which is called Colab now or something free 12 hours at a time with a GPU. So basically you have like an ipython notebook and you can use a GPU for up to 12 hours and it might throw you off occasionally but my guess is they're just soaking up all their free GPU time doing this on kind of older GPUs. There's also a blog post there but free GPU sounds great. Is great. So can I have a show of hands? Can everyone raise their hand? Everyone. This is just a rejection test. Can everyone lure their hand now? Can everyone who didn't raise their hand before raise it now? I tried, right? OK, who wants more in depth stuff like more eager mode? More, OK? These are actually the eager people. OK, OK, OK. Who would like to hear about the text-to-speech race? So this deep, this deep, this wave net thing is part of that but there's also Tachetron and Deep Voice and Loop and there's all sorts of things people playing a race for this text-to-speech. Anyone interested in that? Some, OK. What about text speech-to-text? So this is... I was seeing a lot of the same... OK, some more, OK. Who wants to hear more about Google Cloud ML? Ooh, how interesting. OK, thank you. Who wants to hear about the latent space tricks which are being played? OK, OK. Who wants to hear about things which are being done with knowledge bases? Maybe knowledge bases and assistants and that kind of thing. OK, OK. Thank you very much. Another thing which we're doing. So there may not be that much beginner content today. We're also doing a kind of deep learning back to basics which will be held at SG Innovate which is in Carpenter Street near Clarkey. There's one on 6th of March. If you came basically a year ago, it's going to be similar content to the beginner stuff we were doing then. So we're kind of catching up with people who didn't attend the meet-up. But also we're going to be... hopefully not typical beginner thing and also welcome questions. But we're also conscious that there are quite a few kind of beginner-ish things and we don't want to duplicate stuff. So we're going to play by ear and find out. So here is the 8-week deep learning developer course that we talked about a lot last year. And well this happened from the 25th of September to the 25th of November and it consisted of twice-weekly three-hour sessions which is a lot and homework which had instruction. Everyone had individual projects. People got funded. People with... which needed it from Singapore and PRs got funded 70% of the cost by WSG. Who here attended the course? Have some? And that guy did too. That guy. Yeah, okay, great. Okay, it was held at SG. Innovate, this is over. But just to say that when we said it would happen it did happen, it's over. There should be another something weak deep learning developer course starting sometime soon. It kind of depends on WSG getting their act together. We're not exactly sure how super time intensive this should be because it was a real struggle but maybe that we do a mix of in-person and online or something or leave bigger gaps or something anyway. And we'll also probably be doing a deep learning beginner course but this will be slightly more than an AI Saturday or a one day event or an Nvidia thing. The idea is we'd have a full day of stuff to play with real models. We've done this before but also there'll be a list of projects to actually go away and do with the idea that we'll have a midway point where just to have people got problems what's going on and then kind of a wrap up session. The idea is we have some kind of one-on-one thing but also regroup. What we found from the deep learning developer course is actually doing a project is hugely more than just sitting there or reading a Udacity thing. Having your own project or something you can actually do forces the learning a lot more. There's a link at the bottom we'll press the link in the thing. So that's me. I'll take questions now and we'll switch over.