 OK, let's start. So it's a really big pleasure to have you here. We have known each other for quite a while, but it's the first time that he comes to the university. He was a big plastic surgeon at the time of my game. Now they are participating at Sonnerd. In fact, I see that Hudo has quite a big presence at Sonnerd, but he consumes. And here in Magenta is one of the research projects from Google that he directs and has tried to pay a bit of attention in the recently on the using of deep learning for music. So it's great to have the theater of the team to be here in that team. So I guess quite a number of people are also going to give you a similar talk area, even at Sonnerd. But I think it's cool that he may need to. Yeah, thanks so much for the invitation. You guys have, I didn't know how beautiful the university is here. I mean, Barcelona is the best city in Europe for sure. So somehow we're going to find a way to spend a lot of time here. It's really great. So the project I'm going to talk about today is called Magenta. And it was just an acronym, Music and Art Generation. And the goals actually, you've probably, if you're in this field, you may have seen articles and PR and press around this. And maybe there's a little bit too much. Our goals are really pretty modest. And I think very well aligned with what you're all doing here, which is simply to try to understand what we can do with generative models in machine learning to generate media. So we talk about it in terms of music and art. But another way to think about this is I have two kids, 13 and 18. And I watch what they do with these, with their phones. And largely, they're information seeking. They're communicating. And they're looking for entertainment. And so these really have, even though we're not talking on phones, they've remained fundamentally communication devices. They're about communicating. And if you think about what we're doing as musicians and as artists, we're also communicating. And so I think we have an opportunity with machine learning, especially the same kinds of models that we use for translation and that we use for speech recognition, that same family of models to also provide ourselves with new tools for communication. In this Venn diagram, there remains this very important component, which is human creativity and feedback. I personally don't think it's very interesting to talk about pushing a button and watching a computer make art. I mean, I think we want machine learning models to be strong and to do interesting things. That's what would make this technology interesting. But fundamentally, I think, I hope we all agree that art and music are about communicating with other people, maybe convincing them of something, maybe showing them something new. And so the idea of having one computer playing music for another computer is great, is a lot of experiment, but it's not what interests us on Magenta. Furthermore, let's be really honest. We don't know what makes good art and what makes good music. That's true for us as musicians and as artists. You kind of wait to see if you suck or not, and if people like you, maybe you don't. So there's this longer-term goal, which we haven't really reached, which is once we have some models that do OK, can we use reinforcement learning or other kinds of critical feedback to make them better? And then the combinations of people on technology that work are the ones that other people like. Not everybody, right? But you hope that you have some small slice of people that love what's happening and that they continue to come back for more. And that's really long-term. I think that's the only kind of success you can hope for, is that there's engagement with other people and some sort of conversation that happens over time. We are open source, and I think we're always very actively looking for collaborators, especially like we have a blog, we've had lots of guest blogs, we're not really trying to keep this as a closed community at all. And I think there's a future, there's a future that we haven't reached yet that is not really just magenta, but it's a number of projects happening in parallel where I strongly believe we'll continue to see growth of coding as a creative tool itself. So it's not that the engineers create stuff with code and then throw it over a wall and then musicians grab that code and they do something creative with it. I mean, obviously creative coding has been around for a long time and this is a hotbed of that and all the MIR stuff, so I'm speaking hopefully to people who agree, but this idea that we might start to enlarge open source to be part of the toolkit of musicians and artists who actually wanna code a little bit is something I'd love to see happen. I wanna talk about two projects. They're not both music actually, the first one's drawing, but I think it still tells a story about what we might be able to do with generative models. This is a paper that the first author is David Ha, he's a brain resident, which is a one year program in Google Brain in Mountain View where I work. Highly recommend to all of you to consider a next round of brain residency. It's basically a one year postdoc working with a bunch of deep learning researchers. And then some convert state of Google, some move on to grad school. You don't have to have a PhD to do this. Anyway, the basic idea of Sketch RNN was to explore the space of drawings and specifically the space of sketches, not pixels. The model is trained on the delta X and Y of a bunch of sketches and then it learns to generate new instances. So these are all instances of cats, actually it's moving through the latent space. So moving from one to another of cats and yoga and fire trucks and mosquitoes. And maybe some of you are familiar with how many of you have seen Quick Draw? Okay, that's cool, a few of you. So the data that we use came from this kind of cute game that Creative Lab from Google put out where you're trying to play a picture in an area against a computer. The computer is actually an image classifier, trained on an image net. And there's a clever trick to this. This is a really clever trick to this. So you're drawing and you're told to draw a bear and then you draw this and then sometimes you get it right and you get points. It turns out that you might think that to get those points, the classifier thought that the number one answer was bear. But it turns out if bear shows up in the top five, you run with it and people love that, right? So we don't really have to care about retrieval here and balancing precision. We just grabbed something from the top five and people love it. So we start to collect millions of drawings and as you may know, we just released these drawings out into open source. So there are hundreds of millions of drawings. They have the constraint that they were drawn in less than 20 seconds by anybody. So the artistic quality may be low, right? But they're still interesting. And so I want to talk a little bit about this generative model for vector images because I think it tells an interesting story. For some of you, this will go too slow in terms of the machine learning, but for others of you, maybe it will be useful. The basic idea in a generative model is that we're going to come up with some representation of the data that will allow us to later sample from that representation, sample numerically and stochastically. In this case, we're going to talk about a generative model that functions as an auto encoder. So, oh, well, that's interesting. Did I grab, hold on. Oh, right, sorry. I'm going to, because I just didn't get a chance to switch slide decks, yeah, I'm going to skip some of this stuff. Okay, so this is in a different order. I really apologize for this. How am I going to fix this? This was my slide deck for you guys. You know what? You're just going to see the dirty, the dirtiness of going back and forth. Okay, it's relatively informal. It is now. It wasn't before. It's just the chunks are in different orders for different talks. It really, I do know what the slide deck is supposed to do. So, we have some input and some output that is specifically the first time the user touches the screen on a phone or uses their mouse to start the drawing. That's zero, zero, that's the origin. And all that we're storing are delta X, delta Y to create the strokes. There's something clever and interesting about even this particular image. Notice that the cat input, that's a real input drawn by David, has actually been run through a trained model and decoded. And notice the difference between the two drawings. The latent space actually added a whisker that wasn't there before, which we'll come back to. Our encoder, which is just the technical term for taking the data in the space of the raw data and projecting it into our latent space, is a bi-directional recurrent neural network. So it's an LSTM that is moving both ways through the set of delta X, delta Ys, starting at the end and coming through to the beginning, coming from the beginning, going through to the end. This is a common tactic for doing non-causal processing over sequences. Obviously, you'll need to have the whole drawing because we were going both ways through it to do this. And this is similar actually to work done by Alex Graves, now at DeepMind, on handwriting generation. So this basic idea was there. We're then creating a rather restricted latent space Z, into which we're injecting some noise to avoid overfitting. And our cost that we're training on is a combination of L2 and KL divergence. So we're trying to model the entire space of possible drawings, not just the points of data on which real drawings live. So it'll allow us to generalize later. And our decoder is a unidirectional LSTM. So a recurrent neural network, and it unfolds in time and tries to predict the outputs of the drawings. But crucially, it has tied onto a small mixture of Gaussians model, which was the, if any of you paid attention to the starting slide, maybe you didn't, it looks like little rose petals. I'm not gonna go back through 50 slides to get there. The basic idea is that this RNN is trying to, is also propagating through this mixture of Gaussians, which allows it to model really nicely, given the number of Gaussians in the mixture, the multimodal behavior that you see from these drawings. And of course, this is unidirectional, right? So now we're gonna actually spin out these drawings in real time. And again, I really do, we've seen some interesting work already being done artistically with QuickDraw, just the raw data, and we have in open source the code for training these models and playing around. I wanna do a quick demo of this for you, but I think, let's see how we're doing on our zoom here. So what I'm gonna do is start to draw, and that's going to condition the model, and then the model's gonna continue drawing, but it's gonna sample nine different times. You get to kind of get an idea of the variance of the model. So this model was trained only on cats, and if I draw some cat ears, then, oh, let's see, I get rain, nevermind, okay. This is a bug, this is like a, it's a janky JavaScript demo that we're cleaning up. So let's just go with rain. If I draw clouds, right? And the model knows that it rains, right? And notice, if I draw clouds, I get rain, right? Because it turns out most people, when they wanna win at this game, they draw the cloud first. If I draw rain first, I almost never will get a cloud. So if I draw rain, if I draw big raindrops like this, the model's gonna, that one had, that's kind of cool. I did raindrop backwards for the first time ever, and now I'll do the raindrop this way. It matters, the direction matters, right? Do the raindrops that way, and I get more raindrops, okay, and sometimes a cloud. And if I add raindrops that are just coming down like that, model kind of tends to add in similar raindrops. So it's just toy data, right? These things are just made in 20 seconds. But I think there's something, for me at least, I'm drawn intellectually to this idea of what kind of expressivity do we get when we live in the space of strokes instead of pixels? And what kind of expressivity do we get when we live in the space of sequence to strokes so that the order starts to matter? I also, there's, we're gonna release some of this very soon to play with. But like, I can't really draw a cruise ship, so I'll just draw some water, and the model will just draw in the cruise ships for me, which is so cool. I mean, you know, this also might in your mind form the kernel of directions for having a machine learning able to kind of complete your thoughts for you. Like, this is not very interesting if what you do is just push a button and see a bunch of cruise ships. But it's really interesting if you can kind of tilt the generation, or maybe you can see the possible futures from what you've already drawn, right? Like, you start something and then it gives you some directions forward. Which talk do I pick? There's so many. Let's go with Zurich. Let's go back to Zurich, whatever. Right now we're in Barcelona, but I was recently in Zurich. We also have this idea of temperature, which is common in these models. For those of you who aren't familiar, when you sample from the Softmax probability distribution, if it's very spiky, you're going to tend to only land on that one mode of the distribution that it's learned. So you can numerically kind of flatten out the distribution, changing the entropy content in the distribution. And as you go hotter, the entropy changes such that this is, I really love these yoga drawings, like the low temperature yoga, the really predictable ones, like we can all kind of do, right? But as hot yoga comes in, it gets much more difficult to imagine. And they're actually quite fun. These were just blown up from these down here. Also, those were unconditional generations, which means we only use the decoder. We just decode from our latent representation, but we can also prime our latent representation by encoding something. So in this case, we've encoded four faces and then just so you can see them, those are the reconstructions from the decoder as we sample from these corners. And each of these four faces are shown in the corners. And now we're moving through our latent space. And what you'll see is that we don't have any spaces, we don't have any points in this particular area of the latent space that are broken, which if we only train on L2, if we only train on Euclidean distance, we actually have lots of points in the space which are broken, because the model only learns where the data is. So we're also training on KL divergence as well. So forcing the model to also kind of understand the overall space. And that's crucial for getting this kind of smooth behavior in latent space. Also, I think interesting, these models don't memorize. We've purposely kept the capacity of the generative model small, otherwise it's kind of boring, right? It just memorizes. And we also inject a little bit of noise because these vector drawings are really low dimensional. You just basically have a column of XY deltas and you might have just 50 or 100 of them and that's it. So it's really easy to overfit. So if you notice the model on the left, the left-hand column was trained on cat. And you see if you encode something, that's the brown, the model decodes more or less what you've encoded. Obviously not exactly what you've encoded because it's not got that much capacity. Here's a model trained only on pigs. Notice if you give it a pig with eight legs, the decoder actually only reproduces four of them, which I call a feature, not a bug. I think that's kind of nice that it has this really strong prior towards this particular geometry. And I love this, if you take a truck and you run it through the pig model, right? You get a pig truck, right? And I'm sorry if you hear me repeat this because I used this example at Sonar, but I defy you to draw a better pig truck than that. If someone just comes to you and says, like, draw a pig that looks like a truck or draw a truck that looks like a pig, like it's pretty good, right? Like it's kind of got truckness in the front and pigness in the back. There are no three-eyed cats in the model's mind and more crucially in terms of memorization, if you give it something that we recognize like a nice iconic view of a toothbrush, it's so far away from cat that the model just can't reproduce it at all. So I definitely think that's good, right? This particular model has, this is the cat model. This model has learned about cats. And now we're working towards conditional models that can learn thousands of classes at the same time. What's in our paper now are learning a few classes at the same time and having multiple single class models. Finally, for those of you that like, so this is vector algebra in the latent space, but just look at it. Like it does what it's supposed to, right? The algebra of pig and cat body and face is holds in this model. So I think that shows some real smoothness, some sense that we've learned to kind of smoothness in this space. Okay, I'm gonna move on to music. This is definitely a music group, but I thought that the drawing stuff was interesting. I think it does overlap enough with what we're trying to do with music sequences that it's worth talking about. I'm gonna grab a couple of small, actually no, what I'm gonna do is go to Ensynth. This is not, this is sorry, me going in the wrong order. So don't look at those slides. Don't look. Okay, now look. So the next thing I wanna talk about is a project learning a music synthesizer using a deep neural network. This is about timbre, it's about sound. And we also published some data to go with this that people can play with. This is a nice collaboration between Brain in Mountain View and the DeepMind team in London. And I can say that even within a company like Google, it's really hard to collaborate over nine time zones. Like it just takes, and it's easy enough for like manager types to be like, hey, you shall collaborate. I just looked at you to see that. But it's really hard to actually make that happen grassroots and it did in this case. So the three in the front, Jesse and Syngen and Adam are from Brain on the Magenta team. And the butsander was a crucial contributor from DeepMind. And then all the last ones are manager types like me right now, so on this paper at least. And we have a couple of blog postings. If you want something to follow up on later, please have a look at magenta.tensorflow.org. There's a number of blog postings talking about all of this work, including the drawing. And we keep putting more and more up there. It also provides links to the GitHub and everything else. So this work relies upon previously published work from DeepMind, all credit to DeepMind on this. Model called WaveNet. Anybody here familiar already with WaveNet? A few people? Basic idea is that this is a model that is predicting the raw waveform, which I come from a tradition that strongly believes this is crazy. I think hopefully some of us in the room share that intuition that like what are you doing? It's not going to work. But this model actually showed that with a lot of engineering and a lot of electricity on GPUs, you can actually train a model, a convolutional model to predict the next sample for audio that sampled at 16,000 Hertz. So that's predicting the position of a speaker cone in a way, I mean it's crazy. And what makes this model able to function is that it's fundamentally a model called PixelCNN, Pixel Convolutional Network that was already published and they renamed it when they pushed it onto audio. But the idea is that it's doing convolution at the level of the waveform, so at the level of the sample. But of course, if you're doing gradient descent with 16,000 units, eventually your gradient just kind of vanishes to zero. There's just no signal left. So instead they're doing something that existed already before called dilated convolution. So different layers are actually skipping samples and also being informed by lower layers. So you'll have layers in the network that are skipping many, many samples, maybe 1,000 samples, but they're also being fed by some sort of neural network representation of what happened in between all to condition the next sample, which could be, I think the best performing was the delta of the previous sample and also mule encoded so that you have more dynamic range for what matters, it's a lot greater than compression. So anyway, this works really well. However, it has one really crucial problem which shouldn't surprise you if you understand roughly how this works, which is that it's completely incoherent beyond the say 200 to 500 millisecond range. It's interesting it can grab chunks, it kind of stays coherent for a while and it just decoheres and basically becomes unconditional and continues generating. So to make that more concrete, this is what it's like to train a WaveNet only on Dizzy Gillespie music, right? Only on Dizzy Gillespie and then generate from it or not, this is gonna be great, I promise. Did you guys get that? Miles Davis, it's like Miles on the radio, like you're tuning the radio, right? And then we trained a model, our goal was to create a music synthesizer that lives in a latent space. So I mean that's the basic goal of NSINT. So I like starting with the drawings first because you can see how you can do conditional generation on these drawings and you can explore the spaces that live between what we know about in the real world, so to speak. That's a little dreamy, that's a little bit poetic, but I draw some drawings and I project them and I can linearly interpolate in this vector space and get something new. When we tried to train, we built a large dataset of musical notes, individual musical notes, just like every lab eventually does this, we did this back at University of Montreal. I've got an idea, I'm gonna make a huge dataset by sampling lots of stuff from a bunch of sample packs and from whatever, and so we did that too. And then when we trained on it, if you sample unconditionally, this is not a conditional model, you really only can sample unconditionally from these models, this is what you get. Here's just like one sample from a WaveNet trained on individual musical notes. Sort of held the pitch for a little while and when it decohered, it just decohered to some other super low frequency thing, right? Here's another one, this one's really pretty actually. Notice how it lost its pitch coherence, then it sort of shifted down to another pitch, but then it found the harmonics for that pitch. So it's learning, these WaveNet models are actually learning a bunch of using convolution, a bunch of Fourier like features, lots of papers out from lots of labs that analyze this basic behavior, lots of wonderful discussion about whether you'd rather fight with generating by understanding how to put phase back into spectrograms or whether you should just do this in the first place. I don't, it's almost like a religious battle now, so I'm just gonna talk about what we did. What we did was we said, well, let's take advantage of WaveNet, but let's condition it. Let's condition it in a way that makes sense. So now we have a very similar picture. We have some input and some output, and we're gonna try to build some sort of autoencoder. In this case, our encoder is going to be doing deep dilated convolutions itself, okay? And trying to pool across those convolutions to build some sort of temporal view of what's happening in that audio. By temporal, I mean, instead of having one Z that's big enough to try to understand the entire musical note, okay? We're gonna have a bunch of them that move in an autoregressive fashion over time so that we have this small temporal representation. Then what we'll do is we'll use that to condition our WaveNet decoder. And one thing I didn't copy in, which is to just make the graph simpler, is that this WaveNet also has its own state. So when it's trained at seeing the waveform itself, fundamentally it's receiving the information from Ensin as a conditioning and only as a conditioning. It could choose to ignore it if it wasn't useful. Another way to view this is using the spectrogram-like representations that we call rainbow grams. And time is still the horizontal axis and frequency is still the vertical axis down here and up here. These are the embeddings themselves in the middle. The color here is just to tell them apart. These are the rainbow grams. The color is just, is a mapping of phase. So if the color stays the same, that means that the phase has been coherent at that frequency for a long time. So you see modulation really clearly in these and that's the only thing they add. Somebody must have, we looked around for someone else that used color to represent phase in a spectrogram, right, doesn't seem like a crazy thing to do. If you care about things like modulation, but we couldn't find anything so we slapped a name on it and we spent five minutes trying to explain what the hell we're doing because it's weird. But it does give you kind of a nice view of especially when you compare to spectral models as baselines, we see that these time domain models actually can do a really nice job of capturing some of the modulation that you hear in some musical instruments, which is maybe of, it's in our paper, this group especially might find some of those details interesting. Here what you're seeing is that these are the embeddings. So each of these, it's only a 16 dimensional vector that unfolds over time in an auto regressive fashion and you're just seeing each one of those 16 dimensions as a time series. So what you can think of is that this sound was learned by the WaveNet and when reconstructed became this sound but it benefited from this conditioning. It also means by the way in future models, if we can figure out ways to generate these, then we can generate whatever we want to because the WaveNets learn to rely on this and that's kind of cool. We can also artificially extend these if we want to so we can generate like four minute long flugelhorn drones if we want to burn up a lot of electricity, things like that. So there's a bunch of stuff we can do with this both artistically and in terms of research to play around with what's possible now that we have this sort of temporal auto regressive latent space. Most importantly what this does is allow us to generalize from points that we have in the real world. So here picture, a sound is worth a thousand words. Here's one of the base samples from the data set. Sounds like a bass. And here's the WaveNet reconstruction, it's not perfect. It's certainly not as good as like there's tons of ways in which you can generate bass sounds that are better than this but this still has those advantages that are elsewhere. So let's listen to this. Definitely missing a lot of the harmonics and missing the attack but okay, we can fix that later maybe. Here's flute, here's our reconstruction of flute. Where we see something interesting happening is certainly everybody in this room in your mind's eye you should be able to predict what happens when you do a linear interpolation of flute and bass in audio, you get the superposition of flute and bass, right? But now the take home I think on this model is that now imagine in this latent space we take these temporal vectors for flute and for bass and we linearly interpolate in that space, okay? So now we get something that sounds fused including a little bit of warbliness at the end. Let's listen to that again. Oops. And so where we think that the expressive power of something like this lies is in an ability to really explore these spaces for which we don't have obvious analogs in audio, right? Here's a flute plus organ in audio. I mean I can hear them, right? They're not really fused. And here's flute plus organ in WaveNet or in Anson and organ plus bass in the interest of time. Did you hear that? A little high harmonic at the end. There's a couple of ways in which you can play with this. One of them is with a web demo which contains some samples and linear interpolations between the samples. It's available if you just Google for sound maker you can find it and you can take this slider and slide between actually how am I doing? Yeah, I can do this. Do a reload. So I'm just gonna do one that shows really clearly what's going on. Here's a cow. And that's the volume's different here, isn't it? So you can hear the original sample. Here's a piccolo. And here's the reconstructing that. So click on these, you hear the original samples. If you play down here or with a keyboard. This is our reconstruction of the piccolo. Ideally, I mean obviously you don't come here because you want the most perfect reconstruction of a piccolo, go get yourself a piccolo. And then we can move over and kind of linearly combine and we'll have some relatively ill-behaved spaces in the middle. But as we move closer, here's our cow reconstruction and forgive the AI experiments guys. They wanted to come up with fun samples, right? But you notice it's basically built at the cow sound and sort of obviously these harmonic bases that I learned because it was trained on music, right? And you can play for hours with this and listen to some of the intermediate spaces. What I've found most interesting about this is not this demo, which is cool. I think it's really cool. But we also have, on our GitHub, we have an Ableton plugin that has many, many more samples and that's I think basically more fun to play with. And even then, this really is a first step. We have to do all the linear interpolations offline because it's so slow to generate even with hundreds of GPUs. So I think the real power in this is for us to find ways in which we can generate it in real time and have a latent space that we can control in real time. I think then we really have something. In the meantime, it's really great to experiment with what we have now. Also, there's a kind of cool data set if you, I think probably given all the other data sets, the fun would be to link this to others, especially given free sound. But it's not bad. It's like 300,000 synthesized or drawn from sample pack sounds with different instrument families available and then some qualities. So especially you guys here, if you see ways to link this in with the work in free sound, we're not trying to present this as this thing that everybody has to use, just another data set to throw out there. It'd be much more fun to find ways to have overlap. We just built it. And now I'm gonna talk about SketchRN. No, I'm not. I already talked about that. I jumped through all these. So, questions? It's like, sometimes people interrupt. It's totally fine. Yeah, oh, too good. Oh, you were in the back first. Just curious, your original demo about miles. Yeah. I think it was all the miles. I think it was like searching for, it was searching for like, it sounded like early miles, didn't it? Yeah. No, I think, honestly, these were done by, like just for demos by Syngen and Sander. And I think it was just like, they grabbed some miles. I could find out. I do agree, it sounds like early miles. So I guess, probably is. I didn't hear like, bitches brew or anything. I guess that might, it's almost early too. Other questions? Yeah. What I meant was that for every one of these audio samples, it's not nowhere near real time to generate them. So we had to pre-generate all the wave files. And so fundamentally what's in this Ableton plugin is a sample pack, right? What's interesting is that we can visualize and the user can see, okay, I'm starting with this sound and I'm going to this sound and we've done tons of linear interpolations between instrument pairs. So it's just, I mean, in some sense, doing a grid search over a model and like exhaustively generating is not, it's not new. It's just, I wish we could do this in real time, at least on like a fast machine and we can't. I mean, it's not even close. It's like, you know, 100X real time. It's, you know, it's, and wave, it is what it is, right? Like you're doing this conditional inference of every sample, no matter how many corners you cut, you know, you're still doing a lot of inference to make that next sample. I think so. I mean, I'm always surprised. I didn't think this model could be trained in the first place and there's a lot of work going on because WaveNet is so important for text to speech. You know, there's a lot of work going on and making it faster. Myself and almost everybody else who's looked at WaveNet is convinced that we don't need to do all of this inference. It's just a question that, you know, it's like the movie Amadeus from the 80s, too many notes, like which notes do you get rid of, right? Maybe that's a old, that's like an American old joke that doesn't fly here. The, you know, like this chain of inference that's being done to generate the next sample is massive and it seems clear to me that we don't need all of that and with the right representation and the right priors we can do something else. So one obvious thing to try is to make the model conditioned on some set of bases that it learns that are like some reasonable set of bases, you know, some sinusoidal set of bases so that when you drive the model and condition it, you're driving it with something you can compute really fast over the thing you wanna condition on, like, you know, what are the main sinusoidal components and then force WaveNet to pick up on all the residuals that you can't handle and pick up on the nonlinearities. So these are the kind of ideas that just need to be tried, right? And since these models are so hard to train, you know, it's just like that horrible, you know, it's the black art of machine learning, right? You know, you just have to like do a lot of TensorFlow or Theano or PyTorch and you've gotta have a lot of GPU time sitting around to do it, but it's doable. Yeah, sure. In the joint example, you have the exclusive noise to avoid overfitting. That's for the drawing. We don't, we're not worried about overfitting in WaveNet. This is not a problem. Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah. Right. So this was treated as a hyperparameter in the model and then basically just looking at being able to train the weights without overfitting on out-of-sample data. So it was kind of a general machine learning approach. He, you know, injecting noise into some layer in a neural network is a pretty standard thing to do and David did that. And all that work was done by David Ha. But I think it's, you know, fundamentally you overfit so fast, right? You're playing capacity against overfitting. As long as you're doing good machine learning, you're saying, all right, how much noise do I generate such that I get, I don't get overfitting and I can still generate pickup on both the KL Divergent Floss and the L2 loss. How much time do I have? Yeah, okay, so I'm gonna actually grab a different, I think there's another more important thing to point to here. I was gonna mention, so we have a number of, I'm gonna switch from audio and switch from strokes to musical scores and music. And there's a paper, we have a couple of what we did and so let me back up. Especially with respect to music sequence generation, which is something I worked on embarrassingly long ago, like 2000 to 2003 with LSTM when I was a postdoc with Juergen Schmidt-Hubert, trying to get neural networks to make it, you know. My postdoc was like 50,000 lines of crappy C++ code implementing LSTM one more time and then you wanna try something different. So thank goodness you all have things like TensorFlow and Tiano and PyTorch, it makes life easier. My whole postdoc is like 20 lines of TensorFlow code, which is really sad. But there's these issues of representation, I think, remain probably the most important future direction for caring about generative models that artists might actually wanna use in musicians. So, I mean, it's a question of, it's a question of what question are you asking and what solution are you looking for? I mean, you know, we have WaveNet down here and NSENTH caring about the waveform. We have any number of great intermediate representations for audio that we could play with. You know, we obviously have piano roll and we have scores that we wanna care about and I think that we've shown that the same thing is true in the visual domain by simply having some stroke-based data to train on instead of pixels. We're able to really try some different things. I wanted to summarize a bit. Well, first, there are two papers that if I had infinite time, I would talk about. One of them is actually generating musical scores non-causally, so treating the score as basically an image and then a piano roll image and then infilling bits of the image and then re-synthesizing from that using, you know, turning it back into MIDI. There have been a number of groups that have had great luck with that. There was Bach Bot, which some of you may be familiar with. We had a paper called Coconet, which was by Anna Wang and she did some nice four-part harmonizations of Bach and it turns out unconditional generation from the network sounded more like Bach than Bach to untrained users, which makes me sad for untrained users because it didn't sound that great to me. Here, I'm gonna show you a couple of new things. We started, the first thing that we did in Magenta just to get the ball rolling was like the simplest recurrent neural network trained on all the MIDI data melodies we could get a hold of. And, you know, we're able to generate kind of a melody. This is the melody right here. Oh, why did it? One of many, but this one ended up being grabbed by some reporter. So I think, you know, there are mountains of papers and there are at least a half a dozen different ways in which you can do this with varying degrees of success, generate new melodies. So you might, and I think interestingly, a lot of the real cool questions come from how do you actually take the score or take a piano roll and treat it with a machine learning model? In that particular model, we chose to do even sampling of the piano roll. So to generate a whole note, if you're sampling 16 times per measure, actually requires 16 successful predictions, which is kind of silly. And that was not the right approach. There've been other approaches before and after. Andrew Kapathi has a character RNN net that Bob Sturm used where duration is explicitly encoded. I think that's a better way to work. What we've recently done, and I mean recently, like this week, we haven't even figured out how well this works, is actually try to put time back into these networks to allow for expressive timing in dynamics and also to allow for a polyphonic generation. So I wanted to play a couple samples for you. This one, my purpose for this is actually use your ears to think, all right, what are you getting by adding these sort of these dimensions to the data? Now what the model's doing is it has, it has, it's outputters a bunch, is just this huge softmax representation, which is pitch or advance the clock. And it can advance the clock by, basically at the millisecond level. So the model can do polyphony by continuing to degenerate more pitches. And it can advance the clock by deciding to advance the clock. And it also decides when to turn off those pitches. So it's a pretty granular representation of what's going on. It's basically note on, note off, and advance the clock. And so with this, what we get is a little bit, and oh, and we trained only on real, like lots of piano performances that came both from like scraping the web and pulling together things like the Yamaha data set and measuring for whether there's any expressive timing so we don't just have scores. And now I think this advance the clock representation actually gives us what sounds to me like a little bit more fluid timing and interesting polyphony, this relatively raw representation. Let's listen to one version of this. Now let's listen to it again and try to find the pulse. And I think what you'll find is that the model hasn't really gotten the timing. So it doesn't have that really robotic grid like sound to it. But also I don't think this model was trained long enough or had the right data to actually feel, I don't feel like it's shaping phrases the right way. But anyway, I think it's, I want you to hear that it's not, that the timing is definitely not uniform. So that, I think that's still not bad for unconditional generation, right? Like it's reasonably good. I'd like to verify these models are not terribly overfitting. So that's, put a kind of a wait for the paper before you say you love these. And then here's now this model's trained, this model's a slightly different model is trained with velocity as well. And in this case, it's another unconditional generation as far as I know trained in the same data again. These were emailed to me just a couple of days ago. And, but listen for, it's not over the top, but I think you'll hear, you really pick up a whole nother level of expressivity by having the model trying to pick up on even, you know, even relative velocities in the piano. Still wandering, right? There's no there there. But I think, I don't know, could you hear the, could you hear a little bit of a, it shows up in the midi. I mean, it's definitely there. It's a question of whether you respond to it. So the way to bring all of this together, I think is a nice challenge. We're still living with kind of where we're sitting in that representation hierarchy, some stuff happening at audio, some stuff happening in musical scores. And I think like the rest of the community, no obvious way to bring all of this together. There's the more buck the buck. I wanted to get to a couple of conclusion slides here. Yeah, I wanna leave 10 minutes or 10 minutes at least for questions. But I hope this gives you a rough idea of what we're trying to do. You know, really what we're trying to do is publish papers. And we have the ability to do that. We're also trying, if it helps understand the PR, I know we've done a lot of PR, like you guys are probably sick of seeing Magenta, especially if you're trying to do this stuff yourself. But we're trying to engage a community. And we're actually trying to build out a kind of connection between artists and musicians. And the door is absolutely open for collaboration. I really don't, I really, the success here is seeing machine learning move into the arts and into music. That's what success is. Like, we're not really that interested in products. This is really just a part of Google Brain. And I went to the guy that runs Google Brain, Jeff Dean. I'm like, hey, what about generating media? Shouldn't we do that too? And kind of he's like, yeah, sure, let's try this out. And then the second thought was, well, shouldn't this be open source? Isn't it kind of silly for like four or five people in Mountain View to try to tackle art and music? I was like, yeah, we should make it open source. And that's sort of what we did. So that's where we are right now. Would really love to see collaboration with you guys. I have like huge respect for this group. You are one of the first and one of the best groups. The PhDs that have graduated already. I just know a bunch of people all over, right? I mean, lots of people out in Silicon Valley too. But when I was at University of Montreal and working with McGill and Kermit and now at Karma and at Stanford, you guys are awesome. So if anything about this talk made you interested in collaborating or writing some code, please contact us, contact me directly and we can talk more. I wanna close with this quote because I think it hints at where we all wanna be with technology and art. And I just love it so much, I always read it. Whatever you now find weird, ugly, uncomfortable and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit, the distorted guitar sound is the sound of something too loud for the medium supposed to carry it. That's Brian, you know, who I hear this week. So I mean, I think that hints at this idea of trying really hard to understand how we can take these models, right? But actually do something interesting with them and actually maybe move us into a place where we're actually spawning a new kind of art and a new kind of music. And you know, time will tell if that will happen. If not, we'll be left with some papers and we'll have move forward signal processing and machine learning a little bit, right? But maybe we'll be as cool as Rick and Bakker and actually do something like build an electric guitar for the next generation. And if any of you work in instrument building, you know, like that's just like insanely hard, right? Whether you're using computation or not. But that's what we're doing and I thank you for your attention, especially with Sonar to distract you. And if you have questions, I'm happy to answer them. Questions? Yeah. Yeah, I know what you're doing. Oh, good question. You know, I can probably have, if you do, do you know TensorFlow? How do you say it? Okay. I mean, we're not secretive. It's as fast as we can get it out. I mean, Ian Simon did it and we've just been trying different ways to handle time, you know, without things exploding. By the way, it's really hard to do regression. Softmax works better than regression in almost every case. The WaveNet stuff is actually doing Softmax. So if you're not familiar with Softmax, so just like picking one out of many possible answers. So you're basically saying I'm gonna bin time and then I'm gonna pick the winner as opposed to trying to regress under the milliseconds. So he finally got, I think, kind of the right resolution. It might not be enough so that there's a bunch of potential clock advancement values at the millisecond level. So basically what I'm saying is we've been playing around with different things and we're gonna write it up and we'll do an archive paper and then we'll put out some code and then we'll figure out if it's overfitting, right? We have to figure that out first. Cause these, to be really honest, like these sound pretty good to me. Like I, like that, especially that last one. And then I, you know, we're not trying, you know, we're machine learning researchers. You may have heard overfitting. We don't know yet. I just thought they're, the timing is real. So it's fun to hear the timing variances and it's fun to hear the velocity variances. Whether these, how far these were away from instances of training data, time will tell. Like quickly, like this weekend. That kind of time. I think it would be interesting to just like and it just, that's actually really nice. We already have a lot of that code written. We have the call and response code. Like we know how to shuttle data to and from a TensorFlow model and we can now do it in JavaScript and we can do it in like a Jupyter notebook and we can do it in Ableton. So now it's a question of like getting that right interaction mode, which, which by the way is like so hard. It's just so hard to get that right. We've really struggled to get that interaction mode right. And it found that we've basically built one off interaction modes for different musicians cause every musician kind of wants something different. Unfortunately we haven't figured out that the guitar pedal of this, you know, that just kind of worked. Other questions? Yeah. The sketching, catch, minus, taking, yeah. I was thinking if you tried the same thing with music. Yeah. Like adding. Can you add Bach? Class, music, music, music, music. So I think, I think if we can, if we can come up with what we need is a latent space representation. We need a conditioned latent space representation of what we're generating. And right now what we have is an auto regressive model. So that state is really hard to control. It's basically the recurrent state of the neural network. I think if we could move to convolutional models and then have a latent space representation of what we're trying to generate, basically like what we might consider doing is something almost exactly like Ensinth, but for musical scores, right? Just take that same strided dilated convolution idea down to score, to the score level, right? Then I think we could do exactly that, right? We would have these really nice, relatively low dimensional time series that are there for generating Apex Twin, right? And then we could play around. I just totally love to see that. It's hard, but yeah, it's possible. Yeah. I guess you're aware of all these four on choice of degrees and learning, and simulation, you are sweet about those. Yeah, yeah, yeah. But you know it's a choice of degree and model. Yeah. Is there something that you'd like to apply if we go into the whole generic thing of deep concern as well? No, I think we just haven't gotten there yet. So we chose, I chose with Magenta to do something which is to say we're gonna make Magenta part of TensorFlow, so the GitHub is TensorFlow Magenta. We're part of TensorFlow. And we're also making the decision that we're going to see how far we can get with deep learning and reinforcement learning. And that's not to say that that's the only way to solve this problem. In fact, just the inverse. It's to say that there are so many ways to solve this problem. If you're gonna make any headway, you'd probably better pick something, right? Otherwise, what we're doing is we're saying, hey, we're a group of five people that are gonna solve art. I think that's, it's already pretty naive to be a group of five people that are trying to solve me of doing what we're doing. So there's a ton of stuff we're ignoring. We're ignoring a lot of stuff having to do with higher level structure and the intrinsic motivation stuff. I think we owe a lot to Juergen for having said a lot of really interesting things about the interplay between, or the connections between with encoding and entropy and compression and how that relates to what art is. So it's definitely a thread going back to there. But that's far as we've taken it. That's not to say it's not important. Just, you know, gotta cut somewhere. Yeah. So here's what we did. I checked, I double checked on this. We changed the quick draw released and we decided we wanted to collect the data and we actually changed the webpage and we'd like to, we're gonna use your data to improve machine learning models and we threw away all the data that came before that. So we only stored the data that people, where there was a very clear message, hey, we're gonna use this to try to build better machine learning models. Which I guess is, you know, what else kind of can you do, right? Certainly it wasn't like we, we actually really care about user trust at Google. Like there's a running kind of joke inside of Google that if you really wanna get fired at Google, right? Just go try to read someone's Gmail. You know what I mean? Like you will be fired at Google for doing any kind of like personal violation of trust and we actually have to go through all sorts of hoops. For example, to give you a concrete example, I used to run personalization and recommendations for playing music, right? Which is one Google product and we have YouTube sitting over here which knows a lot about music, right? Now that we've finally merged these products right now but, and like you'd think, okay, maybe I wanna know what users are doing with music in YouTube and it was just an insane amount of work to get permission to be able to analyze and it's already in the terms of service. It was just like, there's just all this internal control. So if it helps people, we really do try hard and there was no question about throwing away the data that came before those changes in the warning and we were asked about that many times. It's totally reasonable question. In fact, I asked, I'm like, wait a minute, how are we gonna use this data? I think one of the things when I first saw that it was released so there's sort of, oh, of course Google would do that. Right. I mean I'd have a history of sort of cleaning data for training that. Yeah, and do what we did, right? Like turning them around. It's like, well, this is a great game but actually the goal wasn't, if you were getting a goal, it was like collecting data. It wasn't though. So the idea to collect the data came later which is why we changed the wording on the webpage. Other questions? Yeah. I do have another one. That's fine. To me, it seems like we were almost getting to the point where you could model like a whole, like able-to-project. Mm-hmm. Because it seems like, as I'm gonna sketch RNN stuff and start to say model formation. Yeah. Wave met, you can model like samples. And then also you'll get into the new stage, but. It's kind of like. As long as it would be, you can just sort of generate like an able-to-project. Like well-formed just with crap. Right. Yeah, we're putting together the building blocks for that. We have a pretty cool 808. The 808's really fun. And the, I think that that's a really cool project to work on. I don't know that we have the right people on our team internally to do that. There are people externally that want to do it. If you're interested in doing that, be happy to work with you. Actually, Jesse, the first author on the Ensins paper is pretty hardcore Ableton. But I think we'd want to partner with Ableton. We'd want to understand how to actually generate proper Ableton projects. We'd want to have some way to get like a really cool thing to be able to build out. So here's what I want to do, if you want to help. Love to build out a project where, especially if you focused on one thing like Ableton, like, hey, give us your Ableton projects knowing that we're gonna train on them and try to build new Ableton projects. Like just make people part of the experiment. And then ideally, give them something back. Like let them sort of overfit on their own project or something. Like give people something like here. You know, you give us your Ableton project. And what we'll do is we'll give you back a version of your Ableton project that we think is either worse or better, but at least Machine Learning has added something. And you're like, you know, you can just keep throwing your projects at it. And at the same time, we start to collect a bunch of them and train on it. And everybody wins because people are like, hey, that's cool. I helped train this big project. It'll be really fun to do. Arguably better to do in collaboration with a group like this than just Google alone. Cause it's just, you know, it's, there's a number of like great grad student projects living there. Right. Yeah. About a year ago, I was like, I'm totally out of my depth. Yeah. But I managed to get, so it's definitely like, people are willing to give. Maybe we can chat about it offline. I'd be curious to know more about what you did. So thank you very much again. Of course. Thanks for your attention. Thanks for all your questions too. I'm hanging out a little bit if people want to chat more.