 Thank you. Is it good? Okay. Thanks for that nice introduction. So we're gonna be talking about a project that essentially we're going to talking about the technical aspects of a project that we did last quarter. And I did this with my colleague Philip who also works at Square. Philip's been a Square for about a year now. He actually came from a company which was an AI startup that we acquired. He's helped set up a lot of the infrastructure that was related to this project. So as I mentioned this is closely related to a project that I did last fall. And I'm gonna give you a little bit of introduction into some of the musical aspects of that. And just to give you a quick layout of kind of what this talk is gonna cover. Since there's more technical audience we're gonna focus more on the technical stuff but we're not gonna make any assumptions because we are still talking about data and sometimes the representation of the data makes a difference in how we think about it. And not everyone here has a background in music although I imagine a lot of you do. So just a question of like why are we even approaching this? Why does it seem like a good idea? Is this so novel or is it totally bizarre and out there? It's really not that novel at all. People have been doing kind of computer generated improvised music for quite a while using different methods. Of course with different resources as well. I have no idea what this code does on the bottom. I can tell that it generates some sort of output every five to seven seconds. And the guy didn't really share too much more about his code but essentially this is George Lewis. He's a member of the Academy for the Advancement of Creative Musicians which is a Chicago organization for improvising in jazz musicians that's been around since the late 50s. And he's always been very interested in the ideas of like computer creativity and computer human interaction in live improvised music. One of the key things to note about this just generally and I'm kind of using this slide as like a catchall history of computer music up until recently. MIDI format input and output. It's not necessarily so you could have like some sort of like pitch content analyzer that does some sort of analysis and converts waveform audio into MIDI format. But essentially the mechanism works with MIDI representation. And I'm going to talk about what that means in a couple of slides. But now I'll kind of cover what's, oh there's a sad face. I'll see if that works. Anyway so in context of kind of how a lot of people have approached the question of generating music algorithmically using some sort of stochastical models. I imagine everyone here is familiar with Markov chains but for anyone that isn't you can imagine some sort of changing process as a weight matrix where the columns essentially represent your, oh actually the rows will represent your starting state or the state of a variable. The columns will represent the the probability of moving to some other state. So for example here in a world where there are three possible musical musical events you could say that there are three notes. Are you gonna fix that for me? Thank you. Say there's a world where there are three notes A, B, and C. You can imagine for every one of those arrows there's a value between zero and one that corresponds to a probability of moving to that other state given that you are currently in the present state. And that's represented, is it any good? Okay. I need to get out a full screen. This is, I'll pause it while I describe. So what's going on in this one is a Markov model can, you can essentially learn the state transition matrix by analyzing data. So given a sequence of events, take for example some sort of monophonic source material you can essentially estimate the probability of moving from B given that you are currently in state A by counting the frequency of migration from A to all other possible states that are accessible from A. You can do that for every possible states and using that you have a representation that tells you, well like you have a representation that can generate audio. So in this particular YouTube clip the the person who made this basically ran in or created some module that does an analysis of some sort of input data in midi format again. In this case it's using WC. So you will notice a lot of the kind of the harmonic material sounds like what you would imagine from French impressionism. Tell from, it's the second order Markov chain. So as I mentioned there's variable order Markov chains. So a first order Markov chain would just be that you are submitting the probability of moving to some other state given your current state only. A second order is your current state and your previous state. So essentially this can't possibly represent overarching musical structure and less musical structure somehow correlated to like two state movements. And which is why you get this kind of cute impressionistic sound but you don't really get anything that has any sort of musical form. Okay so now let me just talk about different representations and this is where I'll probably do some clarification. How many of you read music so that I can get a sense? Oh wow that's plenty of you. I don't have to spend too much time on this. So pitch representation the staff is kind of the traditional set of instructions or traditional data representation for human interpreters. But everything about a staff can be represented digitally. So you can represent rhythm you can also represent frequency content and that's usually done in these three different data formats. MIDI, Lillipon, music, XML. Essentially one thing that's happening here is that you're able to represent what it's representable in a staff very well. That doesn't necessarily cover things like texture which are essential musical qualities. So for example you could have two musicians interpreting the same piece of music. The staff is identical so the staff will preserve the identity of the music but it doesn't necessarily preserve maybe the quality of the performance. So maybe Glenn Gould performing something might be a lot more enjoyable than some student's performance. So since so many of you read music I probably won't say too much about this but essentially vertically you have pitch representation horizontally you have time representation different types of symbols determine the duration of the pitch content and in this particular case using jazz as an example there is an explicit description well somewhat explicit description of the harmonic context for any point in time which just means that this melody that you hear is the the harmonic context for this melody is also providing source material for an improvisation that could be performed on top of it. However we want to work with something slightly more interesting and it's harder and we were kind of constrained to given sort of the I guess the the aesthetic goal of the musicians with whom we were working. We're working with a waveform. So the waveform is probably the most physically real representation. This is a discrete waveform but it's the most physically objective representation of music. So it's discrete measurements of air pressure over time. The reason that I mentioned the type of music created by the improvisers and the composer is simply because I imagine a lot of you haven't heard this kind of music it is specifically some people would call it non-adiomatic free improvisation. So the question of pitch classes doesn't actually come up which makes it a very difficult problem especially for evaluation because who's to say that it's actually creating the right thing where there was not really just explicit pitch content to begin with. So I'll make that in a hundred other caveats throughout the presentation. So one part of the objective of this is since we're working with a waveform and we want to capture information that actually is less lossy than even human transcription, how are we going to do that? It's really tricky. Hand engineering features for anything is very difficult especially if you start talking about very high-resolution data like computer images or waveforms for example. And the easy solution is let's automate it. And Phillip's going to talk about that but let me just present really quickly what our framework is and essentially the way that we're going to do this is we're going to try to approximate some sort of dynamical system with machine learning. So that graph right there is just the waveform over time and the hopes of this project are that if it works very well and we can predict music then we'll eventually be very rich. So now I'll have Phillip talk about deep learning and some of this might be pretty basic for all of you. It's kind of we're anticipating a mixed audience so we'll see. So hi. So I'm not sure how familiar you guys are with the concept of deep learning. Let me just, okay. Okay so I'm sure you guys must have like heard about deep learning and some articles in the news. They are painted as very groundbreaking or have an incredible predictive performance. In some ways they are but the real ingenuity in these models do not really lie where you might think. So hopefully by the end of the section you will be able to appreciate the inherent ability of these models and the reason why Juan chose to use these models in the context of music generation. So just to clear out the very basic definition of what deep learning is. So even though the research is surrounding the field of deep learning is kind of like very involved and the main concept can be summarized into three very basic points. First of all, deep learning can be thought of as a collection of various machine learning algorithms that are based on statistical theory. The algorithms and this is probably the most important part. The algorithms attempt to learn feature hierarchies that can better solve complex problems. And finally they usually are implemented with the use of neural networks. So how many of you are familiar with neural networks? So how they basically work? Okay, brilliant. So I'm just going to go very fast through these slides. Usually a lot of people like to perform the comparison that artificial neural networks are some sort of like you're trying to model the way the human brain works but it's a very crude simplification. So in the display example we have a neuron, let's call the one with the blue ring. It has a bunch of inputs that it receives and it only creates one output. In this case it creates an output of a value from between zero and one. And each connection to the neuron can be controlled using a weight. So the way that the neuron generates an output is by multiplying its inputs and its respective weights and calculating the sum of the products, z in this case, which is shown in the in the formula above. And then passing that the accumulation of that result through a an activation function. And just for demonstration purposes we chose the sigmoid function in this case, which will generate a value between zero and one. And so anything that is in this case anything that is negative the value will be close to zero. Anything that is positive will generate a value close to one. So let's say we are trying to predict whether a picture contains a dog, an animal or whatever. So one if the animal exists, zero if it doesn't exist. So we pass our picture data through the network and the value of our network ends up being completely wrong. So it does not represent what we were expecting at all. Well there are multiple ways that we can get our network to get to the point where we show him a picture and it returns the value that we were expecting and we could do that you know randomly and you know hoping that it might get to the point where it does that. But there is a better way to do this and this is usually done through calculus. So more formally this is known as the back propagation algorithm and what it essentially does it does iterative updates on the weights based on a specific cost function. So in this case we are trying to minimize the distance of the prediction towards the example. So whatever the value that the output generates we wanted to get closer to one. And so we do that but by taking differentials that go down that minimize this distance. So by doing this enough times the network should get better at predicting the desired output. The whole process can be visualized as going down an error slope. So if you imagine if you plot all the errors for one weight W you would hope that you get some sort of conical shape and what you want to do is go as down as possible which will be your optimal your minimum weight your minimum error for that specific weight. So you're probably wondering what is so amazing about this process. Well this is where it gets really interesting when you start stacking neurons on top of previous neurons layers this forces subsequent neural layers to base their outputs on what's on the output of previous neurons. So the neurons in the second layer the ones at the very end only receive as input the output of their other previous counterparts. So we can better visualize this by the example shown on the right. If we pass so let's say we're doing facial recognition maybe the first layer will start detecting edges formed by differentials between pixels and the second layer might want to create facial features from from those created edges and finally generalized feature generalized facial structures are created using these facial features that are at the end will be used to formulate the prediction and as you can imagine it's easier to predict whether a face is the one that you're expecting if you have a much more solid understanding of what it is. So there are so not all deep learning architectures perform the same as all in all applications usually up to I think up to the 80s the densely connected architecture was probably the most common one and the convoluted the convolutional layers came on later by Yanlacun and the Recona layer which will be basing our I'll talk today with base well and all these have different kind of like applications mainly the differences are that convolutional layers perform better at image recognition problems and the current layers usually perform better when you're trying to model sequences or time-dependent information. So all of these layers can can be used or have been used throughout the years to solve various different problems and some of the cooler ones are in robotics or like in gaming which basically have an AI that can like play pong for example and more recently generative artificial networks are able to generate facial faces basically just from the information that you just from the information that you want so let's say you want to create a face and you describe it and it will do that for you based on what you give it. So earlier I mentioned about where Riccardo neural networks usually perform better and that that is inter portal information and and similarly basically because we're trying to model sequences of notes and or or let's say amplitudes in this case we we would like to have some sort of way of modeling temporal data and this we and this is basically how RNNs are able to do are able to do this by we basically take a sequence and we feed each time step in in sequence through the same neuron and at every point so we are so for let's say we have a sequence from zero to T steps for every every every every step we will feed a different value through the neuron and do the same thing over and over again so it turns out that when when when you do this you are you are essentially generating a separate feedforward network but the only difference is that instead of having different neurons at each step it's literally the same same same neuron at at different times yes so this is important because you may notice that the only the only parameters that we care about really during training are you VNW so this greatly minimizes the amount of parameters that we need to update during back propagation but having said that RNNs have shown to in theory so RNN theory they they have shown they are proven that they can model any any sort of sequence and but in practice that is that is that hasn't been the case for longer longer sequences and that is due to the vanishing gradient problem of what is known as now essentially if you apply the sigmoid function a lot of times over and over again you will end up having a very flat line response from the same neuron so and this was and this was the motivation of form formulating the LSTM cell or long short term memory cell so in order to better understand the the major differences between these two from a cell to a more simple neuron it's better we are going to visualize this as a as we're going to visualize the simple neuron as a repeating module this is the way that Christopher Ola prefer basically tries to try to describe an LSTM network in his excellent blog which I linked at the bottom as we can see the only the only information that is passed from time set the time step is the current input and the previous output of that of that neuron now the difference the major difference with LSTM cells is that the additionally incorporates this cell state this is an internal internal characteristic that only that that sells that is an additional parameter or when we're estimating what we are want what we want to keep in the sequence and what we don't want to keep you can imagine it that it you can imagine it as a as an assembly line which just runs through the through the neuron and at various points we we decide whether it decides through using gates whether it wants to keep that information or forget it or add information based on what it's what's giving these usually work in the similar similar mechanics as silicon chip logic gates you can think of it that way so the first step that usually happens is that you want to you want to based on what what you have been given and in the past and what you are giving in the in the current time step you want to decide whether at that time step do you do you want to keep your cell states at the at the at the same status well that is that is basically what they forget gate decides it generates a value between 0 and 1 and based on the previous output and the current input and based on that value it will modify the magnitude of all the vector of the cell states at the for all the values so if it deems that the difference is negligible then it might deem that you know we might not want to keep that information or just forget it or or this yeah so the next step is do we want to add any new information at this time step this is this is done in two steps the first step we pass the previous output and the current input through a tan age activation function and what that essentially does is that it creates an estimate or an approximation to the current cell states and additionally we we perform a similar function of what we did with the forget gate but in this case we instead of deciding what we want to forget we want to decide what what we want to add and this is done but what this essentially does is generate a sparse vector which will only have values to to the points that we want to add to the information that we want to add to our cell states and finally the final step is again what are we going to be what are we going to output at this step well that is again done very similarly to a normal simple neuron but in this case it's the cell state that dictates what what what the output will be at this point and again we also decide which part of that cell state we want to output which again is the most what is the most important part that we want to output this at this step well that was very kind of like convoluted in a way but I guess this example might be much more intuitive this is based this is based on NLP problems or modeling sequences of words so essentially we are going to be setting up our problem in a similar function we are going to be giving a sequence of of amplitudes or notes that you that we want and we will be expecting a sequence as a return so in this case the what is fed is it is the words are you free tomorrow and what we are expecting is yes what's up and this is a very this is basically the way that we we also set up our problem but instead of words we will using amplitudes or notes okay so cool so specifically what are the methods that we used one of the things that we learned pretty quickly on was our assumption that we could just use waveform data was false and disappointing actually pretty obvious so it turns out if you try to model some sort of even a periodic but fairly simple physical event like movement your prediction at the next time step well if you if you have some sort of evaluative measure of accuracy and you have a model that predicts that the value of this variable the next time step is its current value or its current value plus some generic momentum it's going to have a very high accuracy so if you try to back prop through that it's not going to really improve so intuitively you can imagine let's say I give you a question I show you a car for example driving through a map and then I ask you it's 501 p.m. and one second what's the position of this car at 501 p.m. in two seconds so it's going to be a very easy problem to solve meaning you're not going to get in the air and your back prop won't work so we need to find a way to kind of create some more interesting data and something that's maybe more meaningful but more meaningful in a way that's also more challenging that is gonna learn through propagation through back propagation so the way we did this was pretty pretty standard it's called the discrete Fourier transform and it's basically allows us to go from time domain signal into a frequency domain so as you can see we're going from a vector and we're mapping that to a matrix so we're doing some sort of dimension dimensionality expansion so if you look at the graph that's probably the best explanation on here what we're essentially doing is the red line represents the actual data the actual shape of the waveform and we're decomposing that into the coefficients so the this axis in particular is going to be the output we want to output the coefficients of all of the simple phasor components and the coefficients correspond to each phasor's contribution to that complex on use right so essentially what we're doing is we're creating a new basis for it and it's actually really easy with linear algebra it's just an inner product so I won't talk about it but I will talk about the intuition of what this does so we went from a waveform that is difficult even visually to understand kind of what's going on if you were to look at a waveform for any meaningful period of music it would just be like this garbled mess if you guys have looked at like a sonogram as you listen to music and if you got like a Windows media player or something you probably Windows media player it's kind of a pointless feature because it doesn't doesn't really show you much but if you had a spectrogram that would be pretty cool so a spectrogram is basically it looks continuous but it's not so for every period of time which is the X axis you actually have a discrete window of time and on that discrete window of time you have a Fourier analysis of the simple phasor components of the signal so you're essentially expanding your dimensionality from one per sample to 2000 and this is kind of arbitrary you can you can select as many frequency components as you want but what this means is we we have a problem that uses a lot of data which means we have to get budget from square and spin up some AWS nodes and and do something cool and that's what we did and that allows us to formulate the problem a little bit more complexly than I described previously where you just map X at t plus one given X up to t we're now looking at an array of data for each time point so the rows correspond to the discrete time periods and the columns correspond to the frequency components or rather the coefficients associated with those frequency components also the the phase components which are our imaginary numbers we convert we we basically truncated all that information to artificially transpose them into real numbers because that helps with back prop but but essentially we have frequency we have a spectrum and phase so if you think about how this is gonna work essentially you can think in terms of X and Y matrix like a standard machine learning supervised learning problem you have an X and Y usually if you're doing some sort of I don't see like a regression your Y is a vector here it gets to be matrix as well and it just is the X X matrix with one one period of lag and so this is our training method except for in reality we're considering not just the current row of the X matrix but all proceeding rows because these the the memory of previous states is present in the current state of back propagation so pretty straightforward the implementation is totally unremarkable because Keras makes things easy so it's just if you want to do this we don't have a github repo this is our github repo so create a sequential model you do all of your data processing where you get into the format that I just described you can take screenshots if you want and this is how we did it we we use a lot of dropout turns out that this is actually probably just a function of the fact that we expanded into so many different frequency components most audible sound is probably within a very small range of those frequency components and musical signals can be expressed with very few of those and which I knew this before spending several hours and several of squares dollars doing this because I'll talk about that later but processing something we could definitely optimize so we just did two LCM layers I'm with okay so I'll state something so we're constrained on resources somewhat we're not necessarily making the best case for why square should invest in our infrastructure by saying we're gonna make some music so as you can see batch size is only 32 so there were a lot of memory constraints some time constraints associated with the project as well and the topology obviously is constrained but but this is what we could do so it's basically two recurring networks that allow for some sort of hopefully hierarchical representation of the of the form of the music and some time distributed dense which is just dense networks you can think of the last three components as a standard feed forward neural network and arm as prop because essentially we're doing a linear regression yes this takes area so the the sample size or rather the this is a consistent frame rate so you always have a uniform duration for each window so you could think of that as tempo but it's not in tempo in in terms of like four four so there's no meter to it no explicit definition of meter which makes it easy for the music that we're working with so along with a million other caveats and so that works and once once we have that working we don't know if it sounds good yet but I'm just explaining the framework so then there's also a generator function which is also completely unremarkable so it's a really like the lesson here is Keras has made all of this very trivial and so the way that generation works is let's say we have a seed vector and that's those two windows of data and it uses those two windows to predict the next step and it depends that next step to the seed and it continues doing the same process forever problem with this is because you're using a larger seed it's really inefficient so that's a list of caveats surrounding this project oh and it's really convenient because we went from waveform to spectrum and we can go right back because it's an invertible function and success so it's it's pretty straightforward you take some waveform you do FFT on it and then you train a model on it you generate audio and that should be a question mark rather because because really the question is kind of how does it sound how are we doing on time any questions okay so I take these hashtags it's kind of like these are my evaluation metrics sound class recommendations on hashtags it turns out okay so I'll show I'll show you a bunch of different samples some of it is directly related to this kind of like more constrained projects some of it was more just experimental so the first sample is wait one one should you explain to them something about your training data yeah what kind of music it is yeah so for other people who walked in I got one like so for other people who may walked in I'm gonna give some examples that will not blow your mind and then there are certain things that will blow your mind because the music that we're working with is as I mentioned before is kind of non-idiomatic free improvisation that is kind of historically contextualized as like post-modern music or modernism and therefore discards the notations of the notions of pitch and tempo as requisite for musical form so that's that's like the biggest caveat anyway so let's see if this will play so this is actually very musical this is a sine wave created by the network not sorry not a sine wave a flute which is acoustically most similar to a sine wave and this is just made by this thing will it play will you play can anyone hear that was that no because that is yes it should be playing at this point it could be it could be an AV situation any questions I don't know Kevin maybe your SoundCloud muted it doesn't look like it is like those speakers I could sing it okay I get some feedback where your speakers located that's unremarkable that's a flute that's also but that was generated by by the network so that's just kind of like first test does it work it works the next one is oh that's right everything's inverted in windows so the next one is with some other music which was a guitar and a cornet and some electronic effects we'll hear at the beginning of this is that there's a bit of seed that sounds very coherent and then there's some more stuff that sounds not terribly incoherent which probably explains why I got the hashtag that I got form kind of encoded in that you would have to hear the original piece to actually know how that corresponds to the form of the song it's also kind of a droney improv piece but this one is a little bit more harmonically stable making it easier to train on what's very interesting I find is it does pick up little fragments of phrases so given that the the time steps that we're actually predicting are maybe a couple hundred milliseconds at most those phrases those melodic phrases are actually generated very discreetly piece-by-piece which I think is cool alright so now I'm gonna play stuff from from the the output that we were actually training this big thing on you'll see that the first couple of them are failures so let's see see if I can remember some of them require some narration so in this in this case we had three one one microphone for three audio sources a trumpet a piano and a bassist all playing extended technique so one thing that's really interesting is that sometimes the content that's generated is some sort of aggregate of the spectrum of those three instruments which is a very unique sound so maybe this is an idea for people interested in audio synthesis but at other times it actually does make pretty interesting like separated sounds as well so here we go these are generated this is output from the network so these are different so the the seed is not the training data the seed is basically now that I have a network I can pass it some data and try to predict the next thing and so it uses that prediction as its new seed and then it uses that to just concatenate see really yes no no we train the corpus so so so partially due to our resource constraints we weren't able to train on an enormous corpus so we actually gathered hours and hours of data it took several hours to train a network on maybe 30 seconds so that's what this is so it's about 30 seconds of recorded audio trained the network picks up on the patterns and the phrases and the behavior of the improvisers within that 30 second window and then now you can give a new data and it was it's basically asking the question complete the sentence for me in the style of those 30 seconds that you that you previously saw yeah it's trained off line exactly trained offline and now it's just passing data and having to make predictions so one way to think about it is say you have a linear regression or time series forecast and it's trained on previous stock data for example and then now it today it's this and you know this happened in the news so what's what do you think it's going to happen to stock prices it's doing exactly that but using those predictions to kind of generate the next prediction so if you allow it to go crazy and predict the next hundred stock prices for example this is that's the kind of thing this is doing so all right so I'm going to play a couple more successful ones I know now that's just my music okay the difference was the seeds and in some of those I trained on different 30 second snippets so so both and here's the thing is like I'm not an expert in music information retrieval I do stuff with neural networks in a completely different domain so I am at a loss of how to actually optimize this and with certain training data certain seeds certain topologies we got lucky and that's that's as much as I can say from like a research conclusion perspective but it was fun right now how do I go back to my slides wait I can do this no what's that no that wasn't it no you got nothing hey Philip your broke windows so yeah that's a very good question yeah yeah so so the question was what's the difference in the source material that resulted in kind of like the sparsity or silence did I get that right yeah so there is there is some sparsity in the source material it's it's music that is very freeform so like the the statement the silence itself is a statement is a musical statement the thing is like when you have silence is very difficult to make predictions off of that so there were actually cases so sometimes when you hear just what sounds like static or like random noise my assumption is basically you're in a current state of you're in a current frequency distribution from which the network doesn't know where to go next because either it's something so different from anything that it's seen in its training data that it just can't and it's I mean frankly I don't think that it generalizes extremely well I think if you pass to the seed of maybe pop music it would still respond with you know free jazz trumpet I mean it would have to but yeah I think it's that's still kind of an open question of like when does it work and when does it not work and some of it has to do with similarity there's a lot yeah I don't know who wants to go first you will hand you a mic I'm kind of related to that last question when I mean you briefly displayed like your code for generating the predictions yeah do you incorporate any kind of like random like sampling temperature kind of thing in there because I mean I ran into a similar kind of problem using kind of did more discrete like sheet music data and what would often happen if you didn't have a kind of random element in it it would get stuck on the same note over and over again so yeah so I saw that early on as as so when I incorporated dropout into the network you're basically incorporating random selection in the network itself so for anyone that's not familiar with dropout all it is is basically you have a network that has maybe 2000 inputs and so you pass in a probability of retaining those inputs in each training step so say on training step 100 it ignores let's say it ignores 10% of the inputs and what ends up happening is it has to form several representations of the same problem and that actually helps to generalize better but yes I also had another generator function that does have random noise inserted and that that would help to dislodge but sometimes it would derail when it was going well yeah let me let me just go through these last slides so I have a whole slide of future research and discussion which I'll keep open while we take questions because I think it'll spark a lot of questions but before I do that just in case anyone has to go I want to go ahead and give thank yous to to bids to Nick to Stan to Philip Philip can give thanks he's nobody knows who Stan is he's not in this room Stan works at Deep Mind he's a good friend of mine Steve Joie if anyone's interested in music information retrieval he is he's pretty amazing and to the musicians who helped gather data and perform that's Darren Johnson Myra Melford Lisa Mesa Kappa Scott Rubin and Square for letting us use their resources okay back to non thank yous questions yes for the phase components we discarded the imaginary parts I believe I'll see if I have to go back through my code and I think I just discarded the imaginary parts of the other phase what's that it's oh why just because we're doing backprop we want to backprop for real numbers so I think luck I mean frankly I think you discard the phase components and you re-synthesize it and you don't you lose the phase information you could do this completely discarding the phase components so I sketch I kept the phase components discarded the imaginary numbers but you could also do this discarding the phase components entirely and looking only at the amplitude of the frequency spectrum it doesn't sound that great this also maybe would benefit from somehow preserving the imaginary components if you have questions use the mic it's better sorry yes and how many cosiness transformations did you use how many channels that's a good question I think trying to remember all the parameters off the top of my head I think we use somewhere around 2,000 frequency components so that actually brings us to like a really good question a good bullet point on here which is what's what's a better way of processing our data which is both under improving training and improving runtime because we're essentially training a larger network than we really need to if we can actually reduce the number of frequent frequency components that we're looking at intelligently it would make a lot of sense to do so so certain things with a high pass and low pass filter cutting out certain things that are out of audible range but then also doing some sort of low-rank approximation yeah so I think the high number of frequencies could also lead to the silence yes I agree yeah following that question I'm not sure I understand how you can generate the data if you're training on just the real part of it I mean it means you're you have to have both when you when you generate right how do you generate the other half of the data from a network that's trained on one type if I go back go back to my code I know I did something weird in the function and I think I I don't know what I did exactly with the face components so I could definitely give you a more straightforward answer if you want to if you want to discuss that but yeah so I only saw the code briefly but did you guys include in a discriminator network and if not I'm curious why not so a discriminator network as a form of training the generative network so I think you're referring to adversarial networks right now right so yes this is something that we want to try in the future obviously it's right now generative adversarial networks are like all the hype in generative problems and the reason is because as you mentioned if you had like a discriminated network that would train this generative network your funk your cost function now is much more expressive and yeah this is something that we definitely want to try in the future so they did you guys not include that because of constraints with so there were two there are two parts of that answer first of all constraints in terms of like in terms of like infrastructure and also more importantly there aren't as far as I know there aren't any good examples where generative adversarial networks are applied in temporal problems they work brilliantly in like generating images which have no temporal information to it but they up till now I don't know if anyone has has done something like that so yeah if you can that would be awesome yeah I have a beginner's question what would happen if you had used as your training data like something very episode like something like a marches or something that's very predictable yeah so there's there's a paper that came out of andrean's machine learning course a couple years ago called I remember what it's called but they they use groove and LSTM's using house music unsurprisingly house music is was a non-pejorative word for this so so it wasn't it wasn't too much of a challenge you can you can definitely represent kind of longer scale periodicity pretty well okay so you didn't do it because it wouldn't be interesting or fun or hard enough it's it's also a question of like collaborators and like the idea of the project so this is this was originally conceived as a digital humanities projects in in collaboration with the UC Berkeley music department and the Center for New Music and Audio Technologies at Berkeley I read these bullet points and you guys can go if you have no questions but some some things that may have been on your mind that are related to some of these questions that have already come up is definitely computing bottlenecks which which are related to how we process data if we can actually reduce the dimensionality that'd be great if we could actually automatically extract features so one thought is if we could use convolutions to simulate the Fourier transform that might actually reduce the the meant the so basically like you could you could test with different parameters on the dimensionality of that output and try to select the optimal one one question about frameworks that I think is less less technical but very interesting is in this case we had a single microphone with in the background some polyphonic audio source where you have three three different voices and so the the training problem and the prediction problem was given the current state of this one audio source predict the next state of this one audio source so if you think about as building a band you you're training this new band member to do what the band is already doing and it's and then like you're saying play with play with like this this thing that's going to produce what you're doing it's not it's not intuitively makes any sense so another framework would be suppose you're working with a very small ensemble trio or a duo and you predict the conditional next step of one of the members of that band conditioned on the behavior of the other band member and also that same band member so that could be a very useful framework for this and then as I as I mentioned I don't really have the domain expertise to have determined an appropriate sample rate and lag time that would definitely have helped us out and then another question is kind of related to Nick's question is right now we're working on a pretty challenging problem and what we've learned is based on our resources and kind of lack of expertise we can't really do stuff as cool as what Google is doing like with with WaveNet we're not able to train something with like 30,000 epochs on a corpus of five hours of data in you know in a week we don't we don't have that we don't have that option but other ways of kind of experimenting with this which I think is also relevant for student researchers who don't have similar resources is working with discrete pitch data so I know that you some some members here have ideas on where to get some of this I'd be very interested in looking at it because I think that you can do something that is perhaps more consumable by the general audience as as music any other questions just one last question just to tie it back to the data science community at large including Square and everyone here on the Berkeley campus I think some of us can begin to imagine how we might apply this in different domains but what are your thoughts on I guess there's a lot of different time series applications and just curious about what you'd say about that yes so this is one not the only one so at square we definitely we use LSTMs for a lot of different problems one of the things that Philip has worked on specifically is on if you want to talk about it sure so one of the things that I worked on on STMs was trying to predict the acceptance so me and Juan both work on the capital team but the square which is basically facilitating loans to our two square merchants now it's a very interesting question to know when someone is ready to accept an offer that we sent to them so based on their that their temporal patterns and how they use the application and as well as their their their transactions and any other kind like user information that we can gather from them we are we're able to create a very very accurate model as to which would predict at the end a probability of them accepting a loan a loan offer at that specific time so so it's not totally useless but I imagine for there are other applications in NLP definitely like in translation it's pretty useful to look at translation not simply as a word-for-word mapping but also as sequential mapping so I imagine that a lot of you can think of any any problem that you can kind of frame as a sequential prediction problem taking into account a lot of information for each time step you can probably apply LSTMs and I mean essentially neural networks are just kind of abstractions on linear models so any problem that you solve with a linear model can be solved with with the neural network any last questions well let's let's thank let's thank our square team here our data science from square thank you