 Take me to church. I'll worship like a dog at the shrine of your lies. I'll tell you my sins so you can sharpen your knife. Offer me my deathless death. Good God, let me give you my life. No masters or kings. When the ritual begins, there's no sweeter innocence than our gentle sin. In the Madison soil of that sad earth we've seen, only then I am human, only then I am clean. There is this AI that can generate music in the style of famous singers. It can make a rapping Bruno Mars, a rock and roll Katy Perry, and so much more. But how does it do this? We're going to try to reconstruct this model intuitively and get into more details as we go in. Doesn't matter how much you know about AI, you should just be able to walk away with some knowledge of how this works at any level of detail. So let's get started. So in the first pass we'll build this intuitively. We have this AI that takes in some lyrics, a genre and an artist, and it generates a song. And it does so in fixed size chunks. The AI generates a chunk, it can then take this chunk and use it to generate the next chunk of audio, and it repeats this until we get the entire song. This type of AI model takes in an audio sequence and generates the next audio sequence, and hence it is a sequence-to-sequence model. In deep learning literature, transformer neural networks are the best at dealing with sequence data, at least for now. So we can replace the AI model with a transformer neural network. Pretty cool. But we run into a problem when building this out. This raw audio waveform, it's huge. You've seen songs on CDs have a sampling rate like 44.1 kHz. That means that we use 44,100 numbers just to represent one second of audio. It's far too big for our model to handle, so we need to compress the waveform while retaining the key aspects of the music. And this compression is done through a type of neural network architecture called an autoencoder. This autoencoder network can take in a raw waveform and learn to compress it. And it can also learn to take a compressed audio and learn to decompress it back to the original waveform. The jukebox uses the autoencoder and transformer together. And we build this jukebox in two phases, training and generation. During the training phase, we train the autoencoder to compress and decompress audio. And then we train the transformer to take in some information about the song to generate and train it to generate a compressed vector representation of the song, one chunk at a time. During the generation phase, though, we use both of them together. Pass in the lyrics, genre and artist to the transformer and have it generate the compressed vector, one audio chunk at a time. And we can pass each of these compressed vectors through the autoencoder to get back the raw audio waveforms. Stitch them together and we have the generated song. If we build our jukebox out in this way, it might sound something like this. Hey jukebox, have the anonymous celebrity rap to lose yourself by Eminem. Ah, music to my ears. It sounds like this because we built this only based on intuition, but there are some key details and modifications we need to make before this is usable. And now let's take a look at these in past two. Building the real architecture. We're using an autoencoder to compress and decompress a waveform. This is split between two parts of an autoencoder. It's encoder for compression and it's decoder for decompression. During training, the normal autoencoder seeks to minimize the reconstruction loss only. It doesn't really care how we compress the vector as long as it's able to reconstruct the input as good as possible. And so it learns some arbitrary function to compress and decompress data. This is fine for training, but it's a problem when generating music. During generation, we don't have the encoder and we're working with just the decoder. Since we don't know the function our autoencoder learned, chances are if we pass in a compressed vector to the decoder, this is going to generate gibberish, particularly I'm talking about the last leg of the flow. And that's why the anonymous rapping sounded so shhhht or beautiful. This is where variational autoencoders are better. Like normal autoencoders, they minimize a reconstruction loss. But with the added constraint that it learns to do so within a specific distribution, typically like a standard Gaussian distribution. In other words, we know how the compression and decompression is happening. So during the music generation phase, we don't really have the encoder and we can sample a vector from the standard Gaussian distribution, pass it to the decoder, and it will give us some meaningful audio. So replacing our autoencoder with a variational autoencoder, let's see what we get with our jukebox. Hey jukebox, have the anonymous celebrity rap to lose yourself by Eminem. These dreads, the tears, the throes, the very, the most, the rudeness, the unworthy, the spaghetti, the slurs, the slurs, the slurs, the clums, the rudeness, the slurs, the slurs, the clums, the slurs. Okay, getting better, our VAE learned to convert a continuous distribution of vectors to a sound. But not all of these vectors produce meaningful and clear audio. And this is one of the reasons why some muddledness and unclarity is still there in our generated audio. To remedy this, we can have our VAE to learn to encode a specific set of discrete vectors to a clear sound. This is the basis for a vector quantized variational autoencoders. During training, we pass in a raw audio, the encoder compresses it to a vector, and we determine the closest discrete vector and then decode that as the sound. And during generation, the output of our transformer will be mapped to one of these discrete vectors, and we can get a meaningful audio output. So the vector quantized VAEs can get clear sounds. But we can make further improvements on this by using multiple VQ VAEs at different compression levels. This draws inspiration from a study on hierarchical vector quantized VAEs used in image compression. The goal is to generate even more realistic images by using a hierarchy of two VQ VAEs. So we have a top VQ VAE and a bottom VQ VAE. Both of these learn different compressed representations of the image. The top level VQ VAE only learns about the global information of the image, like the contours and the large strokes. But the bottom level VQ VAE is larger and it learns about local pixel information like texture, shading, and color gradients. And both of these learn representations are fed into the bottom level decoder to reconstruct the original image. The bottom representation is conditioned on the top representation, so that it doesn't need to learn the entire representation itself. It makes the best use of its space, leading to stunning images during the generation phase. So like this, we have three levels of VQ VAEs that we train in our case. A top, a middle, and a bottom. The top is the most compressed and the bottom is the least compressed. But we don't feed the top to the middle and the middle to the bottom like we saw in the image case. This is because the researchers observe that the top and the middle were passing all the information to the bottom, and that made them useless. It was similar to just having a single VQ VAE. And so they just train three VQ VAEs separately and in parallel. It's interesting how we went from autoencoders to using variational autoencoders to vector quantized variational autoencoders, and then to a hierarchical structure of the same, each of them making improvements to generate better and better audio. Here's the initial structure that we built in past one. But now this autoencoder is actually three VQ VAEs with different compression levels. But to have these representations interact with each other, we introduce three transformers. The top transformer takes in lyrics, genre, and artist information to give the top level compressed representation. And the second transformer converts this to the middle compressed representation. The third transformer converts it to the bottom compression representation. And then we pass this into the bottom VQ VAE decoder that we trained previously to get a newly generated audio. The song. Hey jukebox, have the anonymous celebrity rap to lose yourself by Eminem. His palms are sweaty, knees weak, arms that heavy, there's vomit on his sweater already, mom's spaghetti. He's nervous, but on the surface he looks calm and ready to drop bombs, but he keeps on forgetting. This is the final architecture and the end of past two. Things are looking a lot more concrete now. But how exactly are we training this thing? So in past three, let's start with the VQ VAE training and then move on to transformer training. We start by training our three VQ VAEs in parallel. Raw audio is a continuous stream. So we need to break it down into fixed size chunks. Let's say that we're dealing with a 20 second piece of audio here. The top layer will break it down into five chunks. The mid layer will break it down into 10 chunks. And the bottom layer, let's say it breaks it onto 20 chunks. We pass the chunks one at a time through the encoder to get these individual colored bars. Note that each color bar has the same vector dimensions despite the width being different. This top blue bar, for example, represents like four seconds of audio encoded into a 64 dimensional vector. This first purple bar in the middle is the first two seconds of audio encoded into a 64 dimensional vector. This first bottom brown bar is just one second of audio that's encoded into a 64 dimensional vector. Passing in all these chunks through the encoder gives us the compressed representations. Now we perform vector quantization. For each of these colored pellets, we determine the closest code book vector. The code book is a list of vectors. The blue one here right in the top level is closest to vector three. The purple one is closest to five. And this magenta one is closest to the fourth. And for the code book lookup, we replace each of these numbers with the actual corresponding code book vector. And then we just decode each vector one at a time to get back the audio chunks. And stitch that to get the original signal. During this training, we want to minimize the reconstruction loss. We want to learn these code book vectors. And we also have a commit loss to stabilize the encoder. So that's the bulk of this hierarchical VQVAE training. Now once all of these are trained, we train our transformers. We take a piece of audio, pass it through our three VQVAEs to get the top, middle, and bottom vector quantized representations. These are then used to train our transformers. The transformers have an encoder-decoder architecture. The top level transformer takes in the lyrics, artists, genre, and other conditional information to generate some intermediate vectors. The decoder then takes these vectors and a start token to just generate the first highly compressed vector that represents a part of your generated song. This is compared to the top level VQVAE vectors to minimize the difference during training time. And it does this sequentially, generating one vector at a time by your decoder transformer. Now, that's the top level transformer moving on to the mid-level transformer. This takes one of the top level vectors of VQVAE, take some lyrics, some genre and artist information, all to generate some encoding vectors. And the decoder will use this to generate two vectors, generated one at a time. During training, these vectors are compared to the mid-level VQVAE representation to minimize the difference. The input vector should represent two seconds of data, but the transformer converted it into two vectors that represent one second of data, each. I say two, assuming that the compression rate of the mid-level VQVAE is two times that of the top level. In the actual implementation, though, the blog says four times as much, but I think you get the idea. The bottom level transformer works in a similar way. It takes a vector generated from the mid-level VQVAE as input. This, along with lyrics and genre, is encoded into vectors. These are passed into the decoder to sequentially generate two vectors one at a time. They are compared to the bottom VQVAE vectors to minimize that difference. If the input corresponds to one second of audio, the output would be two vectors corresponding to half a second of audio each. And in this way, we can pass in a number of audio clips to our transformers to train them. Sweet. So now, during generation time, we would get lyrics, genre, and artist information all encoded into vectors, pass it to our top level transformers to get a very compressed representations. Pass each of these to the mid-level transformer to decompress, and pass each of these to the bottom level transformer to further decompress, and then pass each of these vectors to the bottom level VQVAE decoder to generate audio chunks. Stitch them together and we get new music. And that's it. I hope this covered different levels of understanding. There's still a lot of detail that I did leave out, but if you made it this far, I'll add some references down in the description below. Research papers, reference papers, and blog posts that you can check out. Thanks for watching, and I will see you soon. Bye-bye!