 Over the last decade, deep learning has taken the field of AI by storm. Using neural networks, we can now solve a host of problems. To name a few, one problem that we can solve is object detection. Feed the network an image, and it will be able to identify locations of important objects in that image. Another problem we can solve is language translation. Feed a neural network an English sentence, it will spit out the equivalent in French. Another problem that we can solve is audio classification. Feed the neural network a sound wave, and it will determine the object that produced that sound. So if it hears a... It spits out dog, and if it hears... It spits out cat. You can see these problems are quite different. They have completely different input and output variables. However, all of them have one thing in common. In all cases, the neural network will process the input sample, and it will spit out some result that gives us some additional information about the input. So take the case of object detection. We give an input image, and after processing it by the network, we now know what objects are present and where they are located in the image. That's additional information. In the language translation case, we give an input sentence in English, and after processing it by the neural network, we now know how to say the same sentence in another language, like French. That's additional information too. Now in the audio classification case, we feed an audio sample as input, and after processing it by the network, we now know what animal made that sound. The identity of that animal is, again, additional information. However, there are a category of networks that are a bit different. In the sense that they don't merely provide additional information about some input sample, but they also try to create or generate some sample image or audio or text themselves. And this class of neural networks is called generative models, appropriately named. In this video, we're going to go through a particular type of generative model called a variational autoencoder, or a VAE. The explanation will be twofold. I'll start with an easy-to-understand intuition on VAEs. Once we have a firm understanding of them, then we'll compare it to other types of generative models that have been hogging the spotlight recently, generative adversarial networks, GANs. Techie or not, you'll be walking out with newfound knowledge in generative modeling and variational autoencoders. I'm also going to throw in some technical jargon for you extra curious viewers. This is Code Emporium, so let's get started. Let's start out with a broad concept, generative modeling. Generative models are also just neural networks themselves. Normal neural network models usually take some sample as input. And this sample is like raw data. It could be like an image, text, or audio. Generative models, on the other hand, produce a sample as an output. Because of this flip, I think you can see how and why this is so interesting. With this technology, there is so much potential. For example, you can train a model to understand how dogs work by feeding it hundreds of dog images. Then during test time, we can just ask the model for an image, and it'll spit out a dog image. The cool thing is, every time that we ask our model to generate a dog, it'll generate a different dog every time. So you can create an unlimited gallery of your favorite animal. Doggos! Sweet. But what does this generative model black box look like? Let's take a look at this variational autoencoder as an example. As mentioned before, variational autoencoders are a type of generative model. They are based off another type of architecture called autoencoders. These autoencoders consist of two parts, an encoder and a decoder. The encoder takes an input sample and converts its information into some vector, basically a set of numbers. And we have a decoder which takes this vector and expands it out to reconstruct the input sample. Now, you may be thinking, why are we doing this? What is the point of trying to generate an output that is the same as the input? And the answer to that is, there is no point. While using autoencoders, we don't tend to care about the output itself, but rather the vector constructed in the middle. This vector is important because it is a representation of the input image or audio, and it's in a form that the computer understands. So another question. What is so great about this vector? On its own, I'd say the vector itself has limited use, but we can feed it to complex architectures to solve some really cool problems. Here's an example of a paper that uses autoencoders to infer location of an individual based on his or her tweet. This architecture that they use consists of three stacked autoencoders to represent the input text from the tweet. This is then piped to two output layers. One of them is used to determine the state in the United States where the tweet was made, and the other is to estimate the latitude and longitude positions of the user where the tweet was made. I'll link the paper below in case you're extra curious. This is just one of the many interesting examples of what you can actually do with these autoencoders. However, something we cannot do with autoencoders is generate data. Now, why is this the case? Let's go back to the autoencoder architecture. It consists of an encoder and a decoder. During training time, we feed the image's input and make the model learn the encoder and decoder parameters required to reconstruct the image again. During testing time, we only need the decoder part because this is the part that generates the image. To do this, we need to input some vector. However, we have no idea about the nature of this vector. If we just give it some random values, more likely than not, we will end up with an image that looks like garbage. So that's pointless. Now we need some method to determine this hidden vector. Here's some more intuition. The idea behind determining this vector is through sampling from a distribution. I'll explain these basic concepts of sampling and distribution, but I'll also translate that into more technical terms for those of you who are more advanced in probability theory. So distribution and sampling. Think of distribution as a pool, a pool of numbers, vectors. Consider the case where we want to build a generative model to generate different animals. To accomplish this, our generative model needs to learn to create a pool for cats, a pool for dogs, and another pool for giraffes, like so. When I say the dog pool, I don't actually mean a pool that consists of dog images, but instead it consists of some vector representation of these images, and they are only understood by the computer. So in a nutshell, think of distribution as a pool of vectors. Now onto sampling. Sampling is a verb in English. Sampling means just closing your eyes, reaching into a pool, and picking one vector. If you know where the pool is, then you can go to the pool and randomly pick the vector. So when we say I sampled from the distribution of dog images, it's equivalent to saying that we picked a random vector from the dog pool. Pretty simple. Now the problem with general autoencoders is that we as human beings don't really know where these pools are. Imagine this box represents all possible values for the vector, the hidden vector. The cat pool can be here, the dog pool can be here, and the giraffe pool can be somewhere here. Each of these pools is learned by the model during training time. So when we feed hundreds of images of animals, our model will find patterns linking similar dogs, cats, and giraffes, and come up with these pools. Now these pools, or more technically these distributions, are learned internally by the autoencoder. But there is no way for humans to know about these pools to make use of them for generating images. During test time, we are basically sampling from a random distribution. In other words, it's equivalent to blindfolding ourselves and picking a value from this huge box that only consists of valid vectors in very specific locations, and just garbage vectors everywhere else. Thus there's a very high chance that we'll pick a non-relevant garbage vector from which we get a non-relevant garbage output accordingly. So the big takeaway, we cannot generate dog images with an autoencoder because we don't know how to assign values to the vector during the generation phase. We clearly have a problem here. But what if we did know where to pick these vectors from? Then that would solve our problem, right? Variational autoencoders does just that. We first define a region we want to constrain this universe that is constrain the region from which we want to pick the vectors. And within this region, the goal of the variational autoencoder is to find the pools. The dog pool, the cat pool, and the giraffe pool. And this is done during the training phase. During the testing phase, all we need to do now to generate an image is randomly sample a vector from this known region and then pass this vector to the generator part of our variational autoencoder. This will generate an image. A neat property about this region is that it's continuous. So we can just alter some values in the vector to still get valid looking images. Say we train a variational autoencoder to print or generate handwritten digits from 0 through 9. The VAE will learn the pools such that they are within a defined region. Now these pools will represent the 10 digits from 0 to 9, so we'll have to learn 10 pools. The region now in which these pools are learned is continuous. So I can just randomly sample a vector from this continuous region and just change its values ever so slightly. The result of just changing this vector actually leads to very trippy and a psychedelic looking generated images when they're placed next to each other. This is the simple intuition behind variational autoencoders. If you understood this, then congrats. Now let's revise some differences between the general autoencoders and variational autoencoders just to make sure you have a clear understanding what each does. From a more technical perspective though. So first of all, why do each exist? The goal of general autoencoders is to learn a hidden representation of the input. While the goal of a variational autoencoder, although it also learns a hidden representation of the input, it also is used to generate new information. General autoencoders cannot generate new data. Here's another question. What are they optimizing? Autoencoders, the general autoencoders, learn to transform an input into some vector by minimizing reconstruction loss. Now during training, an autoencoder makes sure what is thrown into it is also spit out. In other words, it tries to minimize the difference from the original and the reconstructed images. Hence it seeks to minimize the reconstruction loss. Variational autoencoders, on the other hand, generate images by minimizing the sum of reconstruction loss and a latent loss. Now reconstruction loss is the same as what we define for autoencoders. With latent loss, we ensure that all the pools learned by the network are within the same region that is in the middle that I defined here. For a more technical context, we assume the pools follow a normal or Gaussian distribution. Hence during testing time, they are actually sampled from the mixture of these Gaussians. Now that we have a clear understanding of VAEs, let's see how this compares with a more famous generative model, generative adversarial networks. So first off, how do they learn to generate data? Variational autoencoders have two losses to optimize. The first is reconstruction loss. What goes into the network is also spit out, making sure that there is as little difference as possible. The second is latent loss, that is making sure the latent vector takes only a specific set of values. So we want to know which region to sample this vector from. Optimizing two losses, our variational autoencoder will learn to generate images. Now generative adversarial networks or GANs work a little differently. Like how the VAE has an encoder and decoder architecture, GANs also have two components, a generator and a discriminator. The generator is responsible for generating images, and the discriminator determines whether a given image is either real or fake. By fake, I mean whether it was actually created by the generator. Both generator and discriminators play a minimax game where one tries to outperform the other. The generator will try to generate an image that fools the discriminator, making it think that its image is real. And the discriminator tries to correctly distinguish between the real and fake images, catching the generator with its wits. If one of them messes up, then its architecture is slightly tweaked to improve performance. While looking at thousands of images during training, the generator and discriminator networks improve each other until the generator becomes proficient at generating animal images, and the discriminator becomes proficient at determining real images from fake images generated by the generator. Then during testing time, we can just use the generator to spit out the images that we need. Another aspect that we can compare GANs and VAEs is stability during training. Now, training in GANs involves finding something called a Nash equilibrium that is a point in the game between the generator and discriminator, where the game is set to terminate or that there is an end of game point. However, there is no concrete algorithm to actually determine this equilibrium end of game point yet. On the other hand, VAEs offer a closed form objective. And by closed form, I mean that there is a nice little formula that we can use to determine the end of training phase in variational autoencoders. Now, here's a third aspect from which we can compare GANs and VAEs. How good are the generated images? VAEs work very well in theory, but they tend to generate blurry images. You can mostly attribute this to the fact that VAEs are looking to optimize two factors during the training phase. The reconstruction loss, that is making sure that the output is as close to the input as possible, and the latent loss, that is making sure that the latent vector can only take a fixed range of values. These two factors often counter each other. There's a trade-off. So the middle ground usually leads to blurry image generation. GAN training, on the other hand, is more empirical and optimized by way of trial and error. They just work. You can write down the losses theoretically, but most of the intuition is based on the fact that we had the results before the actual theory. For simple spatial data, like images, GANs produce really high-quality results. I made a video on the evolution of GANs since its inception in 2014, so be sure to check that out after this one. And that's a brief comparison with GANs. There are certainly deeper concepts that I didn't cover, such as the need for the reparameterization trick in variational autoencoders, or explicitly deriving the two losses of a variational autoencoder, the reconstruction and latent loss. However, there are plenty of good blog posts out there outlining these concepts, and I've linked some of these resources below. I hope you got the base intuition of variational autoencoders so that you can now more easily understand any learning resource you pick up from here on. I may make a more mathy technical video on variational autoencoders later if most of you guys requested, but I'll leave it at this for now. Thank you guys so much for watching. Subscribe to Code Emporium and CS Dojo for more videos on machine learning, deep learning, and artificial intelligence. See you in the next one. Bye-bye!