 Say I have an image and I want to transfer it over a network, raw image data isn't really compact and can take a long time to traverse over a network. In today's world if increased image processing and video streaming, such low transfer rates isn't going to cut it. So how do we overcome this network bandwidth bottleneck? Well, from the source, we can first compress the image, then send this compressed version across the network, and then reconstruct the image at the destination. It sounds like a plausible solution. Consider another scenario. Self-driving cars have been the big talk for a while. How does this work though? Just think big picture. We know for them to function, they need to keep themselves on the road, they need to follow traffic rules, and not to run over pedestrians. To accomplish this, they simply need to detect and delineate objects in its immediate field of vision. And this process is called semantic segmentation. These two applications are very crucial to our world in 2019. And fundamental way that we can implement them is through an architecture called auto-encoders. In this video we're going to understand what they are, how they work, and how exactly they can be used in these cool applications. This is code emporium. So let's get started. Data around us like images and documents are very high dimensional. Auto-encoders can learn a simple representation of it. They are a class of unsupervised neural networks. Usually you wouldn't associate neural nets with being unsupervised, but today you will. The architecture consists of three parts, an encoder, a bottleneck, and a decoder. At the simplest level, we can have a layer of fully connected neurons for each part. The output represents the reconstructed input, hence it has the same dimensions. The objective is to learn a representation that will minimize the reconstruction loss. Learning these weights can be done with techniques like recirculation or back propagation. However, there is a problem. A trivial solution would be a zero loss if we just copy the images. But this would mean that the latent layer doesn't really learn anything, which is useless. One way to work around this is to constrain the properties of this hidden layer, these properties being the size and activation. Consider size. If the number of neurons in H is less than that of the input layer, the autoencoder is said to be under complete. This is useful as the network needs to learn latent representations with an only a small set of neurons, so we're sure that it's going to learn something. The second property is activation. If we don't use an activation function for the decoder phase and the loss function is the mean squared error, then the results are similar to PCA, principal component analysis. And so the latent representation of K neurons will represent the top K principal components. Adding a nonlinear activation in the encoder and decoder parts allow us to perform a nonlinear version of PCA. But then we come back to the same problem. If H is unconstrained, it will simply copy the input to the output as that leads to a zero loss. We need to make sure that H is restrained to an extent for learning. It is important to remember that we aren't too concerned about the output of the decoder, but more of the latent representation that the autoencoder learns. Already, we can see a problem with this vanilla autoencoder. We need the latent feature vector space to be small, and the encoder and decoder part should be shallow. If not, the resulting latent representation becomes useless as it simply copies the image. This means more complex data cannot be modeled accurately, and we may end up with underfitted models. Ideally, we should be able to have a latent representation that is at least the size of the input and also introduces a nonlinear activation in the encoder and decoder parts. This will allow more complex models to be fitted. We can do this using regularized autoencoders. They have the same principles of vanilla autoencoders, but have a modified loss function. When you hear regularization, you're probably thinking of L1 and L2 regularization, where the coefficients are penalized to mitigate overfitting. We have a similar intuition here for sparse autoencoders. The overall loss contains an additional sparse penalty term on the code layer H. While thinking of sparsity, think most neurons are turned off. You can relate this to the concept of dropout in neural networks where neurons are randomly turned off. This forces the network to learn more accurate representations that better generalize training data. Let's say the hidden layer has a sigmoid activation. This means that the output values will range from 0 to 1. We say a neuron is off if it is close to 0 and on if it is close to 1. We impose a sparsity constraint by making the average activation output of each hidden neuron some very low value. Say rhoj for the jth neuron. This rhoj hat is the sparsity parameter. Now this notation may look a bit confusing. Basically, if we have n samples, each neuron in the hidden layer is activated to some extent. We want the average to be some low value rhoj. I'll make the activation function a function of training examples as well. To show that the activation of the jth neuron, aj, depends on the sample xn. Note, it's just a function, not a multiplication with the input vector. We want it to be a very low value, like 0.05, but we cannot just set it to 0.05. It's an aggregate after all, not a simple value. So we set rhoj to 0.05 and penalize all neurons with a rhoj hat that deviates from this value. We can model two distributions, p and q, with the probability of success being rhoj and rhoj hat, respectively. The idea is to ensure the predicted distribution is as close to the actual one. And we can model this with the KL divergence. Now this is just one loss function that could work. Just remember we are considering a sigmoid activation. This is equivalent to the KL divergence between two Bernoulli distributions, with the probability of success being rhoj and rhoj hat. By the way, KL divergence is a measure of difference between two distributions. Ideally, we want to minimize it. This graph shows the KL divergence when rho is equal to 2 for a sigmoid activation. Clearly, it's minimum when rho is equal to rhoj hat. We write the overall cost function as a sum of cost incurred from the vanilla autoencoder plus the weighted sum of KL divergences from every neuron from threshold. So how exactly does sparsity help? We can take a look at the network that learns visually. Say we have an autoencoder whose input is a set of 10 cross 10 images. So that's 100 dimensional input, and say that we use a hidden network of size 100 as well. After training, we want to visualize the components, as shown here. Consider the top left corner, which was learned by the first neuron. This neuron learned to detect diagonal edges. All neurons, like so, learn to detect edges of different orientation. So together, they learned a very effective representation. Another type of regularized encoder are denoising autoencoders. Autoencoders learn a latent representation from the input text in an unsupervised way. Denoising autoencoders learn such a representation from a corrupt input. This allows for a better generalization. So you can take the input x and add some noise to it. The resulting corrupted input is x dash, or x prime. The hidden layer can no longer simply copy the input to the output, hence it is forced to learn some latent representation. For the most part until now, we've only been talking about autoencoders with a single hidden layer for the code layer, a single input layer for the encoder, and a single output layer for the decoder. Such a shallow network can actually be quite useful. You can attribute this to the universal approximation theorem. In neural net theory, a neural network with a single hidden layer and a finite number of neurons can approximate any function with assumptions of its activation. However, there is nothing stopping it from creating deeper architectures. This allows faster training times as we don't need as much training data for a good generalization. Hope you can now see how autoencoders can be used in data transfers across a network. The example application that we talked about in the beginning of the video. They can also be used in instant segmentation. Swapping out fully connected layers with convolution layers, the autoencoders can take image inputs preserving spatial representation, but instead of the original image, we aim to output a semantic segmented counterpart, outlining the contours of interest. For example, this tech can be used in self-driving cars to segment different objects of interest. Xander on Archive Insights does a pretty good job of explaining this on his variational autoencoder video, so check that out. Another application is neural inpainting. Instead of adding noise to autoencoders, you can remove rectangular sections from an image and try to reconstruct the original. This allows you to do things like remove watermarks from images. And I can keep going. Semantic hashing. Autoencoder takes documents as input and outputs a reduced 32-bit address. The documents mapped to closer addresses are considered similar. The distance between two documents is given by the hamming distance, that is, the number of positions in the address that are different from each other. The list of applications just goes on. Ever since the introduction of generative adversarial nets in 2014, the generative models are being investigated in a new light. There are so many interesting applications that are coming out, and it's always fun to read and play around with. That's all I've got for you now, so be sure to like, subscribe, hit that bell, share, watch my videos until the end, and I'll be dishing out new content soon, so stick around.