 Take a look at this famous painting by Monet of the Bank of Seine near Argentiné, France. Even without knowing what it is, we can all agree it is a painting of some place. If photography had been invented in 1873, that is when this painting was painted, what do you think the scene would have looked like? Perhaps something like this? This is an example of style transfer, where we synthesize a photo-styled image from a Monet-styled painting. Style transfer works the other way too. Here's a photograph of the little Cassis harbor in France. Clearly this was taken after Monet's time, but if he were alive in the 20th century, how do you think Monet's rendition of the scene would look like? If you've seen any of his previous works, then you may think it looks something like this. Now consider this picture of a horse just galloping in the field. How common is it to see, I don't know, a zebra galloping the field? Not as common, right? Oh look, we just made it happen by replacing the horse with the zebra. This is an example of object transfiguration. Now take a look at this gorgeous summer landscape. How do you think the same scene would look in the onset of winter? Perhaps something like this? An example of season transfer. In all of these examples, we saw an image and imagine how it would look in different circumstances. And in this video, we're going to take a look at exactly how we can implement this using a cycle consistent adversarial network. I'm AJ Helfler and you're watching... So we saw some cool examples of what exactly we want to do. However, to solve this problem, we need to actually look at a much broader perspective. We need to somehow map the input image coordinates X to domain coordinates Y. And this problem is image to image translation. If you've been in computer vision for even a little while, you'll know this problem isn't really a new one. Here's a dozen papers that have beaten the problem to death. But every single one of these uses paired image data. Their models are trained on both the original image and the corresponding acquired image after translation. But creating such a data set is a pain and existing data sets are usually too small to be of any use. Hence, we are looking for an algorithm that works on unpaired image data, where we have a set of photo style images X and we have another set of Monet style paintings Y. But we don't have access to Monet paintings for every single input sample image. Such data is much easier to gather. We assume there exists a mapping between images X to its corresponding image in Y. Our goal is thus to train a model to learn this mapping G. A typical objective we use to train this scan or rather learn the mapping G is an adversarial loss. This forces the generated images to be indistinguishable from the real images Y. So let's map this out. An image Y hat is sampled from the generator G, parameterized by theta G. The distribution of real images in Y is represented as say by P of Y. The goal of minimizing an adversarial loss or the goal of optimizing any GAN is to model the generator G such that the image sampled from it is indistinguishable from the actual distribution. But matching distributions in this case isn't enough. Remember, we don't have access to paired data. There are many parameters theta G that could potentially minimize the difference in distributions. So the chance of learning a mapping G that makes meaningless pairings between the input images domain X and the output domain Y is very high. This leads to completely meaningless results. In order to reduce the number of possible mappings G that can be learned, we introduce a second type of loss. This is called cycle consistency loss. Here's the idea. If we translate, for example, a sentence from English to French and then translate it back from French to English, we should arrive back at the original sentence. In our image-to-image translation problem, we introduce another mapping F, which is the inverse of G. That is, it maps an image in Y to some image in the X domain. So we not only need a mapping of G that generates similar distribution, but we also need one that is cycle consistent with respect to its inverse mapping F. This significantly reduces the number of such possible mappings G can take. Now that you have a high-level intuition of these two types of losses, let's derive them mathematically. But before doing so, I'm going to introduce some notation. Since we have two mappings to learn, G and F, we have two GANs to train, where each has a discriminator and a generator. The generators actually generate images for a given domain. G will generate images in the Y domain, and F will generate images in the X domain. Discriminators distinguish between the real images and the generated images. Let DY be the discriminator that distinguishes between the images in the Y domain and the ones that were generated by the generator GFX. Let DX be the discriminator that distinguishes between the images in the X domain, which are real images, and the ones that were generated by the generator F. So you can say that GAN1 for the X to Y mapping is the G DY pair, and GAN2 for the Y to X mapping is the F DX pair. Now that we have some notation, let's start deriving the adversarial losses. We have two GANs, so two adversarial losses to compute. First, consider the G DY pair. For the discriminator, each input sample has to be classified as either real or generated. We'll model the parameters of the GAN theta G that maximizes its performance using maximum likelihood estimation. Each sample comes from either the original output space Y, in which case the corresponding label would be real, or it may come from the generated space G, in which case the corresponding label would be fake. Each sample is assumed to be independently and identically distributed, that is IID, so we can write it as a product of probabilities. We can further break this down into K classifications. TN is a one hot encoded vector that corresponds to the true label of the input XN. Now consider the log likelihood denoted by the little L and expand the inner sigma over K. Remember, this is a binary classification where K can take two values, zero for generated data and one for real data. For any sample XN, only one of these terms is non-zero. So why is that? It's because TN is one hot encoded, hence we can separate real data samples in Y from generated data samples in G. Making a substitution for the discriminator notation, we get the following form. We can approximate the value taking the expectation over both terms. This is the likelihood that the discriminator DY seeks to maximize and the generator GX seeks to minimize. Remember that theta G represents the parameters of GAN1, so that's the parameters of both the generator G and the discriminator DY. Let's put that in there so that you don't get confused. We can derive a similar likelihood expression for the second GAN, FDX, and determine its parameters. Let's do this real quick so that you get the hang of the math. We are determining the adversarial loss of the second GAN with the generator discriminator pair FDY. The likelihood estimation is initially the same as before. Before moving on, I want to point out that the X and Y used in this part of the likelihood derivation are the sample inputs to our network. So X is the input image and Y is the output label, which is either real or fake. But in other parts of this video, I use X and Y to represent the input and output image domains. I'm sticking to this notation because that's what you would see in most other papers as well. Just want to point this out so that there's no confusion. Once again, we assume that the input samples from the image domain X and the generated images from the generator FRIID. So we can express them as a product. We break this down into K class classification using TN as a one-hot encoded vector to signify the true values, like we did before. We then take the log likelihood to make the expression easily to compute because sum of sums is easier to compute than product of products. This is a binary classification where images are either real one or generated zero. We can now separate real data from the set X from the generated data that is from F of Y. Making the substitution for the discriminator notation, we get this following form. And we can approximate the values taking the expectation over both terms. Theta F is the set of parameters of the second GAN that needs to be computed by maximizing this likelihood. Since it is a set of parameters that the discriminator DX needs to maximize and the generator F needs to minimize, let's write this in the form of a minimax objective. Combining the objectives for these two GANs, we get the overall adversarial objective. The first term is computed when the X domain is the input and Y domain is the output while the reverse is true for the second term. Let me just include this to distinguish between the two. This is the adversarial objective and the adversarial loss is just the negative of this value, that is, the negative log likelihood. Hope this derivation clears things up. Let's talk about the second type of loss that I mentioned before, cycle consistency loss. Like I said for adversarial losses, since we have two GANs to train, we have two cycle consistency losses and we'll call them the forward cycle consistency and backward cycle consistency. Forward cycle consistency is established when the source image in X matches its transformation after applying G, followed by its inverse F. Similarly, backward consistency is established when an image in the output space Y is retained when F and its inverse G are applied in succession. We can define both losses as a measure of the L1 distance. The overall loss is a linear combination of both the adversarial loss and the cycle consistency loss. Lambda will control the relative importance of the adversarial losses. Now solving these together, we find them two mapping functions G and F. So now we know exactly how to compute the losses, but what exactly is the generator and the discriminator? Like what are its components? The generator follows an encoder-decoder architecture with three main parts. The encoder, transformer and decoder. The encoder is a set of three convolution layers. So it takes an image input and outputs a feature volume. The transformer takes the feature volume and passes it through six residual blocks. Each residual block is a set of two convolution layers with a bypass. Like I mentioned in the resident architecture in my video on various CNN architectures, this bypass allows a transformation of earlier layers to be retained throughout the network. Hence, we can build deeper networks. Effectively, you can think of the transformer as 12 convolution blocks with bypasses. Now the decoder is the exact opposite of the encoder. It takes a transformer input, which is another feature volume, and outputs a generated image. This is done with two layers of deconvolution or transpose convolution to rebuild from the low-level extracted features. Then a final convolution layer is applied to get the final generated image. The discriminator is a simple architecture. It takes an image input and outputs the probability of whether it is a part of the real dataset or the fake generated image dataset. This architecture is a patch-gan. It involves chopping an image input into 70 x 70 overlapping patches, running a regular discriminator over each patch, and averaging the results. That is determining overall whether it's either real or fake. But we can implement it as a ConvNet, more specifically a fully convolution network, where the final convolution layer outputs a single value. Training this against the loss function that we discussed, the cycle-gans produce remarkable results on various translation problems. Let's first compare this to Pix2Pix, which was trained with a conditional GAN that used a fully paired dataset. Not only is it able to create the sketch of photo translation like Pix2Pix, it does a decent job in generating sketches from the image. We can perform style transfer, transforming a picture into works of art in any artist's style, like Monet or Van Gogh. We can also perform object transfiguration. In these images, we have replaced all zebras with horses and all horses with zebras. We can perform seasonal transformation. Here the images of Yosemite and Summer have been translated into winter images and vice versa. Photo enhancement. We map iPhone camera pictures to DLSR images so we can observe a depth of field effect for absolutely stunning images. So what did we learn? Cycle-consistent adversarial nets are a type of GAN that can be used to solve image-to-image translation problems without paired dataset. We define and derive the GAN's objective. The loss is divided into two parts, adversarial losses and cycle consistency losses. The architecture of a cycle GAN consists of two generator networks to generate new images and two discriminator networks to distinguish between the real and generated images. The generator network consists of three parts, an encoder, which is three convaliers, a transformer, which is six residual blocks, and a decoder, which is two deconvaliers followed by a convalier. The discriminator networks are patch GANs, which essentially can be implemented as fully convolutional networks. The cycle-consistent adversarial nets can solve image-to-image translation problems like object transfiguration, photo enhancement, style transformation, and seasonal transformation. And that's all I have for you now. If you like the video, hit that like button. If you like content like this, in AI, deep learning, machine learning, and data sciences, then hit that subscribe button. For immediate notifications, when I upload, ring that little bell. Links to the main paper and other sources are down in the description below, so check them out. Still haven't had your daily dose of AI? Click or tap one of the videos right there, and it'll take you to an awesome video. And I will see you in the next one. Bye.