 So, welcome to the next session. We will have Anmol Krishan Sarvadeva as a speaker. He will talk about guns. I met Anmol, I think, first time at Geo-Python Conference some years ago. Anmol is very active at visiting conferences, real physical conferences back in the day. So, he was also a pike on Thailand, Malaysia and many, many more. And as I said last year, I don't even remember Geo-Python or Euro-Python. And so, welcome. Anmol, he's also, I have to mention that this year he is also one volunteer of Euro-Python. So, thank you very much for volunteering. And now I give you a talk. Start your slides, please. Thanks, Martin, for the introduction. Okay. So, hi, everyone. I am Anmol Susteva. The title for today's talk is Painting with Gants. We'll be talking about the Neural Style Transfer and the technicalities and challenges of using that. So, a brief introduction about myself. So, I'm an international tech speaker and a distinguished guest lecturer. And I work at OLX Group. I have done my Masters in Advanced Computing from University of Bristol. And the specialization was in field of Computational Neuroscience and Artificial Intelligence. I represented India in various international hackathons and I'm a researcher also. About OLX Group. So, it's a group of, which contains of 20 plus brands. And it actually has around 45 offices spanning across five continents. And we serve across five continents with 350 million people per month. So, the flow of the talk will be as follows. First, we will be looking into an introduction to Gants. Then we'll be taking a look at what style transfer is. Thereafter, we'll be learning about different neural style transfer networks that are available and are popular at this time. And then we will dive into the actual NST implementation by looking into loss functions, content loss functions, style loss function, total variation loss function. And then we'll be doing kind of a code walkthrough also. The talk will be supported by a few demos also. Those are actually adopted from the official TensorFlow and Keras repositories. So, I'll be pushing the code to GitHub and we'll be sharing the link to it in the breakout channel. And then post talk, we can have the Q and question and answer session in the talks, talk painting with Gants channel and discord. So, yeah, prerequisites for this talk that you should be familiar with Python and Keras, especially using TensorFlow back end and an experience in artificial neural networks is good to have. It's also good if you have experience with convolutional neural networks and generative adversarial networks. And you should be inquisitive to learn about deep learning. So first let's start by revisiting the fundamentals of generative adversarial networks. In short, I'll be referring to them as Gants. So discriminative and generative models are the two types of models that we use in again. So first coming to the discriminative model a discriminative model forms a discriminative network or the discriminator network. Essentially a supervised learning model which tries to classify the data which is fed into it. So it is just kind of a classification model that we are using here. And it doesn't really bother about the underlying distribution of data only the quality of data matters to it, so that it can classify into categories properly. And then on the other hand we have a generative model that forms the generative network. And instead of classifying the data it's used to generate the data. It actually learns the underlying distribution of the data that's provided. And then on basis of it, it tries to generate samples that are near real looking. Mostly it's unsupervised learning but what if we want to have some conditional training done. In that case, we may support the training set with label data also so that it becomes kind of supervised plus unsupervised learning. So conditional Gants actually use label data. And there we have sort of supervised learning also implemented a bit. And then the art of actually learning the data distribution underlying data distribution which the GAN actually does is through like implicit density estimation so we don't require to calculate any probabilities externally. Everything is done internally by the network itself and it's called as implicit density estimation. So I'll be referring to these terms going forward, but just to give you a just of what GAN network vanilla GAN actually looks like in form of a schematic diagram. It's this. So the goal is to generate near real looking samples of the underlying distribution that we are provided with that's the training set. And then we have this input layer we where we feed the random noise this random noise forms the part of latent space. This can be uniform distribution can be normal distribution. We have to pass this distribution through GAN which is like again formed by two neural networks one is discriminator network one is generator network. So we'll be covering on the details of what's hidden inside this box in the coming slides, but then once we pass this input to GAN the GAN produces an output, which is of some other dimension say in dimension. And then that's maybe an image that we formed out of random noise or something like that. So training of GAN algorithm has two essential parts one is training the discriminator the second is training the generator. So training the discriminator network actually involves the following flow. So we take a sample a real sample that that's a sample from the training set. And then we pass it through the discriminator and we have the discriminator classify it. On the other hand, we have a generator network to which random noise is fed. And then that generator network actually produces a sample which we call as X star or X that that's a fake sample and that fake sample is also fed to the discriminator. Now the discriminator should be able to classify this as a fake sample, but our aim is to actually have generator improvised to the extent that discriminator starts failing in distinguishing between the real sample and the fake sample. So there will be a point where discriminator will start labeling the will start classifying the fake sample as real sample. And there's the path of generator training so generator actually uses this random noise in the second phase of this training this generator actually uses this random noise generates a fake sample. The sample is fit to the discriminator and the discriminator classifies that as real or fake. But the essential thing is that during this training of generator phase we backpropagate the errors to generator instead of backpropagating it to the discriminator. So in discriminator phase, we actually have the backpropagation done to discriminator network whereas in generator phase we have the backpropagation done to the generator. And here we make sure that the discriminator networks parameters are not trainable so in the second phase we set the trainable parameters for discriminator to false because we just want the generator to improvise. So in schematic diagram it looks like this so this is the first phase train the discriminator the second phase is training of the generator. So we have this as x the real sample that we are providing to discriminator and then we have this generator to which we feed some random noise from the latent space that we are calling as z. It can be normal it can be uniform. It can be any other distribution. It produces a sample x bar and then that's also fit to discriminator. So the discriminator classifies x and x bar as in some category maybe real or fake and then classification error are propagated to discriminator so that discriminator learns. In the second phase, the only thing that we do is we remove this training sample phase, and we just pass the x bar and then we pass the backpropagated error to generator. So that's the only difference and then we do this we repeat the cycle in iterations so that the network learns so the discriminator actually gets improvised on distinguishing between real and fake data and the generator on the other hand improvises on generating data which is near real looking and which adheres to the underlying real training data is data set distribution. So these are all fake samples generated by a Nivedia style GAN. So no one can tell that these are fake. They look pretty much real. So this is how much we have advanced in the last five years since the inception of this concept of GANs. So next comes the main concept for which we are here today. We have gained quite a lot of hold on generating near real looking images or kind of photo realistic imagery. But what if we want to now generate art instead of just generating images of static objects, we want to now design new objects or build artistic artifacts. So how can we do using GANs. So here comes the concept of style transfer. So, so as I told you earlier, we now have to dive into this generating new kind of artistic artifacts. So what if we have a kind of image which we call as content image or base image that's there of a dog. And we have a style image here I have taken image of cross. So what if I just apply the style of this cross over to the content image and say it generates this output. So that means the style of this image has been imposed on the content image. So we get some unique form of art that actually contains both the content as well as the style. You see that the content also is dominant here and the style also can be reflected is getting reflected in this image. So this image holds kind of combination of both the content image and the style image. So that's our aim. We will be generating art based on similar lines. So having a content image will be having a style image and our goal will be to transform or embed the style from one style image to a content image in order to produce an image, which is a combination of both and looks good and realistic and original. So, yeah, that's the aim of neural style networks that we'll be covering now. So you must take note of one thing that this model that we saw this training that happened that didn't learn the underlying distribution. So here comes the first difference with respect to vanilla GANS vanilla GANS actually had the underlying distribution being learned. But here we are transferring styles. So that's the first difference. So we extract the style from the style image and then embedded to the content image and the result should look like a blend of both the images. So why not simply interpolate the pixels? That's, that's because if we interpolate this with this, what we will have is blurry image that's highly distorted and the style actually dominates the content image. So it will look muddy. It will not be clear and both the pixels will lose their entity. So this, this is the thing why, why we should not use simple interpolation for doing such sort of things. And this style transfer networks have white applications in the area of gaming in the area of developing applications. A few years back we had an app called Prisma, which had this sort of style transfer made public. So people were able to apply styles from different images on their selfies and all. So that, that actually saw real boom in, in like last two, three years and people have advanced in generating new networks that are like state of the art networks and can do a style transfer. And many more applications have come, which we'll be discussing in a few minutes. And the last one is that the actual image, if you see, it's just that you have just applied the style on the content image. So the dimensions and all you can play with but ultimately it looks like it looks like someone has applied some art on the content image and we got the generated image. So popular style transfer networks, if we see, these are three networks, one is PIX to PIX and one is CycleGAN and then there's Neural Style Transfer. We'll be covering a bit on CycleGAN and PIX to PIX after we cover Neural Style Transfer, but first we will start with Neural Style Transfer. So Neural Style Transfer, as I told you earlier, doesn't require any training set. We have seen that it doesn't require any training set. It doesn't have any kind of, you can say back propagation being done because we just have two images to deal with. We have to transfer the style of one image to another image. So there's no involvement of any kind of training set. So that's another unique thing about Neural Style Transfer that it actually picks up the features from the style image and applies the features on the content image. And it creates some hyper-realistic imagery. So let's say we have this base image and then we apply a style image that's shown here at the bottom. We get this kind of combined image. Just playing around with the loss functions and the hyper-parameters. We can see highly varying resultant images. So this is one image that we see. Likewise, we can have different degrees to which this image can be exploited by this Neural Style Transfer network. And we can have different images which have the style features transferred or embedded onto the base image in different degrees. So we can generate multiple images from the same style and base image using combination of both. So the core of Neural Style Transfer is that essentially we have a loss function which constitutes of content loss. That's the loss which I'll be telling about more in a bit. So content loss, style loss and total variance loss. So first thing is that we have content loss where we would like to have the combined or the resultant image to be as similar to the content image, the base image. So that's a loss between the base image and the generated or the resultant image. Then comes the style loss in where we have the generated image compared to the style image. And then on the basis of how much degree of correlation is there between the style of both the images, we calculate the loss. And then last is the total variation loss that also we call as total variance loss wherein we have the generated image checked whether it's smooth or whether the pixels are distorted and blurry. So how do we minimize this loss? Ultimately the goal of Neural Style Transfer is to minimize the combination of these losses. So how to minimize this loss? We actually use a gradient descent technique wherein we update each pixel over the iterations and then we have something as you saw in the previous slide, the combined image. And there's a difference with the vanilla again which I highlighted earlier also that there's no training set required here and there's no back propagation concept being applied here. So coming to content loss, now onwards I'll be making references to bits of code also but before moving forward, I would like to show a quick demo of how we can actually utilize a pre-trained network to generate these art pieces. So yeah, so this is the iPython notebook that I'm using. I'll just increase the size so that you are able to see. So here we are importing libraries. This is the TensorFlow library, matplotlib. We are setting the run config parameters for matplotlib, the numpy, python image library, and func tools. So this is the import part. Next comes a function. So this code has been adopted by TensorFlow, the official repository. So the function is to convert the tensor to image. So this does nothing but uses some numpy functions like watching on the channels that are there, the dimensions of the images that are there. And then we choose the primary channel if that's if the number of dimensions like it's greater than three dimension. Herein we will be using four dimensional tensors. So it converts the image, the tensor to image. Then that's just a function which is being utilized below. We have the content part. Content part uses the utils function of keras. So get file actually gets the file from remote place and then fetches the file. So we are using image of the dog which I showed earlier. And then for demo purposes, I have taken three images which I'll be running through in a bit. So these are the images of style that we want to transfer. So this is a content image. This is a style image. Then we have load image function just to display the images. This again uses the normal numpy and TensorFlow functions and libraries. We read the image, we decode the image, we convert the image and scale the image. So it's kind of processing of image and then showing the newly formed image. So we resize the image afterwards and then we return the image. And I am show function. This is a typical function. We just are attaching title to the image. And then we have the plot function. So till this point I'll run each of the cells so that you observe what is happening. So first we will see image, the style image of bushes being applied to a dog. So last time in the slide I showed grass being applied to dog. Now we will see bushes being applied to dog. So yeah, let me print the images that we are taking into account for this. So this is the image of bushes and we are applying it to dog. And then we'll have the pre-trained model VGG-19. So I'm taking a VGG-19 pre-trained model which has all the weights set and it is able to classify into 1000 categories. The network has been trained on 1 million images taken from ImageNet dataset. So you see that the combined image is having the style of the grass and also adhering to the content image has the content of the actual base image that we passed. So nothing much I have done here. I have just used VGG-19 pre-trained model. And this is the destination to that model. We have passed it to the, we have passed to it the content image along with style image and then we have passed to tensor to image thing which takes out and processes the image and then displays it. Now let's quickly jump on and then I'll uncomment this and we'll have some other style transferred to this image. We can directly go. So I'll show what's the style image we are referring to now. So, okay, let me change the name here. So it actually took from cache. That's a style image that we are trying to apply on the content image. And then I'll quickly apply this to the base image. And we should have a result like this. This looks like a novel art piece. So that's for the demo of the first part. Now moving back to content loss. So content loss is actually kind of L2 difference or the like mean squared error between the content image and the generated image. So what if we compare apples to apples? The content is similar. So the information of the pixels should also be similar to some degree and we will have less content loss. But what if we compare ocean or sharks or say oranges with apple or banana with apple, then there will be higher degree of content loss. But we will not do pixel by pixel comparison. What we will do is we will have the higher level features compared. So how to compare the higher level features while training a neural network. Say we pick up convolutional neural networks and there are blocks or there are stages at which we train the neural network. And as and when there's any stage which progresses to the next stage, some of the features gets dropped, the lower level features gets dropped and we just are left with higher level features. So at each layer of neural network, say convolutional neural network, the lower layers represent some minute details say very, very minute details and the higher layers or the layers at the top. They actually contain the higher level features or just say broad features like this is a car. This is a building. So just to capture the higher level features will be dealing with the top layers of the neural network from a pre trained model and in our case it's a VGG 19. So it is a 19 layer the CNN architecture. And as I told you it's capable of classifying images into 1000 categories and has been trained over 1 million images from the image net dataset. So the VGG net architecture looks something like this. We have an input layer and then we have five chunks. So one, two, three, four, five, these are five blocks. Each block has a bunch of layers so con one, con two, you can see likewise we have con five, con one, two, con five, four, and then we have dense layer which flattens the input which is being fed and then classification happens. So we'll be using this VGG 19 net and then using the higher layers from block five will be capturing the higher level features or the content or the features through which we will be actually collecting and computing the content loss. So this is the code. I'll just zoom in if that's possible. Yeah. Okay, so we have this Keras library being used VGG 19 is the model that we are using the pre trained model that we are using. And we just specify the path to base image we specify the path to style image, and then we just specify some weights so content rate is the total loss total variation loss rate is the style rate is there but that will be looking out in a bit. So base image, we have the base image style image being processed by the path from the path which we have provided and then we have a placeholder. You can say placeholder image created that wherein the combined image will go. So we are having these three majors. We pass these three majors by contact on catenating them in form of a tensor and then we pass them in VGG 19, which has the pre trained weights, and then we get some resultant image from that image we actually take we actually take from the pre trained model the third, you can say layer of block five. So if you see third layer of block five, we take the, we take this as a reference for calculating the loss because higher level features are captured here. So we may take to second layer we may take third layer we may take fourth layer but it totally depends on you like taking different layers may result in varying results. So I have chosen three as a layer, you may choose to also So we choose to and then we pass the base image and we collect the output from this third layer. And we also have the combination features on the combined image taken from third layer in the network. And then we just have the L2 norm applied that's mean squared error applied to both the images so generator image and content image and then we just have the content loss calculated by multiplying it with the weight, the content weight that we specified here at the top. Coming to style loss, that's the second type of loss that we wanted to cover. So coming back to style loss, we have images which are like similar to each other maybe. So those images we call them as correlated images but there are images which are different and have different styles or different lower level details that doesn't have like common lower level details. So those images are considered as having less degree of correlation. So how to calculate correlation between the layers. So degree of correlation between two images can be can be computed by calculating the degree of correlation between the feature maps. So as we want to capture lower level features because lower level features represent the style. Just just note that higher level features represent content and lower level features or the output from the lower level layers of the neural network represent the style so we we take into consideration the lower level layers of convolutional neural networks we fetch the feature maps from there. You flatten the feature maps we take the dot product of the feature maps between the two images and then depending on that dot product of the value of dot product is greater than some say value that we have specified we consider it to have higher degree of correlation that means the style of both the images matches. So suppose this is image of grass image a is image of grass and image B is also image of cross. So you see that the orange points the dark orange points that you see on both the images. If they overlap so that's the area where the image has like similar kind of correlated style. So we can say that these images are correlated to some extent and in case if this considered this block B to be fully orange then we will say that that has a higher degree of correlation that combination has a higher degree of correlation. So in our case when we will be calculating style loss image a will be style image and image B will be the combined image the resultant image that we get after training. So in order to actually have the style loss calculated for different layers of our network. What we consider is a thing called Gram matrix so Gram matrix is dot product of all the feature maps against the feature maps. So suppose you have layer. So you have this layer. So if you take a dot product of a with a then we you take dot product of a with B. So in this image it should be clear we have image a and we have image B image C. All these images are of grasses but one contains only grass one contains pushes also one contains some brown grass. What we have is that we have the Gram matrix which you see on the right calculated so we have feature one map to feature one and then we calculate the correlation between them by taking dot product of the feature maps. Then we take dot product of feature one with feature two of image a that means this one and this one, the product of both will be taken and the common area or the result of the product will be shown by some color. Likewise we do fall do for all the features of a layer for for an image. So suppose this is image one we have first dot product taken across all the feature maps. Against the feature maps of that image and then we have this Gram matrix likewise we generate Gram matrix for image B and image C and then if the Gram matrix of two images in our case it will be the style image and it will be the result and generated image. So when the Gram matrix of both the images are highly comparable. We say that the style actually hold it throughout the training. So the style was actually transferred to the generated image. So in that case, we will consider it as success. So we have to calculate the mean squared error again. That's L2 norm error and we have to minimize that error. So coming to the code we again say that style loss is zero initially we have the definition for Gram matrix function here. This function actually does it flattens as I showed you it flattens the feature map and then takes a dot product of it. Against itself and then returns the Gram matrix. So feature map dot product is taken and we get the Gram matrix cell likewise we do for all the feature maps and we get the whole Gram matrix. Then we have this function of style loss where we calculate the mean square error between Gram matrix of combined image and style image. So this is nothing but the mean squared error of the Gram, the combined image and the style image and these are just like some parameters to have that we are passing for the function to the loss function. This feature layer actually shows the active layers that we chose. So we chose block block one. So we'll be calculating loss against all the layers present in block one the block one of image is this block one. So suppose we have five layers in image we'll be taking that Okay, so we took block one block two block three block four block five. So we have taken each layer of all the five blocks so con one one, con two one con three one con four one con five one. So each of the lower level layers from each block have been considered in order to have this style loss calculated and then then we just pass this and extract the feature maps from these layers and then we pass it to the style loss function that we have Then comes a total variation loss a total variation loss is nothing but the loss or loss with respect to the quality of the resultant image that we are observing. So in case of the combined image is distorted and is pixelated. We will consider it as noisy and the loss will be very high. So what we can do is we can take combined image and then we will shift pixels of that combined image each pixel to the right once and also we will do another step we will take each pixel of the generated image and then shift it towards down by one pixel. We'll have both of these results stored in A&B respectively and then we will just take a sum of these two and we will calculate the error. So we will calculate the error by shifting the pixels to right and downward so that will show us whether that image is highly distorted or not. So that's for total variation loss. Once we have these three losses we will just combine all these three and then we will get the resultant loss. Then is the time to start the training of model. So then comes the training phase. So we have computed the loss till now we have computed gram, we have computed gram matrix based on which we have computed style loss. We have computed the content loss by taking into consideration the content image as well as the generated image and we have calculated the total variation loss. So this loss can be trained. So this network can be trained by taking into account loss and we need to minimize this loss. So we'll be using an optimization technique here. So essentially Neural Style Transfer is an optimization technique which depends on another quasi-Newton numerical optimization technique called BFGS and L is for like a limited memory use or we can actually constraint it on basis of resources. So L is limited memory BFGS algorithm which is a numerical optimization algorithm and what it does is it finds the local minimum of any objective function based on the gradient of that objective function. So essentially what we need to do is we need to minimize this computed loss over iterations by using the gradient descent method. And what we'll be doing is we'll be updating value of each pixel by an amount which is promotional to the negative of the gradient that comes from this loss function. So let's dive into the code here. Okay. So it's pretty much the same as I showed you in the snippets. So we have the base image. We have the style reference image paths and we have the weights defined here. So total variation weight we have defined, style weight we have defined and content weight we have defined. We process the images and we specify the dimensions of the generated image that we want to have and we specify the iterations. So for this demo I have considered 50 iterations only but in real scenario you will be using somewhat like 5000 iterations or 4000 iterations to actually see the result that I showed you in the slides. So let me go back to the slide once and show you the end result that we'll see. So this is the end result that we should see and this is after 4000 iterations of. We have the pre-process image, pre-processing of the images done. So it actually just opens, resizes, applies the image processing functions and then we get the tensor basically out of it. We have this deep process image also. So it converts an image, it converts a tensor to the image so nothing much is being done here. Again, a reshape function is being used. Then we are clipping the NumPy array for any additional you can say overflow that's happening. And then we just pass the base image and style image to the pre-process function that we have created and we get the tensor representation of the two images. Once we get the tensor representation of the two images we have all the three images combined image that we are treating as a placeholder image as of now. The style image which is a pre-processed image and the base image which is a pre-processed image. So these three are good to be fed into a tensor concatenated network which will be feeding the same to the VGG19 network for training. So we'll be combining these three and then concatenating and then feeding into the VGG19 pre-train network which we imported at the top. And then we'll have the model loaded with us and also we'll have the key layers that we want to actually match against taken into form of a dictionary here so output represents the same. So now is the time to compute the neural style loss that we talked about earlier. Again, this is the same similar function which I showed you earlier. This is a bit modified from the actual one that I showed you in the slide. That's a gram matrix function which I have already explained. So it calculates a gram matrix for the fed tensor. Then style loss is still in style loss we have the combined image and style loss gram matrix generated, style image gram matrix generated and then we calculate the L2 error. And then we have the content loss which is just simply the MSE that's mean squared error between the generated and base image. And the total variation as I told you that we'll be shifting by one pixel and then shifting towards the downward direction by one pixel. So that's again the sum of both and then we just calculate the error. And then we select all of these three losses and we add them to form the main loss. And the thing which you should be seeing is this the next thing that you should be seeing is this that we have the gradient computed for this particular loss. And we fed we feed it to the evaluator, you can say class that we have created so evaluator class evaluator class actually returns the loss and the gradient value at each stage. And we have the iterations formed so my network is training as you can see over say 4000 iterations what we do is we take this loss. We pass it to the evaluator class we get the loss and gradients and then we update the value of each pixel by by the negative of this gradient thing. And ultimately what we see is this combined image. So that's all for this talk. I'll be posting the links to all these cool things. So next we have pics to pics just I'll be talking a bit about it. So it's used for image to miss translation so you can have say hand out or say a schematic diagram being transferred or translated into an image which looks real kind of a chipboard representation being translated to an actual building or silhouettes being transferred or translated into images Google map street view being transferred to map view. Likewise, and then we have cycle gas which is again an advanced scan for neural style transfer. So here we have essentially to gas one is training for the first input and one is training for the other input but both are actually dependent cyclically on each other. So that's another application but discussing this is out of the scope of this talk since we restricted it to new style transfer. So this is something which I am pointing you towards that you can actually explore cycle again also if you are more inclined towards scans. So these three are the popular. You can say networks out there to have the stylistic artifacts created and linked to each one of them and wherever I use resources I use the code references from David Foster's generative deep learning and Jacob Jacob's guns in action. So these are the two reference books that I consulted. So that concludes my talk. We are also hiring at the well script so feel free to reach out to me or just drop in at the courier section and then feel free to apply for the roles. And yeah, don't forget to follow me on Twitter LinkedIn we can get connected and we can have your questions answered in the discord too and then later on we can get connected on these platforms. So thanks a lot for listening.