 Hello everybody. Thanks for having me. Thanks everyone for making it out to the people from Chennai. Thanks for coming out on a Saturday People from around India. Thanks for coming out to Pycon. A bunch of people who have been regulars. Nice to meet you again A bunch of new people. Welcome to the community So basically this talk is It's it's kind of an it actually should have been titled my experiments with deepfix Because basically what I'm gonna do over the next 25 to 30 minutes is walk you guys through How I discovered the problem of deepfix or how I discovered what deepfix are then how the engineering mind started to work And how we broke it down into into different segments and how we built a pipeline to actually generate deepfix like images So just to start off I'll give an introduction about myself. I'm Navin. I go online by the name Mad Max I worked at tac.ai I'm a man United fans, so that means that I enjoy pain At any point of time if you want to tweet to me before the talk during the talk after the talk feel free to tweet out To me at Navin pie And this is basically what we're gonna be talking about during the next 25 minutes So we'll talk obviously about deepfix. We'll talk a little bit about neural nets. There's quite a bit of python involved as well Specifically with the neural nets. We'll talk about a neural network called auto encoders Then we'll talk about open CV because there's not really much you can do in images and python or any other language Honestly without using open CV then we'll touch a little bit on pytorch There's one very clever trick which is used in at least initially when deepfix came up There was one very clever trick that we'll talk about and that that kind of has formed the whole basis for how deepfix are done We'll keep it very light on math We'll keep it fairly light on ethicalities and we'll also go through some code as well throughout So let's get started Okay, and before anything starts off None of this is original research all of this is based on different papers by the guys over at Nvidia at Facebook the original Article that came up on reddit and stuff like that. So it's mostly the derivative work That's like I said it's just been how I personally experimented with deepfix and how I come up with a pipeline for So what's all the fuss about right? So deepfix is actually an image morphing technique that shot to fame somewhere in late 2017 early 2018 I think and initially it came up on reddit, which is where everything on the internet shows up But it showed up in a very NSFW kind of way So we won't really talk about that but the moment it hit mainstream media Was by the when the true holders of our culture so to speak right of small website called BuzzFeed actually came up with this article And a video along with it. So I'll just play the video that they came out with enemies can make it look like anyone is saying anything at any point in time Even if they would never say those things so for instance, they could have me say things like I don't know killmonger was right or Then Carson is in the sunken place or how about this simply President Trump is a total and complete dipshit now You see I would never say these things at least not in the public address, but someone else would someone Like Jordan Keele This is a dangerous time before we need to be more vigilant with what we trust from the internet That's a time when we need to rely on trusted new sources It may sound basic, but how we move forward age of information is You're gonna be the difference between whether we survive or whether we've become some kind of fucked up dystopia Thank you Stay woke pictures Yeah, so I think that that kind of brought what deepfixes to the mainstream in many ways So then obviously there was a lot of a lot of conversation about how do you come up with these deepfix videos? Right as you could have seen that video is extremely believable if if you had for example say a president declaring war as Opposed to something which was obviously funny, right? It is very hard to figure out the difference between a joke and what is actually real What is deepfix and what an actual video, right? So just to just to get a little more into the state of the art, right? We have a bunch of photos here. I just want to ask the audience out Which one of these do you think is computer generated the one on the left or the one on the right? Guys in the audience guys who say left raise your hands guys who say the right is Okay, it seems fairly evenly split. Let's go the next set computer generated on the left Hands up. Okay computer generated on the right Okay, there's a massive majority saying the guy on the right is computer generated Again the third one a little harder this time computer generated on the left hands up Okay computer generated on the right Okay, again, it seems this fun fact every image that you saw has actually been generated by a computer by an artificial neural network Okay, so everyone who said some of these images are real or actually believed any of these are real were completely tricked by our artificial neural network So then like obviously if we take a step back, we need to talk about like what is this whole deep learning thing, right? So there's a whole lot of conversation that goes around between Classical machine learning algorithms, which like a bunch of people have spoken about today and I'm sure they will tomorrow as well What are the pros and cons and how do you choose? Now primarily what I have noticed is that when when you're talking deep learning It's usually to do with data when you're not able to get the features when you're not able to work perfectly on the feature extraction Right, so if you can manage a data set that's large enough wide enough and covers all the cases that you want to deep learning is something You should always look at Just before lunch we actually had a talk on machine learning bias as well If anyone attended that that actually did an amazing talk of talking about what data set how important a data set is when You're talking about deep learning As an aside right purely as an aside simply adding deep learning to your LinkedIn means every one of these sponsors are out here today We'll start contacting you starting tomorrow. That's a small added advantage of having worked with deep learning as well So what exactly does deep learning do right? So this is like the traditional system that you talk about when you say deep learning where you have an input layer You have an output layer you have a bunch of hidden layers in between There are there's something called weights and biases that decide how data propagates between them Information gets propagated. So you have an image. It's putting through the input layer Some things are tweaked in between and it comes out of the output layer with a prediction So depending on whether the prediction is right or wrong you readjust going backwards to the hidden layers back to the input layer Again, you come up with another image and it goes on back and forth until the middle layers actually capture all the information that you need Now for a lot of people deep learning has been something that's come up very recently right 2015 2016 2012 Maybe maybe 2010 at most But the foundations of deep learning are actually very very old, right? So the two papers that are mentioned at the bottom So the first paper is called a logical calculus of ideas imminent in your newest nervous activity Which is published in 1943 which is even before India got independence So deep like the whole fact of neural networks has been going on for almost 50 60 years It's just that 70 years 80 years now I'm sorry, and it's only now that when you have the computing power that's capable of actually processing it right that we have come to this But why is deep learning so important is simply because you can approximate almost anything with deep learning With an artificial neural network, it's mathematically proven that any mathematical function can be a crop approximated I'm not saying you're gonna get a right answer, but you'll get a answer that is approximate to the right answer So what do you see below is like a convolutional neural network and that's actually convolutions But we'll actually skip that an interest of time So let's get to the actual problem at hand right? Let's talk about deepfakes itself So when we want to talk about deepfakes We actually need to talk about how we have done it before today right before deepfakes came along and this is literally how Faces have been swapped traditionally and this is how computer generated animation has worked So if you see you have Mark Ruffalo there who plays the Hulk in the whole Avenger series So the millions of dots that he has on his face are all actually used for computers to figure out the expression, right? And what you actually do is you take a model of the Hulk Then you adjust the facial expression on the model of the Hulk to every single dot on the face Initially it was done manually now. It's done somewhat in a computer generated manner But deepfakes take to the next step all together. That's totally completely automated, right? One more thing is if you look at any of the Avengers movies or the making of any of the Avengers movies any superhero movies, right? You'll see a lot of these you'll see motion capture suits all of them have white dots across them Which is used for exactly the same purpose you have facial expression capture like George Brolin as Thanos in Avengers as well For most of the movie he just had dots across his face, which were later added as CGI and so on So let's break down the whole process that we spoke about just now, right to have Mark Ruffalo become the Hulk and to match the Hulk's expressions to what Mark Ruffalo was doing in the scene We had to actually go through the following steps First we had to create a model with markers for every movable muscle on the Hulk, right? So the Hulk you create a model that has dot dot dot dot dot dot everything for the mouth everything for the nose Everything and so on Then you get the face that's in the frame, which is Mark Ruffalo Then on that face you identify the location of every marker that corresponds to the other marker and Then you move manually move every corresponding marker to the right location and that matches the expression of Mark Ruffalo on the Hulk for one frame Now assuming that movies get usually short in 16 to 25 frames per second and even higher now You can imagine how much work that involves if you're doing it manually for every frame of this is a process that you want So now being engineers, right? We obviously look at ways that we can automate stuff the way we want to like create a actual pipeline to do this, right? So first off this is this is the steps that we have So first of what we want to do is want to generate a model for face swapping itself Then we need to identify the expression on the source Then we need to map it to the target and then we actually set up a pipeline where we take a video We break it down into frames then each frame goes through this process then you convert it back into a video Right, so that's the process that we are we're going to be following today So first step first like we need to generate data because nothing in deep learning works without data So let's actually generate data For data. There's there's multiple sources that you can go for Google images always a really good bet because Google images provides a large variance of data as well Especially if you're trying to do if you're trying to generate defects of someone famous You'll find a lot of images From video itself, you can also generate images as well So from video you can actually use ffmpeg or something like that to actually break down the input into different frames And then which is arguably any machine learning engineers best friend, which is data augmentation So data augmentation simply means that you take an image and you generate probabilities out of it, right? You generate different permutations of it So this is just a random example of given of data augmentation where you take an image You randomly decide whether you want to flip it or not you do a random zoom on it You randomly crop it to some extent randomly rotate it to some degrees you randomly adjust the brightness or contrast Along with that you can also do other things You can randomly add some noise to it. There's so many possibilities I was just chatting with someone just before this talk and I was telling them the same thing that Because of data augmentation and if you do a good enough job of it even with say 500 to 1000 images Which is not really that much that much data when you talk about deep learning You can augment that data into a large enough data set to be to get decent results with this The thumb rule always goes the same the more varied that your data set is the better the generalization works out Simply because if you always give a front view of a person then when there's an when a video when there's some person turning to the side The algorithm would not know how to fill the spots in right? So ideally what you want is as much variance as possible as much variance in in every term in terms of lighting in terms of rotation in terms of in terms of zoom in all in all cases right So now that we have the data what do we do with it? We need to identify faces in that data because obviously nothing works when we are talking about this what we are talking about is face Swapping so you need to identify the face itself For what I was doing it some team. There was a brilliant brilliant python library Open CV itself always already provides facial detection But mtcn mtccn is absolutely brilliant when it comes to identifying faces So this was a very famous face that the photo that came up in the Oscars some take some years back And you can see almost every image has been identified The cool thing about mtcnn is also that it's it's a python package So you can actually just pip install and once you pip install you can read the image pass it through the mtcnn And it detects faces and gives you the bounding box And it also gives you locations for where the eyes are supposed to be and where the mouth is supposed to be which is very useful when you go forward So here's an example of how you use mtcnn. It tells you the bounding box for a photo This is not for the same photo. It's for a different photo But it tells you that the bounding box is this and the nose is here the mouth right mouth left right Left eye everything is here and the confidence for which is making the prediction Mtccn also behind the scenes make use of a different neural network, but we'll not get into that right now So now we come to the crux of everything right when we're talking for deepfakes the kind we use a specific kind of deep neural network Neural network called auto encoders So an auto encoder it does a very simple job, right? It takes an input It passes it to an encoder the encoder generates what is called a compressed representation or a latent space representation Then it passes it to a decoder and the decoder is able to use that compressed space representation to reconstruct a reconstructed input So what do we do? We take an image we pass it to the encoder it creates a representation come back Create a reconstructed input then you compare the reconstructed input and the original input, right? So when you keep doing this over time the encoder learns to generate as Good of an encoding as possible while the decoder learns to use as little information as possible from the compressed representation To create a reconstructed input. That's as close as possible to the original input So now this is this is the whole process that happens and when you probably heard when you say deep learning We use a lot of we talk a lot about training so training is actually just that you're just passing data through the encoder and decoder in This case and then comparing back and then you do a whole lot of back back propagation to like adjust the weights of the encoder and the decoder Then you follow the process repeatedly until until you have satisfactory results now what satisfactory results means is dependent on person to person It seems very complex at first, but actually Implementing an auto encoder and in pytorch is about 70 lines of code. That's it It's not more than that out of which I think about 30 lines is the encoder and about 30 lines is the decoder and 20 lines is the rest right so it's actually very very simple to write your own encoder and decoder And obviously you add data and GPU as per your taste for my training mostly what I do is I usually train on AWS machines So I used AWS P2 machines for my training P2 or P3 machines for my training depending on the need And depending on the data set of course So now we talk about the training itself, right? So This is what you do basically for training. So you have two encoder decoder systems, right? So for a one encoder decoder system for one auto encoder You pass in the face of a morphed face of person a and you ask them to generate What the right face should look like, right? So you take the actual face you morph it in some way and we spoke about how you how you augment the data as well Then you generate an actual face and then you compare it with the original data For auto encoder B you do the same you follow the process again We're comparing you generate a morphed image like either add noise add something and then you pass it back to the decoder And then you figure it out like compare it back with the original So over the time of training what happens is the encoder at the top That's encoder a that you see at the top that in this case is learns how to represent Jimmy Fallon's face, right? Whereas when you come to the second encoder what it actually the auto encoder the second auto encoder actually figures out How to do John Oliver's face, right? All right So you keep training it until both of these are able to come up with decent encodings and decent decodings of the auto encoded representation And remember the start of the talk I said there's like one really really clever trick that was published also in the original paper that NVIDIA came up with And that like totally makes sense in retrospect, but I don't know who initially came up with it So that is the really clever trick So what you do is so you have the encoders are totally independent, right? Because the encoders work independent of the decoder So you create an encoder that use the encoder that's been trained to create a compressed space representation of person A's face But then you use the decoder that's been that's been trained to decode person B's face So it's the encoding that's there in middle is already a tight compressed representation of person A But when the decoder works on top of it, it actually creates person B, but with the facial features of person A So in many ways what we are actually doing is We are algorithmically tricking our auto encoder to go between person A to person B to use person A's expression and transfer it to person B And there's a lot of conversation that we have on image transfer and style transfer as well That's happened in deep learning and this is just an extension of that in many ways So now we obviously just put it all together So now that you have if I've given you a frame that had person A's face on it What I've generated for you is person B's face along with the exact same expression, right? Now it's a matter of putting it all together if you remember initially when we spoke about it We said MTCNN gives us the bounding box as well So it tells me exactly where that face is supposed to be in the image in the frame So I can actually just swap the face and use something like PIL in order to swap the face And that's that's basically just that you do an image dot open You open the other image and then you swap it in between depending on the X coordinate Y coordinate of where the bounding box should be And then you put it all together Which is simply you take all these frames that you have generated and you put them up You can stitch them up together into a movie Again for this I've used ffmpeg. There are multiple tools you can use for this Ffmpeg also provides a bunch of Python bindings. So which are pretty great by themselves I personally use it from the shell script But you can also use the Python bindings to have an entire Python based workflow for this The cool thing about this is that one when you have Things that are stitched up together like this what ends up happening is that you actually lose It's it's very easy to get confused between what you're seeing at the end product and what you had is the beginning product Simply because the quality ends up being really really good And I don't know if any of y'all have seen this movie. It's called Aakodeki which starred an amazing actor Sanjay Mishra So in which he says that I'll only believe things that I see with my own eyes But as we've just seen using a simple pipeline like this where we're just breaking down a problem into Into steps and then putting them together We actually reach a stage where what you see is really not what's real, right? And that that's really cool Small side note. I actually had a demo for this But just yesterday morning I got a copy strike notice from the world's biggest youtube channel So t series is actually copy strike to my thing because we're using some music from t series So I don't have a demo right now, but I can probably put it up later in the slides once I get back Which is very weird. It's an it's an unexpected problem, honestly So yeah, so I think that that kind of brings the end to like what the engineering part of what I had to talk about The final bit that I had to talk about which Narayan also mentioned right at the start, right? It's like we the moment you see deepfakes automatically people start talking about ethicality, right? Like is this ethical to do is this right to do is it wrong to do? As I mentioned at the start of the talk as well the most famous deepfakes videos involve scarlet Johansson You can google it if you want to I would I would not really suggest it But there were extremely nsfw videos that came out starring scarlet Johansson right at the start But again, I think as engineers it's very difficult to like keep the engineering out from the ethicality I think in many ways what we as engineers do is we are very proactive when it comes to stuff, right? Like if you know that a technology exists out there Then it's how you deal with it rather than should it be out there because once it's out there and we know on the internet Once it's out there it's going to be out there forever and we know that there will be people who will be using it Second point that I had to talk about was that six months is actually an eternity in deep learning research So like a lot of what I've spoken about is not even six months is about maybe 12 12 months to 16 months old in terms of This is the most straightforward pipeline you can set up to generate deepfakes, right? Because this this actually bases a lot out of the original paper that nvidia came out of so funny story That is that the original person who came up with a deepfakes pipeline as an anonymous user on reddit But he did not publish any of his work So it just so happened that a bunch of researchers at nvidia saw that there's no published research on this method So they came up with the paper that use his method and took credit for it Okay, so nvidia when they claim that they've come up with this methodology It's actually they're taking credit from some random user on reddit But this is back in early 2018 But six months itself is such a long time now. There are so many so many improvements done on top of the same system And I'll actually show you a video of that as well so Okay, yeah So like in six months, we've had much better improvements auto encoders themselves. I had a lot of research that's happened through them Gans are generative adversarial networks Which are really really big in terms of like I think 2018 had a maximum number of papers that were published on gans Everyone from facebook onwards Started talking about gans the cycle cycle gans zebra gans dc gans blah blah gans every gans in the world There's a lot of research that's gone on in expression modeling in volumetric geometry Which also has fed back into deepfakes into generating more convincing deepfakes and as time goes on The lines blur even more and more between what you see and what you can predict Similarly like sigrafi cvpr iccv, which are big conferences in in computer vision Have just been going on and on in terms of papers that actually work similar to this And what's actually important is that visibility that's come out because of deepfakes, right? The first video I showed you from bus feed is actually where it all started off Getting visibility into the public But now then you have at wall street journal talking about it. You have the washington post talking about it You have why you're talking about it. You have us talking about it So I think in many ways it's important and that also leads to the second thing Which is how do you identify deepfakes, right? So there's a lot of research going on and that as well DARPA has an open challenge right now for identifying deepfakes A blink detection was usually seen as something you could do because we go frame by frame by frame A blink is actually short enough that it's not it's very hard to model So deepfakes videos usually had bad blinking or like like very shuttered blinking But now that is also there's a lot of research that's gone into fixing that as well There's a there's talk of mismatch cnn's where you have a neural network abadla neural network, right? So you take a neural network to identify what is true data what is true and what is false And just to end off I just want to show off this one video that came up last year It's an amazing paper. I would strongly urge all of you all to read it. It's titled everybody dance now It's one of my favorite papers from last year and It actually if you read the paper you'll see it uses a very very similar pipeline to what we spoke about Just with one or two additional steps. So we'll actually go through just watch this So just just as you can see like the actor in the in the frame has actually not done any of these dances It's a source video that you have taken you have followed a very similar process to what we spoke about except you have added post detection in the middle as well And that post detection has then been passed as an input to the autoencoder And they also have a gan in the mix somewhere But that that's kind of it's a very similar process to what we have spoken about to generate something that's in my opinion At least absolutely unbelievable And so it will still see it's a little glitchy here and there and that's there's always been changes happening in that So yeah, that's that's almost entirely my talk. Thank you so much for being here The slides for all this are available on navinepie.github.com Any questions I can take it now take it offline or you can tweet me out at at navinepie. Thank you so much And please remember that after deepfakes everybody lies. So Thank you so much Thanks navin. So we have Time for questions Okay Hi I have a doubt like Autoencoders is mainly used for unsupervised, right? So in this case, uh, if we have if we give like images mixed without any labels Say it's like a true and fake one will autoencoders can be used to detect it or How we how will uh autoencoders can be used in deepfakes in such cases without any labels I'm not sure about the question that you ask. I can't hear you actually Thing is autoencoders is unsupervised, right? So If we have unlabeled data, okay, say some some images are true and some images are fake, right? And if we try to build a Deepfake system, right Will autoencoders will be will it be useful like how reconstruction error can become uh detected in that case? Yeah, actually it depends entirely on what how you're calculating your reconstruction error altogether So if your reconstruction error doesn't really care about whether it's true or fake and you're not using that as an input at all It doesn't really matter it would still work. In fact, when you're augmenting data in many ways What you're doing is you're generating fake data, right? Which is kind of like it's it's I've actually had some friends who worked along something similar Where they've they've used the output of an autoencoder or use the output of a GAN fault training back in a circle Right where you pass the output of a GAN You assume it's true data and then you pass it back as an input You're not really worried about whether it's true or false You're not really worried of whether it is in a class or not. You're assuming it's all in the same class You're assuming that our image that's been generated of person X is the same as an actual image of a person X So for training it does not really matter as much Uh, hello Great talk by the way. Thank you Can you talk a little bit about zero shot approaches because the approaches you've talked about right now Involved taking one specific a specific fake Sorry face training an entire model to make it a bit another specific one Can you talk a little bit about zero sort a zero shot approaches where the input to the model is? I mean you get it, right? Yeah, actually I've not really worked as much with zero shot approaches So I wouldn't really be an expert to talk about it in any way. Um, zero shot approaches always ends up being It's it's a very Flaky area because when I work with zero shot approaches, they tend to be very hit or miss right It's it's like it either works and it gives you a good paper that you can publish But when you actually want to generate a pipeline that works in a stable manner It's very hard to use zero shot approaches So I think usually the way zero shot approaches are also used are in in collaboration with some other approach You don't really use zero shot by themselves At least personally and like in the among the circle of researchers who I work with right So it's it's very hard for me to like comment on zero shot approaches in terms of accuracy or in terms of How it would work with with auto encoders, but yeah, I'm open to learn So we should probably catch up right after the talk if you have worked with zero shots Any more questions? There last uh to your right. Yeah so This is all uh means Processing letter and those things so how do you think python can up means we can approach with python in live data Say we are doing something live because when I was doing something just the special recognition nothing else So see is kind of faster. We know everyone knows that and there is a actual library. You're low You only look yeah, yeah, so those are real time But when we switch to python it the frame rate is very low the moment we talk about just special recognition So how do you to talk about defect in live kind of thing? Is there any possibility in the future or something like that? Yeah, so actually I think um very recently right about a month back or so There's this app that's come out called Zao, uh, which is which ended up it's a chinese app that's become the world's Most downloaded app or something like that which actually promises Generating deep fakes in near real time with With I think four images or one image or something like that right? So there are techniques that are being discussed again I'm not dissected any of what zao does or in between there was an app called fake app It was called fake i'm not face app that became this is a fake app Which actually used to used helped you to do something similar I'm certain that all those used c++ bindings behind the scenes that did not use python Itself behind the scenes because they are android apps So I don't really have a yes or no answer for like how I think it would happen But I think like as as code moves ahead right and PyTorch is primarily used for a lot of research that's going on. So I think over time we'll we'll actually see More near real time solutions. I don't I I never talk about real time when we talk about any of this because Obviously deep learning involves a lot of processing right so real time is in many ways a pipe dream But when we're talking like near near real time is almost good enough right almost good enough is good enough for us as engineers So I think yeah, I think there will be improvements There are a couple of papers we can probably discuss it offline as well Where people have spoken a little bit especially when when I spoke about volumetric geometry right But they're using volumetric geometry methods, which are faster way faster than actually doing the way that we spoke about in this There's a bunch of papers that talk about it But there's no one size fits all kind of method that says that using this is going to be way faster. So yeah Thank you So funnily enough that's one of the core reasons why you use mtcnn's for face facial detection mtcnn's work way faster than Open cv's facial identification or most facial identification in fact that was Touted as one of the core features of mtcnn Because it's a pre-prepared model that you're just you're just basing it off of so its its predictions are very very fast But yeah, I think every part of the pipeline is optimized in some way or the other to make sure that the entire pipeline gets optimized And that's kind of how we work either way right you like take small improvements across the table And then make changes