 So, our next speaker is Sway Lu from SUNY Albany. The code that you saw, he's going to talk you through, but that de-fake, that was him and his grad student, Lee, and our volunteer Barton who did that. And just a short reminder, if you're going to drift in and out of talks, five o'clock A.I. on wind time, we have some really cool kind of hands-on. You can try some of this stuff yourself from five to seven today. Sway? Thank you. Thank you very much. All right. All right. Good morning, everybody. This is my first dive-con, but it has been already been very exciting. So, Sway took the full credit of making this movie of Tom Paris, and in doing so, we have seen a lot of videos of... DNC chair, topologies that I can't come to Las Vegas. However, DNC are doing to increase awareness. Well, so I'm just saying we have done, we have actually generated several, quite a few videos with different reenactment from Bob to Tom Paris, and this one is actually my personal favorite because the face is not as stern, and it's quite nice. But this showcase, what do we mean by deepfake? You see that here, we basically transform the facial expression and the facial movement from one person's face in this particular case, Bob's face, to another person's face in this case, Tom Paris. So this is what we call deepfake, and this is all done by an algorithm. It's done by an AI algorithm. Now imagine if we have this kind of technology, we can do a lot of things. Some are interesting, some are bad. For instance, we can make Nicolas Cage basically play whatever role he wants to play, be anybody he wants to be. More interesting, we can make President Obama to sing a song that he may want to sing, but never get the chance to sing to us, or we can make Mark Zuckerberg to say something that he will never tell us. I wish I could keep telling you that our mission in life is connecting people, but it isn't. We just want to predict your future behaviors. Spectre showed me how to manipulate you into sharing intimate data about yourself and all of those you love for free. The more you express yourself, the more we own you. So this is the kind of technology that when we actually work on this project, make a lot of sleeping lights for me because this is on the one side very, very fascinating. On the other side also very scary because we have done the ability to put any words into anybody's mouth at any place under any circumstance. So this is what we call generally deep fake. But as a researcher, as a computer scientist, my interest lies in not just feeling this is an interesting technical innovation, but also considering its negative social impact. What can we deal with it? So the first thing we actually did, we started working on this problem early on February last year when only a few news coverage about this phenomenon started. The first thing we did is actually we implemented our own version of the deep fake software and we keep improving it's our belief that we can only after we understand how this thing is generated can we actually detect effective, defensive measure for that. So before I talk about our detection work, let me go through very quickly in a sort of non-technical way how this deep fake works and some details under the hood. So the deep fake software start with a video of one person. In this case, for instance, the video of Bob. So everything from now on will be automatic done by the software. So what it does first is actually run this automatic face detection algorithm to actually locate the face. Once the face is detected, it will be rectified. So it's a technical term just to say we make the face in sort of normalized orientation. And this is the kind of rectified faces are the faces that a deep neural network can use to generate a new face. And then all the magic happened within this black box of a deep neural network. We just called it a deep fake model. And within this model, there are actually two corresponding neural networks. One is called the encoder, the other is called decoder. The way to understand this is to make an analogy of language translation. Now think about you have a sentence in English and you want to translate that English into another language, say Chinese. The way we do the translation is we understand the sentence first, separate the language itself from the message, right? And then we discard the language and re-project the message into another language, in this case Chinese, right? So the message was kept, but the language was discarded, was changed. The same thing happened here for deep fake. We trade the faces, the identity of the person as the language we talk about. And the facial expressions are the messages we want to keep. So this neural network, this encoder and decoder just does that. So it actually strip off the identity from the faces and leave only the essential facial expression into something called the code. That's the thing in between. And the decoder will take the code and interpret it using a different person's identity or different language. Then you get a face of another person, but with the same expression. Once that's finished, we reverse the transform that we get of the extract the original face and put it back into another video. Then we get a deep fake video. So this is what we call, this is a synthesized process. But to get the neural network, we need to train it. The way of training is we need a lot of images of both subjects, okay? And then we just train, this is called self-training, self-regulated training. And the idea is for the two subjects, they're gonna share the same encoder. Because the encoder wants to extract the essential facial expression regardless of the identity. And then we'll just use the encoder, decoder to try to recreate the original face and then measure their differences. After rounds and rounds of the screening, we'll get this pair of encoder and decoders, okay? So the two person will share the same encoder, but each of them will have an individual decoder. Now, the most time consuming part of this whole process is training. So what I'm showing you here is a fast forwarded process of training. As you'll see, when the initial neural network, it doesn't have a very good face. But as the training go forward, I'm actually 5,000 times maximizing this whole process because even on a very powerful GPU server these days, it will still take about 24 to 36 hours to actually generate a decent model here. But once this model is trained, you're basically done. What we do is go for the synthesizer process, provide the video, and then run through the neural network model, generating new faces, and make a new video. What make this, now this kind of face manipulation technology for images or videos actually exists for many years. As we have seen a lot of movies from Hollywood, like Forrest Gamp, we have seen this kind of technologies. But what has been changed with this AI, this new technology, new generation of AI is everything now I'm talking about on this slide can be done completely automatically. So all you need to do as a user is to buy a powerful computer with GPUs, have access to internet, get many images of the person you want small faces with. And then you can just let this algorithm run and go do something else, drink a cup of coffee, watch a movie. Once the training converges to come back, harvest the model, and you can start making fake videos. We have been making some of those videos I've shown before. Most of them, we made them up ourselves. We have probably the most crafted, deep fake generation code in the world. But we don't publicize it because we're working mostly on the detection side. We understand the potential negative impact of this. But nevertheless, I think this is quite accessible, and everybody potentially can use that, especially the crowd in this room. So this deep fake phenomenon has caused a lot of media coverage. Ever since the first was spotted in December 2017, when the first round of fake videos, mostly in the form of a pornographic videos, was seen on Reddit. Actually, the term deep fake comes from an anonymous hiker on Reddit. Even to this day, we don't know his or her real identity, but the account name is called deep fake. So ever since that name stuck with this line of technology. So the first curiosity, because I've been working in this area called digital media forensics for almost 20 years. So the first time I saw this phenomenon, the question I asked myself is, can we actually do something to stop the kind of sprite and reestablish some of the trust to the online videos, medias we have, using some of the detection technologies. So we have been doing this. We started getting our own copy of deep fake code. We started generating some deep fake videos, and then we tried out a bunch of different methods. So the first successful detection method and also called a lot of media attention is one based on a very simple observation. The first round of deep fake videos do not blink a lot. This is what in Tom's video he talked about. And the reason is that the first generation of deep fake neural network model was trained based on images collected from the internet. So if you want a small face of Tom Prince, for instance, you go on Google, you do an image search, you got thousands of images of Tom. But the problem there is those images are mostly official portraits. And when you have portraits photos, you rarely have people close their eyes, because the photographers will take them down as they're not good photos. And this kind of bias actually will slip through the training of neural network. It's almost like teaching a baby to recognize apples all the time, and then suddenly you show him an orange. And he will have no idea there is a kind of fruit of orange exist there. Now this is a simple observation. And we developed an detection algorithm here. We actually developed a simple algorithm just using another deep neural network, just looking at the eyes of the subject. And each time the eye blinks, the network actually will give us a signal. It turns out that all the deep fake videos we have seen since June 2018 does not have these characteristics. But there are two messages we'll learn from this experience. Number one, the other side of synthesis people from the fake video synthesis learn very quickly about this. Ever since we put out a paper on archive, we started to get videos and emails from hackers telling us they can actually synthesize blinking eyes. And it's actually not that hard because the problem is based on the bias of training data. You can simply fix this by inserting images with closed eyes or just train the neural network with videos. But we also get another important insight here. That is, those fake videos are detectable. It is possible to look for those archivist heels of the deep fake video generation process to understand how to detect them. And that's what we did subsequently. We want to look for something that's more fundamental, harder to revert, and more intrinsic to the generation pipeline. So the first thing we look for, again, is revisit because we have this code. We revisit the whole synthesis pipeline. And we see that one of the important stuff there is whenever a new phase is generated, it needs to be transformed and put it back into the original into the video to generate a fake one. And this is because the neural network can only understand phase with that standard orientation. While the actual phase in the videos or images come from different orientations, different distance for the camera represented at different sizes of the phase. So this needs to be adjusted. And that adjustment actually leaves some tractable marks that can be detected using another deep neural network. And this is exactly what we did. So we developed another deep neural network that actually only looking for that kind of artifacts from those images. This kind of artifacts actually is visible if the transform is severe. You may see defect videos with over smooth skins or some jagged areas near the boundary of the phase. Those are actually artifacts of this kind of transform. On the other hand, if we do this more subtly, gently, and do some post-precising to remove those artifacts, it will be not easily noticeable by naked eye. But this algorithm can still pick up that. Another important type of flaws of defects are from geometry. So we know that a real person's face move with the hat, of course, because face is part of the hat. But when we actually make a fake video, the face is actually grafted on top of the hat. And it doesn't move in the same way. And to actually see this better, we use an estimation of the height orientation. So basically, there is a technology in computer vision that we can use to estimate, based on a two-dimensional face image, where the hat is pointing to in a three-dimensional space. Now, cut the story short, we use that technology to estimate the orientation of a face. And then for a real image, because as I mentioned, face move with hat, what we should expect is when we estimate that orientation using only the face part or use the whole hat, the two directions that the hat is pointing to, estimated this way, should be fairly consistent. If this is a fake face, on the other hand, whenever we turn around from the camera, the face is not generated by first do a 3D transform and then project it on the 2D camera plan. It was done only by a 2D transform on the image plan and paste it on the hat to sort of synthesize this 3D generation process. So when we actually do that estimation of the height orientation, the two directions will have much bigger differences. Now, it turns out that this difference is significant enough whenever we have a hat that's moving around away from camera, we can use that to detect if this video is a real video or it's a fake video. So based on these kind of artifacts or these kind of problems of the deep fake generation pipeline, we actually developed a new detection algorithm. So we actually developed a new type of detection algorithm and we actually tried not only deep fake videos we generated, but also deep fake videos we collected online. So we have our own collection of known deep fake videos downloaded from YouTube and anywhere else and we tried this and this seemed to work more reliable. And even better, the kind of artifacts I just talk about are more fundamental or intrinsic to the processing production pipeline of deep fakes and they're much harder to fix, unless we cannot fix them easily by just augmenting our training data set. Now, everything I haven't talked about so far is about the video deep fakes. And one of the problems is the audio part is a gap. If we actually replace the original video of Bob's voice in the Tom Priz video with Tom's real voice, then that will actually make it more effective as a measure of disinformation. So we actually look into this problem and it turns out that- You need to improve the quality of your existence on Earth. You got to do the right things. It turns out that audio deep fakes has also make a very fast drive in recent years. So what I have just played is a synthesized audio of podcast host Jay Rogan's voice. Here- Healthy local ingredients, farm fresh ingredients. It's his real voice, right? For if you do not pay a lot of attention to it, you probably will not be able to tell which one is real, which one is not, right? So this is also- This is something we look forward down the years as one of the other important direction to go in terms of detection technology. So I don't have time to go into some of the details, but one of our very recent work presented last month at the conference is targeting this audio deep fakes. We use a special type of statistical features which turns out to be quite effective for detection and just using a two-dimensional feature characteristics, we can do an effective classification between real human voice and AI synthesized voices. All right, now, even so far I've been talking about detecting deep fakes, but as Bob mentioned, even we can do this detection in a timely manner. Even if we can do this in a very reliable way, still the damage has caused. Once a deep fake video is put on the internet, the effect is very hard to remove and fact checking is always lagging behind. So we have to do something more active, I would say more proactive to stop this whole problem. So this is one of our most recent work on obstructing deep fakes, which means we want to make the generation of deep faked videos harder, more time consuming. It shouldn't be just a matter of getting a big computer, getting a lot of images, and then throw out the computer, let it run for a couple of hours, you can generate those videos. We want this to be slower, we want this to slow down, we want to sabotage the whole process. So we go back to check this whole production pipeline and we're trying to figure out if there's a place we can actually do such. Okay, it turns out that one of the key steps of all deep fake generation, including some other types of deep fakes, like again generated images, are all based on the fact that we can use automatic face detection software to get our faces. So when we realize this fact, I feel scary because it's almost like we handing all those hackers for free voluntarily our data for them to abuse, right? We upload our images, videos to YouTube, Instagram, Twitter, Facebook, without knowing that this potentially could be used to make fake videos are talking to us. So we want to protect a kind of a level of protection for the general users for this kind of attack. And the idea here is to sabotage the face detection stuff. And the face detection, the technology of face detection has been very sophisticated. Taking off your cell phone, pointing to a scene of full of people in a room like this, you'll see all those little white boxes sewing off, right? Those are face detection, actively working, detecting faces. But those algorithms are algorithms. They do not recognize faces as humans. I really recognize this as a person. This is our eyes, there are noses and mouth. It recognizes faces based on certain patterns in the image signal. And those certain patterns actually can be disturbed and can be distorted if we can invite some kind of invisible pattern noise into the images. And we call those adversarial perturbations. The idea is following. We're going to invite this kind of noise which are generated by an algorithm we recently developed into an image. Say you want to upload your image or video to YouTube, to some social platform. If this algorithm is implemented there, it will become another little checkbox under the user agreement where you check it and say, protect my privacy from automatic face detections. Once you do that, this adversarial perturbation noise will invited into that image. Now the interesting thing about this is it actually takes on algorithm but not human viewers. So this noise I did there is very subtle. So when you share your photo with your friends, they still recognize your face. And they won't see a lot of visual artifacts there. But when this image is presented to a face detector, the face detector will have a lot of trouble to look at, to find the face. Either it will not find the face at all or it finds some places where it looks like the face, but it's actually not. Now if we do this, you can imagine a hacker want to generate deep fake collecting 5,000 of the images of somebody from Facebook run automatic face detection algorithm on that. And all the faces, most of the faces are wrong or missing from the detection. Then the only option is actually go back and do this by hand. And I can tell you, cropping out face by hand, which I did a lot for annotation purposes, is really not a fine job of the world. It's tedious, it's boring, and you get pretty much exhausted after 10 minutes of doing that. So by doing this, we want to product one layer of protection for the consumers, for the users, from this kind of attack. So here, we have a prototype for this algorithm. Let me see if I can play. Right, so this is a demo showing that algorithm working. The image on top are the original face detection results. The image in the middle are actually the images with the adversary perturbation. And the very bottom line are the adversary perturbation. But to be able to see that, we actually enhance it by 30 times. Otherwise, you won't even see it. And this also works for videos. Again, so face detection will fail there. Now, I hope in this very short talk, I convince you that I think everybody believe this already. AI algorithms will cause a lot of problems down the road, right in the rife of this information. And deep fakes can actually cause damage. We have talked a lot about the political consequences of a maliciously made deep fake videos. But I also have personal experience working with victims of something called revenge pornography. Those are deep fake videos generated by posing faces of ex-spots, or mostly girlfriend's face, onto pornographic video and distributed that. And the kind of psychological damage to the victim is just extreme. So I think there is a real danger there. And there's a lot of damage this kind of technology can cause. So we should take this problem seriously. Now, from the technical side, we have been working very hard to develop detection techniques to give us tools to detect this kind of fake videos and also, hopefully, slow down the process of making such fake videos. But I don't believe this problem can only be solved with technology. As this crowd of audience in this room knows, this is a complicated problem. Like any complicated disease, it is a combined solution. We need a solution from technology. We need a solution from the government. We need responsibility from the social platform. We need more user education and the coverage from the media. And we also need, even from the audience user's point of view, to be more aware of this problem and be more vigilant whenever we want to retweet an interesting image or video to our friends. So with that, that ends my talk. Thank you very much for your attention. We'll have time for questions? OK. So yeah, if there's any question from the audience, go ahead. The question is, can this perturbation technology be used in the physical world? So like in the hardware, potentially it can. You can actually, it's possible to print out these patterns. We actually thought about this as more like a sci-fi. But you can actually print out some kind of pattern and paste it on your face. This may actually work, but we never tried this. I think it would be difficult, because again, on images, you don't have all these distraction factors of lighting condition, face orientation. In concept, it's certainly doable, because all matters is, what are the signals being seen by the face detection algorithms? I guess the question is, well, there are two parts of the question. The first question is, whether we use again in this model. The answer is no. We don't use again. I mean, again, it's a more catchy word, but the problem is again, it's much harder to use and train. It turns out that this deep fake or detection can be done all by just plain CNN, which is easier to work with. The second question is, whether there are countermeasures to the detection? I think that's always, that would be always the case. It's a cat and mouse kind of game. Whenever we get better in detection, the fodder is getting better once they actually know about this. And it's not hard. I mean, all they need to do, because we want to publish our results. When we put our paper on archive, we publish in conferences, they actually read the paper if they're technically capable enough, they can implement some countermeasures. So this actually is the very kind of nature of this kind of work. There's always a sort of competition on both sides. But on the other hand, I do have to say that unfortunately, the fodder has got the upper hand in many cases, because they are on the side with resources, with support, with incentives. For us, I mean, the only thing we get any recognition is we publish a paper. And even like a couple years ago, this line of research has a lot of trouble getting federal funding. Because it was not deemed as a very important area. But everything has changed so far. DARPA has a media forensic program, which we're part of it. So I think people are taking more, especially people like the audience in this room, taking more attention of this problem. I think the situation will change for better. That's a good question. The question is, will this cause some false positives when some non-defect related operation was performed on the video? Like just remove a little artifact, like remove a skin imperfection digitally. We haven't really tried that. But my belief is it won't cause a huge problem there. Again, because we're looking for that particular kind of artifacts. And the artifacts will look for at a bigger scale, not just at the local region. So the other word to say is, if somebody only did fake, only change a part of the face, a small part of the face, this algorithm may not be very effective. Because we look for the whole face. But this is always the case. I mean, that's why I think the solution should not only come from technology. Because there are cases we mis-detect. There are cases we have false positives. So it's not perfect. But this is an ongoing battle there. Go ahead, please. Yes, I think definitely. I mean, we're starting to, let me repeat the question. The question is, there is a competition between frauders and forensic detectors. Is there also a similar competition between this perturbation, this obstruction technique with the frauders? And the answer is definitely yes. So this is only for unaware frauders. So if they do this without knowing the existence of this technology, then they can do that. But once they know about this, they can certainly do something to remove these adversarial perturbations. It's actually fragile. So you can do something like a low pass filtering. You can remove that. So we are just raising the bar here. We're not trying to eliminate this making of fake videos. Because I think that's not possible. Like a panoramic box, once it's opened, there's no way we can come back. But on the other hand, we do want to raise the bar so that it will not be as easy as somebody just run the software on the computer. One more question. I can only have time to take one more question. Go ahead, please. Sorry. Right. So the question is, what I've been discussed here are mostly what I call passive detection techniques. Or protection scheme is also, in some sense, is passive, which means we do not, we know only what are fake. We don't know if it's true. To actually guarantee the authenticity of each media, we can do what you suggested. And this is actually a very active line of research called digital watermark. And most recently, people take advantage of the blockchain technology, proposing the idea that you can actually invite something into an image or video, authenticating the maker of the video, the making model of the camera, and day and time, and all this important meta-information that can be used for the authenticity. Now, this idea has been around for almost 30 years, ever since DVD becomes a thing. People want to protect copyright. They do this. The problem is, again, there's a scale problem. You have, we have to do this for every, we can do all this with new devices. That probably is not a problem. But to actually add these kind of watermarks to all existing devices, that's a huge amount of work. The other thing is, digital watermarks are also easy to remove or embedded in. So it's not a bulletproof solution either. But as I mentioned earlier, I think the solution must be something combined together. We have the digital watermark. We have the detection technology. We have the abstraction technology. We have the detergent measures. We have legislation. We have media coverage. All this together work for the goal to reduce, to sort of control this problem, right? Thank you very much. Yeah. Thank you.