 Thank you guys. Good morning and welcome to my talk or actually a quick guide on how to become your company's CEO with little effort. So first I must start with a disclaimer. So I believe the tech I presenting today is a game-changer when it comes to social engineering attacks and it may cause harm very fast. So please remember that you can always be at the other side of the barrel. Please don't be evil. Don't use deep fakes without the approval of your target. Okay. No, seriously. But this is my target. So I'd like you to meet my company's former CEO and current chairman Udi Mukati. And this is him on a podcast interview he did last year. Absolutely. And great to talk to you here. I think the beauty of the cyber-arch story to fast forward is we pioneer the space but continue to innovate and market lead it. A lot of companies don't. Yeah. So Udi was my target when I played around with deep fakes. And now I'd like to go over the obligatory. Who am I? So my name is Gal and I'm a vulnerability research manager at cyber-arch labs. And actually my favorite research fields are embedded devices such as ISP equipment and mesh Wi-Fi. And more importantly, who am I not? So I'm not a machine learning expert whatsoever. This project was started as a challenge to see how far I can go with real-time avatar, with creating a real-time avatar of my company's chairman. And I'm not going to use the A word because I bet you are all sick and tired from hearing about this buzzword and Robert Apocalypse. And as you all see, I am definitely not my company's chairman. But this is me with a button shirt. And this is me with Udi's office as a background. And this is me with Udi's face. Hi, I'm Udi's clone and I'm very excited to be here. And it's just great to talk to you all. Yeah, so you probably figured out that this video was made with real-time video and audio deepfake. Okay, so for today's agenda, I would like to go over the bare minimum terms about machine learning, going over the process of creating a real-time video and audio deepfakes. And lastly, combine it all together and see how this tech can be weaponized for social engineering attack. Okay, so unfortunately, I don't have enough time today to go over or be too technical about how this tech works, but I want to clear some terms before I begin. So machine learning one-on-one. And actually, when I use the term machine learning or ML, I actually refer to convolutional neural network or CNN, which is a class in ML. But for the sake of simplicity, let's just call it ML. So ML model is a program that has been trained to recognize certain patterns and types. The most classic example is image or audio recognition. And you need to train a model over a set of data, providing it an algorithm it can use to reason over and learn from those data. For example, video frames and audio samples as a data set and image or audio mapping as the algorithm. And once you have trained your model, you can use it to reason over data that it hasn't seen before and make prediction about those data. For example, detecting a face in a frame and swapping it. Training is basically doing a lot of mathematical calculation, which takes time and often measured by iteration, which gives you an indication on how fast you're training your model. And again, please keep in mind, this is a very high level explanation. I removed some terms just to make things simple. OK, so since training a model involves a lot of math calculation, you need the right hardware to do it in a reasonable time. So GPUs were actually designed to do a lot of this kind of calculation. And if you've got a powerful gaming PC, you can use it to train a model. But if you don't, you can pay around 50 bucks per week and get it down on the cloud. And also you can use Google Collab, which is a Jupyter notebook that can actually be used to train a model for free. OK, so great. Now we can talk about video and audio defects. So the first thing you want to do when creating a deepfake avatar is get a high quality data of your target. This later be used as a data set to train your model. So for video, I find YouTube and Google videos to be a great place to start. But you can also find high quality videos in the company's website or in virtual meetings. For audio, here you have aim for high quality audio. Usually podcast and media interviews has the best sound production and they are the best source. But if your target is a CEO of a traded company, then you're most likely to find audio source in post-earning calls. And lastly, you might want a background, which is not a data set. It's just a prop for staging. But if you want to tell a certain story, like for example you got stuck outside of your office or something like that, then you can use Google Maps or YouTube to get the right background. I used Udi's office background. I got Udi's office background from a media interview. OK, so now we're ready to talk on how video defects are done. So in 2008, a talented guy called Ipyrov created a project called Deep Face Lab or DFL for offline video deepfakes. And I bet you probably saw a lot of MIM using this kind of project. And starting using DFL might be a little bit intimidating. You need to run different scripts and there are around 32 different parameters that you need to configure while you train your model and its effect how you train your model. But this project has an active community called Mr. Deepfakes and you can get a lot of help from there. OK, so to understand real-time deepfakes, we must understand offline deepfake first. So let's say I want to swap the face of the upper character with the bottom character. For that, I need to give DFL a source video with the face I wish to replace and then a destination video of the target I wish to replace with. For good results, your data set should include a variety of videos of your target with different angle and different lighting. And then DFL can create and train a model and depends on your hardware it should take a few days. And in some point in time, your model should reach enough iteration to perform the face swap correctly. So this model is then applied on the source video to change the source face with the destination face. So in high level, DFL split the video into frames. Then for every frame, it apply a two-stage model. The first stage of the model is detecting the source face. And the second is swapping the source face with the destination face. After it's done for every frame, you got yourself a deepfake video. Alright, and in 2021, IPROV created a new project called Deepface Live. And this software can run DFL models in real time and it's very straightforward to use. Okay, so now let's talk about how to train a real-time DFL model. Same as before, you need a target data set again with a variety of angles and lighting. You feed it to the same DFL project. But this time, since your model aim to work on any given face, your source data set will need to be different pictures of different people. I used a recommended celebrity data set with around 10K of pre-mapped faces. And then you let DFL work its magic again and after a few days, it depends on your hardware, you get yourself a real-time model. Okay, now to put it in use, you should connect your web camera to the Deepface Live program. The Deepface Live will load the model that you created and run it on the fly for every frame that it gets. And then with OBS, you can direct it to anywhere you like. Like, for example, to do a video conference. But wait, there's more. I actually realized that by using snap camera, I can apply additional effect to help me improve cosmetic. For example, beard and hairstyle. And actually, this is how I managed not to shave my head when I did Udi's Deepfake. Okay, lastly, real-time Deepfakes currently has some limitations. Face expressions, for example, like pulling your tongue out. My may not look that good. And also objects that covers your face, like mask, hand, they can make you look unreal. And lastly, comparing to offline Deepfake, here you need a GPU at your end point. And the more powerful GPU you have, the more frames per second you're going to get. Okay, so now let's talk about audio Deepfakes. And I'll start with speech-to-text first. Okay, so this tech is very popular. And today you can see many companies providing this kind of services. Quite impressively, in some cases, you only need one minute of clear audio sample. And full disclosure, I paid a few dollars to 11 labs, and I worked with them. But I believe that you can also get a pretty good result with an open-source project called Cookie. And after you train your model, it takes text and translates it into speech that sounds like your target. So as you might think, text-to-speech is not enough for real-time Deepfake, since it's text-based and not voice-based. So now for a voice-based solution, which is called voice conversion, or VC. So on March 2023, our VC project was created by Fumiyama, FTPS, and Deepomb, maybe, three anim-enthusiastic developers. And this project converts voice samples and supports just about any language. It also supports singing, which is cool. And configuring this project takes some time, but it's way less intimidating than DFL. All right, so for our VC, you need at least ten minutes of clear audio dataset, the more the better. And after a few hours, our VC created a VC model. This model can take any voice as an input and convert it to the target's voice. But wait, there's more. Remember that we talked that text-to-speech only needs one minute of audio? So you can actually use text-to-speech model to generate a bigger dataset for our VC. So basically you only need one minute of audio for our VC as well. Okay, and now for running the model on the fly. So in August 2022, a guide called Okada created a project called VoiceChanger. It supports our VC model as well as other VC models, and it's very, very straightforward to work with. And so you can take your microphone, connect it to VoiceChanger with your RVC model, and then use a program called AB Audio that redirects the audio to your favorite program. And then you can call someone using a voice deepfake. And RVC model also has some limitation, mainly around constant syllables. For example, if you say for a period of time, it may sound unnatural. And also VoiceChanger has to keep a buffer of your voice to convert it on the fly, and that might create latency. And same as Deepfake Live, you also need a relatively powerful GPU at your endpoint. Okay, lastly I want to talk about the obvious profit reason. I believe that real-time Deepfakes will dramatically change how social engineering attacks are executed. No more texts, no more emails. You can expect your supervisor calling you and asking you to do stuff. And also that of course can cause business damage, money transfer, job termination, granting access, you name it. And of course it can also leak to data leakage, such as confidential documents or credential leakage and so on. But it can also create business fatigue. Think of it, if you need, let's say, perform multi-factor authentication by any kind for every phone call that you do, that can affect the entire business operation. Okay, and actually to use this tech for social engineering attack, you need to collect a data and stage a specific story. And then you'll probably want to automate everything, just that you will be able to train the model in a relatively fast way. And also you can use bandwidth restriction if your endpoint or if your model is not mature enough and you just want to get away with low quality or with latency issues. Okay, so for conclusion, since this is a social engineering game changer, sorry, this is I think a social engineering game changer, training a model is a little difficult and it takes some time and requires some reading to do beforehand. However, running the model itself on the endpoint is super easy and super easy straightforward to use. So since training a model might be high effort, I came out with something called Defcon Video Art, which stands for deep fake conversion for video and audio in real time. Or a fancy bass strip that automates everything that I show today. So it uses pre-trained models to save time. It's using best practice configuration I managed to find and work for me. And I packed it all in the all the environment dependencies so I can actually use it on my machine or in the cloud and hopefully in cold at one day. Okay, so to use it, I just need three minutes of video and one minute of audio. I then use DFL and RBC to train the model. And after around one day for video, I end up with reasonable results. And after 30 minutes of audio, I mean after three minutes, I got reasonable results for audio. Again, it depends on your hardware. And you end up with two models of audio and video. Okay, and then you can use it in the streaming setup I just demonstrated and use it as you wish. And now for a live demo. Okay. I'm Jeff Moss, founder of Defcon. I hope you enjoyed this talk because right after it, Defcon is officially canceled. Thanks. Okay, so what to do? So like every social engineering attack, awareness is a key factor. Today, everybody know not to trust suspicious emails and texts. And seems like we already reached the point where you must suspect a suspicious phone and video calls as well. If something is fishy, then it probably is. And you should always ask for additional information, preferably this kind of information that only you and the person at the other side of the line knows. Also, you can use current real-time model's limitation as I mentioned before. But please keep in mind that this limitation might be solved in the future. Okay. So for technological mitigation, well, you can ask for multi-factor authentication. There are some research papers and GitHub project that aim to detect deepfakes using ML. And there are also commercial products such as Intel Blackflow. And lastly, you can use other phishing mitigation such as notifications and alerts that will be triggered when you're under a deepfake attack, something like that. Okay. As for future work, I plan to try to do a full head deepfake including the hair and everything. I plan to try to play with different source data set, example just my own data set and see how it goes. And lastly, collaboration or integration for the environment I demonstrated before. Lastly, I want to say thanks to Mark that helped me with the research and great tutorials I find online. My targets that approved the use of their image. And lastly, the R2 TV show, which I think is going to start a new season this month. Thank you.