 how many of you flew here all right now how many of you flew here and needed your passport to get here all right now wouldn't it be great if you lost that and you could still fly home I don't know if that's what this talk is about but so AV can we get their slide deck up behind us I don't think my voice is going to get me anywhere but yeah in a couple minutes we got Azim and John who goes by delta zero give him a warm round of applause you ready to go awesome I know why did the chicken cross the road oh I don't know cuz I'm totally unoriginal I don't know that's a good one thanks yeah I try cool and silence yeah yeah I couldn't I don't think I could tell a joke up here because I'd alienate half of the population we can just go ahead and start yeah it's about time all right everybody if there's a seat open next to you make a friend there's still going to be people piling in and give him another round of applause so welcome everybody to our talk your voice is my passport it has nothing to do with physical passports my name is John Seymour aka delta zero and I'm Azim and we both work at Salesforce doing detection and response engineering okay so let's get into it so these days we see that voice is starting to be used as a means to authenticate right I'm using the word authenticate a little loosely here and we'll see why now when I say voice authentication or speaker recognition I think the thing that comes immediately to people's mind is maybe Apple Siri or Google Assistant right both of these are services that are set up to unlock not only a subset of their features based on whether their target says a specific sentence so hey Siri for Apple or okay Google for Google right now I do want to mention here that neither Apple nor Google ever use the word authenticate authentication to describe their service at least we never came across that term we suspect it's because they're aware that this is maybe brittle but we'll see so here you have an example of a financial institution Schwab bank that does indeed use authentication so you can get into your account just based on your voice you can have unmitigated access to everything and the way that it works is after you've registered you say the term at Schwab my voice is my password to get into the account now the irony of that sentence it seems is completely lost on them and now then finally here's an example of Microsoft's speech API which also claims to do authentication and this is so voice recognition or speaker recognition as a service okay so as you may have inferred by now our goal is to break voice authentication and but we want to do this with minimal effort now let me be a little bit more specific by breaking voice authentication I mean that we want to be able to spoof a specific target and get into a service that's deployed today that's set up to let that person in using his voice by minimal effort we actually mean three things so the first is that whatever solution we come up with should not require tons and tons of compute so voice authentication speaker recognition or machine learning problems and machine learning and deep neural networks in particular just tend to require lots of compute so I'm talking off maybe a commodity server not a server from second it should be realizable in some reasonable time so maybe days or weeks not months and then finally you should not require a Ph.D. in data science to be able to implement this all right so if you haven't seen the hacker movie sneaker or hacker movie sneakers you probably should it's a hacker classic from the early 90s and it's also quite a bit relevant to our talk it's where the title actually comes from in it the heroes actually need to bypass a voice authentication system and they do so by social engineering their target to say the specific words in the past phrase so let's see if this actually works you are clear all the way up to the mantra is my passport and so here's how you do that so let's go to the original idea of sneakers right in sneakers they record the words that the for the past phrase that the victim would say by using social engineering and getting the target to actually say those particular words but in real life this is actually pretty difficult for three reasons right first the people that you'd normally want to impersonators who are say like a CEO or a politician they're normally pretty busy people and may not want to speak with you know normal people second it if you've ever tried this in a conversation that you should if you've ever said like hey you should say hey Siri to me and I want to record it it's it's something that's going to get you know your target very suspicious of you and say hey why do you want me to say the words hey Siri but even if you were able to do those two things right still actually most voice authentication systems are pretty smart and sometimes they like change you know the past phrase and things like that so the actual recording that you do might actually be stale and useless by the time you actually go to authenticate however luckily there's a thing called text-to-speech and it's actually pretty good there's an entire area of research around it it's got basically a workshop at NIPS dedicated to it so NIPS is a very prestigious machine learning conference it's machine learning based so basically you give a system a bunch of audio and transcripts of that audio and it produces new audio for you and it's made a ton of improvements lately and so it's a very active research area let's try this all right so let's see if this one works this is a dangerous time moving forward we need to be more vigilant with what we trust from the internet it's a time when we need to rely on trusted new sources all right so the actual audio lagged a bit there just because of the network here but basically that was Jordan Peele actually and Buzzfeed that made that video and it should convince you that this technology is becoming pretty widespread just think of for example what you could do with a huge AI research lab backing you in our case we're actually going to focus on exclusively on using it to bypass voice authentication as such we really don't care about the quality of the audio that's generated we just care whether it bypasses the servers or not it could be complete and utter garbage okay so John has already mentioned that text-to-speech is generally a machine learning problem right the essential idea is that you give the algorithm some text transcribe text to be specific and it generates the equivalent audio representation of that text so for example male spectrograms which is just the audio waveform corresponding to some text the model learns the mapping between the transcript and the audio or to be more precise character sequences and the final output and the way it does this is that you give it labeled data by label that just mean transcribed audio and you feed it into a deep neural network and after many many iterations the model learns this association that I've been talking about the association between character sequences and the final audio output. Now a couple of things I want you guys to note here so generally deep learning models that are focused on voice are trained on a single person's voice right this is starting to change and you'll see later in the talk why but but it's still a good thing to keep in mind the second important thing is that deep learning models in particular and ones that have to do with voice especially so require lots and lots of data data to do any kind of good work right and the general consensus in the academic community is that these models require like around 24 hours of high quality labeled audio to be able to do well. Now there are two very high quality open source datasets that are available both of them have over 24 hours of data the first one is Blizzard the second one is LJ speech the only difference between these two is that one is a recording of a male the other of a female. All of these you'll see why this is important. So basically there's this company called Liarbird and it's founded by several of the pioneers in Texas Beach Research and one of their goals is to bring awareness to what all this technology can do. They host a lot of similar videos to the Jordan Peele video we showed you earlier. As a demonstration to the general public they've actually set up a service where you can actually record your own voice and generate some from it and so the steps to do are pretty easy all you do is you create an account you record 30 sentences which are actually chosen by Liarbird in advance and they're the same for all users and basically after that that basically changed the model you then provide a target sentence that Liarbird would generate. It's actually pretty simple it only takes a few minutes to generate audio but there's definitely degradation in quality it's also finicky with a lot of different accents. We actually did a proof of concept with Siri and Microsoft's speaker recognition public beta we didn't actually test with like Schwab or Google voice. So first we actually trained Siri or MSR to recognize our own voices then we generated the target passphrases using Liarbird and tested the audio against the speaker recognition authentication software. My voice is stronger than passwords. So this is us actually training the service in the first place. My voice is stronger than passwords. My voice is stronger than passwords. Okay so now Microsoft actually accepts our speaker. My voice is stronger than passwords. And notice that was accepted. This is a test and should be rejected. Rejected is expected. My voice is stronger than passwords. It rejects the Zeme as well. My voice is stronger than passwords. And look it accepts the generated audio that we took from Liarbird. So basically actually there is some limitations to basically using Liarbird as a service. For example it's effectiveness varies greatly based on the speaker. It worked very well for me. It didn't work for a Zeme. But aside from just general finickiness Liarbird requires specific utterances and so it falls back to a lot of the same issues that the sneakers video we showed before has as well. It's simply unlikely that an attacker could obtain specific recordings of a target. Though this does mean the Liarbird database and as well as voice authentication databases in general might be a valuable target for attackers. To demonstrate how a real attack might work we actually turn to the state of the art in text to speech generation. Okay so when I started this out I mentioned that one of our goals is to make this as easy as possible. You should not require data science expertise in order to be able to implement the solution. And so naturally we turn into open source models that are just widely available right. Now there are several open source models. Two of the most popular ones are Tacotron which is by Google and WaveNet. So WaveNet is perhaps maybe better known and it generates very, very realistic human sounding output. However the problem with WaveNet is that it needs to be tuned significantly. So what I mean by that is that WaveNet has lots of input parameters. So as examples of some of them those would be the fundamental log frequency, the phenome duration, linguistic features. All of these things would need to be tuned by the domain expert right. This requires domain expertise and kind of strays away from our original goal of making this as easy as possible. So now Tacotron simplifies this entire process very much alright. It takes the guessing out of it. So you no longer need to individually tune features. You can basically just give Tacotron the audio as direct input and it will figure out what the best feature set for that is right. So this is an example of Tacotron 2 which is Google's latest and greatest text to speech system. Now Tacotron 2 is basically composed of two steps. There's this thing at the bottom and the one at the top. The thing at the bottom is basically a recurrent sequence to sequence feature prediction network which outputs male spectrograms and the one on the top is a modified wave net which is conditioned on the previous male spectrogram frames and generates the final audio sequence. So an easier way to think about this is the first network kind of determines what the ideal feature set for wave net should be which you can think of as this visual representation of sound frequencies and wave net then takes those as inputs and then finally gives you an output. So now the good news here is that you don't really need to know any of the internals of Tacotron to make it work. This is available open source and you can basically just run this, give it the actual character sequences. There are some parameters that you can tweak and make it better but we did not. If you just leave those things as they are in default it'll work very well. Right. So we just have a few comparisons of the different audios for you. So this should be the audio generated from Tacotron version one which Google actually published in April of 2017 and there's actually completely open source implementation of it. Scientists at the CERN laboratories say they have discovered a new particle. So you can actually kind of tell that that was generated. We'll clear all the way up to the mantra. This is fun. Yeah, there we do start playing. We just really love sneakers here. All right. Generative adversarial network or variational auto encoder. Generative adversarial, generative ab... All right. So that's actually audio generated by Tacotron version two which Google released in December of 2017. So we're talking April of 2017 to December of 2017. Huge increase in quality in a very short period of time. For completion purposes here's the audio generated by WaveNet. I really should just set the defaults for this. Man dying of liver complaint. Lay on the cold stones without a bed or food to eat. All right. Cool. All right. So that's all well and good but in order to actually spoof somebody's voice or train any kind of model you need data and you need lots of it. So given that we want to impersonate a specific target like where might you get this data from? So if your target is somebody that does lots of public speaking like say John you can probably grab that audio from YouTube or some public source but remember both the quality and the quantity of the audio is important, right? Then you actually need to transcribe this data because as I mentioned earlier these models require labeled data and labeled in this sense is just a transcription. And then finally you need to chunk this up. So you need to chunk it up because these are these these models expect sentences, right? They expect you to be able to give them chunks of audio and then that's how the network runs. So when we started this out we thought that we could kill two birds with one stone use Google speech API. So the speech API suppose what it's supposed to do is you give it some audio and it gives you both the transcript and the start and end time of each word in that audio, right? But for whatever reason we could never get it to work well enough. We suspect it's because you know there was there when you get audio from some public source there's going to be lots of noise in it and it doesn't tend to do very well with that. It also does not do very well with natural pauses in human speech like um or I just just tends to think that's some word. Now this is not to ding on Google actually the speech API does work very well when you give it like good quality audio but we think it that's unreasonable if you're going to impersonate some specific target. So what we had to do then is that we ended up manually transcribing our data, right? So remember John is the target here and uh it's not so bad because like it took us what an hour to to transcribe that data and if then chunking that data up actually turned out to be very easy right. So you just use FFM peg and split your audio by silence and that just conveniently chunks it up by sentence. Alright so so I've mentioned that both the quality and the quantity of data is important right. So when we get this data from a public source like uh a YouTube doc a lot of the sentences in that talk are actually not very usable right. So if your target has says lots of ums and uhs that's not very useful uh the model is not going to learn anything from that. Uh there's also times when there's applause and that again will mess your sample up. So what you and need to do is you need to sub-sample, select the highest quality audio samples from your audio and then use those. What we ended up was with around like five to ten minutes of really good quality audio and if you remember I mentioned that you need 24 hours of data and that is this is just not nearly enough to do any kind of like good training. Um now the solution to that problem of very limited data is something called data augmentation. Right so there's one side effect of actually um slowing down and speeding up audio and that's actually that the pitch changes. And so you can actually abuse this to generate new examples and you can add these to your training set uh training uh examples. Um there's tons of libraries available for this we used PyDub. Um but to make this a little rigorous what we did actually was we took an original recording of me saying hey Siri and we slowed it down and sped it up and we saw how far we could actually do so and the um Siri actually still recognized my voice and unlocked the phone. And so in our case uh basically we were able to slow it down to about 0.88 times um and speed it up to about 1.21 times and Siri would still recognize that it was me speaking. Obviously your mileage may vary for the exact perimeter here right? Um it's probably different for every single person. Notice that this actually fixes both of our original issues right? It multiplies our training data by about 30 times as well as you only need to transcribe about 1.30th of the original training data. But there is an issue introduced by this um and that's the issue of overfitting right? If you're only choosing some subset of what the target actually speaks then you're not getting a full representative sample of um all the different phonemes and things that they might say. Um so you still have to be careful about this. Um so in other words the models being trained on a small subset of what the target might say so there might be some sounds that it can't generate very well. But even if you consider like 30 times right um basically that's still not enough to actually generate really good um really good audio right? If you actually do the math um 5 to 10 minutes times about 30 is still nowhere near the 24 hours that we originally um originally needed. Uh so shifting pitch ended up not getting us all the way there. Um if we calculate right uh we'd need at least 1 hour of high quality data and that actually still takes forever uh to transcribe. And thus this is not even considering the issue of limited vocabulary. Um so we actually turned to this idea of domain adaptation or transfer learning. And so how this actually works is you initially train on a large open source dataset such as Blizzard or LJ speech. You get a decent model um and then you stop training there. You actually just simply swap uh your targets data into the original training data and you just continue training the model. And eventually you'll get a model which actually sounds more like the original target. And so what we think this is is the model initially learns how to speak using the Blizzard and LJ speech data. And then it learns basically adjusts pitch and accent based on the targets. We think this is actually because of the sort of layered approach of neural nets. Um so we think sort of the lower layers are more um um more useful for basically understanding the basics of language and translating you know characters and words into sentences and into audio. And then the higher layers just determine pitch and accent and things like that. And furthermore there's still a lot of variance in effectiveness here. Um it's it's very finicky sometimes it converges like within one epoch which is just one iteration over all of your training data from your target. Uh sometimes it actually takes a couple days to train. So we have a simple demo here of basically this is we trained our Blizzard model not for very long so it's not great audio quality. I'm going to make him an offer he cannot refuse. So it still sounds that's that sounds a lot like the Blizzard um person but it's still sort of choppy and you can hear it that's an artifact of us using uh Takatron v1 we expect the quality to get better. But then when we actually use transfer learning. I'm going to make him an offer he cannot refuse. So I'm going to make him an offer. That's actually that sounds a lot more like my voice and that was completely generated right. So basically um epochs vary this one took about two days to train and then an overnight to actually do the transfer over. Um this actually is good enough to start breaking APIs right. Um the approach works it's not very it's not as consistent even as Liarbird but it doesn't require any specified speech at all. What we did was we scraped audio from YouTube to generate that. Um the overall effort here is also very very low. It took us about a month from conception to completion. Uh more effort obviously would make the audio quality for example much better or make it a lot you know higher probability of actually um being accepted by the uh the two APIs that we demonstrated earlier. Um there's so many more parameters we could have tweaked and so much data we could have transcribed for example. Um but the fact that the overall effort is so low should be pretty scary. Okay so we may have thrown quite a bit of information your way. So let me just take a step back and put everything back together right. So the steps you would need to take in order to spoof somebody's voice. It's really not that much. So you start off by scraping data from the target. Some public source maybe YouTube. You sub sample. You only select the high quality samples from your audio. You need to then transcribe and chunk that audio. Uh at this point you need to do it manually but there is no reason to believe that the speech APIs is not going to get very uh good very quickly. Then you need to augment your audio by shifting pitch. Uh the second augmentation is two steps. First you need to train a general text-to-speech model on any open source dataset. And then you replace your general model training data with the target data. And then you finish training. At this point you should be able to successfully synthesize your target's voice. Okay so I I I kind of want to put our work in perspective. Now I'll give uh people a flavor of everything uh machine learning for offense related. So what we've done here is we've grouped um prior work into these two arguably very broad buckets. So there's attacks on machine learning systems and then attacks using machine learning systems. And our work is kind of squarely in the middle. So let's first start with attacks on machine learning systems. Uh now adversarial attacks are one of the hottest topics in uh machine learning security research right now. In fact these two words are sometimes just used synonymously. So the basic idea behind adversarial attacks is that you have to carefully craft your input to a machine learning model in such a way that the model ends up misclassifying your uh your input right. So as an example think of an image recognition system and a picture of a dog right. Now you would carefully tweak those some pixels in that picture in such a way that the model would maybe then misclassify that as a giraffe or a panda or something. Now this this might sound cute but there are like security implications here. So the canonical example that people give is think of a self driving car uh self driving system and a stop sign right. So if somebody does something to that stop sign uh wherein the stop sign is still a very much a stop sign to a human. The altered and unaltered pictures are indistinguishable to the human eye. The system is going to misclassify that as a yield sign or something. Now the most of the prior work on adversarial attacks for voice systems have focused on hiding hidden commands in benign sounding audio. So some password basically showed how you can uh have a benign sounding sentence like okay Google uh turn on the lights which would in fact Google would interpret that as something like send an email or some such thing. Now this method is pretty cool right but the con is that it's currently very brittle. Then there's this idea of poisoning the well. So with poisoning the well similar to adversarial uh attacks you carefully craft your input but your uh but your aim now is to corrupt the model. Differential privacy is kind of the inverse so you carefully observe the output of your model in the hope that this will tell you something about the actual data that we use to train it. Cool. And so again we've sort of bucketed these things in two different categories just to make them simpler to understand. Um but also um we have this idea of attacks using machine learning systems right. And so for example um earlier this year we actually saw the first what we consider a widespread machine learning based attack in the form of Deepfakes. Um so if you don't know about Deepfakes it's basically this app where you can uh transplant let's say a photo or a head of one person onto the body of another in a picture or video. And what we've seen is basically this ends up being used mostly for pornographic purposes. There's actually also a whole host a whole host of other ways machine learning systems can be used to actually uh attack people right. And so one um primary example is phishing um you can actually scrape data of a target off of YouTube or Twitter or something like that. And generate a phishing post specifically tailored to their own interests. Um the final thing that we want to call out in this space is robotics and social engineering. And if you haven't seen it there's a really cool talk by Sarah Jane Terp and Strath. And Wendy Knox on that. Okay so we're hoping at this point we've convinced you how relatively easy it is to spoof somebody's voice. Um so there there are other issues with like voice as a means to authenticate right. Uh you could have some kind of passphrase uh that you use. But the problem is that it's difficult to keep passphrases secret if you have to say them out loud. The other problem is that you could require an unknown vocabulary and John talked about this earlier. But actually speaker recognition with unknown vocabulary is a harder problem than speaker recognition with a known vocabulary. So what we want to stress here is that speaker recognition and speaker authentication are two separate problems and they should be treated as such right. Uh what we suggest is that you use speaker recognition as a weak signal on top of a multi-factor authentication system. So now think of an MFA system that requires tokens. So what you would end up doing is saying those tokens as to the system instead of typing it out maybe. And that does indeed provide another weak signal on top. So let's talk about detection. So over here we have thrown together two examples of something that you could use to detect this. So on the left is an example of something that attempts to detect computer generated audio. On the right hand side you have the inverse where this device which tries to detect certain neuromuscular features. So the idea being that if it detects something the voice the sound must have come from a human. Now treat these with skepticism because we expect this to be an arms race. Cool. So just to sort of reiterate what all we're trying to raise awareness for and what all we think based on our own experiments right. Um so first off what we'd like you to take away is that speaker authentication and speaker recognition are two different problems completely. Recognition should only be treated as a weak signal for authenticating. Um the second take away is that speaker authentication can easily be broken if the attacker has speech data of the target and knows the authentication prompt. And third although most text to speech systems require about 24 hours of speech to train, transfer learning is actually a very effective method to reduce that time to an amount realistic for an attacker today to abuse. Um in fact transfer learning is a very effective technique that you can use in a very large number of um machine learning use cases. But in conclusion it's relatively easy at this time to spoof someone's voice and it's only going to get easier over time. And just as a final note um even after we submitted this to DEF CON um some researchers at Google um published this paper back in June of this year. Um so the idea there is tech uh transfer learning from speaker verification to multi-speaker text to speech synthesis. We just want to note that this is a very active area of research generally and we're not the only ones looking into this. Um basically this entire field is going to grow at a very alarming rate and we should figure out how to deal with it now. And with that that's actually the end of our talk. So if anyone has any questions um definitely feel free.