 Hi everyone. My name is Ganesh Devarajan. I work for GoDaddy as a security researcher. And this is Don. He also works in GoDaddy as a security engineer. And we'll let him talk about himself. Yes. So my name is Don Lombard. I work at GoDaddy. I'm a security engineer there. So my job is pretty varied. This doesn't exactly pertain to what I do on the daily basis. But because they're also nice and so kind, they let us kind of work on a project that we had. So my day to days are anything from setting up new systems. Prior to working at GoDaddy, I was basically worked at crappy ISPs and service providers. So I got hired on and ever since then I've been happy to actually do research every now and again. So our talk is VDLDS, which also stands for voice data leakage detection system. So the whole idea behind this was like we're like sitting over there in our office and then we have a huge call center at GoDaddy. And what we were thinking about it was like how cool would it be if we can just like walk by the call center and start getting all the credit card numbers. Because our call center is like really strict where like you can't take your phones in there, like you can't talk on the phone when you're walking through it. Or like you can have a piece of paper or pen where you can write stuff down. So they're like pretty strict that way. But what we were trying to figure out was like how can I manipulate the system? Like how can I steal information from like a phone conversation or like any other, like most people like whenever like you get a call from one of your customer service representative or like something that you're calling to pay your bills or whatever, you feel more confident to give out your personal information and your credit card information over there. But so we thought like it'd be interesting if we can decode that voice stream and then convert it and like figure out the numbers in there and pull it out for our own personal gains kind of thing. So that's how the whole thing started. First disclaimer, if you get caught doing something like this and we are not responsible. So it's all on you. So the agenda is like basically like we want to go through like what's the problem that we actually came across? Like what is it that we want to go, what is it that we want to cover? And what are some of the background stuff that's out there and the different scenarios that we could think of and there are like million other scenarios out there and the different deployment architecture model that we have for our tool and like a demo and the future work. We'll probably do the future work before the demo. So the problem. So as we know when you're on your phone, more people tend to let things slip and they do that for multiple different reasons. Even when you're in an enterprise or just you know on your cell phone you tend to say things thinking oh well you know it's either encrypted by the cell phone provider or maybe it's encrypted you know by my enterprise. They're always willing or you or most people are willing to give out their credit card number. I mean how many people have sat there even at work on your Cisco phone and said okay I'm going to call the pool guy because stuff just blew up and I totally need to have him get out there today. I'm going to give him the credit card number over the phone so he's at my house by the time I'm home. Well yeah that's great but it's not as safe as you think it is. So you give out your credit card number. Sometimes you're trying to set up a new maybe you're applying for schools so you've got to drop your social security number. Well they're not going to wait for you to walk over to their station or send it through the internet. You're going to give it over the phone. So a lot of the times people also try to use that phone system which they don't think they are going to get caught on and they'll try to slip out insider information. They'll try to talk about your company's acquisitions before you know hits the streets at all. The problem with that is obviously as an enterprise we're always trying to figure out you know how can we protect our data to the point when we're ready to release it because the rumor mills really cause a lot of damage and if you can control that or at least know where it's coming out from then you're going to be a lot better off. A lot of the times in enterprises you have hundreds of people that are on the phone all the time and there are systems in place that do monitor these phone calls but just because you can monitor them and archive them and maybe have that audio for later doesn't mean you're going to be able to sit there and go through a hundred phone calls an hour and you're going to be able to sit down and listen to every twenty minute phone call or hour long phone call there's just not enough manpower. So some of the background again like so who does this kind of thing like government is already doing it like whenever you want like having a tap water and on someone's phone you can go through the service provider and kind of like tap your all your calls and how efficient is it if you know like the person talks on the phone for like 20 hours a day or something and then you don't want to sit through there and like listen to every single conversation but like if you can figure out like these are the keywords that you want then like you can already trigger that up and do that kind of thing and so one thing that we thought was like things like Jihad and terrorists and all those kind of things like the government's already tracking those kind of calls and then like they do all the voice matching recognition and all those kind of things but those are all done on the service provider level and we wanted to bring something which is firstly not that expensive and secondly much more cheaper to do it and easier to do it I mean like you can also go plant a physical bug into the phone and then do the recording and transmitting it out simultaneously but we thought like let's just make it simpler on the wire that way like I don't have to be physically present over there to do some of these kind of work and there are commercial software which are out there which does this recording for you but then the ones that we actually looked into like didn't have that expertise of detecting stuff on the fly which would like trigger further investigation kind of thing like like we at least as a provider like we have a huge call center and then like over there we get tons of calls like I can't have somebody go sit and like every listen to every single call so if I could like trigger like some kind of an angry customer or somebody who's just abusive or like they're like asking phishing for like credit card numbers or things like that kind of thing like I can trigger that one and log that and figure out who's doing what kind of thing. So what is DLP? DLP is data loss prevention and it's something that every enterprise deals with every day and most of the time DLP is targeted at things leaving their network but it's also targeted at data at rest well the two things that you're dealing with there are either on a server or they're on a client workstation so their digital documents their text pad their emails they're not voice they're not audio they're not video and a lot of the times you run into the issue or that situation where the data becomes encrypted and it comes out of your network anyway because you don't necessarily have everybody's private key you know anybody can start and get GNU PG and encrypt a document and send it out using somebody's public key it's a problem but we can at least try to limit kind of the attacks or at least the vectors of getting this information out which is why that voice is a great thing to start with so a lot of the scenarios obviously that we're trying to deal with cover a broad spectrum. One of the things that we want to attempt to do with this technology is deal with social engineering we want to be able to detect that same kind of talk pattern that social engineers use when you're in a call center you can have one person you know make forty fifty calls and it's going to say the same thing every time until somebody either does what they want or get some of the information that they're really going after and it doesn't matter I mean it can be one person it can be three you're going to see it with this you should be able to detect that type of activity because we've made it in a way that you can write a rule that's going to parse through the text output in the manner that you need so if something new is out there that you need to pull out in a different manner that's possible. Another scenario is obviously your insider trading and leaks everybody has that issue I mean we can attest to the fact that when people invested into GoDaddy recently that it's very short after that information comes out internally that it's all over Twitter and then shortly after Twitter it's out on the internet and then there's like a blog document less than an hour later so it's is it you know people doing that on their cell phones in our case probably not so much I mean we require that they don't use cell phones in the call centers are on the floor so no is it possible that they're calling out over our phone systems absolutely and that's also the best way to get the most information out fast. You know it's calls to bookies we want to deal with our call centers most importantly we want to stop or at least know when people are giving out our vendor account information obviously who you have infrastructurally is just as important as who you're going to use or where it's going to go in the future so who maybe you're going to be buying their products of at the same time we made this pretty modular so if you're going from one end you're being a nice guy and you're trying to block that stuff or detect leaving your network you can use it that way but it could also be used to spy on people it can sit on their work station at home maybe they have a soft phone so that's that void traffic and you can sit there you can spy on your wife or your spouse or whatever you can try to steal their personal identifying data it all flies out over the phone it all flies out after yeah off their system you can try to snag all their credit cards or you know maybe we just want to know who the hell is cursing us out and interact with that maybe blank it out you know try to bleep out if you will that type of curse or reset the call. So one of the things that I also actually heard was like one of my colleagues friend basically happened to record all his phone conversation and when the bank came in for the foreclosure he was able to play that and then basically get out of the foreclosure and sue the bank and get money out of that. So it worked out in their advantage over there and some of the other things in the in the news about data leakage right like Sony got lost so much of credit card information and Citibank and ADP it's happening every other day and these are the ones that we know of that goes like in the public news because like it's like a big company that kind of thing apart from this like you have like all these like smaller grocery stores or gas stations where like people are skimming your credit card information and then selling it out and like in the black market how that works is like they basically give them like okay I can provide you like 100,000 credit card numbers and then here's the sample of that and then they provide like the first thousand of them out there and then like people basically run through like some of the quick checks on like those things and if it works out they immediately go purchase it and like they're like sites which are dedicated for like buying and selling credit card information like this and it's all basically based on like invite only basis and the tons of places where it's happening right now and like all of them goes mostly to the Eastern European countries and somewhere up there too. Some more background on like what is why I'm like we don't want to go more in detail about like what is why and how it is done like there are like few other talks which which are covering all that already. So what we are trying to do is like we have phone conversations between people and whatever is going through the internet like it's basically like the SIP protocol and then the RTP protocol mostly and we we're not even looking at the SIP right now like all that we do is like we take up the RTP payload from those sessions and then like we want to like split those sessions out and then like get the actual data out of that the the wave file out of that. So a little background on the RTP protocol so this is the RTP stack the first byte first bit is basically the version number then we have the padding information which basically says that are they padding bytes to follow at the end of the packet then we have the X which is basically the extension header like this is the actual RTP stack but then like you can have like extension header following this stack which is optional then we have the CC which is basically the contributing source counter and it says how many CSRC records are following this one and we have M which is marker marker is more specific for like like custom applications which are using RTP for their own data transfer so like those applications can have a unique marker identification over there to parse the data differently then the payload type payload type is much more important for us in this particular project because like that says like how the payload is encoded is it just like plain raw or is it like G711 mu encoded or a law encoded and mu law encoded and encoding is what is used in the North America mostly and a law is what is used in the European site in the white data transfer so then we have like the timestamp and a sequence number and the sequence number is again to just make sure it's all in flow in the same direct counting basically and the timestamp kind of helps out a little bit more so like when you are having a conversation say for 10 minutes and then like it happens there's like a lot of lag if the packets reach you a little slower there's a lag in your conversation so with the timestamp in there like we can actually compress the actual audio file which could be like a 10 minute audio conversation like that you have on the phone call could be compressed to like eight minutes or four minutes or whatever and base and we can even use that to like help out a little bit with like all the silence packets and like pull those things out and the regular synchro so the basic architecture that we have is like pretty simple like we have two people talking on the phone sniffer capture all the packets and if you are having like a conference call or if you are having like multiple like if this thing if the VDLDS is deployed on a central position where we can monitor all the conversations going on in the organization then like we need to split it all out into different sessions so what I started doing out was like started using a deep packet it's a python library DPKT yeah yeah whatever it's there in the reference so so deep packet module what I did was like split the packets like first took the packet took out the IP headers out took out the Ethernet headers out everything all the UDP headers out and then went into the RTP header parsed that one and then pulled the data out and based on whatever the encoding type was that I showed in the previous one like apply the encoding and on the entire file based on the sequence number and then start doing so I got almost till that point and that's when I'm like wow this is really painful to do all this work and then like and then I realized like I had to do all the session splitting and everything so like I don't know if you guys have heard of this tool called net dude it's a Linux based one so like and it was kind of pretty outdated tool so like we kind of rewrote net dude in python too so like I said okay you know what let's just start using that and then like use that for to split all the sessions in the packet capture so start doing that and then finally it struck me I should probably Google this up now and then see if somebody else has done it and fair enough there are like a whole bunch of other people who have done the session splitting and splitting it out into audio files already so that's when I said okay you know what I'm going to use one of the common ones which is like spread across like multiple platforms and like that that's when we chose Wipong and Orica are the two people that we chose like that we can use to parse for this one. Going further into this one like after the session breaker we like we have the audio currently the way we deployed at least for the proof of concept was like my laptop was sniffing the traffic from my desk phone and then like whatever the conversation was happening was getting split into the local machine and we basically were like doing like I had script we should like on a regular interval go through that find where if there are new files over there if there are new files then like shove it into like our web server where ever is the master if the master could be running locally or on a remote server so we can transfer all the files over there and then do the further processing over there and then like the master basically does the audio to the text conversion and then like we have our own engines to detect the keywords and all the regular expressions to detect all the credit card numbers and stuff like that and then dump it into like the display portal. So a little bit more details into like the actual architecture of our tool is basically we have the payload converter. So we have the payload which is basically the white packets with RTP packets in there and we convert that into a media store it into a media database then send it through all our transcribing engine where like it gets like we generate the transcripts for those and then we go to the detection logic engine where we figure out if it's like a PII data or if it is like a different kind of like an insider leagues or curse words or based on what kind of rule you wanna apply to it we run that against it and then if there's an events generated then we send it to the display portal and then it gets parsed and displayed over there and if you wanna add your own custom signatures to it because I'm pretty sure you wanna like add something for which is more specific to your organization and then those kind of things can be also done very easily. So some of the other things when we started looking into it like how can we get the audio data out of the packet? So when we started looking at it we first said oh, Wireshark has it on the wipe telephony piece in it and let's just use that one and then like we said like, okay, detect the streams, dump it to a wave file and then see how that goes and then like we had a little bit of falses in there like things like if you are using like Microsoft Office Messenger which also uses SIP and RTP to communicate and then like those kind of started mixing in and then like we had like other cases where like it'll be a same stream but then like in the stream like there are like certain other control packets which goes through the RTP session. So those are all the things that we wanted to take it out. So deep packet like I mentioned that was the Python library that we used then Wipong is like again you have like a live CD that's available from Wipong if you don't wanna use the source code to compile it and do it and you can just install it and that starts dumping out all the audio files then we have Womit and Orika and RTP break and RTP scan, RTP dump and RTP play and can enable all these guys can dump out voice audio files out of the packet capture. So obviously a good portion of this is actually getting the audio to some type of form that we can parse through and pull that data from. There were a couple different architectures that we looked at and a lot of work had gone into one or the other. Obviously looked at the ones that are very specific to the platforms in general. So Microsoft's speech OSX obviously has their own speak recognizer class built in. Really wanted to stay away from that. Anything that isn't that mobile really isn't that good. Obviously there are a lot of companies out there that just don't use Windows or OSX and you would have to develop it completely separate from the others. Also there are larger companies out there now that use Linux as their main desktop. So you really need to target at least all three if not more of those particular platforms. The one in particular that I kind of went for was Sphinx. So there are multiple versions of Sphinx out there, each for different reasons. There are Sphinx 3 which is a completely C compiled module that you can use to filter audio down to text. There's Sphinx 4 which is Java which you can do the exact same thing and it was built to replace Sphinx 3. I really started putting a lot of time into Sphinx 4. As a matter of fact, the demo was probably completed once using nothing but Java or Jython because that was the fastest way to actually interact with their system and it worked very well. The problem that we ran into is as we looked into this Java's really heavy. So it was much better idea to try something else than the same people that make Sphinx so decided to come out with Pocket Sphinx which is very light, it's for embedded systems. It gives you most of the same stuff. It's a little bit simpler but it gave us a lot more avenues in the future. We weren't necessarily going to be stuck using something that had to deal with Java. We could have a C library that worked on Android, it worked on iOS, it worked on the iPad, it worked on Windows. So it kind of covered all of our bases all in one spot and it works fairly well. As far as Sphinx 4 goes, like I said, I really got into this. I thought it was mobile but not as mobile as I needed and obviously it was easy to hide and it was easy to interact with. It was easy to interact with because I could use Jython and heck that's Python. Everybody can do that. It was really easy to hide because let's face it, everybody has seen Java just randomly take up everything on their system for no friggin reason. So why not? Well, before we're really going into it, it turned out that we really wanted to kind of hit the mobile area and that's where Pocket Sphinx is. I mean, it just really allowed us these future avenues to go through. I'm not really going to get into a lot of how voice to text works. It's based on phones and basically you try to determine what is being said based upon what the standard phone is or what is being said at that moment based upon what was previously said and what should be said next. So you have harmonic models and you have dictionaries that break down words into syllables essentially that you try to match on and then base that upon what's going to be said next or probability of what's being said next so your language model. As we go further, you kind of find that yeah, it's possible to get a completely bad translation of what was said based upon who's speaking and what is actually being said and which language model you're using and then kind of like how the audio has been transcribed. So the architecture that we use for the actually detection of what is output because as we put these WAV files in there, it's got to be of a specific format to get the most likelihood that it's gonna transcribe what you want. So to kind of combat the issues that come with something transcribing poorly, we had to make the detection engine really customizable. So it's script based. You can program something in Python and it's in that text file, well you can do anything you can do in Python. Just make sure that you know what you're doing and you'll be able to pull anything out. It takes a little bit more skill but it's gonna be custom and you're gonna be able to format it accordingly. It's also rule based because it's script based. It's easy enough to have a script there that all it does is pull a bunch of rules from a file that's regular expressions that you can match on and then just defaultly kind of push it out there. So you know, Joe Sixpack can take care of it. Nobody really has to be too bad of a Reg X Ninja. It does have lower customization but that's what you sacrifice. So one of the deployment architectures like I said, we can put it on tabs with the PBX system and then start dumping all the conversations, basically anything that's just going through the wire basically gets converted into Wavefile and keeps dumped into the local hard disk and then we can send it out for the processing and reviewing and analysis. So that's one way of having it done so you just have to deploy this at one location. The second option is basically having it as a distributor one so if you are actually trying to specifically target just one particular person in the organization or just few people, then you can deploy this agent on their box and then have that agent just report to the master wherever you have it set up and then it's all possible to set it up via the config file. You just set the config to this is the master and then it just reports to the master which is in a remote box and then it dumps all the conversations over there like parses that transcribes it and then first it actually does the standardization and using the socks and then it does the transcription and then does the detection logic and then spits it out in a not so fancy portal at this point of time but will be fancy soon hopefully. And the other option is like you actually hacked into one of the boxes or you have a node that you have compromised over there then you want to transfer this whole agent over there and then have it report back to you and that was one of the reasons like he said like we wanted to keep it very lightweight and smaller so that we don't have to deal with heavy transfer of file both the ways up and down so like we can even have the master run on their machine and it's very low CPU intensive and you run it over there, get the output and then just to send the parsed output to you kind of thing. So before we go into the demo some of the future work that we have decided was like thinking about is like to disguise this whole thing as an app maybe as a weather app or something which says like it needs internet access and needs like the speech recognition software to put in the zip code or something that kind of thing and basically listen into every single phone conversations that you have on your cell phone and be able to transfer like just the key things that we need from that point to our servers. So then the real time office skaters so the other thing that we were thinking about was like right now it's like all like the audio file has been generated, dumped into the local disk and then we do all the parsing but we are thinking if we can do this more in a real time then like we can actually like start like beeping out or like whenever like you are giving out numbers or you're cursing or something like you just basically beep the whole thing out and or just simple thing will be like just to drop those packets and things like that and inject and like add more packets in there or filter out certain packets specific to that. Some of our references and yeah, wanna go? We'll do the demo for the other stuff first. So here's the first one which is basically the wire shark. Like so this is all our, so here's what's the basic conversations that's happening before like the actual payload starts going out like we have like conversations like okay, what's your phone number? What's my phone number? What's your name? All the zip protocol data transfer basically. Then so these are all the actual RT packets which are going between me and him when we are having this proof of concept call and so what you can do is like the wire shark does have this feature. So this is our conversation that we're having is the audio out. What type of conversation are you trying to pull here? Are you just trying to get a packet capture? So that's just the basic conversation that we were just pulling together and so like wire shark does this. So like what we were thinking about was like first to like use the Thetraal to do the whole thing like in line using wire shark to do the decoding for us but like you see like over there like we have like three conversations like listed over there as voice calls but it's just the central one which actually works. The other two are just like a falses over there and the other tool that we actually looked into was Endernex. So let me just bring that back here and... So basically what I'm doing right now is like sending a few packets through like replaying a packet capture and then looking at all the different voice conversations that's taking place. Can just show that one to you guys. So I did a TCP replay of a pre-recorded packet capture and then like dumped that one and then like all the conversations which were recorded in that packet capture. That's what is displayed over there. And finally, do you want to start the other one? Oh, yes. So in order for our demo to work like we wanted to, we basically pre-recorded whatever the conversation that we had in the office and dumped it into one of the files that came out and we thought we'll play the conversation because we didn't want to bring like two phones over here, set up the PBX thing and like have it all dump out and everything. So these are the two files that we have. Hello, my name is Brian. My credit card number is 1234-5678-1234-5678 My social security number is 012-345-6789 Hello? Hey, who's this? Second conversation. What the fuck is this? Why you just calling me fucking bastard? Kind of asshole just calls a guy and doesn't say anything. Frickin' dumbasses. What a bunch of dummies. I am not saying the words bitch or whore over the phone line. That is all I know. Fuck this thing. So we basically wanted to like figure out like how the curse words work and like all the audio translations basically said one of our coworkers was volunteering enough to do that for us. And the other thing that we wanted to kind of show was also like so part of the agent basically like we have to do the standardization of the audio file and everything. So when Arika gets installed it basically gets by default installation of with the GSM format. So we have to convert that to PCM. So we do that. And so this is how all our output logs basically look like and I can't see much. So I'm gonna just scroll it across and show you guys what it is. So it basically says like who's the, what time the conversation happened? Who's the originator? Who's the recipient? Like if it's a phone number, if it is internal within the same organization then it just shows the extension number. And if it is going actually outside the organization then it shows the actual cell phone number or the landline number that you're reaching out to. And what was the duration of the call and with the direction? Was it you who called out or was it an incoming call? And those kind of things. So those are all the things we have in the log and we basically parse those things out and then we do a post to the massive web server and Don will show you on that. Basically everything was done previously. So we set up a file that we're gonna use to kind of post this the way it would typically happen on your system if you had it set up in one spot or the other. And I'll show you just basically the post first which is gonna seem normal enough. And of course, yeah, it would fall off the screen but it'll scroll, I promise. At least I thought it would. So all this is gonna do is we have those two sample files. We're gonna post it up really quickly and you'll see that's the first one going up, hopefully. And that's the second file being uploaded. And while that's being uploaded, the system on the other end is actually transcribing the data. It's running one of our rule sets that we've set up to get everything through there. Let me show you the end result as small as I possibly can make it. So I just needed something messy and quick. That's really what this is. Just quick reports that were generated based upon the two things that we uploaded. So you'll see that this is the ongoing call, the local IP, the remote IP, which evolved to sleep and kept in a packed format for the... And then you'll see what the number was, which are the actual extensions that were used. And then as you look down, you'll see the alert. So it pulled out, the credit card number is one, two, three, four, five, six, seven, eight, social security number. And then down here you have the transcription. So that is exactly what we said. That's enough for me to walk off with your credit card number and your social security number if I have it running on your system. The other one didn't work so well due to such things as language models and the actual language model that we're using isn't so good at dealing with profanity because typically when you're cursing, you're gonna have spaces around the words. Well, the way that our actual audio came out didn't. So as you can see, it detected absolutely nothing. And obviously there's not one curse word there. So it is not perfect on everything, but it does detect normal speech patterns. One of the things for future work is obviously to train it for different types of dialects. So it works somewhat and then in other times it fails miserably. And there are like different dialects and like so basically even for English language model we have like the British English model, the US English model and like some of the dictionaries that we looked into also they also have like for like different languages like Spanish or German and was it French too? Yeah, they have French as well. So they have like some of the major language model already defined in dictionaries and so that's what we used. And just for like ease of use, also just made a form so say if you did manage to get wire shark sniffing on somebody's box then you could just quickly upload it and have it transcribed and it would come out the other end automatically for you. Thoughts about it that we had and if you guys have any questions for us we'll be happy to take. Sweet. Thank you. All right.