 So, wow, this is like my 60th year doing this stuff. We got more toys to show off. All right, who years is the first time to one of my talks? Awesome, thanks for coming guys. So, I don't know if anyone else does this, I sure do. I like good questions, I like insightful thoughts, I like people who mess with my head and I reward such behavior with cool alcoholic beverage. So, if you've got an idea of something going on, think of it after the talk. There's a mic, come up, ask question if it's good, good things will happen. For now, let's talk about some pattern recognition. So who am I, some guy? What are we here to do today? Oh, there are talks where I talk about one thing and just go really in depth. This isn't that talk. We only do like six or seven things and we're gonna go pretty quick because there's just not enough time to do everything but that's never stopped me before. So, we are going to show off some mechanisms that enforce network neutrality. The telcos are up to some damn creepy stuff and well, what can I say, I like messing with them. We are going to, I have me a nice little box that I can send packets faster, I cannot send packets faster than this network can route them. So, I can send lots of traffic and I do on a regular basis. So I looked at some SSL servers. I also looked at online banking. This is so embarrassing that someone has to do this work but oh well, I'll fix it too. I kind of put a bug in OpenSSH a few years ago. Let's fix it. New with pattern recognition and just entropy and of course, pretty, pretty pictures. Now, I have a habit of showing off really cool, pretty pictures that well, aren't necessarily useful. This time around, we are going to do, wow, that needs a little bit more light but whatever, we are going to compare two files visually at a size of 16,000 pixels by 13,000 pixels and the similarities will be completely visually identifiable. But first off, let's talk about network neutrality. Telcos, Telcos have basically said, you know, we've been looking at this network traffic. That's some pretty cool stuff you're saying on our networks. We should charge you more for that. It gets a little creepy. When you actually filter it all down, it devolves down to an old thing that we have in the crypto world. Alice and Bob are in prison and are attempting to communicate without the warden interfering. How do they do that? Now, I've never actually looked at my ISP as my prison warden but hey, it's an interesting thought experiment. Don't believe me. Check out this quote from Comcast. This is great to accommodate the needs of our customers who do choose to operate VPN. Comcast offers the Comcast at home professional product. At Home Pro is designed to meet the needs of the ever-growing population, the small office, blah, blah, $95 a month if you wanna check your email from home. No one actually thought like, we here's had a minimum wage job. Funny thing guys, so you guys were making a couple bucks an hour. YouTube's losing a million bucks a month. You guys made more than YouTube. Network Neutrality has nothing to do with YouTube. Those guys just shovel cash into a firing hole. No, no, no, no, no. It's all about freaking VPNs. It's all about $1,100 a year per telecometer. Yeah, I'm a geek but you know what? The price of oil is tripled. Pretty soon we're gonna see a heck of a lot more telecometing. They want their $45 billion a year and they're gonna break my network to do it. I don't like that. So what am I gonna do about it? TCP bandwidth estimation. You guys didn't actually think I was done with crazy little TCP hacks. Oh no, no, no, no. I just get to keep playing with this stuff forever. So TCP bandwidth estimation. An elegant weapon for a more civilized age. TCP standard protocol for doing communication between points on the internet is very flexible, very well tuned, very auto tuning to the capabilities of a network. It is able to automatically determine the amount of available bandwidth between any two points. Now if you have two different clients that are using the exact same network, they actually do manage to figure out how much traffic they can split. But these boxes do not talk directly to one another. The entire process of figuring out how much speed to put on the wire, how many packets is done by just looking for dropped packets. Oh wow, a packet dropped. I better slow the hell down. That's pretty much how TCP works. Now what this means is that dropped packets are an information channel. They're a source of useful entropy on the net. Well, we're trying to figure out, TCP traditionally figures out that packets have dropped. Can we figure out why? So here's what we're gonna do. TCP has told us that you can only send packets at 5K a second. And if you send them faster than 5K a second, something bad's gonna happen. You're gonna drop traffic. So what do you do? You send packets faster than 5K a second. You send it one hop. So you have your stream going. It's 5K a second. It's working. It's going. And you start putting on more traffic than the network has told you it can do. But you only let it go one hop in the network. What happens most likely nothing. It's only gone one hop. It's probably just on your local LAN. You let it go two hops. You let it go three hops. You let it go far, oh. Now you have gotten a total of 10K a second onto a router that has basically been dropping packets down on you until five. Now you go ahead and get packets dropping. Now you're able to identify the specific hop and thus the specific internet service provider that has its limitations. Now you say, that's kind of a neat little trick. And of course it's not difficult. You can do this with HPeng. I didn't even need to write any code to do this. You like run HPeng with a couple of straight up obvious commands and it goes. As long as we're injecting interference, you know, we can control exactly what that interference is. So you say, huh, if I send more traffic, packets get dropped. What if Viacom sent that traffic? What if NBC sent that traffic? What if my workplace sent that traffic? You can change the source IP. They go, oh, but Dan, how will you ever see any replies? We're not looking for replies from your spoofed source IP. You don't care about the actual interference. You're just injecting interference. And the back channel is in the TCP session that's been going at 5K a second and suddenly dropped down to zero. So you go ahead. You drop it. You change the source IP. You can also go ahead and mess with the content. Say, I'm gonna put in some encrypted packets and see if this causes any particular deviation of behavior. Now you say, well, why would an ISP ever do anything to implement, to effect encrypted traffic? Well, they did it, right? I mean, Comcast a couple of years back actually did go ahead and start blocking VPN and Washington state. Washington needed to go ahead and actually buy DSL connections for some ungodly number of users because the cable modem stopped working. Great. We know that they wanna make a lot of money doing this. We know whenever you put firewalls in front of a lot of people for some strange reason they start running tunnels. Possibly over DNS. But you see now, that's theft of service. And profit capture. Let's look at who actually legitimately uses encryption. From their perspective. A, workplaces who have people who are checking their email from home. And B, e-commerce sites that make money from consumers at home. In both of these scenarios there is a direct profit motive in keeping the encrypted link alive. The telcos really can take a look at encryption as, hey, if we shut that down we can go ahead and get people to pay a lot of money to bring it back. And I really do not wanna see that happen. As security professionals, hackers, guys it's a pain in the ass getting people to deploy things securely. I don't need to hear, yeah, but the telco's gonna block us. So, let's mess with them. What's up Toby? Ah, so what Toby was asking me is if I'm expecting to see silent dropping versus getting an alert. The problem with an out and out ban is that it's really obvious that your cable modem is broken. What they would rather do is they would rather make the traffic really bad. That way instead of blaming the cable modem, you blame your provider. And you blame your company. And you say, you know, my company should be spending more on giving me a high speed link. With more being defined as giving my telco money. Telco's like things that give them money. So what I was asked is if I'd seen any redirection. I don't really have time to tell this story, but I'm totally gonna do it anyway. Check this out. So, the porn world has weird issues like you would not believe. It's almost like there's not much legal enforcement in that land. So, there's this spyware that the porn guys have that hits them. And it's got a eula. Who here loves eulas? Hey, oh, they're just wonderful things. Yeah, so the eula actually said, somehow this spyware, how do you love? And the eula said, we reserve the right to deem certain sites as unsafe. And to silently redirect you to sites we consider safe. So what's happening is that people would spend 30 bucks a month on a porn site. It would move their money to a different site. Then they just silently go, oh, you're trying to go to that porn site? Well, here's porn. And there's your money. And it just kind of went somewhere else. So that's the most I've seen in terms of a redirection. Speaking of secure solutions, let's talk a little bit of SSL for a moment. SSL, probably the standard way to do secure communication on networks. It's got a couple of basic rules for deployment. Don't put anything secret into an SSL search. It's called a public key for a reason. You'll see in a future slide just how embarrassing this can be. And for the love of God, do not put the same key on two different boxes because if you do an SSL, each box can read, each other boxes traffic. So I went ahead and scanned the world for SSL servers. Found lots and lots and lots of search. What kind of stuff did I find? Okay, if you are the sort of site and you know who you are that does not want people knowing all your internal DNS names, be very careful what SSL search you let the public scan for. And quite specifically, if you have a honey pot, and the purpose of this honey pot is to trick people into attacking, you might not want to call it honeypot.yourcompany.com. And if you do, you might not want to get an SSL search for it. Cause you go to the site, it's like, hi, I'm the honey pot. Puts the pot in honey pot. So what did I actually see when I did the scan? Good times, 90% of keys are actually deployed correctly. We're only seeing on one box, about 10% of keys were seen on multiple boxes a lot, a hell of a lot. Only about one in three SSL servers I saw actually had a unique key because of this. You know, wouldn't it be depressing if there were a couple of companies that made an SSL search, put it in the device, shipped it to 10,000 people? That'd be just horribly depressing, yes, everywhere. Wouldn't it be even more depressing if a group that really should know better and has people in this room said, you know, we need to monitor SSL on our network. We better go ahead and take all 61,586 servers and give them the same SSL search. Yeah, any of those boxes can read the traffic from any other of those boxes. I'm sure that's going to be real secure. So, but there is actually a bigger SSL flaw that we have to deal with in everyday behavior and that is the world's most depressing Google search. If you search for the phrase, why is this secure? You will get a lot of online banks that are very happy to say they take security seriously. They're very worried about phishing. They want to protect your money. Yeah, so they give you this form over HTTP which is totally hackable. That's real nice. So yeah, you know, you've got these forms here and then if you actually submit, you go to login, it submits to HTTPS unless it's a bad guy. You know, there would never be a bad guy. It's almost like this whole form's existence is to protect against bad guys. But you know, it's okay because they have pictures of locks. Did you know if you put a picture of a lock on your site is 83.2% more secure? I read it somewhere. 26% of the top 50 banks do this. This isn't like a little tiny problem. This is like multi-billion dollar sites completely ignoring the threat model of SSL. Now why would they do this? You know, it's actually interesting if you go ahead and you think about what's going on. We basically got three groups, right? You got the perf guys who are like, hell no, you are not putting our website on SSL entirely. Our site will melt down. The only online bank that actually has this setup as well as Fargo. You go there, actually it's not the only one. They're the most prominent one. They're about five of the top 50. You go there, you move to SSL. It's just done. So the perf guys go, hell no. The UI guys say, if we don't show someone a username and login form on every single page that they see when they come to our site, they might not use online banking. And if they don't use online banking, they're gonna call us on the phone or come in person. And users are fucking annoying. I never wanna see them. I never wanna hear from them. Do you know how dumb they are? I don't care if it's insecure. I don't wanna speak to that bastard again. So yeah, what ends up happening is you end up with, why is this secure? I mean, look at this. Social security number or tax ID. Could there be a bigger hack me button? This question, is it possible for you to actually get all three groups for you to get perf and UI and security to actually be happy? All right, check this out guys. So what pages aren't static? We got JavaScript. We got a little bit of an execution environment. What you do is when the user goes to start typing in their username, you go, oh crap, things are happening. I bet they're putting in their username now. I think they're gonna log in. Shocking, I know. And so what you do is you have an iframe open up. Iframe being a little tiny window that is a window into another page and you have it open up the page in SSL. So while you're typing the username, a process that takes a couple of seconds, in the background, the SSL page is opening up. You go to type in your password and it goes, oh, hang on, this would be horribly insecure if you gave me the password here. Let me move you over somewhere where it'll be safe. The screen flashes, the toolbar changes color. Everything looks good and you go, oh, but the UI guys will never accept this. You know how you make the UI guys accept it? Okay, you see all these stupid little locks? You turn them into animated gifs and you let the lock close. Welcome to security in the weird world. So code is simple, this is how you do it. Bug me later and I'll show you the demo site. And it ends up just looking like username, you type in the username, the iframe shows up, yahoo, come on, yahoo should know better. And you get the redirect. Now as long as we're talking about bugs and crypto systems, I did one. This is a slide from my first talk here ever. Was anyone here actually at my SSH talk like five years ago? Fuck yeah, holy crap. Yeah, so that dynamic forwarding stunt I did had a bit of a problem. Well, you know how later on I started researching DNS? Well, I kinda looked at this stuff through the DNS lens and I'm like, ooh, you'd be PN over there but your DNS is from here. So I SSH'd somewhere and I'm using dynamic forwarding from a malicious network DEF CON. And yeah, I may, all my connections may be going out over this encrypted link but they're going to places that some jackass here is controlling and that's not good. Now, I got this partially addressed a little while ago. I went ahead and got what they called SOX5 support put into open SSH so that applications could go ahead and say, hey, I want you SSH to handle my traffic and here's the DNS name and you do the look up because this network's fucked. But you know what, there's a couple of clients out there that no matter how much I try, Internet Explorer are just not gonna go ahead and support SOX5. Can I still bring them on board into the 21st century? Okay, guys, this is gonna be horrible. Why can't SSH move DNS? Well, DNS is a UDP protocol and SSH doesn't do the whole UDP thing, that's just not its gig. It's really a TCP only thing. TCP being a nice consistent stream of data, UDP being individual messages. Now, I could put a huge translation layer into SSH that actually spoke all sorts of DNS and did all this work and moved it back and forth and had a custom server and we'll put in thousands and thousands of lines of code and Theodorat will kill me. Or, so it turns out DNS, if you just tell it, yeah, I know how you gave me that request over UDP but my response is just way too big for UDP. Why don't you retry that in TCP? DNS will go, yeah, there's a truncation bit. I better retry that in TCP. Works like a freaking charm, it's great. This is totally the way to do that. But it does lead to a problem. It means I'm going to hell. See, I had DNS over SSH working. I got SSH over DNS working. Which means DNS, SSH, DNS. Yes, guys, I finally did write DNS over DNS. Malkovich, Malkovich, Malkovich, Malkovich, Malkovich. See what I go through for you guys. Another bug with SSH. SSH, Danup Law, the authenticity of 1.2.3.4 can't be established, RSA key fingerprint is 09A9B1998. Am I supposed to do something with that? Oh yeah, crypto people will look you very solemnly in the eye and go, yes, you're supposed to recognize that key. You should see that and go, oh yeah, yeah, that's totally blah, I see that all the time, no problem. No. So there's a group out there called ADM that basically has a great attack. It's called generate 2 billion SSH keys and see the one that looks the closest and host that one. Works like a charm, it's almost like the human brain wasn't built to look at strings of hex. I don't know. So kind of new field that I'm spawning, naming called crypto mnemonics. It's basically how the hell do we patch crypto into the human brain? I like saying that. So there are three classes, oh wait, guys, who are you still in college? Who are you that are still in college or taking comp side classes? Okay guys, take some side classes too. Just trust me. Side classes are awesome and they will mess with other people's heads, which is even more awesome. Yes, it's computer engineering for girls. Hey, you know, we hack computers, they hack us. I can either confirm or deny that last comment, but please come up here and grab a beer. All right, so check this out crew. We got three useful things we can pull out of the human brain for crypto use. We've got rejection, what the hell is that? I have never seen that before in my life. We got recognition, it's that one. And we got recollection. Here, I'm gonna tell you exactly what it is. That last one, recollection is all about passwords, you know, recollect your password. Needless to say, people's memory for passwords sucks ass. People's ability to recognize stuff, I pick it out of a lineup, is a bit more. And your ability to say what the hell is that is huge. It turns out the less you ask the human brain to do, the better it does. That's all right, okay, we can use that, we can use that. So what else can we, since hex is a clear and utter failure, what else can we use? Well, there have been some other attempts. They go ahead, instead of doing recollection for passwords, they try to do recognition, where you recognize something vaguely to find is artwork, or you recognize human faces that are automatically generated. Now if you actually look at this past faces system where you recognize the face that you look at, you see we got five attempts, you gotta pick one of nine, nine to the fifth is 59,000. Wow, that's not even 16 bits of entropy in that password, or past face. And so what Toby just said is, and humans have the documented tendency to pick the most beautiful face in a series, which is why past faces doesn't let you choose which face to authenticate against, just says, you're gonna recognize this one, this one, this one, this one, and this one. Precisely to get around that bug. But the bottom line is past faces can only get about 16 bits of entropy. Now this isn't a problem, interesting point though by the way, grab a beer if you want. So past faces can only do so much, that works for online, right? Like if past faces isn't noticing a brute force attempt that's going through 59,000 tries, they're not doing something right. But we don't have online when it comes to SSH keys. You're just handed something and told here's your hash, you recognize it. And for that we're gonna need something else. And speaking of going to hell, I'm probably going to hell for this one too. So, check this out crew. Bet you didn't think I could make a DNS reference, but you're gonna be totally wrong. So, humans do not remember arbitrary strings of characters effectively, that's just not how we work. What we remember are stories. Homer's epics are enormous. But you know, there's a funny little thing about stories I don't know if you noticed, they kind of change over time. You know, things get embellished, things get changed, things get dropped. But the only consistent thing in storytelling are names. And when it comes to names, those things stay stable. Homer's been Homer and Homer for a long time, Simpson. So, we seem to have hardware acceleration for names. So, what if we went ahead and took that huge string of hex and turned it into a series of names? So, well, we take US Census data for names available at the Census Gov. We know there are more unique female names than male names. We know that there are way more last names than either. We take 512 male names, that's nine bits. 1,024 female names, that's 10 bits. 8,192 last names, that's 13 bits. We try to get as different names as possible. We use something called the Levenstein algorithm to come up with an edit distance. The maximum, minimum number of changes from one thing to another to go ahead and turn the first into the second. I'm mentioning this in detail because we're gonna abuse the hell out of it later. So, what we wanna do is we wanna make sure that two names don't look similar. And we can do that, because there's a fair amount of entropy around names. And we split our 160-bit big mess of hex into five married couples. I gotta say, it's kinda weird, but Julio and Epiphany at the Zootie is a hell of a lot more readable than A1321651. The crazy thing. Okay, so this is really interesting. I actually got to tell Phil Zimmerman about this and I got to guide his jaw to drop, which pretty much made my ear. And Phil Zimmerman's like, yeah, we hired this linguist and he went ahead and he studied and did all these genetic algorithms that determine words that wouldn't be confused for one another. You know what else we're not optimized to handle? Random frickin' words. We can handle arbitrary names. We're damn good at that. Random words, not so much. It's almost like we suck at learning new languages. Three years of French, remember nothing. It is, first of all, shocking that this actually works. That's kind of a surprising thing, but it does. It is critical you actually show the user the key every time they log in. Can't just be like, oh hey, I guess you should probably learn to authenticate this guy. Hey, you recognize this? Like, I've never seen it before. Sure. Now you actually need to show them for every legitimate use. Then when there's a challenge, the memories have been created. So, speaking of broken representations of entropy, you're looking at a file, new file, never seen it before, and you're like, oh, let's look at a hex dump of it. And what do we see? 6A, AC, 06, 2D, again. Am I actually supposed to do something with this? Now what OD tries to do, octal dump, is octal dump will actually go ahead and take, oh, here's a whole bunch of zeros. I'm just gonna shrink all these down. Well, gee, thanks, I appreciate that, but still, 6A, AC, 06, what the hell? People go, ah yes, for our hex dumper, we'll add ASCII. Okay, that's still not very helpful. We need a little better. Now why do we need better? Well, first of all, because if you're like me or, you know, Moskis, you like looking at random protocols and seeing what the heck they do. And second, fuzzing. Anyone notice fuzzing is breaking the crap out of everything? So I've been playing around with fuzzing stuff. And, you know, fuzzing, you get files. The files are just bits. I gotta figure out some way to find structure on it. How am I gonna do that? So we got two different ways of doing fuzzing right now. A, dumb fuzzing. I don't know what I'm doing, we're just gonna like flip a bunch of bits and see what happens, and that can be helpful. In fact, that destroys the crap out of a whole bunch of stuff if it's not specifically designed. It's a little depressing. There's also smart fuzzing, where you have total knowledge of what the system is and you just do this full analysis and then you fuzze in the structure and you go back and forth. It's a huge amount of development and you can do it. But, you know, you don't get to be a lazy bastard. I like being a lazy bastard. I don't wanna have to understand all the things I break. That's hard. Can't a computer do it for me? Isn't that kind of the point of our entire industry? Can't someone else do it? Can't a computer... So, can we increase the intelligence of dumb fuzzing? Well, the first thing we're gonna have to do for that is get me beer. Because nothing increases intelligence like beer. Yes, things get very fuzzed. Things get very fuzzy. So, there's a freaking cool tool. Check this out, guys. It's called the sequitur, right? Sequitur is a linear time pattern finder. You hand it big old piles of crap and it doesn't care how much you throw into it. It just goes ahead and creates sequences and patterns out of it. Actually, the technical term are hierarchical context free grammars. Now what the heck is that? Here's a switch statement, right? Switch C, case one, value two. Case two, value three. Case three, value four. You know, a little bit of redundancy in there, right? You would pass that to sequitur and sequitur and say, oh, okay, we'll represent as switch C, A1, B2. A2, B3, B3, A4. So, it's actually a compression algorithm, but with a very nice trait of you can actually look under the covers and see how it's doing its work. That's kinda cool. This is a smart guy. He should end up somewhere like Google, okay, he's the chief research scientist. His thesis is awesome. I suggest you check it out. So, here's what we're gonna do, right? Oh, wow, that's totally not visible at all. Oh, well, you'll have to find me later to see the actual live version of this. But check this out, guys. In sequitur, sequitur goes ahead and creates a tree that you have to traverse to get back each byte because, you know, it's like, for example, everything in case, you have to go down one level in the grammar to reconstruct the word case. So, what we're gonna do is we're gonna take each byte and we are gonna render each byte by how deep in the tree we had to go to recover that particular byte. With this completely idiotically stupid little design, what you end up getting, and you totally can't see it on the projectors on Fortnite, is there any way someone can kill those lights up there or any other light on this? Probably not. But what you end up seeing is you end up seeing this. Oh, yeah, there we go. Sweet. All right, check this out, guys. Look, instead of this just being a random string of bytes with no idea what's going on, what's this huge blob right here? Why are these basic offset white lines right here? We ended up getting full-on visual identification of segments and sections. I have no, okay, actually this is the Windows NT kernel. And it's like, oh look, okay, repeated structure, repeated offset, it's just all visually aligned with absolutely no knowledge. So we have significant improvement on analysis. Ah, now we start up a new tool. We can bring lights back up now. I'll tell you later when we can shut them back. Oh, we can keep on and off, it doesn't matter. So, you go ahead and you look in this guy's paper. He shows off one of the things that make his stuff cool is now that you go ahead and you have this symbol level view of our data, we can go ahead and we can generate graphs out of it. The symbol switch leads to A1, B2, A2, B3. Switch leads to case, case leads to one or two or three or four which always goes to value. This batching up is actually kind of meaningful. Now you go, but Dan, I can read this, I can CC, I can manage to handle it. Yeah, that's because C is designed for the human brain. This crap works on anything. So you can just take arbitrary stuff and it will find structure that you can run through this process. It will create symbols that you can manipulate and link together. And anytime we have a symbolic level view, we can fuzz at that view. So I'm creating a new tool called the CFG9000, context free grammar, 9,000 ways to fuzz. We reduce our input stream data to a stream of symbols. We fuzz the data at the symbol level rather than pure bytes. We do shuffling, we do dropping, we do repeating. Now, sequitur is not necessarily the best algorithm to go ahead and generate our sequence of symbols. We got a kid. Wait, is this the newest spot to fed? I mean, are we starting a little young here? What's up? Wait, is this a lost and found? Cause that's gonna be really depressing. It wouldn't be a Dan Kamitsky talk if I didn't have to come in here and lecture you again on keeping the aisles clear. If you're in the aisles right now and sitting down, you need to get up and get either find a seat, move to an area where you're not in aisle or you need to leave. Okay, there's some space right up here in front of the stage. Some of you can come up here. Everyone from over here, guys, come on in. Actually, all you over here, why don't you shift over to this side cause there's a huge amount of free space and we can let all the people out from there. Exactly. Like along this wall over here, there's a lot of free space if you all can open it. And note, no exit, very visible to get us kicked the hell out. Yeah, please. Sorry about, sorry to interrupt again. It wouldn't be a Dan Kamitsky talk if I didn't do a spot to fed. I had to interrupt to get a lecture. All right, it's all cool, man. But it's my son, by the way. Oh, very cool. Can I give him a beer? But it's stuff gone. All right, guys, we don't have a huge amount of time but we still have lots of toys to play with. So, what does CFG's output look like? This is just very, very basic, basic fuzzing that I'm doing on the symbol layer. I'm not even fuzzing like the internal graph of how things all align, but still. So I went ahead and I threw sequitur through its own, actual code for sequitur through the fuzzer. What you end up seeing is, wow, this thing straight up identified, calculate rule usage, calculate rule usage, calculate rule usage. This thing has identified a basic, what's it called? Stack depth attack. They might say, yeah, but syntactically, you know, it's missing the closing braces, but hey, with literally no work, this algorithm figured out interesting sections to repeat. So it's kind of cool. And yes, things blow up. Things to do, create recorder, which is a sequitur implementation optimized for fuzzing use. Need to generate larger symbols. Need to eliminate redundant symbols using work by a guy's named Kyfer and Yang. Yeah, if you have two symbols that reduce down to the same message, how about we merge them? Sequitur is linear time. It, however, takes absolute crap tons of memory. So you can get about 150 megs through before your memory usage is about 900 megs, which is okay, but, you know, hey, what can I say? I wanna take the entire contents of my hard drive and run it through sequitur. I'm gonna need something out of memory. And add the ability to compare files against foreign grammars. Now, sequitur's really cool, but it's not where we need it yet. But there's another approach that I remembered. And it was called dot plots. There's this paper called visualizing music and audio using self-similarity. It's a very brute force solution. You compare a song against itself, you know, little tiny chunks, and you see, you know, does this chunk sound like that chunk? There's a tool called Disassociated Audio that'll make graphs like this for you. It's actually a Beatles track. Actually, why don't we drop the lights again? So this went ahead and this identified individual sections of this track, and they're very visually distinct. Now I said, could we do this possibly with something that we might wanna fuzz? Like the pirate baby MPEG at the beginning of this video. Go ahead and you run this exact process on something on your hard drive where you take it, you split it into chunks and you do analysis. I'll do a deeper and out description in a sec. You get really distinct and useful visual patterns. What the hell is going on here? Let me tell you. Jonathan Hellman wrote this thing called dot-pot patterns, a literal look at pattern languages. And if you look at this in an example from Shakespeare, to be or not to be, we get this big diagonal line of similarity. Two equals two, B equals B, or equals R, et cetera. We also get a second diagonal line from the repeat of this to be and that to be. So instead of having individual words with a single bit, whether it's the same word or different words, we use a similarity metric. Well, we have a similarity metric. Oh yeah, we had one when we were identifying how names were different. Well, instead of doing names, we're going to do chunks of individual files, 32-bit chunks. We're going to arrange them in the X dimension. We're going to arrange them in the Y dimension and we're going to do a self-similar comparison. So what does this end up looking like? Java class files, .NET assemblies, CNN's homepage. Of course, HTML being far more self-referential, far smaller tags that repeat and thus the density is much higher. Packets, SMB torture packets. Kernel32.dll, the human chromosome 22. The legend of freaking Zelda. I don't know crap about 6502, but I can take a look. So why is this useful? Because if I'm doing, imagine if I had a big old packet trace and I could not separate my packet, you know, the file into individual packets. How would I go ahead and identify what was going on? Well, what this is doing, this is at least saying with no prior knowledge, this is a different section than this, this is a different section than this, and I can fuzz them one at a time. Can I potentially do more? Well, what are my actual messages here? Well, from the paper, we do actually see this is a full, a diagonal is a full on repeat, squares are sections that are self-referential when you change from one section to another. Kind of get an idea. There is actually a visual language going on here, but there is a lot of research to do because there is some weird ass stuff in files. Like, I am, what the hell is that? I am, what? But you know, this is good, right? It's nice to have a tool that's showing you better things than you can identify easily because that means there is, you know, hey, it was just a hex thumb to me before and random string of bytes, we're just random string of bytes. That's just how it was. This is showing me something different. Now, there is more research to do. Figure out what the hell these things mean. Create some interactive tools for dot plot evaluation. Do a little, you know, data microscopy. Color, different similarity metrics. We can have different colors. Better symbol selection. I am being an idiot when it comes to this stuff. I don't have clue what's going on in the x86 layer. My similarity metric is raw bytes. I could have so much better stuff if I had a clue. But you know what? Right now I've got pretty pictures and I'll take that too. You don't think I'm done? No, no, no, no. I've got one thing left. If autocorrelation is interesting, if it's cool to compare A to A, what if we didn't compare a file against itself? What if we compared it against a different file? Now, if it's a totally different file, you get this thing, it's called ass, it looks like it. But what if you compare a file to an older version of itself? Well, normally when it's the same, you get this hard diagonal line because of course a file is the same, the first 32 bytes are the same as the first 32 bytes. The second 32 bytes are the same as the second 32 bytes. And it works, whatever. But what if it's two different versions? Well, let's ditch the slides for a second because I like megapixels. Nothing says hacking like 16,000 by 13,000 pixels. Awesome. I can either confirm nor deny. All right, let's check this out. We're gonna open a little image here. It's gonna take a little while because it's a couple hundred megapixels. This should not actually be useful. So yeah, I re-implemented all my code in C just because it might be a little faster. I got an 80-time speed improvement, hell yeah. So I'm like, well, I'm going a lot faster here. I should go ahead and start trying to, come on, you can load any minute now. Oh, okay. Let's go ahead and just as a lark, see what happens if we allow two different files to compare against each other. Oh, well, check this out. Instead of having this hard diagonal line, got this little meandering thing. Oh look, new data over here, new data over here, back to normal, new data over here. Big, huge insertion of new code. We should probably look at this. Guys, it's a visual bendiff. I didn't think this would actually be useful, but you can just visually trace it out. Okay, you got one third of your brain dedicated to visual processing. Believe me, believe you me, I'm willing to use it. So yeah, you can basically just go through files, find the differences, figure out what is going on. In summary, your VPN's under threat. If you got one, you might want to tell your boss that the telcos are evil bastards. SEL's got some issues, especially online banking, especially those of you who are putting the same sort on 60,000 boxes, please stop. There's some fun DNS games you can play. Oh, please, we love a God. If you're writing code, fuzz your file formats. This is ridiculous, it should not be this easy to knock your crap over. And take a look at your data, you might be surprised at what you find. Now, I've talked too long, get the light on. I got some beers to give away.