 So, my name is Sergey Brathus, I work for Dartmouth College, and it's actually a lucky coincidence that I get to speak right after this ARP talk. Because, if you remember, if you were here for the last talk, it was about a covert channel, or at least the part that I got. A covert channel in ARP. Well, in my tool that I'm going to demo today, such a covert channel would really, really stick out. And I hope to explain why, and I hope to give some math to prop it up. So, this is called Entropy-Based Data Organization for Browsing Logs and Packet Captures. And I got this subtitle. Every year I learned something new at Defcon. FTW is what I learned this year. So, for the win. And what I'm going to do is discuss the state of actual log browsing that you get with free tools, and then see how we can improve it. So, what is this all about? And if you, excuse me, I'll sit so that I can operate the demo. So, what is this all about? What is this all about? We want to design a better interface for browsing logs and packet captures. Okay. Where does my mouse go? Okay. Logs and packet captures. And we want it to be a slightly smarter interface than what you get with other real wire shark, grep, and so on. In that, I would like to see anomalies or whatever passes for interesting data in my dataset first. And then I want to be made aware of correlations and where they break. So, there are many data fields in a typical log record, and many of them are correlated. And how am I proposing to do that? Well, I'm proposing to do that by designing the interface around decision trees, which is how people think about classification, really, in their minds, and about basic statistics for frequency distribution and correlation. And for that, I propose entropy and friends. How did this all start? Well, my wife ran a tournode at Dartmouth, thanks to Roger. And we, she kept getting frantic messages from the sysadmin, you know, your machine is compromised, shut everything down, there is IRC traffic. And, of course, we all know that IRC is evil. But that started us thinking, okay, IRC tour has a really interesting traffic mix, non-standard ports, encrypted traffic, all sorts of stuff. If we wanted to check that nothing but tour or most likely tour was leaving our machine or coming in, how would we do that? We'd go classify things by session, you know, cut away until we're left with everything that's not tour. And Averil isn't really that much help there, because those filters can get really, really long. And I know people write really long TCP-DOM filters as well. So how many page-long filters can you juggle? What you want to do is really classification, you know, sort the data out. This goes here, this goes there, this is no threat, this is well known. Okay, this might be interesting. So the tool I posited should be based about classification as such. And here we go. Before we go to the tool, here's a disclaimer. These are really simple tricks. And do not expect a survey of research literature. I did put up some research paper references on the last slides. So if you're interested to know how this is done in IDSs, IPSs that try to use entropy, as far as I know, all of these are test systems, research systems. I'm not aware of any production system that uses this, but these things are promising. And who knows, I plan to give you enough math to at least get the basic intuition about the stuff and how you get around it. And here we go to the standard log browsing moves. And this is kind of like the hacker jeopardy, right? What does this do? Can you see the logs? I can enlarge them, right? So you just have to count to 11 to get the 11th token. Then sort, unique, sort by numeric. And you get the answer or rather the question is this is the number of successful logins via SSH by IP address, broken down by IP address. So we look for this kind of information all the time when we admin our systems. This is grep, and this is the UNIX pipe length measuring contest. Alternatively, if you've parsed your logs and they're sitting nicely in a database like this, then you can reach for your where clause and select count as count IP from log data group by IP. That will get you the same thing as awking and unicking and order by count. And then you've got your little table. And you know, as many times as you have to get a useful statistic, you have to run a query or write a UNIX log pipe. And of course, you have to parse syslogs. And in order to build the tools that I'm going to talk about, we built our own syslog parser and the pattern guesser that guessed those patterns. So this is an excerpt of the actual packet language, pattern language that I'm using. But we can talk about this at another time. I used to work with the natural language processing, so I tried some of the tricks there to have the pattern guesser guess those patterns. But basically, you know, this is what you do. Filter the data, group the data, count across the groups, sort, and then repeat. And this is the great cycle of log processing. You know, this snake eating its tail or a boar or whatever it is called. You can do it with pipes, you can do it with wear clauses, but you can't really escape it. And this is the case that I want to make, that trees are better than tables and pipes. And so here is my log tree. This is the first tool that views parsed syslog. So the actual log record is down there. And the grouping is done by a template. I will later show you how that template is derived. And you can see that it's split by IP, then by user. It's a pretty long list for that particular node. It's sorted by the number of records, the number of actual log lines that ended up down there. So we see that 2,300 logins out of 6,000 come from this single host in this data set. And the rest is by and order of magnitude smaller. That's interesting information. Of course here you have ranges of distinct values. So you can see how many distinct values for port. You have how many distinct values for user you have on 26 and so on. Well, so having one classification is good. But we might try to take another classification and see how they relate to each other. So I'm exporting my data to another window. And in this window I will choose to regroup it. And here it's grouped by the user. And I would like to see where records from this user turn up. And if I scroll down, I should be able to see this. Seven here, one here, and so on and so forth. And what we could then do is look at other attributes and see by record type. And we see that there are so many public key logins. And this user that's generating those 2,000 logins is a major user. Actually this user is almost always using the public key. What we also see is that there are some failed passwords and password logins from this user. And I'm going by the blue number and this is failed password for user 2. Well, so this is approximately anonymized. Can you guess what actual user this is? A four letter user? Yes. So it's clear why that user is using a public key. It's a maintenance script. And it runs from the administrator's console. And in fact, if we split this by IP, we see that most of those logins come from one and the same console. 1,300 out of, right, 2,900. This one. So we know the Roots machine. But this is interesting. So apparently that admin went around and tried his password from workstations and failed. Anyway. So the claim is that trees are better for browsing logs, even if they're really, really simple trees. And to back up this claim, I would say that humans always use trees when they think about classifications. So here is your protocol hierarchies dumped out of Wireshark. And here is the visualization of that graph. And you see that every time you look over statistics. And firewall decision trees is what people deal with all the time. You know, a record comes in and you apply a test. Is it from the DMZ? Yes, no. Is it from the trusted network? Yes, no. Is it from others? Yes, no. Then you sort it by protocol and accept or deny or drop and so on and so forth. IP tables are living proof that this is what people do. So in the demo you just saw, this is your simple syslog, groups are nodes. These are the same groups that you will do with unique and sort or select group by. I have another demo which is the Wireshark extra panel that does this. But that one actually does a lot more. It tries to get the optimal tree and we'll get to that. And they are sorted by the number of records that end up there. The records themselves become leaves. And this is what's going on. When you want to run a query, say pick out a particular IP, you're going down a branch. When you further refine that by say a user, you're going further down the branch in the subtree. And so if you remember this table from the previous example, this is what one branch looks like. I claim that it's so much better to have all the pipes in hand at the same time. And all the queries in hand at the same time because these queries are now branches. And any time when you actually open up a tree and select a node, this is actually a branch. This is actually a query and you can write it out. And you can preserve and log all of your process of log browsing. Queries pick out a leaf or a node or a tree. And you keep them in your mind and on your screen at the same time. And this really looks like a coin sorter. You know, a record comes in, you have a template. The template says, okay, first we'll sort everything by destination port. There you go record. And then it gets into another coin sorter, which gets instantiated for it. And that's port 80. Okay, now we're going to sort by source IP. And here you are source IP. And here's another record. It travels down this path. These coin sorters are already there. So they happily direct it into a different branch. When another record comes along, a new sorter may be created just for it. That's a new group. That's a new intermediary node of the tree. And so it goes. And the good part is that you can save this template and apply it again and again. If you think you found the right way to classify your nodes, that's how it does it. You can save, send it to somebody else if they have my tool, which is free and needs debugging. They apply the template and they have the benefit of your decision tree for classifying your log. And the question here is, which tree to choose? Should you choose to split first by user or by IP, then by user? This is all about choosing the best grouping. And the grouping should be such that the interesting stuff jumps out at you. And this is what the rest of the talk is about. Simple heuristics for choosing the best grouping, the best classification. And so before we do this, why don't I try to actually demonstrate how it works? So this is other real except when you open a capture file, it asks you, well, which sort of feature, which set of features would you like me to investigate to participate in the construction of the best template? And here you say, okay, this is a wireless capture. So give me 8211 and IP. It's clear enough. Let's look at the levels 2 and 3. And it does its magic and gives you this tree representation of your data set. And its point is to make this the most conspicuous tree in which anomalies would jump out. And here you've got just two branches. We reserved the by IP flags. Can you actually see this? Excellent. I was thinking that I'd have to use magnification, KMAG, but it messes up my screen. So IP flags reserved bit. There are no evil bits here. Remember the evil bit RFC. But in one of those, it's not even there. How is that? And so we get to see this packet, source, destination, something. Let's see what that packet looks like. Let's find it in packet list. And oh yeah, for sure, sure enough. It's a bogus IP header, length zero. This is not IP. What is it? It looks like it's IP. The snap header says so. But it starts with version zero. So we found an anomalous packet. And we can delete it, which I am not going to do, and proceed. Or we can look into the other group. And look at, oh, what do we have here? We have some IP fragmentation. There are three IP fragments in this set. And the rest is good, clean stuff. And so we go in deeper and deeper. If we're tired of this here, there are more fragments. Of course, everything that's associated with fragmentation will jump out at us. And these packets are really anomalous for this capture, because the rest of the data is not like that. What we can do is we can check the ranges of other features that occur here. And see that, certain by the number of unique values, we see we have 3,000 IP IDs, 2,900 WLAN sequence numbers. So apparently there is some retransmission going on, as we will see. So many links and so many TTLs, 21 distinct TTLs. Well, what are they? Let's reshape the tree and look. And so the largest number of packets in this capture, almost a half, 2,300 out of 5,000, has TTL 255. Oh, that's something to know. What protocol is that? Well, let's look at the protocols. We have only five protocols. And we can see what these things are. We have 11 hex, that's 17, that's UDP, 6, TCP, understandable. Actually 67 hex, what is that? 59 hex. And one packet has protocol undefined, but we know that's the melform packet that we've seen before. So we could see where these guys all go. I will mark them and they will jump out at us in the tree. And this 67 with a TTL 1, well, what is that guy? Let's find it in the packet list. Oh, it's a PIM packet. And what are those guys? Those guys, there are five of them. And here they are. And let's find them again. And this is OSPF, which I'm sure many of you could have told me just by looking at the hex. Maybe. Is there a drinking game tonight? Or was it yesterday? Too bad. Anyway, so I would like to direct your attention again to the fact that you can ask the algorithm to reshape the tree for you at any level. And so if we decide to look at these 53 packets with TTL of 1, we can apply the algorithm under this node. And I'm not going to modify the feature choice. And we see how these packets distribute and the feature chosen now was WLAN FC, the control code from the beginning of the packet. And these things distribute into frames destined for the distribution system and not destined for the distribution system. What is that thing? That thing is this SSDP packet. And so we go from one subtree to another and we look for anomalies. And there are two more fields to show, which I hope we'll get to as we proceed. It's very useful to be able to summarize the distribution of unique values. So say, you know that there are 318 unique source packets. And you get this information from the statistics summary in Wireshark, but here it's correlated with this template view of the tree. So now how does this heuristic work and what is the magic? How does it pick the right tree or the wrong tree as it happens? We want to define the browsing problem mathematically. I'm a mathematician, a former mathematician if you like. I do low-level kernel and networking research these days. But it's nice to be able to define the problem mathematically. It's good to find a proper statistic for it. And we look at this picture, right? And what do we see? We see this little scroll bar thingy that's just not showing us the rest of the data. We have no idea what's down there. And the lines that you're interested in are maybe 20 pages away, each surrounded by a page of chaff in a twisty little maze of messages, passages, whoever played adventure here. All the same or not exactly the same, right? So this is the face of the enemy. You're uncertain about what you've got in this. And you can use summarization tools, but you would like them to be integrated with your interface. You don't want to write so many pipes and queries. And of course you can sort. Sort and clicking on the tab sorts the values and you can see the max and the min according to alphabetic order. And you can drill down with filters to an interesting group. And again, each filter takes you down one branch of the classification tree that you keep in your mind. So why not on screen? And all the while there is a ski problem. Where to start? Which column or protocol to pick? With Wireshark you can get as many columns up here as you like, but where to start looking? And how to group those things? Should you group them by IP? Should you group them by source? Should you group them by an obscure protocol feature? We'd like to automate guessing one and two. And it's really guessing. It's a mere heuristics, but, you know, it's better than nothing and it works in many cases. So what we want to do then is to estimate uncertainty. And the trivial observation here is most lines in a large log will not be examined directly, ever. And one needs to convince oneself that he's seen everything interesting. And you'd like to jump right to the interesting stuff. And this is what your logs looks like. There is redundancy, there is repetition. And somewhere there is something that's not like the rest of the stuff. And most of the time you want to see it. And so we must deal with uncertainty versus redundancy. We want to compress data. And there is actually a statistic that deals with that. That statistic is entropy. Entropy is, speaking informally, the number of bits to encode a data item under optimal encoding in a really, really large batch of data like that, distributed statistically like that. So here's a stream of A, Bs, A's, Bs, Cs and Ds in which A's, Bs, Cs and Ds appear equi-probably. So there is an equal chance, this is the probability, this is the chance, this is the relative amount of A's, Bs and Cs that you'll see in the stream. And since they're all equi-probable, you have to encode each one of them and it's two bits per symbol of four possibilities, two bits. A number from one to four can be encoded. You need two bits to encode that. But if the distribution is non-uniform, A's are a lot more frequent, then you can modify your encoding to take advantage of those long runs of A. And you say, oh, at this point, there are so many A's following and you have a really short code for that and your basic zip compressor will do it for you. It will pick that sort of special encoding to take advantage of the long runs of the data. English is like that. English is redundant. Depending on how you define the entropy of a natural language, you get 0.6 to 1.6 bits per character. And the really interesting thing is how you define the entropy of the natural language. In this sequence here, the assumption was that they're thrown out according to the distribution but randomly, not looking at the previous one. In English, when you've said a word, you have to continue in a restricted number of ways, otherwise you're speaking gibberish. The same applies to letters. So these are different models of English. If we just observe the distribution of characters, that's the English we get. Each letter is chosen independently of the previous one, often as it occurs in the natural language. This is by-grams. We'll look at the previous letter. Okra li doesn't work. Here, my Klingon vocabulary is just about ten words. Nuknech kapla. I can probably, oh, tarpahtarbe to be or not to be. That's just about all of it. I wrote the Klingon parser once to help me read the translation of Hamlet from the original Klingon. There is such a thing, and it exists, and if you look at it, then you see that the earthlings have really spoiled it. It's all about honour and revenge and all of those noble Klingon things, and everything else is just this extra garbage thrown in. Trigrams. Here you see some words that start to resemble English words. So you can go further and look at this. Words, unigrams. So these are words taken out of the bag randomly. These are words taken out of the bag looking at the previous word, and these are words taken from the bag looking at the two previous words. And this is actually readable. The best film on television tonight is there is no one here who had a little bit of fluff. This is almost grammatical. Anyway, so Shannon's experiment, which gave this 1.6 bits figure, is based on how likely humans are to be wrong when predicting the next letter or word in the text. So you uncover it letter by letter, and you see how many times the guess is incorrect. So entropy measures uncertainty given some previous knowledge, given some previous model of the data. It may be your mental model of English. It may be your mental model of your law, which is a lot more interesting for our purposes. And people have already been doing that sort of thing, right? In every log browsing tutorial, in every incident response tutorial, you look at something like, look at the most frequent and the least frequent values in the column or list. Now, what good is this advice? If all those values are equi-probable, what good is this advice when there are thousands of those values? Maybe the thing is hiding somewhere in the middle. So it would be nice to begin with easier to understand columns or features, ones with the simpler distribution. And this is my suggestion that as implemented in that tool, you've seen that heuristic work. Start with the data summary based on the columns with the simplest frequency value charts. These are also called histograms. I will, frequency histograms, I will show how one is derived. This means you start with a picture with a more simplicity, less uncertainty, smaller entropy. So you shine your little flashlight on your data where you're more likely to get an anomaly. Where you're more likely to recognize an anomaly. It's really like looking under the street lamps for lost keys. If you just start browsing in the dark, you're not really likely to find anything. You may find someone, but probably not your keys. Entropy. Simple summary of a column is a range, minimum to maximum. So this tells us, and we can get this by sorting, destination IP starts here, ends there, destination port, so on and so forth. This is a frequency histogram that generalizes the range. So we're taking all the unique values, distinct values in the column, count how many times they occur. There are little counts here. The count would be on the y-axis. This is the label of the port 445. This is port 80, port 21. This is a port scan. A real port scan from my network. Clearly everyone is looking for port 445 followed by port 80 and FTP and so on and so forth. And so we can build a little histogram for each one of those. And I ask you, which one of these histograms is easiest to understand? I'm not going to wait for an answer from the audience. Now, when we take the counts, add them all up, get the total counts of the packets, then we can compute probabilities. A probability is really the fraction of the count of things that ended up in one bin. Here the bin for 445 is pretty large. The next one is lower and so on. And compute the set of probabilities. This is how we define entropy. We take the log base 2 of inverse probability and average them over all the probabilities that we have. Log 2 is the reason why entropy defined that way is related to the number of bits. How many bits do you need to record a number from 1 to n? Well, log 2n. And it turns out that if you have a perfect encoding of really large batch of data, you can do better than this for your average encoding length per symbol. This is a theorem introduced by Shannon in 1948 that's revolutionized the entire data transmission business and science. And why logarithms? The least number of bits needed to encode numbers between 1 and n is log 2n. Now, why is this a measure of your uncertainty? Because in order to communicate your choice once you've made it, you need to use some bits. And if you count it as the number of bits to communicate the missing information, that's 100. So here is a little bit of math that explains just how it works. If you choose out of four characters, out of four possibilities, out of four values with equal probability, then it's a quarter, quarter, quarter, quarter chance to draw each one of them four possibilities give you two bits. Now, if you draw the first thing with higher probability or you need to transmit it more often, as per your data, or it occurs more often than your log, see here it occurs half as often as the rest of them together, then you are reasonably certain that you're going to get the first one, the first character A, half the time. So you can decrease your uncertainty. You can decrease the amount of information that you need to transmit the sequences of that A. Let's bias this a little more. In 80% you draw A. Your uncertainty is measured by one bit and a little bit more, because what is your uncertainty about what you're going to draw? You're going to draw this almost often, almost always, right? And finally, if all you get is A, then your uncertainty is zero. A is what you're going to get. Thank you. For only one value, there is no entropy. And so if I superimpose the entropy measurement over a histogram, that measurement is going to serve as the measure of simplicity of this histogram. And, you know, for this one, it's one bit in the bit, and here it's five bits, and here it's almost six bits. Here it's six bits and a bit. So this is clearly the simplest histogram. So you are more likely to understand this splitting by this feature. And of course, you know, if you're looking at source ports, it's obvious how you're going to do that. It's obvious that you're going to get your targets picked by a source port. So let's look at this in another tool. Can you see the screen? Can you see the letters? Very good. Now, these are port scan alerts. And I press the magic key, I rework the template, and it says, OK, destination port is the best you can do. Import some more data in here. Let's say your next log from the next day, and it's a larger batch. This batch was 1,300. Let's import this one. And so you end up with 18,000, right? And you can see what sort of ranges are there. 37 destination ports now, 96 destination IPs. I reapply the magic. Now, the statistical composition of my set has changed. I just added more stuff. Actually, I just added the wrong batch. I'm sorry. Oh, no, no, no, no, no. I added the right batch. Let me rerun this. This is an earlier version, and I get confused with the command line. So this is the loaded template that I used from the previous time. I now rebuild it. And what I see is that the statistical properties did change. Now, the simplest histogram that I have is a split by the type of the scan. Most of them are SIN scans, but seven of those alerts are SIN thin scans. So this can serve as a measure of novelty. And this is actually how the tool would pick up the ARP covered channel from the previous talk. There is only so much statistical bias in normal ARP. You don't get to see much variation in those fields. Once you start transmitting information, you increase variation quite a lot. And actually, if I manage to get a packet capture from the previous speaker, I may be able to demonstrate it during the Q&A. So we went through this, and measuring co-dependence correlation is really interesting. So you look for correlations, and you want to know if a pair of fields is strongly correlated. And where this correlation breaks. So you would try and rank pairs of values before looking. And, you know, this might be significant. There is a standard classical example of a source IP from user logins. Almost everyone comes in from a couple of machines. One user comes in from all over the place. Problem? We've actually seen that scenario. That user was root. On a small network, a source IP would be correlated with a TTL. That's fine, right? The paths are really simple. Why would the source vary the TTL? What if a host sends packets with all sorts of TTLs, breaking the correlation? A user just discovered trace route? Or maybe that machine never ran trace route before. Maybe it was a printer or an appliance. If you never observed that traffic from it before, you know, this might be significant. In fact, I just spent some time last month on cross building, libpcap and rpsk and disniff and a bunch of other neat tools for this platform, which turned out to be a wonderful box for doing arp poisoning and traffic injection and just about any kind of packet devlory that you might think of. Because this thing runs a stock kernel and BPF, a Berkeley packet filter, is not ripped out. And the raw sockets interface is not ripped out. So libnetworks, libdinetworks, libpcapworks, beautiful. And don't ask me about the hat. Okay, so my other example of where correlation may be interesting is from a mod, a multi-user text adventure. In the world of Warcraft, in ASCII text, only better player-versus-player interaction. And you can talk to my wife because she runs that mod and I'm not an expert. But having this tool, I helped her find cheaters. And so here you see a syslog record, format for parsing that syslog. User gets object in room-room. And we immediately found out by looking at the correlations in that tree that two rooms had by far the largest number of objects picked up. And that's just like gold transfers in the world of Warcraft. I get to find out from Greg Hogel and Stoke. This is always recorded in that particular mod. Major source of money was a camp of robbers. Everyone would go in, kill them, get out with the loot. And that was how the economy of that game was run. Or cheating, right? Player A kills B over and over and over and over again in the same room. Why? Why is he coming that way? Well, turns out that's how we experience and war points can be gained. And that, of course, is a non-commercial game, so gold transfer is not a problem. But these are the things that jump out if you use entropy for correlation of things. And for the pair of features, instead of single diagrams, you use a three-dimensional diagram. These are combinations of, in this case, source IP, destination IP. The height of the histogram is the count, the number of hits. And I am out right now. I'm going to skip the math here. And yes, yes, yes, yes, yes. I'm so totally going to skip the math. And I'm going to put up the last slide. Not the one that's okay. Entropy is good. It's part of a complete analysis kit. And I would like to thank you for listening. You for listening. DevCon for having me and letting me go two minutes over time. And everyone who helped me accode these tools and everyone who helped me with the data. And you can get those things from our SVN repository. And please do contribute.