 All right Defcon 13 is proud to present Marshall Beto with the protocol information informatics project Saturday morning at that god damn. Why am I working Saturday morning at that cone? All right Marshall take it away. All right. So this project is called the protocol informatics project It's basically a plan words from bioinformatics, which is kind of my main inspiration for this project Should I talk closer? Okay. There we go. Thanks So my name is Marshall Beto. I'm a research scientist at McAfee. However, this is pretty much all independent research My company's never even seen this stuff. So I think that's why it's worthy as opposed to something It's out there trying to make money or something like that Let's start Well, the purpose of this project is to basically automate network protocol analysis. So I Just added this slide the other night just kind of like say this is what the thing does you input a p-cap file and Outputs is going to be a packet format. So This is my little XML markup of this particular protocol format, which is just kind of something that's made up Which is basically a two-byte link field followed by a ASPY string or something like that. So my application what it does is it goes through does a series of comparisons and checks and He's a sequence alignment algorithms to understand the general structure of the format. So if you were to analyze this This markup basically what it's doing is describing this protocol saying that it operates on GDP for three five three six four Something I just pretty much made up It's a block-based protocol. So you open up a block You in you insert the link of that particular block which is going to be a short Veganine followed by a variable type Which is going to be a plain text string right now the default value for that is going to be hello, and then I close it all out. So The general idea is you take a protocol you've never seen before in your life You're trying to reverse engineer world of warcraft for all I know, you know, you Input the p-cap file and out comes the structure. So that's a general idea behind it. So So the objective of protocol analysis is first to understand Exactly the format of every single packet. I mean That's pretty simple concept in general, but it does get more complicated as you have to understand the actual state and natural handshake and things of that sort so There's really no good way of doing this at this current time I mean it's pretty much a manual process people sit down with ethereal or something like that and We're gonna piece of paper and just start writing down offsets and trying to use their human intuition to be able to understand What this particular field does and where it's located in the packet? So this information is generally important to understand proprietary protocols Which are actually kind of becoming a little bit more rare in my opinion of all the security products that I've been looking at and things like that Everybody seems to be using soap over SSL nowadays. So maybe the stuff isn't as Applicable in those situations, but if you're going back and looking at like legacy protocols like some state and state of networks and things Like that this stuff is going to be a lot more useful because there's really not that much documentation for that sort of stuff so it's also important for extending the functionality of software if you guys want to write like a open-source client for particular service This this this project can help out in that in that regard and lastly and my favorite is Finding vulnerabilities and unknown or badly documented protocols using fuzzing. So what I basically do is I? Feed it my my p-cap file it analyzes the structure pops out pops out the X-mail markup, and I feed that into my fuzzer and it starts going to work So what protocol informatics does which is just the name of my program basically Heuristically identifies the fields in data format, so it's never going to be perfect But it's as close to automated as you're ever going to get So nowadays the problems with protocol analysis in general is it's a manual process Like I was mentioning earlier you just pretty much do it by hand using ethereal and fields identified largely using human intuition So I guess I was talking to spoon about this a little earlier And he was basically explained to me that well all these protocols are written by humans So you can basically use your you know human intuition to be able to understand how they were thinking and things like that However computers have never really been able to mimic human behavior that well I mean if you're looking at pattern recognition and things like that Your people are moving over neural networks and and other hit markup models and other things of that sort in order to try to Mimic that Intuition in a computer program, so I guess the main challenges with protocol analysis these days is Kind of just understanding general structure when it comes to different types of protocols So it's for example a protocol. That's all binary Usually quite hard to analyze because there's really no points of reference for us when you're going through it you're just kind of looking at the bytes and trying to find patterns and things like that and Typically it doesn't seem like that that's going to be a pretty much a tedious painstaking process So this is really an attempt to automate that so I think that the crux of protocol analysis in general is variable in fields Long strings in the middle of packets and stuff like that You don't necessarily know if it's not like plain text data where it starts and where it ends and what each byte represents and things of that sort so If you're really gonna Typically, this is all gonna be differential analysis You're just gonna be going through doing comparison comparison comparison trying to kind of build a profile as you go along Unfortunately, however Biologists already fix this problem back in the 1970s when they were trying to analyze the genetic structure and DNA and amino acids and all that sort of stuff So this is all pretty much I'm not taking credit for any of these algorithms I'm presenting today All I'm basically doing is reapplying it to our realm, which is computer security computer science and things like that So bioinformatics, what is bioinformatics is basically information theory applied to biology so it's a lot of probability Shannon Shannon entropy all sorts of things like that and it's mainly used to identify genes in the genetic structure of organisms, so Human genome project the majority of the algorithms that are used are mainly sequence alignment algorithms What they're going out there and trying to find is where the beginning of these genes start and where they end What the function of these genes are and what the structure of these genes are how they fold and all sorts of things like that so How how the workflow for how biologists do this they have these large databases just full of genetic data The human genomes in there they have you know mice and you know every other species that they that they've sequenced and They put this all in a database and then they could start running queries on it So saying like I know that this gene exists in you know You know human it will go through and it can understand Common ancestors and things like that but based on mutation rates and all sorts of stuff like that so If you want to put it in simple computational terms it basically is just processing large amount of structured but complex data It's just too much for the human eye to just recognize so you have to put it in database You have to run these algorithms on it and things like that if you really want to understand the structure So yeah, the objective is to identify genes to determine structure and the function Easier said than done naturally, but there's a lot of analogies to protocol analysis in this case So what but one application of bioinformatics is allowing biologists to map phenotype to genotypes being able to say Blue eyes maps of this particular gene attach earlobes maps of this particular gene and things like that so Protocol for mad or excuse me protocol analysis and bioinformatics have a lot of things in common they both operating large sequences of data and My favorite is just the just the analogy of Biology's phenotype to genotype mapping is equivalent to protocol analysis is mapping of offsets to field locations We're basically just trying to Understand the function of a particular field and a byte stream We don't know where it exists, but it's between you know this offset and this offset fortunately for us that actually makes us easier is when it comes to genes and and things like that It's not all it doesn't have a defined start in a defined end You can have like I think eye color for instance isn't just like one single gene. It's it's spread out throughout the genome. So You have to do a bit more complicated analysis to understand or here's a block where it starts Then it skips this many symbols and then it starts resorts over here. So fortunately for us It's really not like that. We have fields. Here's your start offset. Here's your end offset So what I did was I recreated application that was meant to understand the structured complex data found the network protocols So the core algorithms that are used in bioinformatics in general are mainly sequence alignment algorithms There's two of the most classical algorithms of the Neumann-Wunsch algorithm and the Smith-Waterman I'll get into that each particular detail as we go on but basically the Neumann-Wunsch algorithm is a global sequence alignment algorithm It takes into account every single Symbol in the genetic sequence and tries to align it to a different sequence. It's just a pair-wise alignment Smith-Waterman on the other hand is a local sequence alignment algorithm. It's going to find the most common Substream within the two sequences that are most similar There's ways to improve sequence alignment and that's basically by using similarity matrices and I get into that a little later And it has to do with basically all these algorithms are Operate operations on matrices So there's a lot you can do with different weighting and things like that and assigning scores and and how that the The the matrix is traversed. It's all dependent on the scores in the cells So there's actually a lot of really neat things that that they figured out that they can do using this And then what we get into is phylogenetic tree building. This is kind of cool because this is kind of How violates are able to build like a tree of life and not based on fossil record, but instead based on Genetic information so how similar are these two sequences and being able to go all the way up through the through the evolutionary tree to find common Ancestors and things like that So I go over the two tree building algorithms up, Gama and and the neighbor joining tree Which is just you know, I'm sure many people have heard of that And lastly I get into multiple sequence alignment and that's where there's a lot of challenges involved with Aligning more than one sequence just due to computational problems So I go over the progressive and tree base multiple sequence alignment algorithms And this is a lot of information. So if I am going too fast, just put your hands up or something like that So the fundamental purpose of sequence alignment is really simple It's take two sequences regardless of their length insert gaps in either sequence to make them the same length So right here. I have an example of a DNA alignment There's four base amino acids and DNA is thymine, guanine, cytosine, adenine this is kind of just a really simple example of inserting gaps into the first sequence to align it to the second sequence. So as you can see there's three oops there's three Gaps that were actually inserted into the first sequence which would represent a deletion evolutionary wise so The needleman-wunsch algorithm was I mentioned earlier, which is a it's a dynamic programming algorithm meaning that it operates on matrices And it's basically the I think it I think it was the first sequence alignment algorithm ever Written and I think it was 1973 when it first came out Basically it performs a global alignment on a pair of sequences and what that means is that Every symbol in the sequence that goes into the algorithm comes out of the algorithm gaps are inserted There's no substring searching or anything like that. It's going to try to insert gaps Anywhere it can to make the sequences line up to each other as best as possible to maximize the alignment Maximize the score of the alignment And it's mainly used for analyzing closely related structures and the reason for that is if you were trying to analyze the human genome and a mouse genome and using a Global sequence alignment algorithm There's gonna be so many gaps that are inserted that it's just it's not even going to be relevant at that point So it's mainly used for you know Maybe comparing two different types of humans DNA to understand a brown-eyed person versus a blue-eyed person versus attach your lowest person versus a You know dangling so dynamic programming It's a common concept in computer science. It's not it's not program. It's not coding per se. It's just a terminology that's used to Solve problems by using matrices. So the idea is to break the problem up in the sub problems. So you have a matrix you perform a computation on each cell in the matrix and as The algorithm progresses it uses information that was previously gained in the other cells to progress to Until the end of the process So the results of the previous computations are saved and used by the remaining sub problems And need them in which is a dynamic programming algorithm as is most of the sequence alignment algorithms out there So how need them in which works. It's actually pretty simple little operation that it does The first sequence is placed in the top most row and this and the second sequence is placed in the left most column and basically there's there's this little equation right here that Basically, it's used to assess each cell in the matrix starting at position one one There's three steps For these sequence alignment algorithms. The first step is to assign the similarity score So basically what you do is you go through and you score every single similarity between The character in sequence one on the character in sequence two The second step is you assess all possible paths pathways through through the matrix Left up in diagonal from the particular shell that you're in Assigning the current cell with the value of the maximum scoring pathway using this particular equation, which I have a Matrix on the next slide that will make it a bit easier to understand The third step is I'm constructing a pathway from the highest scoring cell back to the back to the beginning of the matrix and What this represents is the insertion of gaps in each sequence and lastly Okay, actually, let me go back to the equation What you're going to do is when you're in cell ij you're going to look at that your diagonal cell Which is i-1 j-1 plus s ij which s ij is going to be the similarity score between Sequence the offset i and offset j in each sequence it's also going to look at the the left oops it's going to look at the left left cell from from the current position and It's going to add a gap penalty because if you're going left or if you're going up You're actually inserting a gap into the sequence and that's not necessarily a good thing because it kind of makes it a bit Irrelevant if there's too many gaps. So there's something called a gap penalty that will Basically dock the alignment score at the end. So you'll be able to understand what the relative You be able to really understand how important this alignment is and what it really represents if it has a really low score Then you know that it's probably really not that significant So this is the visualization of the concept. I use HTTP just because everybody understands it It's a text-based protocol. It's really simple You actually probably wouldn't ever have to use this program to do text-based protocols just because they're so easy but for For example sake I use it just because it's small enough to fit into a PowerPoint presentation So the first sequence is is placed in the the top most Row just get slash an extra HTML HTTP ones dot oh and the second sequence is get slash HTTP one dot oh So as mentioned earlier, you just kind of iterate through the matrix and you perform the scoring based on the The similarity between each Each character so right here the G's match So you have what you score with one and so on so You go out and you build up the matrix. One thing we mentioned though is Needleman Woonch the kind of like the default runtime environment for it is to have a match score of one a Gap penalty of Z and a gap penalty of zero and a mismatch score of zero. So in some cases you can actually Make this a bit more complex by having different like scoring mechanisms G could Might have a stronger alignment So you'd have a higher score in that case and things like that and when I talk about similarity matrices a little later on You'll kind of understand what I'm talking about. It's probably a little too early for that So what you do is you start in In cell one one and you look at your left cell your diagonal cell and your upper cell And you apply this little equation right here, which is going to take the maximum value of the diagonal or the left or the upper For the diagonal cell you actually add the the score of that position. So in this case It's gonna look a diagonal and add one because the G's match There's no match here. There's no match here. So the maximum is going to be the diagonal plus one So it's gonna leave it a one and the way that these matrices are traversed They're traversed left to right and then they move downward. So The next step move over one one position This time it's gonna look left. It's gonna look diagonal. It's gonna look up It's gonna take the maximum In this case before the cell was zero. So this s i j in this case is going to be zero It realizes that the left most cell in this case is going to be the maximum so It it places that particular value in this cell So it's a really simple operation actually just goes cell by cell and it builds it builds up So this is kind of like the completed matrix and I do this all in hex down here Just to make it all line up and look pretty on the screen so Basically, this is kind of how it looks after you perform all the scoring of the of the matrix the next step at this point is to actually traverse back to the the beginning of the matrix and Understand what your alignment is going to be So in this case Typically the maximum cell is always going to be your bottom right So what bit what we do is we start in the bottom right cell in this case because it had the highest value and We look at the neighboring cells and we choose to sell at the maximum value and traverse there So we start here at e and we notice that our neighbors are all these So basically that means that we can go anywhere because they're all equal we can go left we can go up We can go diagonal I Personally like to just go diagonal just because it just makes the most sense basically Signifies that there's not going to be any gap that's inserted into any other any of the sequences Which may mean that there's actual match there? so this is kind of the evolution at this current point so Sorting down here We look we go diagonal because it's D at this point in D. We look around all our neighbors are C We want to go diagonal again by default And else is disappearing The same deal here all be we so we go to diagonal all the neighbors are a go back to a neighbors nine all etc etc all Until we reach this point up here at this five and we look at our neighboring cells again We realize that there's a five to our left a four-door diagonal and four-door up We want to have the maximize alignment So we're going to take the map this the path that will result in the maximum score So at this point, we're going to start going left This is the completed traverse all the way back to the beginning of the matrix So at this point So at this point We're looking around we notice that five to our left is the maximum so we're going to go there same thing We were we're surrounded by two fours and the five on the left so we go left Etc. Etc. All the way until we hit back here and All of them are equal again, so we decide that we're going to go diagonal So and then it goes all the way back to the to the root So what this means at this point is There are going to be gaps inserted into one of these sequences So what did this do? We computed a path through the matrix and now we can apply the rules of the Needham and Woonge to obtain the two aligned sequences Basically, here's the rule any time that you traverse left on the matrix a Gap is inserted into a sequence. So when you go left It's going to insert a gap in a sequence one at that particular offset if you ever go up You're going to insert a gap at that offset in the first sequence. So here's the results of that particular alignment On the bottom down here kind of a lame example with HDP. I know but it illustrates the point There's gaps that are inserted Right here under index a HTML because that particular sequence did not have that file in there How and everything else matches up So what this allows us to do really is to understand the bound the beginning beginning and ending of the boundaries in the packet so If you wanted to analyze this really simple example, you can say that okay These two sequences had get slash in common We can consider that a keyword and that index a HTML was a file name It's a variable link field. You can put any file in there that you want So we're getting it was subject to deletion in the second sequence So what we can assume at this point is that that is a variable link field because it was subject to deletion Followed by the keyword HDP 1.0 So If you want to actually describe this protocol in a fuzzer you could say okay, we're gonna send a get space slash and we're gonna send a Whole bunch of random data in between followed by HDP 1.0 pretty simple This leads us into explaining my the local sequence alignment algorithm, which is used kind of for a different purpose So this sum is called Smith-Waterman. I think it was written in 1979 I think I might be wrong about that. It's basically a derivative of Neatum and Wunsch. It just has some different rules basically What is what it what the point that is trying to do is find the maximum local alignment between the two sequences Which is going to be a sub sequence between the two it's going to find the the subset in each in each sequence that makes the most sense and That may be subject to gaps and it might not necessarily So this is especially useful in terms of determining similarity between distance sequences because you could look at a chimpanzee genome and a human genome and use local sequence alignment algorithms And you'll actually get a lot more data out of it because the alignments will actually mean a lot more in the end So local alignment. It has a strong scoring mechanism whereas global is weak Because of the number of gaps that may be inserted into the global alignment so local alignment algorithms are used a lot to kind of search the database to understand which Species are the most similar to others and things like that so So how Smith-Waterman works. It's a derivative of Neatum and Wunsch The only difference is you kind of start up with some different initial values Also, this is kind of like a default run You can also use similarity matrices and things like that to modify how this works But but by default it has a match of match score of two So if G and G match we sign that it to has a get penalty of negative two anytime that we move up Or anytime that we move left we're going to subtract negative two from the from this from the alignment score The lower the lower the alignment score the less significant it is any mismatch so G and E don't match It's going to have a mismatch score of negative one The difference is When you're in the matrix you start in the cell that has the highest a lot highest score And you're going to start traversing back the difference is when the score hit zero in the matrix You're going to return immediately and at that point those particular offsets In the matrix are the ones that are saved and use as a representative local alignment sequence. So I actually didn't do like a Matrix and stuff like that. It was kind of short notice and things like that But basically if you want to align these two sequences the same thing as before The score is going to be end up being 18 and it's going to find the substring HD space HTTP slash 1.0 as being the most common Local alignment in that those particular sequences. So that's going to be the most similar difference between the two so Smith-waterman is actually used a lot to determine the distance between two sequences So if you want to find it the distance between sequence one and sequence two You're going to take the Smith-waterman score of sequence one running against itself Minus the Smith-waterman score sequence one aligned with sequence two So in this case, it's 48 minus 18, which is 30. So the distance is going to be 30. Just pretty simple So Basically, here's how it's used in my program. It's not necessarily used to determine the protocol fields I use it just as a classifier To allow me to build kind of like a relationship matrix and understand Which packets in the TCP dump file are the most similar to each other because If you're looking at a complex protocol like isocamp or SMB or something like that they have like multiple different message types This is actually going to be really helpful because it's going to allow us to cluster the different types of Messages within that particular protocol So it's used to calculate the relative distance between sequences for clustering purposes and tree building which we get into later In bioinformatics Smith-waterman is used for multiple segments alignment because DNA is no starting in whereas a packet packet does That's also another big advantage that we have. We know where the packet starts We know where the packet ends. We also have fields that start here and in there. There's no like Fields that are going to span over multiple, you know blocks and jumping over each other and stuff like that like it doesn't in DNA So to improve these alignment algorithms, we use what's called similarity matrix Basically, we replace our scoring mechanism Before I was showing you that a match score would equals equal one and need them in Woonch a match score and Smith-waterman would equal two and things like that similarity matrices on the other hands are Their their tables based on the on the probability that this sequence is allowed to mutate into another sequence in evolution it is possible for different amino acids to mutate into into others but not but not other ones so it's basically allows Biologists to say, you know adenine can mutate mutate in the thymine, but not cytosine and things like that by the way I'm not a biologist. I don't actually know if that is true, but it's just an example. So similarity matrices were pretty I guess They're pretty different I guess because It's kind of based on preference. There's there's there's similarity matrices out there that are used only when Comparing really distant sequences and there's similarity matrices that are used when comparing really close sequences, so The applications protocol analysis is really good though because we have data types. We have plain text data We have binary data and things like that So we can assume that binary data is acceptable for binary data to mutate into other binary data and plain text data to Mutate into other plain text data Especially if you're using a protocol that usually uses English words you can actually have a similarity matrix that accepts I before E and and other kind of common linguistic rules like that so The protocol for me informatics similarity matrices our alphabet is 256 By it's long. So we have a 256 by 256 matrix and each mutation Each cell contains some mutation probabilities between every character So we might say that it's acceptable for a to mutate into e but not into like hex ff and things like that so This is actually kind of some outdated stuff. There's actually a lot easier ways of doing this so So what this allows us to do is is obtain a bit more optimized alignments This way when we're if we use these similarity matrices that are especially tailored for different types of network protocols If you have one that's mixed has plain text in it and binary data in it You could use a particular similarity matrix in there and basically what this does is it allows when gaps are inserted into the sequences There's the alignment is still going to hold true where Plain text data is only mutating into other plain text data binary data is only mutating into other binary data You're not going to have any mismatches in between so I guess the kind of role similarity matrices though is that They're they're they're gonna really be have to be tweaked and you could have Probabilistic models you can do some like Markov chains and stuff like that to initially build your stuff But it's also kind of subjective. I can say that I know that you know Like a plain text character is commonly followed by a null so it's it's able to mutate into a null character and stuff like that That's that's very acceptable so Bioinformatics scientists been years perfecting their version of similarity matrices I don't know how many revisions there are blows them in the pan matrices But there's quite a lot and they're each used for kind of different purposes. I kind of actually had a problem trying to represent this stuff and kind of Allow others to understand kind of what I'm talking about and this isn't actually that great But it's just a kind of stupid example. This super script B just stands for binary. Let's assume It's like hex zero one hex zero two hex zero three Etc. Etc. So if you wanted to line these protocols alignment without proper scoring might actually insert gaps into incorrect places and assume that F might be able to mutate into like hex zero one and hex zero four and stuff like that, which isn't necessarily a correct alignment If you use a similarity matrix with Decent data that keeps track of mutations between different types of data types You'll start getting better alignments where all the ASCII data is aligned to each other all the binary data is aligned to each other and Things like that. So it's basically just a tool to help us Signify our our alignments a bit better so I Kind of explained so far how to do a pairwise sequence alignment, which isn't that isn't that useful Just had a look at two HTTP requests and say, okay, here's the variable link field This is the spot where you put file names. It's really not that big of a deal But it's really not that pairwise sequence alignment isn't really that useful You really have to go start getting the multiple sequence alignment if you really want to get significant results. So Multiple sequence alignment is basically the act of allowing more than two sequences It uses Needleman-Woonch as the alignment algorithm and uses Smith-Waterman as a scoring algorithm The problem with this is there's computation issues So Needleman-Woonch as I showed before It operates on a in-by-in or excuse me in-by-in matrix where n is the length of sequence 1 and m is the length of sequence 2 If you want to start aligning three sequences, you can start building hypercubes and things like that and traversing through hypercubes The problem is it's a the operation is into the ant power where n is the length of sequences and n is the number of sequences This isn't really going to be feasible for us So I have my little illustration of my hypercubing traversed. It can be it can be done But I've been trying to align sequences that are up to 1500 bytes of length and it takes a really really really long time so What we use is a heuristic sequence alignment algorithm. So heuristic. I'm basically sacrificing accuracy for time The objective is to align every sequence to be to each other in a reasonable amount of time But the results are never ever going to be perfect So the way that this is done is by building these phylogenetic trees and phylogenetic trees are basically What you guys would think of a tree of life? You start with you know common ancestors and you kind of move down and you can start understanding the Similarities between certain species and things like that. It's the same thing with packets We know that in SMB all the SMB authentication packets are going to be more similar than a data transfer packet and things like that so typically these are just binary trees which makes it quite simple actually and It does have an interesting parallel in protocol analysis because protocols mimic evolution by changing their fields The kind of just kind of goofy thing that I have here is what came first get slash index HTML or get slash Obviously get slash is going to be the common ancestor So here's my little rip off of the tree of life of lizards as you can see on your on the right hand side These guys look a lot more similar in structure than to this the sea Tigris guy over here with a long tail So basically similar situation applied to packets so To create phylogenetic trees, there's an algorithm called up Gama, which is the unweighted pair group method using arithmetic averages It's pretty simple actually This little equation right here just signifies how the distance between two clusters is taking taken basically The lower the dpq function is going to is going to determine the distance between each sequence and cluster i and cluster j and it's basically just taking the mean of that and returning that back is as a distance of i and j so In this case the dpq is going to be a Smith-Waterman score and the d ij is a distance between two clusters i and j So this is just the operation that's used each step of the way when building the tree so To build the tree each sequence is put into its individual cluster So one pack up per cluster basically Use the up Gama algorithm to iterate through every single cluster doing comparisons between every single Sequence you take the two with it with the smallest distance between each other you create a new cluster k by uniting those those two sequences You define a new node k with child nodes i and j to build the tree you add The cluster k to the universal set and you remove i the clusters i and j It's basically at the very end. You should only have one cluster and You should have a tree bill at that point So here is a phylogenetic tree of SMD protocol. This is just doing like no sessions and stuff like that As you can see there's actually a pretty large disparity between just the differences between different types of packets the edges at the edge length signifies the distance so It's kind of more than a pretty picture I'm sorry. I can't actually like I don't have any cool interface where I can zoom in and tell you exactly what those packets were and stuff Like that at this current time, but from the analysis that I did each one of these clusters up here Represented one of the SMB Message types in the protocol. So this is actually going to help us Say we want to analyze the SMB like authentication packet. Okay, it's you know this cluster over here for instance So we started this route node number start going to start doing our progressive alignment over here. So The tree is your guide. Basically what this tree is going to be used for is to align the sequences to each other As I was talking about that heuristic alignment earlier What we're basically going to have to do is what's called progressive sequence alignment and what that means is we're going to align two sequences to each other and then we're going to Pick one of the one of the alignments and use it as a representative sequence for those two sequences Then we're going to align another sequence to that representative sequence and as gaps are inserted They are They are moved down towards the towards the root Excuse me down towards the leaf nodes on the graph. So the problem with this is If gaps are inserted into the tree in errant places that will be propagated all the way down to the leaf nodes Which can result in significant alignment. So it probably should be a more building tree building should probably be a bit more of a manual process where While you're doing the analysis you kind of look at it and you say this does look acceptable to me I'm going to accept this alignment to be the representative sequence. So Gaps aren't inserted into errant places. Basically. So Opposed to the exponential hypercube traversal upcoming up performs In comparisons where n is equal to the depth of the tree minus minus the root. So that's very feasible for us We can actually align sequences in our lifetime, which is very helpful. I think So multiple sequence alignment basically the rule as I was explaining earlier about progressive alignment is once a gap always a gap So Basically what you're going to do is you're going to align two sequences to each other place a representative sequence in that in the root node of that and As you align sequences to those representative sequences those gaps are going to Translate themselves all the way down to the leaf node So basically the the rule for doing the tree traversal you start in the root and if the root is no You start going left and then you go right If the left does not know and the right does not know which signifies that there are sequences on those two leaf nodes You align those two sequences you choose a sequence with the least number of gaps inserted and that's kind of just I Got the best results out of that I don't really know if that's kosher in the bioinformatics world or anything like that But from my experience that resulted in in decent alignment. So In this case we align sequence one and sequence two Sequence one is chosen to be the representative because there were not as many gaps inserted So kind of lowers the margin of error of errant gaps propagating all the way down through all the alignment So what we're going to do is we're going to place sequence one in the in the root node of these two leaf nodes and we're going to keep track of the edits in the edge So the edits in this case would be you know, I'm I don't have the the list of the offsets right in front of my face but There's going to be edits that are inserted into sequence two to insert those gaps pretty tree not really so you start you start in the The root node apparent note right up here Like I was saying earlier if the root is null you go left and right the root is null in this case If there's there's no value being stored there. So we're going to go left Same thing here. We're going to go our our child nodes are not null So what we're going to do is we're going to align sequence one and we're going to align sequence two We're going to choose a representative sequence, which is going to be one We call it one prime at this point. We keep track of the edits in the edge So edits the edits for between sequence one and two were stored in the left in the left edge and edits for sequence two are stored in the right edge We just traverse our way back up to the root Start moving right now We align the representative sequence, which is one prime with three store those edits in the edge And now what we need to do is we actually need to we need to Propagate those gaps that are inserted by the edits all the way down to the leaf node. So If we want to align all the sequences to each other, we basically Add up all the edits. So sequence one aligned, which is over here. We're going to add the edits of one prime three to one two To add the add the edits the edits in this case are going to be the gaps by the way if we're going to edit Excuse me, if we're going to propagate those gaps down the tree to sequence two, we're going to take one prime three added to the edits of two one and lastly For sequence three all we need to do is add the edits of three and the edits of sequence three and one prime in this case as explained So another kind of hokey example of three sequences being aligned this time a Bit more verbose usage of HTTP I guess So basically in this case right down here. This is what's called our consensus sequence What we basically do is we start in offset zero when you go down we say, okay, these are all bunch of G's So we're going to put a G down here, etc. Etc. Once we start getting over here, we start noticing that there's gaps inserted So that might possibly mean that that is a variable link field so in this case, you know, I'm just representing that by the question mark down there and All is the same you have a space here. This is all matches up just fine followed by host followed by more unknowns dot Three three spaces for comm.net, etc. Followed by the user agent block Followed by the browser type followed by except text x amount So if when we're analyzing this consensus sequence, basically what we can say is okay Here's the structure of this particular format. We have a get space slash followed by a variable link string followed by HCP 1.0 I'm not counting new lines or anything like that at this time Followed by host followed by another variable link sequence followed by dot followed by another variable link sequence followed by user agent variable link sequence except text x amount so What we could do at this point is feed this little Structure into our fuzzer and just kind of just let it let it start going. This is completely automated process I didn't really have to do anything in this case Unfortunately, I don't have like any good mixed protocol where it's buying it half binary half plain text for me to show you right now, but I'll put up some examples on the website and kind of like walk you through using the tool and stuff like that They might be actual actually useful One thing that I want to say about this though is If you'll notice With these consensus sequences, I'm just assuming that this is gonna be variable length and this is not gonna mean anything So basically what what's happening here is I'm losing information. There was a C here in two gaps There's there's got to be a better way to represent these consensus sequences There's something called sequence logos and basically what it is you it's kind of just like a stacked histogram of each of each offset So let me find a good example. So right here after the after the host you have two W's followed by a gap Basically when you stack that histogram up usually it's a W But sometimes it's it contains a gap and things like that So in this in that case you don't really lose information and you're actually able to use a little more of your human intuition to be able to understand the alignment a bit better than just Ruling it out just because of her gaps or because of her mismatches and things like that so So basically here's how we're going to analyze this the consensus sequence in this case We could do statistical analysis on each column by building histograms So we build our consensus sequence Like I showed you on the previous page and we and we look at the mutation rates and the offset comparison between between each offset So basically what we can do is We can say okay, this has a A mutation rate of let's see. Let me try to find a good example okay What what we can do is if you look at this block right here of all these question marks All these every almost every single one of these has a similar mutation rate They they have a rate of change that that is pretty constant So if we were wanted to analyze this like a binary protocol It's all it's a lot more difficult to look at just on the screen. We pretty much base our Understanding where the field begin feels and based on the on the mutation rate so for instance what we can do is Look at the rates of mutation for each offset and using that we can kind of understand what that particular field might be used for so One thing I wanted to mention though is beware junk data in the last example if there were any posts in there It would result in kind of a bad alignment and it would it would actually be a little more difficult to understand what the general structure was so What I was talking about earlier about with the with the clustering algorithms if you cluster things correctly and you choose your The root nodes in each cluster You'll be able to get decent alignments out of it instead of trying to align really distant sequences where lots of gaps are going to be inserted so Yeah, classification is your friend The better that you cluster the packets the better your sequence alignment is gonna end up being so This was kind of an experimental phase about a year ago when I was working on this But I kind of already implemented this in code. So it's really not that experimental anymore I just haven't really had a chance to work on the slides At first what I was basically doing was I was separating the dynamic data data that was changing versus the static data So in the case of HTTP I was able to understand just the different blocks and stuff like that and understanding the variable length fields Just by because it was dynamic data. It was changing To understand integer fields. This was actually work. This actually worked out pretty good You can build like a just an ingram frequency tables for one two and four by window sizes and you just kind of slide yourself over over the consensus sequence in that point and You can use some logic in your program to try to discern what that what the particular function is of that field so for example if two consecutive bytes that are integers Mutate at the same rate the chances are that they're part of the same field and maybe their checksum or something like that They're changing every single packet The second example if in two consecutive bytes the least significant bite increments faster than the great greater significant bite It might be a 16-bit sequence identifier field or something of that sort so This is all also implemented stuff fortunately But what I was going to basically do is build a protocol profile based on each consensus and aligning each sequence individually to break down the structure so for example, I would take a SMB login authentication packet Consensus sequence I'll take a SMB like data transfer packet and then I would align those two to consensus sequences to understand the general Structure of the protocol the packet format in general They're still going to be pretty deviant, but you'll be able to understand like the specific blocks in between like I'm block-based protocols Like isocamp SMB and things like that That's actually going to be quite helpful. I would think The biggest challenge for me this far is being able to present the data in a readable format where the user can actually understand the results so I Have all this stuff kind of implemented Like on a like a command line tool which kind of made it difficult So I'm just kind of like using antsy color and stuff like that to to kind of print all this stuff out And the last thing I wanted to say about this particular tool is it will never be fully automated if accuracy is in mind But if you're doing protocol fuzzing you don't really have to be accurate necessarily so it's kind of okay to have It's kind of okay. I guess to have That that that kind of entropy in each sequence. Let me get this So basically what this can be used for is understanding unknown protocols some kind of like Worm for all we know communicate with its with its brothers and sisters all over the internet using some binary format that we've never seen before This also allows us to fuzz network protocols more efficiently because we don't necessarily have to write protocol specs for them If it's a common protocol I'd probably say just go read the RFC or something like that and and just look at the ethereal samples because there's so many of Them, but if you're looking at some black box protocol, you've never ever seen before in your life This is going to be a tool that would probably definitely be useful. So Also one thing I kind of forgot to mention, but this can also be used for file formats There's you know, there's really no there's no difference between a packet format and file format So if you wanted to understand some proprietary file format Same same situation the only thing that you need to do is just have a lot of samples that are different so And also if any of you guys have any ideas about how this stuff can be used I'm interested in hearing it so My conclusion is this stuff will never be automated a hundred percent like I was describing the tree building earlier It's good if you wanted to Be involved in each step of the process so you know which representative sequences use and which gaps are inserted because as gaps propagate down To the root nodes there can be errant gaps that make the alignment quite useless and things like that So Basically right now. I have a Python proof of concept for this. I have some C++ code that I'm working on It's actually gonna be pretty good. I should have it probably finished in about a month or two The Python code right now. I use Pyrex I don't know if you guys have ever heard that to do like all the the hard computationally expensive algorithms like the Needleman Moons matrix operations and stuff like that I don't know if you guys have ever seen this tool. It's called orange. It's pretty cool It's um, it's basically like a widget-based visual programming interface It kind of allows you to play with different data mining algorithms. So you can kind of like link together everything so in this case like what you'd be able to do with Protocol informatics is be able to say okay. I want to use this particular Classification algorithm for clustering. I want to use this particular sequence alignment algorithm I want to use this particular tree building algorithm because there's actually Like four different types of tree building algorithms. There's the Upcoma. There's neighbor joining. There's multiple parsimony things like that So there's a multitude of uh of opportunity to implement all sorts of things and kind of just figure out what what works So I guess the whole moral of the story with with what I'm trying to say is um I know it's it's kind of difficult for people in this industry to kind of like look outside their science that they're working on and kind of Understand that people might have already kind of solved everything back in the day So this is just kind of a good example. I mean this stuff was was figured out in the 70s and you know It's 2005 now and it's we found the use for it. So I'm taking questions now. I'm feel free to walk up to get a microphone up here or something like that and uh Thanks for coming Questions I can't hear you Do you want to come up here and try to use a mic? Yeah, I'm working almost exactly the same thing with mutating web things But I just walked in on the end of your lecture and I was talking to this guy who's also doing tree algorithms But he's doing it with granularity Based on the levels of complexity of the search Have you looked at any of that stuff for your sorting and tree methods? Yeah, that's basically a maximum parsimony tree building. That's where You every step of the tree you're going to look at The complexity of what it would take for these sequences to mutate to to reach this particular result and and usually Evolution chooses the path of least resistance So in maximum parsimony trees you have two different choices. You're going to choose the choice. That's the most simple Anyone else? Okay, well, this is a lot of information. I'm sorry if I didn't have any other ways to represent it in a kind of clear and clear and concise manner, but This is the website for for this project and I Encourage you all to visit the site and download the tool and just give me feedback if you got it. So thanks Just want to make an announcement that meet the fed will be a