 The next talk is going to be network traffic analysis using deep packet inspection data visualization. So it's basically this is Bram and he's really really good at staring at lots and lots of packets and if you actually want an application of Tetris block stacking, he is Bram. I'm at the University of Technology and I'm in the Computer Science and Mathematics department and I'm here to tell you something about my PhD project which is about natural traffic analysis using deep packet inspection. So I'm in the group of data visualization which means that we're trying to develop new types of systems that can help people to gain insight in complex data structures and in this lecture I will tell you something about wire shark traffic. For the people who are interested I will briefly give you some demos of systems that we've designed. You can also download them on a website. I will show you the URL in a minute. So why are we doing this deep packet inspection? In my topic I have the task to find so-called advanced persistent threats and these zero-day vulnerabilities they often tend to work in three stages. You have the infiltration phase where the virus is trying to enter the network using social engineering. Think for instance about a malicious USB drive or simply sending it by mail. Then you have the second phase which is expansion where the virus is trying to locate the system that it wants to harm and finally we have sabotage that can either result into staying under the radar and leaking information to the outside world or sabotaging the network by disrupting the services in the network. Although infiltration is nearly impossible to prevent we can detect signs of sabotage and expansion by analyzing the network traffic inside these environments. So how can we actually find these advanced persistent threats? There are several ways in which you can analyze network traffic and one of the more common ways are for instance analyzing network traffic at the level of bytes. So we ignore anything that's inside these packets we look at the byte structure and see if there are some patterns in there. Well the technique that's now most popular is a flow based analysis where people also try to incorporate IP and TCP information telling us where the messages are coming from and the over what protocol. But still the payload inside these messages is still unknown. Well if we want to find the advanced persistent threats we need to dig even deeper and we have to go into the level of application specific protocols to figure out what the message actually means. So in case for instance of file accesses in the network there is a protocol called Samba that can tell you what kind of files are being created in the network. But the more we go deeper into this hierarchy the more protocol fields that we have to analyze. So from a scientific point of view there are still lots of questions on which fields are actually interesting and how to deal with this large attribute space. So how can we obtain this data? For this we use Wireshark. So we take raw P-cap traffic and we run it through the Wireshark dissector. And for the people who do not know what Wireshark is it's basically an open source tool for analyzing network traffic. And the only thing that you have to know is that we run the data through it and we obtain a huge table where every row corresponds to a network packet and every column corresponds to a protocol field that can be present in these packets. So you can see that every packet has an IP and a TCP protocol but the more you go to the right you go into the application specific protocol such as Samba. Besides the large number of columns network traffic analysis involves the analysis of a large number of packets that have to be analyzed. So this is basically the summary of my PhD. I have four years to analyze this huge table to find APTs and then I'm done. There's a colleague of mine who also assists me in this by finding already marking some packets that are anomalous using machine learning. So I was wondering like okay I am in data visualization why should I care about this problem we have machine learning they can fix it for us they can mark these packets and it's all fine. Then I started to realize that these machine learning algorithms nowadays they generate a lot of alerts. So even though they tend to have false positive rates of 1% for network traffic where thousands of packets are being sent per second this still results into 10 alerts per second. So we can use data visualization to help users analyze these alerts that can help them to distinguish true ones from false ones. There are also other things that we can use for data visualization and that's just for instance abusing human cognition. So I have shown a figure here on the right where we apply machine learning simply linear regression on four different data sets and on the right you can see that this blue line represents the model and it's exactly the same in all these four data sets but if you look more closer from a visualization perspective you can clearly see that these data sets are that they are different from each other and last but not least visualization can also be used to help users incorporate domain knowledge so network traffic can only tell you what has happened what kind of packets are being sent but not why they're happening and this is why the user kicks in. The user has knowledge about this environment he is aware of things that have happened in the past and he might be able to relate the stuff that he is observing back to these phenomena. So we have been into this project for quite a couple of years now and we have looked at this problem from three different angles and one of these angles is data-driven so we were like okay can we create an image of this network traffic and maybe find some patterns inside these images. The second approach is alert-driven so here we relied on machine learning they give us a large alert collection and we had to analyze we had to find some patterns in these alerts maybe some alerts are all caused by the same user or they are being generated at the same moment in the time and last but not least and this will be the system that i will be focusing on in this presentation is knowledge-based where we simply use domain knowledge of a user who has a feeling about what's should be happening in the system and what not and based on this knowledge try to find patterns so to give you a bit of a flavor what data visualization is all about these are two systems that we designed in the area of data-driven and alert driven visualization and the system on the left is a monitoring tool where we created a pixel image of the network traffic that we were observing and as soon as you go into the application-specific area of this network traffic you can see that sometimes these network packets they generate a fingerprint that's specific for a particular type of message the system on the right focuses more on alerts so can we find protocol fields that are common in these alerts and can we for instance study correlations between them so i'm not going to discuss these systems in detail for people who are interested in these prototypes and how they work just contact me afterwards and i can i can show it to you now we'll be focusing on knowledge-driven development and for this we designed a system called eventpad and eventpad is actually a sublime text editor for wireshark i will it relies on several concepts that is very similar to a notepad editor and i will show you this in a few minutes you can also download the application and www.bromkappels.nl it's still a prototype but i really would like to have your feedback on this i'm working in academia everybody is working in their ivory tower and think that they are doing the greatest things but i believe that the true users are here in this room and i would like to have some feedback on whether the system could be useful for you and what you really would like to have for future investigations so the eventpad system i will give a brief a live demo of several data sets that we analyzed including void traffic and ransomware traces and the main system is basically an overview such as this one it's a bit overwhelming but i can explain you the main ideas in a few slides so the main concept is the view in the middle with all the blocks so basically every rectangle in this view corresponds to a wireshark packet and we can use rules to color these packets according to the properties that these wireshark packets have i can illustrate this so let's get back to our table and what we're now going to do is we're going to map every packet that is in the table into a gray square and in case of void traffic people were interested in what type of phone conversations are actually happening in my data can i find for instance phone conversations that are not conformed to the standard because void traffic is being transferred over a protocol called SIP and SIP is highly flexible so you can do lots of stuff that's not specified in the standards but it's still possible according to the protocol specification so what we here did is that every row corresponds to a phone conversation and every gray rectangle corresponds to a packet that was sent inside this conversation so applying this trick to all the data we eventually end up with a huge stack of these gray squares and sometimes these squares are very broad and otherwise they're very short and in order to find some patterns in there there are two strategies that we can apply we can try to reduce the sequences in the horizontal direction to make them shorter or we can try to find similarities between sequences well how do we do this for the horizontal direction here we go into the area of regular expressions is anybody of you not familiar with linux tooling such as crap or text searching like regular expressions or anything related to this no okay so what we do is we take a conversation and if we hover over one of these conversations then there is lots of information associated to this and we can use this this information to color these gray cliffs so in case of VoIP traffic i'm interested in two aspects i would like to know when a phone conversation starts and when a phone conversation ends and as soon as a conversation is not ending or not beginning properly then i think that's interesting to analyze so what we can do is we can use a visual query language that's very similar to regular expressions to color rectangles based on their properties well as soon as we have done this we can also now throw away information that we're not interested in so let's say that all the phone conversations that start with an invitation and and with a buy they are fine by me i want to see the ones that do not adhere to these rules then we can create a visual regular expression that states that all sequences with an invite and a buy should be grouped into a gray rectangle of a green rectangle like this and then the ones that do not match these patterns are the ones that are of interest for our investigation so we can use rules to color the traffic based on our needs and it really depends on the type of traffic and we can use simplification in the vertical direction to see if we can find some patterns in there so we can for instance stack the sequences on top of each other and see how many times these sequences occur in the traffic but we can still see that there are some slight variations between these sequences although they look very similar so we can also apply different type of strategies where we let the computer align sequences based on certain properties using multiple sequence alignment and this for instance shows that in the end these phone conversations look quite similar to each other but there are some of them who have an error in between the last aspect that this tool has are selections and this is very important in data visualization the user should be able to play with these components interactively so users can write regular expressions to find sequences that they're interested in but they can also for instance group these sequences the packets together and see if there are some patterns in there for instance all the error in phone calls are coming from a particular country well let me just give you a live demo of the system so that you have an idea how this works so this is the prototype and as you can see initially all these squares are gray and if I hover over one of them I can inspect all the properties in these packets and if you look on the left we have histograms of all the protocol fields that are present in Wireshark so I can for instance filter on the so-called SIP protocol methods and then we can see for instance that in 50% of the cases we have invite messages and if I click on them you can see all the squares that corresponds to the invitation of these phone calls well we can also create a rule using a rule editor that's very similar to word where we can for instance say okay let's start with a wildcard rectangle and if we double click on it we can write a constraint to this we can say okay I want all packets that have the method invite in it and this is very similar to the Wireshark interface and let us now create a new rectangle where we specify a color like this invite and save applying this rule we can see that immediately all the selected rectangles are colored and we can have a view on the very far left that shows that 16% of the traffic was now matched by this rule well of course now creating all these rules on the spot takes too much time so let me just preload a rule set that we already designed earlier and these are designed by actual people who are analyzing this SIP protocol to the max and what you can see is now they created several rules so they have rules to show the invitation of a phone call and also the acknowledgement of a phone call and immediately things start popping up like why are there two invite acknowledgement patterns after one another so we can also see that there is an exclamation mark rectangle and if I zoom on that one then and inspect these details then this is an authentication error so apparently the first attempt in this phone conversation didn't go well and the second one either okay so let's have a look at the different types of sequences that we have in the data so we can for instance inspect different types of sequences by sorting them so simply sorting them already reveals that we have quite a number of these sequences that are repetitive maybe it's handy if we stack these sequences on top of each other to count how many times these sequences occur in the traffic so we're now stacking them and sorting them by frequency shows now that these are the most frequent sequences in the traffic well we can still see that there are some similarities between them but we can study them in a separate view let me check hold on then I will so I unfortunately I cannot duplicate my screen so I have to do this on the spot all right and if I make a selection of these sequences that I'm interested in like this I can get a summary overview on the right basically showing how the traffic looks like so all the conversations they start with an invite and sometimes there are these ringings events and acknowledgments but we can let the computer apply alignment to discover some patterns in there and applying this alignment shows that in some cases these acknowledgement events are always occurring on the end and sometimes we have two invite messages in between well we can also select these blocks and study what their commonalities and differences are and this for instance reveals that all the selected messages that we now have correspond to buy messages so we can now use again rules to assign these buy messages to the data so here we now specify that give me all buy messages in the traffic and let's assign a color to it okay buy like this and here you can see now the buy there were indeed buy messages well sometimes we see that these sequences they do not end with a buy and I'm only interested in the complete conversation so let's first find all the sequences that start with an invite and eventually is followed by a buy so here we can tell okay I have an invite I have a wildcard and after this wildcard there should be a buy and I do not care how many wildcards there are in between I simply want to have all of them applying this we get a selection and we can start a selection for future investigations so we can start them like this and what will happen is we have a summary overview over here that shows how many percent of the traffic we now currently analyzed so we can select this and only now focus on the complete conversations and now continue our analysis or this one so something else that we that we have have seen in the in the alignment view is that sometimes we have these nested map patterns that are in as an invite as inside an invite but this is an example of a phone conversation where there is a proxy server in between the recipient and the sender and if you want to study how many times these type of patterns occur we can write a rule for this so we can basically tell that okay give us all the invite messages that are eventually being followed by an hack and if I see these then we rewrite them into a new rectangle called connection attempt and by stating that in between an invite and an acknowledgement there is no invitation in between we can also take care of these proxy servers that are in the middle connect and applying this we can compress these sequences and you can see that the sequences have become shorter we can also convert these connection attempts we can only filter on these connection attempts like this simply throw the rest away and study how many times certain conversations needed to apply connection attempts and in some cases we can see that in most of the cases in three thousand cases the phone conversation only required one connection attempt in order to succeed but we can also see that there are examples of sequences where many attempts were necessary so this was an example of the devoid traffic we also analyzed ransomware so what we did is we installed a honeypot in the system and we wanted to see what kind of encryption is happening inside our network traffic to see what's going on so in ransomware traffic everything is being operated on samba and samba is a protocol where you can open files delete files read files and write files inside network environments so we created rules to represent the opening the closure and the find requests inside the network and what you can see is actually some very interesting patterns that a lot of open and closures are being done and then after which lots of read messages are being sent and if we look even closer we can also see that there are write messages and inspecting this traffic this is actually highly repetitive so I can also show a mini map and this will show you that there are these some very block structures here inside this traffic so let's reopen this network traffic and inspect this stuff by file we can for instance here decide to group the traffic by file id all right and now applying the same rule set again we can see what kind of requests are being sent per file all right let's find all the unique patterns in there and sorted by frequency frequency descending yeah and here we see some interesting things happening so we have 700 cases of the topmost sequence and the bottommost one and what we can actually see is that a file is being opened and being closed probably to check if this file already exists then we have an opening of a file and some content is being read after which it is closed and after opening the file it is being marked as delete and it is closed again well if we now inspect the type of files that are being accessed here in a tabular view then we can see that in this view all angular js files are being touched if we look at the topmost sequence however we see also that some writing is being done but if we now analyze the same files that are being accessed here then we can see that we also have these angular js files but with a fun extension attached to the back and this was kind of strange so we decided to investigate this more closer and eventually doing so like this let me go all the way to the bottom this one yep so we applied the open find enclosure events inside the sequences and by analyzing these rules we could find some some some interesting schema here it was always the case that for this particular virus that a scan was being performed some files were being open and being closed but the interesting thing about this jigsaw virus was that a file was first duplicated before it got encrypted and the original file was simply deleted from from the disk but it wasn't removed in any way so if you would simply apply hard disk tooling to scrape off all the the files that were still available on the hard drive you would still be possible to get the original files so this was actually a bug inside the ransomware virus that we detected using these patterns so these are also some examples of patterns that we discovered throughout this exploration but you can see that as soon as you already have some knowledge about the things that you expect and do not expect you can already find some patterns inside this data so yeah that was it if people have any questions feel free to ask them if people are interested in the system then also please contact me afterwards we're still looking for people who would like to try the tooling see whether it is useful maybe we can collaborate in some sort of way so thank you very much