 Karen Benson take it away. Hi everyone thank you so much for coming today to learn about examining the internet's pollution as announced. I'm Karen Benson and I'm really excited to be talking here today at my first DEF CON. So to start off a couple years ago on Reddit somebody asked the garbage men who on there about the illegal strange and valuable things that they had seen while examining other people's trash. And you can go find this thread and read what they found. But the main takeaway is that they found a number of interesting and valuable items. So today I'm going to talk about the analogous question but for the internet. We're going to ask what sort of interesting and valuable information can we find looking at some packets and traffic that you may consider the internet trash. And I feel that I'm pretty qualified to talk to you about this not because I'm Oscar the Grouch but because I just defended my PhD in which I spent the last four years looking at this type of traffic. And prior to that I looked at not so trashy traffic but writing intrusion detection software. So I've looked at some packets. So quick outline of the talk basically I'm going to go a little more into depth on what this trash is and the various ways that you can collect this. I'll talk about the ways that we collect this and the ways that you could possibly collect this on your own networks. And I'll go into a little bit about the data that I use for the presentation and then the bulk of it is going to be about the interesting and valuable items that you can find in trash and then there will be a conclusion. All right. So what is internet trash or this is something I made up so what am I calling this? So basically I mean any unsolicited packet. So this means you're not going out trying to get people to send packets to you. You're just passively capturing everything that comes to you with your own IP addresses. And this has a name other than trash. It's Internet background radiation or IVR. And people have studied this for a long time to look at worms and stuff like that but I'll tell you kind of more of the things that have happened in the past couple years. So probably the most obvious example of IVR is scanning. When you're searching for hosts that run a service, you're going to send packets to hosts that will respond to you as well as hosts that are behind firewalls and they're not going to respond to you and possibly to people like me who are just kind of collecting the garbage of the internet. We also get backscatter packets which is any packet that's a response to a forged or spoof packet and typically you think of these in denial of service attacks. So you have a victim and the attacker and the attacker doesn't necessarily want everyone to know that they are the one launching the attack. And so they may be able to forge the source address or the from field of the packet and when they send it to the victim, the victim may have a hard time differentiating between forged and non forged packets and they may respond but they're not going to respond to the attacker instead they're going to hopefully respond to us. Next we have misconfigurations which is when you just erroneously believe that a machine is hosting a service. These can be small scale like someone typing an IP address incorrectly but they can also be pretty large scale and affect a lot of hosts and we see this a lot in peer to peer networks. Similar to misconfigurations are bugs and this is when you have some sort of software error that causes the packets to reach an unintended destination such as a byte order bug. So even if you know your DNS server correctly you may because of some issue in software send the packet to an unintended destination. We also get a bunch of spoofed traffic where for some reason people are using the wrong address. They typically aren't trying to attack me but we still get some packets like this. And then finally there's some traffic that we just don't know what it is. This can be TCP send packets to non standard ports or UDP packets where we don't understand what the payload is. One example of this is encrypted packets. They are difficult to understand what the intention of that packet is. So this is kind of a summary of the major classes of IBR. So how can we collect this? You've probably heard of honey pots where you purposely set up machines to be infected with malware. Maybe you run an old operating system or some sort of vulnerable service. And with this you can get really in depth information because you're infected and you understand the attack vectors and the consequences of this. But if we don't want to do something so in depth we can have some other setups. The first example of this is just collecting one way traffic. So if this is your network and these are the used machines in your network you announce some BGP prefix and you probably have some sort of middle box keeping state of the connections and which ones are bidirectional and which ones haven't received an acknowledgement yet. And if they never receive an acknowledgement this is probably some sort of unsolicited pack traffic. So you can store this as your collection of IBR. Similar to this you can have a gray net where your state is the IP addresses that are used and then you just know which other ones you can rate to storage as they come into your network. Another concept related to this is if all of your addresses are in some small BGP prefix but you have a much larger one you can announce the whole prefix that you have and then based on the destination decide which ones to route to the destination or write to storage. And then finally an extreme example of this is a network telescope where you just don't use a BGP prefix that you have and you record all the traffic that comes in. And in the order that I presented these it becomes easier to scale and implement and there's normally relatively fewer privacy concerns but you lack the ability to do really in depth analysis if you're not responding and people can avoid your IP addresses. For this talk I am going to use traffic collected at a number of network telescopes. So we have multiple large academic network telescopes and we receive a ton of data from these. We're currently capturing about five terabytes of compressed PCAP per week and we have traffic going all the way back to 2008 so we can do some historical studies with this. And with this data we see traffic from all over the internet. In terms of the countries we see all countries except a few islands in the Pacific Ocean and in terms of IP addresses we're seeing about 5% of the announced IP addresses in BGP so it's a pretty good sampling. And I'm showing you data from July 2013 but if we look over time this is we were almost always seeing data. I didn't extend this graph but it's just increased a lot recently too. There can also be events such as the spam house attack which was a really big DNS based denial of service attack and with this attack we see this event we were able to see traffic for more hosts. So now we get to go to the exciting part of the talk where we talk about the interesting and valuable things found in the internet's trash. So for this section I'm going to go through the major classes of traffic besides spoofed and I'm going to tell you about the thing that I think is the most exciting for them. So in terms of scanning I'll talk about some trends and some relationships to vulnerability announcements. And to collect this data we use the historical data that we had since 2008 and we just applied Bro's parameters for determining if an IP address is a scanner which is if you send packets to 25 different IP addresses on the same protocol import within five minutes Bro would alert that you were being scanned. So this is maybe not the best definition of a scanner because it obviously depends on how many IP addresses you have and it's definitely not capturing slower scans but it can give us a kind of a first look at the macroscopic scanning that's happening on the internet or at least of our networks. So a brick up the data into what was happening from 2008 and 2012 first and you can see that the colors correspond to ports and we see in terms of packets and IP addresses the purple port is very popular and this TCP 445 and we see that the first increase is right when the configure outbreak occurred and then we see subsequent increases often corresponding to new releases of configure. But we can't say all of this is necessarily configure because there's other scans of this port though most of it happens to be from configure and so we can come up with some heuristics to determine which packets originate from configure and to do this we can exploit a bug that configure has in its pseudorandom number generator for the most part when it's randomly scanning the internet to propagate it has a bug where it only targets IP addresses A dot B dot C dot D where B is less than 128 and D is less than 128 so it's only really scanning a fourth of the internet. And so we used a heuristic based on the birthday problem which basically says given a random group of people what is the probability that two people are going to share birthday and often this you're it's like surprising it's only like 34 people and it's pretty proud and then it's likely that people share birthday. So another way of asking this question is how many unique birthdays can we expect to give in N people and 365 birthdays. So turning this into a identifying configure if we have IP addresses A dot B dot C dot D that are being scanned we can look at the individual bytes of the IP address. So if we look at D and we say how many unique D values can we expect to give in either targeting 128 or 256 targets which are the possible values for D. And you can repeat this for the other bytes and you can then start to differentiate between randomly scanning a quarter of the internet versus the entire internet in expectation. So if we look at the configure outbreak and the amount of scanning that happened around that time period and this is a graph this graph is in log scale. We do have some missing data but we do see an increase right when a configure was discovered. So what we would expect here is that we wouldn't see any hosts matching our configure heuristic. However when we look at the number of IP addresses meeting the configure heuristic this is what we see. And so for up until about August we didn't see no IP addresses met this heuristic and then all of a sudden we started seeing some traffic. So this is and this is what before the configure was actually discovered. So this is evidence that someone was trying to actually like test out their configure bug prior to this. And on the first day the IP addresses were all in the same province and the first couple of days they were all in the same province in China. And so maybe this is helpful. As far as I know nobody has claimed the Microsoft 250K bounty to collect the configure worm author so perhaps this information could be useful for that. So that was before 2012. So if we look at what was happening since 2012 not surprisingly configure is dying out but the most popular port has been replaced with port 23 which is Telnet and the best explanation I have for this is that people may be trying to scan for internet of things. If you have a better idea of let me know. And we can also see some other interesting things happening here. So this spike that is in gray it was a variety of ports and it correspond to traffic from the Karnabotnet. Which was somebody decided to create a botnet scan the whole internet and then publish all of the results anonymously. So we see this and we can verify that that traffic was actually coming from the Karnabotnet based on their data. So as if we look at the IP addresses we notice some period of time where there's increased activity on a port. So if we look at heart bleed right around there in here you can see in red where the heart bleed vulnerability announcement occurred and then like a week or so later we see a lot of increased activity on the pink port which is TCP 443 which is where heart bleed was likely to be exploited. Similarly a little bit later we see a lot of traffic a lot of scanning TCP port 5000. And so just Google searching TCP port 5000 during that time. We Akamai had a report that they were seeing lots of universal plug and play devices being used in denial service attacks and prior to that report we see evidence of scanning on that port. So we were potentially seeing activity before it was used in an attack. Alright so that was scanning hopefully we will release our scanning data set pretty soon. But going on to backscatter I'm going to talk about an attack that we're seeing that we've been seeing on authoritative DNS servers. So just a reminder backscatter is a response to a spoofed packet. So let's suppose you have a web server that you want to perform a denial service attack on. You could do the denial service attack directly on the web server. However there is also another weak point. All legitimate hosts who want to contact that web server need to find the IP address associated with the name. So they have to do a number of DNS queries. So it turns out that you could also perform a denial service attack on the authoritative name server. So one way that you can do this is with an open resolver. And an open resolver typically with DNS you should only resolve domains for machines that you administer. So UCSD's domain server should only resolve domain names for clients in UCSD. So it's typically considered bad because otherwise you could use them in DDoS attacks. But so you could use an open resolver to pull off this attack on the authoritative name server. In particular the attacker can spoof a packet. A DNS query send it to the open resolver. And since the open resolver resolves the data for everyone it's more than happy to ask the authoritative name server and they get a response. And since the original query was spoofed they do not respond to the attacker but instead it's likely that they will return respond to our network telescope. Or there's a probability that we'll do that. So this is we're so we're seeing a lot of traffic recently from open resolvers. So this is 2014 data. So prior to pretty much the end of January 2014 we didn't see pretty much any traffic from open resolvers. We saw about 3,000 open resolvers per month. And then starting in February 2014 we saw 1.5 million open resolvers per month. And we notice that once this attack sort of took off we are seeing traffic from the same open resolvers over and over again. This is only a small fraction of the open resolvers used on the internet. The open resolver project which is scanning, active scanning at the same time saw about 20 times the number of open resolvers that we did. So this means that this attack is only using a subset of the open resolvers. But we can also look at some other data that we have from the attack which is the status code that comes back with your DNS response. So if it's like okay everything's happy. But you can also get a number of failures including a serve fail which indicates that there's a problem with most likely the authoritative name server. And in the month of data we got serve fail errors from nearly every open resolver that we saw. Whereas in the open resolver projects scan they see this error very seldomly. So this is evidence that this attack is actually overwhelming authoritative name servers. So one interesting thing is that you see we see some data on January 29th and then the attack seems to really take off in the beginning of February. And this first day the domain that was queried was all for body.com which is a popular website. So this reflects a testing phase here. Since then there's been lots the domains seem to be just used for a very short period of time. A number most of them seem to have bogus registration information. And we're still seeing this. This all this analysis was from the first month of activity and we're still observing this type of attack right now. So that was backscatter. Now I'm going to go on to misconfigurations which in particular I'm going to talk about bit torrent misconfigurations. So if you want to download a torrent through bit torrent you use you contact the you typically contact the bit torrent distributed hash table and they will tell you the location of the torrent or some other bit torrent node that is closer to the torrent that you want. However there can be malicious nodes in the hash in the distributed hash table and they can lie to you about the location of the torrent. And if this happens repeatedly over and over again it's going to be a lot harder for you to actually find the torrent and get the latest episode of Game of Thrones or whatever you want to watch. So this attack is called an index poisoning attack where you're purposely inserting fake information into or about what's in the hash table. And so what happens after you receive this false information is you try to set up a connection. So when people send bit torrent packets to the network telescope we get an idea of what they're what torrents they are trying to download. And so this is some data from July 2012 and in terms of the most packets associated with a torrent and you'll notice that a lot of them happen to have the word China in their name. And a year later we see about the same thing. So this this attack doesn't seem to be going on right now or if it is it's a lot lower. But we have but oh I'm sorry and typically in this China attack typically the IP addresses that are asked for the torrents satisfy this equation or this set of IP addresses. Basically they're in certain slash 13 blocks. And so it seems that they're being generated programmatically with a buggy pseudo random number generator. And this attack is sometimes we see a lot of packets from it and then sometimes we don't see any. And currently we're not seeing very very many. But more recently in about about a year ago we saw a huge spike in the amount of bit torrent traffic we see. We're getting traffic from about 250 times more IP addresses per per hour. And we don't really know everything that's going on to try to investigate this. We were so just as a recap when you want the torrent you ask someone the DHT node the location of the torrents and they come back with the locations and then they potentially contact our network telescope. So we want to know who's spreading this false information. So this node so we can't really learn this by looking at the IBR. Instead we can set up nodes to actually interact with the distributed hash table. So we set up two torrents, two clients and examined what happened for over two months and they both contacted our network telescope fairly frequently. And so we looked at who was telling our clients to contact the network telescope. And the most popular client string was a lib torrent one but this only accounted for about 70% of the clients and it's a pretty popular client string among legitimate hosts as well. Most of the IP addresses were in China but they were in multiple ASs. So this wasn't two six vessel in identifying who is actually sending this false information. But we did notice one really suspicious behavior. So in the hash table all the nodes have an ID. So that means that they think that the IP addresses in our network telescope also have IDs. And so the IDs that they request they all have four as the third byte. So that's kind of weird and typically when you look at the location when you receive locations you receive not just one location but multiple locations at a time and this behavior is similar for a lot of other IP addresses that we see. So we're receiving a lot of bit torrent traffic as a result of a bug in or a misconfiguration in a peer to peer network. Peer to peer networks also caused a lot of traffic as a result of a bug in one of the systems. So if we look at the number of sources sending us traffic over time we notice some interesting things like the configure outbreak. When we started seeing a lot of bit torrent traffic and then all of a sudden in October 2010 there was all the shape of the graph definitely changes it's very diurnal and we weren't really sure what was happening here. And we were able to identify the responsible payload and certain bytes seem fixed and then we could hypothesize about what the other ones were using it for. But we still had no idea what this was, what this was and the popular ports the most frequently used ports we weren't really sure those were either. But we did notice that in terms of the sources sending them they were mostly located a large number of them were located in China. In fact we received in a month's time traffic from 30% of all BGP announced IP addresses in China so this is like huge. Also interestingly when the USA category for IP addresses belong to the UCSD computer science department where I went to school. So we were able to coordinate with someone who could monitor the traffic going in and out of UCSD's network to basically capture traffic from these IP addresses. This ensured that this traffic wasn't spoofed and was actually happening. So all of the CSC machines basically contacted a common IP address and in response they got a pretty large packet. And based on this packet then they sent about 40 more packets to different machines and they were all encoded in this original big packet. And it wasn't just one packet they were exchanging a lot of packets and eventually the UCSD machines would receive a packet like this and so this packet is from 113704122 but instead they would respond to 12270113 just immediately after receiving this packet. So and this packet met the BPF filter that we had used to identify all of this traffic. So this is a byte order bug and this is why we were receiving a lot of this traffic. We identified that this software bug was in Quihu 360 and if you look at their license agreement and this is like the most popular security software in China and if you look at their license agreement you see that they will use peer-to-peer technology to update program modules, malware definition, databases and components of the software. So basically we were getting information about when people were updating, getting software updates. We contacted Quihu and told them hey like you have this bug and so then we could see how long it took them to fix it. The traffic had one kind of weird thing which was like every four to five weeks. There was a large spike probably related to big update events but there wasn't a big decrease following one of these. Instead it decreased like about a month later and this date was about the same time a new version of Quihu was available on their website. So we're still getting some traffic but in general this bug has been fixed. Now on to the last part which is in looking at some unknown traffic. So the bug was also an example of unknown traffic but I'll go through another one. So basically if you investigate some of these packets a little bit more you might be able to come where identify where they're from. So in the beginning when I explained the unknown category I said here's a packet its payload appears to be encrypted. So well basically this one IP address was getting a lot of traffic sent to it and they all seem to be encrypted based on the entropy of the bytes. But we did a byte wise analysis of like what is the first byte, second byte, third byte and stuff like that. And we found that this byte here always seem to be somewhat related to the whole length of the packet itself. And then I read a bunch of white papers and found that the sality botnet their encryption is such that these four bytes are an RC fork he used to decrypt or encrypt the entire rest of the message. So when we decrypted almost all the packets to this one IP address we found that they all sort of started like this. So this confirmed that this is a sality commanding control packet. So this is kind of interesting because you're like okay I understand why someone would have a bug or someone would purposely put false information into a bit torrent DHT or I understand how a byte order bug happens but this also happens in peer-to-peer botnets as well and that's why we received that much. We received a lot of traffic. In fact if we look at how many IP addresses were sending us traffic per month basically to this one IP address we see about the same number of infections as Semantic was seeing in the early part of this decade. So in conclusion it's pretty likely that you are transmitting internet background radiation and if you use network telescopes or other technologies you can find a whole bunch of interesting things. In addition to just looking at these kind of security related events we can also learn about the networks and machines generating the traffic. For example you can do outage detection with traffic reaching network telescopes. This is a graph from a paper that analyzed events during the Arab Spring and as you can see the number of packets coming from Libya went down to zero at certain periods of time and these corresponded to known times that the Libyan government had pulled the plug on their country's internet. We can also look at path changes so when you send a packet on the internet there's this TTL field that is decremented by every intermediate router to prevent routing loops but based on this you can infer how many hops away the sources so if this changes then you know that a path change occurred and this can help you analyze outages and understand routing dynamics. So looking at some of this stuff we can see like if you have traffic like this where the TTL is this about the same value over time it's probably using the same path but if it looks more like this then you're you know that the path has changed and then as a final example we can also look at DHCP lease duration so when you join a network using DHCP you announce that you want to join the network and you're given an IP address to use and typically at some point in time you are no longer you no longer use that IP address which means at a future time someone else can use the same IP address so we can look at DHCP lease durations using any traffic that has some sort of ID associated with a client so these are the packets received you receive over time you know that the lease duration is at least this long and at most this long so as I noted before BitTorrent has IDs as well so we can use BitTorrent to identify how long lease durations are for various autonomous systems so this autonomous system almost everything has a minimum lease duration of less than 7 days and this is really useful for understanding the effectiveness of blacklisting or how if people are going to not be able to access the internet because you have blacklisted their IP so so hopefully you enjoyed the talk today where we very we discussed some of the crazy things that happen on the internet and thank you. Hello. Hi very fascinating research and great presentation thank you looking toward the future I noticed this was all IPv4 have you done any consideration of IPv6 based telescopes and you think it's practical with the sparseness of prefixes and v6 addresses so I haven't but some people wrote a research paper where they used an IPv6 they basically were able to announce a covering prefix and basically capture everything that wasn't other people weren't announcing in BGP and they didn't find as much but I think as IPv6 evolves I think also this will evolve as well. Thank you. So thank you that's very convincing that this is incredibly useful data how can other security researchers get access to it? So I know that the data that UCSD has that it is available to academic researchers you might need to sign a bunch of things but I don't I don't know the whole process but you I mean you can start with your if you have your own network too. Is there a question over there? Thank you.