 DNS with an industrial-size digger. This has nothing to do with the CPUs of the same name. No, it doesn't. Our presenter here probably constructed something like a Binford 9000 DNS scraper, something like that. He will probably tell us a lot more about it. So a warm round of applause to Roland van Rijswijk Dij. Oh, and please give a warm round of applause to our Herald here, who pronounced my name correctly, which never happens. Good morning, everyone. Thanks for showing up early in the morning. I really appreciate it, because when I drove here this morning, I thought, oh, I have to talk in the second slot in the morning at a hacker conference. That's going to be hard. But it's great to see quite a full room. I apologize for the white background. They didn't tell me the screen would be like this, so if it's a little bit blinding, it's entirely my fault. First, a little bit of an introduction about who am I, so you know who's talking to you. I work for Surfnet, the national research and education network in the Netherlands, where I'm mostly an R&D guy, but I'm also responsible for DNS operations. Anybody in the room who doesn't know what DNS is, please wave now, and you might want to leave. Great. So I'm also a researcher at one of our universities here in the Netherlands. I work at the University of Twente, where I recently became not that kind of doctor. And I've worked in sort of network security for the past 16 years, and that's been stuff to do with networking, but also applied crypto, stuff with smart cards. You name it. And on the left, you see my big hobby, which is scuba diving, and this is kung fu diving, as you can see. So why am I wearing this loud shirt? So today, the biggest pride in the Netherlands takes place, which is in Amsterdam. And I had to choose, go there or come here. And obviously, I chose to come here, right? Because this is great. But this is a hacker conference, people. So please celebrate diversity if you're here. Right, now to the content of the talk. So I'm going to get a little bit more serious, and hopefully you'll laugh somewhere along the way. We started a research project where we wanted to measure the DNS at a large scale. And this project, we started about two and a half years ago. And then the first obvious questions is, why would you want to measure the DNS, right? Because the DNS is part of every network service. And it tells you a lot about the state of the internet, right? Because the DNS has a very important function on the internet. It translates human readable names into machine readable information. And it also does some stuff in sort of service discovery or tell you where you need to deliver mail for a domain or stuff like that. So what is in the DNS over time, and the clue here is over time, can tell you something about the evolution of the internet, something about the security of the internet, the stability. And we wanted to make sure that we get as much data as we can. And that's what this talk is going to be about. And what you can learn if you have that data. So the first obvious thing to talk about is people have been measuring the DNS for years. There are probably people in the room from the CSERT and CERT community that are aware of passive DNS. And passive DNS was thought of by a guy called Florian Weimar in 2005. And basically what he said was, why don't we record the traffic between a DNS resolver and authoritative name servers on the internet? And then we stored that data and we aggregated at the central spot. And actually, there are two huge deployments of this on the internet today. One is operated by a US company called Farsight Security. And the other, which is probably less well known, but probably just about as big, is operated by the Austrian national CERT team. And we, as in CERTnet, contribute to the second one because we don't want to contribute to something that is for a commercial company, and that's also used by certain agencies in the United States. So we did not go for passive DNS. We went for active measurements. Why did we do this? Well, passive DNS has one problem. This is that it suffers from bias. And that makes it unsuitable for the kind of work that we want to do, which is to track the state of the DNS over time. The problem with passive DNS is that it will only see information for domains that clients of the resolvers where you capture traffic are actually interested in. So that means that a domain has to be used first before it is observed by a passive DNS sensor. And if you want to track the entire state of the DNS, also less popular domains or domains that haven't been actively used yet, you won't see them in a passive DNS setup. Another issue is that you have no control over the query frequency. If a domain is less popular, you might get one data point every week. If it's very popular, you might get tons of data points that you then have to deduplicate. So it's a bit of a management nightmare as well. So we decided to go a different route. What we decided to do is to do active DNS measurements. And what we do is we send a comprehensive set of DNS queries for every name in a top-level domain once per day. And I'll tell you a little bit more about the specific queries we do so you have an idea of the kind of information that we gather. Now, we do this at skill. Our current measurement covers around 60% of the global namespace. And that has all the large generic top-level domains, such as comnet, org, and some others. But we also have quite a few country code domains, for instance, Netherlands, Sweden, Canada, but also the Russian Federation. In addition to that, we look at all the new GTL days, like useful domains like .xxx or .berlin or .freezland in the Netherlands. And I can already tell you that most of those new GTL days are full of crap, shit. In total, we measure over 200 million domain names every single day. And our challenge was, how do we do this in a responsible way? Because we mustn't overburden the global DNS, right? We want to do a measurement, not a denial of service attack. And we need to store and analyze this data efficiently, because I can tell you, this generates quite a bit of data every day. So to give you an overview of what our architecture looks like, it's built up of three components. On the left-hand side of the figure, you see our sort of collection server that collects all the zone files for the top-level domains that we measure. We have contracts with the TLD operators to get their zone files every day, sometimes up to two times a day. And we collect that stuff and put it in a database. And then we keep track of the daily changes and we keep track of the current state of the zone so that we can actually control our measurement. The middle part is probably the most important, because that is where we do the actual measurement. Now, we have set up sort of a distributed measurement system where we have one central node per top-level domain that sort of controls the measurement during the day. It hands out chunks of work to a cloud of workers. So we have a hypervisor where we have lots of little VMs running that do the measurement, and they simply ask, oh, do you have some work for me to do? And then we will send it a batch of work to do, and it will do a measurement. And then it will send off, once it's finished, a batch of domains to measure. It will send it off to a central aggregation point at the university, where we do two things. We put the measurement results onto a large storage array for long-term storage, because we want to keep track of multiple years of data. And right now, for the domains that we've been measuring for the longest time, we have about two and a half years of data. But we also have a Hadoop cluster that I'll tell you a little bit about in the next couple of slides. So what do we actually query and store? So on the left-hand side, you see all the record types that we ask for. So of course, we start with the SOA record, which should be in every DNS zone, and tells us a little bit about how the zone is configured in terms of how often is it refreshed, when was it last changed, and things like that. Obviously, we want to capture the A and Quad A records that are in there, not just for the Apex records, so the domain name itself, but also for the www label. And then we ask for the name server set, mill exchangers, text records, and we ask for some DNSSEC-specific records. So for a delegation signer record, to find out if there is a secure delegation, we ask for the DNS key. So we can track, for instance, changes in DNSSEC sign domains. And we ask for authenticated denial of existence records by sending a query for a name we know will not exist. What we store is all the records that we get back in the answer section. So on our worker VMs, we run a custom piece of software that we developed ourselves, and I'll give you a link at the end of the talk to a paper where you can read all about that. We send the query to the resolver, and then whatever comes back in the answer section of the response is what we store. So we discard all the stuff in the additional and authority sections, because we have no guarantee that the data in there makes sense. But what we want is what is in the answer section. Then because we send our queries to a resolver that runs on the measurement host, we actually get all the CNAME expansion, so we store those as well, and then we store all the DNSSEC stuff that's in there, so if we get signatures back, we store those. And finally, we record some metadata, such as we do a GUIP mapping for all the IP addresses that we get back in A and Coday records, and we try to map the IP addresses to an autonomous system. This makes it a little bit easier for us to analyze the data afterward. Since the beginning of this year, we also have a separate, what we call, infrastructure measurement, and what we do there is we take all the NS and MX records that we get back from the main measurement, and then we do an A and a Coday query for those names, and we're gonna be extending that with additional query types later this year. Right, so we measure about 11 to 12 queries per domain for 200 million domains, so you can do the math, how many queries that is. What does that mean in terms of traffic? So to give you a little bit of an idea, and this is from early on in the measurement, the traffic is a little bit more now, what we did is we run this measurement from within one of the data centers that Surfnet has a point of presence in, and the dark red and light red is all traffic generated by all users on our network. And we have over a million people on our network. The blue part is our measurement, so we generate more DNS traffic than all our more than a million users combined, so we generate quite a bit of traffic. But as you can see, it's not a lot of traffic in terms of volume, right? It's maybe 250 megabits, a little bit more now, but it's not a lot of traffic, but it is a lot of very small packets. And what we actually did, most of our traffic will go to the TLD servers because for every domain that we request, we will have to find the delegation point. So we have to talk to the TLD servers. And obviously the biggest TLD out there is.com, which is operated by a company called Verisign. So we reached out to them and said, oh, do you see our traffic? And is it a lot? And they said, oh, well, we hadn't noticed a huge change, but can you tell us what the IP ranges are? And then I told them and they said, yes, we can see your traffic. They said it was a non-trivial amount of traffic, but not disruptive, and they actually encouraged us to continue this type of work, so that was good. That's one box ticked. Of course, we want to do a responsible measurement. And if you're ever considering doing something like this, what we did was we obviously set up a website with clear contact and abuse information. We created sensible reverse DNS entries. And the IP block info in the right database actually tells you where to find us, right? Because what we don't want is people to start blocking us, then our measurement suffers, but more importantly, we created something that other people suffer from and we don't want this. And we have had one such instance this year where due to a company suffering a denial of service attack, our queries to them were timing out. And this was a very, very large domain name registrar, and because our queries were timing out, our servers were sort of retrying and retrying and retrying and retrying and sending them lots of traffic, and they blocked us. And they didn't block us because we were bothering them before. This was just a bad coincidence. And so we tried reaching out to them, which was a pain, and it took us two months to reach the right person in the company to get ourselves unblocked. Even though we set up all of this, they apparently didn't find our contact info, didn't think about reaching out, which is fair. I mean, what we're doing is a measurement, it's not like we're one of their customers. But what we learned from this is that, even if you do all of this, people will still block you or not be able to find you. So keep that in mind when you set up these kinds of measurements. So of course I'm a researcher, so I need to get funding for my research, right? Otherwise I can't eat. So we decided to call our project a big data project, right? Because big data gets you funding. But then we had to ask ourselves, is what we're doing actually big data? So we decided to compare it to something that is generally considered to be big data, right? The human genome is three times two to the power of nine base pairs for an individual human like myself. Actually, if you're interested in DNA, there's gonna be a talk by Bertie Hubert of PowerDNS. It's gonna be all about DNA. I think it's set for this afternoon. And we collect about a two billion DNS records per day. So we're not quite the human genome, but close. And since February 2015, we have collected over 1.7 times 10 to the power of 12. So that's 1.7 trillion in human money or 563 human genome. So that's more people than are in the room right now. But it's fewer people than are at shot. And oh, yeah, so I'm working on big data, right? And I mean, seriously, big data, it's huge. So what do we use to run this project? Well, if you do big data, the only tool chain you really wanna go to is something like Hadoop. So together with a couple of partners, SurfNet, SIDN is the registry for .nl, and the university, we bought our own Hadoop cluster, and we use all of the tool chain that's available on there. But in particular, we use a tool called Impala, which is a SQL query engine that I'll show you some examples of later on, which we use to analyze our data. And actually, because we are publicly funded, we try to make our data openly available. Unfortunately, we can't do that with all of the data we collect, but I'll tell you a little bit about the data that we do make public later on as well. And then of course, on top of that, if we want students to process the data, we run something like a Jupyter notebook so they can run their Python code straight on top of the Hadoop cluster and analyze the data. And in fact, we have collected so much data that we had to extend our little cluster, and June this year, we added another eight nodes. So we now have our nice little cluster in the data center. I really enjoyed wiring that up. So all I've told you about now is that we collect data, that we collect tons of data, but what do you do with that, right? Can you do something useful with that? And I'm gonna talk to you about four examples of things that we have done with this data that I think might be of interest to this particular crowd. And the first example is something called Snowshoe Spam. Who's familiar with the term Snowshoe Spam? Raise your hands, please. Very few people. Wow, two. That's very few people. So like the TODR is that Snowshoe Spam is a particular variant of spam in which the people that try to send you the spam try to spread out the load of sending spam over lots of hosts and lots of domain names, lots of IP addresses. So you can liken it to a Snowshoe, which is one of these funny tennis records that you tie to your shoes if you want to walk into the snow. That's where the name comes from. The problem with this particular type of spam that it's hard to blacklist because they send it from lots of different IPs. They say they take, for instance, 40 or 50 IPs that are in a single prefix but then don't use all of them. So you can't blacklist the whole prefix. It's a pain blacklisting every individual IP. But what we found when we started looking at our data is that the domains that are set up for this type of spam are actually recognizable in the sense that they are anomalous if you compare them to other domains. And in particular, if you look at two examples which is an anomalous number of A records or an anomalous number of MX records, these are plots created by one of our students where he took the long tail of our data set. So domains that are already anomalous because they have more records than other domains. And then what he did is he took domains that were blacklisted and compared them to domains that are whitelisted. And he took a couple of spam blacklists such as the ones from Spamhouse but also from other organizations. And then he compared domains. And what you could see is that blacklisted domains were much more likely to have many A records. In fact, you can see that, for instance, there's a gap of seven records at the, what is it, 75 percentile? And there's a gap of 17 records at the 90 percentile. And for MX records, it's more or less the same. You can see the red graph veering off towards the right way earlier than the blue one. And basically what that says is that domains that were blacklisted because they were marked as spam are much more likely to have many MX records in them. So what we currently have is we have a master student who is almost ready, so he's finishing up writing a paper about this. His defense is actually the end of this month. And this was a collaboration between the university and surfnet. Now surfnet has a mill filtering service for our constituency, which is higher education and research in the Netherlands. And we filter about 10 to 15 million emails per day, so it's not a huge processing service, but we do see quite a bit of traffic. About 50 percent of that is considered spam. At least that's the statistics that my colleagues gave me. And what we learned is that with this research that we do, we can actually improve real-world email security. I'm gonna give you some preliminary results here. They're gonna be a little bit hard to read on the screen, but the takeaway is that basically for every domain, you see two rows and the top row is the spam score. Redder is spammer, just... So blue is probably not spam according to the mill filtering system. Red is, this has been marked as spam and every block is a single day. Now, if it's green, that means that that domain was already blacklisted. If it's purple, that means that the methodology that the student developed detected this domain as probably malicious. And what you can see is that there are, of course, control domains that the methodology of the student finds, but they're already blacklisted. So the method that he developed doesn't add very much because it was already on a blacklist, so that domain would have already been filtered. But if you go further down, you can actually see that his method detects domains before they appear on a blacklist. And the bottom graph shows you how many days earlier he detects it than it appears on a blacklist. And while the majority is only a few days, we have outliers going up to over 50 days where he detects the domain as probably set up to send spam more than 50 days before it appears on a blacklist. So this is interesting and he's writing this up as a paper. So the second example I want to give you is something called crafted domains. Anybody familiar with the term? Oh, very few, okay. So everybody know what the DNS amplification is? Who doesn't? Good, I'm at the right conference. So DNS amplification is still one of the most frequently used attacks or to perform volumetric DDoS attacks. And if you're an attacker, basically you have two ways of doing this. You can either abuse a DNS-seq sign domain, which is something that's been popular over the past, say, three to four years. Because a DNS-seq sign domain, of course, has lots of signatures in it, you have keys in it, and that means your responses are gonna be large, so you can use that to perform denial amplification attacks. The other option is, of course, to craft a domain where as an attacker, you configure your domain such that you almost have a guaranteed bang for your buck. And typically what you will do is you put in a few large text records or you cram in a lot of A records that simply inflates the size of your responses when you send a query for this name. And we decided to look for these kinds of domains in our dataset. And while we didn't find hundreds of domains, we did find tens of domains and most of them were actually abused. And there's one example of such a domain on this slide here. And what the graph shows you is basically in March of 2015, all the way on the left-hand side of the graph, the domain is not large, not inflated yet, so it's not configured to perform attacks. But then, as you can see, as time progresses, they're adding lots of A records to it, which makes the response for this domain a pretty large. And how can I tell that these are not legitimate A records? Now, they're all local hosts. So they're all 127.0.0.1, 2, 3, 4, 5. Why are they all different? Because a resolver can't collapse them to a single response if they're all different. Then with thanks to Christian Rosso and Johannes Krupp from the University of the Sireland in Germany, we got some data from a project called AMP-POT. And if you don't know this project, look it up, it's pretty interesting. What they did was they designed something that functions as an amplifier for lots of protocols, including DNS and NTP and other UDP protocols, and they try to become part of attack swarms. And they want to do this in order to figure out if people are performing attacks and what kind of attacks they are performing and what they are attacking. And we... So the area indicated by the arrow in the middle is actually a period over which attacks were observed that used this domain. So this domain was actually used for DNS amplification attacks. So what are some other examples we found? So attackers are pretty creative. We found one that has, and I'll show you the content of that one on the next slide, that has parts of a speech by President Obama on net neutrality. So I can't wait what crap from Mr. Trump they're going to put in there next. You find that they sometimes put random garbage in there, they'll put mildly offensive language in there. They will have high numbers of A records in odd prefixes. 1.0 slash 8 is used for all sorts of research. It's not actually as far as I know, routable. And let's see, we had... Oh, it's not on this one. But we had one that also had an excerpt from the US federal budget for 2016. I have no clue why they did that. So this first one with the speech from President Obama, if you want to read it, look up the slides later on. As you see, it has lots of text. It's about net neutrality, this speech. Oh, I don't know what they attacked with it. Maybe companies that violate net neutrality, maybe they attacked Verizon with it. I don't know. But the takeaway here is that this domain was actually observed in over 8,000 attacks. So it was abused. The fact that we can find these domains in our dataset, and actually as the graph showed you, we can find them before they are abused, it gives us a window of opportunity to take action against these types of crafted domains. And it also gives you an opportunity to see if you can find the people that set up these domains. Because they'll typically register the domain which they'll need to register through a commercial company. And typically they use one of these who-is-anonymizers. But still, the more time you have that you know that this has been set up, the more time you have to find the guys that were doing this. Now, as I mentioned earlier on, why not just use DNSSEC for DDoS, right? Because many people claim that DNSSEC is just an amplification nightmare. No need to craft domains, just use what is out there. And this is actually a graph from a bit of research that I conducted during my PhD, which shows you on the left-hand side, the gray area, that's, say, a control group of normal domains, all the colored lines on the right, are DNSSEC signed domains, and on the axis is the amplification factor that you can achieve. So the takeaway here is DNSSEC is a way worse amplifier than regular DNS, so why not just use that? And actually there are people that claim that the amplification is a reason not to deploy DNSSEC. Now, I'm not going to go into detail. There are good reasons to deploy it anyway, and there are actually ways to do away with the amplification. There are ways to solve that. But hey, who needs DNSSEC if you have .tl? This is one of those new, well, it's not a new GTLD, but it's in the list of new GTLDs. It has something to do with telephony. I'm not sure what you're supposed to do with it, but people for some reason put lots of big text records in there. And this is a CDF plot for about 3,500, no, sorry, it's about almost 5,000 domains that have text records in them. And if you add all those text records together, you get a certain size if you send a query for this. And there were 3,500 domains with over 1,000 bytes of text records, so that gives you a decent amplification because you can send a small query, you get a big response, nice amplification. But we found people with over 54,000 bytes of text records that you can get back in a single response. Why? That's like, what the fuck are these people doing? Doesn't make any sense to do that, but might not be so bad because if it's over 4 kilobytes, probably the response is not going to get transmitted if you send a DNS query because it gets truncated and you have to retry over TCP. But hey, there's lots to pick from below 4 kilobytes that you can use for attacks. The nice thing about this one is that mostly what people will do if they perform an amplification attack is they'll send in any query. And there is a draft circulating in the ITF to do away with any queries because they're mostly abused for amplification attacks. But this is just the text query that I did, right? That's a legitimate query, you can't block that. So this doesn't make any sense. Oh, and it gets worse. I assume that many of you will be familiar with what is called Handlin's Maxim, which is never a tribute to malice, that which can adequately explain by stupidity. If you go through our data set, you will find the weirdest stuff in text records. We find snippets of HTML, JavaScript, Windows PowerShell code that allows you to configure your built-in DNS server. Why is that in a text record? Pem encoded X5 on end certificates. Really, you want to configure that in your web server, not put it in the DNS people. Snippets of DNS zone files, so it's sort of a self-recursion. Really, you cannot make this shit up. But we have a winner. And the winner puts their RSA private key in the DNS. And for ethical reasons, I didn't put the whole key up there. And he's not the only one. Why, people? Why? Why? Seriously? But it's, I mean, as a hobby project, at some point, I'm going to see if I can figure out whether the public key that belongs to this pops up somewhere in a certificate or a PGP or whatever. But this is just plain stupid. Right. So the fun doesn't stop there because I have two more examples. And the third one of them is CEO fraud. Who here knows what CEO fraud is? Ah, that's more people. OK. Who here has received CEO fraud emails? Oh, a few people. Did you find them convincing? More or less. Yeah, that's the problem with these. So August 30 last year, about a year ago, our third team reports an incident of CEO fraud that was targeting us, surfnet. So somebody was pretending to be our CEO. And actually, this was quite a sophisticated campaign because not only had they learned the name of our managing director and were they sending us emails in correctly correct touch, which for most foreigners is quite challenging. And they registered domain names that looked like our domain name. So they went quite far. They registered something called surfnet-nl.net. And there was also one sent to my university, u20-nl.net. And the problem with CEO fraud is that these campaigns can actually be quite sophisticated. They learn quite a lot about your company. They try to figure out whether your CEO is on holiday. And they target your financial department and then ask them very helpfully, maybe I need to transfer money to this foreign company and I can't figure it out. Can you help me? Attackers wouldn't do this if it didn't work, right? But this is also quite costly for attackers because as you'll see later on in this campaign, they had to register quite a few domain names and they actually have to pay to do that. So later that day, so August 30 last year, we started getting more reports that there have been others in the surf community that were targeted with similar emails and then our third team received through one of their communication channels a longer list of domains, including more names in our community so we could reach out to these people and warn them. But then we thought, oh, okay, we have a skirt is our security community so all our constituents can talk sort of in a private area about security concerns. But we decided if we could find out more about this campaign using our open Intel platform. So what we wanted to do, and here I'm going to show you a little bit how we use Impala, is what can we find for these domains? So the first thing I did was look for what record types do we actually have in the dataset for the surfnet-nl.net domain. Who can tell me what is missing here? Wave your hand. Yeah, yeah, I heard it already. A records, quad A records. So all I have is MX records, text records, then an SOA record, but no A and quad A records. So this domain was probably set up just to send email. Okay, if it's set up to handle email, who's handling their email? So let's look up the MX address that we have in the dataset. Oh, it's outlook.com. So they are using Office 365. They probably had to pay for that, right? They have a paid account with Microsoft. Okay, so if they're using Office 365, they may have one of these Microsoft-specific tokens in there. What you see at the bottom, MSC ID and then some base 64 encoded stuff. Microsoft puts that in text records if you use their managed DNS service. Oh, and they protected their domain against email fortune, right? They were very helpful and included in SPF records so they wouldn't get marked as spam. But what's nice about this token is that that token is linked to the account, not to the specific domain of the user. So what can we do? Well, we put this token in a SQL query and we ask our system, do you know of any other domains that use the same token? And this is in the .NET dataset, so we found another 17 domains that have the same token. Okay, it's not a huge hit. So let's look in .com. Oh, we find 199 domains in .com that have this same token. So basically using our dataset, we could sort of take domains that we'd seen before and find domains that were part of the same campaign. And we scripted that and we had an input list when we collected data from various sources, the input list contained around about 860 domains and by looking them up in our dataset, we managed to find an additional almost 1,400 domains and that allowed us to warn people. And the pattern that we'd seen before was stuff where they would take the CCTV of the original domain and then put it in front and then register it in .net or .com. But we found other ones like Groupon where they substitute O's by zeros or overstappen.com where they substitute another O by a zero and there are other examples of these kinds of changes in domain names. We used this in this campaign to warn people and this had direct operational applications. And eventually we shared this list with the National Cybersecurity Center in the Netherlands as well so they could inform other people because we figured out there were banks in their notary offices. So they were targeting quite a wide range of companies in the Dutch society. Right, so now we get to the last example. Unless you've been hiding under a rock, you're aware that Dyn was attacked October last year. Who was hiding under a rock? One person, two people, okay. So for those two people, Dyn is a company that provides managed DNS services. They suffered a massive DDoS attack. There are sources that claim it was over one terabit of traffic that they had to process, but that's alleged. So there's no confirmation that that actually happened. But the main problem is they went down on their US East Coast operations. They were attacked using the Mire, Botnet, the Internet of Shit. And this affected the number of large internet brands that were their customers. So Netflix was affected, Twitter, eBay, PayPal, LinkedIn, not everybody was affected as badly as some. But for instance, PayPal went down completely on US East Coast. This is an illustration of putting all your eggs in one basket. If you outsource your DNS to a company to protect yourself against DDoS attacks and they suffer something that they can't process, you go down as well. So what does that do with people? Well, what we did was we looked at our dataset in the aftermath of the attack and there is quite a dramatic change. The red line are customers that exclusively use Dyn services. So if you look at the NS record set for them, they will exclusively have Dyn NS records. The blue line are people that don't just have Dyn NS records, but they also have NS records for other operators. The black line is the day of the attack and as you can see, there is a dramatic change. If you read the scales on the graphs, you can see it's actually a few percent change because the scale is different on the left and on the right. But I wanted to put this in one figure so it's easier to see. In total is about 5,000 domain names where they change from exclusive use to non-exclusive use. But those 5,000 domains contained lots of big internet names such as Twitter, Netflix, PayPal, but also some other companies and this talk was called Digging in the DNS. There is a large company, manufacturer of earth-moving equipment is one of the companies that changed. We studied this a little bit more because we wanted to learn what this does to a company that provides these kinds of services such as Dyn. And we wanted to see if it cost them any customers. The top graph is the period from October 1st until December 31st of 2015. The bottom graph is the same period but then in 2016. So that's the period in which the attack takes place. At the top, you can see that there are very few people that actually leave Dyn. So their customers are quite loyal. We saw in the same period the year before the attack we saw only one big change and that was a company called Zalando. They sell shoes and other stuff, leather bags. And they had lots of typosquatting domains sort of defensive registrations of typosquatting domains. For some reason they were using Dyn for that as if somebody was going to attack their typosquatting domain. I have no clue why. But they left Dyn. Now as you can see, the day after the attack quite a few people decided to completely ditch Dyn, go away and you see a little bit of after effect after the attack. But this is not a trend that continues. So it doesn't really affect Dyn as a company. People still stay with them and actually new customers still register. So let's look at new customers. What we wanted to see if new customers are exclusive users or non-exclusive users. And the top graph is over the entire dataset until the middle of last month. Orange is new customers that are non-exclusive. Blue is new customers that are exclusive. And there is only one major event of a new customer choosing to be a non-exclusive customer. And that is the good adult content providers of Pornhub. Do I have to explain the irony of Pornhub going non-exclusive or can you do the math yourself? The aftermath of the attack is the bottom graph. And what you can see there is that we see no evidence of a significant change in behavior of new customers. So new customers still almost always decide to use Dyn exclusively rather than mixing Dyn with another operator. Okay, what about people switching from exclusive use to non-exclusive use? So blue means people switch to non-exclusive, orange means people switch to exclusive use. So they were non-exclusive and they go to exclusive. It's kind of hard to read on this screen, but there are two takeaways. There are two major events of people switching from non-exclusive use to exclusive use. And almost all of these are people fixing an error in their DNS configuration where they had a dot local NS record in their NS set and the rest was all dying. Don't ask me why. They fixed that, there's two separate events where they do that. Above the X-axis you see the blue and you can clearly see the top graph is the entire data set. You can see that after the attack, so October last year, you see a major change. You see lots of people switching to non-exclusive use. And if we zoom in a little bit, and this is the bottom graph, what you can see is that there is a real trend change. There are still people switching from exclusive to non-exclusive use today in the aftermath of the attack. So somehow people did pick up on this and they are changing their behavior. Whether or not that partly has to do with the fact that dying was acquired by Oracle, I don't know, but it might play a role. So what are the takeaways from this fourth example? Well, my goal is not to bash dine because I know people that work there, they're good people. It's a company that has existed for a long time. This can happen to even the largest providers. If you say you prevent DDoS attacks, you also paint a target on yourself, right? If you are a cloud player or if you're a dine or if you're a parasite or Akamai or whatever you paint a target on yourself, everybody can attack you. And we have other examples of big providers either through mismanagement or attacks going down. I think it was early this year. Amazon S3 service went down. OVH, very large infrastructure service provider, suffered over one terabit of attack traffic, went down. But a takeaway here is that the internet was, of course, designed to be distributed. And we break that assumption by putting all our eggs in one basket and going to a few of these large providers that we outsource our services to. And this is not just outsourcing DNS to dine. This is outsourcing your email to Google or Microsoft. We are breaking that assumption. And as an internet community, this is something we need to think about. Because many of the original assumptions on which the internet was built no longer hold in today's commercial internet market. And you can have these kinds of things happening if you break these assumptions. Right. So what are we doing with this data and how are we continuing to work with this? Well, one of the things that we want to do is do more proactive threat detection, like we did with the Snowshoe Spam example and the Crafted Domain example. And we actually managed to get funding for a PhD position. And we actually already hired a student who is starting in the third quarter of this year. We are also working on another project proposal. There is a joint call for cybersecurity proposals that closes end of August. We're submitting something to that. And we hope to have another PhD position in the Netherlands that we fund with that. I have a student from the Technical University of Eindhoven finishing a project. We want to measure adoption of secure email practices, such as using Start TLS and Dane. And later this year, we're going to do some cool visualizations based on this data that we hope to put up on our website. Because, of course, we're academics, so we have a really boring HTML-only website with text. And we want some nice figures on there so people actually get a feeling for the power of this kind of data. If you are interested in data access, we share data with other academic researchers. We have to be a little bit careful because we sign contracts to get some of these TLD zones. And while we are allowed to use the data for research, we almost definitely are not allowed to use it for commercial purposes. Some of the data we have is already open access. And you can access it through our website. And we can always find the middle ground if we can't give you access to the data. We can give you access to public data sets and then run queries on your behalf once you've tuned your queries. If you want to learn more about our project, please go visit our website. We published a paper about the design of the system last year, which you can find through the URL. But you can also find a link on the website. If you're interested in how we designed the system, look up the paper and give it a read. Send me an email, find me on Twitter, find me on LinkedIn if you have any questions. That was my talk. And if you have any questions, now is the time to ask. Thank you very much, Roland. If you have questions, please line up at the microphone just a second. If you're leaving the room, please be quiet. The audience, the acoustics in the tents are really awful. If you haven't noticed, outside it's raining, so I would suggest staying and asking questions. First question, please. Yeah. Do you have a list of top-level domains where you have not the possibility to get the entire zone file? For instance, do you get also the .mil domain or any others? Which one? The .mil, US military. No, we don't get that one. But we do actually have most of .gov because that is open data. So we have .fed.us and .gov because that is published as open data because Congress forced them to do that. .mil, obviously, we don't have. And you're grinning, so you have the data. What? No. For information, maybe there are other top-level domains that will not give you the entire top-level domain. That's true. So actually, the hardest people to get data from are the country code top-level domains, the CCTLDs. Because the generic TLDs have to have a means to share their data. This is a requirement from ICANN. So most of them we can actually set up a contract with. But the CCTLDs have their own bylaws. And especially the European CCTLDs are reluctant to release their data because they say, well, it's privacy-sensitive and we can't release it under European privacy regulations. I would argue this is not true because we have data from a number of European CCTLDs. But it's mostly the CCTLDs that are actually the hardest to get data from. So if you know people at a CCTLD or you work at one and you want to give me your data, please. Let's talk later, Brad. OK. Next question, please. What's the reason you're not collecting PTR records as in reverse DNS? OK, yeah. That's a good question. So there are basically two reasons for that. A, there are already projects that do this at a large scale. For instance, the people that develop bind, so ISC, runs a yearly PTR scan. And that data doesn't change very frequently. So they run a yearly scan that they publish. Actually, right at the beginning of my talk, I told you I'm operationally responsible for a surfnet's DNS infrastructure. There are people running very, very bad PTR scans that show up in my data set because we have most of a slash 8 in IP space. These people are really annoying. And I don't want to be one of these annoying people. OK, thank you. Next question. Thank you. In your graph showing your query volume, it seemed as if you were not doing any queries at midnight. So is there a reason why you don't do 24-7 scanning? That's a very good question. I didn't actually explain that. Well observed. So we want to do a measurement for every domain once every 24 hours. That means we start our queries at midnight UTC. But we also want to finish our entire measurement before the next midnight UTC. So we leave some extra space where we're not sending queries, or we might be sending one or two batches because there were changes to the zones during the day. But actually, our query in completely stops before midnight and then restarts at midnight, which also makes us very recognizable if you're receiving traffic from us because we will start exactly at midnight UTC. OK, thank you. Next one. Yeah, thank you. So if I understood it correctly, you look at domain names just below the top level domain. So what happens if you have a host there in between, which has clients, which register domains with a host there for some malicious content, or one malicious domain which only hosts its malicious content in a subdomain? You look at those as well, or do you think there's not an important thing to look at? No, that's a good question. So you understood correctly that we only go for the second level domains under a TLD. And we ask for the double double label for A and quad A, but we don't go into third or fourth or whatever level, unless there is a C name there, in which case we will expand it and go in. What we also do is we have some measurements where we take hit lists. So they are, for instance, RBLs with longer domain names on them that we will also measure. But the trick there is that we have to have added value with the measurement that we do. We want to measure something that somebody else isn't measuring already. And these third or fourth level domains often show up in passivity in set ups. We can use those as a hit list to do a measurement on our end, but it does have to make sense. So what we're doing now is we're doing this for a few RBLs to see if we observe changes. For instance, there's a takedown or whatever we observe changes. That is interesting. But of course, we have no way to work out what exists in the namespace if we don't know what to look for. And what we definitely don't want to do is do some sort of dictionary querying or brute force, because then we're going to generate traffic volumes that are just going to be too big. Thanks. We still have a few minutes left. So if you have any further questions, please do come up to the microphone. And if you do, please move rather close to the microphone so we can understand you better. Closer, closer, closer. Closer, closer, closer. Yes. I'm not 100% filmmaker with all the intricacies of the top level domain system. But what happens if you do the queries from inside China? Do you see different things? Is that possible? Because you only measure from the Netherlands, which is quite safe, hopefully. It should be, yes. Should I go to the Buddha? So now, this is a good question. So I'm going to make it a little bit more general, which is why do you not measure from multiple vantage points, right? Yes. Two answers to that. Answer number one, we already generate quite a lot of traffic. So the more vantage points we set up, the more traffic we generate. Why? We want to be a little bit careful with that. Most of the there are people that claim that most of the internet is already scans and we don't want to add more to that. We did actually set up a secondary vantage point in the US to do a measurement, together with Keda at the University of California, San Diego. And one of the things that we want to do is what differences do we see? Now, obviously, you have things like CDNs that will give you a response based on your geolocation. But there is already a lot of research going into working out how these CDNs work. So that's not a goal of our measurement. And to answer your original question, what if you do this from China? Well, from China, you'll hit, of course, the Great Firewall, which will block certain domains and return you data that whatever the Chinese government puts in there. There are ethical issues with those types of measurements because where do you then do this measurement? Do we do this with our Chinese colleagues who operate the research network there? What do we expose them to if we start sending lots of queries for stuff that our government doesn't want them to query? So we tried to steer clear of that. Actually, we did another measurement last year where one of the things that we did was do a scan for open resolvers. And our scan hit the Great Firewall, and we got some responses back from that. And they send back really weird stuff. You send them a Quoday query, and they'll send you an A record back. So I have no clue what they're doing. I tried to steer clear of that for ethical reasons. Thank you. OK. He hasn't asked the question yet. If you observe records with a low TTL and change every day, do you want to scan them more than once a day to get better measurements? Yeah, that's a good question. So to answer to that, because we use a resolver in between. So the cache of the resolver will actually ensure that if something is cached for the TTL, it will expire. We can query it again. We don't store the TTL. What we want is a predictable data set. So if the TTL is short and it might change more often, then of course we could query more often. But how do you scale this? Because if there is one thing that is observable in the DNS, it's a trend in TTLs going down. There's a paper from two years ago where they had a study where they looked at TTLs observed in A and Quoday records. And this was a repeat of a study that was done in the early 2000s. And in the early 2000s, the median for TTL for A and Quoday records was about an hour. When they repeated this study two or three years ago, the median was 60 seconds. We can't feasibly send queries every 60 seconds. We have no way to make that scale. And I don't want to break the internet with my queries. We have time for one more or two more quick questions if there are quick answers. Anyone? Any takers? Come on. And close to the microphone, please. Yes, you have to bend down. Recently, I stumbled upon a Twitter user called DNSStream, who regularly informs the community when an DNS record changes or whether there's a random subdomain attack on some sort of DNS domain. Do you know who that is? Or is it probably by any chance you and your project? Oh, OK. So it's not me. I followed that account as well. And I have a hunch who it might be. So I can tell you in person, but I'm not sure. So last call for questions. We can fit another one in if anybody's interested. No? So I would say thank you very much, Roland. Please give him a warm round of applause.