 Hi, everyone. My name is Roy Adams. I work for ICANN. I'm going to talk about the root zone KSK rollover and specifically the process we did on the 11th of January, which is revoking the DNS key, the root zone KSK, the old one specifically. So who has not heard about the key role? Okay. Who's not Warren? So I'll go through quickly through the first few things because these are fairly old-sliced, the first parts. It has happened on the 11th of October at four o'clock exactly. This is the schedule we had during that day. You can ignore this. I highlighted the part. It says four o'clock, very sign approved. The root zone file push. At which point we pushed the root zone file? Well, very sign pushed the root zone file. Anyway, this is the team that helped. This is at NLNet Labs, Ben Ho. Thank you all. And Willem, oh, thank you. So this is the team that did it. This is the very sign and ICANN team. I put these two pictures in there because this is Matt Larson. He worked at very sign in 2010 where he signed the root zone for the first time. Keep note of the small pack of paper that he has in his hand. They printed out the root zone, and it's a very small pack. Eight years later, this is Matt again. Time works well for him, except he doesn't change shirts. And he's holding the new root zone, which is about 450 pages. Now, before anyone says like, oh, you're going to waste paper on this. No, we didn't. We just printed out the first two pages and had like a pack of 400 under it. But so fake news. So when this happened, we had some tools to look at what was going on. We have, as you know, 13 root server letters, A through M. Those are 12 organizations. 11 organizations gave us data. So we had A, B, C, D, E and F, and then we had H, I, J, K, L and M root. And I'm not going to tell you the missing one. G, that's hard to guess. So what you see here is basically the, it's not really that visible. But the top left graph, note to self, just one graph will do. The top left graph shows you the accumulated DNS key traffic. Queries for DNS keys for the root. Not queries for anything else. Queries for DNS keys for the root. Now, why is this interesting? Well, in testing, we know that if you haven't configured this thing with the new KSK or don't have some automatic thing in place, like RC5011 or other tools that help you to replace that key, to add that new key, then a resolver goes berserk. If it doesn't have the new key configured, it can't validate. If it can't validate, it will ask for the DNS key. It can't validate this DNS key, so it can't cache anything. So in order to keep on going, it will ask for a DNS key and so on and so on and so on. In the old days, this was called Rollover and Die. A couple of years ago, this was 2008, 2009, this was a big problem that has been solved since. But still, if a DNS resolver, DNS validator can't validate this stuff properly, it will ask for more DNS keys. It's kind of like an oracle, like a hint to the data to see what's going on. So this is before, just before the roll on the 10th of October. This is just after the roll. You can't really see it, but before we had about 1,400 queries per day, sorry, 1,400 queries per second. Sorry, a DNS key queries per second. After the key roll, this is on the 14th of October, or the 13th of October, 48 hours after the roll, we had 2,500 queries per second. Now, this is an uptick, so that means that some resolvers were misconfigured. Now, how to find those? So what I did was I took, so a little backstory here. ICANN also operates L-Root. The actual name is now IAMRS. ICANN managed root server, but the folks still call it L-Root. I get all the data out of it. My team gets all the data out of it. So all the traffic we get. And so we can look at interesting things. So we looked at October the 10th and we looked at October the 14th. On October the 10th, we took all of the region resolvers that asked for DNS keys and we counted how many they queried for. We did the same for October the 14th. Some numbers were about 1 million unique resolvers asking DNS keys over four days. But we saw basically 115,000 who queried on October the 10th and on October the 14th, so on both days. But we only had 85,000 resolvers that we saw every day. The reason I'm telling you this is not all resolvers talk to L-Root. We got 13 root servers and what basically happens, if you can't resolve here, if you can't get the right DNS key, and it depends on your server selection algorithm, you either randomly go to another or you basically take the next fastest one. So we tracked each of these under the 15 and we get the following graph. A fairly nonsensical graph, so it takes a bit of explaining. X axis this way, October the 10th. Y axis that way, October the 14th. Each individual dot is a validator. So you have a volume on October the 10th, you go this way, a volume on October the 14th, you go that way. So you basically have a coordinate. All of this stuff basically in the green line, I'm not sure if you can see that, all of this stuff in this green line is the diagonal. That basically means if you're on a diagonal with your resolver, that's good because the volume you had on the October the 10th is the same as the volume you had on October the 14th. So that's all good. You see the reason for these lines, this is a logarithmic scale, a log scale. So if you happen to be here, this is 14, as you can see. Well, if you take e to the power of 14, you basically get, I think, about 10 million. So if your dot would be here, and not many are, if your dot would be here, you're asking for 10 million, you're asking 10 million times for a DNS key query for that day. So this is a whole lot here. The interesting for me are the ones that are here. These are querying a lot on October the 10th, but not anymore on October the 14th. So they either have been switched off or they haven't been switched off and you need to have asked on October the 14th at least once. And of course, if you take log for one, you get zero. That's why they're on this line. This is the problem space, basically depth triangle. If you are up there, if your resolve is up there, then you're asking for more on October the 14th than October the 10th. So that's kind of a misconfiguration. I've highlighted this blob here. It's a fairly significant amount. This also signifies two order difference. Here is six. That's up eighth, right? So the difference is two order. If you're two order larger, two in the log scale, that's a kind of 100 times more or 80 times more if I'm not mistaken, the amount of queries that you sent. There's a high concentration there. We think we know what it is. Ireland is up there. ARCOM, EIRCOM. And what is now well known, we haven't had positive confirmation of this, but what's now well known is that ARCOM had misconfigured resolvers. ARCOM is, I don't know here in Belgium, but in England we have BT. So that's kind of the de facto national BTT. The same for Ireland, that's ARCOM. So their national ISP, if you will, went offline. And if you follow Twitter at the time, and we did follow Twitter at the time, you get a lot of heartbreaking stories. My kid is trying to get online. He is autism. He needs his thing. It doesn't work. ARCOM, what the hell is going on? And it goes, it's very negative, heartbreaking stories there. I just talked to Warren Kumari before this talk, and he says there are a lot more problems. APNIC, Jeff Houston, he found a lot more issues than this. But we know of the ARCOM one through our data itself. This is what I just explained, so I'm going to skip that. This is, to me, interesting. Zero changes between the 10th and the 14th. That's the bulk of the resolvers. They were there. And that's good. That's good. So for those resolvers that are asking for the NSK, they don't ask more on the 14th than on the 10th. So that's great. Minus one and plus one is also not that interesting. It's one order change. Now, if you take log nine and log 11, there's two different orders. But nine and 11, that's kind of in the error of the data, right? As I said, not all resolvers ask us all the time, so there might be some difference in that. The interesting part is here, minus two, minus one, minus four, minus five, minus six. They ask for a lot of queries on the 10th and then very less on the 14th. Which is kind of strange, because they had problems before the 10th, but not anymore on the 14th, so they either fixed their implementation. I've heard from a very cute implementation model. So as you know, at Cresolve.conf, you have, I hope, at least two resolvers in there. What one organization did, one ISP did, they configured the first resolver with the old key and the second resolver with the new key. And in hindsight, it's actually brilliant, right? Because the first one stops resolving, surveil, what your implementation does, what your laptop does, what your local code does, it basically takes the second one. The first one does a response, so if it takes the second one, it works. So there was a cool hack, actually. I wouldn't promote this in doing that, but it worked for them. And here you can see the real problem space. This is two-order change positive, positive meaning more. So this is typically what we see if we configure bind in the test environment, and we let the key roll, right? And then you can see a two-order change in the amount of traffic. So this is what we expected for resolvers that don't behave. Here's some more information. I want to get to the part of the revoke, because this is just the roll on October 11th. Three months later, we decided to revoke the key. This is something else. This is RC8145 data. If you think that this graph is nonsensical, you're absolutely right. It doesn't mean anything. We all looked at it like, oh, so many implementations that report to us that they still have the old key. So what is RC8145? It's a trick where you send a query to the root servers, and in that query, you basically say, I have these keys configured. And you might have the old key. You might have the new key. We see a whole lot of garbage in there. There are many implementations that just take, they look at the root zone, and they grab all the keys, even the zone signing keys. So you have the key tags of those as well in there. But yeah, this really, it wasn't really helpful for us. The problem is, if you are really concerned and you want to read something in this graph, this black line says 5%. The green and the red correspond to this. This is 50,000, 100,000. This is the number of sources, the number of unique sources that report. So it's not query volume, the number of unique sources that report stuff to us. The green is everyone who's sending information, all the resolvers. The red is the ones that only report the old KSK. And the black is basically a percentage of that. So the idea was to get below 0.5%. Well, this is 5%. Again, order larger. But yeah, we realize that this really doesn't mean anything. And luckily a lot of people understood that, that it was meaningless. Oh, this was... The NSX design team recommendation. Yeah, so this was the NSX design team recommendation. They said half percent. What we said, when we made the implementation plan, according to the, according to that design team is not ICANN. Design team is the community who designed this stuff. And we as ICANN, we get to implement that. We didn't copy the 0.5% because we didn't know where it came from. I think someone at ESSEC just put the finger, oh, let's do a half percent. A Jeff, okay. And anyway, so that's where the half percent come from. At the time, and I say at the time because now we know more, but at the time during those three or four days after the role, we heard basically about two big outages. Consolidated communications with Vermont ISP and EIR. We had a lot of little issues. We contacted a lot of developers before this happened, also during when this happened that we saw issues. Thank you developers for stepping up the plate and fixing everything that fast. Second part of this talk is about the revoke. This is, what is a revoke? Basically, we publish the DNS key with a bit set that says revoke. If a resolve, it's a fairly complex RC5011 dictated. If you see it, if you see a key with that bit set, don't use it anymore. This is self-signed. So that means also that the DNS key set, including the signatures that come back, that whole set will now be slightly larger. So we went from, I think, 1,425 bytes to 1,515 bytes. Now, you might think that's what? 90 bytes, that's not a big problem. It's not unless you IPv6, then you have this magic threshold of 1,480 bytes that's going to be problematic. Anyway, let's do the same thing as I did before. This is the, you see the top left. You see that 2,500, the red line starts about 2,500 there. This is still from the time because it was kind of a nice baseline, 2,500 queries per second. But you see this little diagonal going up? That was the 48 hours after the revoke. Now, that's actually a massive amount. That went from 2,500 to 15,000 queries per second. And we all thought, me including, that just setting this bid revoke, I mean, if you don't do DNSSEC correctly, you would have known by now because you're validating, right? And you fix your stuff, otherwise you wouldn't be able to resolve stuff since October 11th. But yeah, a significant amount of traffic. So I did the same trick. I did the same trick. So look at the dates here. This is 14 October, the day after the key role to 14 January, which is the day after the 48 hours of the revoke. And you see, now, they have been fixed since. EIR is in there as well. They have been fixed since. Now, thinking about it's not really a fair comparison, right? The day after the key role until the day after the key revoke. So I looked at this, 11 October, the day before the key role and the day after the key revoke. And all is fine, right? The bulk of the resolvers are on this middle line. And yes, we can still see a fair amount here and still see a fair amount there. I reckon they'll still be there in 10 years because no one knows that these things are working or doing the wrong thing. But where is this massive amount of traffic then coming from, right? Where you go from 2,500 to 12,000. Or 15,000. So then I did the day before the revoke, compared that with the day after the revoke. And there it is. You see this massive amount of shift of resolvers that are in there. This is really an enormous amount. I highlighted this in case you didn't notice. So remember, this is up to 12,000. This was the day after the revoke. Now we're kind of three weeks further in. This is up to 16,000. So after the key role in October, sorry, during the key role in October, we went from, let's say, 1,400 to 2,500, then flatlined a bit until the key revoke, and then went from 2,500 to 12,000 in two days. And now it went to over three weeks, it went to 16,000. I don't want to say it flatlined, but it went a little bit up. Exactly. So this kind of shows also query volume. If you have a root server operator that has very little anycast, then there won't be query that much. That's just the rule of today. I have a very similar graph of total traffic, because I've got those statistics as well, except for that one root server operator. And the more anycast you have, the closer you are to the community you want to reach, et cetera, et cetera. The bulk anycasters, I can happen to be one of them. This is L, I'm worse, L. This is L. So this is quite high up. A is quite high up. And J is quite high up. K is quite high up. So it's not the same everywhere. But if I go back to the, if I go back here, I think you meant this graph. So everyone has the same trend, right? But because you're using a static skill everywhere, otherwise you can't really compare, you use static skill everywhere. So every scale goes up to 2,500 here. If you have little traffic, the impact is harder to see. So you can see this one going up still. This is M-root. This one going up massively. This is K-root. H-root is the exception. H-root, we know it's very hard to contact them, the U.S. military. You don't want to bother them too much. They have guns. No, but seriously, yeah, we just don't get, we just don't get all the traffic. I'll really go quickly now. I just got a sign. So that same report, we now up to basically 2%. 2% is good, even though it doesn't mean anything. Upcoming milestones. We do this again the 22nd of March, where we get the revoked key out of the zone. It won't be there. Hopefully something stabilizes, but probably something goes berserk. We still need to get, these are ceremonies, this is maintenance. We still need to get the old key out of the HSM, so we do that. This is a small pitch. We need your input on the following. For the next 5 to 10 years, we need to know the frequency of key roll-off. I work at ICANN. I've been told I can't think for myself. We need to be dependent on the community, heavily dependent on the community. So I can't suggest what these numbers should be, but at least I want to suggest that we need to think about it and come up with some policy development around this. How often are we going to do this? We also need to stand by key. Then we are compliant with RC5011. And we need to think about algorithm roll-over. I'm not saying we should do algorithm tomorrow. I know some folks want to do an algorithm roll tomorrow, but we should think about algorithm roll in case RSA gets weak. That's my talk. Thank you. Any questions? One question. One question. Pause. We have two competing theories. One is, could it be IPv6 because of the increase in response size? Now we know that a bulk of these were IPv4, so no. And then the other thing is, these are configurations where you have trusted keys in bind. Static trusted keys, not managed keys, but static trusted keys in bind. When you do the reverse lookup of these addresses, you get the host name of these resolvers. A fair amount have the number two in them. And I think that's due to when you have two resolvers, right, in your scresolve.conf, the second resolver you call two. Now if the first resolver works, if it works, don't fix it, right? So if it works, it works. You don't have to do anything. And folks forget that they have a second resolver configured as a fallback that might not be working. So we think that might be due to that. But we honestly don't know. We need more data. We need to have a look at it. Thanks for the question. Thanks, Roy.