 Hi, folks. My name is David Benjamin. I work on TLS at Google, particularly our TLS library and also TLS's usage in Chrome. So this talk is not going to have any crypto in it. Rather, it is going to be about the troubles that come up when we try to add more crypto to things and all kinds of bugs and implementations. Or why your crypto isn't real world yet. I consider titling this why my life is hard, but I decided that might have been a little bit too much. All right. So, as in the previous talk, suppose we either have a new protocol that's been analyzed or we find some bug and we need to patch and we would like to go and fix the problem in TLS or make it better or whatever. The first problem we run into is that the internet doesn't update atomically. We can't just wake up tomorrow and realize that all of our mistakes have been erased. So, at least for a transition time, new clients with the fix need to account for old servers without the fix and vice versa. At least until we've gotten everything updated and then we can remove it. And so we have four easy steps to change TLS. First, we go and add the new thing, but we carefully design it so that old connections still work and so we sort of upgrade the ones we can. Then we wait, you know, a short while, remove the old thing and rejoice to go home and I can leave the stage now. No. Everything on the slide is a lie. This is actually a huge pain. The internet, the TLS is this vast ecosystem with like lots of different implementations with varying degrees of quality, ability to update things, and as a result, we end up having a very hard time removing things and even adding things end up actually being rather difficult, much as we would like to be able to make things better, you know, very quickly. So, if you remember no other slides from this talk, remember this one? It's pink. There's another pink slide later on that's also important. There was a template and it made things pink. The universal law of users is that the last thing that changed gets the blame. This is how every single user of any software works. And I don't mean this as like a disparaging comment on users. I do this, too, and I'm sure everyone else does, too. If Chrome updates and suddenly everything is green, I blame Chrome and not my graphics driver, even though it might be the graphics driver's fault. And this means that when we go and break things, either when we go to like remove things or deal with some buggy server, even though the server is the one that has this 15-year-old cryptography or they just completely implemented the spec wrong, we're the ones that have to deal with the fallout. And fallout usually is in the form of angry emails and angry users and possibly someone telling me to go revert the change and lots of political capital lost. And so while we do manage to break things, it's rather expensive and much slower than basically anyone would like except for the people who don't want us to break anything. And so as a result, TLS parameters are extremely long lived. SSL 3.0 was 15 years old before we got rid of it. That's kind of ridiculous. And we still have a whole mess of TLS modes that we know are insecure, we know they're broken, and we still haven't managed to get rid of them much as, you know, it would be nice to do. And so the result is that adding things is much easier than removing them. And so if you're not careful, your complexity budget like quickly goes to zero where like we keep adding lots and lots of little patches and we have to deal with this quadratic product of like every single combination. And so you often find that implementers are a little bit uneasy to add new things. Like we would like to make things better, but if it's sort of a like epsilon improvement and there is a major improvement that's about as much work, we would rather just go straight for the major one and not have as many intermediate steps. And the other consequence of this is downgrade protection is absolutely critical. And what do I mean by that? So because we're sort of stuck with this swath of like history that we have to support and clients and servers for a long time, TLS includes a negotiation step. The client advertises a list of parameters it supports. This is the version, the cipher suites, various other things. And then the server processes this message and responds with a selection. And for this to work, we care about two properties. We care that this thing is downward protected. Because we're going to be stuck with things that we know are insecure for way too long. And that means both client and server will be supporting these. We need to make sure the attacker can't activate those code paths and that we can rely on the newer thing actually getting negotiated. This is done by some variation of taking the handshake transcript and stuffing into the protocol and the finished message or the signatures. The second property we care about is extensibility. This doesn't work if we can't actually add the new parameters. The way this looks like is the first message with the advertisements, we keep that compatible from version to version. And then the protocol tells you what you should do if you see an advertisement you don't understand. Typically the rules are the server must ignore things that it doesn't know what they are, just skip over them. And the client must not advertise anything it doesn't know what it is. And this should be relatively straightforward. None of this is particularly subtle engineering. And unfortunately it keeps going wrong. And so removing things is sort of naturally hard, but adding them also can be hard. And then my life is hard. So first, as an example, we have, TLS has had roughly four major versions so far. There's also SL2, but it looks sufficiently different from SSL3 that I think it's reasonable to consider a different protocol. Whereas SL3 to 1, 2 are exactly the same. And the negotiation works by the client includes a maximum supported version and the server computes the min of that and its maximum version and sends a server hello. Everyone can take the minimum of two 16-bit integers, no one can possibly get this wrong. Unfortunately, when my predecessors went to go deploy TLS10, we found that a bunch of server hello, sorry, SSL3 servers rejected the TLS10 client hello. They should have just negotiated SSL3, but they would break the connection. And that meant that deploying TLS10 and clients would break those servers, and the universal law of users says that would be our fault, not theirs. And so if it's too many, browsers ended up implementing a version fallback. At any connection failure whatsoever, we just turn off TLS10 and any other features that might freak out this server and see if it works this time. And then this problem kept happening every single time. TLS11 turns out that those break some TLS10 servers. TLS12 breaks some TLS11 servers, and by 2014, Chrome was trying four times to connect to a server before it would give up and just show an error. And so this was a huge complexity nightmare. When you have these kinds of retries, it turns out to mask other kinds of bugs, and so other things got into the ecosystem, and it was like don't do this. It also turns out to have security consequences. SSL30 is completely broken by triggering requests in certain ways, and fiddling with the ciphertext, you can decrypt the traffic. Ideally, we designed this thing so that this would be fine. Around this time, there was still a small fraction of SSL30 servers, so those we would need to show angry warnings about and gradually phase out, but the majority of traffic had already updated by then. But the problem is the downgrade protection broke. Because we had this external fallback, the attacker can just close every connection that has a version that's too new for it, and the browser would just happily turn them off, and then at SSL3, and then we would trigger the SSL3 code, even for those newer servers that should not have been vulnerable to these bugs. So this means that if your client has a version fallback, you cannot rely on security properties of newer SSL3 code. This is a disaster. Happily, this is gone. In 2016, Chrome no longer performs TLS version fallbacks as of 2016. I will note that at this time, the fallback was 16, 17 years old, and this still took a year worth of fixing up metrics, yelling at people whose sites were broken, dealing with bugs that the fallback had masked, and lots and lots of angry users. And we also learned that vendors don't really understand version negotiation, like TLS server vendors, which is kind of surprising. So the conversation would usually go something like, hi, so your product doesn't implement this correctly, you should be computing the minimum versions, your TLS12 intolerant, et cetera, please fix this, and the response is, oh, okay, we'll implement TLS12 in the next version. And yes, please do implement TLS12, it's better, but that's not actually the bug. The bug is that you are not computing this minimum of 16-bit integers, and so you're just going to cause this problem in TLS13, and then you'll fix that by implementing 1.3, and then 1.4 will be painful, and this cycle never ends. Additionally, it's not just the version field that goes wrong. Basically, everything else in the protocol also goes wrong. LPN is the way we negotiate the protocol, and you can negotiate which protocol to do next. That made the client hello too big, because now we had to list a bunch of protocol names, and we found that a particular load balancer would start hanging when the client hello was bigger than 256 bytes. We thought that this, and this was kind of problematic, we thought the problem was they had a fixed size buffer, and unfortunately, TLS13 is going to stuff even more things in the client hello, so this was kind of bad. But we were very lucky that one of this vendor's engineers posted the mailing list that, actually, it's not that we have a 256-byte buffer. If your length is in this range, then we think you're speaking SSL2 and get very confused, because the 2-byte length has a 1 here, and they were distinguishing them wrong. So now all clients actually will produce the client hello, and if it's in the bad range, we have a padding extension, which has it up to 512. This is ridiculous, but it unblocked our extensibility. At some point later, I think this was when we added certificate transparency. We got some reports that some users couldn't pay their taxes because their tax website was hanging. Turns out this vendor hangs when the final extension has an empty body. If it's elsewhere in the list, it's fine, just that the final one can't be empty. I still don't know what they were doing, but we strategically reordered extensions, and it was fine. Then we added a new modern ECDH curve, and some other servers in Proxies broke. They forgot the default case when looping over the curve list. Thankfully, that one was rare enough that we could just break it. And then this happened again for RSA PSS, which is added in TLS13. They also actually was the same implementation. Forgot a default in their switch case. But fortunately, we were able to break them too. So the problem is when this happens, we don't really have easy options. Ideally, we would prefer to just break these servers and try to get them fixed. The ecosystem is much healthier when we manage this sort of thing. But many vendors don't ship updates. They might be, for instance, some fancy hardware load balancer, and due to its excessive fanciness, to take an update, you must bring it offline for two hours, and so nobody wants to take their updates. And anecdotally, the folks who mess up these switch cases also don't build auto-updaters. And if it's too widespread, it becomes prohibitively expensive very fast. It's a very hard case to say I would like to break 25% of connections on the internet. I don't think anyone is going to allow me to do that. The universal law of users gets in the way here. So while this is preferable, if we can manage it, our metaphorical breakage budget is rather limited. We can only break so many things at a time before people start getting angry. Alternatively, we could work around the bug, and this is usually what non-TLS folks will ask me to do, and this keeps our users happy. It's nice in the short term, but then there are long-term costs and protocol complexity. This TLS is already a nightmare, and we're going to make it even more a nightmare. And as we saw with the fallback, sometimes they have security consequences, which is extremely unfortunate when we have this nicely analyzed protocol, and then we just break it all outside. So ideally, we could prevent this from happening in the first place, so it makes sense to look at what is actually going on here. Specifications are written and consumed by humans, and humans are not very good at this. We make mistakes all the time. The only thing that keeps the internet working at all is some form of natural selection. If your implementation breaks with existing stuff, you will not last very long. The users won't want to use you. Hopefully, your QA folks will yell at you first, and possibly even when you're testing things on your machine, you'll notice, oh, hey, this doesn't work in insert browser here. Conversely, the general rule of thumb is it works in my browser, ship it. In standard circles, this is known as interoperability testing. This is all interop testing is. Interoperability testing does not catch all bugs. Problems with extension points are usually latent. They will work today and then break tomorrow because it doesn't matter that you break on unrecognized curves because you know all of the curves that exist today, it's that tomorrow we're going to define a new one, and then you break. And additionally, TLS has far too many extension points. You're not meant to be able to read all of that. I'm sure this list is missing. I just went to the page where Ayanna keeps their TLS registries and listed all of their titles. There's too many of these. And because there's so many of these, our changes are spread across them. Like today, this change is a new curve. Tomorrow, we added a new cybersuite. And each individual extension point is very rarely used and frequently untested. When we added 2519 in Chrome, that was the first time in the history of TLS anyone ever deployed a new curve, as far as I know. It was just dumb luck this worked at all. So there's a nice analogy here, which is protocols should have only one joint and keep it well-oiled. If you have these extension points that you never exercise, bugs will gradually accumulate and if you wait too long, they will rust shut and it takes a lot of effort to get them moving again. So the solution is we can apply a little grease, which because I have far too much fun with acronyms stands for generate random extensions and sustain extensibility. In each of these extension points back here, we'll just reserve a handful of code points that we promise never to actually use and have the clients pick randomly out of this list and send fake ones just to get people used to the idea that there will be curves they don't know what they are. And we ideally break the buggy implementations before they spread. Because breaking things when there's only one or two is not a big deal. It's when they went unchecked for five years and are suddenly one percent of the ecosystem that we have a problem. And in a way, this is sort of inverse Postel's law. We already know that being liberal in what you accept doesn't work very well. We keep adding these workarounds and workarounds and security problems. But maybe we should invert the other half, too. And rather than being careful about not breaking random servers, we should go out of our way to break them while we have a chance. So Chrome is doing this these days. And it appeared to actually work. We prevented some bugs in some draft TLS-13 implementations that they supported all of the current curves. But because TLS-13 changed where the curves are listed, they forgot to skip over the unknown curves correctly. But our implementation will always send fake curves. And so they found they couldn't talk to Chrome's draft implementation. And notice the bug before TLS-13 even existed. So speaking of TLS-13, that is the other... TLS-13 has been plagued with a lot of these problems. I'm not going to talk about what's in TLS-13 either, but I think you heard from the previous talk that we're doing it a lot better this time. So we're very excited about getting it deployed. Unfortunately, TLS-13 used the same versioning scheme that worked so great the last three times. I think there's some saying somewhere about doing the same thing over and over again and expecting different results. So we did a crawl of some list of top sites. And the folks who run SSL lab did as well. And we found that somewhere between 1 and 3% of top sites rejected a TLS-13 client hello. These weren't sites that you never heard of either. These were like popular newspapers and other things. So there was no chance we were going to be able to break that. Our breakage budget is several orders of magnitude below 1 to 3%. Which means that clients would be forced to fall back or do some other work around because these numbers, as much as we would like to just break them, other people will scream at us. Also it's actually not that great for our users to have the website suddenly stop working. But that would break down very protection again. And we spent all of that trouble getting rid of that 17-year-old disaster. So that wasn't very happy. So we actually ended up just moving the versions, extending information somewhere else. The old version field is now frozen at TLS-12. Minimum versions, sending a maximum version seems reasonable, but apparently it doesn't work. I don't know. Now it's in an extension and critically, we send a list of versions rather than a maximum version. And although that is actually slightly more work to process, it is more aligned with how other TLS parameters work, and so hopefully people will not get the idea that each, people will get the idea that, one, a single client hello is good for multiple versions at once. And more importantly, we can grease a version list. We can't grease a maximum version because if we send, you know, a fake high number, that implies we support everything below it, which doesn't actually exist yet. And this indeed successfully prevented TLS-14 intolerance in the draft implementation. The two bugs we managed, there was, no, sorry, there were three bugs we cleared. One of them was a version intolerance. The other one was a curve list, and then there was an extension. And also, this allowed us to be able to safely deploy the draft versions. The old scheme requires you support contiguous range of things, and that doesn't work well when you want to temporarily advertise, you know, TLS-12 and a half to see if it works and then get rid of it later. So with that resolved, we were very excited to finally turn this thing on in February of 2016, and everything broke. So if you were wondering why TLS-13 isn't here yet, it's because of this disaster. We got a lot of user reports that folks' firewalls, proxies, antiviruses, random pieces of software or devices that sit in between the client and server were doing something. And the impact was that we lost about six or so percentage points of handshake successes, and this was on the beta channel where you would expect that fewer people are running these like, these things tend to be more enterprise-y features. And 6% is not within our breakage budget. We would need a fallback without any question. The good news is, we didn't find any endpoint issues, at least not that round. So the versioning hacks, while disgusting, worked. The bad news is, diagnosing middle box issues is very hard. We can't just ask the user for a URL and poke at the server and see, ah, you don't like it when the final extension is empty. Okay. So what did we do? Well, we didn't have, we don't have very many tools at our disposal. So we did a small deployment with our beta channel. That was small enough, large enough to trigger problems, but small enough that we didn't hit the universal law of users too much. And then we monitored metrics and waited for people to tell us what broke. This is not a very good mechanism. Only a very small fraction of users will actually send reports. Once you get a report, they tend to be on the order of, I can't visit Gmail. It doesn't work. That's not very useful. You need to go ask them what their setup is. This is probably some random employee of this company, not one of their IT folks. And so as a result, only a small fraction of users actually respond to these questions. Anecdotally, the half life of a user report appears to be about 1.5 questions. But we did eventually find a small incomplete list of vendors. And so we reached out to them, maybe they would be able to fix their bug, maybe. But now we have yet another having or whatever. Not all the vendors are responsive. Those who are responsive, they aren't necessarily willing to fix their bug because their product works just fine for them. Those who are able to fix it almost never are able to reliably deploy fixes. We were already pretty dismayed at how servers could auto-update. These enterprise boxes will never update ever. And so we tried another attack, which was we just purchased every broken device we heard of and attempted to reproduce the issues. At last count, we have four middle boxes, one printer, and there's a poor Windows laptop that has two antiviruses and some scanner software, sorry, a label maker software installed. It's been a fun year. So what did we learn? This is an oversimplified picture, but roughly these middle boxes either terminate TLS so they have their own CA, which maybe they configure the client to trust or something, or they might not terminate it. The ones that terminate it, I will make no comment on whether or not this is a good idea. I will merely say that they don't generally break TLS 1.3. A TLS terminator is in theory a client and a server. You connect them back to back and you move on with life. If the client is correct and the server is correct, you will at least not make me angry at you for breaking TLS 1.3. You might have other problems that's not the subject of this talk, although some vendors do still manage to break this anyway. The other middle boxes, which is the bigger problem, will process TLS without terminating it. The problem is TLS is not the same. Officially, only the client hello is promised to remain unchanged. Everything else we're allowed to change in newer versions and indeed we changed basically everything after the client hello in 1.3. But we didn't do it for 20 years. TLS 1.2 and everything in between are the exact same protocol. And so although these things can't change, they didn't. And people got used to the fact they didn't change and built products that did various things with these information. Some of them are, you know, firewall type filtering things, and some of them, they just wanted to check the length prefixes worked out because it seemed reasonable at the time. One of the vendors actually said, we just wanted to sell the customers a box that did something and didn't justify it beyond that. This is an oversimplified picture. A lot of middle boxes are mixed of the two, but this will give you a flavor of the kinds of things that we're having to deal with. So well, we need to work around some kind, either a fallback because or whatever, but we don't want a fallback. So enter TLS-13 compatibility mode. This is ugly beyond belief, but it does work. We made TLS-13 look like TLS-12 resumption, which actually only takes injecting a few dummy fields and some messages that the other side ignores, but for whatever reason, the middle boxes think are critically cracked. So this is ugly, but it's only ugly. The fallback has a security consequence and this one just costs aesthetics and complexity. I would rather not pay that, but I'll pay one of them, I guess. If I have to pick one, I'll pay that one. So we, I think this was an idea by someone at Facebook originally, and then we iterated over this with Chrome, beta and Google servers until we got something that worked and it does appear to relieve most of the compatibility problems. So TLS-13 draft 22 and later includes this disaster. So with that out of the way, in December we did a Chrome stable test and thankfully, so we turned it off intentionally before the holidays because that seemed kind of bad manners to leave on. But we got a number of results that were consistent with Chrome beta. So it seems to basically work. There were two new confirmed bugs. There's a new buggy middle box. I do not think we're going to work around them because the workaround would actually have security consequences so hopefully they can fix themselves. Also, someone used extension number 40 for the extended random extension which was never standardized. That collides with TLS-13 because the ITF really likes to number things consecutively. Pro tip, if you're making your own extension, pick a large number. They're within the limit. The limit is 65,000. There's plenty and the ITF will never get to 65,000. So in draft 23, that extension has been renumbered and is now 51. Oh, well. So hopefully this year we should have TLS-13, I hope. If not, I'm going to be a little disappointed because I'm kind of sick of this protocol by now. I want to deploy. And the other thing to think about for the future is that the first version of Greece, it was basically about endpoint bugs. It doesn't work against the middleboxes because it doesn't touch the server response. We need to think about how to keep our protocols from ossifying that way too because the unfortunate reality of the network is that if you have any observable property of network traffic that does not change over time will get stuck and you need to be constantly doing some things to deal with that. Finally, as a parting thought, folks who are shipping things please do not ship network software without an auto-updater. It causes lots of pain and suffering for you and everyone else. Just build one of these. All right. We have time for a question. So great talk. I love the idea of Greece. What happens if you say, oh yeah, I'm using curve grufflump and the server calls your bluff? We promise that you never allocate those curves. So there's a grease draft that I actually haven't updated for a year. No, no, but the server is buggy. Oh. A buggy server that says, oh yeah, I'll do that. That would be impressive. I don't think we've seen a server that is confused about what curves it supports. Because. So hopefully that doesn't happen. Although because we've already deployed Greece, that server of universal law of users goes the other direction too. Thanks. I have a question regarding the chelsea compatibility mode. On the slide, you write that it is exactly, this is ugly, but only ugly. Do you have any formal reasons to buy this claim? In particular, in light of the preview stock by Tyler? It would be nice if someone could do it formally, but it's sort of obvious serialization silliness. So, before this, there were every record in TLS13 has three bogus bytes in there that are completely useless, but if you remove them, middle boxes will hang because they try to parse out records. I don't know. There are a lot of dummy fields in TLS. If we are unable to add dummy fields and keep things compatible without breaking things, that would be something that we definitely want to find out about. So, question, what do you ... it sounds like the take away from your last point was that protocols should actually be hostile to middle box analysis just to ensure we can change them later so we could just randomize things and always make it hard. Do you have any reason why we shouldn't be doing that? I would maybe like a temper that's going to make it difficult to understand what the invariance is. We promise the client hello, stays fixed, everything else go to town. And this is about enforcing ... whether or not implementations follow those ... so these invariance, every implementation that touches TLS needs to follow for the ecosystem to work. I think we should defend those invariance and make sure that enforce those invariance. I don't think that quite answered your question. I was wondering about the parts that aren't invariance, but that people mistakenly assume are. I don't have a complete answer for how to resolve this. This is something we'll probably be thinking about over the next year and hopefully you will see some ideas. But the obvious thing is encrypt more things because encryption is kind of nice, but TLS has a stopping problem where the first few messages are the one setting up the encryption. The best idea I have is to mutate them a lot, but doing that carefully is a little tricky. Thank you. Hi. One potential strategy to tackle this would of course be vendor shaming, like saying your thing is broken and you seem to be extremely reluctant to do that. I mean you know my opinion on it and we've discussed this before. The technical solution, so this is not entirely a technical problem. I think we need technical measures and I think also the ITF might need to have more of a conversation on whether the stuff is socially acceptable. But like when it's six percentage points of connections naming and shaming is not going to do a lot of good. Everyone's going to get sort of all angry, the vendors won't talk to us anymore and at the end of the day these folks don't have auto services. They're not going to get fixed, at least not for 5, 10 years and I don't want to be waiting for 5, 10 years for TLS13. When it's a smaller amount then Chrome has gone out of its way. Chrome has broken things before that aren't working right. But it's sort of a numbers and costs game. All right, let's thank David again.