 So, thank you, Tom, and the rest of the committee for the opportunity to talk to you today. As Tom said, I'm David, and I'm going to be presenting joint work with some collaborators. Blake Anderson and Scott Fleurer are here today, and Chris Schenefiel is back at the office. So, I'd like to start off by giving some background. First, I want to talk about some previous work and some related work. The most important is, wrist and part in the electric in 2010 showed us that randomness can go terribly bad when there is a virtual machine restart involved. And I'll go over that in more detail later. This is really important work that we build on in what we did. There's also more recent work, an active scan of HD2PS servers done by Bach and others non-disrespecting adversaries for GCM. And what they showed was that there are TLS servers out there that implement the GCM-based Cypher Suite, and the requirement for distinct nonces is not met by some of these servers. So, in our work, we did passive inferences. We did passive monitoring of a network and then made inferences on that, not active. But it's similar in spirit in that we find flaws in implementations due to repeat. And some related work that's really important is Bernard Kohe, Hallnerman, Henninger, and Volenta have some ongoing related work that has similar findings to our own. So, thank you, Nadia, for talking with us about that. I want to take a far step back and kind of present the philosophical approach that we have to better explain how the work we're going to be presenting in detail relates to the bigger picture. And what we have is a threat-driven approach to cryptography. And you've probably heard in the industry people talk about a threat-centric approach to cybersecurity, and we very consciously mean the same thing here. So, I want you to pretend for the rest of the presentation that you are an information security analyst, and you're responsible for protecting some set of assets, right? So our InfoSec person is on the right, and there's a set of assets that you need to protect. There's a set of vulnerabilities in your information systems, and you have controls like anti-malware, anti-virus, and transport-layer security to protect communications. And your controls are your protections, and then there's a set of threats that you need to worry about. And it's some adversary that wants to get at your assets and steal your data or cause your power grid to go down or do some other damage, and it's really important to think about things in terms of what are the adversary's capabilities, what are their motivations, and that helps you to understand and prioritize what vulnerabilities you should be paying attention to and so on. And you'll notice my threat icon here is intended to look like either a warfighter or a thug or somebody who works for organized crime, which is the appropriate thing for the real world. And if you haven't read Phil Rogaway's excellent essay on the moral character of cryptography which touches on the importance of appropriate symbols, I encourage you to read that. So you're responsible for protecting information assets that might include a data center at the bottom right there, a campus network that has a bunch of wireless clients and other clients on it, and then of course you might use cloud services, which is essentially somebody else's data center over the internet and a set of mobile devices, and they all have information. And then the orange lines are intended to illustrate the encrypted connections that we're relying on to actually protect all this information, right? So the communication security is critical and it has to be done right. And as an information security person you should ask the question, you know, is it being done right? And we have a way of talking about this, which is crypto visibility, which the idea of being as an information security person, do you actually have an understanding of where cryptography is being used and how correct it is? So you want to be able to ask questions and have answers like, is encryption in use where it's needed? Is sensitive data being appropriately protected? Are there active attacks or exploits going on? Are there bad certificates in use, you know, keys that are being trusted that shouldn't be trusted? And then the focus of the talk today is around weak cryptography that's being used. And that can, you know, there are two major categories there, one being, you know, sometimes obsolete cypher suites and, you know, inadequate key sizes are actually used and it's very valuable to get a view of, you know, where they're being used and, you know, somebody's using 1024 bit RSA that may or may not be, you know, security critical depending on where it's being used. And then also there's a set of implementation flaws, right? There might be a cypher suite that is a strong cypher suite but then the implementation is incorrect and that's causing a problem, right? So in our work we focus on using passive network monitoring that is aware of TLS sessions and aware of cryptography as a way to provide this sort of visibility. So let's move on to detecting flaws. We use what we call the multi-session monitoring model and essentially you can monitor the network at just one point but you have to have an elephant's memory, right? Because if you want to look for something like a collision between pseudo random number generator state in two different sessions, you need to have a long-term memory and then it enables you to, you know, find sessions that might be separated by a great distance in time. So of course you can monitor at multiple places, multiple networks and that gives you, you know, more visibility but, you know, a single monitoring point is adequate. So this is our simple model of TLS. I'll go into that more later but the important thing is the PRNG, you know, it should have some really large set of possible states but if it has a small typical set, typical in Shannon entropy terms, then, you know, an attacker can potentially, you know, exploit this fact by, you know, finding sessions that actually collide and it's feasible to do if the typical set is small. The monitoring tools that we used is an open source package that Blake and I and some other people put together which the package is called JOY and it can turn a PCAP or a live network capture into JSON descriptions of the network traffic. It has kind of a flow monitoring viewpoint except it's much richer than flow but it's not a per packet, it's a per flow thing. And it's open source and like all great open source, it's, you know, the documentation sucks and if you're interested in it please send us an email. So I'm going to touch on virtual machines a little bit because they're really important for understanding some of the failure modes. So you know, virtual machine snapshot is basically just a set of bits that can be stored and then later cloned into other running images. And a really important use case for this in practice is something called autoscaling. In autoscaling there will be a server that has a hypervisor and it runs an image of, say it runs the, you know, one of these images here and then when the load on it increases it'll spin up a new virtual machine with another copy of the same image so that it can provide, you know, the greater scale for the same service. So that's called autoscaling and the, it's important to realize the difference between a volume snapshot and a full snapshot, right, and this is something that, you know, Wrist and Partnilic introduced this but let me cover it again to make sure that we're all on the same page. A volume snapshot is an image of a bootable disk and by contrast a full snapshot is an image of the random access memory as well as the disk. So if you start a VM from a volume snapshot you're going to have the latency of a boot whereas with a full snapshot you won't have the boot latency. You basically copy the image into running memory and you tweak a couple of things and then you're good to go. So a volume snapshot is not vulnerable to the attack so we're going to be describing it in a VM restart situation but a full snapshot does have these vulnerabilities. And a slight nuance is of course that there are operating systems that store random seed on volume snapshot but that's a minor thing that we didn't focus on. So in our VM experiment we worked to duplicate some of the failures that we've seen in the wild in a lab setting so that we could make sure that we were understanding them. We also worked with several different enterprise offerings for virtual machines to investigate which of these would actually have these types of failures. The first is we worked with a malware sandbox environment and Threat Grid is the particular one we worked with and they use a full snapshot. Just to make sure everybody understands what I mean is a malware sandbox is a dynamic execution environment where you can introduce an executable sample into an operating system and then the operating system will run that executable, it could be Visual Basic or JavaScript or EXE or whatever. It'll run that and then after running it it will check to see what is it, was it good or bad, right? So it's used in malware detection. Several other environments, Docker and VMware linked clones we tested and they used the volume snapshots so we did not observe the failure on that. The reason we took this approach, we checked a bunch of things, because in software it's turtles all the way down. There can be an application, a container under that, a virtual machine under that and an operating system under that. We talked to a bunch of people that worked with virtual machines and what we heard from them was they were very glad to hear about ways of testing to make sure that the turtle below you isn't misbehaving in a way that undermines your security, which I think is a good way to think about it, that you can have the best possible application but if you run it in one of the turtles below can subvert it. So there are a number of different scenarios where TLS failures can occur and you're probably familiar with the TLS protocol but just to review a few things quickly, there is a field in the initial handshake, both the server and client hello, the initial messages that they send and it's called random, that's the name of the field and it serves as a random nonce in the sense that it provides anti-replay protection and it's also used as a unique input for key derivation purposes, those are the main functions it has and for the client it's supposed to have the time in the initial 32 bits and for the server it's not supposed to have that but then again it's not really fully specified, the server is supposed to make it distinct, it's not literally, the field is called random, the RFC is a little bit loose on exactly how it's supposed to be formed in practice, some people don't set the time when they should and some people put a time in the server nonce even though they're not required to do that. But what's important about these nonces is that if there's a PRNG state collision then you can observe a collision in the nonces. So we have the following simple model of TLS that I showed before, let me explain it in a little more detail. So essentially we have the observable fields, the time field if it's present, the random and a client key exchange which is something that contains either the RSA ciphertext or Diffie-Helman or elliptic curve Diffie-Helman value and there's an entropy source that feeds into a PRNG and the PRNG is some software thing that is typically part of the operating system and the PRNG will feed into a nonce generator, that's just our term for that, there's some logic that actually reads from the PRNG and writes it into the nonce and then there's the cryptographic component that generates the other field and the clock is fairly separate from all this. So you can think of this as like a direct acyclic graph where a failure will propagate along the direction of the graph. So there are essentially four main categories of failure modes that we observed and our catch-all so to speak is what we call an aberrant implementation. This is something that fails to conform to the spec and probably isn't really trying to conform to the TLS specification. So these are interesting and we observed a number of these malware does this sometimes and some non-malware applications do it as well. It's not clear what the non-malware applications are trying to achieve. The malware is apparently trying to cut down on computational cost or something like that although that's speculation. And in an aberrant implementation the fields that might be fixed, they might be repeating, they're going to behave in some way that's unpredictable and doesn't correspond to the spec. A PRNG flaw is much more interesting. So in this case the PRNG is going to be a repeating value and it might be for example something that updates no greater than once a millisecond. And so that's one of the failure modes that we observed and in this case so the PRNG state will be repeating and the repeating values will feed into the non-generator and they may feed into the client key exchange. And the reason that the random value might be identical and the client key exchange wouldn't is because the entropy might reset the PRNG in between the calls. So it's not completely deterministic when that reseeding happens so sometimes it's possible to see a random that repeats with a client key exchange that doesn't. And lastly maybe the really interesting thing is the multiple running instances of a full snapshot. So that's a big phrase but trying to be specific here. Remember the full snapshot copies RAM and when you have multiple running instances that's the case of somebody has made these clones and when each clone initially starts its PRNG state is going to be identical as shown here. And so similar to the previous case the random may repeat and the client key exchange may repeat and if there is a client key exchange repeat you'll definitely see a random that's repeating as well. And the last case is the least interesting which is an active network scan. There are some network tools that will do HD2PS scans and these are things like if you wanted to find out what Cypher's Weets servers are offering you have to make construct a client hello and send it to them and so these scan tools will basically build one hello and send it to many different destinations and so they're pretty easy to recognize and in fact if you do this kind of monitoring in the wild you'll actually see scan services that are network internet based scan services commercial services in action. So here's the the field identification chart that essentially summarizes the interesting cases that I just explained. So in this chart each line is intended to correspond to a single session time is increasing down the page and so for the aberrant implementation things might be repeating they might even be fixed. More interestingly the the PRNG failure is going to have the random values repeating and possibly a client key exchange repeat and that's going to be for the failure mode that I described earlier around the slow to update PRNG. That's going to be something that happens quickly the the sessions will be successive and with a very small time interval and in contrast the multiple running instances of a full snapshot there might be a very long time period you know hours or days or more and in the PRNG failure case it's it's going to be the same internet address right that when you're monitoring you'll see this this one device is screwed up and but in the multiple running instances case the addresses might be different right because there are systems that allow you to boot you know multiple running instances on different addresses of course you're something like auto scaling is typically typically used with a load balancer so there'll be a single IP address and then when you contact that address and it farms out your your session to something behind it that is it's very complex but in principle it is possible to to see these different things different addresses. So here's a really quick summary of some of the observations that we did we we monitored two enterprise networks with about a thousand hosts and so we actually have other observations in this but I wanted to show you know a very bounded set with about 30 million TLS sessions in it and we saw these four instances of the aberrant uh the aberrant case and one instance of the PRNG library failure one instance of the VM snapshot failure and by instance what I mean is not a single session but an implementation that that had a flaw and was causing this and it I think in every case they were for multiple sessions so so it seems like they're they're reasonably prevalent and to give a sense on the the work that we did to to verify the some of the failure modes within the lab environment so this is a showing a VM a sandbox run of the test crypt malware which had let's see six runs and this this random value repeated five times and the client key exchange repeated five times as well so and then you know there were other instances where the you know with with different malware samples that the random would repeat and the client key exchange wouldn't and that means that something else is going on on the os that's causing the entropy to to reseed the you know before the client key exchange is generated and so it's interesting you know we have the tor2web.org is being used by this malware probably in an attempt to hide although I'm not sure that's a good attempt to hide and I want to quickly touch on cryptographic attacks and then you know head to some conclusions right so what bad things happen if you see something like this on your network right well obviously replay attacks work cut and paste attacks will work in a session that has the the same encryption and authentication keys a really important problem would be a flood server that did dsa or ecdsa signing because private key recovery would be possible in that case and so we you know deterministic dsa and ecdsa looks like an even better idea after you do this kind of analysis with a flawed client if you can act as a server to the client and get the server to sorry to get if you get the client to actually complete a session with you then you just learn to key and that could potentially be the key that's used in another session right so that's an obvious sort of attack a more subtle attack would be if you can't convince the client to authenticate you maybe you can get them to downgrade to an export type cipher and then you can break the export type cipher and then you know that would give you access to keys that might be used in other sessions as well so rsa key transport is especially bad because a server an attacker can replay a colliding session and cause symmetric keys to collide and that does a lot of damage plain text leakage for a bunch of cipher suites and then you know potentially authentication key leakage if your performance crazy so let me try to hit some conclusions real quick so these these types of failures do happen and there were studying and especially I would say for future work the aberrant implementations look really interesting because some people seem to be you know not conforming with the SPAC and that's very interesting right if you're trying to actually make sure that things are secure and somebody's not really following the protocol it's nice to know so so passive network crypto visibility can help to detect these types of failures and you know what I'm hoping is that through work like this the good guys can have the same visibility that the bad guys do right because bad guys already you know they don't need our open source tool they have something way better than what we have but we you know getting some visibility into where the cryptography needs to be better or needs to be used where it's not being used is a good thing and multi-session monitoring is a big part of that our conclusions for people who implement cryptography right you know tls shouldn't use rsa encrypted key transport so you know thank you tls 1.3 for not doing that and for robustness deterministic dsa or ecdsa is a good thing there's an alternative that you can do if you're concerned about maintaining your fifth 140 interoperability there's a Cisco blog on that that that we implemented in some of our code which is to stir in a pseudo random function of the message into your entropy pool before you do a signature right which achieves the same security properties and you can do that in this conformant way so for authentication authenticated encryption we suggest you know use something robust like eight yes gcm siv which is some you know merging work that's really attractive and i think that that should be the default you know you should use that instead of using gcm unless you know you absolutely can't for some you know you know real high performance reason and if you're using a method that's not robust and testing it is really important right so and there's other ways to test it then then hooking up a network monitor and any testing is good so before i conclude i noticed some other vendors had shameless plugs for the fact that they're hiring we're hiring too both for research encoders um and uh thank you for your attention today we have time for only one very quick question so in your slides you said that tls doesn't specify that the server should include the gcm the gmt time but it should and you can't cut and paste records between sessions unless both sides have rng failures right because either sides nonce changing is enough to change all the encryption keys uh you can cut and paste whenever the authentication key is reused and if you replay a session yes you then then yes i think i think the logic flow on the slide was slightly off and i think you caught that thank you okay um in the interest of time i'll shut up there thank you all right let's thank uh David again