 Good morning everybody. My name is Luke Hines and I'm going to be presenting a relatively new project called Keyline. So let's get into this, let me get my clicker. Okay, so a quick personal introduction. I work at Red Hat in the CTO office. I have background in security stuff around. It's getting close to 20 years now in many different capacities. I spent quite a bit of time working on tooling and vulnerability handling and research. So my other gig is the Product Security Committee for Kubernetes. I've done various project team lead roles in different security groups as well as OpenStack, Open Daylight. On a personal level, I live in the UK. Fucau'r rhaid a'r rhaid a'r rhaid yn eny, einernol yn cyrraedd y dynon cyfrofiwr a gaelig. Mae hwn yn hynny i'n cwyl i, ac mae'n meddwl yng Nghymru wedi'i wneud arnyn nhw'n gwybod. Ac yn gwneud hyn o'r cyffredin dim yn ychydig, rwy'n meddygu'r d Queenie some pam 12 yr oedau mewn. Mae rhaid i'n gwybod i mi yn ymghi gweithio, yw sydd yn iddig, maen nhw'n i gynnig. Rwy'n meddwl o'n meddwl i mi. Ond yna, eich problemau? I'm not really going to deep dive into this, because you're a pretty smart audience. I'm guessing a lot of you probably know what TPMs are. You know about security trust very well, so I'm really not going to teach the choir how to sing, as they say. But we'll do a quick refresher so that everybody's on the same page. So essentially, software trust without a hardware means of protection typically resides in the memory or the disk. So we're talking about private key secrets and so forth. And the problem with that is that you're then at the mercy of the lower levels of the stack. So effectively, we're talking at a chain of trust here. You have your firmware, your bootloader, your kernel, your modules. Am I straining out the camera range? I'm sorry, I'll try to stick in this circle here. And do all the way up until we've got our user land and our runtime. Perhaps, you know, like a container runtime of virtual machines and so forth. And the problem here is that a lower level of the stack is compromised. It's very difficult, the higher up the stack you are, it's very difficult to know about that, essentially. And you're at risk, but it's very difficult to establish that there has been a compromise. And this is even more of a case now that we have loads such as containers and so forth, where they have to trust the underlying host. So we have these things called trusted platform modules. And this is the key attribute that Key Lime is built on top of. So we're going to do a quick overview of those. But as I say, there's probably people in the audience that know TPMs even better than I do. So I'm approaching them as an application developer. So a TPM is a specialised chip. They're almost ubiquitous. A lot of boards are coming with them already on. If not, they're pretty simple to retrofit. So that's at the bottom right, that's a picture I snapped of a Raspberry Pi 3 that I have. And on the GPIO board you can see there's a TPM chip there. And it costs me about, I can't remember, about 20, 30 euros. So it's not expensive technology. And as I say, a lot of server providers already have them on the board. If not, they're pretty simple and cheap to buy and retrofit. And they're not only turning up in servers, they're in devices. They're used a lot in the automobile industry to ensure that nobody's tampered with your breaking system and so forth. And so they're a pretty common commodity to find. And essentially the key attribute is they have a private key pair. And they have an RSA key pair and the private part is not accessible by software. It's only a special bus connection that can access operations to a key that is secluded within the chip that's created at manufacture time. Now the TPM itself, don't think of it as like a crypto accelerator. It's a very simple crypto engine. So it can create keys, can sign artifacts. And it can do something that we call measure, which is take a cryptographic hash of a particular object. So it can hash critical sections of firmware, boot. Some of you would know IMA, Integrity Measurement Architecture. So it can take measurements and it can use these extend operations to build a one-way hash function. And you can end up with a complete cryptographic replay of, for example, a boot. So the good thing is that hashes can be made public. So you can have a public counterpart of the private key. And then using that you can establish this list of hashes that you're looking at, which represent a system state. They have not been tampered with because you have the public key which allows you to attest, essentially. And that is where you'll hear that word used quite a lot, attestation. And one of the, of several things, one of the things that we do in keyline is we provide this means to remotely attest a certain part of your system. So let's go into what actually keyline is. So originally the project was the idea and a white paper was devised by some folks at MIT. They have a security department there. They do stuff with the department of defence and protecting military systems and so forth. So there's a couple of people at work on this and they came up with this idea of keyline. And from there they started to prototype that, they wrote some code. And they came up with a, you know, a pretty good working system, a good prototype. Now essentially, as I say, we provide a remote trust framework. So by framework what I mean there is with keyline we're about trying to provide the means to the user to say what they wish to measure and then what actions they would like to take if the integrity of that measurement is compromised. So we're trying not to be too opinionated about how you should use keyline. We're trying to make this a tool, a framework essentially that you can then get to drive a particular use case that you would like based on the trust or non trust of a system. Now one of the good things from the onset with the design is very scalable in nature. You'll see that when we start to look at the architecture. So when I was looking around for different solutions within the open source ecosystem, it's one of the things that I liked was they got a nice simple scalable architecture from the beginning as part of the key design. We support TPM 2.0, so that is the latest standard that is being actively developed. There is TPM 1.2, which we do also have the code that works and is supported, but going forward we're not focusing on 1.2 because it is common, you can find it still, but essentially a lot of the new chips that are all being released are based on TPM 2.0. So keyline provides essentially this ability to remotely attest to say as I said. So this is essentially based around measurements. So we get these measurements from different areas. So one of them is a measured boot. You would have heard perhaps the term used a trusted boot. Now using a particular project to shim, what we can do is we can measure the boot loader, the grub, kernel options, so for example on the kernel command line you can toggle SE Linux on and off, which is something that you wouldn't want somebody to change. The init ramfs, the modules, and there's the shim project. What it will do is it will measure these and it will extend them into the TPM. So we can then query the TPM, we can make a TPM quote, and we can establish that nobody has changed any particular part of that boot cycle. The other thing we can do is we can measure secure boot. So we don't set up secure boot, we're not providing a secure boot solution, but there's many attributes that are part of secure boot as the mock list vendor DBX. There's various parts that are required for secure boot, and we can actually remotely attest that nobody has changed any of those. There's certain certificates and so forth. There is IMA, so we're getting into the run time here. So this is where the system is booted and now it's running. And integrity measurement architecture hashes objects when they're executed. It writes those hashes into a security FS. And then if IMA finds a TPM present, it will extend cryptographic hashes into the TPM that are based on the measurements that it's taking for run time. It's part of the subsystem. So with this we can pick up executions. We can monitor SELinux labels, files, and so forth. And then we have this other, like I said, we're trying not to do too much around features, but we have these sort of cases that people can take and get running with. And one of them is an encrypted payload. So essentially what we do is we, user-user, you would have a payload. It could be some, perhaps some TLS certificates that's required for a website, perhaps some config files that have database passwords and any sort of sensitive material essentially. And what we can do is we can measure whatever it is you wish to measure. So it could be the boot. We can measure the current run time environment. And then if there is, if we establish that you can trust that based on the cryptographic hashes that we're getting from the TPM and check in those against the public counterpart of the key that I described earlier. And we'll go a bit more into the different keys shortly. What you can then do is you can execute that payload on the remote machine. So you can essentially establish that nobody's tampered with that machine that you want to instantiate your sensitive workload on. Okay. If anybody's touched the machine, they've compromised with it somehow, then the payload is not made available. This payload can be pushed over the air, or it could be something that's baked in the image. We have a script that executes that actually sort of can do such things such as move it into a certain location or run certain commands. But none of this happens unless the trust is there. Okay. And then last of all, we have something called a revocation framework. So what this consists of is when a node fails its integrity, we can then kick off certain actions to take. So one of them is we integrate with a certificate authority. We have Cloudflare SSL at the moment. We're not opinionated around which TA we use. It's something where we hope to have like an open plug-in framework. We can also work with open SSL, so like a local CA. And what can happen is when a node fails, we can then revoke their certificate because there's a certain certificate authority that we bring the node into. So what could happen then is that nodes TLS connections could break because you revoke the certificate, IPsec connections would break down and so forth. And then we can also kick off what we call local actions. So if you have multiple nodes that you're monitoring, and one of those fails, then you can tell the others to enact a local action. So that could be remove their key from authorized keys. I don't know, change a local firewall rule, anything that you can think of essentially. It's kind of pretty much an open scriptable framework. So I've kind of glossed over these, but we're going to have a look at how these actually work. We'll go a little bit more deeper into the use cases. So to give you a high level view, I actually need to come back here to see the view a bit of our, I guess, our architecture. So the key things to look at first, I should have a laser here. It doesn't really work too well. But to the left, this is actually to your right, this is the Wild West. And then to your left, this is on premise. So this would be within your control, within your network, somewhere you trust essentially. And then again to the right, we have something called the agent. Now this runs on the node that you're measuring. And this agent has a very simple job. It just makes requests into the TPM, using the TPM software stack. So it's pretty simple. It makes a request to the TPM, and then it hands it back. It doesn't deal with any secrets. We can go into this a bit later, but we don't care if somebody hacks this. Nothing is stored there. It's pretty dumb. It just makes requests into the TPM and then sends back signed measurements. And if somebody was to tamper with those measurements, they would break the measurements. So the attack vector is pretty small here. Now over to this side, we obviously have a bit more. So the first one is the verifier. Do you know I've forgotten? I've actually got some bubbles here. So we communicate with the TPM and we can do these local actions that we spoke about. And then we have the verifier. This is tenant owed, as in the tenant be new, the user. And this checks the node system integrity. So this gets the measurements and it verifies that the state is as it should be and is expected to be. You have the register. So here we store the public keys, certain public counterparts of the keys that we spoke about that are embedded within the TPM chip itself. And there's a simple database that has the operational state of the node and its unique ID and so forth. We've got our revocation service, which we just spoke about. So this creates the certain actions that should happen when the node fails its integrity. And then last of all, we have the certificate authority integration. So as I say, at the moment, we work with Cloudflare SSL, but we're making this to be open so you're effectively tied into any sort of CA. OK, so, as I said, we have a kind of quite a friendly, widely-distributable model so we could go from a single site to a single node. We could go from a single site to multiple nodes, multi-site to multi-nodes, so distributed data centres. You could have many verifiers and then many different devices and have the verifiers attest each other and so forth. And then multiple verifiers can attest the same single node so you could have a scenario where you have, if we use Cloud, for example, you could have a Cloud consumer who wants to verify the node's trust before they instantiate their workload, and then you'd have Mr Cloud provider who also owns the infrastructure and wants to make sure that it's of a good quality before they release it and do whatever, give it to another customer or so forth. So let's kind of go a little bit more into the ins and outs of the key operations and how this works. So the first thing is we need to set up our hardware root of trust. We need to establish that we're actually talking to a real TPM and then we need to set up the mechanism so that we can start to attest these measurements. So the first thing, actually let me come back, on a TPM you have an endorsement key and then a testation key. And what happens is the public counterparts of these are sent to the register. So you have a testation key pub or endorsement key pub and our ID. And the ID is nothing special. It doesn't have any cryptographic properties. It's just a unique ID string. Think of it like a UUID. What happens is our register then using the endorsement key pub it encrypts a hash of the attestation key public key and it also includes a challenge. So we actually have a description of our different keys down here. You can see KE and it's an AES256 key. And this is an inferial challenge to certify the attestation key. So these are sent back. We know that as this has the private EK it will be able to unwrap this challenge, prove that it knows, and it will then respond with a HMAC of the KE and the ID, which is that UUID that we spoke about. And as that happens we now have tied the attestation key to an EK identity and we're now at the point where we can trust quotes that are signed with the EK. So that's our hardware router trust. So the next part of that is this is something that is quite unique that is in fact is unique to KeyLine that we do. We have this key deviation protocol that we run. So you the user so you're this guy here, the KeyLine tenant and we provide a CLI application or there's REST APIs to drive this. You create a key which we call the bootstrap key. And this key will be split. Now when I say split it's cryptographically split into two counterparts. And we call these U and V. First of all what we do is we send the V part to the KeyLine verifier along with some other data. We might have a white list which is a set of golden hashes that we expect the IMA runtime state to have a good comparison with. So we send the V counterpart up to the verifier. The verifier sends a nonce so that we can avoid replay tax so forth. And then a TPM quote is sent back to the verifier. And this is signed by the attestation key. And we also include this other key called an NK key. And this is used to protect these split keys in transit. So all of this can happen over HTTP. We're effectively encrypting the key counterparts that we have split over here, the U and the V. So this goes to the verifier. So it's a TPM quote so somebody that doesn't know quotes they're effectively the measurement list that can then be taken by a party and used to attest a state using the public counterpart key. What happens then is the verifier will make a query into the register where we keep the public keys to establish that the AK is valid and we can trust it. And then from there using this NK, this public NK will encrypt it, will encrypt the V counterpart and will provide that to the keyline agent over here. So it now has half a key. So it still can't do much. Now the next part, and this is all part of a single operation. This isn't split into two stages but I couldn't get it all on one slide. So effectively this is the same transaction continuing across two slides. So the next part is the tenant themselves they also get a TPM quote which is signed by the attestation key to prove identity and they also get the agent's NK pub. And what they do is they then make a call into the register to again establish that the attestation can be trusted. If that checks out then what it will do is it will send it will encrypt using NK the U counterpart and it will also send an encrypted payload if you're sending it over the air, that is to say. And what will then happen is certainly encrypt U, send it to the node the payload, if we're sending it over the air that will go and then the key shares will re-combined and the key now has the bootstrap key and it has that key based on like I say this hardware cryptographic root of trust. So now that it has that key it can use it to decrypt the payload so we could have a tarball for example and then you'll have a script which we call to run.sh and then that can then move files around and set up your web server and you know put them into your your kind of TLS directory and so forth whatever it is you want to do it's totally up to you, we're not opinionated you can have anything you want in your payload and you can carry out the operations that are applicable to your particular application. So yeah, I just mentioned that the payload could be baked into an ISO or a QCAL2 or whatever image type it is that you have and then the good thing then is you can distribute that that image with your secrets in there that are protected and you'll know that those secrets won't be unlocked unless the machine passes its trust state. So now all of that's happened our kind of guy over here is having a cup of coffee and we now move into the next stage which is continuous remote attestation so this is where we start to use IMA Integrity Measurement Architecture so effectively we had this continuous polling that happens now so a nonce comes in and a TPN2 quote is returned the verifier attests the state and then this continuously happens so we're talking it's very lightweight here it's just a simple get request and we're talking a mere few bytes here so this is something that's configurable you can configure this polling to your own requirements and I have a single machine here but this could be thousands of machines that are all putting requests into the verifier and the verifier uses a non-blocking IO framework as well so we've done some scale test where we've had a few thousand virtual machines all requesting quotes into a single verifier so I'm sure there's room for improvement but it can handle quite a large a large grouping of agents so as I say we're in this continuous measurement phase now so one of the things that we would have sent earlier would be a whitelist so a whitelist would be a list of hashes shard 256 hashes in one column in the next column you would have a file so you'd have a POSIX path to a file so essentially you've got the file and then you've got its hash file then its hash and this would be a golden state this would be something that you'd typically generate on an air gap machine it would be a kind of a cryptographic state of what you expect your application to look like IMA as I said populates the security FS list so it also has a list of hashes and a POSIX path to an object but the difference is it extends the cryptographic sign into the TPM extends the measurement sorry into the TPM and then Key Limer test the trust state use an IMA against the golden state the whitelist and we also you can configure an IMA policy so some folks might know IMA this is where you delegate what it is you actually want to measure with an IMA and then what happens an integrity failure occurs so somebody runs a script that's not part of the whitelist or they toggle an SELinux label or they swap out a binary for a Trojanised binary and it's called and IMA intercepts and an integrity failure occurs so this is where we're going to get to the stage where we're going to start to take revocation actions which we described earlier as well so this is our revocation framework and what we do here is the nodes will all connect with a zero MQ and we also have the existing connection to the verifier as well the reason here is somebody could like kick the verifier out so we need to make sure that the other nodes can let about they can let the others know about certain failed states and so forth so what will happen is our node fails here so we've got a host C that has failed first of all the verifier will make a certificate revocation into the certificate authority to say that a particular device has failed its attestation and then that could be part of that CA it could have a root certificate that your kind of cluster then has it's all of its TLS certificates built on top of so for a particular machine you could effectively strip down its connections by revoking the certificate the other thing we do is we send out a signed revocation event from the verifier so this is a list of actions to take and as I say it's signed by the verifier so let's have a look at some of the actions that we could take I sort of already described these earlier but the first one is the verifier sends out a signed revocation event so we can make sure it's not somebody pretending to be a verifier it's actually there's a sign in there and we send it out to the agents and we tell them that remove the failed node from authorized keys that's a very simple example but that's a sort of just to give you an idea of what you could do and then of course as I alluded to earlier we make a certificate revocation which invalidates all the TLS and IPsec connections and cuts off the failed node the other thing of course is you could integrate this into an existing system you might have an alerting system or an incident management solution you make an API call or whatever it is whatever it is that you want to do so to come on to to move on in the interest of times a little bit more about the project and where we are at the moment because we're a young project we've got a nice team that's building we're attracting developers organically which is really nice people are finding us they're coming along they're trying it out they're getting interested they're making patches so as a girl Amy recently joined us she found us of her own accord I think she found us through a good first issue and she wants to re-vampire UI which is really nice and so we're not kind of fixed to any particular vendor we're a nice organic community we've been pulled from a github by another tool so as you can see we've got increasing year on year commits we're a young but established code base they consider us a large development team I'd like us to be a lot larger but you know all in good time hopefully and you can see the first commit was made by MIT in October 2016 and it says the recent one was an hour ago but this was like last week or something one of the things since I've got involved in the project myself originally it was just some code on github we ported it from python 2 to python 3 because obviously time is ticking down on python 2 we moved it from 1.2 to 2.0 so there was some sort of key changes that we got into place and we've achieved those now and we wanted to get it looking and smelling like a good open source project so we have CI testing it makes a poor request we instantiate a container which has a TPM emulator and then we run a series of functional tests to sort of mimic an attestation happening and we also use a code of city as well to sort of link checks and so forth we have some documentation we need better documentation I think every project could say that but you know we have a good framework there and we're starting to look at doing build testing as well one of the things with keyline we've had people interested from of course the standard linux distributions but the IOT people and the edge people are very interested in this I've had a few people approach me because essentially with an IOT or an edge device it could be up in the ceiling there so physically it's not an easy place to access physically but you get the idea it's not protected it's not in a data centre we have a lot of concerns around people kind of messing with different interfaces and actually accessing an edge device or an IOT device so we meet weekly we have like a channel that's open 24 seven where we all hang out so one of the things we wanted to do is when people come along and they try it you know they get a big exception and something's broken or it just doesn't work we want to be really open and friendly for people, do you see what I mean? so if people actually try our software we're really going to try and support them, help them, get it working there's no such thing as a stupid question and just be a good open community that's welcoming really so that's one of the key things that we've tried to have from the beginning so as I say we meet once a week 1500 UTC anybody is welcome and we track all of our meetings as github issues so this allows us to reference pull requests and everything ties in quite nicely that works well for us so what's coming next so we had the agent and this is remember the part that runs on the remote machine that you're monitoring that was originally developed in Python so I'm getting that to Rust so I'm working on that at the moment and essentially we've gone for Rust because performance and there's no garbage collection and the security, memory safety and thread safety and a strict compiler as well to help keep the technical debt down a bit so that's a work in progress and eventually the Python agent will then be depreciated we're working on something that we're calling the TPM support so with a TPM it's a hardware chip that's on a machine and it doesn't scale very well it works very well within the context of being in a machine and attesting that machine but if you have a virtual TPM well put it this way a hardware TPM it can't handle a significant load it's better working with single operations so somebody, Stefan Burger has worked on some code for a VTPM so a VTPM is great because you can instantiate it in a container or a virtual machine but the problem is you're back to having your trust on disk and in memory so we need to cryptographically marry the VTPM's trust to the hardware TPM's trust so this is something that was created by Nabil, one of the guys from MIT and he's working on this with some interns from Boston University that are part of the Mass Open Proud project that they have there and I'm not an expert on this but effectively what they're going to do is they're going to pull all of the quotes together build them into a miracle tree and then there'll be a whole cell parsing of these quotes to the hardware TPM which can attest them and then they go back to the verifier and then that will individually let the specific VTPMs and the person measuring those VTPMs know about their current state so I wouldn't be able, anybody collars me about more details about this, I'm following the work but it's not work that I'm actually doing myself I'm more working on the core code base itself but this will allow us mass scale so we can then have thousands of containers not all trying to shoehorn them into working with one hardware TPM since the hardware crypto trust to the virtual TPM so we're expecting, there's a prototype being developed at the moment we're hoping we'll have something to show in perhaps three to four months so interruptability and feedback so we love having different TPM vendors come along and trying to get it working and if it doesn't we'll try to get it working for you you know we're very much into testing this out different software environments, different platforms generally we're pretty much abstracted away from the hardware so a lot of the time we're not really heavily dependent on ARM or x86 or whatever you know we value feedback that's the one thing, tell us guys this isn't so useful you should be focusing on this, this is what we need that's the sort of stuff that we really value so to round up we're looking for anybody to come and help here engineers, users, architects, documentation writers people new to writing code, people old to writing code we're really not picky, we value any help that we can get and last of all like I said we're a young project we have a responsible disclosure system in place where anybody can report anything so I'd ask when you're asking questions there's some experts here, there's something that you're unsure about, consider do we maybe need to discuss this as being a potential bug that we need to look at so just be mindful of that because we're a new project and with that I can look at that, two minutes off half past I've done well so we've got a little bit of time for some questions so I'm here for the rest of the day if anybody wants to grab me as well so on one of your slides you mentioned you had a golden set of hash values and I know on some operating systems like Windows that these hash values can be fairly unstable they can change between multiple boots and that Windows with the TCG introduced the event log, have you looked at that have you had good results, bad results so that's a really good question because measurements has been the Achilles heel of TPM so with Keylight there are certain things that are going to be more noisy and they're going to change the state more so generally we get people focusing on the core parts of the system that should be monitored more than logs and so forth so there's other approaches that we'd recommend though and interesting stuff happening with trusted TEs and so forth, execution environments but on the question of measurements so with the white list you can reprovision a machine so if you had an OS upgrade you could send a new list of measurements around how this would work with Windows do forgive me I should probably get more up to speed on what they're doing there but one of the things that we're excited about in Keyline when it comes to measurements is there's a certain movement towards immutable operating systems where it's very much a fixed state and that makes measurements very easy then so what we can then do is we can we've got a very static core OS to measure and then we have the container where things are a bit more noisy and we can take our own approach there so we're going to be looking at ways that we can make it a lot easier to source a measurement list and not have to be worried about this constant moving target because for the 10 odd years or maybe more that TPM's been around that's always been the challenge really is how to manage these measurements so Eureka there is a good light at the end of the tunnel now more questions hi thanks very interesting quick question about the actual core library that you're using to access the TPM so there's several possibilities that I'm aware of there's a TSS from IBM and also some work from Intel are you using any of these to do the actual data marshalling or have you implemented something on your own another really good question I should have had that in my slides so we're using what people refer to as the Intel TPM software stack so effectively you have the resource manager TSS and then the TPM2 command tool set so we actually we call through those so we use that as our communication stack currently I mean it's not to say we're against using others it's just what we've landed on at present but we're trying to sort of focus more on building on top of that rather than being too deeply entwined good morning and thank you for a wonderful presentation the one thing I didn't follow is how is the bare metal machine provisioned and imaged in the first place so that you'll know what to expect as far as measurements are concerned yes so there are various bare metal provisioning systems that are around there is for example I know that OpenStack has one ironic and Kubernetes has one I'm sure as you you have your own bare metal provisioning framework and for us so there's two of the current approaches we remotely push the agent onto the machine so we don't really care at what stage we measure we're not going to stop a boot from happening so we don't need to be really early to prevent a boot from happening all we do is we take the very end resulting measurement and then we say to the app we're there for the application to know that's a good environment essentially but having said that we do want to look at seeing if we can be part of an early service start up within the TPM and within the machine itself but as to the real ins and outs of how we'll interplay with a bare metal provisioning system we get to really eke out the best practices there thank you so much so you've explained you can verify the boot and integrity of the system once it's originally booted so what I was wondering is which problems can you not detect with this solution so for example if you have volatile changes in memory can you verify the integrity of a running system? no so we wouldn't be able to do that we're not going to try and do everything so we're very much we heavily harness IMA for runtime verification so you can think of this as being like a tripwire but with a TPM rooted trust essentially and a good thing about IMA because it's in the Linux subsystem it will create those measurements before execution as well so there are actually other measures part of IMA is you have EVM extended verification modules which is a file label and that can make certain blocks and so forth but we don't have the ability to to really dig into the memory and we don't really plan to as well I think it's a very valid area that needs addressing but we're trying to do one thing well rather than lots of things not so well sounds like a sane approach, thanks great so yeah