 Hello and welcome to Cloud Native Rust Day. My name is Lily Sturman. I'm a software engineer at Red Hat. And today I'll tell you a bit about KeyLime, a CNCF sandbox project, and specifically about cloud native security and Rust with KeyLime and about our ongoing experiences using Rust as part of that project. So first, I want to summarize a little context about the problem that KeyLime is solving. So many of you may be familiar with computers. And by extension, you may be familiar with the modern tech stack. This is the modern tech stack as depicted by the talented Randall Munro of XKCD. And you might notice that it has a little problem. And the problem is that this is a complex stack. It has a lot of layers. And that means that there is a large attack surface. So it's hard to know that every single layer in a stack on a machine that you're working with is going to be secure. And furthermore, due to the nature of the stack hierarchy, lower layers generally have access to higher layers of the stack. So what this means is that if a lower stack layer is compromised, the entire stack is compromised. So you can try to write or to run a completely watertight application with all the security features that you want on your system. But if there is a compromise lower down in the bootloader, firmware, host OS, or hypervisor if you're running a VM, you will still be dealing with security problems. So now if you imagine that you have many machines with a similar stack, even remote machines in the cloud, this problem compounds. So especially when you're running in the cloud or on a remote machine, you often have even less control over these various stack layers because you may not be in charge of all of them. You may not have control over all of them. So if you are running a remote machine, it would be great to know if you should trust these layers both at boot time. So you want to know that you're booting the machine that you expect in the state that you expect, as well as during runtime. You want to know that the code that you're running is running as expected in the same environment that you expect to run it in and that no bad actor has executed some malicious script on this machine. So this is where KeyLine comes in. KeyLine is a project that was started at MIT Lincoln Labs in 2016. And over time, the project and its community grew and Red Hat got involved as well. By 2020, last year, the KeyLine project had been accepted as a CNCF Sandbox project. And the code base began its life in Python. So what you see here is a simple depiction of the KeyLine setup. On the left, you see the remote machine, which is also called the Node or Agent. And this is the machine that's going to prove its trust to the KeyLine system. So you can imagine it running this huge tech stack layer that you saw in the previous slide. And before this machine is allowed to run any tasks or be included, for example, in a cluster, it's going to have to prove to the system, the KeyLine system, via attestation that it is running in an expected state. And you should know that there can be multiple ones of these agent machines. But we're only showing one here on the screen just for simplicity. So until this attestation process happens and until it's checked by these components you see on the right hand side, the agent is not trusted with sensitive data or payloads. There's also a revocation framework in place to run a set of predefined actions if this agent node fails attestation. Even during runtime, for example, maybe it was in a good state before, but something happened. Some malicious script was executed and it's no longer in a known good state. And the agent machine will need a TPM, which serves as a hardware root of trust. The machines on the right hand side don't need a TPM. These are components responsible for verifying the state of the machine on the left. Here they're called the verifier and the registrar. And these components could be running on a machine on premises, for example. So they're communicating with the agent over on an untrusted network. And this is completely fine. So at a high level, what KeyLime offers is a way to measure the state of a remote machine during boot, as well as at runtime, using something called integrity measurement architecture. It allows for encrypted payloads to be delivered to that remote machine after the machine has attested its state through use of the TPM. And it offers a revocation framework. So in the case of that malicious script I mentioned before, that node could be, for example, removed from the cluster automatically. Now there's a lot more to know about KeyLime and how it works. And if you're curious, I encourage you to check out KeyLime.dev. But I just wanted to give an overview of this architecture so that you have some context for which portion we chose to port to Rust. And spoiler alert, it is the agent component, so that remote node. And the reason that this component was chosen is basically because it's a smaller self-contained component. You see there are multiple components over on the right. And we're open to potentially porting more, but we just chose to start with the agent for the first part. So KeyLime is actually already available as an RPM in developer preview in Fedora, which if you don't know Fedora is a Linux distribution developed by the Fedora project. And at Red Hat we have a philosophy of upstream first. So we want to contribute directly to open source upstream code bases and build community around those code bases. And Fedora is one example of working with an open source community in this way. So if you're looking for a KeyLime RPM for Fedora, it's already available at this link that you see on the slide. And this is going to be using the original Python version of KeyLime. So that's great, but what about other use cases, packaging for other distributions and was Python really the best choice for a security oriented system like this? These questions were asked and they eventually converged around a decision to begin porting at least a portion of the Python KeyLime code base to a different language. And as this is part of Rust Day, you may be able to guess which language was chosen. But just as an example, if you're curious about some of the considerations that led to this decision, there is the example of Fedora Core OS. For those not familiar, this is another Linux distribution, which is known as an automatically updating minimal operating system for running containerized workloads securely and at scale. So this is often used as a container host and it's known as an immutable operating system. At a very high level, it's meant to be a read-only operating system in the sense that all installed programs are defined by a single commit hash. And this commit hash basically describes everything that is installed on the system. So the system can be atomically upgraded or rolled back to a particular hash. And this can make it a lot easier to manage a large number of these machines. And as you can imagine, you're not going to want to be installing different programs that the system doesn't have installed already. You want to really anchor it in a particular commit hash so that you know the state of the system. So for this Fedora Core OS use case, if you want to package a program for this type of distribution, you're going to want to know all the dependencies ahead of time. You're going to want the program to be very self-contained because we're not going to want to be updating the OS to install new dependencies when the program is running. And Fedora Core OS also tries to be very minimal in its dependency tree. So a Python program would actually add a lot of complication to that tree. So Python was not a good choice here. And we didn't want to pull in a vast number of dependencies that a Python program would require for something like this. So this is some of the thinking that led to the decision to pour some portion of key lime over to a different language that is more compatible with a distro-like core OS. Now for the big reveal, the language that was chosen was Rust. And in addition to being a compiled language so that we can create a self-contained binary, it had multiple other advantages as well. And some of you Rust fans might already know some of these things about Rust. It has better performance than some other languages because it doesn't have garbage collection, for example. And also for us, especially as a security project, we had a lot of security considerations. So the ownership enforcement that Rust has is very useful in terms of memory and thread safety. And the rigorous type checking from the compiler avoids unexpected behavior at runtime, which is very important. Nobody wants unexpected behavior at runtime. And then we also get to have some less concern about side channel attacks, which is interesting. There are some papers about this actually that I didn't link here. But the lack of garbage collection leads to more predictable performance, which could actually help with some timing attacks. And additionally, there's helpful Rust ecosystem around security. For example, there's even a crate called TimingShield that is able to help you guard against some of these side channel attacks. And we haven't used this particular crate in our project, but I did find it and I just thought it was interesting. So additionally, Rust also gives us some helpful compilation targets. It has an ARM target, for example, for Internet of Things that could be very useful down the line. And it's also got very good foreign function interface. So some of you may be familiar with BindGen, which automatically generates Rust foreign function interface bindings to C. That's been really useful for systems programming. So these are just some of the highlights of why Rust seemed like a good fit for our project. And it's been panning out pretty well so far. And there's actually something else that I wanted to highlight, which I think could be very exciting for other security projects. So the TPM that I mentioned is critical, not just to our project, but I imagine is widely used in other security projects, which like to have a hardware root of trust. So it's reasonable to think that Rust as a language choice for such a project would be accepted or rejected based on how easy it is to work with TPMs. Now, as mentioned a moment ago, Rust already has pretty good FFI to C generally, which helps as TPMs will have a C interface often. So it's possible to use Rust already to talk to a TPM, but this library, this new library here, the TSS2 Enhanced System API Rust Rapper, just makes it a lot easier because it has those bindings already. It's actually basically a wrapper that binds to the TPM2 software that some people may already be familiar with. It's a pre-existing library, a C library for communicating with the TPM. So this is a new library. It's helped us a lot. It's been undergoing active development the past few months. So we actually had some functionality in our project where we use some of the interfaces from this crate as it became available, which meant pinning our dependency to this library in cargo.toml actually to the Git repo, the TSS Enhanced System API Rust Rapper Git repo instead of the officially released version, which is also a cool Rust feature that you can actually pin your dependency to a very new code in a Git repo. That was useful to us here as well. So by using this library, we avoided having to call out to see ourselves. We saved ourselves a little bit of time and we also saved ourselves from having to issue shell commands to a TPM, which I believe the Python version of the keyline code was doing. So we have more direct communication with the TPM this way, which is great. And what about our other experiences with porting to Rust? Well, overall it's had everything that we need in terms of cloud native security. We've been able to have server-client interactions, which the agent node needs in order to communicate with the verifier and the registrar over this untrusted network. There are a lot of great crates out there for this type of communication already. I've just highlighted a few here that we've used like request as a client. There's Actix web for the server. There's ZMQ. And we've been able to have this type of communication asynchronously between the components when we need it. So also on top of Rust safe and speedy memory management, Rust has the helpful mutex primitive for shared memory where we need to restrict access to one thread at a time. Like we have something called a TPM context that we only want one thread to be able to access at once. So this has been very easy to use even with one of the other crates, the Actix web crate. And of course in a security project, we're going to need to use cryptography. So there are helpful crypto libraries available for us as well. And we've been using the open SSL crate when we need to handle things like RSA key pairs in the code. So as for cons, there haven't been too many and some of these may come up in other languages as well. But one of the things that we've thought about as a con is that there's still a lot of active development going on and a lot of Rust crates. And some of the libraries that we've used have had changing APIs or maybe outdated documentation just because of the rapid pace of development. So we've just had to check on that and adapt to it and maybe do some revision if we hit any bumps. But this hasn't been too bad. And secondly, there is a higher barrier to entry for getting participation in a Rust project because the Rust basically still isn't a language that a lot of people know. And so there can be kind of a barrier or a bit of a learning curve even if someone is interested as they try to learn Rust really fast. But hopefully the Rust community will keep growing and hopefully the Rust ecosystem will keep growing and eventually maybe this won't be a concern. So the current status specifically of the Rust port of KeyLine is I'd say it's about 80% done. A lot of the functionality from the Python agent is already in place. So the agent needs to be registered and activated as part of the KeyLine system. And code is already in place to go through those steps. And we also have the revocation framework in place. So that's the case where a malicious script is executed on the node. We wanna run some actions to make sure that we don't keep trusting that machine. So we're still working on the attestation that's for boot and for runtime. And of course, we're gonna want to do some extensive testing before we're able to package this. But we're getting there and it's been an exciting and rewarding project that we hope will be useful to people. So if you're curious and if you wanna find out more and get in touch we really encourage you to do so. You can visit any of our resources. KeyLine.dev here is the shortest one. If you only wanna type in one link that'll get you to a lot of other information. We also have the CNCF page as we're a CNCF sandbox project. So you can visit us on the CNCF site. And if you're curious about the code or interested in contributing, we've also got GitHub repos for both the Python and the Rust version of the code base. We've also got a Slack channel. So if you have any thoughts or questions we encourage you to join that. I believe there's also a bi-monthly text-based meeting on that same channel, which is kind of cool. So you don't have to turn on your video or any of that kind of stuff. It's a purely text-based meeting. And I also thought I would plug the TSS ESAPI again. This is the library for working with the TPM because this is so useful and worth checking out. And they, I think they're also a CNCF project because they also have their own Slack channel on the same CNCF Slack. So thanks very much for listening. If you have any thoughts or questions please let me know, reach out on Slack. And we look forward to talking with you.