 All right. Hello. So thanks for coming along. My name's Matthew Garris. I'm a security developer at Nebula. And this is? I'm Chris Chandrugla from around this. That's the microphone back? Yes, OK. I'm not normally a security guy. I'm a cloud architect. And I come from the storage side, but I had a really interesting project with Intel trusted computing that I thought was going to be a good topic for you. So the way we're going to do this, I'm going to give a kind of reasonably rapid overview of some of the issues that we have to face. And then Christian is going to be talking about the use of Intel trusted computing as an example of a mitigation strategy for some of the attacks that I'm going to be talking about. So let's get on with it. The cloud industry is fairly different to traditional hosting. So we like to talk about the way that cloud hosting is better because it's more efficient. You get to use machines to a higher degree of utilization. You waste fewer resources. You are much more dynamic. You can mix development and production environments without worrying too much about patient link between two. But there are some other differences to traditional hosting as well. And one of those is that back in traditional hosting, we pretty much understand the security concerns. Someone hacks into your system via some sort of security floor, or your hosting provider backdoors your system. So say maybe they walk into data center and they physically modify your system, or alternatively some sort of powerful agency which may or may not be a government. Does so. We have a reasonably solid idea of how seriously to take these risks, we know what we can do about them, we know what we can't do about them, and we treat them accordingly. In the cloud world, most of these still exist. Someone can hack your system. Someone in your hosting or data center can hack your system. An intelligence agency can do whatever they want. But also potentially you have these other classes of risk, such as someone running a guest on your system is then able to escalate their privileges somehow, either through a hypervisor or through some other form of information leakage, and perhaps compromise a much wider range of the cloud. And this is interesting, because while we understand this as a theoretical level, it's something that is still not... We don't have a great deal of real world experience with it. We have not been exposed to many attacks along these lines, and we're not particularly confident about what we would do if we were hit by one. So I'm going to be going over some of the things that could result in this kind of compromise, and some of the things that we can do to deal with this kind of compromise. When you boot a cloud system, you have to decide what you're trusting. And if a guest comes up, then the guest has to rely on a bunch of things. First of all, your guest has to assume that the operating system image it booted was trustworthy, which is not necessarily a given. How do you demonstrate that the operating system image is the operating system image you expect it to run? So if someone is able to perhaps not gain full control of your cloud, if someone is still able to modify the image storage, that means that your guests may no longer behave in a trustworthy manner. And from there, we can also start thinking about other things. So if someone is able to gain access to a specific node within a cloud, what kind of damage can that person do? If they're merely running inch control of a system on the same network segment, does that mean that they're able to leverage that into taking control of other parts of your cloud? I'm going to give just, Brian actually ended up talking about more of this and I expected him to, which given that I work from him is probably something I should have anticipated. Anyway, here's a cloud. Clouds are beautiful, clouds are wonderful. They're a nice, happy blue color, just like in Jenkins. One of your guests is compromised, and this is fine because the hypervisor is sitting there isolating that guest. This is no worse a situation than having a single server in your data center hacked. It's still isolated from everything else. Unfortunately, hypervisors are made of software and software is written by people. People, history kind of suggests that people aren't very good at writing software. And so there's a non-zero chance that an attacker can go from this situation to this situation. And the problem is once your hypervisor is compromised, it's probably a likely outcome that the rest of your guests are going to be compromised. There's a large attack surface on hypervisors. They are abstracting away a loss of functionality that was otherwise be formed in hardware. So we can say, okay, in some cases, we deal with this by just passing through hardware functionality directly. That's not a particularly scalable approach for most clouds hosting environments. So the state you have your first IO, that's a moderate amount of code to have to deal with. It's not always particularly well audited. There have been significant security issues discovered in multiple first IO components. And that's not necessarily even the worst. That is something that you can audit first IO code. You can look at it and you can use existing code analysis tools. From with virtualization is that it's pretty complicated. There's a lot of subtleties involved. And there was ASM, security vulnerability that was introduced because Intel and AMD had subtly different semantics when it came to one part of the AMD64 instruction set, which could result in it being possible to break into the Zen hypervisor if you were running on Intel hardware, but not on AMD hardware. That's not typically the kind of thing you notice just by auditing because there's not many people who are going to be auditing with a solid knowledge of the intricacies of your CPU architecture. And as an industry, that lack of auditing, that lack of confidence in our hypervisor security perhaps results in some other unfortunate side effects. We tend not to think about this. And as a result, we don't tend to describe what a competent cloud security policy looks like. We don't have great ways to update systems when we know there is a security issue. And we often have nothing that will actually tell us that there has been a hypervisor compromise. We don't necessarily even apply existing tooling to that case. So say someone breaks out and obviously they can compromise all the other guests on a system, but can they go further than that? And for instance, can you start faking pixie boots and perhaps launch modified software on other compute nodes in the same cloud cluster? And this is probably the case. And if it can, are you able to detect that in any way? And so you can use TPM accessation for this. You can use TXT and Christian's going to be describing some tools that already exist that you can deploy in a cloud environment to deal with this. So the worst case situation, I mean, I think is perhaps you have an insecure hypervisor and your entire cloud is as vulnerable as the weakest guest. So you have guests running with a variety of code. If someone is able to break out on the guest into a hypervisor, then the only thing in the tech needs to compromise is a single guest and they can compromise whichever guest is running the weakest, the least secure code. Of course, if you're a public cloud provider, the weakest guest is one that the attacker has launched themselves. You took some credit card details that based on the fact that they're going to be engaging in illegal behavior probably weren't valid to begin with. You've in fact sold them the tools they need to break into your clouds, which is unfortunate. There's actually a more unfortunate situation, which is that, okay, someone breaks into, out of your hypervisor, someone manages to take control of a significant path for your cloud. You come up with some sort of cover story or perhaps you tell the truth, all your machines go down for a couple of days and then they come back and everything's fine. This isn't necessarily the case. There's, most servers have a baseband, baseboard management control on them, a small separate computer that typically runs an embedded version of Linux that was probably last updated over three years ago, if you're lucky. And which is connected to the PCI bus so that it can act as a remote keyboard and mouse video setup. And as a result is able to perform DMA, at least over the lower four gigabytes, which means that if someone's able to compromise the BMC, which I'm not planning on getting into the business of selling insurance to BMC vendors, let's put it that way. If someone's able to compromise, if someone's able to take control of a machine, there's a reasonable chance that they will then be able to compromise the BMC. If they're able to compromise the BMC, they can do things like disable the firmware update capability on the BMC, and then perhaps would be able to use the BMC to modify any operating system that's booted on that piece of hardware in the future. How do you know that that's happened? What do you do about it? We really haven't figured that out yet. It's an unfortunate situation to be in, but it perhaps means that the worst case scenario is that someone buys access to your cloud, runs a deliberately insecure guest, uses that to break into your hypervisor, and from there perhaps compromises the hardware itself in an almost irreversible way. You reinstall the machines, they come back, and they just get the BMC to reinfect your machines. And that point, you're probably going to have to think about retiring every server you own, which is pretty bad. So knowing about these things in advance is helpful. And there are some misgaming strategies. As Brian mentioned, first of all, you can reduce the impact of many hypervisor vulnerabilities using something like SeLearnings or AppArmba. That is, if you're not doing that, you really, really need to be doing that. That reduces the probability that someone will actually be able to escalate themselves from owning a guest into the entire compute node. You can perform boot accessation, which Christian will be describing in some more detail now. You should probably also be providing a secure boot chain for your guests. Now, this won't necessarily help if the node is compromised because people can just modify the guest after it's booted in an operating system. But UEFI secure boots is a thing which allows you to configure systems in such a way that they'll only boot operating systems that's assigned with a trusted key. Red Hat have been doing work on integrating support for UEFI secure boots into QMU and Livevert. And so the long-term hope here is that users should be able to choose the keys that their guests will trust. And then if somehow the operating system image is modified, the guests will refuse to boot that modified operating system. And the perhaps most interesting one, probably also furthest away from complete press count yet, is external validation. So if you can read system memory, then you should theoretically be able to verify that the system is in a good state that nothing has modified system memory. This is hard and it sounds. First of all, doing it on the local system is difficult because a sufficiently advanced attacker can then just modify whichever technology you're using to perform the memory accesses and so just feed you false results. But you can guess around that in a couple of ways. Some work Intel are doing, the SGX extensions to X86 may allow you to do this in a secure way. Alternatively, there are companies that will sell you PCI cards that you can plug into a system and then access system memory over the network, which is a fairly straightforward way of handling that problem. That's a bit more expense, but they're actually much cheaper than you think. But it's still difficult at a technological, as a, sorry, theoretical level. Even if you can read all memory, you can, without too much difficulty, verify that no executable code on the system has been modified. You can verify the kernel. And once you've verified the kernel, you can trust that the kernel can verify everything else. The problem is that attackers can, attackers have got good at circumventing this kind of thing. And so say you have a function pointer and the function pointer points towards a function that is supposed to perform a security check. The attacker comes along, replaces that function pointer with one that points that an existing piece of code that always returns true. And now the fact you've verified that none of your codes be modified is unhelpful. You need a way of also tracking down function pointers. This is a much harder problem. It's not one that there's a solid solution for yet. It's an active field of computer science research, and it would be nice if people were trying to actually turn that into working products. And perhaps one of the last things to think about is as an industry, we rely on the fact that people trust their hosting providers. And we're not necessarily doing a great job of convincing people to trust us in this respect. If you currently go to a public cloud provider's website, it's in general almost impossible to get any insight into their security. They will not tell you what their processes are for ensuring that if there is a hypervised vulnerability, they will update the system. They won't tell you how your data will be secured, what mechanisms they have to detect the system has been compromised and notify customers. We should probably start thinking about working out the best practices for this as an industry, and we should start reassuring customers that they can actually trust cloud providers. Otherwise, the industry may not end up growing in the way that we was hoping it will. And OpenStack is probably a good forward for setting an example here. We have a large number of companies working together trying to create a better overall ecosystem. And if we can figure out what the ideal best practices here look like, I think that helps us all. So I'm going to hand over to Christian now, so just get the... Okay, let's see whether this works. Well, apparently Apple is throwing me a loop here. I hope you can live with the image of being what it is. We tried it earlier and it worked and now it doesn't work, that's technology for you. This is where security starts. Anyway, we're talking about security and in a bare-metal environment, security is usually... Well, we have an operating system, we have a bias, and we have a boot process and there are certain ways to secure this. One of them is Intel Trusted Computing Intel TXT. And TXT, for this point, is something that originally was limited to bare-metal infrastructure. So what I want to talk about today is using established server protection, meaning Intel TXT, to predict cloud infrastructure. In our case, OpenStack, of course, with existing tools. There's actually a way to use Intel TXT to secure your hypervisors and have OpenStack refuse to deploy workloads that require security onto hardware that does not provide their security. 10,000-foot view, this is very short. What can we do? Cloud infrastructure is vulnerable. I mean, any computing infrastructure is going to be vulnerable, but cloud infrastructure is more, though, because we have another layer of... Rather complicated, as Matthew already said, rather complicated complex code that is underneath what we are normally used to protecting our operating system, our firewalls, malware protection, whatever we do to protect our workloads. We have a whole layer underneath this, which is essentially the hypervisor, the operating system the hypervisor runs on and the hardware slash firmware that we are working with. One thing that we cannot do is simply detect that the cloud infrastructure is compromised from the guest OS. You cannot, the whole way that the virtualization is designed, you're not supposed to be able to see what the operating system underneath is doing. And we need to protect this infrastructure somehow. And Intel Trust, that computing does a number of things. It's a combined hardware and software solution. The hardware is a TPM chip that is sitting on the motherboard of Intel server systems and some circuitry or built into the Xeon processors that Intel provides that go with this TPM system. Most of the servers that have Intel hardware right now that are sold currently have this technology. A lot of the servers that are already in the data centers have this technology and we want to make it usable for our environment. We have the BIOS or UAFI depending on what you are using. We have the bootloader and we have the OS starter and we want to measure the behavior of the environment and make sure that the behavior is compliant with a good behavior that we have determined originally. This metric cannot be simply stored in memory because that would be too easy to manipulate for somebody who already has compromised the server. So the metrics for this are stored in a hardware device. And there is another layer that goes with that that provides verification with a remote server. How does this apply to the cloud? We have just bare metal servers that are underneath our cloud hardware devices. So we can take the existing attestation, existing trusted computing and attestation to protect this hardware the same way that we can do it with if we have the workload running onto the bare metal systems. So what we need is a mechanism to make the cloud Intel TXT aware and this does actually exist in OpenStack. Let's talk about TXT a few minutes. We have, as prerequisites, we have Intel TXT capable hardware, CPU, chipset, the TPM module that comes with the hardware. They take any server, they'll HP, they all have this hardware built in. If you opt for Intel boxes, then a TPM-capable BIOS or UEFI chip, which is also the same on pretty much every hardware that is TXT capable, you will find that TPM actually with TXT capable. And you need a trusted boot module that wraps around the operating system because you cannot trust the operating system to tell you if something's wrong with it when if it's already been compromised because it's gonna say, well, I'm good, but is it really gonna be good? So we have to have something that wraps around it, which is a T-boot, a trusted boot module, and you can use trusted grub as an intermediate step to have a secure boot loader. So people do not manipulate the trusted grub module and a T-boot module into your environment. So the boot sequence, the BIOS is attested by the hardware. You can actually make T-boot stop the boot if your system is compromised, meaning if the system does not behave the way that T-boot expects it to. You have the values that are from a known good boot in the TPM module, and you have values that are from this boot sequence that we are currently working with. And if they do not mesh, if they do not compare equal, then you can either stop the boot or you can boot the server as an unknown or untrusted compute node. The advantage of doing that is that you're not using losing all the compute capacity if you have, for instance, have made a change to it and you have not updated your attestation server yet or you have not updated the TPMs yet. You can still use that compute node. You just cannot deploy trusted compute loads on it. So the boot loader loads T-boot. This is a wrapper around the operating system and T-boot watches the operating system boot and hands over to a module named Zinnit that Intel provides and that allows to write the boot measurements into the TPM. And T-boot and Zinnit together attest the kernel and the init RAM disk that are loaded with the hypervisor init. That if you look into that TPM module, you will find a set of platform control registers, PCR. They are numbered PCR00 to PCR25 or so. And they are populated while you're booting. 00 and 01 are bios and firmware and so on. During the boot, each of these registers gets populated and gets compared with known good values for this step. If you want to look at your system to see whether you have those platform control registers on the most of the sentos and redhead systems, you will find this path that has, it's actually just a catable file that's an interface to that TPM module. And the values in this module can be used for local verification, meaning it compares with values that it has in itself. And for remote attestation, meaning you have a server somewhere that is better secured than your cloud and that can be used to ask, do the values from this boot correspond to the values that I have put into that attestation server? So, this is an attestation example that stems from our conventional boot. You have the good TXT boot metrics on your attestation server. And you can set this up after system built, but you have to change it every time you change your boot environment. You upgrade the kernel, you upgrade this, you upgrade that, you upgrade the bios. You have to put the new good TXT metrics into your attestation server. The attestation server then retrieves the actual states from the clients that have booted and I work with, compare those actual states with the states that it has, the known good states that it has from your original configuration. And then when you have a software ask the attestation server for the state of a host, you get that either trusted or unknown. So, on the open stack side, how does Nova allocate resources? Of course we have schedulers. We have a number of different schedulers, but the most popular nowadays is filter scheduler because it has a whole bunch of configured, configurable items that you can use. One of these configurable items is trusted filter. Trusted filter uses TXT attestation to determine whether a node has passed its secure boot tests and checks and is considered secure. And the way that you do that on the operation side is you set a flavor or number of flavors that have an additional key set. You cannot, unfortunately, you cannot do that from horizon yet, at least not in Havana. I'm not really sure about Icehouse. In any case, you have this key that you can manually set from the command line. I'll show you the command later on. And if that key is set to trusted, then the trusted filter will not allow you to schedule this workload on a host that is not, that has not passed this TXT test. Okay, so here operator comes and says, okay, we have this workload that runs with trust like level equals trusted. So our API endpoint that we talked to and told to build an environment with our trust level equals trusted talks to the scheduler. The scheduler has a filter scheduler module and a trusted filter loaded. And instead of just scheduling the workload to any of those servers up here, it goes to my attestation servers, which can be hosted somewhere either in my cloud or outside of my cloud, depending on your security requirements. The only real connection between here and here has to be port 8443. And there's basically HTTPS commands coming and going or HTTPS sequences coming and going. Meanwhile, the attestation server keeps track on which server has passed its boot. For instance, we have a server here. This one has not passed a trusted boot and is not considered trusted. These ones have passed their trusted boot. The numbers in the PCR are the same as the system expected. So the attestation server says, okay, we have one, two, three servers here that are safe and we have two that are not safe. So when once my key trust level comes in, I ask the attestation service and I get a set of servers back that is considered secure. In this case, these three servers are two hardware servers and then we have to hear, we have the apps on there. And then the schedule level only schedule onto these hosts. Even if there's high load on these hosts, it will not schedule on here regardless. And that's good because the people who have potentially compromised those hosts cannot compromise the workload that we are putting on our cloud. So we have, again, we have a known good state for all clients that the stations have a pulse the actual site from all clients. This is all the same as attestation normally works. The only difference is actually here. There's a trusted filter has a cache of trusted nodes in it. The cache is time-based. It's after a short while, the cache times out. And when a trust level key comes in, then the cache is checked on whether it's timed out or not. If it's timed out, the trusted filter gate goes to the attestation server and refreshes its cache. And then it selects a node from the trusted pool and launches the workload on there. Now what happens if we do not have any trusted servers anymore? For instance, somebody has accidentally updated all our operating systems and has forgotten to put that into the attestation server. Well, what would you guess would happen? The workloads cannot be scheduled. And this is also the way that we want it. We do not want a secure workload to be scheduled on a server that we do not trust. So inside the trusted filter, you will find a trusted, there's a trusted filter class. And you will find methods in there, trusted filter method. This is essentially just a constructor and a single do attestation routine that runs computer attestation. And computer attestation calls the computer attestation cache and checks whether the cache is valid, in which case the trust level is just returned and goes back to the filter scheduler. Or it has to go to update cache and go all the way out to the attestation server. And then the information that the attestation server provides bubbles up to do request attestation, update cache, and then goes all the way back to the filter scheduler. In each case, you get back a set of hosts that is trusted. And the filter scheduler then decides from these nodes which ones to put the workload on. This is just the same thing in code. And essentially, the trusted filter, every filter scheduler or filter scheduler plugin has the same constructor, has different names, but it uses the base host filter as a base. And then this is actually a relatively short and straightforward piece of Python code. I encourage you to actually look at it. I found when I was working with attestation back in January that in Grizzly, there was an incompatibility between the attestation server and the trusted filter, which would always bring back all the hosts as untrusted, which was essentially in here. The attestation cache would always be invalidated because the time that came back from an attestation server did not mesh the timestamp that was required by the computer attestation cache. And with a bit of logging, and I just patched that, but the bug has been fixed by now. So you should be able to run the trusted filter without any issue. Practical application, this is from that project in January. You can run an attestation server as a standalone server. In this case, we ran it on as a standalone server. And depending on how high-use security requirements are, especially in the public cloud that Matthew was talking about before, you do not want to have the attestation server on your controllers. You will have a separate attestation server that's firewalled off and protected from people manipulating it. Because if they can manipulate the operating system and then can also manipulate the attestation server, then you essentially do not have any security. Because the attestation server is going to tell what your operating system is going to tell. You have, if you run Redhead or CentOS, you need to have the EPEL repository activated. Otherwise, you will not have the packages. Packages are named OAT Appraiser and OAT Client. And you also have to open up part 8443 for all the traffic. This is the only part that is used. So TPM installation goes the same way that you use TPM on your regular hosts. You have two packages. Trousers is the demon that is able to talk to the TPM module on the system and TPM tools, which are the command line interface for this TCSD demon. And before you do anything, when you boot the system, you have to enable TPM and Intel TXT in the BIOS app. Not the same thing. TPM is also used for other purposes. For instance, Microsoft uses it for some purposes in Windows and other manufacturers also. So you have to activate both. And you have to reset both in case somebody used it before and has taken ownership. So once the operating system is booted, you can use TPM take ownership to take ownership. Let me see what we have. And then when you have ownership of the chip, you have the credentials for it. And then you can put values into it and read values from it. Tboot installation, you can use Tboot as a wrapper around the kernel that you're loading. And you have to download the appropriate Zenit model from Intel. This is not available everywhere. OAT installation, the client is on every host that is, or that you want to potentially have as a secure host. And you have to, as a public key authentication, the keys have to be transferred from the attestation server to the client. And then finally, after the necessary entries in the TPM, you have certificate. You have values that identify your host to the attestation server. And then copy the data to the attestation server. Here, this is the configuration that you need to do in Nova to get this to work. There is, filter scheduler must be the scheduler that's activated. As the filter, you have scheduler default filters must have the trusted filter key. You also have to configure your server IP port. And you have to have the API URL and the authentication that you set in the TPM. And then when you're actually launching an appliance or an instance, you have to set this in the flavor that you're launching the instance with. So what do we have? We have Intel TXT. We have attestation. And we have a way to use it from our OpenStack cloud. It's up to you to actually use it in your environment and make our world a little bit safer. I will put this online so you can read up on the details. And I'm also planning on putting up a blog post on how to do a step-by-step installation for TPM. I have not done that yet due to workload, but this is going to come up. And yeah, can I answer any questions? I think we may be out of time. We're probably out of time, yeah. So we'll be here for a few minutes. If anyone would like to ask a thing, just come here. Please come forward. And we would like to answer everything that you may have to work to ask. And I'm sorry if you run out of time. Thank you very much. Thanks. Thank you.