 Okay, so my name is Brian Kelly and I work at Microsoft. I had the opportunity to speak here yesterday with Ron, Nate, and Elaine on open-source firmware. And in having that, I get a little bit of a retry today in this presentation. I realized that some of you may be unfamiliar with open compute and the efforts that are going on there. So I'm going to talk about Project Cerberus today, which is focused on hardware security, which is appropriate for this audience. But I also want to maybe divulge a little bit and talk about open compute and give you some backdrop that's kind of relevant to why and how Project Cerberus came about. Okay, so open compute was founded in April 2011 by Facebook, and it made some good contributions and great forward momentum in establishing open compute. Microsoft joined in 2014, and when we joined, we made a contribution of an open cloud server system, which was a 12-view chassis. You can see it there. It's the top image. And that was optimized for density. And when we made that contribution, the design of it was complete. It was a product that we had been using for some time and contributed to the open compute. Then in October 2016, Microsoft announced Project Olympus as part of open compute, and that was a design which was not yet complete. It was about 75% complete. We tried to do a thing with this design whereby we wanted to foster community feedback, similar to an open source project. You'll start it up. You won't yet be finished. You'll get contributions, and you'll come out with a great product. Project Olympus was a try at that, and the community really helped to steer the design and take us that extra 25%. So the feedback was incorporated into the design. All the manufacturing collateral for the product was open sourced, and then later that was followed with open firmware to open EDK, open VMC, and part of the PDU and plug strip firmware was an open rack manager. And that stuff is all out on GitHub, along with the board file schematics and manufacturing collateral. So the open source momentum, that first Project Olympus design was an Intel X86. What quickly followed behind that was a lot of other manufacturers taking the same form factor in design and giving more building blocks. Then there was more building blocks that followed that, including flash storage, hard drive storage, GPU and PCIe expansion. So all of this hardware contribution and firmware contribution was great, but there was one thing that was missing, and it was what about security? And if your Greek mythology is good, you could probably tell from the three-headed puppies what that led to. And that was our establishment of Project Cerberus. And not only on the announcement of Project Cerberus, but around the same time we established a security forum in OCP. And the goal was to take typically proprietary hardware security implementations and to open it up, open up the design and open up the architecture on how we do hardware security and to thrive that forward with the community. So I have a quote here from somebody you probably read about, which was Alfred Charles Hobbs. In 1851, he was a locksmith, an American inventor, and he came under a great criticism for publicly demonstrating how to pick some of the most secure locks at the time. And his response to all this criticism was that rogues are very keen in their profession and know already much more than we can teach them. And this was against the attitude at a time of security through obscurity. He maintained that by being open we can get more, we can improve security by being open about our security and taking feedback from others. So the Open Compute Security Project was announced in February 2008, so it's relatively recent. Microsoft and Google were selected as the co-chairs, but we have many, many companies that join weekly and contribute expertise in engineering time, making our lives a little bit easier. It is community focused on advancing platform security as a whole. So that's really the intro to OCP, just to let you know a little bit about it. Now I want to switch gears and kind of circle back to what I had originally intended on presenting, which was Project Server specifically. So in the cloud, or as a cloud provider, we have a different security trap model or different trap factors to maybe traditional enterprise or client. In the cloud, you're essentially leasing VMs to companies but still maintaining ownership of the hardware. Those customers, their VM may be on different hardware at different stages or different times in its life cycle. So it's really how do we protect the persistent security or the persistent security state of that device as it transitions through its life cycle, a physical device. We have many different trap factors. We've got customers who may be compromised with malicious software running on their own systems that tries to spread to the cloud. We have people that may have malicious intent and pose as customers. That happens in both enterprises and customers or regular consumers. Insiders from within the company who, you know, every company is exposed to maybe the same thing, a rogue tack or somebody with bad intentions that tries to penetrate. How do we add security in depth and ensure that the damage they can cause is minimized? We've also got supply chain threats, system integrator threats, and of course, manufacturing threats. Project Cerberus is focused primarily on firmware security and the attack surface of firmware security. What is that attack surface? So, you know, driver, all driver, firmware interfaces, access to flash during boot, firmware interfaces are exposed and the OAS firmware interfaces are exposed, a hypervised environment if you're providing direct access to peripherals or the platform itself, there's exposure there. Firmware in particular, unlike upper level software, you don't have a lot of detection and malware in anything running at that low hardware level. Recovery, if there are compromises there, can be challenging as it can disable recovery interfaces or choose to completely ignore them. And then, of course, if you get compromised, it can result in bricking, loss of the asset and loss of data and so on. But there is some good guidance and hope through a set of NIST standards that were published about a year ago initially and then ratified a little bit more recently. And that was the NIST 800193 and it focuses on three pillars, the protection of firmware, the detection of corruption or unauthorized access, and then, of course, the recovery. So with these guiding principles, we align our security of platform firmware and platform as a whole. And before we could do that, we had to take a look at where we were at and the current state of the industry. So the typical enterprise server in the industry looks maybe a little bit like this. You might avoid CPUs, you may or may not have a baseboard management controller, and then you've got a bunch of peripheral cards that can plug in some of those peripheral cards, maybe more powerful than a host CPUs themselves at certain workloads. So security state, you've got the base firmware, UEFI, there's some limited protection there, secure boot-like functionality, there's measured boot through the TPM, but the detection and the recovery are very not really all there. It doesn't have complete coverage and it's very platform independent on how these things are implemented. BMC, of course, typically not secure, no protection detection or recovery ability or attestation, and then the rest of the peripherals that you'll find inside the enterprise platform, they follow the same suit. So it brings us to project serveress. And what is project serveress? Well, it's a set of requirements around platform power sequencing and when and where and how to establish trust. It's also a set of requirements around firmware integrity, how to verify it, how to measure it, and then it's a chip that implements and enforces all three of those things. So the serveress route of trust implements the guidelines from that NIST 800193. It's a microcontroller that enforces digital signatures on firmware and components that don't necessarily intrinsically have any. And we'll get to how it does that in a little bit. It provides protection to not only the platform firmware, but also the peripherals that get plugged into the platform. So you buy us your BMC and all your PCI add in cards and whatnot. Of course it's CPU and vendor agnostic. So when we go back and we had a look across those vendors of the project Olympus system architecture, there's multiple CPU providers and multiple hardware manufacturers that have inconsistent or mismatch security. This puts everybody at a common standard. So what is the serveress ASIC? It's a security microprocessor with internal secure memory and flash contains the accelerator. Typical accelerator blocks, SHA, AES, it's got a random number generator. It's got a public key engine for public key acceleration and a lot of functions and key derivation that you would typically do with public keys. Some effuses for its own hash or measurement of the public key that we use to load the firmware on it. It's got a physically uncullable function or a puff for some additional entropy. And it's got the device identifier composition engine that is part of the TCG. It also has, which allows it to be coupled with a CPU or component that doesn't have any security intrinsically in it. And that is a special interface that we designed to work with SPI and QSPI and allow that microcontroller to be plugged on seamlessly without the host CPU even knowing it's there. And of course it's got all the physical anti-tamper as well to protect the secrets that it generates. So the interpose interface and how it actually works. So your typical processors have a boot ROM, they boot up, they read in some additional instructions from Flash. What happens if what's on that Flash is really important. A lot of those processors will just read in whatever's there and go and execute. So to ensure what's on the Flash is signed and is of good integrity and what we want to actually be there, we interpose on that and in between it, this server is microcontroller. So I talked earlier about the properties of the project Cerberus and it was a bunch of specifications and what processors must meet in order to be considered secure or server-risk compliant. If they're not, then they get this microcontroller interposed between them under firmware load store. What does this microcontroller do? All firmware that's on that Flash is authenticated before CPUs are taken out of reset. The server-risk microcontroller stays in line. All firmware that's read in through the CPU is measured. All spy transactions are filtered. So common platform design is you will take a spy flash chip or a NOR flash chip of a given size 32 megs or 16 megs or 8 megs. A firmware image might only be four, might only be two. So you've got, with a typical secure boot, it's only going to measure the firmware that it's reading into load. But you've got a lot of flash that's like a little black box or a blind spot to your system. So what Cerberus will do is it will ensure that that unused flash is unreadable and that data portions or firmware regions of the flash that is readable is unwriteable, unless the firmware that's been sent to it is authenticated. So in enforcing those NIST principles that we talked about, I'm going to circle back to that standard quite a bit. The protection where all flash accesses are filtered through this server-risk ASIC, it stays in line when the platform is running. It ensures that any accesses out to firmware are right, are read, I guess, and are protected. It authenticates any firmware that's coming in. And there's a feature, which we'll get to a little bit later on of it, called a platform firmware manifest. So in the cloud, software and firmware are continuously updated. As you manage this large fleet, you have to be dynamic in rolling updates seamlessly throughout the fleet. So at any one point in time, there could be the latest and greatest firmware. But the following day, when you spend a new firmware or whenever your new firmware comes along, your fleet is at a yesterday's version, essentially, N minus 1. So the platform firmware manifest allows us to give known good, but known good firmware versions or measurements of known good firmware versions into the server-risk microcontroller, and only those versions can run on a platform. So what that gives us is the ability to maintain a good state. And it's kind of like a soft anti-rollback feature and roll forward feature without the need to go and blow OTP fuses. The detection mechanism, of course, the server-risk ASIC has secure boot. It does its own secure boot, tests to its own measurements. It'll also go and measure the firmware that it's supposed to be protecting for the device it's supposed to be protecting and include that into its measurements. Recovery, which was the other principle of the NIST 8193. The recovery is policy-based in server-risk where we'll have Burr metal recovery if the system is off, we can do recovering. Irrespective of the power state or the platform, we're able to recover firmware images at any point in time to a known good image. As I mentioned earlier, flash access is protected by server-risk. And we have automatic recovery flows should any corruption, whether it's a bit flip or attempt at a malicious attack, occur. We can, of course, rectify that with some automatic workflow from within server-risk. Now, there's a lot, one of the reasons we went down the server-risk path or into server-risk is a lot of folks or hardware suppliers are very focused on an individual product. If you're a CPU manufacturer, you're worried about the security.cpu and that's it. When you take a whole system together and you're worried about all of the components and how they interoperate and the state of the platform when it's booted and when it's running, not like when we take into consideration typical secure boot, you're going to read in some option ROMs, anything in the boot path, you're probably going to measure. But what about stuff that's not in the boot path? Accelerators, GPUs, microcontrollers that are out on all of these different components that are running in the system, some of them more powerful than a host CPU. So with that, the server-risk architecture was hierarchical in that we wanted to have this scalable architecture where we could have a single entity on a platform attached for all of those other components in that platform. We wanted to be able to access that before we initialized the platform. During initialization, an active route of thrust should be available all the time. I should be able to ask it, hey, what's the state of firmware? Firmware gets updated at runtime, not only during boot. When firmware gets updated at runtime, measurements change. I want to know what's the state of the platform at any point in time. So that brought us to this master-slave hierarchy throughout the system. Components that didn't meet those standards would have the ASIC interposed. Components that did intrinsically support those just fitted into default because it would also conform to the attestation protocol. The platform level attestation, as I mentioned, the single measurement, we extend using some acronyms here, platform firmware manifest, component firmware manifests, and component device file are all acronyms that are into specification. But what these stand for is essentially manifests of measurements of permitted firmware per device. There's measurement logs that come out of the Cerberus platform ASIC. And then, of course, there's certificate sealing for machines based on their ability to attest. So as Cerberus attests to the firmware, machines that are in good state get essentially a cookie in the fabric. So a little more about the Cerberus security controller enforces the guidelines from the NIST standard. It is a small microcontroller. It's also a bunch of platform specifications. It's a hierarchical root of trust with topology that provides attestation for all firmware. And it's an open design. We open up the specifications and more of the collateral for Cerberus will follow. If you want to know more about it, if you want to participate in it, we encourage you to join the security project in OCP. And with that, I'll take maybe some questions. I'll keep this on time. Yeah, sorry, a couple of things. So first one, you talked about recovery options and then talked about automatic recovery workflow on attestation for option. Is there any path to provide reporting of that if it takes place? Yeah, absolutely. So one of the things with it is it's alive and running all the time. It doesn't boot and then go sleep or boot and load and go off and do something else. This TCB is active all at a time. So as we decide to go and measure it or ask it for measurements, it'll provide you the most current measurement. In addition to that, when anything changes, it will raise an alert. So there's like a interrupt that will come from the ASIC and go into the platform to raise an alert to the attestation agent and then that will play back into the fabric. So any changes that are unexpected, we get notification. And the PFMs, I did have some backup slides to go into a server's deconstruct, but I was running them a little over time. But what I will do is I'll jump down into a key element of it, which is the platform firmware manifest. And essentially in our build system, when we build firmware, it spits out a manifest. And the manifest, of course, looks like it spits it out in this human readable form, if you consider XML human readable. But it eventually gets built into a binary list that lists all firmware that's applicable for that platform. So you could have firmware version 1, 2, 3, skip 4, 5, 6, 7. And they might be fine to run on any given component. Those lists are monotonic. So you flash it once. It can't take a previous list, so there's no replay on the list. It allows us to keep the fabric at different states in firmware, because it's always transitioning. You've got a tail end that might not get updated, as it might be on the last version of firmware, as you're going through the fleet to update everything to the newest and change those attestation policies. So this, by taking essentially the part of the attestation policy and making this root of thrust be responsible for the enforcement of that, it makes that higher level attestation a lot easier to manage. And the other thing is you use the word tamper-proof. Can we please never use the word tamper-proof? I mean, 800-193, and I checked. It talks about resistance to tampering. And it may just be me, but I hate the word tamper-proof, because tamper-resistance is all good, but please never tamper-proof. Yeah, yeah, good point. So I just hod-podged that together yesterday after my talk and realized that I might have been starting at a level in server-ish and going into the deconstruct without too many people being familiar with open compute and what exactly server-ish was, but point well taken. Is it possible to use a certified secure element with cyber-ish? I'm sorry. And is it possible to use an EL certified secure element? Oh, a secure element inside an server-ish or instead a server-ish? I know. So, yes, the ACC has a secure element, which gets back to my friend's comment here about tamper-proof compared to tamper-resistance, but it's got a secure element inside for storage of the keys and the entropy that it generates. Okay, but it's not certified. Yes, it's not certified. At this point in time, it's not certified, yeah. So, and what kind of secure protection do you have? Pardon? Do you have side-channel protection? Yes, side-channel protection. Cheatings and so on? Yes. So, it has side-channel protections. The thing about the certification and why it's not pursued yet on the device, the certification actually takes quite a bit of time. We announced this in October of last year. The certification too, depending on where you are in the world, different things are acceptable. You know, in Europe, you've got the common criteria. Over here, you have the fifths. In China, you've got a whole different ballgame. So, the certification is something we're looking at through OpenCompute, but it's still really TBD on what direction it will go. Great question. I guess looking at smart cards, design, maybe interesting. Yes, another question. So, when your chip is reading the firmware of the other chips to make sure they're valid, is it just basically asking them, hey, can you tell me what's in me? Could another compromised chip lie about that? No, it's not. It's actually mastering the bus at that point in time, so it stays in line on what gets read. So, all spy transactions, they don't complete, as you would think. The ASIC interposes, if I go back, it might be easier to explain through the drawing that I rendered earlier. Here. Where did we go? I went back a little too far. But the host processors cannot access that flash directly. They think they're accessing the flash directly, but they're actually accessing the ASIC. And flash, as you know, is like a memory interface. It's not like transactional with a payload. So, it has to happen in really tight time. The ASIC has to make a decision on whether it allows the host processor to access firmware from that device. And there's a couple of techniques that we use to achieve that. One, of course, is we have a special hardware interface that provides that bitstream kind of filtering. And the other one is a little bit of a honeypot design. The ASIC, of course, or the spy interface on the processors. In some cases, those will go into a crazy control loop and then assert and do all kinds of stuff if it doesn't get a readback from spy within a certain time. Or if it doesn't read back the data it expects. So, for that, the ASIC is able to essentially honeypot it. Let it think it's doing one thing, but do another. I don't know if that answers your question or not. An attack I've heard in the past is you can't really trust any device to tell you what's on itself if you're not using it that same way. But it sounds like in the case where there's flash supper from a processor, then you can do that. But are there cases like where a chip has its own internal storage that you try to verify? Yes, there are. The internal chip, you can't interpose the server as ASIC. So it's a case of either working with the manufacturer to rom that internal storage and make it not updateable and to read externally. Usually it's pinstrapping to change that boot path off. Or to adapt the security requirements that we put inside in serverless. But every component has to be analyzed pretty much at face value. It's a lot easier when the component doesn't have intrinsic security to interpose an ASIC on the side of it, or outside of it, as opposed to going through a development cycle and having that functionality. So in other words, it's easier to change a PCVA than it is to change a substrate. Time is going to be a lot less. Time to market that is, which is also key in the cloud. Right at the back we got another question. So you briefly mentioned about the use of Puff for additional entropy. Yes. So what kind of Puff is that? Is it delay based, RAM based? It's S-RAM based. Is it also resilient against known Puff attacks? There are quite many of those Puff attacks. Yes. So that memory that we have inside for the Puff, we actually have that memory over as part of the secure element. There's a mesh over it. We drive current through it. If we detect there's a change in the state, the S-RAM, if we detect it, there's a tamper on there. We also have side channel measurements in there too. So you're intersposing the flash accesses at the moment. Have you looked at any other buses, say LPC? We have considered that. The thing is to hit the timing, it's not done in firmware or software at all. It's a new hardware peripheral essentially. Interpose in there and keep the same timing on the bus. You can't achieve it with software and DMA chaining. Your set of times and DMA just make you have to run that bus really, really slowly. So to keep up with performance and boot performance, it's an interface, but we are looking at other hardware interfaces and been able to interpose. Any more? I guess it's lunch, right? We've actually got one more session. One more? Alright. Which it's time for now. Alright, thanks a lot folks.