 I'm Tim Pletcher. I was at Cray for a little over two years. During that time, HPE acquired us while I was at Cray. I was the software security architect for the Shasta Systems. There were a few of us that were systems architects there. And so my remit was the access control framework that currently sits over the Shasta Systems. I've since moved over to the security engineering team at HPE, and I work with Dan and a bunch of other folks in Cineal. And I'm still involved with the HPC side of the house in a few dimensions. I'd like to say that while I'm here, there are a whole bunch of other folks behind the scenes that worked on this particular dimension of the implementation. Zach Risler, Kevin Burns, a whole bunch of folks. So not a one-man show by any stretch of the means over there. I'd like to start off and talk a little bit about Shasta Systems. So if you're not familiar with HPC, it's basically a giant liquid-cooled computer at its highest end. And while they used to be vertically integrated machines, they've evolved to be commodity compute resources that are basically pulled together by a blazingly fast network interconnect. And so you'll see that there is a lot of commodity in the compute nodes themselves, but for example, our secret sauce at Cray is the ASIC and the software that drives the network, along with a few other things in the programming environment and the kernel module optimizations on the compute nodes themselves for network. These machines are capable of over an exoflop of computational activities, and I always like to write out the 10 to the 18th because it makes me think about how fast and how many numbers these things are crunching at any one particular time. Historically, these machines will scale to tens of thousands of compute nodes, and different installations will have different node counts associated with them. Right now, the Slingshot network that runs Shasta Systems network-wise is 400 gig, going to 800 gig with the next generation switches and cards that are coming down the road in a few years. It's a Dragonfly network configuration, so it means that all the nodes are never more than a fixed number of hops away from each other and allows for a highly consistent and manageable high-speed network. There's truckloads of storage behind it, and so, yeah, they're just crazy, right? They're data center-sized computers. They consume power measured in megawatts, and it's kind of funny when you think about it, because there are scenarios where the machines will have their own substation, or you have to be careful about when you run them because they'll brown out the neighborhood, like the city. So they're really neat machines. So one thing I would like to say about the Cray systems, and I guess I'll get to that, our system management model, basically, is going to look fairly familiar to a lot of you or all of you. We basically have a user access layer, and that deals with access into the compute plane itself, and that's provided either through a container instance that's running on the management plane itself or via nodes that run off the Kubernetes management plane but still have access into the API plane to run administrative functions. There's a bunch of functions around the compute nodes themselves. You need to manage the images, you need to manage boot, orchestration of boot, configuration management for the nodes themselves and whatnot. Then there's a network management side. There's obviously two networks, one's the high-speed network and one's the management network, which is a fairly standard configuration, and so you've got all the interaction that goes on there. Hardware management is as you would see in a lot of large data center operations. You've got to deal with power management, you've got to deal with all the BMC endpoints, firmware, the whole nine yards, and then, of course, there's security. Personas, non-person entities all have to be accommodated in the authorization, authentication, and key management context. So how does that work? The CSM software, the Cray Systems Manager software, is, I would say, a fairly generic and vanilla CNC FE implementation, and that was by design. So we run on Kubernetes, we use Istio, we have Vault, we have CertManager. I think any of this, you know, you would take a look at it and be like, oh yeah, of course, that makes perfect sense. If you're going to run an API plane in front of a big machine like this. There's a whole bunch of different, if you look around the side there, there's a whole bunch of different networks that are present in this system. They deal with the hardware, they're VLAN, basically, and they deal with the hardware management side to get to the BMCs, they deal with the API plane side to get to the services that need to run, and they deal with the customer-facing side or the user-facing side, whether that user-facing side is for management or for high-speed network access. And then we back it with dedicated CIF hardware software. So along the way, we have a Gen 1, basically, when you think about the context of how you need to transit the API plane from a compute node, the side where you're going to have administrative traffic in and out is fairly straightforward. The harder part for us, and I say fairly straightforward, because our access control framework basically is standalone. And so we ship Key Cloak with this thing. You could basically stand the system up and run it completely in an air-gapped environment by itself without talking to anything else. And that was one of the original requirements, obviously, because of where these machines run. So our story around basic API interaction as an individual is pretty straightforward and good. We use OPA to deal with the Aussie topic, and Key Cloak issues standard OIDC tokens in a way you go. Where that breaks down is that there are applications that run on the compute nodes, obviously, that deal with platform operations. And so our Gen 1 implementation of this was to basically use what Key Cloak calls a service account, which is effectively just issuing a long-running OIDC token and handing it out, right? And it was a horrible implementation. I'm still maybe a little bit embarrassed by it at the end of the day, but we had to get started somewhere. And we knew that we were going to have to build something specifically to accommodate this. And as it turned out, Sightail and Cray were acquired around roughly the same time. And I started interacting with Sunil and Emiliano, and it became pretty clear pretty quickly that while we were very familiar with Spiffy because of our use of Istio, we hadn't really considered Spire. It wasn't on our radar screens. And after a little bit of discussion internally, the AHA moment came and the dots are connected and we didn't have to build anything. We really just needed to pick up the ball of Spire and get going. And so we did. And it was a great collaboration. I would say that it was probably one of the easier implementation cycles I've been through in that whole process. It took less than 90 days and we were up and running in the system. And so that included the whole nine yards of applications talking across the API plan. So this is where we sit today with the Shassi CSM. As you can see, there are obviously the key cloaks there for the individual users and then the MPEs are covered by Spire. Today we used joint tokens to attest the compute nodes. And I'm going to talk a little bit more about that as we go forward. And then the Svids get issued and away we go. The applications then basically are issued an OIDC token from the Spire server and they're evaluated appropriately as they come through the API calls as they come through the gateway to point them at the right issuing server. So this is our world. And I'd say again, it's a fairly straightforward and standard implementation in most respects with the addition of Spire making our lives a lot easier. So where do we go from here? I'd say that the state of the control plane or the access control framework is good for a 1.0 implementation. But you can always improve and we will seek to do that. And so the big place that we have a challenge is really that node attestation in the compute plane. When the original specifications were cut for these Shasta systems, TPMs were not specified on the compute blades. And so we find ourselves in an awkward position. They will be going forward, but we have machines that are fielded that do not have them. And so that's objective number one is to get a better story than we have today with the joint token. And then we really like the work that is going on with our effort to get Spire and Istio talking together in the community. We already have a central PKI issuer system that runs behind the platform. And we are in the process of basically putting an operator in place to roll the certs from the issuer for Istio. It would be nicer if we could just bolt in Spire to that task and call it done. That would be awesome. The other thing that we're looking forward to is API-driven workload registration. So as platform components come on, the ability to register them automatically as opposed to manually with a config file, which is what we do today, will be replaced and that will be a good upgrade for the engineering teams to get up and running. The other topic that we're going to be looking forward to is federation. So there are certain components that we have in the platform that are not baked into the CSM software itself. So one of those would be the fabric controller. The Slingshot software is a stand-alone system. It could run without CSM. But the ideal scenario is that their access control framework can be federated into ours and we do that through Spire. That has been proposed. I don't know where they're at on that team side. But then that takes advantage of the federation capabilities in Spire, which are excellent and pretty straightforward to implement as I said today. So note attestation. The astute observer might have noticed the use of joint tokens. Joint tokens are not ideal. They require you to create an issuing mechanism that can be a little bit challenging to have reach the security level that you'd really like it to. And we knew going in that this was going to be something that wasn't going to be our ultimate end state. It just is what it is. So we started work on the next phases of that. And we know where we're going to end up, I think, in the short term or in the medium term and then in the longer term. While we don't have the TPMs in place or where they're not available. So let's look at this. Coal is kind of enough to steal my thunder on the TPM attestation. I'm going to skip through this one real quick because I think you will look at this diagram and see something that's very much similar to what he had presented. It's basically the flow to do the TPM based attestation from the compute node. So again, we'll go pass this one for now. And then the next step for us is probably going to be a move to the X519 based node attestation. So with that, what we end up doing is injecting a certificate into the compute node at boot time. So recall that these are diskless machines for the most part. And so we have a process where we can inject that boot in the INIT RD phase and we do that with other payload components today. And so this will just end up adding the XNAME certificate into the mix and then we start down the process of the cert verification, generate the nonce and you go back and forth and you end up with the specific ID issued using the intermediate that the SPIR server holds. Which it has acquired from our PKI as a service plant. So that's up next for us. We're also looking at the ACME IP based attestation. So this is another avenue that we can take. It's still not where we'd like to be with the TPM side but while we get to this process of the best that we can then this is something that we start to look at and I think we're watching the step CA ACME server project. I think that's something that we hope gets there pretty quickly and then we'll take a gander at that. So that's SPIR in a super computer in a nutshell. We have time for questions. We want to leave time for questions if anybody has any. We appreciate the chance to share the experience that we've had. I would say again that if you haven't started working with SPIR it really truly is a Swiss Army knife and it becomes more and more of that as the plug-in universe gets bigger. So I suspect this is going to be something that plays prominently in a lot of engineering, platform engineering going forward. So we're excited about it and we really appreciated the timing that allowed this to come into the picture for us. So with that I'll take questions. Fire away. Line up. Have you considered other hardware based attestation methods like we keep these smart? Not really in this context. The scale that you have to kind of do that with I would say would be challenging. Oh, tens of thousands. It does kind of change the fleet management problem is real in the large footprints with any of this type of thing. So we want to find the way in the short term that is most manageable for the system administrators because they have a big job and these machines run a lot so we want to keep that overhead low. With respect to the TPMs have you seen any other ways to manage those at scale? We are working on some things internally around that at HPE. It's to me that's the killer app right there. Historically TPM operations and when you mix them in with anything related to fleet management, maintenance events any of that just is really painful which is why a lot of people don't do it. So that to me in the context of security engineering especially when we start to really focus on hardware up attestation is going to have to be dealt with and so when you look at some of the other things that are going on with platform certs and whatnot and SPDM the need is going to come for hardware to not only be for every component in a box to be validated and so I suspect that this is going to come to the forefront more and more and we do have people looking at that problem internally. Got a couple of online questions here. I got one over there too. Can I come to you right after? Just one question, this is from Richard in our virtual audience. Richard asks, can you speak a bit to what types of scale you've seen require for large node boot events? I will say that it's improved. We seem to... there's one scenario that dings us on the boot cycle. So we've run up to I think probably close to, I want to say 8,000 nodes at this point give or take a little bit and so one of the applications that runs in the context of our world is a heartbeat mechanism. So what happens is when you run through this boot cycle all these things start heart beating almost immediately and it's kind of a thundering herd problem and we see supercomputers have thundering herd problems in a few different areas so we end up actually with the heartbeat thing turning that off during the boot cycle and I think we're meeting most of our targets for boot timings today. It is a contractual requirement in the super real world to boot at a certain speed so it fixes... it's prominent in the discussion. Did that answer your question? It was an online question so I believe so. We also have a comment from Anne. Anne too, who I think might know you. Tell her I said hello. She would like to liaise with you following the talk. I am passing on now. Another in-person question. I wonder if you were incorporating Spire into the user production workloads that you're running there and how that may tie into... Not today. So a good way to think about these machines is that they're big IAS implementations and so it's very... it's a... from an analogy perspective it's just like boot up, here's your VPC and then workload manager software comes in and dispatches applications into the compute plane for run time. That's not to say that we haven't had requests from the customers on this for PAS type services. I think as you see more... as you see more modern approaches to running jobs up in the compute plane itself, you'll start to see Spire make its way in there but we don't provide that as a service platform and the core platform. So this is all platform-based. This is all basically, think of it as behind-the-scenes platform operations. So I've been building on my platform and it was kind of the next level. This is Slarm. They're better schedulers. Yeah. We've been handling this problem. This seems like a... interested in kind of seeing how together to tie this into the line store. Yeah. Absolutely. And I think there would be... I'd be surprised if there wasn't interest from some of the labs community around this as well. There's a... some of the labs actually expose these machines to a lot of researchers and they come in and do their thing. So secrets management has been one that's been a topic of discussion and we kind of said, well, all right, these are all viable. We get a little bit more mature in the platform then we can start to look at the PAS topic. There's also, you know, the topic of some type of of Git service. So they could do GitOps up in the compute plane with, you know, whatever they're going to run. So I think that's only a matter of time before it starts to show up. But I think you're starting to see different approaches to workload management come into, right? So you have singularity containers and, you know, it's going to change I think across the board in HPC. Any further questions? In person or virtually? Going once? Going twice? Sold. Thank you so much, Tim. That was awesome. Thank you. How cool was that?