 Thanks for having us. We are with the NASA Center for Climate Simulation, based out of Goddard Space Flight Center in Greenbelt, Maryland. I'm down at the middle, and this is the bottom, though we actually both work remote. But we're going to talk to you about our OpenStack Cloud, and a little bit about the science that we run on it, and then sort of how we run that cloud. The idea to do this came from the OpenStack Scientific SIG, Special InfoTrue. The scientific SIG is people like us who run OpenStack to support science. And so, we got some science stuff from the real scientists, but full disclaimer, we're just to submit, so they'll all try to ask us questions about the science, please. So, one of our premier customers that's getting a lot of attention lately is the Nancy Grace Roman Space Telescope. Its purpose is to explore dark energy and dark matter, and search for and image exoplanets, and many topics in infrared astrophysics. Its primary instrument is the Wide Field Instrument, or WFI. It'll use 18 infrared detectors for a panoramic field that's 100 times wider than the Hubble Space Telescope's infrared instrument. This is supposed to launch in 2027, so you guys are probably familiar with James Webb, whereas James Webb can do this whole view, but at way deeper resolution. This is going to complement James Webb by being a broad water field. So, this instrument and James Webb together will give you the ability to see a wide view, but also zoom and deep monitor. We support this team in many different ways. We have our neutron networks have provider networks that go back to shared POSIX file systems, where these guys access petabytes of data. We get them access to GPUs. We have the flexibility of the cloud environment. We can build VMs in different ways to meet whatever their interests are. Some of their VMs we put into SLRM for scheduling. Others we hand individual scientists for interactive testing and so on. They use Jupyter Hub, which we cloudhose that pretty easily. We host databases for them and fast data transfer nodes so that they can have a dedicated transfer node that doesn't get stopped on by other people. And we manage all those VMs for those. We take care of the patching and the security aspects so the scientist doesn't have to deal with that again. It's basically platform as a service, right? Our team has other sediments to help run those platform as a service VMs. So, another one of our big customers, and again we do mostly climate science. This is the ice cloud and land elevation satellite. It's purpose is to produce a highly detailed measurement of the height of Earth's ice, water, and land surfaces to enable investigation of changes over time, particularly for Arctic and Antarctic ice. Its primary instrument is the advanced topographic laser altimeter system, or ATLAS, which which is a LiDAR, basically. It fires 10,000 times a second and precisely measures the time for individual photons to bounce off the surface and return to the satellite. So, what we do for these guys is much the same as what we do for Roman SC. We are an HPC center and we have a huge supercomputer, but for those of you who work with big supercomputers, you know that it's typically a really locked down and inflexible environment, right? It's good for one kind of thing, which is huge in PI jobs. But science, you know, the way that scientists are doing their work now, it's so much more diverse than just in PI jobs. And so, we get ice sat access to Jupiter hub. We get the vast access to huge data repositories, both our NASA's, they can file systems as fast to write to them. We also have a huge archive of data that's NFS hosted that's very fast for reads. So, this allows us to do lots of analytics type work, read from one, write to the other and so forth. Learn data transfer nodes, Jupiter hub, that kind of stuff. This is the stuff that we're being asked for. And, you know, the nice thing is the flexibility of the cloud, we can easily reconfigure it to give anything else within reason that they need. Instead of the static environment of a supercomputer. These the guys also using GPUs or is that pro? We have a small number of GPUs in the cloud. Yeah. So, we're going to give you an overview of our cloud's architecture. So, a little bit about the base deployment. So, we, as a lot of the other large open stack deployments kind of just totally rolled our own. We're not using anything like triple low or color or what would stack ansible. We built this from the ground up, that at the time there were needs that we had that weren't supported by the standard package deployments. So, right now we're on open stack wallaby and we're using the RPMs from the RDO project for Red Hat, which is nice. We have Xcadas are cool for deploying what we call the under cloud, the bare metal, and then some of the VMs that are hosting the open stack control plane. And then we use ansible to interface and push out the open stack deployment and networking configs, the switches, the NetApp filers are all mostly controlled by ansible. And right now we're just kind of focusing on the core micro services of open stack. And these are largely what we need for our mission and what we've had time to get up and running. There's obviously other other parts of open stack that could be used. We just need the time to play with them and see if there's something that fit our customers needs, you know, the scientists needs. For the most part, we haven't been asked for anything other than these core services. Yeah, so we've got, we listed the core services in case there are people that aren't super familiar with open stack, but we've got Keystone, authorization and service, we've got Glance to manage all of the virtual machine images, Neutron for all the SDN and provider networks, Nova for your depot compute, and Horizon the web GUI for some of the IS tenants that are going to get directly involved with the VMs themselves. All of these micro services are replicated so we can patch open stack pretty much seamlessly and also tolerate failures and down times or when we get the upgrades, we'll be able to do that in a roll. All of these services, they also span multiple buildings. That way, if we have some kind of a network event or some other cash drop a failure, we can maintain that core still being up and mitigate the or isolate the failure domain to a certain building of the three that we use. One thing that was important to us, it wasn't available a long time ago when the math diplomats were like triple O and call us, we had to do the fair requirement. Everything has to be a stall encrypted. So all of our micro services are encrypted end to end. All of our rather than Q buses are encrypted end to end. There isn't anything that's not encrypted even on internal private networks that aren't exposed. The use of XCAT very much comes out of the fact that where our legacy is. It was our center, it was it was our partner that made available the original vehicle cluster. We've been using XCAT since 2006, I think. So when we approached open stack from this standpoint of DIY, it was tools we were familiar with what makes sense to us. Well, we just ran with XCAT because it made sense. And then used our own CEM to layer on the other pieces. So here's a high level resource overview. Our cloud is about 300 hypervisors. We have 8500 cores, 75 terabytes of RAM. We have three availability zones that are one-to-one ratio to each building that our cloud resides in. Each availability zone also has co-located storage. That way all the VMs stored is native to that building. And that's the typical latency problems which happen to reach a different net of piloted for building. We run two MariaDB clusters to house all the database access. So we have one that runs all the control plane and that's shared with the two AZs building A and B that have smaller number computes. And then building C has the paramount majority of our compute. So we set up a separate galeric cluster for that just to keep things isolated. Each of the two major zones has a different fire drop for the NASA network, which is in open stack terms the floating IT network or the external network. So we can tolerate breakage purposes. We share the same MIP space but they're tied to different buildings. You lose a building and still have to fail over to the other buildings that would have to accept the same floating range. That allows us to HA services as we do. So we modeled our compute flavors after AWS for better or worse. Let's get a picture of what our flavors look like in open stack. It's not intuitive. Yeah, and it will lead to some of the caveats on this later. But we do have a heterogeneous compute environment. We have both Intel and AMD. We have some GPUs and B100s in one of the AZs for people to do the permitting on items of camping on interactive slur. Note on our GPU cluster. The flavors are very much fractional increments of specific pieces of hardware so that when it's all scheduled out there's no orphaned resources. It adds up. It's like Tetris, right? Yes, they largely coincide with a full node, a half node, a quarter node and then some minor fraction of that for a little tiny one-off. Okay, so our networking leaves by topology in a full cloths fabric. This is in every AZ is built like this. We mostly Melanus 100 gig networking. The Spectre 1 AZ appears to be running with that stuff. We run Keemless Linux on it, which has been interesting. Bob wrote Ansible Code to configure the Keemless stuff for us. It's all in lag, LACP. It's in lag between the switches. It's LACP down to the nodes. So every hypervisor has a bottom zero. That's either dual 10s or dual 25s. So reasonably fast. So we heavily segment all of our control plane traffic into different VLANs, different VLANs for control plane, storage, on-tap traffic to the net of filers, IPMI, pixie traffic, it's all. Even our out of band switch access is in a separate feeling, so that's all broken out. Our tenants use VLANs, segmentation IDs. We support VXLAN, but nobody's using it. And frankly, we don't really know anybody at NASA who uses VXLAN. We don't even know if the security folks would like it. So that's another thing we're going to have to deal with. So we do have very high speed interconnects between our availability zones. You see 4x100G, 4x40G. Our Neutron ML2 plug-in is the Linux bridge, but that's going to have to change for those of you in the know. We like it because it's easy to use. It's really easy to debug it, VCP, and there's no thing obfuscated about internal searches. So our L3 agents are the Neutron network nodes. We run those in DVR SNAT mode. So it's DVR HA routers. Our compute nodes are DVR no external. So we don't waste floating IPs. We don't waste external IPs, network IPs on the compute nodes. And it's less of a security potential issue. So this is the typology of our cloud. So the left corner would represent the core data center, the NASA center for climate simulation, and then how our other availability zones relate to that. And you see in the middle is basically what we think of as our external network. It's the broader NASA networks and how we connect to it from different places. So infrastructure as a service via would get a floating IP directly on that NASA network. But internally, of course, they'll have their own dedicated field and their own different IP space. You see in between our AZs is this very fast, you can call it the cross connect. It's about a kilometer ish of very fast 400 gig bandwidth between those AZs. And that mostly carries platform as a service, traffic, and our high speed storage going all the way out there to the VMs. Our storage is all NetApp filers. We have two AFS that are all flash. And those are in redundant pairs at every site. And we have one that's a hybrid in one of the AZs. We have about 140 terabytes total storage. We're really happy with our NetApps because they dedupe the compress at unbelievable levels. For VMs, we've seen upwards of 20 to which is remarkable. And to think how much storage we would have to pay for if we weren't dedupeing that level, and we're using something like VPFS, our cost for storage should be way higher than it is. You can imagine how well a Linux base OS dedupes, right? You just make forever copies of it. So we also hold mirrors of all our repos on the NetApps, which helps them be able to dedupe everything better. And that also allows us to keep our own internal mirrors for everything. The other cool thing with the NetApps is the Snap Mirror ability. We can set Snap Mirrors for important data between all the buildings. So we can at our NetApp level, we can provide redundancy for things that we know are incredibly critical. That's a bit of a disaster for happening. We use pretty standard data, NFS, for clients and OVA's back ends. So each of those have their own flex fall. Cinder has a couple different flex falls and uses the special on-cap driver to do some copy on our offloads and other tricks that you can do to make things faster. That is NFS version 4.2, I think. So we get the smart support, I think, in 4.2. So each, then also we have each of these has its own Cinder volume type. So this allows us to, through OpenStack commands, migrate a VM's data from one building to another. So all of our VMs are backed by Cinder volumes. So what we can do is we need to move a VM, we tear down the VM, we shoot Cinder retype, Cinder tells the one that has to transfer to the other. It'll take that copy and then we can bring it up in the other AZ. So this allows us to fill all kinds of things as well. We do have this enhanced instant creation thing, which is a feature of the on-tap driver for Cinder. Basically, if you have your glance storage on the same filer as your Cinder storage and you're spawning a new Cinder volume from an image, it's basically like a stuff pointer. It's almost like an instantaneous zero copy kind of operation. So you can boot VMs really, really fast like that. All of this stuff is also handled by Anthemal. So we have controlling access to the NetApps, doing all the R-backs that are necessary for Cinder to have the right global school permissions, but not have too much access to other pieces of the filers that could be a security problem. And it also controls as flexible creation, the SVM creation, the lifts, most of the sort of requirements. All the R-backs that are needed for the Cinder. Yeah, all the R-backs for Cinder. We got eight minutes. We have a bare metal GPU cluster with 88v100s, and that is deployed from XCAT, but all the ancillary services in that are cloud-hosted. So all of the login nodes, the SLURM KettleD, SLURM DVD, MariaDB, all of that's actually Puppet managed CM for those, because it aligns more with most of our passes. That's a legacy thing. So yeah, V100s, we do have a DGX A100. That's been fun. We actually have some GPU nodes in the cloud that you can schedule through NOVA. You know about X-Perspects, but they're not a VGP, it's ETI-PASTER. And the reasons we have weighted GPUs, VGPUs down here, it was a performance penalty, unless you did lots of sort, lots of clever stuff with CPU pinning. There were license costs associated with those drivers that we didn't have a budget to pay for. The VGPU drivers are essentially deprecated, as I understand it. Now, we're in videos moving to some new thing. And, you know, for our users, our power users would rather have the full GPU anyway. We have two nodes with full GPUs, and then we end up carving out this one or two GPUs that can be requested, especially for testing and able to camp on those nodes before they either get ready to move to Prism. Yeah. And we are going to get some more of those GPU nodes. There's also talk of some Grace Hopper, Grace type nodes that could be coming our way soon. All right. Discussion. So we want to talk about a few of the things that we've done well, and then, you know, go over our challenges and what's next. So the good stuff, it's been really awesome having OpenStack seamlessly integrate with our provider networks, you know, all of our data center VLANs that access all of the resources that our users would need to access anyway. So we've done GPFS, Anastas, NFS. We have something like 90 petabytes of storage in all, and a cloud would be useless if you couldn't access that stuff very, very quickly or easily or securely. But, you know, because of the root problems with shared product file systems, you can only access those kinds of file systems from platform as a service. Yeah, we don't give people root on those. We give people root on infrastructure as a service, but then you can't meld parallel files because it can't be root called parallel file system. We have OpenID integration with NASA's IDP. And so that has been, that was hard to set up. But now that it's there, it's hugely convenient. It means that people can get access to our cloud through all the normal NASA identity kind of tools that do the authorization in a way that makes sense to NASA security. So that was a big confidence boost for getting people into our cloud. And then that integration is huge when the Dedu is mind-blowingly cool compared to, say, CEP, which, you know, can triplicate and blow up your data. We're going the other way, right? Now, the downside of it is that we get people in infrastructure as a service who want to store lots of data. And right now, we don't have really a way to apologize. Arbex. Want to talk about the Arbex? Yeah, sure. So prior to Explore, we had two separate OpenStack clouds. One for handling the past tenants and one for handling the IS tenants. Largely for security reasons. You have to walk down to keep pass users out of downloading the images or being on the run. Certain things they should be. So we ended up doing with Explore. We merged the clouds together and we set up complicated OpenStack Arbex to give the IS tenants where they be and in the past tenants earn a much more locked-down environment. It allowed us to... We've had some major challenges. I want to touch on the forced OS upgrades. So as we mentioned, we used the packages from RDO. And when RDO says, well, we're not going to use RAL8 anymore, never mind that RAL8 is supported for like seven more years, well, we have to go where RDO goes. So we have to ditch RAL8. I guess we have to do RAL9 now, right? So that's a big... These kinds of changes. And so just in the last three and a half years, we went from CentOS 7 to CentOS 8, and CentOS 8 streamed. And then because of bugs in CentOS 8 streamed, these switch to Rocky 8, which was more stable. But now we have to go to 9. So that is a lot of major OS changes in just three and a half years, a really unsustainable amount of change. It's got to settle down, right? Yeah, we had big bugs in with DVR in stream and so on. RabbitMQ has been... And I know we're about out of time. RabbitMQ, the classic mirror queues, right? Everybody knows that was kind of a big problem. It was a big problem for us, especially when we ran Cilometer, because that put a lot of stress on the bus. You had to win up a script. It was just like what... It would count the number of the computes that were down if it was above the threshold. It would help our amputee box, and then it would help the novocandotter in a kind of an order that allowed them to recover properly. Right. So that we weren't up in the middle of the night. We finally gave up and just went to single instance rabbits and it's actually... And then we just monitored. Monitoring. Yeah. Okay, other challenges. Telemetry is a nightmare. Noji, abandoned, or deprecated. Manasca was actually great. I loved Manasca, but it's leaderless and it's complex. It doesn't work with Cloud Kitty. And Cilometer... What do you even say about Cilometer? Yeah, don't forget it. We're trying to DVR routers. If you're not doing it over again, well, the routers pull all the information serially off the rabbit in Cubus. So if you've just restarted everything, if you have dozens and dozens of routers, it takes a long time for that to stabilize. It would take about 20 minutes time for an L3 router to fully recover all of its tenants. Yeah. And so we've definitely run into a few snags with that. Lance, I mean, how do you deploy Lance correctly anymore? I mean, are you supposed to use the Python eventlet, which doesn't really support the FSL very well? Or do you use ModWiskey, or Uiskey, or GUnicorn, GreenUnicorn, or whatever? I don't know. Somebody has to figure out how to play Glad's correctly. DV disconnects, right? And the logs are filled, but we see this a lot. It seems to not have problems, but it's a byproduct of ACA proxy timeouts before MariaDB client timeouts. We try to tweak that away. And when we did, we could see our recovery time from an ACA proxy fail over a minute instead of milliseconds or so. So we're looking for some clarity sometime on that on how to clean up our logs. Okay. So another big thing we run into is the user education about what combinations of things will work in terms of the flavors, the availability zones, which storage device, that sort of thing. And a big problem for us is that Horizon is perfectly happy to let you boot an impossible combination. Like, it doesn't do any sort of validation. Like, I can try to boot a compute node here with storage that's in a different AZ with a flavor that doesn't even match the node, and it just loads up and then we get a ticket that fails, right? You can't boot an impossible combination. Yeah, it's your classic open stack novella toast, which yeah, always points out looks like an operator problem. The flavors aren't named intuitively because people that we work for wanted to look like AWS flavors. But, you know, it's to our end user trying to read it, and we only have one more slide. The reader trying to use it, it doesn't make sense. And I wish we had just named our flavors something that is like mnemonic or like, you know, it inherently tells the user what it is. And then so moving forward, I think this is our last one. Yes. So we need to upgrade open stack. We need to get to Zed or antelope, or by the time we get to the antelope, they'll probably be on C. Who knows? EL support ends with Yoda. So again, we have to go to Red Hat 9, Rocky 9, or something. Or Ubuntu because frankly we're frustrated with the state of the whole rail landscape and all those generations. So the neutron driver change, I mean, it's going to be hard for us to go Linux bridge to OBS. I mean, we have some of the Vswitch experience. We've actually run it in the past. We ran it before we switched to Linux bridge, I think enough. But now, bam, here's OBS. So now we have to go and learn OVN, right? So that's, I mean, it sounds like it's great, but it's also complicated, right? It's a whole new ball of wax to learn. We want to implement the Reviton queue for OVN queues, right? That's something else we don't know. Maybe load balancer as a service, bottom also do sender active active. Get the sender rolling back. Yeah, I would really love to build some better SDN as our cloud grows. Switch from Kiebel switches to Sonic, maybe do more VX LANs, maybe do VGP unnumbered, maybe a centralized SDN controller so that we're not logging into individual switches to change ports and so forth. That would be cool. And last but not least, I want to replace Xcat with Ironic. We all know the best way to run Ironic yet for our use case. So that's something that we're thinking about. We'll be testing that in our testing development system. And then, you know, if we do Ironic, you know, what else do we do with it? Do we let the users use Ironic or do we just keep it to ourselves at the cloud I've been to reuse it to deploy HPC systems, right? So there's like a whole new set of possibilities we need to consider with that. Thank you. Q and A, but we've overrun our time by five minutes. But we'll be around later if you want to get with us and ask us some questions about this. And I'm happy to share these slides to the public. I know we started a little bit late, so maybe we can take questions that feel like for the next four minutes. I know there's going to be some questions for that. I can see people are here. Vincent, do you have a question? That was super interesting. Thank you so much for sharing this with us. I full disclosure, I'm an open stock consultant working for Red Hat. I work with telcos. Some of them have thousands or dozens of thousands of nodes. Your challenges are very, very interesting. And I can see where you're coming from. I have some friends who in companies work with RTO and CentOS, and it's been super disruptive with them. A couple of things that I wanted to mention is you don't have to do ironic to deploy. You can choose any deployer you want and then use pre-deployed nodes with open stock. And this is actually the way things are going with Seventeen. For the Glance and Cinder stuff on NetApp Filers, at one of the telcos I was with, we ran into an issue that they kept asking us to use this unsigned, unbackaged binary that was supposedly making those clone operations on the NetApp box faster. But it turned out that if you have big filers, then the cache doesn't replicate from head to head. So it could only work one time out of eight. So you might want to check that if you're looking to expand your filers. And also one more thing is that I don't know for Seventeen, but for OpenStack Seventeen, you can run mixed computes. So you can keep your computes on the REL8 and you don't have to go to REL9 all the way, even if it provides additional benefits. And about the Linux Bridge driver, it's actually a multiple step process, because it was introduced as early as Queens as open flows. And due to the collapse of the layers of networking, you no longer have those bridges or sole purposes to run IP tables. Then it makes everything very fast. And you don't have to go to OVN all the way. You can take like an intermediary step where you switch from Linux Bridge to open flows. And then later on, maybe you'll be doing some OVN, but you don't have to do OVN at the same time you're migrating from Linux Bridge to open flows. But they're super interesting. I'm a big fan. Thank you so much. You are doing great stuff for the science and worldwide. You have worldwide recognition. They have the qualities of the work you do out there. So thank you very much for sharing that. You mentioned you wanted to use BGP on numbers and slides over that before. We periodically find instability within lag. I think it would be fascinating to get away from Layer 2 networking. I can second that. A lot of our bigger deployments are moving away from BXLand, EVPN, and to do Layer 3. I would love to not have to worry about stuff like spanning tree anymore or my in lags flaking out. We lost a switch pair last week. And what happens is, I mean, we have a pair of switches per rack. So there's some isolation. But when the in lag freaks out, the switch is strong and it's not passive traffic. And then the NFS mount gets pulled out from under the running VMs. And it looks like a scuzzy timeout and the file system goes read-only. And we end up rebooting 100 VMs. So in the place of that, just run BGP between your software rack switches in a kind of a classical spy and leak environment. And augment that with BMP and ECMP to have link redundancy and fault tolerance. Don't get me wrong. I've never done it, but I've read about it. It's so awesome. In my opinion, it's easier than traveling a bunch of in lag. It certainly has given us plenty to work on and worry about over the years on all the layer 2 stuff. I know our customers are still afraid of it, but those who've tried it have been very nappy. Well, NASA may be afraid of it. So we may not get to do it, but people love to to demo that new technology for them and maybe teach them something new. You know, a certain big customer at Spain is doing a lot of this stuff. And it is something I'm prototyping right now. We open stack pre-range routing, sitting on top of the compute nodes and using multiple layer 3 links in favor of just using LACP on any kind of layer 2 link aggregations. Even if I never went to BGP and layer 3 based networking everywhere, I would love to be able to have Neutron directly controlling my switches, so as to dynamically do VLAN pruning based on the VMs, right? Because that would be super neat. You will have to go, well, unless you do UVP on the switches, you'll have to go VXLand to get VLANs to span across your racks. I can have a short question. Do you support UEFI for the VMs? Also, secondly, are you supporting IPv6 addressing into the VMs or do you have any kind of such plans? Yes, we natively support IPv6 into an athlete. And we even have some publicly available websites hosted in our cloud that are IPv6. And the federal government loves that because of the IPv6 requirements that they put on all the documentation, even though no one actually really uses it. There's a certain other government agency that talked to me. They said, can you do IPv6? And I said yes, and they said, I hope the answer was no, because if you say yes, we have to say yes. Yeah, yeah, we don't we don't bunch it or fake it or translate it to the or anything. We support real IPv6 right down to the VMs. And your other question was UEFI for the VMs? I'm actually not sure. I'm actually not sure either. I think we're just doing classic BIOS, but I'm sure we could support it. Sorry, it's something I've been really looking for. I don't think you're doing the Q35 machine type, which is nice because the virtual hardware shows up looking like a modern device, right? You recognize that I'm OVMF and you probably aren't doing any of that. All right, thanks for watching guys.