 Great. Hello, everyone. Thanks for coming. My name is Ryan Nakashima. I am the Research Computing Manager for the Research Data Services Division of the San Diego Supercomputers Center, and I know that's kind of a lot of words to describe my position, but really all I do is help researchers try to find the Research Computing and Data Storage Services that they need to do their highly technical research. Presenting with me today are some of our amazing engineers who do the really difficult work to create these research computing services. With me, I have Colby Walsworth, Danny Saba as the main engineers that work on our NOVA cluster, as well as some other members of our team who are here as well. Our handshake deal I have with our engineers is that I get to do the talking here, and then they get to answer the questions that come up later. So today I'll explain some somewhat complicated answers to three simple questions and how they relate to NOVA and open source infrastructure. These three questions are, where are we from, what do we do, and how do we use NOVA? The first question is, you know, where in the world is the San Diego Supercomputers Center? As you might have guessed, we're in San Diego, California in the United States, on the University of California's San Diego campus. So we are affiliated with a public research academic and research institution. And for some reason, our architectural designers decided that the perfect way to complement our nice, very, very beautiful beaches was to make our buildings look like spaceships of competing alien races. So one of these pictures is SDSC, show of hands. How many of you think it's the top left pyramid looking thing? How many think it's the blocky looking thing on the bottom left? How many of you think it's the cylindrical thing that is in the bottom right? Okay, how many of you think that that was a trick question and that we actually submerged our data center into the ocean in the top right? I wish if we only had the money that Microsoft had, right? So those of you who guessed the bottom left were correct. We are the blocky looking one. So yeah, congratulations if you guessed it right. So now that you know where we are from, on to the next question. What does SDSC do? The answer also sounds simple. We store a lot of data and we run very, very numerous computers for the researchers. The actual answer, of course, is much more complex. But at our core, we do what I assume that, you know, those of you who are engineers here do. We take care of the infrastructure problems that we find in research computing because we want the researchers to be able to focus on solving things like cancer, other diseases, natural disasters, you know, things that will kill us and not be distracted with questions like, why is my instance not migrating to another hypervisor again? I should mention these researchers also solve other problems that don't involve things that will kill us, but I happen to think that helping us live longer is one of the more important things that they solve. On this slide, you'll see some of the supercomputers we've hosted in our data center over the past few decades, including our latest and greatest supercomputer expanse on the far left there. Despite our noble purpose of enabling research that helps humanity, somehow our systems look more and more sinister as time goes on in these pictures. Not sure why, we'll probably have to work on that public image. Aside from the supercomputers, which are traditionally batch job processors, we have thousands of other systems in our data center that support many other types of computing, including the open-stock NOVA cluster that we'll be talking about today. And now, I know NOVA is almost a teenager now if I calculated the number of years it's been around correctly, and it's probably not a trending topic unlike something like Kubernetes, but we found that NOVA is still one of the most useful options for very quickly providing a powerful, versatile compute system for researchers when combined with the right stock of supporting technologies. I've been encouraged seeing it still widely discussed here at the Open Infra Summit at the other discussions, and I definitely have to extend my gratitude to all of the contributors working on NOVA and the other components we all use to do our jobs. These contributions translate into making our lives better as the researchers that do their work on the infrastructure that runs the software is successful, and we definitely are very, very thankful. So you'll see that we have the core open-stock technologies on the left that form our NOVA cluster, which I'm pretty sure is going to be a pretty standard setup. We also use other supplementary open-source software to maximize our abilities and automations so we can work within research budgets. And again, we greatly appreciate the fact that the software is open and free for anyone to build upon, or I definitely have to emphasize that free them because one of the biggest problems that we face, unless of course you're a billionaire, anyone? No? No one wants to give me a million dollars? Yeah, okay. Everyone basically has to, unless of course again, they're filthy rich, they have to solve what to do with their budgets, right? And so that's kind of the core of what we're talking about here is how does SDSC use NOVA within the budgets? And it seems like a pretty straightforward answer, but at the risk of sounding like a relationship status on Facebook, I have to say it is complicated. So show of hands, how many of you support researchers? Okay, a few of you? Nice, that's good. So I'm sure that those of you are well familiar with the problems that we face with those research budgets. But for those of you who do not, and this is new, we'll have to understand how funding works in the research sector. So one of the kind of interesting issues about research where I should mention, I formerly worked in industry, and now I work in research. So I've seen both sides of how things work. In research, the research processes all come before the revenue starts. In industry, it's a little different, right? There's usually revenue stream, something else that happens. So the question is without existing revenue, where does the funding for research computing come from? The answer is usually a grant, a donation, or seeds funding from an investor. Overall, these money sources are pretty much always limited because there is no ongoing flow of money supporting them. Open source infrastructure, therefore, is critical for our work because it's really the only way that we can provide the powerful research computing options that fit within the limited research budgets that we're talking about here. So if you work in research, you may have found that there is a big difference also in the scale of the budgets of different research groups. Offering only one way of accessing our services, we found ended up excluding either the smaller research budgets or the larger ones. So we created two models to access our NOVA service for the researchers. Those models are what we call our on-demand model and our reservation model. Researchers have the option in the on-demand model of creating instances whenever they want to on a shared portion of a single NOVA cluster and they pay for each hour of use from their research budgets. Alternatively, in the reservation model, they can reserve an entire hypervisor long-term usually for their lab members to all split up and use. When reserving hypervisors, sometimes that means that the researchers are coming in with their own hardware giving it to us. We plug that into the NOVA cluster and it becomes something that can be virtualized and now all of a sudden it's a lot more shareable. Other times, the researchers prefer to, rather because of budget or because of some other reason, have us acquire the hardware. We set it up, they do the same thing as the other ones. They then also split it up for their lab members to use. Either way, the researchers are ultimately responsible for taking on the cost of the hardware procurement. There's just a couple of different methods of accessing that reservation model. Now you might ask, public cloud providers do similar things, right? Why are you talking about this? Well, we do have a couple of differences. First, unlike in cloud providers, reservation refers to the hypervisor, not the instance. This allows labs with, again, a lot of different individual researchers to have their own instances and not interfere with each other's work, say with competing software types or people trying to mess with the same system at the same time. But then they still share an overall computing resource pool that ultimately belongs to the group that the researcher belongs to. Also, we don't charge based on how powerful the nodes are for that reservation model. Instead, we charge the researchers, based on more of a maintenance model, we charge them based on the number of nodes that they reserve. Therefore, the researchers can access very, very powerful nodes for relatively low ongoing maintenance cost compared to what they otherwise would do if they were to use something public like AWS or Google Cloud. So now, thinking back on those kind of different groups of researchers, the folks that come from public research institutions and those from industry, how are those models consumed by people with different funding amounts? Overwhelmingly, we find that the on-demand model where people pay by the hour, or that model is used by researchers with very small limited budgets. They're the ones that are mostly doing a lot of their work locally, and every once in a while, for a very brief limited amount of time, they need to use a more powerful system for a very, again, short period of time. And they can't afford the cost of reserving a whole hypervisor long term. They might not be working with a lot of other collaborators. They're kind of more solo. That's that type of researcher that will use the on-demand model. On the other side of things, the reservation model is often preferred by the researchers that have larger budgets, whether they're from public institutions or from industry that is also still involved in research. These people need a large amount of resources, and they need to make sure that they have these resources when they need them. The most important part of that is when they need them. Having that guaranteed minimum availability is critical for hitting their deadlines and meeting their budgets. On the technical side, which I'm guessing the more of you are probably interested in, how do we support these two different models? The answer ended up being surprisingly straightforward. We use host aggregates, where we find in open stock documentation that the main suggestion for using the host aggregates is to align with the physical characteristics of the hypervisors, like setting up instances that will use SSDs, or providing GPUs or virtual GPUs to researchers. However, we decided to break away from that connection of host aggregate and the physical characteristics of the hypervisors to more focus on the use case. We use host aggregates to define the ownership of hypervisors in the reservation model. On setup, we configure the host aggregates and which projects can access them, and then the scheduler handles the rest to allow the labs to use their hypervisors exclusively when using the reservation model. The on-demand users instances, the folks that are using the pay-by-the-hour model, are created on shared hypervisors that a lot of different people use. And then, I'm guessing no one saw this coming, but when I joked earlier about our building being very blocky, that was because I am a horrible artist, and whenever I need to come up with something, I tend to make blocks of things. So, here is a pretty terrible representation of what the cluster's hypervisor claims look like when using this model. So, each of those groups of people represents a different research group or lab, where if you're working in a large public institution like UC San Diego, you'll find that there often are a lot of labs with individual researchers within the labs that all need to access compute instances of their own. So, you can imagine that each of the research groups are kind of like those labs at a research institution, and they have multiple users that need to access things. When they request to spin up an instance, based on the host aggregate definitions defined, and also the filters that we define, the research group's instances are created on a specific set of hypervisors that are considered theirs in that reservation model. Please note that the shared nodes are still in the same cluster. It's something where we are able to support this, really only because I would say that we are working very, very closely with these people for the most part, and I should really say all, but they're all trusted users. This is not really something that we would recommend for anyone working with highly sensitive data, of course. I do have to mention there are some other downsides to the reservation model. The main two are the lack of redundancy and the relative inefficiency of research consumption. However, we found that the people that use the reservation model ultimately do not prioritize these factors at all, and that's fine for both them and us. The advantages of providing the lab members with resources when they need them outweigh the downsides. Lastly, if you're interested in this kind of thing and a probably separate presentation, because I can talk for hours on this since I'm kind of more often on the business side of things, and I assume that no one else really has the strange kind of fascination that I have with business processes, but we can explain how we automated our building processes to support hundreds of researchers with a relatively small team at SDSD that works on multiple projects, not just this Nova cluster, but I've included this slide as a preview in case you want the quick summary of what we do, where basically we put variable hourly usage in the billing system for our on-demand users, and the reservations get a static flat rate that they're charged. Thinking back on all of the technologies that we use from the CRM system, or out of all the technologies that we use, the CRM system is the only system, at least right now, that is not open source. I'm guessing that that will probably change within the next year or two just because our license is probably going to expire and we'll have to move on to something else, but yeah, we really, really heavily use open source technologies to do all of this. So we've answered three main questions today, who we are, where the San Diego Supercomputer Center, we're an organization that's almost four decades old based in San Diego, California, in the United States. What do we do? We support research computing, and how do we use NOVA, where we set up on-demand and reservation models for researchers so that they can use exactly what they need and when they need it. So now we have a period of time where we wanted to leave, yeah, some time for questions for any of this. So what all questions do you have for us? So if you're using host aggregates in that same NOVA cluster, you're setting the filter tenant ID on the host aggregate, how are you dealing with, do you guys have a public set of flavors that you're using and then flavors for those dedicated aggregates? Because if you have no metadata set, those will wander over into your other environment. How are you addressing that? Okay, and so just to repeat the question for the recording, so the question is how do we kind of deal with the flavors? So it goes through the filter schedule and says okay, this tenant is able to provision on this host aggregate only. However, if you have people that are not present in that host aggregate like just a regular standard shared user, you're going to have let's say a flavor with no metadata set. If nothing's set, it'll wander its way into there. How are you guys dealing with that? We actually put everyone else into a host aggregate filter set as well. So the shared pool also, everyone who doesn't go in the private ones goes into the shared one. So they're forced into that one as well. But are you setting something on the flavor? No, we're not. They don't move as long as, everyone has a filter tenant ID. So there's no one without one. Okay. Now we do create custom flavors for the private pools because we have different use cases and they want to have different sized instances and stuff like that. And we also provide GPU options. So we have custom flavors for the GPUs. Sure. Okay, that makes sense. Thank you. Oh, thank you for the question. And yeah, what other questions are there? So do you use the Cloud Kitty exact project for the billing stuff or are you thinking? And the question says, do we use the Cloud Project ID for the billing? Is that? Oh, Cloud Kitty. Oh, so let's see. So we had tried using Cloud Kitty. I forget what year that was. Maybe back in like 2018 or so. Yeah, thank you, Lisa. So yes, we had tried using Cloud Kitty. If I remember correctly, we had some issues trying to start it up. It wasn't quite given us what we wanted. So we decided to look at it again at a later date. So when you guys are billing back your researchers, what telemetry information are you gathering? Is it bandwidth data? Is it just are you billing based on a project? Or is it an instance itself? It's hourly. And what telemetry data are you collecting? How are you collecting it? And what are you putting it in? It's something about you're taking it and bringing it into influx. But where's it coming from? Most of it's coming from NOVA. For the hourly billing. Yeah, so the simple usage API. Yeah, we don't bill for bandwidth. Part of our policy is our network is supplied by public money, so we don't charge other people for it. Sure. But we do, the one challenge we do have is GPUs. So that we have to start looking at instance ID or the flavor ID to figure out whether they're using GPU or not. Because NOVA doesn't reveal any of that information. Right. I mean, I imagine most operators have baked the solution and to deal with their billing. We've done ours and I'm sure you guys have done yours. Just curious to find out how. Great. Yes. The customers that you're renting hypervisors to, are they interacting with horizon or are they just sending you guys VMs? Okay. Yeah. So the question is to start preparing for the recording. Are the customers interacting with horizon or are they kind of just sending us VMs? Where I want to say that a lot of them are yes, interacting with horizon. I'm not sure if they're interacting with anything else. Is there a lot of training the customer for that or how are they picking it up? That, I would say, greatly varies where some groups of customers require a lot of hand holding. But luckily, we had just set up our own set of steps to go through in a wiki format. And luckily, a lot of the users that we work with are able to go through that wiki and at least figure out the basics. Do you guys expose any of the telemetry information from the instances to the user in horizon, like a Grafana plug-in for horizon or anything like that? We do have, we've created some dashboards for some of the rental models so they can see what resource users they're using on their hypervisors so they know whether they have GPUs or how many CPUs are being used at the time. But nothing integrated or automated during the provisioning process right now. So comment rather than a question, but we run an active research cloud in Australia and we use host aggregates in a similar way because we have some infrastructure that's funded nationally, but we also across and we have a federated system across multiple institutions. But some of those institutions fund their own infrastructure that's just used by their own local researchers. So we put those in host aggregates to make sure they can be scheduled just for those. Oh, nice, that's great to hear. And yeah, I'd like to hear more about that because, yeah, when we were setting this up, we hadn't talked with a whole lot of people who had been doing this already. So yeah, great to hear that that's working over there. Okay, and yeah, any more questions? Cool, well, so yeah, that wraps up our presentation. Thank you for coming here and listening to what we do at the Supercomputer Center. I just wanted to say one last thank you to the open infrastructure community where the efforts that are performed here are things that they greatly improve the world that we live in because of the work that these researchers are doing. So then again, thank you for coming. We'll end here a little bit early, which gives you some extra time to get to the marketplace mixer. And we look forward to the rest of the summit. Thank you.