 Hello, and welcome to theCUBE's coverage of ISC High Performance 2023, where we're covering all things HPC, high performance computing, machine learning, of course, AI, high performance analytics and quantum computing and a lot more. Here to talk about the Omnia project where operation AI has never been easier as Peter Nguyen, senior product manager for HPC at Dell and John Lockman, distinguished engineer, artificial intelligence and high performance community at Dell. Gentlemen, thanks for coming on to talk about Omnia project. Oh, thank you for having us. All right, I have to ask you, I read all about it, love it, love the HPC AI tie-in. I think this is really one of the most important trends of this generation of computer industry. So really, really super excited to talk about the Omnia project. But John, I got to ask you, you've been called the godfather of the Omnia project. So I got to ask you, what's the originator? How did the project come about? Was it an issuer scratch? It was a mandate? Take that hill? How did it all come together? The godfather is a great one. Yeah, the origin story, if you will, started just back in the engineering lab at Dell. We were doing lots of proof of concept work where it was rapid deployment of, build a small cluster to run these set of benchmarks, build this GPU cluster to start taking a look at these AI workloads. And some of the harder parts is we were bringing in these new data scientists who really didn't understand what a traditional HPC system was. They didn't use batch schedulers or Slurm or even the command line. They were familiar with Jupyter Hub. They're interactively playing with their data. And we wanted to build a way to kind of more easily bring in those new data scientists onto our HPC hardware where we needed them to actually be crunching that data. And it was just really, how can we change the way we think about building HPC clusters to be more accommodating to our modern workloads? That's awesome. I want to get to Peter. I want to just comment on your title, HPC software. When we talked to Jeff Clark at Dell Tech World last year, we said, does hardware matter? And everyone's talking about solutions. That's what Dell PR wanted to talk about. We're like, yeah, of course hardware matters. But then he also said, it's all software, right? So this has been an ongoing thing in the industry for many, for a decade. Software is super critical with AI being software driven and more hardware power at the silicon level, chips. The combination of high performance computing is kind of a renaissance now with AI. What's your take on where the omniproject intersects with the AI piece? Because everyone's talking about this, high performance computing, more compute, more GPUs. AI just feeds off this. Right. And I think that, you know, AI, you know, there is an intersection between HPC and AI and it really boils down to the fact that they're using more and more intensive hardware nowadays, you know, back in the day, AI was a lot of research, but you know, it would be one or two nodes, one or two GPUs, a lot of test bed. But now with, you know, all the things going on with open AI, now that those things are almost near completion, you know, you can start throwing some horsepower at it. And I think that's where Dell really comes in is now that these things are mature, let's get those models, those applications out into the field where everybody can use them and start tweaking them because, you know, at the early stage, not everybody's going to be working on this code, right? It's very deep, very math oriented. And now that, you know, we're putting on the finishing touches, a lot more people can work on it now that that core baseline is there. And so that's kind of where Omnia comes in, is, you know, taking it to the next step, making your final changes to that code and then rolling it out. We're going to get into it. It's really relevant. Dave and Avanti and I were talking about SuperCloud a year and a half ago at Reinvent. We saw the evolution of this new kind of cloud architecture, multi-cloud, integrate, of course, supercomputing's booming. Now Super Apps is what it's being called. So supercomputing, super cloud, super edge, super applications. The Omnia project kind of hits all these things in a way. So I want to get the definition out there because this is super relevant and cool as well. What is the Omnia project and how does it fit into the future of how we see distributed computing coming out? Cause this is really the big architectural conversation. What is the project? How do you define it? You know, the project is a set of best practices for applying a DevOps mentality for building and maintaining our HPC systems and for AI systems. What we've been able to do inside the company has been pretty amazing as we've completely open sourced the tooling. It's a set of Ansible playbooks. Documentation, examples, a group of developers that are here at Dell that are continuing to develop new features for it. And really the heart of the project is to be able to enable everybody to build a simple slurm-based HPC simulation and modeling cluster or get started in the Kubernetes space if you're doing AI. You know, in the beginning, I'll say a lot of my base that comes from HPC background, although I was always doing AI in HPC. One of the things that we really wanted to see was can we modernize how HPC does their work today? We're a lot of bare metal houses, which is great and that's where we've always been, but containerization is obviously taken off and how can we sort of push the HPC community forward in adopting those sort of practices, not only the DevOps space for building and maintaining your clusters, but also for running your simulation and applications? Yeah, I think a couple of things that I want. You mentioned a Kubernetes orchestration and containers, really taking over that microservices, which is not completely replacing virtual machines, but certainly creating new paradigm. So I have to ask you what separates the software stack here, what's different and unique about it? Because I mean, basically it's infrastructure as code now, it's not just cloud, it's cloud operations, right? So you got infrastructure as code, is it programming the low-level stacks? Well, I guess what my question is, what separates this Omnia project from traditional HPC software stacks out there? So this starts all the way at the bottom, I think like your traditional software HPC stack, I mean, the Omnia project will handle your firmware updates, your BIOS configurations, your RAID configs, but it will also provision operating systems to your choosing, can set up a very large variety of Dell switches and NVIDIA switches, including InfiniBand and high-speed Ethernet, and it can take all of those, we call it that pile of hardware and turn it into a logical cluster. So I have to ask you on the open source, we'll be at open source summit in Vancouver, covering a lot of the action there. We're just at KubeCon, you mentioned Kubernetes and CloudNativeCon. Where's the challenge on HPC and where's the opportunity I should say as well? Because when we brought up the AI conversation to that world, they're like, whoa, we don't want any hallucinations, we want to have more lockdown, more reliability. So everyone obviously graduates towards automation because they love it and they can get it. So AI is being used in that kind of vein. I can see the HPC fitting, so how do you guys support Kubernetes in that world? Can you just take us through that real quick? Because I think that's an important point. You're seeing a lot more automation there, what is AI and HPC fit? Yeah, so I guess I'll take this one. So, as you know, Kubernetes is all open source, but the problem is it's microservice based, right? There's a lot of moving pieces, everything's always changing. And for better or worse, Google has what they call upstream Kubernetes. It's a version of Kubernetes. They say, install these 10 things and you'll have a Kubernetes cluster. The problem is you can make deviations, you can swap things out. And once you do, things don't work exactly right. So some companies have tried to solve that problem. They create distributions, and two popular ones that are popular on the Edge world are K3S or micro Kubernetes. Those are the two most popular ones. They solve their problem by managing it, but then they also charge for it. Dell's taken kind of a different approach where we've created the open source stack. We've given you the ability to change out bits and pieces here and there, but it's still open source. And that kind of leads into my next thing is, we're not against these managed Kubernetes distributions. If enough people ask for it, we will add K3S to Omnia. The important part is once you have that K3S installed, okay, let's connect up to all your different storage devices, let's connect to all your different switches. That still needs to be done regardless of whatever Kubernetes flavor you're using. And that's where Omnia comes in. It's kind of just like this big wrapper that goes around everything. And that's in the name, right? Omnia, everything. Yeah, a little more on that open source side is, I think that HPC and historically has been a choose your own adventure sort of play, right? We're gonna, I'm gonna pick these pieces of hardware. We're going with this software stack. And when you look at the choices that have been made across all the different research centers, government centers and in private industry, they've all made slightly different choices, which is great because it's provided, new things to pop up to the top. What we wanted to do with Omnia was provide a prescribed sort of best practice of what we feel is, this is the set of open source tools that we look at for doing AI today, the state of the art AI, whether that's training or just inference and state of the art HPC and simulation. But when you provide that in this open source fashion, the way that we have with the Omnia project is it allows customers to pick and choose the pieces that they want and not just take a large monolithic approach to this is a cluster and it does AI. It gives a lot more choice back to the, the system administrator, the developer and kind of shows them the way that we're looking forward for what is a real DevOps world for AI and HPC. I love the DevOps angle here because I think that's going to be completely enabling the super apps I call them with AI. AI is going to be in every app, no doubt about it. Data is going to be involved, more compute, more processes is very key. I want to ask you what hardware and software you guys support but first real quick follow up on the Kubernetes thing. Can Omnia support an HPC scheduler simultaneous with Kubernetes? Yeah, actually it was one of the weird use cases we came out with right at the beginning was to run both a Slurm and Kubernetes side by side on a single cluster. And we did that with the hope of, and we did actually were able to demonstrate running an inference server in your Kubernetes space where I might want to replace a scientific library in my HPC simulation or I might be using an AI simulation, an AI inference server to generate input data for my simulation. And in doing so, you're able to couple both schedulers on the same system. It does get a little hairy with who's in charge and who's on first I guess. And those are some of the things that we continue down this journey, I say, in computing of what is next and what are the next sort of possibilities? What will people want to do? And that's what we try to allow is, well, everything. What about geographically dispersed HPC clusters? Is there support there too? Yeah, we're actually right now doing a lot of work in the multi-site and multi-cloud work. We've done some projects with the Google Cloud project. You can actually get an Omnia instance in GCP. But today we're doing a lot more work in allowing customers to connect their on-prem, Kolo and cloud resources through the same DevOps mentality of everything is as a service. It just depends on where you want to start it up. Okay, so back to my next two questions that are going to be around kind of hardware and software use support and then how you decide what features to add to the roadmap. So the first one is just to get it out there, what hardware and software do you guys support with Omnia? Sure, so I mean, I'm not going to lie to you. We are Dell, so obviously we're going to support Dell hardware for the informal. And the other thing is we are part of the HPC AI. So what's the most popular stuff out there for us, right? So we've got the C series, which is like the super dense compute, lots of CPU power in a small space. We've got the XE series, which is, you know, the, we've got lots of accelerators in a small space. And then you've got the high end R series. So R being the traditional, you know, rack servers, but we're going to be supporting mostly the high power, you know, CPU, high power GPU. Beyond that, you know, we do connect to all, connect up all the different switches that you could possibly use. You know, you've got your S switches, power switch, S series, which is, you know, what you use on the top of each individual rack. You've got your, your N series. So that's going to be what you use on like the client side. They call it, I believe it's campus build out. So not necessarily like a school campus, but like this is what you're going to use, like in your office building to connect the clients to the compute nodes. And then finally, you've got your Z switches, which is like your high speed interconnect, which is kind of like the competitor to InfiniBand, which we also support. In terms of storage, you know, we do support the most popular storage solutions for HPC. The big one there is going to be Power Vault. So Power Vault is kind of like an entry level storage, really cheap, fast, but you know, you rate a bunch of them together. It's pretty quick. And there's also an option called BGFS. Now BGFS is like a software that you put on top of those Power Vaults and you make basically all those multiple Power Vaults behave as one super drive that's super fast. And so that's, you know, what most HPC that is storage intensive is using is this BGFS software on top of those Power Vaults. In terms of, you know, persistent storage, let's say you're running a simulation and you want to collect all that data, you want to merge all that data. And then before you send it back to the client side, use, you would use something like a power scale. So power scale most popularly is basically just one box with just a ton of SSDs in there. And that's what the power scale is. Can you give some use case examples? How can Omnia Benefit Edge use cases, for example? Well, you know, 2023 is going to be the year where, you know, AI really came onto the scene. You've got, you know, language models like ChatGBT, you've got, you know, generative AI like Dolly or stable diffusion where you take, you know, just a prompt and you generate some, you know, beautiful picture out of it. So the tricky thing here is that, you know, these models take tons and tons of data to train and you're going to do that in the data center, right? You know, just gigs and terabytes of data. But once that model is trained, you don't need that data anymore, right? Think of like something like Excel, right? You take a bunch of data points, you draw that line, you get the equation out, you throw away the data points, you don't need them anymore because you have that equation, you got the line. So in this case, you know, once you have that model trained and you want to start using it, you can throw that model into a single box somewhere. And it doesn't necessarily need to be in the data center. It can be in an office somewhere. But because it's going to be in an office somewhere, you can buy something that's, you know, a little bit lower powered in terms of hardware. So for example, you look at stable diffusion, you know, they recommend just a quad core and 16 gigabytes of RAM, right? Like my computer at home can do that. So in this case, you want to get your overhead as low as possible, right? You don't want a lot of, you know, garbage running on that. No, because hey, it's just doing one job. It's just chugging along. You don't want to check on it. So in this case, what we've done in Omnia, for example, is we've added the ability to kind of make your cluster a little bit lighter. So in this case, you know, if you're just running this box out in the middle of nowhere, you can disable the telemetry and you know, if it fails, it fails, so what? I don't care. So that's one thing. We've added the ability to skip the telemetry, you know, skip the ability to plot everything that's running all the time and that frees up a little bit of space. You know, we also have a couple of things in the pipeline, things like Ubuntu, right? Ubuntu is like the number one AOS for AI. And the reason why I want to support something like Ubuntu is, you know, theoretically, yeah, any container that you create can also run on any bare metal OS, right? They say, oh, as long as it's running the Linux kernel, it's going to be compatible. There are going to be, you know, little dependency things here and there. So we figure, okay, let's get the boxes running the exact same OS that was used to generate that container and Ubuntu is that thing, that OS. There's also other things like, you know, you can also make it lighter weight by changing your Kubernetes distribution. You know, before I mentioned that you could install something like micro Kubernetes or K3S, those are the two popular distributions. We're also considering those. You know, if there's a use case where I want the absolute minimum footprint, then that might be the way to go. It's to add those things into Omnia. So I love how you guys got this DevOps focus with HPC and AI. I think that's a great strategy. The market's in need. Love it, it's open source. John, you mentioned that. The question I have for you guys, does Dell plan to offer a formal support plan for an open source or as Dell and how do you balance the open source contribution? Can you take us through the support plan? Yeah, so, you know, it's important to remember that Omnia is both an open source project and it's a product as well. So if you are an expert and you know what you're doing, you can use Omnia today. You go on GitHub, you download the code, you start using it. You know, if you're already using Ansible, you know, you don't have to use the entire Omnia. You can just, you know, take bits and pieces of our code, copy, paste it into yours, into your setup. And, you know, that's great too. Now, the product side is, you know, if you're not an expert and you want some help, we do sell pro support where, you know, we take care of the installation. If something goes wrong, you have access to our support team, our developers, and to get you back on, back on your feet basically. So you can consume Omnia both ways. All right, last couple of minutes we've got. I want to get into some of the questions around metering features and accounting. You guys planning to offer any of that? They implement, like, say, chargeback models? Yeah, I mean, today, you know, we take a simple approach of, it's your cluster and you do what you want. But I mean, our customers always want to know, hey, can I monitor more than just, you know, individual user usage or are there automated things we can do? We've partnered with a lot of open source tools, like a cube cost, for example, where we can plug in, you know, some of these values. But as we continue, you know, as this is another thing that people continue to ask for, this is something cool about an open source project that we're building on the fly, is if enough people want it, we're happy to start building it and building it into the tooling system. And, you know, the other cool part about it being open source in that sense, you know, for things like new features and accounting is if somebody's already written this and they want to add it in, you know, they're free to. If they want to do it and it's their own product and their own project, they're free to do that as well. You know, Omnia's distributed under a very permissive license, which allows us to make products off of it and allows our customers to build products off of it too. It's next gen for sure. I love the strategy. I think it's totally aligned with next gen thinking all around open source growth is phenomenal. Obviously, you know, the scale and automation coming, humans can't do everything, but they can be, you know, have the key hand in it. So final question to wrap things up. What's the advanced features for Omnia that are going to be interesting in the future? What do you guys see coming around the corner? Is it straight and narrow 90 miles an hour straight ahead or is there stuff around the corner? What should we look for? Coolest stuff coming hot off the press is definitely the multi-site, multi-cloud work that's happening right now. There's a lot of work going on to partner with ISVs to provide more platform integrations, whether that's an MLOps tool or a tool for HPC and simulation modeling. We've got a lot going on in there. The other really neat things that I'm excited about is in composable computing. We've done work with liquid computing and giga IO because Texas is apparently the center of composable computing. And, you know, it's a great place to be. So we've worked with those folks to help build out, you know, Omnia, I'll say environments on top of that type of hardware and there's just more and more features to keep coming. Awesome. Peter, John, thanks for coming on and talking about Omnia. Congratulations being the godfather and all taking it to the open and bringing this into the DevOps world. You know, you got super computing, you got super cloud, super applications, all part of the new wave, next big thing and AI is going to be in every application. So appreciate your time. Thanks for having us. Okay, this is theCUBE's coverage of ISC High Performance 2023. I'm John Furrier, host of theCUBE. Thanks for watching.