 Thank you. Hello everyone. So I'm Kimi very happy to be here today So the idea behind this talk is essentially a project that the University of Alabama and canonical did together last winter So at the University, they already had an HPC environment and an open-stack base environment for their users But in terms of containers all of the different environments were like on laptops Containers here and there some darker swarm over there and the idea was to provide a dedicated Kubernetes platform to their users and so we work together to get this done and We're going to go today through the Kubernetes architecture that we chose and a few of the integrations that we've done and then My colleague John Paul will go a little bit more through what type of research they do and and how they leverage that infrastructure For the research So for the Kubernetes architecture in this case, we're talking about the Kubernetes on bare metal deployment. So When we talk about bare metal deployments, we want to have some infrastructure nodes You can also call them like management nodes Essentially those are outside of the Kubernetes deployment to be able to manage the communities environment So the first technology I'll highlight here is the metal as a service Platform so maz is a Bermuda provisioning tool super useful when you have a large estate of Bermuda servers It provides you with your asset inventory. You can do a Layout of the type of storage layout Networking and and all of the layout that you want to do when you pixie boot and deploy your servers can be automated through maz, so that was a really useful tool to use Then to Choose which machine we're deploying and to deploy them with Applicate to deploy the OS and to deploy different applications. We're using juju juju is called also a operator lifecycle controller essentially It's one of the best aspect of it is that there's a lot of the two operations built in So once you deploy Applications you can relate them to each other. So let's say you want to relate your Kubernetes worker to the control plane It's a simple relation in a model that you define in yaml So it makes it simple and when you want to scale your cluster, for example, you want to add a Kubernetes node You can simply do like juju add unit Kubernetes worker and it's going to scale for you You can also reduce you can upgrade and update So that's the main Manager that we're using in this cloud In terms of observability, it's pretty standard. We're using a bird open source project That I'm sure a lot of you are using as well So elastic search for fauna for Matthews since the whole deployment is done on Ubuntu servers We're also using the landscape server That tool that's you see if you have security vulnerabilities in your environment Depending on the version of packages that you're using so we are also using that and then for the Secrets management tool for communities We're using vault and since it's deployed on those three infrastructure nodes It's also a cha and we're using a to proxy in front of it So now for the actual Kubernetes deployment control plane we decided to do The control plane on three nodes with standard components all of these components are Operators or charms that are deployed with juju So Calico for the networking we got each CD QB PI load banser And then for the workers we had two different sets of workers We have generic workers Dell machines pretty standard and powerful and then we have Specific GPU nodes. So those are NVIDIA digix and 100 that we used in this case And I'll go a bit more in detail into how we integrate that to provide GPUs for the pods in a few minutes And finally for the storage aspect of the cluster We use Seth in some architecture You may see Seth deployed hyperconverge with Kubernetes so you could have your stuff was these directly on each step for it on each Kubernetes worker It's not the case in this case. We connected to an external. Well standalone Seth cluster separate from the Kubernetes cluster The main reason for that is that that stuff cluster is used for a few other Clusters in the environment. So for example, the open stack connects to it. Some users can use the Fileshare system from it separately from the case cluster So Seth provides persistent volumes for for Kate's It can provide block storage. We have S3 and also yeah, self FS Let's go into the different integrations we've done first of course the NVIDIA GPUs one of the neat things about the NVIDIA nodes is that they publish the GPU operator Which makes it pretty easy to To set up if you install the GPU operator on your NVIDIA nodes You don't have to set up your OS in a very specific way like we just had open to vanilla Then you have you the GPU operator deployed in Kubernetes And that's a little bit of what you see with the different pods here There's a discovery pod that will go and scan your nodes and find which one have GPUs and it will install their proper drivers inside of pods to really simplify the setup overall and Then one feature that we used from NVIDIA is the multi-instance GPU profiles Essentially that lets you slice your GPU into smaller instances One DGX node has eight GPU cards, and they're really powerful And those nodes are also really expensive So there's a limit to like how many you want to purchase for your cluster in this case If you slice your GPU into like smaller slices, you're able to provide more GPUs for different types of workloads that may have different requirements and it also Makes it so that each GPU is independent from each other so in Our in our case we deployed three nodes with no make profiles and then one node with the seven slices per GPU Profile and that's also something that you can change later on when you see that the need or the request for one Type of GPU is is greater than the other. So it makes it pretty easy to use and then To configure those profiles, it's simply a label on the nodes So once you label your node with a specific type of profile the GPU operator picks up that Configuration and configures it for you So you see an account of 56 here for the seven slices for eight cards on the one node that we set up like that On the networking side, we use metal be so for those of you who are familiar with Kubernetes and public clouds Well, you have access to application will bansers that you don't have to to do magic to have set up in your environment But on bare metal unless you purchase expensive appliances That do you know hard roll advancing? You don't have that many options So metal be is a really useful open source project that can simplify your life For a little bouncing in case so essentially you give a set of external or public IPs That you want metal be to be able to assign to pods and if a pod request a service type of type Well a service of type load banter It will get an external IP There's two different modes layer two is preferred unless you have switches that are able to do BGP So in our case we did a layer two set up Next authentication You can do you know local users for your case you can connect to LDAP you can connect to IDC private providers But in the universities environment Samo is used extensively in a lot of different environments. So we decided to do a sample authentication It might surprise you that we're using Keystone to make that possible for those of you who are familiar with OpenStack Keystone is an OpenStack project, but the way that We developed the operators you can connect the Keystone operator to Kubernetes control plane and it will request its authentication through Keystone And then we connected Keystone to the SAMO back in So that makes it a lot easier for UAB to manage their users And you can define the access level in your case cluster through policies and in the SAMO back in The SAMO part was probably the trickiest part A little bit because of networking when you have to make sure that your pods talk to the right thing Excuse me That you have Keystone talking to the SAMO back in and all that I'm not going to go in detail through this But I wanted to give a shout out to Gustavo Sanchez a guy on my team who helped a lot with making this work And then finally the last integration I want to highlight is the GitLab integration. So we're able to do Integration with GitLab so all of your runners from UCICD pipeline can run in Kubernetes All right, you got something in my throat and Save your container into your GitLab registry So on that I'm going to leave it to John Paul to continue with the research aspect of the project Thanks Camille So what I thought I'd do is give you a little bit of an overview of the University of Alabama at Birmingham Just to kind of help you understand the scope and the place in which this deployment is occurring So UAB is a large public institution located in Birmingham But the metro area has about a million a little bit more than a million people in it that represents about a fifth of the state The population the state We also are the largest employer in the state and have a very large economic impact in on the region So academically we have about over 20,000 enrolled students a good third of those are in the graduate research space post undergraduate space and We generate about 600 million annually in research funding from the different national Funding agencies like the NIH heavily and the NSF Research computing the group that I'm with we're part of the IT organization at UAB that serves the campus we have about 200 monthly users and Our our researchers represent about 30% of the research revenue revenue at at the University. So If you know anything about sports in Alabama Just to highlight we're the University of Alabama at Birmingham. Our mascot is a dragon not an elephant So anyway The What I'll do is I'll just talk a little bit about why we were interested in kates to begin with I'll kind of talk about how it sits in our research infrastructure and then kind of go over what some of the future Use cases we see for for kates in the research environment So a lot of the kates use cases that we have are what you might consider You know standard fare for microservices. We have a self-service application that we're building For users at our University for our researchers to be able to kind of manage their lab environments Manage the resources that we provide for them the IT services specifically around computation and we have the you know an Automatic user provisioning workflow a group management workflow that we're developing and we've deployed that on a traditional kind of message based Application on traditional infrastructure. We're very interested in moving that over to a kates platform and making that a microservices experience We also are working with Essentially the leading adopters of kates on campus. We have a few users that have built their own Kates platform that they're running in the cloud They're interested in bringing that on to campus and we have some use cases that are kind of leading Research use cases for applications that already exist on their kates oriented so One of the primary spaces though that we're working on right now is that CI CD integration Basically as we build these applications We want to be able to make sure that we don't fall behind in our in our workloads One of the problems that we had in our own development was that we would just essentially get a huge Merge backlog in our workflow. And so we're able to use the CI CD pipelines that we're now moving over to kates To help clear that out by having nightly builds and moving forward with that. So The kind of the leading edge of the use cases that kind of drove us to saying hey We really need to get a kates capability on campus is this machine learning ops workflow That's becoming more and more popular out there One of the characteristics of ML ops if you know cube flow or ML flow or next flow is that Machine learning Applications tend to want a lot of control over the environmental configuration that they have and they want to kind of be deeply Integrated with the workflow of the machine learning pipeline And so that makes them a hard fit into what might be more traditional compute environments And so we wanted to make sure we had a kates platform available. That's why we kind of peppered it with some GPUs as Also along the way so I Think it's also helpful to understand what we mean when we speak in terms of like research computing and high-performance computing We tend to think of our platform as Units of high-performance computing and a high-performance computer in our environment is something that has a considerable amount of RAM You know three to a hundred gigabytes is very common We have a few that have above a terabyte. We have a couple of CPUs They have you know 24 to 64 cores on them and then we may have accelerators in those nodes Each of those nodes definitely has a high-speed nick on them to let the data move on and off the Node itself and of course in in an HPC context you think in terms of speeds of network And so you want to keep your your data close to your CPU at the highest speed and then your internal Networks at the next highest speed and then your data ingestion onto the node at the lowest speed if you will so We combine those compute resources into clusters So we just kind of buy a bunch of them at a time and we stick them together and we operate them on a connected network and then we make make it possible for them to pass data in between each other and reach the Essentially the global file system that we run is GPFS that's what we use on our HPC side and we have a sep as Camille mentioned for various use cases for block object and also some Essentially leading of the file system type solutions that we're exploring We we combine those into a fabric that we then expose in different Compute flavors and our kind of bread and butter compute flavor is the HPC batch compute flavor We run the slurm scheduler if you're familiar with that What you do inside slurm is you ask for a certain amount of RAM a certain amount of cores and nodes For a certain period of time and then slurm Allocates that when it's available and ensures that nobody else is competing for that resource with you Typically in a batch computing environment you tend to think in terms of a terminal based access and SSH access and That is certainly true in our case as well But we've also deployed an on-demand open-on-demand platform Which is a web GUI that you wrap around to your cluster and it provides a really easy to use Resource for researchers where they can do pretty much everything that they need to do with an HPC cluster inside their browser So including running like Matlab and other traditional x11 applications and of the NBC session inside their desktop it also provides Essentially web proxy capability So you can tie a web proxy into a job and you can run things that are kind of web-native like Jupiter notebooks and our studio and things of that Nature, but a lot of our researchers just run their traditional SMP or MPI applications or even their pleasantly parallel single core independent data Workflows on the HPC batch and like I said, that's our biggest workload engine right now today for research at UAB We also have an open-stack cloud and we provided that Primarily so that we could have the An easy to use platform when the impedance mits mismatch to our cluster is high, right? So if you have for example Science Gateway, that's a very common tool that people want to run It's some sort of a web application that hides the details of a complex path platform behind the back end and Exposes to the end user something that they can get started with right away kind of like our on-demand platform But often these are geared towards specific science domains That lets a new graduate student or new researcher get on board using Computation and analysis in their world much more quickly So you can't readily deploy one of those or easily deploy one of those into an HPC batch cluster There's a lot of things that don't work with that So that's one of the reasons why we have open-stack. We also have it to help with our development workloads We actually do development cluster builds on on that open-stack platform Even when you want to use a container like model in HPC batch with singularity You still need an environment where you can be root to build that container. So For all of those reasons we have an open-stack cloud And then as I mentioned on we were pursuing kates because we know we need a container-based abstraction and Kind of an orchestration workload manager For you know next-gen machine learning applications We also are interested as I mentioned moving our CI CD workflows on there and starting to build more of a Composable application environment for users And as you can see I mentioned some of the the tooling the are shiny stream lit snake making next flow Those are a lot of the tools that we see as you know Desiring the container back ends and right now we we didn't have a platform on which to Make that easy to just consume those Containerized components of those workflows easily. So just as a kind of a reference related to Camille's slides This is where we're currently using mass and juju and charm deployments inside of our environment So we use it for our open-stack cloud We use it for our kates container environment and we're working actively to migrate our seph platform over to that It's still going to be an independent seph cluster for the same reasons that we needed to have it independent to begin with but the Provisioning is going to happen through through mass and juju our HPC batch is still done in a traditional HPC batch way And the GPFS is an independent file system. So Kind of in conclusion I want to just talk a little bit about the the next gen use case that we really see for for kates the one that I Think is among the more exciting spaces that we can spend our time with over the coming years As you can see on the you know In the in the research environment, we have a lot of different data sources. We have scanners. We have microscopes We have telescopes not so much at UAV. We don't use telescopes. We're a medical university So we tend to think in terms of scanners and microscopes But you also have other data sources, you know gene banks and other things like that that already have existing data sets that come in And they come into your environment and ideally those just kind of stream into your environment and they have the associated workloads kind of just managed and The customizations that you have to do to your data to publish that data Automatically run and then on the on the right hand side. We have our essentially our analysts sitting there composing analytical workflows through the through the kates platform so they can work with things like Jupiter at the top as an example of a Jupiter application that's loading a lot of data up inside it and presenting a you know a satellite image of the of the globe obviously but That needs to basically allow the users not just to consume the data coming off the platform But package up that computation that they did to produce this new data set back into the platform so that it can be used by other Researchers downstream so they don't just become consumers of data sets They also become kind of Instantaneously or immediately producers of data sets as well I think that that's a really powerful enabler that we're going to be seeing more of over time As we continue to adopt the kates platform another use case that I think is kind of critical for the reproducibility of research in Higher education is to be able to encapsulate the HPC workloads So the the work that they did on our HPC batch computer to be able to capture that and then be able to reproduce it at some point in the future and get out the same data that they had when they did their analysis and drew their conclusions on the cluster originally and Obviously they could just go back and run it on the cluster again But clusters even though they're kind of a slow-moving animal their OS's are reasonably stable They don't change very often the the OS is the applications the The storage environments they do change over time And so if you want to come back five years later and reproduce a run It's going to be very difficult to do that So being able to essentially capture all of that up into a container that then completely Reproduces that batch computing experience that they had by loading modules and You know referencing their different data sources on the on the disk Would be a very very helpful reproducibility tool for researchers over time and that requires that it doesn't you know This is not something that the researchers really want to sit around and design So you want to make it possible for them to essentially say okay now capture this environment that I Used and give me a container that allows me to come back and use it again in the future So this is what I'm kind of terming like the the next gen science gateway this platform Where you essentially we are able to move across the different areas of your workflow in research and modify it and Reproduce it effectively so with that I'd like to thank you all for attending our talk You're welcome to come reach it out to us individually if you want to know more about either of these Efforts or stop by the booth or just to kind of keep in conversation going here. If you have any questions Yeah, thank you for the presentation. We have a we have quite a few minutes for Questions any questions just a raise to the hand Okay, thank you. That was very entertaining So on your traditional HPC batch cluster you're obviously able to use tools like torque and slur to handle resource contention in queuing Do you already have an infrastructure like that in place for kubernetes to? Handle say you know a similar resort Resource contention between researchers so if one researcher wants all the GPUs and definitely Do is there another tool or metalware to kind of handle that? you know batch queuing so that Not everything is allocated to a specific to a single or a specific workload Yeah, so that's a it's a really great question and the answer the quick answer is no we don't yet But we know that that's an issue Obviously in the in the batch computing the reason why people do batch computing is for really kind of gets to two reasons is one Access to bare metal performance, right? You don't get any kind of interference from abstraction layers and two is that you can share the resources batch schedules are very good at Disciplining that over time Our kates environment does not have anything to essentially stop a researcher from having all the resources What I would like to see and where I'm exploring with my team The capabilities of kates and HPC to come together is how we might be able to Manifest some of the HPC capacity as a kates worker platform So that's one approach that you could potentially use where you basically say okay Well, I have this kates workload or this kates demand for for resources and I can schedule that into my batch Scheduling environment now that has problems obviously for immediate reaction the way kates tends to have you know immediate responses So there's gonna have to be other ways that we look at that And I've only really started scratching the surface on how the kates containers can you know Schedule or let's say make reservations for their cores and RAM But one of the things that I haven't really come to understand or come across a solution for is how to stop them at A particular point in time right kates tends to run forever and batch tends to run to completion So but if you have any if you have any suggestions on it, then I'd be happy to follow up with you on that Thank you Okay, there's another gentleman there Underway my suggestion is that you need to have a two-level scheduler So in theory that could be the NP complete problem, but let's go for the next question Thank you My question is regarding Cuda versions so we run in the cloud I have some ML engineers that want to use cuda 10 and some that want to use cuda well cuda 11 Right now I have to spin up two different node groups one With two with manual AMIs some of them have cuda 10 some have cuda 11 Is that a good way to manage that? How would you suggest maybe doing something like that? Well, I mean when we when we have that requirement we just spin up different Environments as well. We we have typically been able to move forward with cuda So we tend to be on cuda 11 on our cluster right now And so we're pretty consistent across cuda 11, but for something where we would have to go to the cuda 10 That would be a case where we would use our open stack environment to use a GPU from that environment so every one of our environments has GPUs available to it, but also with the one of our kind of our Visions for starting to leverage mass more heavily is to be able to kind of say well We need more capacity for this kind of workflow in one environment over the other And so we can potentially move that compute capacity over for a period of time But right now it's just a we're kind of in the same space you are No great solution Thank you. So now it's a less question for today Thanks very much the typically when you are using containers for Machine learning projects like this the models or the containers can be very large to incorporate These things is there anything in your architecture? specifically to handle those types of issues to you mean to to ensure that there's enough Jeep memory resources for that is that what you're kind of referring to or yeah memory or or bandwidth on the network or Things like that. I wouldn't say that there's something specific in the architecture to handle that we have we have Used nodes in our environment that are generously provisioned So our a 100s have their 40 gig a 100 so they have a lot of RAM on there And then we have a lot of memory available on the nodes that are the worker nodes So I guess we're kind of maybe cheating a little bit in the sense that we we know we have those That capacity in our environment and we don't have it you know over Kind of oversubscribed yet because we don't have a demand that goes beyond it yet But in my mind the way we would go is with that additional scheduling model where we would basically say okay Well, you need this much RAM and so we'll just have to go and get that out of a resource pool either by You know physically moving Physically but virtually moving one read compute resource over into another fabric or by doing one of those secondary scheduling layers where we could basically bring in slurm to help us Allocate some resources for a particular compute workload Okay, so low Bm for the research and chemical station for the Cooper count. Thank you for everyone the participation See you next year