 All right. Good morning, everyone. Or afternoon, I guess. No idea what time it is, apparently. All right, can you all hear me? All right, good. So, well, I'm excited to be here in Vancouver. This is actually my first open infra, so I get to kick it off with a talk. So, yeah. So today, my name's Kevin Hannon. I'm an open source maintainer on a project called Armada, which is a cloud-native computing foundation. It's under the cloud-native computing foundation as a sandbox project. And I'm going to talk about building a research platform using OpenStack and Kubernetes. Now, a little caveat. I will not talk too much about OpenStack. I won't be able to do it justice. Over there is Scott, who will be giving a talk on Thursday at 11 a.m. to talk about how we actually provision our bare metal nodes for Kubernetes using Ironic. I'm excited to attend that talk. And I hope you all attend the go. It's at room 8.15 at 11 a.m. on Thursday. And so without further ado, I wanted to give you a little background about who I am, what interests me, and why I'm here giving a talk. Well, I actually started my career as a computational scientist, mostly focused on utilizing hardware to accelerate modeling and simulations. So I, and in my masters, I was actually focused on running, on doing quantum chemistry, trying to explain chemical properties. And that's when I got interested in parallel computing and how you can use a lot of the really nice tools that are out there to speed up simulations. And I kind of got away from that for a few years. And then in the last couple of years I've been working. I had a job working at the National Institutes of Health, focusing on trying to run scientific workflows on high-performance computing clusters and or Kubernetes clusters. Machine learning is the elephant in the room. And if you've ever tried to provide any kind of platform, you know that there are machine learning people that want to use whatever cloud tools there are for deep learning. And it's actually kind of difficult to do that on a high-performance computing cluster. And it's also kind of difficult to run traditional model and simulation software on Kubernetes when most people have not even containerized that yet. So there's kind of this weird disconnect in the community. And so my first entry point into this field was just trying to say, I want to run a workflow. And I want to run it on either a Sarm cluster or Kubernetes cluster and have a similar API between both. And guess what? It doesn't exist, unfortunately. It was a tough task. Well, I realize that I think Kubernetes is a pretty powerful construct for having both. You can run Kubernetes on-prem and in the cloud. And I actually think there's a lot of work to be done to improve that. And so I'm actually now working at G Research under the open-source division where I am focused on better enabling batch workloads for researchers on Kubernetes as part of Armada. And I'm also a contributor to Kubernetes. So if you are interested in talking about that, please follow up with me. So who is G Research? Well, they're a trading company based out of London. We actually have a US location opened up in Dallas. And we are heavy users of OpenStack and Kubernetes. So if you are interested in looking at some job postings, we have quite a few available in the Dallas metro area around OpenStack and infrastructure. But generally, for the purposes of this talk, all we have, like, G Research, what we provide to quantitative analysts are, since it's a trading company, you can imagine we're trying to predict trends in the stock market. And we provide a really large research cluster for quantitative analysts to kind of just figure out trends. Now, what does that really mean? It can mean anything from data engineering to machine learning to trying to run ML simulations faster. And so we provide a really wide range of use cases. Every day when I talk to the researchers there, I find out about new things that they're actually doing. So I can't even do it justice all the types of workloads that we're actually providing for people to run. But so generally, G Research is focused on running batch jobs. Now, the definition for a batch job really is a finite lifetime job. Just means a job that will run the completion. It's actually kind of interesting the timelines depending on your domain. Like, in my masters, you could see calculations run for months. And that was considered a batch job. I've seen some researchers have jobs that actually run for years if they don't have enough hardware to speed it up. It's kind of amazing. But so generally, there's a lot of areas here. Computational science is one that I know and love, machine learning, elephant in the room, especially with large language models, and data engineering. And another area I like to talk a lot about is genomics. It's kind of a pretty scientific computing community. It's a little bit slower on adopting containers. But there is one, I think, one shining light in that community, which is the genomics community, where they've released a lot of their tools, free and open, via Docker. And they also couple it a lot with workflows. And one of their main focuses is trying to create reproducible science so that you can continuously run your workflows over time. And be able to reproduce those results. And so a lot of times, they have containers and workflows kind of coupled together. And they run large amounts of jobs. So just an interesting one. So what are we trying to do? Really, there's kind of two areas of parallel computing that I like to think of. One is embarrassingly parallel. This just means if you can split your task into a series of bold-defined inputs and you require no communication, this is really easy to run on any kind of cluster. Kubernetes actually supported this out of the box. Day one, or maybe day one, I don't know if it was day one, but very close to that when they released it, they had a simple job API for representing batch jobs. But generally, what gets a lot harder is when you actually need to use all of the hardware on your entire node. And you know why? There's really two reasons why people want to use parallel computing. One, they want to speed up their simulations and they want to run faster. Two, they want to increase the science or the research that they are doing by leveraging more compute to answer harder problems. And so generally what that means is either you're using hardware to accelerate a problem or you are trying to split your problem up to different nodes so you can actually run your simulation. And so this gets really tricky because you actually have to kind of have a scheduler built in that can grab the nodes that you need and then have communication between those nodes. And that's usually done through a lower level layer called the message passing interface or MPI. And then there's also a lot of GPU-GPU communication that can happen in PyTorch frameworks and stuff like that. That is usually hidden under the hood. But generally, this is kind of a harder problem to solve. So what are our requirements? Well, generally, G-research, we run large amounts of batch jobs, but I want to keep saying it. Our order, we're usually on the order of around a couple million jobs per day. And this varies widely on the types of jobs. So you can see our target for this is around 100,000 nodes, a million cores. And we provide a lot of GPU accelerators for our quantitative analysts to try and predict, to try and speed up their simulations. And I think it helps to give a little bit of context about where G-research was and why we kind of built what we did. Generally, we were already running a lot of workflows that were batch jobs. And we were actually using HT Condor, which is a popular, it's a scheduler popular in the grid computing space. And then it was primarily targeted for Windows. And then we were like, hey, Linux is great. It's my great. And then, of course, a lot of pain ensued. But we're closer now. And so I would like to say that most of our workflows are container-first. And we actually run on Linux. And we are also using Kubernetes for our container orchestration now. But unfortunately, there are some problems with that. And this is three years ago. One, if you try to run millions of jobs on a single Kubernetes cluster, you're going to be in for a bad day. There are a lot of things that can actually happen. We've seen cases where your Kubernetes cluster goes down. We've seen cases where your API server gets so slow that you actually, it will lose jobs. And generally, it's not great for it. It's not really set up to run large amounts of jobs. So you kind of need some kind of queuing, scheduling system at the layer above Kubernetes if you're using an on-prem cluster like we are. Because you have quant, so researchers, that want to run more compute than you have nodes available. Maybe everyone has that problem. But generally, that's one thing we notice. We give them more compute. They give us more jobs. And so it's annoying. So generally, what I like to think we're providing for our quantitative analysts is a serverless platform for people to, most of our quants don't know anything about Kubernetes. They barely know anything about containers. They just say, here's my task. I have a build. It's now a container. Now I want to run it. And they don't want to know about the hardware. They don't want to know very much about any of that kind of stuff. And so that, of course, makes our day, because that means we have to implement that as something else for them to worry about. So generally, we provide a UI for our researchers to interact with so that they can know what kind of stuff is running and be able to tail the logs, be able to track the status of their jobs. But they don't have to know about. They don't have to use Kubernetes to actually do this. Well, why do we even need Kubernetes? Well, generally, there's two avenues of thought for running this kind of types of stuff. You can either go all in on Kubernetes, or you can look back and look at other options in one area that I'm familiar with as high-performance computing and have a little audience interaction. How many of you have ever submitted a job to an HPC cluster? OK, so about half the room. And so generally, users kind of, they have to SSH onto a login node. They interface with the scheduler. And then that scheduler has hooks in there to say, how many nodes I want, how many resources. And then if you don't have enough capacity, you queue it, whatever. That's great. It works pretty well until you start trying to get into stuff like cloud bursting, or you have a researcher that wants to submit a job from a slurm cluster on their laptop using a Jupyter notebook. And this gets much harder. This is where it starts to show that you might need to start thinking about an API for your cluster that users can interact with if they are saying, I don't want to SSH into this thing. I just want to be able to use this cluster. And I would say a lot of the ML data science community is kind of pushing this idea of here's a Jupyter notebook. I don't care where it is. I just wanted to run. Give me all the GPUs you need. I'll find it somewhere. And unfortunately, that makes the traditional HPC style workload a little bit trickier. So now you have Kubernetes. It's great, right? I'm sure. How many of you here have ever submitted a workload to Kubernetes? OK, so half and half. So generally, one of the nice things about Kubernetes is users, once they have access to a config that defines their cluster using a Kubernetes config and they have a CLI, they're usually pretty much good to go as long as they know about Kubernetes. I know I said a very heavy statement there, but it is what it is. So generally, it's a pretty nice pattern for programmatic access. I remember when I first getting Kubernetes workflows running, once it was all set up, it's actually pretty good to get stuff started. You have a lot of tools. And if you want to do multi-cluster, there are some other options you can do, too. Unfortunately, it's not perfect. I would not be up here or have a job if I could say that Kubernetes is perfect and I don't need to work anymore. It'd be a great day. But generally, Kubernetes, there's kind of one way, like I said, 100,000 nodes is our target. Well, it's actually difficult to do 10,000 nodes with Kubernetes, so what are we doing? Well, and also, generally, I should say that it is actually quite, it's even more difficult to run a lot of nodes with Kubernetes. And there's some good posts from OpenAI talking about all the things that you need to consider if you're running at, they have one post, which is 2,500 nodes, and the next one is 7,500. And there is just a lot of engineering effort to make that reliable and scalable. And generally, if you're running your own infrastructure like we are, that's a lot of work on your engineers. So one of the things we actually, we cap our Kubernetes clusters around 1,000 nodes. And then, but now you're saying 10,000 nodes, 1,000 nodes per cluster, well, how are you doing that? What are you gonna do? Well, we actually use multi, we use multiple Kubernetes clusters. And I'm gonna switch to just a little demo to, when I started, when I joined and worked on our motto, I was like, what the heck is multi-cluster Kubernetes? And I thought it was kind of a neat thing to, oops, wrong slide. So what I'm switching here to just a terminal. So I'm using Kubernetes and Docker, which is just a simple way of provisioning very small Kubernetes clusters. And so, sorry, that is a little... So generally, I have two clusters. I have an Armada test and a demo one. And if I wanna switch between different clusters, I can just switch between two different contexts. And that's actually a pretty, I think a pretty powerful way to actually have multi-cluster. And I won't submit them. But generally, you can submit a work, you can submit a YAML file and then tell a certain context which cluster you actually wanna target. And you really, this is actually a pretty easy way to programmatically call and submit things. So that's kind of for multi-cluster Kubernetes. This is great. Now what is the, one of the problems is, is a lot of implementation leakage. If you go this route, your user needs to know which clusters are available or which clusters are there. What if you change the name of your cluster? What if they have a script that they wanna run every day and if it breaks, you're gonna get a call at three in the morning because you're doing, I don't know, maybe you won't be doing a maintenance effort at three in the morning, but maybe there's a cluster that goes down and it doesn't work. Well, what we wanted to try and do was, we wanna provide a way for our quants to interact with a stable API and then do a lot of our maintenance and our rebuilding of our nodes on our Kubernetes clusters without having to take down the entire farm without having to disrupt our research. So we created an open source project called Armada. At a high level, Armada has a, we have, we're like a hub and spoke architecture where we have our server is actually what users are submitting their pods to and then our UI is looking in the server to get the state of the world. And then we have our bread and butter for execution is in these Kubernetes worker clusters called executers. These are the things that actually talk to the Kubernetes API submitting pods, kind of doing the cube control, create kind of stuff. And then we also, our executers are pretty closely coupled to Kubernetes. So we use the Kubernetes scheduler and generally we keep pretty close to it because I like to say if you fight the platform, the platform's gonna win, unfortunately. And so if you try to get away from it, upgrades are gonna get harder and harder. Maybe after a couple of upgrades, you'll be fine, but maybe after a couple of years, you're gonna be like stuck in, a version from a couple of years ago and no one wants that, engineers like new and shiny things and so do researchers. So generally, we have an API and we provide a series of clients. Our API is defined in protocol buffers so it's easy to generate other clients. We have, we're a big.net shop so we have a.net client, a Python client and a Go client and we also have a CLI. I'm happy to say that we have our first Google Summer of Code start this summer who's actually trying to make the CLI part of the cube control as a plugin. It's his summer project and so we're excited to have him working on that. And you can see our API is pretty simple. You create queues. I didn't talk too much about queues but they're kind of the way that you have a resource, how much resources a user is allowed to use and you can easily create them, delete them and then you can submit a job. So I have this Python client. Well, what can I use for it? Oh, I'm sorry. I jumped out of myself. So generally, what is the unit of work? Well, we have what we call, we have a custom job spec and series of metadata which is what our queue is, our role-based access control for our resource limits for our job. We have this thing called a job set which is really just a, we do a lot of our models usually coupled with a workflow engine in our case and so typically our users like to title their, all their jobs are gonna submit as a job set so they can actually view the status of all the jobs without having to pull each individual job. And then we have a pod spec which is actually a Kubernetes pod spec. For those of you that are familiar with Kubernetes, it's kind of similar to like a pod template or a pod template and deployments or stateful sets. And what's powerful about using the Kubernetes API is we get, unfortunately, we get a lot of stuff that they like to use like init containers which can be challenges in its own way. We also have a list of containers but then what's powerful about this is if your users wanna target things like GPUs or they wanna start using fancier features of Kubernetes, we actually inherit this with our API. For those who are not familiar, taints and tolerations are ways to target a GPU node from your pod spec. So this is how users can say I wanna submit a pod and have it run on a GPU. And then generally that's a lot of that stuff. And so part of our job spec will be also enable. I won't talk about too much in this demo or in this talk but you have a pod spec but let's say you wanna have like something like a Jupyter notebook. Well, generally that it's not just a container, you would also need a service and ingress. So we actually provide that functionality when users submit. It will also create a service and ingress so our users can actually interact with a notebook via a permanent URL in the timeline of their job. So well, some use cases that we have with Armada, one of the top ones, we use a lot of workflow engines and g-research and some unfortunately are in-house. And I was happy to say we were able to convince them to look at some open source solutions. So we've been running actually Apache Airflow. For those of you who are not familiar, Airflow really pioneers this idea of workflows as code and a lot of people say the powerfulness in Kubernetes is an extensibility. I also think that's true for Airflow. You have hundreds of providers that you can use with Airflow. The main gotcha with Airflow, which I think a lot of people get trapped by is you really do not wanna be doing compute in Airflow. You wanna be running your compute in a compute environment such as like, in our case Armada or Kubernetes and that's where your job actually runs. And Airflow is there for orchestration. And so generally what we wanted to provide was we wanna start looking into ways, how can we use Armada in Airflow and have Airflow orchestrate workflows and have Armada run those individual pods. Well, so we created a Armada Airflow operator using our Python client. And this is a pretty simple script that just is running a simple hello world followed by Armada running a simple job. And this is a sequential task, running one job after another. And so I will switch to two demos. So generally what I will show first is, I will show our my Python client that I have. Okay, that's pretty good. Oops, not good. All right, so generally you can see here, this is a pretty simple pod spec. I'm running a container, it's just doing sleep because I don't trust myself to do anything more complicated than that during a demo. And then this is running a simple Ubuntu container that is sleeping and then my API is creating a job request and giving it a priority. And then I'm also creating queues and a job set. And then I have a GRPC client that's actually hooking this up and then this is kind of running this. So I have a demo available for anyone to check out, a demo server. And so this is free and open to use. What I'm doing now is I'm running this script and this is gonna submit 1,000 jobs and I'm gonna switch to my webpage. So up here, this is our website, armadaproject.io. We have a nice little demo page which is running an EKS cluster for Armada. You can click on this and then you can see this is our an overview page in our UI that kind of shows the queue is called openimfra and I have 129 jobs queued. And I'm gonna switch to the job panel to walk through this a little bit easier. So what is queued? Queued means that I have, I actually don't have enough capacity on my worker clusters to take my job so it is waiting on a message queue for a new, for capacity. So there's a lot of jobs queued because I don't have that large of a cluster running this. Pending is a state of it's in the Kubernetes, it's actually running in Kubernetes, it's pulling the container down, it's doing all that. And then running means it's actually running a job and then succeeded actually means it is running. So what's nice about this UI is you can click and drill down into individual jobs and you can actually start seeing some more information about your job. And these are just relatively simple stuff like this is by default, we add stuff like IDs, queues, all that. And you can see here this is when you get the information like which cluster it actually ran on and all that. And you can go down here and look at the job YAML and see a series of what the actual job that your user submitted. And so that is, that's that part of the demo. And then last part I wanna show is just this little video to last minute. So I'm just gonna show the Airflow integration. So generally, yeah, so Airflow has its own UI for representing workflows. This is, these are just a series of example DAGs and then you go scroll down the bottom here. You'll see a Hello Armada, which is the DAG that I created for this demo. And I can view the code here and this shows you that I'm actually using my Armada operator. And this is defining an Airflow workflow and submitting a simple sleep job and then running a DAG called Hello Armada. And yeah, it's very simple, the one that I already showed you. And then I can switch to, I can manually trigger the job, which just means to run it now. Usually Airflow runs on a schedule. That's usually what Airflow's bread and butter just to run on a recurrent basis. And then you're able to use Airflow to drill down into individual tasks. And then we kind of leverage our UI to get more information about what's actually going in the progress of the job, like what is the state in Kubernetes. And this way we can have, we can still leverage Armada and Airflow together. And that is, you can click on this task and then if you scroll down to the bottom here, you'll see that I have a series of, this is my Hello Armada DAG, this is actually what Airflow submitted and we're able to tag our job so we know that was submitted by Airflow. So not too much left. All right, oh, come on. So the last piece. So generally if you are interested in checking this project out, as an open source project, we also have swag. If you are interested, we have socks. They're nice Argyle socks. Please, I don't wanna carry them home. So I would love, if anyone wants them, please let me know. But generally on a left hand side, we have a little contribute page for how you can get involved. And this also links to a Slack channel that you can join as part of the CNCF Armada, CNCF slash hashtag Armada. This is a nice way you can engage with the community. We are actively growing. And we have our code, central QR code and then our website armadaproject.io where you can scan this QR code, which is, I love that Google has cute little dinosaurs. It's my favorite. But anyway, these are pretty much it and this is, if anyone has any questions, please feel free. You can go to, that's a nice place to talk. You can go to Scott's talk over here. He is giving a talk at 11, no, not 11, 11 a.m. on Thursday. And that would be a good way to understand how G-research is using OpenStack. A very quick question, sir. You said that HPC schedulers, likes normal PPS, are scalable but not user-friendly at home. So the other is Kubernetes is more user-friendly but not so scalable. But still you're building on Kubernetes. Are there any plans also to tackle the other big areas with the HPC schedulers to enable them in use? Sorry, let me repeat the question. You are asking whether or not is there work in the HPC community to make that more user-friendly? I know, is there a way for you, for Armada, to support also the HPC world? Oh, that's a good question. We actually have been talking lately a lot with some of the traditional HPC schedulers, especially in Slurm, to really think about ways to kind of maybe mix and match different schedulers. I believe pretty strongly that every tool has its use case and I think it's kind of, it's important to leverage all the tools you can. And so we do have interest in looking into that but it's really quite difficult. There's a lot of work being done in Kubernetes to try and bring HPC-like syntax to Kubernetes and then I don't see as much about trying to say, I want Kubernetes to run Slurm or vice versa but there are some people starting to look into that now. I think part of the challenge is just there's a lot of legacy around the HPC schedulers that they're building containers and adding that in also while they're also trying to add Kubernetes which is a layer above that. Hope that answers your question. Hi, I'm just trying to work at data machines. How do you deal with the overhead latency of scheduling as you scale up the number of jobs that you have and specifically the overhead latency just starting up these jobs? Does that become prohibitive when you're still using the Kubernetes scheduler as kind of the job submission mechanism? A big part that we run into is depending on what's the size of your containers. I mean, it's a very wide range of what, like we'll have users that might want to, they are, some of them using .NET might be using the machine learning frameworks in .NET so you can imagine those containers get really fat so we can have, you know, we've seen gigabytes worth of containers and it's actually kind of difficult, like no matter what we try, it's hard, like pulling the container off the internet can be really difficult, especially, it can take a long time and we do build in some caching layers to try and make that faster because usually our quants don't like to see more than a few minutes stuck in that pending state which is the Kubernetes state for saying I am pulling images, I'm starting, KubeLit is starting, all those kind of stuff. It's not quite yet running but it's trying to get the container up and running but generally those are those problems and there's obviously in Armada we have the layer of, you know, sending our pods to different clusters but usually that is much smaller scale than pulling our images from our internal Docker registry. Do you have like a pilot demon set that can pre-stage or do data scheduling for you to speed that up at all? I actually don't know that answer but yeah, it's a good question. We can follow you up if you really want to know. Yeah, it's a good question. The question was how are queues different from namespaces and generally they're very close and you'll see a lot of cases where we use, we have some internal tooling where we actually, we create a queue which actually creates a namespace with the right resources and generally we have some interesting usage patterns in G-research where each quantitative analyst has their own queue and they have their own namespace and usually they're one-to-one but we could imagine a world where people would want a team to have a namespace or a queue and they all have that kind of similar namespace so we could imagine a more, like many-to-one type of relationship but they're usually pretty close. And the second is would you advise to run jobs with Armada on a cluster that is already running other production typical Kubernetes deployments? Yeah, I would. I mean, so we, I think generally, I mean, a lot of it is, it depends, like, yeah, I think you could, yeah. I mean, I don't think there's too much overhead on running them depending on what your infrastructure is. We typically have our worker clusters being dedicated fully to Armada but I have seen some cases internally where we actually might share our hub cluster, our server with other resources but generally this is a multi-cluster Kubernetes thing so we have kind of, there might be some parts that might share our workloads with others. Thanks. Of course. All right, well, thank you all and I promise socks so if anyone wants socks, I told you I have a lot actually. So. So.