 Good afternoon, thanks for hanging in there. It's getting pretty close to the end of the day. I'm Ann Holler, and I'm happy to be here with Travis on an MP4 that he recorded earlier to present efficient deep learning training with Lidwig AutoML, Ray, and Nodeless Kubernetes. I wanted to start off by just a shout out to several recent articles that contributed to material that's in this presentation and to the people from the Lidwig, Ray, and Nodal communities that contributed to this material. So the first is a recent CNCF block from February on managing public cloud resources for deep learning training. The second is a medium blog from that same month on Lidwig AutoML for deep learning. This was focused on tabular data sets. And then thirdly, our presentation from Cloud Native Regix this past fall, where we created a POC for running Ray and public cloud Kubernetes. So without further ado, let's get on with it. So deep learning has been applied to many fields, but it's well known to be complex to get it from planning to development to production. Ray and Lidwig open source systems greatly reduce the complexity barriers to training, scaling, deploying, and serving deep learning. However, even when complexity barriers are reduced, the cost and operational overhead of deep learning presents significant challenges. So deep learning intermittently needs substantial GPU resources. Public cloud vendors are perfectly happy to provide those, but at non-trivial prices. So managing GPU resources and operational overhead is critical to practical use of deep learning. Lodal's Nodal's Kubernetes nicknamed LUNA commoditizes compute for Kubernetes clusters. So it provisions just-in-time, right-sized, cost-effective compute for Kubernetes applications when they start and removes those resources from the Kubernetes cluster when they end. So its purpose is to manage public cloud resources judiciously. So bringing all this stuff together, this talk is on running Ray and Lidwig on cloud Kubernetes clusters using LUNA as a smart cluster provisioner. And so we'll look at experiments using Lidwig AutoML deep learning training as the experimental workload that shows sizable improvements in efficiency and usability versus the way I was running Lidwig AutoML deep learning training prior to setting it up this way. So compared to my prior way, elapsed time was decreased by 61%, computing cost by 54%, and idle ray cluster cost by 66%. And lowered my operational complexity and also retained the performance results of the AutoML. So we'll start up, let me now turn it over to Travis. Hi, everyone. Thanks for coming to our talk today. My name is Travis Adair. I'm the CTO of a company called Predebase Building an Enterprise Low-Code Machine Learning Platform built on top of Lidwig. And today I'd like to tell you a little bit about the background behind the Lidwig project and how AutoML fits into the vision of what we're doing with the open source Lidwig project. To start, I wanna present the background on why we believe that Lidwig is a valuable addition to the ML ecosystem. So our observation is that if you look at the way ML is done in industry today, there are essentially two incomplete options that are available to companies and organizations that wanna operationalize ML. On the one hand, you have low-level APIs like TensorFlow and PyTorch that provide a great deal of flexibility. And on the other hand, you have traditional AutoML systems that provide a lot of simplicity. But neither of them end up being ideal because oftentimes the low-level APIs are difficult to gain the production for non-expert users while the AutoML systems end up being these black boxes that you end up graduating out of because they don't always solve the problem the first time around. And so when we look at what we're doing with Lidwig, the core insight is that we believe that there is a third option that needs to be explored, which is what we call declarative machine learning systems. With declarative, what we intend to do is provide a high level of abstraction, a higher level abstraction that provides the flexibility and automation use of use of AutoML while still giving you the flexibility of lower-level tools like PyTorch. And opening the door for non-experts to harness the power of ML without needing to resort to these more granular tools. And the way that Lidwig works to kind of make this declarative vision possible is similar to kind of systems that provide infrastructure as clear that I'm sure people who are in the Kubernetes community are very familiar with. We provide YAML configurations that declaratively define models that you might wish to train. And so for example, it's very easy to get started in Lidwig. You just say, here's a YAML config saying what my input features and their types are, what my output features and their types are, and then everything else to kind of how will get filled in automatically on your behalf. But at the same time, we provide a lot of expert level control as well. So if you say I want to use a specific type of mall architecture to encode a particular feature, if you want to use a particular learning rate or regularization or dropout, all those options are available to you as well as more advanced features like hyper parameter search on any of the different parameters within the config. And what makes this all possible is the Lidwig architecture. So every input feature and output feature in your dataset passes through an architecture we call ECD for encoder-combiner-decoder. Every feature is pre-processed according to pre-processing rules that you can configure in the YAML config and then encoded into a vector, which can be a machine learning model, pre-trained or otherwise or learned. And then all of the different features are combined into an embedding space and then individual output features then passed through a very similar decoding step where we get the final prediction. And the benefit of this architecture is that it provides a great deal of task flexibility without a lot of additional complexities. So if you wanna do a regression problem, you can have any types of inputs and then just specify the numerical outputs. You wanna do speech verification, you can have two different audio inputs that then have a binary output, which tells you whether or not the audio streams are, for example, equivalent or something to that effect for the same speaker. And any number of other problems including text or image or forecasting, tabular data problems, they're all possible with Lidwig. Another core component of Lidwig is scalability. And so because we integrate heavily with Kubernetes, we also integrate heavily with other distributed systems that sit on top of the bill on top of Kubernetes like Ray. And so all of the pre-processing can be distributed across a cluster of pods using a Daskron Ray. And our training system uses a framework called Horavad that allows you to distribute training across multiple nodes and multiple GPUs. And then model artifacts can then all be upload to a registry like something like an Lflow, which we support integration without the box as well. For hyper parameter search, it's very similar and very modular again. So we use RayTune, which sits at a level on top of the training process and can perturb different parts of the config. And every one of those config variants then becomes its own trial that goes through the same training, pre-processing training and evaluation step as any other training process in Ludwig. And then at the end of the day, you can get all of the different model trials that were explored and choose the one that you would like to use in production. And when we started to look at building an AutoML layer on top of this, our goal was that we wanted to be something that was ultimately a glass box and not a black box. And so one thing that is very nice about the AutoML system in Ludwig is that at the end of the day, you can see it as like a co-pilot that's helping you generate an ideal Ludwig config for your data set. So you can start by saying something as simple as a co-pilot configuration from my data set, it can be a data frame or a K file or whatever. And then I want to predict this particular column in this case intent, and then it can give you a config that then you can do whatever you want with, modify anything and to your heart's content. So how this works under the hood is you just provide those two parameters plus an optional time budget. And then Ludwig AutoML would do some inputs to determine the input and output feature types, choose the appropriate mall architecture based on your task, select the parameters and hyperparameter ranges that wants to explore, given the time constraints and resource constraints, and then launch the hyperparameter search trial trials on RayTune using your GPU workers. And the outputs will be the best tune mall along with other models that were explored. And you can then take those results and deploy them into production as well. And now I'd like to hand it back to you and to talk a little bit more about the little, the important thing that I want to emphasize here is that there's more to the story than just the Ludwig AutoML side because when you're running this thing in production or kind of in a large distributed setting, there's also a component of how you want to do this process efficiently to optimize the usage of these commodity resources like GPUs so that you're using them judiciously and not wasting resources. And so this is where Ludwig puts in the picture particularly for running Kubernetes workloads. And so now I'd like to hand it back to Ann to tell you more about the little and the work that she's done on combining the AutoML and these other systems together. Smart cluster, sorry about that. It's a smart cluster provisioner that runs in standard Kubernetes clusters. It monitors for pending pod creation requests and adds additional compute to the Kubernetes clusters to satisfy those requests. That compute can be in the form of VMs on demand or spot VMs or it can be in the form of serverless compute like AWS Fargate. And it chooses the compute based on current availability of that kind of compute, the cost of that kind of compute and other user specific requirements. You may want a GPU, you may prefer not to have a certain kind of GP, things like this. And on an ongoing basis, Luna is monitoring the node usage in the cluster and will remove compute from the Kubernetes cluster when it's no longer needed. So Luna is comparable to the Kubernetes cluster auto scaler, but it provides more flexible node selection without the need to create and maintain what can be hundreds of node groups to represent all the image types, I mean all the instance types. And it's similar to AWS Carpenter, but it works across cloud vendors, provides instant family exclusion and supports a deterministic application of rules. With all that background, let's go into what we did to look at the different ways we could run Ludwig AutoML and how much they would cost and how easy they would be. So just some background, the Ludwig AutoML heuristics that we developed for tabular data sets were developed after analyzing thousands of hours of model training across 12 tabular data sets. And after we had those heuristics, we ran them on the training set of 12 data sets to make sure they could produce good models in a short period of time like an hour rather than thousands of hours. And then we said, okay, that's like running on the training set. Now let's take an additional nine validation data sets. We've never seen them before, they're tabular. And let's run Ludwig AutoML on them and let's get the resulting models and compare them to highly tuned publicly reported models. And so I did that on those nine data sets. And so the workload we'll look at here is three of those data sets running for one hour, two hour and four hour ray tuned time budgets. And we'll look at the way I originally ran them and the way I would run them on Kubernetes with Luna available. So in the remainder of this talk, we'll look at the baseline configuration and the two other configurations. First look at a high level description of the three and then deep dive each one. So the top level is how I ran things originally and did the original validation of AutoML for the nine data sets. I had a three node ray cluster. I deployed it directly on AWS VMs. All three of the nodes were GPU enabled meaning the ray head itself could run part of the workload during the auto tune process as well as the two workers. I ran them on NVIDIA T4 GPU VMs as being a kind of a commodity GPU system that did a good job for these workloads. The first alternative to that basic configuration is to instead of deploying ray directly onto VMs to deploy ray into a Kubernetes cluster that's got Luna installed in it and to enable the ray auto scaler. So to deploy ray with the head being GPU enabled and allow ray to scale up to eight workers, you'll see here instead of just two, we'll talk about that in a minute. And so what happens here is when the ray auto scaler realizes that more workers are needed, it asks for the workers not in terms of an instance type but in terms of the amount of resources that are needed. And when Luna sees those resource requests, those pending pod requests that aren't satisfied, it's gonna go out and pull available instance type and put it into the Kubernetes cluster on demand. And then later when the ray auto scaler doesn't need a worker, it's going to get rid of the worker and the Luna system that it's no less is going to see that that node is no longer needed and pull it out of the Kubernetes cluster. So in fact, in this case, even the ray head when it's originally deployed, Luna is the one that chooses a GPU enabled node and puts it into the cluster. Alternative two is just like alternative one with one change, which is that the head of the ray cluster is CPU only. This is good because GPUs are expensive. And so in this case, you can run an idle ray cluster and it would cost less money. All right, so how did I choose this baseline? Well, the basic principles of why I ran the original experiments in this configuration of the fixed size three node ray cluster with T4 GPU instances was I wanted a standardized amount of compute for the auto ML time budget. So if I say I'm gonna run auto ML for one hour, well for two hours, auto ML for four hours, there needs to be a standard amount of compute behind that time basis. So I cared a lot about that. I cared a lot about operational complexity. I didn't want to reason about whether my experiment was good or not. I wanted to have confidence that all the compute was available for the entire time budget and what I was getting was a legit result. And my final thing was I wanted to control idle cost. So when the experiment was finished, I would log into the ray head. I would make sure that I didn't see anything bogus about the experiment. I would poke around, I would record things. So I was pretty sensitive to idle because I might leave the ray cluster running for a non-trivial amount of time after the experiment was over. So G4DN means T4 GPUs and I could have used any of the variants of T4 GPUs because for this workload, it's all about the GPU and the GPU memory. But I chose 4x large a little bit spendy because when I tried to get cheaper instances in my region they were often not available and so that was operational complexity for me to keep trying to redeploy the ray cluster and so on. So that was my choice. Three nodes I knew would do a good job running the 10 hyperparameter search trials which is the default number of search trials run by AutoML and Ludwig. So I knew three nodes could complete 10 trials in a reasonable amount of time. I used non-spot instances for the same reason that I had a fixed size cluster. I didn't want anything to go away during the run and I ran a single job at a time, not six or nine nodes for that idle issue. So of course the baseline for these three workloads running for the three-time budgets matched our expectations for tuned accuracy of the models versus manually tuned models. The elapsed time for this run was 22.6 hours. And you might be saying, well, why wasn't it 21 hours? It's three times one plus three times two plus three times four. But the extra 1.6 hours were for parts of the job that only run on the head. So when the data is loaded up and pre-processed that's done on the head and when the final auto-tuned job is complete, the head also runs the evaluation of the best model from each trial. The cost of this workload because G4D and 4x large instances cost $1.204 an hour. The overall workload cost here was $81.631 and the idle cost of course is $3.612 an hour. So some observations about this baseline, these baseline runs I did. Well, it would be nice to get the results in quicker than 22.6 hours and this is just three of the nine. And the obvious way to do that would be to run more than one of the jobs at a time, run the three one hour jobs in parallel and three two hour jobs in parallel and so on. And of course I didn't do that originally because I was worried that the Ray auto scaler wouldn't be able to obtain the instance types needed when I needed them. But that's where Luna comes in. So now if you combine the Ray auto scaler with Luna then the Ray auto scaler asks for resources and Luna satisfies them. So basically that's the magic that allows this to reduce idle cost because now I don't need all three nodes running at the end and reduce the lapse time. But you might be thinking, okay, well that's fine but what about workload cost? Can you really reduce that? I mean sure you could get rid of the 1.6 hours where only the head is needed but what about the workers that are needed to run the auto-tuned searches for hyperparameter? Well actually those workers aren't always needed during the entire run either. So in the AutoML for Lidwig uses something, a search strategy called async hyperband. And what async hyperband does is discontinue trials that don't look very promising compared to the trials it's already run. And so depending on the dataset a lot of trials may be discontinued quickly and when there's fewer than three trials left then fewer of three workers are needed. So this is a picture of the 22.6 hours on the x-axis and on the y-axis you see the three one-hour datasets running for one hour each, then for two hours each, then for four hours each. And you see a trial, the 10 trials for the first dataset. So you can see that for the first dataset only two trials really run until the end of the time budget. The other ones are discontinued as being not promising. And for the second dataset you can see only one really survives till the end of the hour. Third dataset is the trials are more competitive but hey that's what autoscaling is all about. So there's a lot of opportunity here to save resources even during the run. So that brings us to the configuration where Ray is running with its autoscaler, where Luna is managing the Kubernetes cluster in terms of scaling it, and where we deploy Ray onto a Kubernetes cluster which is EKS in this case. And we actually used a control to make sure that Luna didn't use NVIDIA M60 GPUs which are actually cheaper than T4 because they didn't work well for machine learning workloads or kind of for graphics workloads. And so now we run the three jobs in parallel of the same time budget and we set max concurrent trials to three so that they still only will run three at a time just like they would have in the fixed size cluster to begin with. So in this configuration we got competitive accuracy results. So there was no compromise on the accuracy of the models that AutoML found using the Ray AutoTune. But the elapsed time was greatly reduced. So in this parallel run, the elapsed time was reduced to 8.75 hours which was a big difference. Of course the idle cost was reduced by two thirds because only that GPU head is running at the end. And the workload cost was reduced by 54%. Let's look at where that is coming from. More than half of that is coming from exploiting the auto scaling we talked about. Scaling down workers not needed during the run for the head only parts or for the parts where there's fewer than three trials left. But a little bit, about 20 some percent was also coming from using cheaper instances. So for when Luna ran this job, it shows the G4DN4X large for the head because I had said the head needs more CPU memory to handle the evaluations and the data processing. But I had said that the workers need less CPU memory and Luna patient and willing to look for available instances was able to get two X large instances which are cheaper. Okay, cool. So now that's way better in elapsed time, idle cost and workload cost. But we still have that GPU machine running if the ray head is up. And I guess there's two things there. One is it means you're gonna feel a little guilty just leaving it running all the time. And so you're probably going to spin it down once you've gotten all the data that you needed to get off the head. And also it's just a waste of money in general cause you don't need those GPUs once the job is finished. So in this case with the CPU on the head, we can see how cheap we can get this and possibly even leave the cluster up if you're comfortable with the cost of idle once you've switched to the CPU head. Now in this kind of deployment, the head can't run a worker so we need to bring up nine workers max to get 333. And also because the way Ludwig Auto Mail checks for resources, it looks at the ray head to see if there's GPU enabled in the cluster. So there's a slight option you need to add to Ludwig Auto Mail to run it this way. So again, in this configuration, we were able to get match the accuracy of the models. So that's good. The elapsed time, we suffered a little bit instead of 8.75 hours it was nine because a CPU head running the evaluation was a little less efficient, but it's still a way better than 22.6 hours. So if you're willing to take a little hit there, that's fine. The idle cost was $0.452 an hour. So 62% than leaving the GPU head enabled and 87% than to the baseline. So basically it was a big savings and I felt way less guilty leaving the cluster, ray cluster up in this case. The workload cost was essentially the same as in the previous case. So overall the lessons we learned here was the efficiency you can get along with operational ease of use of using Luna nodeless Kubernetes and using ray on top of Kubernetes was really worthwhile. GPU enabled head was good, CPU only head was better. So we've shown these benefits in the future we want to continue to advance Luna to handle efficient scaling of all sorts of workloads. Certainly deep learning training is not the only workload that can benefit from this CICD and many other workloads can as well. And we want to continue to extend Ludwig AutoML to new domains to enable efficient development and scaling and we've already just recently that like in the past two weeks announced that Ludwig AutoML now works for text classification data sets and also shows good savings for those as well. So that's it, thank you. And any questions? So do I get at the right that Luna doesn't actually add nodes to the cluster but rather add some compute and you're handling off the workflow to that thing. And if so, what is the difference between just expanding nodes on Eman that fits your needs? Sorry, just to make sure I understand the question. So right now Luna during the experiment is adding virtual machines to the Kubernetes cluster to satisfy the pending pod requests that couldn't be placed when the Kubernetes cluster at the current size of the Kubernetes cluster. I'm sorry, they're, I'm not sure if I understand. They're nodes in Kubernetes, yeah, yeah. You mentioned there the Higgs data set. So I'm very curious, what is this? So there's a whole, it's pretty cool but there's a whole bunch of famous tabular data sets and I wanted to choose famous ones that people had done a lot of disclosures of best models force that I could make sure that AutoML was competitive with those. So there's one for Higgs, Higgs Boson data set. It's a very large, very nice data set. It's a very challenging data set to run. So yeah, it's an interesting data set. So it's out there, you can download it and run it. It's also, these data sets are built into Lidwig. So they're available if you're using the Lidwig platform. There's a bunch of standard data sets. So each of the ones for AutoML are checked in. Question, like, did you compare like Luna with for example, Jiki Auto Scala or Carpenter? I think here you're comparing with yourself, right? Yes, so here I'm comparing with my own lame, you know, baseline. But yes, I kind of mentioned on the first kind of description of Luna that it is comparable to AWS Carpenter and it is comparable to the cluster auto scaler. I would say the difference for the cluster auto scalers typically you have to end up creating a whole bunch of node groups to cover every possible instance type you want, or as you don't have to do that here. And for Carpenter, I've actually run experiments on other cloud, clouds other than AWS. So that's one issue with Carpenter at least right now is that it's really AWS. Carpenter also doesn't allow you to do instance exclusion. So I really wanted to do instance exclusion here because I didn't want the crappy M60 GPUs. It also has this weird thing with the rules where it's not deterministic what order Carpenter applies the rules. So these are little picky things, but so I feel like this is kind of robust for my use case. Thank you.