 Good morning. Thank you very much. I appreciate the chance to talk this morning along with Travis on and recorded MP4 about efficient AutoML with Ludwig, Ray, and Nodeless Kubernetes. I just wanted to give a brief shout out to the community of people that contributed to content of this talk, including people from the Ray, Ludwig, and Elotal communities. There's four kind of sources for the material. Our recent from February blog on CNCF about managing public cloud resources for deep learning training. Our medium blog from February on Ludwig AutoML for deep learning. This was focused on tabular data sets. Our more recent medium blog on Ludwig AutoML for text classification data sets. And finally our cloud native rejects from this past fall where we talked about a POC of running Ray on public cloud Kubernetes. So we're going to discuss the efficiency of AutoML with respect to two aspects. One is the efficiency of using AutoML rather than doing your own untuned search. And the second is when you're using AutoML, how you can use it efficiently in public cloud Kubernetes versus using it like my directly deploying on EC2. So first we'll give some background information about Ludwig, Ray, Ludwig AutoML, and the Nodeless Kubernetes technology. So this is where Travis comes in. Let's see if this can work. Hi, everyone. Thanks for coming to our talk today. My name is Travis Adair. I'm the CTO of a company called Predabase Building and Enterprise Low Code Machine Mining Platform built on top of Ludwig. And today I'd like to tell you a little bit about the background behind the Ludwig project and how AutoML fits into the vision of what we're doing with the open source Ludwig project. To start, I want to present the background on why we believe that Ludwig is a valuable addition to the ML ecosystem. So our observation is that if you look at the way ML is done in industry today, there are essentially two incomplete options that are available to companies and organizations that want to operationalize ML. On the one hand you have low level APIs like TensorFlow and PyTorch that provide a great deal of flexibility. And on the other hand you have traditional AutoML systems that provide a lot of simplicity. But neither of them end up being ideal because oftentimes the low level APIs are difficult to get into production for non-expert users while the AutoML systems end up being these black boxes that you end up graduating out of because they don't always solve the problem the first time around. And so when we look at what we're doing with Ludwig, the core insight is that we believe that there is a third option that needs to be explored, which is what we call declarative machine learning systems. With declarative what we intend to do is provide a high level of abstraction, a higher level abstraction that provides the flexibility and automation needs of use of AutoML while still giving you the flexibility of lower level tools like PyTorch. And opening the door for non-experts to harness the power of ML without needing to resort to these more granular tools. And the way that Ludwig works to kind of make this declarative as impossible is similar to kind of systems that provide infrastructure as code, I'm sure people who are in the Kubernetes community are very familiar with. We provide YAML configurations that declaratively define models that you might wish to train. And so for example, it's very easy to get started in Ludwig. You just say, here's a YAML config saying what my input features and their types are, what my output features and their types are, and then everything else to kind of how we get filled in automatically on your behalf. But at the same time, we provide a lot of expert level control as well. So if you say I want to use a specific type of mall architecture to encode a particular feature, if you want to use a particular learning rate or regularization or dropout, all those options are available to you as well as more advanced features like hyperparameter search on any of the different parameters within the config. And what makes this all possible is the Ludwig architecture. So every input feature and output feature in your data sets passes through an architecture we call ECD for encoder combined decoder. Every feature is preprocessed according to preprocessing rules that you configure in the YAML config and then encoded into a vector, which can be a machine learning model pretrained or otherwise or learned. And then all of the different features are combined into an embedding space and then individual output features then pass through a very similar decoding step where we get the final prediction. And the benefit of this architecture is that it provides a great deal of task flexibility without a lot of additional complexity. So if you want to do a regression problem, you can have any types of inputs and then just specify the miracle output. You want to do speech verification, you can have two different audio inputs that then have a binary output, which tells you whether or not the audio streams are, for example, equivalent or something to that effect for the same speaker. And any number of other problems, including text or image or forecasting, tabular data problems, they're all possible with Ludwig. Another core component of Ludwig is scalability. And so because we integrate heavily with Kubernetes, we also integrate heavily with other distributed systems that sit on top of the bill on top of Kubernetes like Ray. And so all of the preprocessing can be distributed across a cluster of pods using Dask on Ray. And our training system uses a framework called Horovod that allows you to distribute training across multiple nodes and multiple GPUs. And then all artifacts can then all be uploaded to a registry, like something like an outflow, which we support integration without the box as well. For hyperparameter search, it's very similar and very modular again. So we use RayTune, which sits at a level on top of the training process and can perturb different parts of the config. And every one of those config variants then becomes its own trial that goes through the same preprocessing training and evaluation stuff as any other training process in Ludwig. And then at the end of the day, you can get all the different model trials that were explored and choose the one that you would like to use in production. And when we started to look at building an AutoML layer on top of this, our goal was that we wanted to be something that was ultimately a glass box and not a black box. And so one thing that is very nice about the AutoML system in Ludwig is that at the end of the day, you can see it as like a co-pilot that's helping you generate an ideal Ludwig config for your data set. So you can start by saying something as simple as create a configuration from my data set. It can be a data frame or a file or whatever. And then I want to predict this particular column, in this case intent, and then it can give you a config that then you can do whatever you want with, modify anything and to your heart's content. So how this works under the hood is you just provide those two parameters plus an optional time budget. And then Ludwig AutoML would do some inference to determine the input and output feature types, choose the appropriate model architecture based on your task, select the parameters and hyperparameter ranges that wants to explore, given the time constraints and resource constraints, and then launch the hyperparameter search trials on RayTune using your GPU workers. And the outputs will be the best tune model along with other models that were explored. And you can then take those results and deploy them into production as well. And now I'd like to hand it back to you and to talk a little bit more about the LODL. The important thing that I want to emphasize here is that there's more to the story than just the Ludwig AutoML side because there's, when you're running this thing in production or kind of in a large distributed setting, there's also a component of how you want to do this process efficiently to optimize the usage of these commodity resources like GPUs so that you're using them judiciously and not wasting resources. And so this is where LODL fits into the picture, particularly for running Kubernetes workloads. And so now I'd like to hand it back to Anne to tell you more about the LODL and the work that she's done on combining the AutoML and these other systems together. Okay, great. Thanks to Travis. We'll go back to the slide set now. Skip forward to pick it up where he left off. So the last piece of the puzzle of the background information for this talk is the nodeless Kubernetes LUNA functionality. It's a smart cluster provisioner that runs in a standard Kubernetes cluster. It monitors the cluster for pod creation requests and adds additional compute to the Kubernetes cluster to satisfy those requests. That compute can be in the form of VMs, either on-demand or spot VMs, or it can be in the form of serverless compute like, for example, AWS Fargate. And it chooses that compute based on current availability in the cloud. We already heard about sometimes you can't get the GPUs you want, the cost in the cloud, and other user requirements. In fact, the user may have specific GPUs that work well for their workloads or something do not. LUNA is monitoring your cluster and if node usage is becoming low, it can remove unneeded nodes from your Kubernetes cluster. It's comparable to the Kubernetes cluster autoscaler, but it provides more flexible node selection without the need to maintain what can sometimes be hundreds of node groups to handle the number of instance types available. It's somewhat like AWS Carpenter, but it works across cloud vendors, provides instance family exclusions, and provides deterministic rule application. So now let's go into, we've got this background information. Let's go into first efficiency in Ludwig AutoML for tabular datasets, and then we'll move on to text classification datasets. So looking at the tabular datasets and looking at what AutoML can bring you versus doing untuned search, the AutoML heuristics for tabular datasets were developed by analyzing thousands of hours of model training across 12 datasets that we used as the training datasets for the heuristics. So we ran three model architectures. We ran across 24 hyperparameters and basically looked at all that data and formed a set of heuristics. That set of heuristics includes a particular model architecture tab net, which we found gave a good straight off of accurate models with a good leverage of the training time involved. We drilled down on what the search hyperparameters, search parameters that really made a difference and narrowed their ranges. And we found something interesting, which we didn't expect, called transfer learning for tabular datasets. And that is if you've done a hyperparameter search on dataset A, and you found the best model with the best hyperparameters, those best hyperparameters will probably do a pretty good job on dataset B if it's somewhat similar. And so we can up front load the hyperparameters search with those settings for the hyperparameters and get a good model more quickly. And the fourth thing that we did was to set things up so that AutoML uses async hyperband when it's using raytune. Async hyperband just continues on promising trials so the resources aren't wasted on them. So creating this sort of set of heuristics, we then ran the heuristics on the original 12 datasets. It's kind of like running on your training data, but we were able to find a competitive model in one hour. So that gave us some confidence that we were on the right track. And then we took an additional nine datasets we hadn't used for creating those heuristics and ran validation of those heuristics. And we were able, within a two hour time budget, to get models that were competitive with highly manually tuned models that were publicly reported. And so we'll use this validation dataset the way I ran it to do validation to show how difficult it was to do the validation and how much easier it would have been if I had used the LUNA nodeless system on Kubernetes. So we took three of the validation datasets of those nine, and we ran them on AutoML for one hour, two hour, and four hour time budgets. And we compared two configurations running on public cloud Kubernetes with LUNA with the way I ran it, so the initial baseline of deploying ray directly on EC2 VMs. So here's the top configuration is the configuration I ran. Three node ray cluster to do the ray tune AutoML search. The head and both workers were GPU enabled. I used NVIDIA T4 GPUs, so kind of off-the-shelf GPU that gives a good ROI on the cost. And so fixed size and a fixed amount of compute. Alternative one is that instead of doing that, deploy the ray cluster into a Kubernetes cluster that's running LUNA, the nodeless Kubernetes system. So LUNA will be handling autoscaling when the ray autoscaler asks for it. So basically, again, the ray head is a GPU enabled node, the ray workers are spun up when the ray autoscaler says that more resources are needed, and then LUNA sees those prods pending and goes out to the cloud resources and adds resources to the Kubernetes cluster and then removes them when the ray autoscaler removes them. Alternative two is just like alternative one with one difference, which is that the head node is a CPU-only node. This is kind of nice because it means when the ray cluster is idle, then you're not spending money on a GPU node. Okay, so why did I choose this as my baseline? Why did I run those nine data sets in this way? I had three things I was trying to achieve. One was I wanted a standardized amount of compute for the time budget. So if I run AutoML for one hour, two hour, four hours, I want to know what that means in compute. And I knew that three T4 GPU-based machines would give a standard amount which would do a pretty good job with those time budgets for tabular data sets, and the overall AutoML runs 10 trials, they would give a reasonable result on 10 trials. So that was one thing, is I wanted standard amount of compute. The second thing is I wanted reduced operational complexity. I didn't want to worry about whether I got a legitimate run or not. So that was a big concern from my time. And the third thing was I wanted to limit idle cost. So I didn't run these jobs in parallel. I didn't run the three one hour, three, two hour, and three, four hour in parallel because I wanted to limit the idle cost. So I was sensitive to idle cost because after the job is over, I would log in, make sure I really liked the results in terms of that they were credible, that they had run properly, and so on. So I wasn't going to spin down the cluster the minute that the job completed. And so for all these reasons, I had this fixed size. I chose G4DN4X large. G4DN means T4, 4X large has to do with other aspects of the virtual machine, which are less important to this workload. But the smaller instances were hard for me to get when I tried and I didn't want to retry things because that would just be tedious for me. So of course, the baseline of these three models running, you know, one, three data sets running one hour, two hour and four hour, you know, produced competitive models to manually tune models. The last time was 22.6 hours. And you might say, well, why wasn't it 21 hours? Three times one plus three times two plus three times four. The extra 1.6 hours was to load up the data sets and preprocess them and also run the evaluation of the best model from each trial which is done on the head node after the hyperparameter search is completed. The cost for running this, you know, G4DN4X large is $1.204 an hour. So it cost me $81.631 cents to run this job. And the idle cost was $3.612 per hour. So what are some observations about this baseline? Well, it'd be nice to get the results quicker than 22.6 hours. And the obvious way to do that is to run them in parallel. But I didn't want to run them in parallel because of my sensitivity to idle cost and my concern that the Ray Auto Scaler wouldn't be able to get the instance types that you have to ask for when you're running directly on, you know, AWS VMs. But the marriage of the Ray Auto Scaler and Luna fixed that problem because the Ray Auto Scaler could ask instead for a certain amount of resources and then Luna could go and broker and get those resources up from what was available. So that was really the key to being able to move to a more flexible system with Auto Scaling. Now you might say, well, I can see how Auto Scaling is going to save time in the idle time. I can see how running in parallel is going to reduce the elapsed time. But are you really going to save anything while the workload is running? Well, you're going to save at least the workers don't have to run during that 1.6 hours that the head is active and the other workers aren't active. But actually, there's more savings here than that. And that's because we're using async hyperband scheduler, which is discontinuing, unpromising trials. And so if you're running 10 trials and you get down to where there's fewer than three trials left, you don't need all of those workers to give the amount of compute that's needed. And so this is a picture of what's going on during that 22.6 hours. So you've got time on the x-axis and you've got the data sets and their trials on the y-axis, you know, the one hour, one hour, one hour for the first three data sets, then two hours for each and then so on. And so you can see that that first data set only really two trials survived to the end of the time. So you didn't need three workers while that trial was running. You just really mostly needed two. Same, the second data set, you only needed one. The third one you needed more, but that's what autoscaling is all about. So there's a lot of opportunity here to spin down workers even during the run. So that moves us to the GPU head rate configuration with Luna running on Kubernetes. So deployed on an EKS cluster with Luna there to bring things in when the Ray Autoscaler asked for them. And this time we can run the three workloads, three data sets in parallel for each of the time bases. And we set max concurrent trials to three for each of the three data sets running in parallel so they would get that standard amount of compute when they ran. So again, we didn't have any compromise on model accuracy, it was just like the baseline. But the elapsed time was much shorter, 8.75 hours instead of 22.6, so a reduction of 61%. The idle cost, of course, down by 66% because only the head is running, not the workers. And the workload cost was 54% lower than the baseline. Just to break down that 54%, more than half of it was coming from the autoscaling we just talked about during the run. But then some percentage of it was obtained because Luna was able to get cheaper instances than I was able to mess around and try to get when I would bring things up annually. So Luna put the head node on the 4x large but it put the workers on 2x large based on the resource requirements. So now moving to the CPU only head, this is nice because A, idle is cheaper and B, if idle is cheap enough, you might even not have to bother to spin down the ray cluster and bring it back up if you're willing to kind of spend away less money. So this is the run there. Again, no compromise on model accuracy. The elapsed time was a little bit longer because the CPU head wasn't as efficient at running the evaluation at the end of the set of trials as a GPU node on the head, but not too much of a compromise in extra 15 minutes. The idle cost was significantly lower than the GPU based machine. So 0.452 dollars and the workload cost was essentially the same. So that's the story around tabular data sets. Let's look at text classification data sets. Can you get an advantage running AutoML with text classification data sets versus your own untuned search? Well, I actually thought this wouldn't be interesting. I mean, I was just naive or something, but I'm like, oh, I can read a bunch of papers. I can just use BERT base. BERT base modeling is widely applied to text classification. Here's the batch size ranges I should use. Here's the learning rate ranges I should use. Here's the optimizer I should use. I'll just do that, and it'll just work. I mean, there's AutoML won't really be doing anything other than stamping out what people in the literature tell you to do. And so I had 10 data sets I was going to use to create these heuristics, and I ran this set of parameters to train those models to see what would happen. Well, three of them got great results within one hour, and the other seven crashed. And they crashed because BERT base is a transformer model. It's quadratic in the input text token length, and I was trying to run on commodity T4 GPUs, and they wouldn't fit in memory. And so I actually spent the next two months injecting a whole bunch of heuristics into AutoML so that they could be trained on this kind of commodity GPU. So there really is a value to AutoML even for pre-trained models that you're fine-tuning. So this included adding a tune for memory option to AutoML that says, hey, change the hyperparameter search strategy and other things if the model is projected not to fit in memory during training. So a lot of stuff there, including the fact that limiting the input token length is kind of a good trade-off first before you reduce batch size and so on. So a bunch of stuff here highlight that particular choice. Also limit the number of trials needed. You remember in tabular data sets we were running 10 trials. By default we run here, but five here. But if we do look at the combinatorics and there's less than five possible combinations, we reduce the number of trials further. Setting the max epoch rate, max epoch count to six, because these pre-trained models, when you're fine-tuning them, once you get past about six, you're moving into the range of catastrophic forgetting, it's not interesting and so on. So with these heuristics, the original 10 data sets, so the training set worked well. And the other thing we did here is because it's combinatoric, we were sort of, we sort of needed to move into regimen where you had a particular time, a budget set uniquely per data set. So we couldn't stick with the kind of fixed time budget. So anyway, that worked well for the 10. We then tried it on the additional five data sets that we hadn't seen before when we formed these heuristics and got accuracy competitive with publicly reported models. So again, I did the same experiment that I did for tabular data sets, which was choose three of the data sets from the validation and see how they would have worked on my baseline configuration versus Nodeless Kubernetes Luna running on a Kubernetes cluster. And so doing that, these are the three data sets I chose. I was also able to save a bunch of resources during runtime, but it's kind of interesting where those resource savings came from. Remember in tabular stated data sets, it came from the idea that async hyperband was discontinuing certain parts of the run. Here it was coming from the fact that we didn't need as many trials. All three of these data sets, because they used tune for memory and because it reduced the hyperparameter search range, only ran two trials at max. And then several of the data sets also didn't have to run their full time budget because they hit the six epoch mark. So if you look, this is a picture of the run time for these three data sets being validated. So again, a big savings and compute cost and idle cluster cost is the GPU had not so much savings and elapsed time this time because elapsed time was dominated by that one big data set. So anyway, in this talk, we've talked about the savings you can get in your own efficiency by using AutoML rather than your untuned search. And then we've talked about how once you are using AutoML, there's a lot of leverage in using AutoML on top of a Kubernetes cluster with Nodeless Luna technology. So in the future, we're intending to continue to extend Ludwig Auto and melt new domains, and we're continuing to enhance Luna as needed to handle efficient scaling of all sorts of workloads, including machine learning across public cloud vendors. So thanks very much, appreciate it. And if there's any questions, I'm happy to hear, take them. Thank you. Does anybody have any questions they'd like to ask? We've got one question here. I'm going to get the microphone because we have one over here. You want to go back here? Okay. He wants to get on stage. Okay. So you were talking about the idle cost savings, but you seem to be assuming that you had basically a single user on the Kubernetes cluster. It seems like you could also get a lot of idle cost savings if you were able to segment things by namespace and have multiple users share the same cluster. Yeah. That's a really good point. Yeah. I think I was, this is kind of a selfish view of the world. I think I was sort of spinning up the Kubernetes cluster just for myself and then deploying right there, but I think it's a very good point. Yeah. I think that's something that we should think about more. All right. Well, thank you very much. Enjoy the rest of your day. Oh yeah. So a Luna is, I guess, a local dot, actually the local company has Luna. So let's see. I think, I don't know if you want to share the website. Yeah. What's the question again? Where can people get more information about Luna? Oh, elotal.co slash Luna. Thanks. Thanks. Thank you.