 Hello, everyone. Hello. Thank you for coming. Thank you for coming to ArgoCon. I'm going to talk for the next around 20 minutes. And I'm going to talk about like how we train and ensure reliability of machine learning nodal at vault and how we use Argo CD flight and Argo works for that. Quick intro on my side. Well, no one came here for me. But still, I'm Stefan, I'm a machine learning engineer. I'm also a YAML engineer, given that I use Argo a lot. That's basically what I do all day long. In a previous life, I was a data scientist. I'm the founding member of the ML platform team. Also, if you want, a founding member of the Berlin MLOps community. So if you're Berlin based, and you want to learn about MLOps, you can come to our meetups, conferences, everything. But that's, that's all about me. Second thing, I work for vault. I don't work for vault. And I don't work for HashiCop vaults because everyone, every time people are like, Oh, who are you? And yeah, so we're vault, vault depending on the country you are in. And given that we're not in Amsterdam, given that we're not in the Netherlands, just going to give you a quick intro, and then we go to the technical stuff. So a vault was created in Helsinki in 2016. We started as a food delivery company in Finland. And now we deliver basically everything you can think of. So going from food delivery to Christmas trees, and to everything in the middle. We are in 23 countries now going from Norway to Japan, going also through the stands and a lot of countries in the middle. We have a lot of users, a lot of partners, a lot of core partners. Those are just random stats. We don't really care about them. What we care about is machine learning. And we have different use cases for ML at vault. So I'm just going to explain the different use cases. And then we can start the talk. So the first one is supply and demand forecasting. And that's what we do when we try to predict like how many people are going to order next week. Do we need core partners next week? Do we need to buy more things for supermarkets as well? Because we also have our own supermarkets. So like, let's say it's a public holiday next week, maybe people will buy specific things. So that's what we're trying to forecast. We do that on a weekly basis. And it went pretty good. Then we have recommender systems. Well, I assume a lot of people know about it. It's you buy the same dishes, we're going to try to recommend you different dishes that you might like. But also let's think like you move to a new city. You need to furnish your apartment. You can buy a lot of things on vault and then we'll be like, oh, yeah, you bought a lamp. You want a chair. Maybe you want to buy a desk or something, you know, like we're trying to do that. We're not trying to do the Amazon way where you like to buy a lamp and we'll be like, yeah, maybe you want a thousand lamps. Like we're trying to like recommend you new things. Then the other one is logistic optimization. That's I think one of the most important when we have is okay, you make an order on vault, you order a dish. We're going to predict how long it's going to take to prepare the dish. And then how long it's going to take for you to for us to deliver the food or the or the object that you bought. And that depends on traffic. That one is like full real time. So it's really important. And it's the one you usually people complain about because they're like, oh, yeah, I would do something. It's supposed to take 25 minutes. I've been waiting for 35. What's up? Then fraud detection, keep the bad people out. And the last one is support prioritization. So you order something. And then you're not happy about it. It's delayed. There's a problem with the wrong object that we try to prioritize by order of importance. So like, let's say you have a dish that is late. That one is pretty important. And other other problems can maybe wait a day or two. So that's that's it. We have different use cases, different use cases, many different needs. Our data scientists, they need data access. So, you know, we want them to have access to production data in a simple and yet safe way. So like, you don't want them to have access to the whole production database, you know, you just want them to have access to specific tables. We make that, I would say, fairly easy for them. Then in for access as well. A lot of data scientists will need GPUs now, especially even now. So like, if they need GPUs, then they can request them themselves. They can make PRs. Then I'll talk about it later, but Argo would pick it up for you. And so they don't, they don't even have to apply anything themselves. If you need a lot of RAM, a lot of CPUs, it's the same. One other thing as well that we want is to make deployments of models quick, reliable, and easy. Because, you know, you might have the best model in your laptop. If it's not deployed, then it's useless. So we really want to, to make that easy for people. And we want, really want to increase the velocity of our data scientists. The last thing we want is standardized monitoring. We want to track data quality. You don't want to train a model if your data quality is not good. If you have a problem with the data, because then the model wouldn't probably not be good. You want to train the metrics of your ML model. And then you want to also, like, track the production performance. Let, let's make sure that your P99 didn't go from 50 milliseconds to two seconds, you know, if you want to promote your model. So those are the needs we have. We have an ML platform. I won't go into detail, because not the, the, the goal of the talk, but we're using flight to train our models. So our whole stack is on communities. And yeah, flight is running on communities and what, that's what we use to train and orchestrate our workflows. Then we have ML flow that is here. You track your metrics. You track the different parameters you, you like, you can log the artifacts as well for your ML model. And then you can also compare experiments. Being like, okay, I have multiple experiments. Then I can compare them, have like graphs, have UI. So you, you're happy with it. Then you also have, and we also use ML flow model registry, which allows you to track which model is running where. So if you have a model running in staging in production, then you also, like, can track the different versions of your model. And yeah, you know where they're running. Then we'll have a lot of Python services to, to make life of our data sense is easier. And the last one is Seldon Core. It's what we use to deploy models into production. And it takes a model and basically it takes, like, you give it an S3 bucket or GCS. You also say, like, which, which library you use. Did you use Scikit-learn? Did you use Exiboo something? Then it was going to create a microservice for you. And then out of the box, you have automatic logging, everything. We use Kafka a lot. So then we log everything to Kafka. And then we push from Kafka to Snowflake. And then our data scientists can have, like, dashboards and be, like, okay, like, my model is doing pretty well. I also can compare it to the ground truth and that they don't have to, to write code for that. So you have that automatically. Then you have A, B, testing, canary deployment. A lot of things. So I have an ML platform, but in the title there was a variable. So once an ML platform, but reliable. So, like, at first, we deployed Flight. We deployed it without Argo CD because we're not using Argo back then. Then we introduced Argo CD for Flight. So it's been like, Argo CD, for the ML platform, we only use it for Flight. But what's really nice that then even our data scientists can then, you know, they can ask for resources. They can add new things to Flight directly without us doing it. Then we have the whole ML platform. And then we have Argo workflows that we use with Gatling. And our Gatling is an open source load testing tool. And for the ML platform, we use Argo workflow with Gatling. And then I'll talk more about that later. But basically every ML model that is deployed, go through load testing. And we can make sure that it actually supports the load we want to. So Argo CD and Flight. How we use them? So we use the cluster bootstrapping with Argo CD. So, yeah, I said before, but like, we deployed Flight at first without using Argo CD. Yeah, I was not really happy with it. Every day I would maybe cry in the corner because you apply something wrong. Everything's broken. Now you don't have that anymore. So you're happy. Flight needs a lot of apps, a lot of different apps. And honestly, without Argo CD, you spend most of your days like trying to apply things and going to the right name space and figuring out what's broken. So you don't need that. We have no communities experts in the ML platform. So like, Argo CD has been really helpful on that one as well, you know, just so that we can apply things in an easy way without becoming an expert and without being asking also always the infra team. Oh, can you apply that for us? Can you add that for us? So that's been really helpful. We don't apply anything manually, so we don't need permissions. We don't need specific permissions, you know, like to go to a specific namespace or to apply specific resources. Argo will do it for us. Also, it has rollbacks because I break a lot of things. So when you apply something, then you also can rollback easily. And Flight supports plugins. So let's say you have flight, you're happy, but then you want to use Spark and Kubernetes. So then you can. And then you create a PR, Argo CD will like pick it up, then deploy it to flight, and then you can have a look as well yourself. You can look at the UI and be like, okay, like my plugin is working now. I have now Spark running or I have the MPI operator or whatever plugin you want to add that they support. And yeah, data centers can do that. So I don't need to do it. Then I can go on a holiday more often. So then I'm really happy. So that's why we use Argo CD and why we use it with flight. Then we also have our whole like, so you have Terraform modules and you have our own one, which is a lot of wrappers around the Terraform modules. So that also what I like to not do is to write Terraform. So like that's what Infra team is doing and developer experience team is doing as well. They provide modules. So let's say you need to install Cillium on your cluster. Then you can just call the module and then Argo CD picks it up and then it's always in the flight cluster for our use case, but I don't have to write all the Terraform that would be a lot of Terraform. I can just call the module. And that's what we have. That's what we have Argo CD and flight. Yeah, maybe for people that don't know flight because it's very specific also to ML. So flight is what we use now instead of Airflow for ML workflows. And why it's because first this committee is native. So pretty happy about that. It also supports automatic parallelization. So let's say you have different tasks running and you know they don't need one each one others. You can just like have one that is I don't know like fetching some data, the other one that is doing. I don't know what whatever it's doing, but they don't need one each other. So then flight will parallelize everything automatically. You don't have to think about that. You don't have to declare your dependency. You don't have to declare your task be like, oh yeah, please run this one before the other. They will figure everything themselves. You have reproducible pipelines really important for ML. You don't want your pipelines that produce an amazing model to be like, oh which one was it? Which version was it? So that's also why we use it. It supports caching and caching is really good. Let's say your workflow takes six hours to finish like four hours are going and like at that point it crashes. You want to restart everything again. You don't want to like lose six hours. So then you have caching, flight will pick it up everything and then you only need to wait two hours instead of six. Different SDKs, we only use Python in the ML platform team, but we have other teams that are using Scala and they're actually going to use flight soon. And yeah, flight support different SDKs as well. So if you want to use something else than Python, you can. And the best thing, in my opinion, dynamic workflows. So instead of saying, okay, like we are in 23 countries, maybe we want to deploy model, we want to train a model per country. You don't want to define a new code like, oh, yeah, for I in 23, because then if you had a new country, then you need to change your code. And well, you don't really want to do that. So then with flight you can say like, please parallelize on this list, then it reads the list, and then it will create the task, depending on that on the size of your list. Like 23 is an example. We are in more than 400 cities. Imagine if you have to change every time we go to a new city. That would be annoying. So yeah, it supports that out of the box. And that's been pretty handy. And I have a use case that I will talk about at the end. Well, I was even surprised we could do things in our data centers did it. But yeah, flight workflows. What does it look like? I don't think you can see anything in the back because it's very dark. So I'm sorry. Yeah, you can't see anything. But you basically, you write normal Python and then you just add decorators. So you're going to be like, okay, one decorator, which is going to be a workflow. And then another one, which is going to be a task. And then flight would be like, okay, that's something I recognize. You can also run everything locally. Like it's normal Python if you run it locally. And then if you run it in the cluster, then flight will be like, oh yeah, I know that. Let me run workflows and tasks. And even better because then I have another example. But you can't see anything in the back. But yeah, it's what we use for dynamic workflows. And that's what I was saying before. Like this one is doing cross-varidations for all the ML models you have. And you are going to do cross-varidation on all the ML models, but you don't know how many you have. And then flight will figure that out itself and then do cross-varidation and then return results. And basically that's it. Like you don't need to do anything else than calling the decorator dynamic. So that's nice. Sorry for the end for the back. But yeah, then we use Argo workflows and Gatling. For people, I guess Gatling, I don't know if that's very famous actually, but it's an open source load testing solution. And what we use it for and what it allows us to, it's like to script our testing scenarios and automate our tests. It provides visual reports that is actually nice to look at. And you can really like, you have your app and you should be able to understand fairly quickly what's happening, what's wrong with your app, how it behaves if you add more load and everything. I also have examples of the reports. If you want to be really fancy, you have continuous load testing. We don't use it in the ML platform because the ML models, they don't need that. But for like your different apps, let's say you have an app, that is really important for your company. Maybe you want to have continuous load testing. So every time you make a commit on GitHub, then you run a small load test just to make sure you don't add regressions, just to make sure you be like, okay, my P99 is to like 20 milliseconds. I'm happy. You can do that. Everything's running on the Argo workflows for us, so it's been really nice. And what's nice as well is that basically data scientists see Gatling and they don't see Argo workflows because they can be scared otherwise of like using Argo workflows and writing all the YAML and everything. So they see the end result, but they don't see how to do it. And then yeah, we have templates. So you have the ML model, then you just give the payloads and the name of the ML model, and then we run load test for you. And then you see results and you're happy or not. So that's what it looks like. That's better for the back, right? Well, it's actually white. So yeah, you have the template, then you give like number of requests per second, maximum, how long you want the wrap-up to take, how long do you want your test to last for, then you don't see it here, but you also provide the payload and the name of the model, then you click submit. And if you have only one model, then Argo workflow will run one test. If you have 400, like we do sometimes, then it was 400 tests in parallel. One of the examples you have, that's one visual report, response time, you quickly see like, okay, it's good or it's bad. That one is just an example. So like we don't deploy anything that has like less than 100 milliseconds. We usually are like less than 25 milliseconds. But yeah, we have a look at those. Then you have a summary. You can have a look quickly, like here my P99 is happy. Like I have errors on the specific endpoint, whatever. And then what you can do is that you can also like increase the number of active users because you can be like, okay, my ML model works perfectly well if I have one user. But then what if you have 10? What if you have 20? And then you can like really see and really have a look at how many users you have. And then you can look at the response time distribution. Also, you can look at the person times over time. So you can really have a look and be like, yeah, most of the time I'm happy. Most of the time we're happy. But sometimes we have spikes. So then is it correlated to the amount of users you add or like different things? You know, then you can like quickly maybe have an idea. Yeah. Then you have number of requests per second with the number of active users. And then, sorry again for the back, you have assertions. So getling assertions is what we use to make sure that when you run load tests, you have some rules. Like we put some rules in place that is like, okay, we only make it pass if all our requests are less than 100 milliseconds or we only make it pass if only 5% of the requests are failing. And then it will report everything in GitHub and then you'll have the GitHub statuses and then it'll be like, okay, you can merge only and only if you respect those assertions. So it allows you to redeploy in a reliable way without having to look at the reports every time you push something. Be like, yeah, everything is less than 100 millisecond. I'm pretty happy. I can merge. All my tests are working. So yeah, you have those assertions. That's been really nice. So yeah, maybe why we went with that ARCA workflows in getling is like, well, both, like they can handle large scale workloads. So like they can do way better than any of the ML model we produce usually. So like, you're really sure that your ML model is going to like work really well with the load tests you have and press my screen is gone. My screen is back. Then you have parallelism. So yeah, you can like leverage ARCA workflows. To manage multiple tests. Like we have one model. We run, we train it on 460 cities. And then we load test it on 460 deployments. And we do that in parallel. So then we can make sure that, okay, everything working as expected for all the cities we have. So we can then deploy it and be happy and be more confident. Automated testing. It's what I said before as well with the assertions. So if you want, you can have everything in GitHub. Then all the assertions, then Argo workflow. You're going to trigger everything, then report back to our city CICD pipelines. So you can have also like your kind of visual reports in GitHub. And yeah, you can make sure that everything works under high load conditions. So at least you're pretty sure that it's going to be reliable with that regard. Maybe you would fail for something else, but at least the low test, you're happy. And the use case I was talking about. So yeah, we have, we predict traffic for each city we're in. We are in 463 cities something because traffic is different in each city. So that's the idea is like, yeah, you really want to predict how long it's going to take for you to deliver the food, for us to deliver the food depending on the city because traffic is going to be very different in Tel Aviv than in Helsinki. So yeah, let's say you're a data scientist. You're like, okay, I need my GPUs. So like we have GPUs running already, but then you might need some different GPUs. You might need one with more memory. You might need special ones. Well, you can do that. You create the PR, we accept it. Argo deploys everything. And then you can also like have a look and like look at the alerts, make sure that everything's deployed correctly. So that's the first step. Second step, you create a workflow in flight. You'll be like, yeah, that's my training workflow. Then you create a dynamic task that is based on cities. So it'll be like, okay, for all the cities we have, please train a model. Then we have a toolkit that is an ML toolkit that every data scientist is using. So then in this toolkit, we have Python code that is like, yeah, please create a load test for that specific ML deployment so that data scientists don't have to do it themselves because it's always going to be the same code anyway. Then you load the test, load test per city in parallel. So yeah, 460 cities running in parallel. That's also why the assertions are really important because you don't want to look at 460 reports. You're going to be like, okay, I have all my assertions. Everything is green. I'm happy. I think we should be good. Then if you pass the assertions, you can promote the models and be like more confident that it's going to work. Or what you can also do is then you pass the assertions, but you can also make sure that you actually like not worse than the previous models because maybe your ML model is really good with the ML metrics. But if your P99 went from 20 milliseconds to 200, okay, maybe there's a right balance to find here. So that's our use case. And yeah, this use case is using everything we have. So Argo CD, flight, Argo workflows, and Gatling. And our future work. So the first one, we want to add support for Triton inference. We start to use Triton more and more. The way Triton works is a bit different to every other library. So yeah, we want to add that to the support for our load testing tool. At one point, I hope, like I have dreams that we can load test 100% of the ML models that are going to production. So that means a lot of advocacy work. That means a lot of abstractions as well for data scientists because they might be like, yeah, it's cool. What you're doing is cool, but I don't really want to spend time on it. So it's like, yeah, how do we make that easy? And how do we make it possible for them? Then if we can, I also would love to promote ML models only if the load test results are good and if the ML metrics are good. So that means comparing both of them and then probably even more things that I can't think really about. I can't really think off, sorry. And that's going to be it. Thank you. And we're hiring. For God's sake. Hi, thank you, Stephen, for our amazing talk. My question is around, I've got a two-fold question. The first one is if when you parallelize those workflows in flight for 400 cities, if they spin up 400 pods and if you are adding GPU support, how do you prevent for those type of jobs to spin up 400 GPUs? How do you ensure that they like share the GPUs? So the first one for this one, then the committees will pick up the nodes and then it will be like, okay, I have this GPU that is like, there's that node, sorry, that has whatever GPU you have in. And then it will try to use that, but usually we just spin up 400 GPUs. So that's thankfully for this one, we don't need GPUs. Otherwise, we probably wouldn't be hiring. Anyone else? Okay, thanks, Stephen. Thank you.