 So hi everyone, I'm Henrik, and with me on stage today is Christian, and as I already said we're going to talk about how we integrated a lot of Kubeflow components in an end-to-end workflow, and we brought a nice example which is a anomaly detection model training pipeline where we use data from the International Space Station, so deep learning and rocket science and buzzwords is top. Right, but before we dive in, let me say two words for myself, so I'm Henrik, I call myself a data scientist, and I'm doing, actually I'm trying to finish my PhD on machine learning for cyber-physical systems, so that's where the ISS also comes in. And I used to be a data science consultant with a background in engineering, and now I turned into a Kubeflow enthusiast I would say, and with this I hand over to Christian. Thank you. So I consider myself nowadays like a data engineer or data jack of all trades and master of none used to be a physicist, and nowadays I'm mostly involved in building platforms based on Kubernetes, but also in this use case we're going to talk about a bit. Actually we're here only as representatives of a bit larger group, we actually also wrote a paper about what we're presenting today, which actually got published this morning I think, and those people all were involved in that, but also those are only like representatives of a bit larger project, which is called the KISS project. We also talked last year at KubeCon about that a bit, about the platform that we built for that, but today is going to be especially about that anomaly detection. And so just to give you a very first quick overview, this project, as mentioned already several times, uses data from the ISS, and on that we do anomaly detection, so detecting if something's wrong up there, and then do diagnostics and even reconfiguration to bring it back into a safe state. And that project is funded by the German state and also the European Union, and you see that also two universities and Airbus and just at AI and ProCube are involved in that. So what will we be talking about today? First, I'm going to give you a bit higher level overview of our use case, and then we're going to talk about what parts of Kubeflow and also wider ecosystems, so not only Kubeflow, but some other tools we are using and for what purpose, and especially how they all fit together. And with that, share some challenges, solutions we found, tips and tricks. But also, there's some things we won't be talking about, so if you're here because you're really interested in the ISS, I'm sorry, we won't really be talking about that at all. Also, not scientific methods, results, anomaly detection or anything like that. But the good news for that part is if you're interested, check the references in the end of our talk. So now on to the high level overview of that use case. So we have time series data coming in from the ISS and then do anomaly detection on that. If we found some anomalies, we're doing a root cause analysis for that, then a reconfiguration and finally like a supervision, which just makes sure that any reconfigurations we calculate aren't doing anything dangerous. And the setup in general is that we have like dashboards built into that, which load the data from there, and then spacecraft engineers can look at that and decide if that's like a good thing to do or not. We're using Kafka to tie all of that together, have some Postgres databases and so on, but also we're not talking about that today, but instead mostly we're going to talk about this neural network machine learning model, which is only used for the anomaly detection and especially how we end up there. All the other parts use symbolic AI. We also use case of with custom transformers and predictors for that. But as said, we're talking now about how do we get to that part. And we face some pretty standard challenges for that actually, I would say, like we want to do distributed pre-processing because we have quite a lot of data. We want to train a model, do hyperparameter search, want to deploy our model. We have multi GPU model training that we want to do. We want to do inference on streaming data. And finally we want some pipelines to tie all of that together. And luckily, Kubeflow has us covered here. So we have Kubeflow pipelines on the far right for, well, tying it all together. We use case of the training operator, Catab, and then also not really part of Kubeflow, but we still like it, dusk for the distributed pre-processing. And so how does KFP allow doing all of this end-to-end? So what we want to do here is distributed pre-processing, then hyperparameter tuning with multi GPU model training, and finally deploy all of that. And well, if we stand here, you also already know what the answer is. Yes, that's totally possible with Kubeflow pipelines. And now we're going to dive into a bit more detail, how this looks for what we're doing here. So here on the left part, you see that duck we are using here. Actually, all of that code you can have a look at under that GitHub address on the web. But a quick note, this is not really like our production code. We clean this up quite a bit to make it easier, digestible, and understandable to make it easier for reuse and use as a template for similar setups. And as a first note, we use KFP version two only, which we think is a lot nicer than KFP one, but has some issues. And especially because we're using that with the open source back end only and things like parallelization, for example, don't work as nicely there. But some general notes, so we use a mix of lightweight components and container components. And for, well, at the top here, so perhaps a quick note for that duck. I quickly want to get that laser pointer working. So pretty standard, really, mostly what you expect. At the top, we have data pre-processing. In the middle, we have like model training. And at the bottom, we have evaluations somewhere here and like serving at the bottom left. So at the top, we use parallelization also for some pre-processing steps. And at the bottom, then, we automatically deploy if some metrics are better than a given threshold and so on. For those container components, we use one single image, which has multiple entry points. So you can see they're used like more or less all over. But we're going to see that in more details in a minute. And then finally, Dask already mentioned that. We're using the Dask operator and also a cube cluster to make the best use of Dask. But also, Henrik is going to show you some more of that in a minute. And then finally, we use the Python Kubernetes API and Minio to tie Catep PyTorch operator in case you serve into all of that. And we're going to have a look at that in details now as well. Thank you very much, Christian. Now that we know how our pipeline looks like on a high level and we know what our core feature implementation features are, I'd like to use the rest of our talk, which is 10 minutes or so, to highlight a few of those details. And the first one is that we actually use, as Christian said, only one image with multiple entry points. For example, in all our container components for KFP. But we actually not only use them for the container components, but everywhere where we need custom images, which is Catep training trials, those Dask workers, but also PyTorch trainings, distributed ones, and even for a case of for the custom predictors and transformers. So that means if you have a look at our repository, you will only find one single Docker file. And it looks like this. It's not very exciting. We start off from a rather large base image, which holds Python and PyTorch, but also those CUDA drivers to do deep learning. And since we only need one image, we can also treat all the functionality that we implemented in Python as one large Python project, where we manage our dependencies using poetry. Now the rest is pretty default, but then the question is now, how do we make multiple entry points of this? And somewhere in the image, we have this main.py file, which is the main entry point. And we have a lot of functions there that we use to create the KFP components. And this is one of the examples of the functions, which is called Dask run preprocessing, sorry, run Dask preprocessing. And the function actually does exactly what it's promising. In order to create a CLI, we use click. And this is where the different entry points come in. We can use different arguments to this click.command method, such that when we run Python and then main.py and say run Dask preprocessing, then we actually enter this function. Now, if we want to create a component out of this, as you might already know, we just, in a different file, in this case, we call it components.py, we define all our components. And if we use this decorator, we actually just need to specify three things, the image, the command and the arguments. As I said, the image is always the same and we start somewhere in our configuration. The command is Python and then the path to the main.py file and the first argument specified which function within this main.py file to run. So this is pretty handy, but it also allows us, just like any other container components, to run the code locally. So if I pull the repository and then run poetry install, I can actually, on my local machine, run something like poetry run Python and then the main.py file again with the corresponding arguments. I just need to download the corresponding files of the previous tasks in the pipeline such that I can locally debug. And it turns out this really speeds up our development process compared to, like this, as a data scientist, I can debug like any other Python project compared to building a new image, pushing it to the registry and then running the pipeline again. So this is great and the second thing I'd like to highlight is that we, from within Qflow pipeline lightweight components, we make use of the Python Kubernetes API to actually talk to the external services such as kserve or cut-tip. Now how does this work? Let's have a look at the cut-tip example. So this is our code. Again, we are looking at the components.py file where we define a different component. In this case, the run cut-tip experiment component, which as you see from the decorator is a lightweight component, which means we can install packages using these packages to install argument. And you see we install Kubernetes, which is the key point here. Log guru is just for logging. And if you've seen a cut-tip experiment manifest, you know there's a lot to specify. For example, what code to run in each cut-tip trial. This is what we specify here. So, and this should look familiar to you already. We again run Python and then the main.py file. But in this case, not run does preprocessing, but run training. But we use the same image again. There are a lot of other options we need to specify to get a comprehensive cut-tip experiment. And instead of creating a YAML manifest, we use this Python dictionary. At the end of the day, we store every of those components, of those configurations in this dictionary. Then we create a Kubernetes client using the in cluster configuration. And we eventually actually start the experiment. This create namespace object such that we have a cut-tip experiment running. We then check like in a while loop every minute how the response looks like. And eventually it hopefully changes its status from running to succeed it. And in that case, we can get the ideal hyper parameters from the response of the Kubernetes client. And in that case, this is then the output of the lightweight component and it returns a dictionary that holds the optimal hyper parameters of our experiment. So far so good that didn't look too difficult. So what's the big deal? Well, it turns out it's more challenging than we initially thought because we are leaving the comfort of Kubeflow Pipelines IO features. So let's have a look at that. Usually in a pipeline like visualized here on the left-hand side, we have different tasks in a deck. Each task, as you probably know, is executed as a pot. And Kubeflow is great because it stores the outputs of the first task and MinIO and takes care of copying it to the second task. So we don't have to take care of that as a data scientist. However, if we do something fancy like integrating a desk operator that spins up a whole bunch of pots, worker pots, and a scheduler pot, all of those workers won't have access to the data that Kubeflow usually copies inside the pots. So that's bad news. Same holds for the cut-up tray. The KFP component used earlier, the lightweight components uses this experiment controller and the trial controller. And each trial actually then uses a Kubernetes drop, in our case, or even a PyTorch drop or a Argo workflow. Eventually, in any case, you end up with a pot that is supposed to run your training code. But also here, Kubeflow doesn't take care of copying data in there. Lastly, for the PyTorch cluster, that's also the case. It spins up a whole lot of workers. And in our case, we use four GPUs for our model training with PyTorch distributed parallel here, DDP, which then internally use ring all reduce in order to average the weights in an efficient way. But also those worker pots, they obviously need access to the training data and the validation data, and they wouldn't if we didn't take care of it. So that's the workaround. We need a workaround. And our workaround basically holds three steps. And let's have a look at the cut-up experiment again. So the first thing we do is basically get the Minio path or the artifacts to the data in Minio. It's a bit hacky, but what we do is we look inside the component and we have this data frame for training and validation data. And those Python objects, they are of type input and data set, and they have this dot path attribute, which holds the path to the data inside of the keep float pipeline task, which happens to also hold the information to the artifact in Minio. So we know where the data lives in Minio. Second thing we do is we inject the Minio credentials from the corresponding secret into the environment of the running pot, whether it is a PyTorch training worker or a desk worker or whatever it is. And lastly, we implement a custom function which lives inside our image that is actually capable of reading from and writing to Minio. And if you wonder why it looks so messy here, it's the case because the path from Minio actually looks different depending on whether you copy it from Minio directly or from the pipeline UI or in the setup that we just shown. Right, so problem solved. And doing this, we were actually able to integrate all of the tools we mentioned into one single keep float pipeline's workflow. And thus, this leads me to our last slide. So if you would like to remember anything from today, from what we told you, we would like you to remember two things. First, yes, it is actually possible to integrate all the tools mentioned, which is desk, cut-up and PyTorch jobs into one single keep float pipeline. However, it does still require some hacks. And second, it really shortens the debug loop if you are able to run your component code locally. We would highly recommend that. We'd be really glad if you rated our talk. QR code is listed here. And in case you are interested in more details of the implementation, check out our GitHub repository for this project. And maybe also the papers we wrote about related stuff. So there you will also find information about the ISS and more on the machine learning side of things. So, questions? Thank you for the great talk, very exciting. So I'm just wondering, you've been using the pure Kubernetes Python client to for your keep float pipeline's components for KT or for desk. So did you try to look for our upstream SDK, Python SDK to simplify your components and not maintaining this part by itself? So we did a few little commits and pull request. Actually, is it more than one? I'm not sure. But we did actually write issues. And for whatever we found, so we try to come contribute as good as we can. And we will continue doing that. That's great to hear. Because I think we should work with you together to understand the downside of using this as part of keep float pipelines because they're a huge task to use this SDK inside as exactly what you show on the flow, right? And so maybe we can collaborate with you to improve upstream SDK better. We'd love to do that. Thank you. Yeah, I do have one question myself, which is for integration with Dask. Because we have in the hiring physics community, Dask is quite popular. The integration has been done in different levels from Jupyter Lab directly with the extensions and everything. What's your experience of the integration of Dask with the flow as well? You mentioned a few things, but yeah. So that was pretty seamlessly, I have to say. So once I found this, we found this keep flow. I think it's cube cluster object and the operator. It was pretty easy aid to use Dask from within notebooks for exploration stuff. And also reading and writing from and to Minio work rate. With this integration into pipeline, it was a bit more tricky because of the issue I talked about. Other than that is what was great. One issue I just, it just pops up. I think in the default keep flow version, it wasn't easy for us to get this. I think it's also a Dask component that allows you to run the Dask dashboard inside keep flow notebooks. We didn't spend a lot of time, but we didn't get that running. So we always use Power Forwarding to see what's going on in Dask. Okay, thank you. Yeah, sorry about the Dask. I'm just wondering, so how do you exchange data between your Dask workloads and the other part of pipeline today? I didn't get the first bit. How did you exchange your data basically? Are you using S3 or any like an envelope storage to exchange your data between workloads? Yeah, so we have Minio running. And we actually, in the pipeline, we do a lot of pre-processing to split up large parquet files such that we have nice chunks that we can work on with Dask. And then Dask reads those partitions directly. Does that answer the question? Yeah, and then how do you use this part of data in your training workloads? Just fetching them from some blob storage, right? So, yeah, well, we use Dask and it is really the way to come up with a smaller aggregated dataset that we use for training. And that we can handle using pandas. I see, so that's a lot of features, right? I see, yeah, thank you. Yeah, regarding the Dask dashboard, you can clearly integrate that into the pipeline. You can also do it with TensorBoards and any other kind of web application. You can have a pipeline step and you click start and you get that easily redirected within the Kubeflow UI to the Dask dashboard. So if you want, please reach out. Yeah, I will definitely do. That sounds exciting. I'll do as well. All right, any other question? Still have a bit of time? If not, let's thank again. This was a really nice talk. Thank you very much.