 Nice to see you guys, you're so many. Are you sure that you're in the right talk? I'm not so sure about that. So welcome to our session. Today, we're going to walk you through our journey in DHL in how we adopted Kubeflow as our main platform for operationalizing machine learning at scale, meaning all the steps that we have to go through starting from best scripts till our current solution and our setup. I'm Luca, alongside with me is Dennis. We're both part of the MLOPS engineering team. I'm lead MLOPS engineer. I joined DHL less than one year ago, but I'm a long-time MLOPS engineer. I started in 2018, but I also have a background as a data scientist, and I used Kubeflow in different setups. This is the largest setups that I use Kubeflow, but I'm also here not to just represent DHL, but also to represent Kubeflow from a user perspective. I used it from very small companies to bigger one, and we will go also through some of the journey that even outside of DHL we can, I've gone through in the past. So alongside with me, there is Dennis. Thank you, Luca. Yeah. Also very welcome from my side. I'm Dennis. It's really crazy to see such a huge crowd. My name is Dennis. I'm senior MLOPS engineer in the DHL data and analytics team. I started as a data scientist actually in 2020, so roughly four years in the company now, building machine learning applications for DHL group. I moved or transitioned to the whole domain of MLOPS engineering, so machine learning operations roughly in 2021, when we started also to scale our solutions and productionize them. Four years of Kubeflow, that's pretty important, I think, because today we are also going to give you a little bit of a rundown of our experience that we had from starting with a practical setup to now having a very sophisticated setup where we can also leverage Kubeflow as our data science platform. I have been part of this journey for the last four years as well. I want to briefly give you a very short overview of our company. So we are called DHL data and analytics. We are part of DHL group. I guess all of you know who DHL group is, so they will probably at some point have delivered some package to you, or maybe even Formula One cars, I don't know. So we are a huge logistics partner also for Formula One. But specifically our team of DHL data and analytics, we are doing machine learning projects for the whole group for roughly seven to eight years now. We started very small, and now we are a team of more than 100 people that are specifically building machine learning applications to solve real business problems, business challenges in the group. So that means, for example, for DHL express, or post and parcel in Germany, but also global forwarding, yeah. Very important how our team is structured. So we have around 50 data scientists. They are the ones that are experimenting with data. They are the ones that are building machine learning models, seeing whether they can solve business problems with machine learning or also without. So that's not a necessity, right? Sometimes it's also way more easy. We also have a team of 25 data engineers. They are the ones that make sure that from this very diverse landscape that we are doing projects, and we have like a common connection to where we can use the data from. In this case, we have a very large data lake that is connected to our data science platform, but the business divisions, they don't really have a very standardized data system. So we have to find some kind of connectors there. Luca and I, we are both ML Ops engineers. So that's kind of a new formalized role for us. It's building on the whole idea of machine learning operations as a mindset. For us, this is a role. We have around 15 ML Ops engineers currently in our team, and they are actually bringing these solutions into production and making also sure that they are solving problems over the long term. And then we also have platform engineers. So we have a dedicated platform team, which is very important when we go through our journey today. Also, two of them are here today. They are actually making sure that the platform that we as users, using for building machine learning products is robust and scalable. Now I want to give you a brief rundown of our legacy project setup. So we started in 2016 building data science solutions for the group. Back then it was three people, I think. And they were working in the standard setup that you might know also if you're doing data science work. So there was a business problem. There was a data scientist that connected to the business that got some kind of flatfire maybe even with data. And they tried to solve this problem with all the tools that they have, right? So for example, extrapolation of data for some kind of capacity planning or so on. So that's the first part of such a project, usually. And it still is. Back then the result of this kind of process would be some kind of model that solves a certain problem. And then at the other end, there is this operationalization. Back then it was done by data engineers. They received some kind of model file, for example, a pickle or job lip or whatever. And they somehow brought it into production. And as you see literally here, there's a huge gap between those two. And this is actually building the machine learning application. Because machine learning is, or the machine learning applications are also just software applications that include a machine learning model. So everything that is relevant for software development is also relevant for us. We just have another extra bells and whistles with data and machine learning model configuration. So we had this gap and we realized this, especially when we have a look at our process on how these projects were done back then. So we started with POCs and some kind of Jupyter notebooks set up on some local machines or even shared workbenches. So some kind of virtual machine where the data scientists all connected to and could also share their work and the data. Then there was some Python scripts built, most of the time, not always. And I can tell you from personal experience, bringing Jupyter notebooks into production is not something that you want to do. So that's Python scripts, the intermediate step, some kind of components. For everyone that isn't really deeply into machine learning, it's usually something like pre-processing, feature engineering, model training, et cetera. Then the orchestration was done usually with batch scripts. We also at some point used make files. That's also a little bit where the title comes from because that's basically defining our process back then. And then in the end, there were some pipelines that were run on some Linux virtual machines. Later we also introduced containerization. Around 2018, 2019, we started to use containerization, but back then not on some kind of Kubernetes cluster. Then also for machine learning workloads, especially it's very important to some kind of scheduling. Back then we did it with cron jobs and airflow. Cron jobs we still kind of use in Kubeflow, but it's basically very important for machine learning applications also to be able to reiterate because even if the code doesn't change, you have to update the model to make sure that the certain performance of the model is still there. So with this setup, we face a lot of challenges, especially encouraging a pretty messy work mode from the beginning. So this whole idea of bringing notebooks into production is not something that you want to do because notebooks were designed to be used for experimentation, for fast iteration, to try things out, to look at your data, et cetera. But you don't want to have them in production. And also we face a lot of reproducibility issues because with this setup, especially with this special script setup, we were not really able to actually reproduce certain results from model trainings because we maybe knew what kind of source code we used, but we didn't really know what data, what configuration for the model. And then some more technical aspects, instabilities and limited resources is also something that you always face if you're running on some virtual machine or whatever because you cannot really scale your workloads according to your needs. So we needed to find some kind of platform that enhances our work practices and also reduces the developer siloing which we were seeing back then and improve the reproducibility and of course also provide more and more flexible resources for all the machine learning workloads that you're facing. Before we dive into the technical part, I also want to highlight that we also changed how we do projects. So we are still very project-based. We are currently transitioning a little bit more into the product mindset, but usually projects are done or have been done as you see here and we moved from this process to a new process. This is more of an end-to-end process, right? We have still kind of the same roles, but we added MLOps engineering. So what now happens usually in a project is we have the business problem at hand, the data scientist works on the business problem and then pretty early, sometimes in the POC latest in the implementation phase, there is MLOps engineer involved that is actually building the machine learning application. And we are building these applications with all the dev ops principles in mind and we also run them and we support them. So we have this end-to-end approach for all the projects now so that we think production in mind from the beginning also for our machine learning applications. Data engineers are also always very relevant because they provide the data in a robust format so that we can use it from our data lake to build our models. And then again, the platform team who manage our data science platform, which we will talk about in a second. And to really use this process efficiently, we were lacking the actual tools to do that. And with that, we introduced Kubeflow and with that I will hand over to you Luca. Thank you, Dennis. Okay, let's pause for a second this journey within DHL and let's have a step back and let's think of when people starts to think about Kubeflow. They can really come from different areas but what I've seen usually is that you approach Kubeflow when you want to experiment, when you want to start having something that it's really easy to use, really easy to set up at the beginning. And typically notebooks is the first tool that you start with. And it's not just about having notebooks there because of course you can use Jupyter Hub or whatever tool you wanted to start experimenting with notebooks but it's also a matter of having an enforcement on the way you work. Because it's not just spinning up an instance by the UI, it's also having the chance to actually create your docker image that contains everything about your notebooks and it's also starting enforcing a good way of developing from a data science perspective. But this flexibility keeps moving on the more you move on in the adoption. So let's take for example K-Serve. So when you have to start deploying an actual API, you will find the same degrees of flexibility that you experience from the very early stages with notebooks. You can even deploy a K-Serve API via the notebooks directly. So that's one of the key takeaway that I'd like you to give you on this slide is about this degrees of flexibility that remains constant. The more you move on with the adoption, with Kubeflow, without losing freedomness and explorability and all those nice things that data scientists and developers always want. But it's not all unicorns and rainbows, of course, this comes with the cost. Because on the other side, we said we started very simple. You can use your mini-Kube and install K-Serve in a matter of minutes. But the more you move on in the adoption, the more you move on in the operationalization of your machine learning workloads, the more team support you would need. The more people you would need to support your platform, not just from an high-level perspective, but also from the infrastructure. And as we already mentioned, it's really important the support from the platform team that we are getting, because otherwise it would be very difficult. But as I said, you start it easy. So maybe sometime it's really cumbersome to pitch this to your management. Because you started easy and for the upper layers of the management, it's not a problem for you. And then at some point, you start asking for team that supports the platform, that supports the infrastructure, and you start asking for budget. And that's sometimes really difficult. And I have to admit that I found a really technical management and that was extremely important because they really understood the business needs and the impact that this might have also on the technical side. But it's not really guaranteed that every time you get this and it's really difficult to pitch in. And for the rest, I mean, if you compare Kubeflow with the other platform as a service solution, it's really the good old story about high investments are returned because with Kubeflow, you have to invest a lot. You have to invest a lot in people, quite a bit also in the infrastructure, but you will keep the flexibility. You will be very fast in adapting. We will see that in a second that we've been able to deliver a dozen of POC in generative AI field in six months or so with a team of four people. And it was extremely important. And it's also very important to mention that with Kubeflow, we've been able to standardize our project in a very good way. It was really easy to standardize the project based on, I don't know, the GitHub repo, how we structured the namespaces. Well, sometimes with other vendors might be cumbersome. It really depends on your use case and what you do. So how it looks like from a technical perspective and I hope that our colleagues from the data platform team will excuse me for this very high abstraction level. But just to give you an idea, so of course we start from Kubernetes cluster. We have both on-prem and cloud Kubernetes clusters. And for every project that we have, we start from a namespace. So we have a set of users that are assigned to a project and they are assigned to the namespace and the GitHub repo, of course. And then you can see here two examples of projects that are using some of the Kubeflow tools. Of course, it can be different combination depending on the use case itself. Let's start from the first one. So it's a typical use case where you have to develop something, which is typically a machine learning API that you will eventually deploy as an inference service. You start from a user notebook. So every user creates its own notebooks, but we're not using the UI. In our case, we are just connecting to the notebook via VS Code Remote Extension. And that was really good also for developers because they found a really usual place to work. It was nothing new and it was really awesome to start working with VS Code. Of course, you develop there. That's your environment, very secure, very isolated. And then you start developing inference service. And in those cases, we are also using some volumes for, I don't know, object storage, model, versioning, whatever you name it. And those volumes are provided by a separated cluster, a map or cluster that provides the volumes for us. The volumes can be, as I said, dynamically provisioned or statically provisioned. Statically provisioned volume are typically the volumes that are used by data engineers as point of contact with the data scientists or MLops engineers, meaning, let's say that we have an ETL pipeline that has to process some data that are then used in a specific project. Those data eventually lands in the statically provisioned volume managed by the platform team. And then this volume is available in the namespace project and used for an inference service or a Kubeflow pipeline that delivers, maybe might delivers, prediction to a different data source. Standard stack for monitoring, meaning Rafaana, Prometheus, and Kibana. And then for CICD, we will see in a second, but we can either go with standard CICD pipelines with GitHub actions or even Argo CD application sometimes. So this setup, I'm sorry if I rushed a bit on that, but really allowed us to move in a very wide spectrum of projects. And that was really important, I think. It was a journey and it was not easy to get here, but now I think we are in a very good position. Let's think about a two-dimensional project space where we have on the x-axis the project complexity and on the y-axis the time to target. And we can divide into four quadrants, of course. So bottom right corner, we would like to be there, but that's just utopia because the project is complex, the more difficult, the more time it will usually require to get to the target. But it's really easy to fall in the top left corner. Maybe you have something very easy, very fast, that you would expect you can deliver in a few days, maybe. But it doesn't happen, maybe because I don't know you have to wait for an upgrade of that platform because the Python library is not yet there, so you need to wait. Or maybe you need to work against the system of the platform to make it work as you intended to. So we experimented some of these troubles, and it's really easy to fall in this quadrant, which I believe is the most dangerous one. What we're able to do is to move into this optimal diagonal so to stay very fast when the project is not complex and, of course, when the project grows and mature, we have to spend a bit more time. It's not like really diagonal, it's more in a center that sometimes, but I think you get it. And so just to give you some examples, as I mentioned, we've been able to deliver generative AI, we've seen a very short amount of time, and we're still doing it. It was really easy to just start with very easy projects and deliver UI to the customer that can use to validate whatever it's words, investigating more or not. And also moving on to standard projects and also very large solution. We are mentioning product classification tool, which is a very important use case where we are assigning a code to every parcel that gets in based on the description and other features. But it's really complicated because it depends on the business unit, it depends on the country. So we've been able to use a meta-cubeflow pipeline that orchestrates different cubeflow-specific pipelines for every specific use case. And that was, I think, a really good examples of how we've been able to get into. And so to sum up this technical landscape, so what I think it's really important about Kubeflow is that what we think about Kubeflow is not, of course, just a UI, but we are a Kube console. We know that most of all, it's a set of custom resource definition. And that's the more important thing that you get when you start using Kubeflow. And for everything that you can do, you have a manifest. You can track this manifest into your GitHub repo. And that really enabled us a different way to deliver application, which is the GitOps away. Everything is tracked there, so we know that every configuration that we want to apply in our cluster, every deployment that we want to apply, it's tracked in our Git repo. And then we can either deliver everything with a GitHub Actions, or maybe sometime also with Argo CD. That said, I will end over again to Dennis to resume our journey in DHL. Thank you. All right, yeah. So after Luca and I shared with you a little bit on the technical landscape and also the decision factors that led to us deciding for Kubeflow, I want to continue the process that we just started some minutes ago and have a look with you together on where we are now, where we are today, and how we do projects today. So first and foremost, we decided, and that was like a strategic decision also, that Kubeflow is our main data science platform. So all the new projects that we are doing in machine learning in our team for all divisions. And we also have other divisions that are now using our platform with their data science teams. That has to happen on Kubeflow, or that should happen on Kubeflow. And we also have made really very good experience with that over the last few years. The second part, also maybe from an infrastructure perspective, we decided to isolate projects based on namespaces. So we have namespace environments on the clusters that are actually bound to a certain project, which means we have isolation on the user level, but also on the data level, because we need to make sure that there is no data slip between different projects. This is also maybe very important, because we put high importance to the developer experience also, which means data scientists but also machine learning engineers, ML Ops engineers. Dev work is done directly on Kubeflow. So our usual motors operandi is that the developers are not actually working on their local machines anymore, but they are actually attaching to their notebook servers that are running in Kubeflow as their own parts. And that is really, really nice, because VS Code and the plugins from Kubernetes are allowing you to have a very seamless integration, so you can use your favorite IDE to develop remotely very nicely. And there you also have, of course, all the access to the resources that you need. Another important part is that we are using Kubeflow pipelines for all the machine learning trainings that we are doing. And Kubeflow pipelines, we didn't dive too much into why Kubeflow pipelines are specifically very good for machine learning. I can give you a short rundown on this. So most importantly, Kubeflow pipelines allows you to also separate experiments. You can compare experiments. You can always see what artifacts every run produced. You can see what data you use, what configuration you use. And that is very important also for us, because we need to make sure that we always know exactly what kind of setting we had when we created a certain model. One important part that I also personally feel is very, very important with Kubeflow pipelines. Every component of a pipeline is containerized. So that means you don't have a huge workload that is running in a single container, but you have very small components, for example, pre-processing, feature engineering, model training, prediction, evaluation. They are all their own components running in their own containerized applications. And that not only allows you to scale each component individually, which, as you can imagine in machine learning, very important if you have a huge model that you need multiple GPUs for, or do a little pre-processing step that only needs two gigs of RAM or whatever. And you can scale that very specifically there. But it also allows you to, from the beginning, work very decoupled, because each component only communicates with each other via HTTP. So you have a standard interface between them. And that also reduces coupling and the possibility for coupling in your production code. And the last important part, all the model servings are usually done with K-Serve. For those that don't really know what K-Serve is. So K-Serve is basically a framework that allows you to easily deploy your machine learning models. There are optimized engines for the most standard model libraries like PyTorch, or Scikit-learn, or TensorFlow. But you can also create your custom inference services. And they are very nicely optimized. They provide already out-of-the-box things like queuing and batching and so on. And that is our de facto standard for deploying our machine learning models, which is also, again, native to the Kubeflow platform. I didn't want to leave this without once showing you even a very small demo of this platform. This is just a screenshot. But what is important here is, on the middle, you can see how these kind of pipelines look. And there you also see that each of the components are very specific. It's their own entity and these arrows between them built like a deck that you can track. And for each component, you can also see what gets in, what comes out, what was the configuration. And this is also stored over a long period of time. On the left-hand side, you can see all the tools that we use on a day-to-day basis. So notebooks are usually the entry point for the developers. This is not Jupyter notebooks. I mean, we are starting notebook servers, but they are basically just a pod that we attach to so that we can do remote development. We have volumes, as Luca already shared with you. We have endpoints. So this is our entry point to really track also the inference services that we deployed with KServe. See the logs that they produce, the metrics they produce. We can integrate easily Prometheus metrics in Grafana because all of these inference services are actually providing already out of the box certain metrics like throughput or latency, et cetera. So that's very convenient. Most important for data scientists is the experiments part. So that is for Kubeflow Pipelines specifically. So there, we can track all the machine learning experiments that we did over time. We can also cluster them in certain groups, for example, hyperparameter optimization or production machine learning pipelines, et cetera. And then on the bottom, or maybe we can quickly walk over runs and recurring runs. So that's basically summarizing all the machine learning runs we do on our namespace. And recurring runs are those workloads that rerun every week, for example, because we need to retrain our models with new data. And on the bottom, you can see monitoring. So that will lead usually to a Grafana dashboard in our cases where we can model all our machine learning pipelines, all the pods that are running, all the metrics they are producing. And what we want to tell you with this is this provides basically out of the box in an open source framework everything that we need to build, to experiment, and also to run in production our machine learning models. With that, we come to the summary and the key takeaways that we really want to provide to you today. First and foremost, the developer empowerment and reliability is enabled through Kubeflow in this case. So we really experienced over the last few years that it's very important also to empower your developer to do their job like they need to do it, but also to make sure that you don't force them into a certain work mode while keeping everything stable. And that is exactly what Kubeflow allows us to do. We allow them to move freely in their notebook service, et cetera, but we have a certain gate to get something in production. Next part is the seamless experimentation to production pipeline. And that is also this gap that I was showing at the beginning. Kubeflow reduces this gap significantly or even annihilates it completely because it's the same platform that you can use for experimentation, but you can also professionally bring your production pipeline and even models on the same infrastructure. And the last part for us, this whole idea of MLOps is very important and helped us tremendously over the last few years, also formalizing this as a role because it's bridging this gap between data science and engineering. And data scientists are usually more experimental. They are more on the science side. And that is also good that way because they have to do that. But we need to somehow bridge this gap to the productionization. And that's where we also as MLOps engineers see us. Before I close this, I want to quickly also share with you that we are not alone today here. So we also have Tim with us. He's also a senior MLOps engineer. And specifically in the first row, we also have Usman and Julius. They are both platform engineers in our team. So again, very important to make sure that everything runs smoothly. Julius specifically is also a long-term contributor, actually, to the Kubeflow project. And even the leader of Kubeflow Manifest and Security Working Group. He also had a talk on Kubeflow Summit on Tuesday. Or yeah, exactly, security stuff. So if you're interested in that, feel free to check it out. He was also in KubeCon 2023. We provided you some links here as well. And with that, we want to close this. Thanks a lot for the attention. And we are open for questions. I think we have some more minutes. I think if you have questions, you can simply approach the microphone and just ask. We'll approach us later, whatever you prefer. Hello. Ah, there. Also, just to begin, thank you so much for presenting. My name is Flaviano Christian Reyes. I work at Bloomberg. So if I remember correctly, y'all are using a GitHub flow for your model registry. So how does your team bridge the connection between Git concepts and your model promotion slash deprecation flow? Yeah, so the question was, how do we make sure that we have all the model tracking also integrated into the whole GitHub approach, right? So we used also things like MLflow, for example. But MLflow was really not the best thing to use when you're in a GitHub environment. We usually work with Git LFS at the moment. So we are actually pushing our models and tracking and versioning them also with Git LFS. And then we pull them as needed, because we need to make sure also that we have a new deployment, basically, not only when the source code changes, but also when only the model changes, for example, in a weekly deployment. Does that answer your question? Sort of. I guess I was wondering about conventions or anything like that, but also I can talk about it later, I guess. Sorry? I can talk about it offline. OK, yeah, sure. Any other questions? We'll wrap this up then. Yeah? Yeah, sure. I just wanted to say, especially this gap is being properly covered in the future, because Red Hat within Kubeflow is working on something called model registry. This might be interesting for you then. Thank you. That was, by the way, Julius from the platform team. Hey, I have a question. The screenshot was still from Kubeflow v1, right? And are you moving to v2? And if yes or no, why? That's also a question to the platform team. If you want to come here, Julius, I will just give you the microphone. Yeah? The problem with v2 is the feature parity is just not there. So do you have any specific questions regarding that? It's already gone. OK, yeah, v2, I mean, officially, it's released. But the feature parity is not there, too. We want pipelines. And we want to make sure that probably at the end of the year, we support everything that the data scientists and the malops engineers need. Yeah. Thank you, Julius. Any other questions? So you think time is up? Yeah. So if you want to approach us, we are here. So thanks again. Thanks a lot.