 Hello, welcome to our talk on ML Ops for ground operations of the ISS. We are really excited to be here, especially excited that so many of you made it to this talk at such a late hour, aren't partying yet down in the exhibition hall. Before we start talking about ML Ops, we'll give you a brief introduction about ourselves and how we got here. I'll start. So my title nowadays is data scientist, but I haven't done any data scientisting in quite some time. Instead I've been a data engineer, data architect, something like that. So basically building all of what we are going to talk about as part of a larger team. I used to be a physicist, did a PhD there, got a lot into data analytics there, also like running bigger clusters and so on. I've been a data science consultant for some time after this. That's where I picked up wearing a jacket like this. I just don't look like that usually, but I'm a long time UNIX nerd, which really helped me out with this because I actually had no previous exposure to Kubernetes at all before. Obviously, I knew what it was, but had never really touched it. But without that 20 years of UNIX experience, that would have been a much harder journey. Hi, I'm Samo. Similar role as Christian, slightly different background. So my background is in life science, so I have a PhD in computer-aided drug design. So I spent quite some years working in pharma and after that in consulting before joining this company. Also have decades of experience with working with open source and Linux. And before embarking on this project, I had some experience with Kubernetes, but that only from convenience of the cloud. So this talk will be about, we will just give you a high-level overview of the use case, but we'll then mostly focus on our approach towards designing and building this MLOps platform and obviously on-prem without the cloud. We'll talk about major components we used and how we decided for them. We'll touch automation, resistance storage, networking, logging and monitoring, and then finally we'll give you some tips and tricks and some useful tools that we discovered during this journey. So what we won't be talking about is any Qflow details, running pipelines, ML jobs, things like that. We also won't go into any details about anomaly detection, root cause analysis, and other algorithms we're developing. And also fortunately, we don't have time to go into any details about International Space Station. So the work here is sponsored as part of this project called KISS, and the project is about AI for International Space Station. So what we're getting is we're getting telemetry data from International Space Station and we're trying to develop algorithms for anomaly detection, diagnostics, and finally also reconfiguration. So if something goes wrong, also the systems can then be reconfigured and amend the problem. They are four project partners. So two German universities, then Airbus and as just the AI. So the universities are focused on the models, while Airbus and us, we are mostly focused on the platform itself and also Airbus provides us with data. And just to give you a rough feeling, so there are like five, six people not working like full time on this platform. So still a brief info about International Space Station, so the data we are getting is actually from Columbus module, so this is Europe's first permanent outpost in orbit, and it's actually a space laboratory. So it weighs 10 tons, takes up to nine tons of payload and it's seven meters long, and you can now here see it also within the International Space Station, which in contrast is 440 tons, weighs 440 tons, so it's a really massive thing. Let's have a closer look. So being a laboratory, it has 16 racks for experiments and infrastructure, but primarily as mentioned, so we are interested in sensor data, in telemetry data. There are thousands of parameters recorded in one heart sampling and this ends up in like 10 gigs of data per year. And why is this important, you know, working with data, having models and so on, is because like microgravity really introduces some unique challenges, so due to microgravity you don't have convection, so convection doesn't work, air is not mixing and you might get local pockets of CO2, which is obviously not good for astronauts, so that's why it's really crucial that ventilation always works, so we have to monitor this. But being this sensitive data, so it's not allowed to go on public cloud, so we kind of had to develop this platform so it runs on all project partner sites, so this means bare metal clusters, corporate data centers, but we're also still using cloud, but only for development. Another requirement was that we can only use open source components and end users are actually data scientists and aerospace engineers and not necessarily software engineers. So yeah, we quickly realized that leaving compromises means losing a lot of comfort, so we suddenly had to take care of installing and configuring Kubernetes, taking care of storage, IAM, and all those other things. So but as you do nowadays, we started with the workshop and we collected some ideas what this platform actually should have, what kind of functionality, building blocks and so on. So we came up and quite soon that we're going to use Kubernetes, also explains why we're here, because it really covers a lot where it's a really flexible platform. Then we quickly also decided to go with GitLab for our Git and NCICD needs, and then when it comes to data science part, we also quickly zeroed in on Qflow. So those are three key pillars for us. And then some other components also followed quickly. So just to guide you through those key pillars, so it's GitLab, why? Because it's open source, it takes a lot of boxes when it comes to source code, CICD, and so on. And for Kubernetes, so we had to pick a distribution and we took a while to evaluate some distributions and then we decided to go with MicroKate. So the reasons were because it's really easy to deploy, works mostly with distributions. It supports single node deployments, but also multi node deployments. It's still a CNCF certified distribution, and it was also very nice for us, so it has a lot of add-ons, which actually get you started really fast, so we had things up and running quite fast. But still if you want, which you also did, you can replace those add-ons once you need more flexibility. And then finally, Qflow, it's a complete data science workbench. It's Kubernetes native, it's a mature and stable product, and it's also actively developed, so it's an easy pick. So very soon we also realized that we'll have to do some automation, so as it says in Kubernetes values, heroism is not sustainable, and it's especially important for small teams that automate as much as possible, so we have different targets for deployment and so on. So yeah, and we also had to learn a lot, especially initially, so it really made sense to immediately encode that knowledge into code. So our platform can be completely deployed, so the deployment is completely automated, so we are hitting all those different environments. By automation, so it enables quick iterations, so it takes really one hour for complete deployment from VMs to things running on top of Kubernetes, including Qflow. It's reproducible, it's also scalable, so it's easy to add new nodes, deploy new environments, and so on. Key components for deployment are for us Ansible, so this is our tool of choice when it comes to infrastructure as code, so it's very nice, so you work with VMs, it's implemented in Python, supports templating, and what's also important, it's agentless, so you only need SSH access, you don't need any dedicated demons or control node. Then the next one is GitLab CICD, so again, you define everything in VMs, it's very flexible, you can build software packages, containers, and then in the end also deploy to Kubernetes. And finally, we also decided to use Customize. So when you start deploying things from Kubernetes, you very likely will actually start with Helm because a lot of software is packaged with Helm, it's actually easy to deploy, but there are also online heated discussions, which is better, Helm or Customize. And those two tools are actually quite different, so one is imperative, supports templating, but the other one is declarative, but actually so Customize is also built in Kube control, works with Plain Emils, it's very minimal, and you work with Overlay and Patches. So we quickly realized that Helm was not flexible enough for us, so we decided to go with Customize because we needed to customize more things that you normally have available in that value.tml. So how we do it? So we have a separate repo for all the manifests, and we try to get official manifests, if those are not available, we render Helm Chasl, it has a template and everything goes into this repo. Then on top of this, we have Custom Overlays to do changes we need, and then in the end we simply use GitLab CI CD to deploy, so Kube-CTL, apply-K, and things are deployed on our clusters. Also a lot of software comes with Operator, so if you have this option, go for that, it's going to make your operations easier. And what we also do, we have a Custom Operator to handle dynamic changes, so for example if there are new users and so on, actually an Operator kicks in and takes care of some changes there. And for this, we actually just went with Shell Operator, because it's really easy to implement stuff in it. Okay, so the next big thing for us was storage. So storage, as you know, was initially not really well supported in Kubernetes, so Kubernetes was not developed for stateful workloads. But yeah, nowadays it is supported and you will need persistent storage. And there are two big types, so you have block storage and object storage. And first, we had to kind of wrap around our heads about how all this works, and just to give you a quick rundown. So you have persistent volume claims, you have persistent volumes and you have storage classes. And persistent volume claims, so those are the claims requested by POTS, they are namespaced and based on that claim, there is a persistent volume which is created either manually or dynamically and that's a cluster resource and that's then mounted into a POT. And behind the scenes you actually have storage classes which depend on storage engine or provisioner and storage classes actually define how your application and access mode works. So for our block storage, we decided to go with OpenEBS ecosystem. So we are using OpenEBS hostpad and Mystore. So Mystore is for high availability deployments because we also do single node deployments if needed. Mystore provides application but needs minimum three nodes and it's quite picky when it comes to hardware performance. Yeah, and this is how our storage classes look like. So we have two Mystore storage classes and one OpenEBS hostpad. And final storage component is Minio. So this is the one that provides S3 compatible storage for us. We went with it simply because it was already included with Kubeflow but we did update it to a later version but newer versions are under AGPA license so that might be a problem for some organizations but for us it is fine and works well. And with this, a hand off to you. Thank you. So let's talk a bit about how do we actually get users into the applications we host on Kubernetes. And here we use Istio. We didn't actually choose that by ourselves really if we're honest but that's also shipped via Kubeflow. Small team, we can't do everything ourselves. We can't then also rip that out of Kubeflow so we decided to go with Istio as a service mesh. You can see some examples so we use that for routing also. So that, for example, Slash Minio gets you to the web interface of Minio. We also use this then for authentication which is also very nice. So if you don't know that then if you are not logged in as a user that will automatically redirect you to our identification provider which will come to in a second. But this doesn't really help us yet how do we actually get users into our cluster because as you can see at the moment still is everything just running into the cluster. Originally we first looked at something called Metal LB but that seemed to be at least for now a bit too complex for our small team. So we did something like very simple. We just use EngineX as a proxy and have our users access EngineX. This obviously isn't very resilient because if that one host machine goes down everything's down our users can't access that anymore. So what we're really doing at the moment is something more like this. We have several host machines all with an EngineX proxy on top of it and then we do two different approaches depending on the deployment. So either we actually just give users really three different URLs. So in case something goes down which doesn't happen very often our users are at the moment are technical enough so that they can just select something else. Obviously this won't go for the long run when we really run models in production. This won't be a solution. Then we might have to look into something like Metal LB again. But for the moment this especially together with DNS failover works pretty great and we don't need that on every single node. Sometimes we have like then bigger deployments or add some more nodes which then don't come with EngineX on that. While we're at the topic of networking let's quickly talk about TLS. We can't go too much into detail here but as probably all of you know that's important and also important to get right. But on the other hand it's also hard. And so some people might be just tempted to say effort we trust our users. Let's disable that. We would say don't do that. And if you've been to any of the security talks here that opinion will probably be have been confirmed there. So what do you do instead? You need to learn about TLS unfortunately there really is no way around that. But then if you dive into it it's really not that half after all. Some topics you might want to pay special attention to is A how does certificate signing work and then also how do you distribute those signed certificates to your hosts. We use Ansible for that as well. And to distribute that needed at different places. So like on GitLab but also then on our Kubernetes nodes. And then you need to think about how do you get your certificate signed or by whom. So you can do that self signed which we just use for our development classes which we spin up well daily if not hourly. But then we also use let's encrypt for example or have those certificates signed by the root certificate of the organization we are currently deploying to. Which could be Airbus or one of the universities. Unfortunately this is about the time we have to go into TLS. But as I said important topic. What we also already somewhat mentioned is authentication. Here again we use DEX because DEX comes with Kubeflow. And well I've seen some people on the internet who actually managed to rip out DEX out of Kubeflow and use Keycloak instead. Which we would have liked to do as well because Keycloak there's just a lot more information available online like more documentation. They're more, well it's easier to connect it to something else. So we would have liked to use Keycloak because then we can also like connect it to Grafana for example or whatever other applications we're using. And then on the other end for example use Azure AD or GitHub as an authentication provider or whatever. But luckily that can really just be connected pretty easily. So that Kubeflow itself keeps continuing using DEX. But DEX then talks to Keycloak and then you can that way hook into whatever authentication provider you want to do. That works then pretty well out of the box. A note on Keycloak, at least the version we are currently using is not really Kubernetes native yet. So configuration is done via JSONs which is pretty ugly. So currently what we're doing is we're configuring it with a web interface exporting those JSONs and then loading them back into Keycloak. There are new versions available that use an operator and also then custom resources and custom resource definitions. We haven't tried that yet but it looks a lot easier to handle in the near future. Then let's talk quickly about how do you find out what's going on in your cluster. Especially so I've been to that exhibition hall down there and like every second booth seems to offer some observability options. We are currently not using something as fancy as that. So we went for Prometheus and Grafana also just because it's like a very easy, lots of documentation available, lots of people use it, resources available for almost everything. So for example, for Ksurf, we would use for model serving. It's well integrated. You get like pre-configured monitors and dashboards for almost everything you can think of. So that is really very nice and we are pretty happy with that. We currently use this Kube Prometheus project to also monitor our cluster itself because also we live on premises and so we don't get that by our suppliers. And that is also very nice because it also comes more or less pre-configured. You just need to make sure to use the right version that fits to your Kubernetes version. So we can highly recommend that as well. It also supports that metrics API that some software needs for example, Kubeflow, so full featured solution. We're very happy with that. We use a second instance of Prometheus to do application monitoring just because we tried to mess as little as possible with that Prometheus instance by Kube Prometheus. And then we also separately installed Grafana mostly because we wanted a newer version of Grafana. So later version of Grafana come with something like some Krabi builder. So if like me, you don't know really the Krabi language that Grafana uses, you also have pretty easy access to get whatever you want. So for us that looks like that. Also then logging, logging is also very important. Perhaps even more than monitoring. We use something else from Grafana there, Loki. Also very happy with that. And the users here are not only cluster admins, but also the end users. I'll quickly give you an example for that. So in Kubeflow, if you start a job for example, it's also a pot. And if that pot then gets cleaned up after some time, your locks are gone. But perhaps you would just check on a Monday morning what happened to my job. Job failed, you would like to see the locks not there anymore in the Kubeflow user interface by default, but then you can very quickly find it in Loki. We here use also like a bigger set like PLG which is short for Cromte, Loki, Grafana. But also we don't use that Grafana there just because we want one unified interface. So that looks then in total like this for us. And if you've been at the talk before, so we also use Postgres, actually the same operator that Karen talked about before. But also there we don't use that Grafana that comes with that, but use just one Grafana instance to just get all the data we want in that instance. So one big component we haven't talked at all here is really Kubeflow and we're not gonna go into detail here. Obviously we really like Kubeflow, otherwise we haven't decided to use it as like the main machine learning ops component of our platform. And our users also really like Kubeflow, especially like the pipelines I just mentioned, perhaps after a bit of an initial hesitation because it's a bit more complicated than just using a Kubeflow notebook now they really start seeing the value in that and are excited about that. But what's also very good about Kubeflow is that it's very flexible. So if it doesn't exactly fit to your use case, you can very quickly make it fit to your use case. But on the other hand, that flexibility might also mean that the defaults are not exactly where you need them to be. And so that means Kubeflow can be modified rather easily, but perhaps also should be modified to actually get that out of what you actually want. Some examples, so limits for example, for users are not there. Like on one of our very first deployments, like a data scientist found like a feature in those pipelines to start concurrent pods, did like 1,000 pods at the same time, worked fine, tried 10,000 pods and it turns out like et cetera didn't like that too much and that brought the whole cluster down. So since then we've had some users not done by hand, but by that operator that we wrote that someone mentioned earlier. Then there are quite a lot of default containers there, but they also might not be exactly what you're looking for. But also that's pretty straightforward to get like your own default containers which you can then also expose to your users that they can get like something that's a default perhaps for your organization. Stuff like access to those Kubeflow pipelines isn't also there by default like POT defaults, but we also supply that and would recommend you do that as well. In general, the documentation around Kubeflow pads isn't the best. We probably could contribute more and we just started with that. But what is very nice is like the community there's a Slack channel and people are really very, very helpful. So if you have any problems reach out there I can only recommend that as well. We're coming to more or less the end of the presentation and before we go we would like to leave you with some tips and tricks. So as you can see, we had quite a bit of fun with mid journey there. So all those like pictures are generated by that. Some ideas on debugging. So we also use Kube control locks and exit a lot, but then also we like K9S and Lens very much to have a look at what's currently going on in your cluster. Then we can highly recommend a YAML rally data for your IDE. So at least some and I are both using something from Red Hat called I think YAML language server, which is part of that LSP ecosystem, which means like more or less no matter what editor you're using, you can use that to get your YAMLs validated on the spot. Then we really like JQ, YQ to pre-process some YAMLs perhaps even before we do deployments. So make something fit the environment variables we set and so on. So then reminder check your hardware requirements. So especially Maya store really wants like low latencies for example, working requirements might also need to be checked something that we didn't do from the start. So microk8s for example, doesn't come with RBAC enabled from the start, but you need to manually enable that. We started without that and then enabled that later that led to some issues, which would have been like much easier to fix I guess when we had that from the start. We already mentioned Shell operator, but also a tool we would like to recommend is Bats, the bash automated testing system, which we use to test the whole platform in itself. Finally, like one of my least favorite bugs we run into is that if you have huge pages enabled on your host machine, which we needed to where we run a Maya store, then Postgres might have some issues if you don't explicitly give it huge pages like you can see on the right. So if you have issues like that, have a look, at least we didn't find the error messages there too helpful. And I think it's a bit of a discussion who's fault it is, if it's that's Postgres or the Linux kernels fault, but it seems to get fixed. So this brings us to more or less the end of our presentation. Let me quickly summarize. We showed you how we built an ML Ops platform, what components we choose, not all of them obviously. So there's like lots of topics we didn't touch at all like secrets management, like how do we do upgrading and stuff like that, but not everything is possible like in 30 minutes we have. Some key takeaways, we already mentioned the first one, automate as much as possible, especially if you're in a small team, like one day invested now might save you like a week or more later. So we would say definitely to do that. Then don't be afraid to asking for help on Slack channels and so on, but then obviously also don't forget to give back later if you found out something or if you can help somebody else. And then so for me, I always liked a lot to just experiment with stuff, but like if you dived into something, at least for me, new like Kubernetes which is really such a big ecosystem. It also makes sense to pick up a book or something and learn a bit about the basics. To conclude, building something like this with a very small team, especially on premise is hard and can't be done in four weeks. It took us something like six months I guess before people were really productively using that and we haven't figured it out completely yet. We are still like iterating, improving and so on. And I think we will also be doing that for quite some while. But the good news is if you really commit to it, it totally can be done even with a really small team. Before we come to the questions, I would like to remind you that this was a group effort even though it was only some of me talking. Lots of more people involved. Some are also here, I've already spotted. So if you want to come afterwards and chat about other topics or any of that what we just mentioned come by, they're probably also like hang out. Also, some of those people really know a lot about space. So if you were here for the space part and were really like a disappointed that we didn't go into that, we can also make that happen afterwards. Thank you for your time. And please don't forget to rate this talk. Thank you.