 So, hello everyone and thank you for being here with us today at KubeCon 2023. Our talk is about hardening Kubeflow security for enterprises. I'm Diana Tanasova, a software engineer at VMWare's open source program office and I'm currently contributing to the Kubeflow project. Today I have the pleasure to co-present together with Julius Bonkhochert, who is freelancer, a DHL employee and who is contributing to the Kubeflow ecosystem for over two years and he is also a founder of the Kubeflow Security Working Group. So, that's enough for us or do you want to say something? Okay, maybe here. And how many of you are planning to use Kubeflow? Okay, we need a learning part. Everybody got their hands up, amazing. So, yeah. So, here is the agenda, so we will still do an overview of what Kubeflow project is and we will try to share the answers of who is using Kubeflow and why. Then we will do a short introduction of the security working group, we will share its main goals and initiatives. Then to make sure that we are on the same page, we will do an overview of the Kubeflow architecture and then we will share, we will discuss the authentication flow of Kubernetes of Kubeflow. Then follows the interesting part, we will have a discussion of different various Kubeflow security issues and we will end up with the conclusion. We will wrap it up. So, what is Kubeflow? Kubeflow is an open source machine learning cooperation platform based on Kubernetes. It enables data scientists and machine learning engineers to build, scale, deploy, orchestrate their machine learning workflows. It tries to standardize and automate the iterative nature of machine learning workflow. We can see Kubeflow as an orchestrator of commonly used machine learning platforms like Kubeflow pipelines which are reusable, modular, can be shared across the teams. You can do hyperparameter tuning thanks to the cat component. You can serve your model right in the Kubeflow thanks to the case of component. Users can spin up their favorite IDEs. They can easily track their lineage of their pipelines and models and et cetera. The Kubeflow is comprised of different components and it also has integration with different machine learning frameworks like TensorFlow, PyTorch and very important is that it's abstract away the complexity of Kubernetes. So, to summarize, Kubeflow helps teams to streamline their work. It improves the collaboration. It has multi-tenancy support and as a whole, it speeds up the time for developing their models. The project is developed by numerous companies and now the project is doing its next step towards joining CNCF landscape as an incubating project. We hope that this will happen soon. I'd like to announce that Kubeflow version 1.7 has been just released. So, if you still haven't tried, please do. So, who is using Kubeflow and why? Companies from various industries are using Kubeflow like communication, finance, medical, insurance and especially the regulated sector. But also some hyperscalers are using Kubeflow like IBM, Google, VMware, DHL, Ricto and others. Actually, Googles are using Kubeflow as their default machine learning platform on GCP under the name Vertex AI. But why Kubeflow is a platform of choice for so many users? Simple, there is no similar open source machine learning orchestration alternative available. And another reason for adopting Kubeflow especially for regulated sector but not only is that it's vendor neutral, it's scalable, standardized and we could say fairly secure. So, we have the similar reasons for adopting Kubeflow on enterprises as running Kubernetes on enterprises. So, now let's say a few words about the newly formed security working group which technical leader is Julius. Its primary goal is to define a clear policies and procedures on how vulnerabilities should be reported and should be publicly disclosed. Another goal is to enforce the use of security-based practices like automated every API call, like use the least privileged airbag, do periodic CV scanning and now still in progress is the integration of software build of materials or S-bombs. Of course, these groups provide place for discussions during our bi-weekly meetings and in Slack. And of course, the main focus of this group is to tackle different security-related issues. In the following slides, we will discuss some of the issues. So, now we would like to share some of the insights we got from our first CV scanning starting from 1.7 Q4 version. So, this table contains information about the number of images used by a specific group and the number of CVs divided by severity. Luckily, most of these CVs here originates from our external dependencies and the majority of them can be addressed by just upgrading to a newer version. Actually, Canonical already worked towards minimizing these numbers and they have even achieved zero critical and zero high severity CVs. So, we're expecting their work to be contributed upstream. So, it's important to take care of your dependency. We automatically inherit all their boxing CVs. And now still work in progress. The community is doing its next step towards securing software supply chain by integrating software bill of materials or S-bombs. S-bombs provides an exhaustive list of all the software components within the project with their versions and their licenses in standardized format. So, both sides and once this file S-bombs file is released, both sides, the user's community and the contributor's community could benefit as it provides transparency. Both sides know what are the ingredients within the project, in this specific version of the project. And it increases the accountability. Based on these CV files, you could do a CV scanning and we could say that it enables us to do accurate identification and remediation of security vulnerabilities. And I forgot to say that ensures that the project is licensed compliant. Now, Julius. Yes, thank you very much. Is this working? Yes. Let's start with the architecture here and please, please follow closely because this is important for the remaining part of the presentation. So, we start with the architecture here. In general, you can see the authentication and authorization part within Kubeflow consisting of an Inverse Gate for a DEX as a ROTC provider, the internal ROTCR service and in general a service mesh. Actually, the service mesh as well as certificate manager is connected to most components within Kubeflow. I'm not going to cover it everywhere. But giving this authentication and authorization part, oh yes, amazing. Let's first take a look from the user side here because usually you have quite a lot of users, in this case user X which owns a project or namespace or profile, however you want to call it. Why do we do this interchangeably? Because we have these profile controllers here which actually pick up a Kubeflow profile custom resource and transforms it into a Kubernetes namespace, adds some default service account, role binding, secrets and so on. So, yeah, this is done for all of the users automatically. For example, I run some clusters which have several hundred users. Then, giving this, we have the central dashboard here in way which is connected to most of the Kubeflow UI components. So, this is the main entry point for the user. As for example, volumes UI which is quite simple. It just allows you to manage your own PPCs in your namespace because as I said before, this is really just a Kubernetes namespace with some additions. So, you can do all the normal Kubernetes stuff as well there in your Kubeflow namespace. But then comes the probably most used Kubeflow component which is Kubeflow pipelines, which adds some, quite some resources to the cluster in Kubeflow namespace and allows the user to run these reusable, modular pipelines we talked earlier about in their own namespaces. This is heavily used, developed by Google. This is their main product also sold as a vertex AI on the GCP platform. And then we have Kube, as I said before, for hyper parameter tuning here in green. And also quite heavily used, the workbench is formally called Notebox. As the name already says, you can start your JupyterLab, which is through your code, RStudio, later on maybe also LabelStudio and MLflow within these as in a self-service manner. So, you can just point and click, choose how many GPUs memory you want and start your online IDEs. And last but not least, there's K-Serve here in pink, which is used for model serving. I think there was a talk a few hours ago here exactly in this room from Bloomberg about K-Serve and how they use it. Yeah, it's using K-Native under the hood, so that means you get serverless inferencing, you get scale to zero and some nice additional stuff. And all of this, as you can see here, runs in the username space. The actual workbook is always running in the username spaces, so you can add your quotas and whatever onto it. Then we have the Kubeflow name space, which covers most of these components and is the system and authorization for these here. So, this is the big picture of Kubeflow, but it's not complete because, for example, I'm working with any scale on Ray integration, maybe you know Ray, the training platform for JetGPT, as well as MLflow integration, labor studio integration and so on. So, there's a lot to add to Kubeflow. It's really a big authorization platform for quite a few frameworks. Now, try to remember, but I'm always going to show it again. So, for us in education, I'm giving back to Janna again. I'm not sure that everybody hears you. Yes. Do you hear him? Okay. Okay, so now, on this slide, we would like to show you the authentication flow of Kubernetes of Kubeflow. So, it starts with a user or a machine who needs access to Kubeflow. So, the user hits the ingress gateway, which redirects the request and asks another service, OIDC Auth Service to validate the request. Currently, we do support two kinds of authentication, session-based intended for humans, and what's new in 1.7 is that we do support service account-based authentication. So, OIDC Service validates the request. The request is valid if it contains headers with a valid token or a valid session. If they are missing or they are not valid, it redirects the user towards to begin a new cycle of OIDC authentication to get the session. So, once the request is valid and it has a valid session cooking service account token, the OIDC service also adds another new header called user ID header. So, and then the request is actually redirected to the correct Kubeflow component. This new header, user ID header, is later used by the Kubeflow components for authorization. So, before 1.7, the OIDC service supported only the interactive session intended for humans. So, we were supposed to simulate the web browsers for machines. And now with the newer 1.7 version, we had a programmatic way to authenticate the machines using the service account tokens. And now I will give the stage to... Actually, this was a quite heavily requested feature by the users. Okay. Actually, this was a quite heavily requested feature by the users because quite a lot of companies are using GitHub workflows or GitLab pipelines to trigger Kubeflow pipelines in turn. And they definitely need a long term proper machine to machine authentication. So, this now in 1.7, it's in. And let's continue with what we've achieved so far because I'm not only talking about the bad security stuff. I also want to talk about what we've done so far. And it's actually quite a lot because, yeah, we had some problems over the last few years. And this started with misconfiguration of Istio sidecar services and so on, which allowed you to easily fake the user ID header and impersonate to a lot of services which you have seen before in the architectural overview. And you could just fake to be another user, hijack their namespaces and so on. Of course, completely unacceptable in an enterprise environment. And, yeah, even the user management self was completely unprotected a year ago. So, yeah, you could just add yourself to any other user's namespaces, hijack it, take it over and so on. Yeah, complete disaster. Okay. But I spent some time on it with some other persons as well. And we fixed these issues upstream. Everything is contributed. We spent some time to harden Istio with security best practices, add additional sidecars and so on. And in addition to Istio authorization policies, I later on also added network policies as a second layer of defense. And actually, it turned out to be quite valuable and catch some additional Istio misconfiguration. So please, it's under the control folder. If you use Kubeflow, please consider using the network policies as well. So this is regarding Istio network policies, quite some changes, quite some improvements. And another area where we made significant progress over the last two years here are rootless containers. And as probably all of you know, it's still an unnecessary security risk. You should not run root containers nowhere, never if possible. Although some people still do it. So this is obviously even with Kubernetes user namespaces, which is something you might know about improving the situation about root and so on. But even despite such efforts in the Kubernetes Foundation, it's still a big problem. And it's forbidden by most company policies in Stain Enterprise environments. You don't want to have your users to run root containers. It's quite simple. But what's the default situation within Kubeflow? If you go to the Kubeflow website, download the manifest, install in your cluster and so on, then by default, almost all of your containers run as root, especially in the user controlled namespaces, which makes it easier to exploit your cluster, escalate privileges and so on. So this is something I wanted to graduate almost two years ago. One of the first things I tackled. So you made it possible to run 99% rootless. Of course, you're asking, where's this 1% gone? It's gone within the Istio C9 demon set because if you want to run Istio rootless, this is the only supported option at the moment. So you want rootless init containers, no admin rights there, you want to prevent root containers everywhere. You have to use Istio C9. This is where the 1% is going and you will get an exception for this probably at your company. And then it's not enough to just do this for all default containers, which are there in the Kubeflow namespace, in the Istio minus system namespace and so on. But you also have to enforce it for new containers that you probably do not know at installation time because your users can bring their own containers in their Kubeflow namespaces, their own containers for pipeline training, their own containers for inferencing, and so on. So you have to enforce this on the cluster level, on the Kubernetes level by using pod security policies. And nowadays, I think with Kubernetes 1.25, you can use pod security standards, which is the successor of pod security policies as well. So this solves the rootless container issue, but there's one limitation because some people do not use GitHub pipelines or GitHub workflows to build their containers or something else, but they actually want to build containers in their own Kubeflow namespace. And this is slightly limited because, of course, no legacy in Secure Docker, but even with Podman or Carnico, it's still not yet possible to build OCI containers rootless. This is the only limitation so far, except for this significant security gain. Please use this at your company. I'm linking the GitHub pull request there, and I'm hoping to upstream all of this in the near future. So the second thing, rootless containers mostly solve. Yeah, then let's go to the exploit part. Let's start with a simple example. This is just a simple RBAC issue. Most of you should be somewhat familiar with RBAC stuff. I hope you can email properly. If not, please raise your hand. Okay. Then, yeah, let's start again. We have a namespace called Alice, because I think this is common in the security sector that you call them Alice and Bob. So I took this here as well. So we have Alice and Bob here. And we have the Kubeflow namespace. And in the Kubeflow namespace, those two controllers, controllers run, which you've seen before in the architecture overview. As first of all, the profile controller, which takes these Kubeflow profile custom resources and transforms them into namespaces at some stuff, wall bindings, authorization policies, and so on. But then there's a second one. Not only this one, the normal profile controller, also the pipeline profile controller from Google, they were developed independently. So this one here as well, add some additional stuff just for Kubeflow pipeline to the namespace. As you can see here on this side. So now that we have these controllers, what's the problem with them? Obviously, as you would have expected, they both just run as cluster admin. As you can see here, we should never be the case, never provide cluster admin roles to any of your service accounts. It's just don't do it. Because then one exploit in the Kubeflow namespace and you can easily become cluster admin. That's the problem. It's just about hardening in this case. So what can we do? It should, of course, be a reduced cluster role. And if possible, talk to the Google developers to merge all of this into one profile controller, to reduce complexity and therefore attack surface. So this is a quite simple example, rather on the side, not that big impact yet, but we can get to a more sophisticated one that includes namespace sharing. Because if you have tried Kubeflow so far by yourself, you will have probably noticed that you can share your namespace with collaborators. We start again with Alice and Bob, in this case, malicious Alice, who gets added to Bob's namespace as a collaborator. So what does this mean? She also will have access to the service accounts here in Bob's namespace. And this is problematic on several levels, because first of all, it does work. Yes, you can access Bob's Jupyter Labs, Bob's pipelines and so on. Everything is working on the surface, but it's completely broken on the Kubernetes level, at least from a security perspective. Because first of all, they are configured in such a way that even if you only have access to the default viewer, you can escalate to default editor in several ways. Even with the Kubernetes 1.25 improvements, where the secrets are not created by default for service accounts, it's still possible to escalate. So this is the first thing, even in your own namespace, you can escalate your privileges to another service account with higher privileges. Should not be the case. And the other way more important problem is, just imagine Alice's malicious at a big company. So she just steals Bob's default editor service account token, leaves the company and impersonate Bob with the viewer token. She can just log into the Kubernetes cluster. I actually did this myself several times. So should not be the case. If you leave the company, you should not have access to any Kubernetes clusters anymore, just because you stole the service account token. And what can we do against this? For example, we can just disable sharing all together. This is what is done at some companies because I'm doing consulting, I know about it. And then there is as well a second solution, which is might might be called a proper solution to some degree, which means actually deleting, regenerating all of the secrets tokens and so on. None of this is implemented yet upstream. So please take care of your companies. This is a problem if you leave the default namespace sharing enable. Then we have sharing and staring, as I always say in life, be careful. So now there's another issue about multi-tenancy, especially the artifact storage within Hubfloor, which is implemented using Minio, as SD server. And here you can see it's used mostly by Hubfloor pipelines to some degree also by KServe. And we start again. We have Alice, we have Bob, and the namespaces. So let's add the Hubfloor pipeline components, which you have seen on the architecture slide before, here in blue. And as well as additional backend stuff as the pipeline API server and Minio. They used, especially Minio is accessed by most Hubfloor pipeline components. So what's the problem? It's completely shared. These secrets, Minio Secret 1 and Minio Secret 2, they're actually the same. And not even that, they're not just the same. They're also the admin secret for Minio by default. So any user here in your Hubfloor installation can destroy your Minio instance, can hijack other users, read all artifacts from all other users. There's just no default multi-tenancy. You just share everything and can destroy everything. So a year ago, this is why I spent some time on a POC to isolate the artifacts per user and in the end really have separate folders with proper IAM permissions, which you can set in Minio, such that the users cannot access each other's artifacts anymore. And this is actually in place at some companies. So I'm not sure maybe it's already an implementation and not only a POC, but in the long term, I want to get even better because we still have credentials there. Just imagine now we have randomized, separate different credentials per namespace, not the admin secret anymore. It's still annoying. You still have to pick them up as a user. And in the long term, I actually want to get rid of those credentials all together. I want to use the namespace origin here for authorization. But this is not yet implemented for the POC with separate credentials. There is a PR. You can click on it. Please use it at your company. Try to contribute back if possible. And last but not least, the main developer of Kubeflow Pipelines, which is Google, is allergic to AGPL. What does this mean? They just don't allow it. And the problem is, as you may know, Minio was switched around three years ago from AGPL 2 to, no, from Apache 2 to AGPL. So we have a three-year-old Minio image with quite a lot of CVEs, I would guess. Yeah, and we have to find a successor. I'm working on that, but it's not yet upstream. So CVEs in the image and the admin credentials shared for all users. Quite important. This is something, I think, which is necessary to fix if you want to deploy Kubeflow. And then, orthogonal to the artifact storage, there's also metadata storage for lineage lineage tracking of your artifacts, especially the metadata for pipeline one artifacts. It's also heavily used by TensorFlow. So the Google guys decided to also use it within Kubeflow. And we start again, we have Alice and Bob with the namespaces and a lot of components for this ML metadata stuff. I'm not going into detail about this, but the main problem is, again, we don't have multi-tenancy support. It's, again, similar to the artifacts, it's shared for all users. Alice can access Bob's metadata. So there's a solution for it currently. You can just disable this component of Kubeflow because it's quite modular. This is what I do at most companies. But the problem is, the newer version of Kubeflow pipelines might require the machine learning metadata part, and then it becomes complicated. We will have to spend some time on isolating it per user for KFP version two. I talked to the Google guys. They are focusing on it for the second half of the year. Let's hope they deliver, but let's see. And of course, we are looking for volunteers here, not just for this issue, but for any issue. I'm really offering mentorship. I did my third mentorship already. If you want to contribute, just contact me. I'm going to help for free just to improve the project here. And then to add some variety, we also have a frontend issue for you. As you can see here, just imagine, this is Bob's namespace. He had just the bottom of the UI. And just imagine he did some TensorFlow training and a cool, shiny little pipeline. And the TensorFlow training produced some output artifacts, as you can see here, or at the bottom. This is the S3 path, including a namespace parameter. Now, the problem is that if Alice somehow manages to spy Bob's S3 artifact path, then the UI allows Alice to actually read the content. How does this work? You just remove the namespace parameter here, enter this in your web browser, and the UI will not check permissions. Someone forget to add the permission check. This is still the case in a default installation from Google. And there's also some additional technical depth that, for example, yes, we skip the API server, we access Minio directly, or this artifact proxy, which you've seen before on the slides. It's also rather obsolete. So there is some stuff to be done on the, especially in Kubeflow pipelines regarding the artifact storage as well as the metadata storage. So then there's also an additional denial of service attack. And I know some companies who are really affected by this, you can block the usage of the cluster for, at least the database and therefore KFP for most users. I'm not going to explain it here. Please, if you want to help us, fix this, it's just SKL stuff, join us, and we will find a solution for this. Yeah, so as a conclusion, we achieved quite a lot over the last two years, like authentication for most API calls, lower privilege RBAC. We also found that security working group, including the just recently released image scanning numbers, software bill of materials, the new machine to machine authentication, distio improvements, network policies, and of course, we'll just contain us. But yeah, on the other hand, we still have some open issues, which I've just presented here from the profile controllers and namespace sharing, which are mostly RBAC related to the multi-tenancy support within Kubeflow pipelines, the artifact storage and metadata, yeah, does not support it yet properly. And we have provided some solutions to some of those issues here. Yeah, this is what I would like to summarize. And of course, please join us. I said several times before, we're looking for volunteers. You can join the community, there's a community calendar, you can join us on Slack, quite responsive, there's a security channel on Slack, and we have the security working group meeting minutes, so you can even see what we've done so far in videos. Yeah, this is our contact information. Please do not refrain to contact us. I will try to answer you as fast as possible, and we would really like you to rate our talks such that we can improve on your feedback. So I know this is quite a lot. It's not an entry-level presentation, but maybe some people already have some knowledge regarding... Okay, maybe some people already have some knowledge regarding Kubeflow, want to get a more in-detail explanation of some of the issues I've shown here. Any questions so far? Yes, please. Is there any work going on isolating or like managing security of the notebooks? So let's say if I am in a notebook, I can authenticate in the notebook via my credentials. And within the team, any member can have my credentials. So that is one of the security issues we are facing in our company. Yes, if you share your namespace with other users, then of course they will have access to the same stuff. I wouldn't call this an exploit. So if you really want to separate this, maybe create separate namespaces per user and one big shared one. But it's not really an exploit if you explicitly share the namespace and then expect some stuff to not be shared. I mean, I understand regarding the exploits of default editor, viewer to editor. This is clear, at least that most of the companies, I know they just have per username spaces and if they then work as a department within a project, then usually these departments get their own shared namespace where all of the department users are inside and they know they are sharing stuff. Okay, thank you. Yes, put your hand over the microphone. Thank you very much. Cool. Hi. So I was wondering, did you ever consider integrating with other open source tools like gatekeeper, OPA policies? So for instance, the way KFP pipelines work, the V1 version is that you create like a workflow and then it gets turned into a pipeline execution. So one way I solved multi-tenancy is that I build an OPA policy that prevents running other people's pipelines. Can you get a bit closer to the mic? Oh yeah, sure. So the way I solved one of the multi-tenancy issues is that in the previous versions, you would be able to run other people's pipelines. So I built a policy that prevents you running other people's pipelines only in the namespaces that you whitelisted it. So yeah, I was wondering if you considered integrating with OPA policies and gatekeeper. No, not yet so far because I think if I understood your question correctly, then you are talking about namespecary. Like pipeline execution. So you can, in the previous versions, you were able to execute other people's pipelines. Ah, yes. Okay, now I know what you're talking about. Yes, this is too much to cover here, but I can tell you a lot about it. You had these, I think even in Kubeflow 107, you still have just the shared pipelines displayed in the user interface. Yes. And also on the Minio level, they are stored and you can access them even there and you can probably also access them in the database. Yes, that's still possible, but in Kubeflow 108, I think this is solved. We now have a proper UI which separates between shared pipelines and private pipelines. I can also, if you come later to me, I can tell you about the exact pull request. This is actually solved. The only thing that is missing is the KVP SDK support, but now we have really in the UI, we have a separation between shared or global public and private pipelines. And I even had my own implementation some time ago, but nowadays it's really fixed upstream. Okay, cool. So I'll look forward for the one point. Yes. If you want to help regarding the KVP SDK stuff, feel free to volunteer. Sure. I will. Thanks. Hi. Thank you for doing all the work. It's very useful. So thanks and also for the presentation, of course. My question is regarding the CVEs. You showed a quite scary table of vulnerabilities and you were talking about how they were getting solved, which is very nice, of course, but I was wondering, is this also going to be part of the release schedule that in the future CVEs won't be introduced into new versions? So we can use it directly? Yes, this is exactly one of the reasons why the security working group has been founded. And one of, actually it was mostly done by Diana here, who did the security scanning. Yes. And we are especially working on fixing all the CVEs. I talked to Canonical, they have their booth here. They fixed most of them and want to upstream it. And as said before, it's mostly outdated base images. So I think we can really get the high and critical ones to almost zero. Yeah, that's very good. The problem is actually it's scattered about over so many repositories. You have Kubeflow manifest, which pulls from Kubeflow pipelines, which pulls from Kubeflow slash Kubeflow case of and so on. And it's quite difficult to you have to manage with all of the separate working groups. You have to create separate pull requests per working group to update their base images. And then most of the CVEs should be gone. But it will take some time. Yeah, of course, I understand. But thanks anyways for the work. By the way, from which I would like to know from which companies you are who are using Kubeflow. If you could tell us, are you using it privately? Oh, sure. No, we're not using it yet. I'm from a local bank here in the Netherlands, the Volksbank. We're not using it yet, but we were looking forward to being looking at implementing it. But this was a very important talk, of course, for us because well, you can imagine why. Yes, I can imagine why you have especially the regulated sectors as I said before. Yes, it's doable. I know some quite quite a lot big companies, not just the ones listed here, but even way more companies, insurances and so on. And you cannot imagine how many people are using Kubeflow, but they're not contributing back. Some companies are really not allowing their employees to contribute back. So maybe this is something you should also check first, whether you are allowed to contribute to the project if you want to have fixing those issues. I think that wouldn't be a problem. Having enough time for it would be the major problem, I think, because most people have. Okay, do we have any further questions here maybe? No? Okay. And I guess we're also over time a bit. So if you want to talk about anything in detail, just come up front. I will be here for 10 or 15 minutes in addition. And thank you very much for your patience. I hope this talk was enlightening for you. Or entertaining.