 Hi, everyone, and welcome to my talk today at Kubernetes AI Day. We're going to talk about security best practices for AI on Kubernetes. My name is Guy Salton. I work for a company called RunAI. RunAI provides a platform for orchestrating and optimizing AI workloads on top of Kubernetes. And we'll talk about some security best practices today a bit about myself. So I'm the solution engineering lead at RunAI. I'm a big fan of Kubernetes. Been working with it for a few years now. And I live in Tel Aviv, Israel. And let's quickly review our agenda for today. So we'll talk about Kubernetes and containers for data scientists and why data scientists are adopting these technologies. We'll then also talk about what data scientists need on their day-to-day work. And we'll speak about security concerns that might arise for security teams when running workloads on top of Kubernetes. And then we'll see the solution. So we'll see how we can help the security team make sure that the jobs, the pods that data scientists are running on Kubernetes are running in a secure way and everything is secure and we shouldn't have any concerns. And we'll see a demo of how all of these things are running. So let's start with a bit of an intro. So data scientists are widely adopting containers today. And this is due to the fact that a lot of the very popular tools that are in the market today that are used by data scientists are built and designed for containers. So you can see a few of the names here. There's also, of course, the NVIDIA NGC library, which provides a lot of pre-trained models and containers that researchers can work and start getting started very fast. And with their models, all of these things were built for containers. And that's why data scientists want to use containers. And when using containers, you know that the fact that the new standard for orchestrating containers is Kubernetes. So you can see a few stats here. This is from a recent study that was done by Datadog. You can see that almost 90% of containers are orchestrated. And the de facto container orchestration tool today is Kubernetes, used in more than half of containerized environments. The advantages were talked about many times. What we're going to talk about today in focus is what data scientists actually need and what kind of work will they do and how we make sure it's secure. So what do data scientists need? They need to run containers that have their deep learning framework inside it, like TensorFlow or PyTor or Trokeras. They would need to be able to install dependencies on this container based on what their model needs. And they would also need to be able to mount their data into the container. So they probably usually won't build an image that has the data prepackaged inside it. And because data would be changed frequently, they would prefer to have a container only with the framework tools, dependencies that their model needs, and then they would mount data from outside from some shared storage. So once they have all of these things, they should be able to develop, debug, train, and deploy their models. So this is what they need. But on the other side, we have the security team. And they're thinking, OK, data scientists want to run their containers, and they want to use the Kubernetes for that. But how do we make sure that researchers don't mess with other containers? We don't want a researcher to mess around with system components or other researchers' containers. We don't want a researcher to have access to other researchers' data. Data is a big thing, and we don't want a researcher to be able to access somebody else's data, change it, or even just read it. And we also don't want researchers to have access to shared resources, meaning they should only be able to work inside their container. We don't want them to access the host running the container, and then they can potentially change things on the host that can affect other users, and we don't want them to do that. So how do we make sure that these things are not possible? Let's talk about a few best practices, and then we'll see a demo. So first practice, like we said, we don't want researchers to mess with other containers. We only want them to work on their containers. The solution for that is to create a service account for each researcher, which is basically like a user definition in Kubernetes, and then we'll use namespaces for segregation. So each user would have their own namespace, and they would only have access to do things inside their namespace, and they won't be able to touch other namespaces. The second best practice is for the fact that we don't want researchers to access other researchers' data. To make sure that this is applied, we would need to set the right permissions on the researcher's data directory. So to make sure that this directory can only be accessible by a certain user, and then we would, in our pod definition in Kubernetes, make sure that we use the right user and we mount the right directory, and we shouldn't be allowed to mount somebody else's directory or impersonate as somebody else, and we'll see how this is enforced. And lastly, we said we don't want a researcher to access the shared resources, the host where the container is actually scheduled on to make sure that this doesn't happen. We need to restrict each of the users, so basically restrict the cluster or the namespaces on the cluster from running pods with root user and with privilege escalation. We have to enforce and make sure that the researchers run with their own user, and they don't try to use privilege escalation to access the host. So we'll see how this can be enforced in Kubernetes. So let's go and see the demo. I created a public GitHub repo with all of the files and examples that I'm gonna show today so you can feel free to access it and follow along. And so let's go ahead and get started and let's go and start by looking at our cluster. So I have a Kubernetes cluster that I created and you can see that. Let's make sure I use the right config file. So if I run the kubectl get nodes, I can see I have a Kubernetes cluster with three nodes, I have my master nodes, CPU node and another node with GPUs. And I said that I want different users to have different namespaces so they won't be able to do things on other namespaces. So let's say I have two researchers in my team. I have a researcher called Bob and one called Alice. And so the first thing I would need to do is create a namespace for Bob and a namespace for Alice. I already did that. So if I run the kubectl get NS, you'll see I have a namespace called Alice and one called Bob. And now I want to make sure that Bob will only be able to do things inside his namespace and Alice will only be able to do things inside her namespace. So to do that, what we did is we have to create we have to create a few resources in Kubernetes. First of all, there is one called service account then a role and a role binding. So we do this for each of the user. We create a service account role and role binding for Bob and then other ones for Alice. And so the service account is basically a user in Kubernetes. You see, we just create a service account. We call it Bob user. We map it to the namespace Bob. Then we create a role and for Bob, this is also mapped to the namespace Bob and it will basically allow us as Bob to do everything that we want, run pods, run jobs, deployments, everything on the namespace Bob. And then a role binding is what binds the service account with the role. So you see, I create this thing called role binding. It's also gonna be created in the namespace Bob and it will bind this service account we created called Bob user with the role that we created for Bob. So we did the similar thing for Alice and all of these are in the GitHub repo. So you can just go and run the kubectl apply dash F on this YAML file and this will create all of these three things for Bob and then other three, four Alice. So if we go and run the kubectl get role for example on the Bob namespace, we will see Bob's role and we will also see his role binding and his service account. So all of these were created and similarly for Alice. And once we have these three things created for each user, we can create a Kubernetes config file. So I won't go too deep into this. We'll have an example in the GitHub repo but you, this is the file that each user would use to access the cluster. So it would have the cluster information then it will define the user. So Bob will get a config file with the user Bob user and will only be able to access the namespace Bob inside the cluster using Bob user. Alice will get this file. So she will only be able to access the same cluster with the Alice user on the Alice namespace. So just to see that this works, we can now impersonate for example, as Bob by using Bob's Kubernetes config file, which we just saw now. And you'll see that now if I try to get for example, pods on the Bob namespace, I'm allowed to do that. Currently there's no pods running but if I try as Bob to see pods running on the namespace Alice, you see I'm forbidden. I don't have permissions to list pods on the namespace Alice as well as in any other namespace only on the namespace Bob and similarly for Alice. So this is the first best practice it's now implemented. Second best practice we wanted to make sure that each researcher only has access to his or her data. And so to do that, I created an NFS server and I mounted it here to one of my nodes in the cluster, the GPU node. And you can see I created a file, a folder called Alice and one called Bob in my NFS folder. And each of these folders has different permissions. So you can see that the folder Alice has a permission, the only person can access this folder is Alice and Bob can only be accessible by user Bob. And to see that what is actually inside, I can go and show you that for example, in the Bob folder I have a file called data.txt and it contains one line which is Bob data. And if I look at the directory of Alice, I'll see that I have a similar data.txt file but for Alice it will contain a single line which is Alice data. So I have two folders, one for Alice, one for Bob. The Bob folder is accessible only by user Bob, Alice is only accessible by user Alice, each of them has their own data files. So this is the NFS. Now to be able to mount these NFS directory into my pods in Kubernetes, I have to create two things. So first one is called PV persistent volume and then something called persistent volume claim. And so persistent volume defines my volume that I'm going to use. So this is the persistent volume for Bob. You see I called it NFS PV Bob. It's going to be created in the namespace Bob and it's going to use and access my NFS server. This is my NFS server IP and it's going to get my Bob folder. And similarly for Alice we created another one but this would get the Alice folder on the namespace Alice. Then we have to create a PVC. So to be able to use this volume in our pods we have to create something called a PVC, a persistent volume claim. So this will, this is the resource that we can actually refer to in the pod. So we created one for Alice again in the Alice namespace using the same storage class name as the PV and then one for Bob in the Bob namespace. And so we have our storage PV PVC and NFS configured. And now for the last best practice we said we want to make sure that pods cannot run with root user. They can't run with privileged escalation. So to enforce something like that on our cluster there is something called pod security admission. There are a few things that you can do actually but there used to be something called pod security policies but you can see that Kubernetes decided to deprecate it and completely remove it from Kubernetes in version 1.25. So you should be aware of that. The new recommended approach is called pod security admission using an admission controller. It's a new feature, it's in currently in alpha only available from Kubernetes 1.22. To enable this alpha feature you can go and enable the pod security feature gate in your cluster, make sure that you do it both in the cube scheduler, cube API server and the controller in your master node. Once these are applied you can enforce different security standards on each namespace in your cluster. So there are three basic standards and what the standard that we want to use is called restricted. So the restricted standard has heavily restricted policy and this will prevent pods in the namespace to do unsecured things. So if we look at some of the things that it will enforce you see that once the restricted standard is enforced on our namespace it won't allow us to run pods with privilege escalation it won't allow us to run pods with root user it will require non-root user as well as other things here. So I enabled it on my cluster and now to go and actually do it by the way I can see that for example here in my master node in Kubernetes if I go to the ETC Kubernetes manifest I can go and look at my files here. So for example, if I look at the cube API server you should see that under the container commands I added this feature gate pod security equals true. So it's enabled on my cluster and now if I go and look at my namespace you can see that I can do for example, kubectl get ns pub dash o yaml and I'll see that I added specific labels on this namespace which are the cube security enforce and so I went and actually enforced the restricted policy on the bobb namespace I did the same thing for the Alice namespace which means that as the mode is enforced if I try to run pods that don't comply with the security standard the admission would just plot them and let's go and see this in action. So let's go and try to run a pod. I created a pod definition here for bobb so I would try to go and run a pod in the bobb namespace and this would run an image called jupyter tensorflow notebook so it will contain jupyter as well as tensorflow it would go and mount my bobb folder from NFS. So let's see what happens if I impersonate as bobb and I then try to run this pod. So I'll go and run the kubectl apply pod bobb not secure and you'll see that once I try that I get an error, right? My pod security admission gives me an error it tells me that I try to run a pod with privilege escalation I try to run a pod with a root user this is not allowed so I'm blocked and so this is a good thing to apply now let's look at the proper pod that is where I applied all the security standards so it's the same image but this time I added security context and I said the run is not root as true I specified the actual user ID that I'm gonna run this pod with 501 is the user ID of bobb I added a few other things here like the runtime default as the type of my second profile this is also required I also had to drop all of my Linux capabilities and set my allow privilege escalation to false so my pod won't be able to do privilege escalation and then again I'm going to mount my bobb folder from NFS to this directory this path in my container and so let's go and see what happens when I try to run this pod this should now be fine and I see that now the pod is created and if I run the get pods on the bobb namespace I see that my pod is now running right for 10 seconds I should now be able to exact into the pod so exact dash it the pod name in the bobb namespace and I'm gonna use bash for my session and now I'm inside the pod and you see that if I run the ID command I'm running as user ID 501 I'm not the root user and if I try to run sudo su for example see I'm not allowed to run sudo as I don't have privilege escalation so this is good and if I look at my file system here I see that I have my workspace this is where I mounted my NFS folder and I have my data file here that was mounted from NFS I can see my data, bobb's data and if I want to let's say install some dependencies I can run the pip install command to install Keras for example so I can run it it would go and install Keras on my home directory so everything is good so that was the demo and going back to our slides just for summarizing we covered three best practices first to create a service account in a namespace for each researcher so they would only be able to do things on their namespace then the second best practice was for the data we set the right permission for the researchers data directory in NFS in our case and then make sure that we use the specific user ID in our pod definition and lastly we used pod security admission to restrict pods from running with things like root user and privilege escalation so thank you very much I hope you enjoyed this session you can feel free to contact me either on LinkedIn or email and thanks