 First of all, my name is Asir Gutierrez. I am working for Huawei. I'm here to talk today about containers and security. Now, these are topics that were covered before in a previous talk on Tuesday by a guy called Marcus, and we will talk about very similar things, but from a completely different point of view. For these three days, we've heard a lot about regulations, about compliance, so we know that the US is looking into regulating security, we know the US is looking into it, so a lot of companies are looking into how we can make sure things are secure. Okay, so compliance. Before I joined Huawei like a few years ago, I used to work for a big telecom company. We had tens of millions of subscribers, then I moved to a small startup that we used to do legal tech, and a lot of the stuff that we had to do, we had to comply with a lot of laws and regulations. Now, one of the things we had to do at that time was to make sure the files were immutable, which means that obviously they haven't been modified. If the file has been modified, we should know about that. So Linux has this wonderful thing called integrity and measurement architecture, it's part of the LSM, the Linux security module, so it's very, very simple. As soon as you, as soon as the kernel receives an open M-Map or X-V or SC-Schools, if I'm a CNABLE, IMA is CNABLE in the kernel with all the correct pricing and policies, then certain hooks will be fired up. What happens next depends on when this hook is executed. The first time we open a file, we will get the first hash of the file, the hash, the good hash to put in that way in a good state. And the next time we open a file, we will just check the LSM and the IMA hooks, we'll check whether the hash is the previous hash and new hash, they just match. If they don't match, well, it depends. We may get notified, we may get denied access to that file and so on. So, all of this actually works very nice. All of this works along with the TPM chip, so all this goes onto the real hardware inside a real server or computer. So, talk about the TPM chip. The TPM chip is nothing more than a cryptographic chip, it's sold around the board, usually on most laptops or computer species, and it's pretty much a cryptographic chip. So, it allows us to create true random numbers. We are able to perform cryptographic key generations like provisioning. We have the possibility for remote attestation, which we'll look into a little bit later. And we have something called the PCR, which stands for platform configuration registers. So, these are specific registers that they cannot be reset or deleted. You can only extend it. So, when you first boot the computer, these registers are all zero. Well, not all of them, but many of them are zero. And whenever you want to put something into it, you need to extend that. And then, this new extended result is put into this PCR. If you want to put something new, you need to extend the previous PCR with a new one, you get a new value, and so on. So, it's like a chain of hashes that you get there. Okay. So, now, as I said, we used to use this in my previous companies to make sure the files were modified. Now, there is a thing. In previous companies, we used physical servers. Now, who are we? We thought, we want to have these in containers. We want to be able for people to use these kinds of tools, inside containers, seamlessly, very easily. So, this is how we came to this idea of having a new namespace. Now, we are based on the Linux LXD container technology that is based on namespaces. We wanted that because it allows us room for a lot of improvement in terms of resource management, and because it has better performance that hypervisors like KVM. Now, what we added, we thought we would add just a new namespace. We call it the I-man namespace. Obviously, since we are working in the open source community and we wanted our work to be useful to all the people as well, we thought maybe somebody look at this problem before. So, we found these two individuals over here, Stefan Berger and Christian Browner. They are two IBM engineers. They have been working on I-man namespaces for quite some time. They have a set of patches that they released to the LKM, to the Linux mail list. So, we took it as our baseline to further improve it. Obviously, all the improvements that we made, we also publish it in the Linux kernel community. So, they are out there available for you if you want. Okay. So, the previous work by these IBM engineers, they created these I-man namespaces based on the username space. So, you cannot create an I-man namespace without username space. The reason why they did that is because it's probably the less painful way to get these done. There are other ways to do it, more flexible, but they are more complex and they may break things as well. We made some changes, though. So, these I-man patches that were first released, they didn't suit our goals. So, we made some changes. One of them is we added some VPCR. So, VPCR stands for virtual VCR. These are virtual VCRs inside the container. So, the container will look like a real hardware VCR, but they are just like emulations. Another thing that we did is we changed the way we activate these I-man namespaces. So, now we activated these through ProcFS. And, well, this approach of having actually the I-man namespace linked to the username space has a number of advantages like having different key rings. So, one username space or one container, we contained a completely different key from other containers. No shared keys, no shared values, nothing. Okay. So, this is not national how it works or how it looks like. So, we have a username space related to I-man namespace. There is no way you can create an I-man without username space. And then you have a number of containers. Each container will have its own I-man policy rules, appraisals, and so on. Something very similar to what you have with hardware, a real hardware server, but in containers. Okay. Now, this is in slightly more detail of what we do and what we try to do. What we have here are three pots, pot one, two, and three. And then you have the I-man measurement list. So, as you can see, well, it says hash one, hash one, hash one. They are like different files in each container, just as a matter. It's like the hash of first filing, first container, and so on. So, what we do is having all these virtual PCRs, we are able to do the same thing as a hardware PCR. So, these are not resettable. We just need to extend them. So, this provides like a complete chain. If one of the links in that chain fails, the hash will be modified. We will know about that. So, this is what you see there. You have hash where the kind of orange or brownish boxes for pot one. We extend that one. We extend the one for pot two, for pot three, as well. And what you end up with is just a hash that will identify that container, which we call it CPCR. So, CPCR is container PCR, and it identifies the extension of all the files for that container. And all these CPCRs will go to the TPM chip, to the actual chip. Because what we want is containers to be secure, but we want them to be linked to the actual cryptographic chip. Because this is very important to maintain the root of trust. One thing important here is these red boxes over there. So, what we have is we create kind of a fake container. This we call it container zero. And this container is actually not a real container. It will be linked to security affairs. And you will contain the hashes, extension of hashes of some of the IMA policies. So, what we are trying to achieve this way is if somebody has access to the root file system to the actual host, if he changes some of the policies or some of the security affairs files, we will know about it as well. And this goes to the TPM chip as well. So, now, this is how it works in reality. I took it a couple of weeks ago from our test machine. Okay. So, what you have there in the very beginning is the host. This is like the normal IMA stuff that you can enable in any Linux machine nowadays. I just listed the first five entries. Now, a lot of them. So, what you have is you see all these files are stored in PCR zero. We have the hashes for the file and the file names. Okay. Then we have the container. So, when I say container, this is we are inside the container already. Okay. And we have the very same file. We have it exposed as key runtime measurement. And we see exactly the same thing as the host, right? So, we see PCR 10, hashes, and the files. But these are files not in the host, they are in the container, inside the container. Okay. Finally, what we have here in the host, we have a new file that we exposed with our Linux patches, with our cannot patches. It's called ASCII VPCR. And what we have is these CPCRs, the container PCRs, like a hash that we'll identify that container entirely. We have the container UUID. This is a UUID for future cases. We don't use it at the moment. It's actually UUID created by the Linux kernel. We have the name space ID as well, that will identify that container. And then we have the PCR 12. So, all these CPCR container PCRs will be stored in PCR 12 on the physical TPM chip. Okay. So, this is how it works. Very, very simple. So, we have an extension of all these VPCRs. We will get a final CPCR. And all these CPCRs will be also extended one after another. We will end up with a hash that is put into PCR 12 on the actual TPM chip. Okay. So, this way we can make sure that nobody modified any hash in the actual CZFS and try to get around it. So, okay. So, you know, all that I explained were just kernel stuff. And this is very nice. But this is useless in practice because, well, yeah, I can build a container. And what do I do with it? So, nowadays most people run cloud environments. What we have here is just like a very simple diagram explaining we have one or more control planes that are synced. We have at least one, but we usually have many nodes with Qubelet, which is going to take care of that node in Kubernetes. Container, and then the container running on top of the kernel inside the kernel. Okay. So, this is more in more detail. So, what we wanted is to have something that can be used in real life. And most people in real life use Kubernetes. So, we try to build something for Kubernetes that uses IMA. To achieve that, we need to change all these blue boxes. So, we need to change API server and the control planes. We need to change Qubelet in the nodes. We need to change container, the run C. And we need to change the interfaces as well. So, from the client to a control plane, from the control plane to Qubelet, CRI and OCI as well. So, see how it works. We talk about the kernel. Now it's time to talk about the user space. So, it's just above the kernel who creates a container. Usually it's run C. This is just a very primitive diagram showing the run C algorithm how it creates namespaces. I didn't put all the steps. What I put there is just you see like a red line. This is where we create the user, the IMA name space. We first need to create the user name space with all the mapping and all that stuff. Then we created, we added some patches to create the IMA name space. And then we continue with the regular flow. So, we create the mountain file name space group and so on and so forth. So, the question here is an important fact here is this creation, the creation of this IMA name space, we will only do it if we get something from the upper layer. So, if we are told from the upper layer that this container should have IMA name space, maybe not all containers need to have IMA name space. Maybe some of them. So, I talk a little bit about changes in the API. So, these are the changes that we made in the API. What we have is in the OCI, in case you don't know, this is just like a normal JSON file that describes how the container is created. There is a section called namespaces. We just have the IMA name space. The CRI, in case you don't know, is the protocol that links KubeNet with container D. And this is mainly a modified version of GRPC. This is GRPC over Unix sockets. We just added a new flag there. We call it BOOL. So, just bear in mind that we have that in the Linux sandbox security context. Sandbox actually means inside Kubernetes, means pod. So, we are enabling IMA for the entire pod. And then we have in Kubernetes, this is the actual YAML file that you will use to deploy a pod. So, we have in red a new flag, we say IMA. And we have in blue host user's false. So, host user's false is a way you can tell Kubernetes to create a pod with user namespaces. Otherwise, it won't work. Obviously, we made also some minor changes to KubeNet, container D, and API server. Didn't put those here. They are mainly sanity checks. We changed some formats and so on. So, it's not in really major. So, for example, if you want to create an IMA name space and you didn't specify the user name space, it will fail out. So, okay. So, when I talk about the TPM chip, I mentioned remote attestation. So, it's important for us to know the container is in good health. So, again, everything that we've done, yeah, yeah, you are now able to deploy these containers in Kubernetes and so on with IMA, it's all good. But how does the DevOps engineer know that particular container, a particular pod, has been compromised? So, we created an special controller for Kubernetes. We call it integrity controller. And we have also these two red boxes there. So, we have the attestive demo set. Demo set is something that is in Kubernetes. Every time a host joins the cluster, it will automatically get these attestive pod deployed automatically. And then we have in a completely different machine, we have a verifier pod. Okay. It runs as a cron job. So, the DevOps engineer can come in. You can set it up to run every 10 minutes, every hour, every day, whatever. Now, it's important to know that they are tested on the verifier and they are linked directly to the TPM chip. Okay. So, we use a device plug-in in Kubernetes to expose the physical TPM chip to these pods. These are actually privileged pods as well. So, just this is not sure how it works. So, we create a pod, Qube API server receives the command to create a pod. It tells Qube to create the pod. And our integrity controller will be notified about this pod creation. It's in the watchers list. Then after that, the integrity controller will try to get through a tester pod which has access to the TPM chip. We will get the golden hash for that container. Like the good hash. Hash for a container that hasn't been modified. Okay. And then we have a loop again every hour, every month, whatever. Where we will create a verifier pod. Okay. This is how cron jobs work in Kubernetes. The verifier pod will ask the integrity controller, which is the one that keeps all these gold hashes for these containers. Give me the gold hashes for all the containers within this machine. Okay. And then it will go to the tester pod for each of these nodes and say, I want you to get me all the container hashes for all the Iminable pods in your host. If something works, fine. If everything matches, fine. If something doesn't match, we will report it to the integrity controller and this one back to Kubernetes. So, this way, Kubernetes has access to all this information of which pod actually failed. Okay. We are almost done, but we are not done yet. We found a lot of issues when we tried to build all these huge projects. So, we had some roadblocks. First one is what happens when pods are deleted. It's an interesting question, because in Kubernetes, the way you update the pod, basically, you create a new pod with a new version. You delete the old one. Now, if you delete the pod in the previous slide where I showed the CPCRs in the host, the CPCR for the deleted pod will be gone. And if we replay all that chain with all the hashes, we will end up with a situation where it just doesn't work. The CPCR 12 doesn't match. We are thinking about how to fix different ideas, but this doesn't work yet. Another problem is what happens with non-overlapping policies. So, what happens when you have, right now, the way we build is all the pods in a node share the same policy. This means that we just record all the files that can be read or can be executed or whatever. What happens if we want to have a pod where we want to register, we want to make sure executable files are not changed, and another pod where we want to do that for all files. This can't be done at the moment, we're just thinking, you know, what can we do about it? Shared storage and NFS, that's a big one as well, because a lot of pods actually use external storage. The user is 3, they use NFS and so on. That's a big issue because all what you saw here, all these IMS and so on, it relies on a very simple thing in these extended attributes for each file. So, the Linux cannot provide extended attributes. IMA, the hashes for the files as written in the extended attributes, if those extended attributes are not there, like, it's the case for NFS, for example. There is nothing we can do about it, right? Stateful pods. So, again, this is a hard one to fix. What happens if you want a stateful pod? Pod whose files can be changed. So, you may want to have, like, a folder that can be changed, another folder that cannot be changed. As I said, these, the IMA spaces that we built are linked to the user namespaces. And currently, user namespaces only work for stateless pods. So, the current implementation, there is no way we can get it working with stateful pods. This is one issue that we may need to fix. And what happens with multi-container pods? So, in it, and it's a femoral container, I'm not going to need to talk about them, but sidecar containers are important because a lot of people use them for service mesh architecture. And this is very important to get them working as well. So, we just have a lot of issues. We don't know how to fix them. We have some ideas, but still a lot of work to be done. So, in case you want to see our work, you have the GitHub links and so on. Again, what we try to do with this project is we try to link the kernel site and the user space site. What you find actually is, you should talk to the Kubernetes community that will tell you, you know, this feature looks very nice, but we cannot work on it because there is no kernel support. You talk to the kernel community, they said, we don't know what the user space wants from us. So, we are hoping that we can get contributions from many of you. That's my email address, so you can email me if you want. But, yeah, we want to actually work together, people from the kernel site, from the user space site, so we can get something really working for in real-life environments. Thanks a lot to my colleagues. So, you have the Tanisha Makin, Ilya Hanoff, and Sia Blodzabinski, so all what you saw here today couldn't be done without their hard work, so thanks a lot to them. Well, I'm done. Questions? Right. So, the question is about the overhead of all these CPCR and so on. No, unfortunately, we don't get any figure at the moment. So, we're trying to build some kind of functional prototype that we can use and at least have all the features that we need. After that, we can get the performance numbers and optimize them. If you have no more questions, thanks a lot for watching and for attending this, and I hope to see some of you maybe next year. Thanks a lot.