 Hello everyone, my name is Laurent Bernay and I'm here with Eric and I mean, we're very happy to be here today. It's our first CubeCode in person since 2019 in North America. So that's that's great to to be here with you So today we're going to to discuss our migration of the way we build container images to Kubernetes and as you will see it was a bit eventful and it got pretty interesting So before we start we both work for Datadog So we have a few numbers on about Datadog on the left hand side of the slide What what matters is with we are an observability company and and we do a lot of things in the observability space But today we're not going to talk about that of the product We're going to talk about data dogs infrastructure because we both work on the data dog infrastructure teams so we Essentially work on the Kubernetes environment, which is why we're going to talk about building containers in Kubernetes and As you can see with the numbers, I mean we have tens of thousands of nodes dozens of clusters and and with this come a lot of challenges around Many things but in particular building container images So before we dive into an interesting issue, let's do a quick overview of how we build things at Datadog so Quite a while back, I think like up to four or five years ago We were using this very simple setup to build our applications So we had GitLab runners pulling jobs from GitLab and using Docker machine to provision AWS instances and running job on them So pretty standard pretty simple When we started our migration to Kubernetes, we needed to in addition to building applications We also needed to build container images and if you're familiar with the way we build Docker images It usually means having access to the Docker daemon, which basically means being roots on the instance where you run So that's why we couldn't reuse the runners We were using for applications because these runners will be used and there was no way we could run a workload Which will end up being root on the machine because it was running Docker commands So we ended up having other runners which were just standalone machines running Docker Doing one job and when the job was done. They were just Being killed and replaced by new ones that was working fine until Sometime when the company had grown a bit and we're having many more engineers many more builds every day and Docker machine was starting to hit limits We're hitting API rate limits with the AWS API and it was starting to be tricky to To scale with the scale of the company and so we migrated our workers to Kubernetes, right? This was easy because at that time we had enough of knowledge of Kubernetes at Datadoc to be able to do that And it was pretty successful But as you can see to build Docker images, we were still using the dedicated runners The next step in this journey is some customer were starting to ask us for arm binaries Right because we provide the datadoc agent for instance that people use for monitoring and people were starting to use arm CPUs so we need to provide them with arm binaries And so what we did is because we're running on Kubernetes We set up Kubernetes nodes with arm CPUs and we're running builds on this on this node to get native builds Well as you can imagine the next step Well, of course to be able to do that we need to have Docker images that supported arm architecture And so we wanted to have multi-arch images to do this We transition our runners from running simple Docker builds to running Docker build X Which allows you to do multi-arch build by relying on emulation with QMU So it was to be honest. This was magic. It just worked, right? We were able to provide to get images which were both working on x86 and arm by using emulation However, as you can imagine for some builds We move from 10 minutes on native x86 to more than an hour on arm emulated, right? So it wasn't ideal And and today the talk is going to focus on this part like how we build images and how we transition to to Kubernetes So as I was hinting I mean this system was starting to show its limits the way we're building Docker images It's what the last workload running outside of Kubernetes Which was starting to be a pain for the team managing it because Everything else was managed with a single team by a single team doing everything on communities And by this team had to manage dedicated instances outside of Kubernetes to for this workloads It it also requires Investing in this legacy platform, right because we wanted to do multi-arch builds and native arm builds We need new runners, right? Which was not something wanted to invest in and and finally as I said before this was building images raise a lot of concern in terms of security because your builds are basically roots on the on the machine they run on This gets us to the main topic of the presentation today, which is well What if we could actually build images inside Kubernetes? It is very attractive, right? So if you look at how people do this because many people have tried and tried to do it you have multiple ways The first one is to use Docker in Docker So what you do is similar to what I was describing before which is you create a community spot in which you mount the Docker socket and from there you do Docker builds You can also use standard on builders so builders dedicated to build images inside communities and I gave I gave a few example and Finally, you can use a dedicated build demon and this is what build kit is about So can we do can we use this option as you can as I'm sure you've guessed the first one was A no-go from the start. There was no way we would give build pods access to being roots on the on the Kubernetes nodes The second one standard old builders actually worked pretty well in something we we try them and they work But they were a bit more complex to use because you had if you want to do multi-arch images It means you have to distribute jobs You need to run one job on x86 nodes one job on arm nodes and one job to assemble the multi-arch image and push It to the registry so it works, but it's more complex Build kit D although the build demon was very attractive to us because well the UX is great, right? You you can use the same build X command to build locally on your laptop to build on a dedicated instance builder and to build with remote builders allowing native builds So this is what we're going to focus on today So what would this look like like let's get back to our build infra running in in Kubernetes So well what we wanted to achieve is simply is this right when when a job is to build a new image What what we want to do is one of the worker is going to run the build It's going to do a docker build X build command and contact a build kitty demon and do the build What is nice is because it's remote built and build kit support build X supports it We can actually have demons running on x86 and demons running on arm nodes and build X is able to build To to use the remote builders to build the native image on an x86 node and a native image on an arm node assemble them and create a multi-art image of course, we wanted to make this safe and as you know the main challenge with with building images is Most of the time it requires being very privileged right because you have to install packages for instance, which requires being roots But what we want you to achieve is rootless beyond right because we don't want containers to run as roots because when you do this If there's a container escape you hand up with workloads that can actually do things on the host So what do we mean by rootless beyond well? It's beyond where the main demon inside Kubernetes is running as Non-root right as a normal user However, as I was saying before we also need to be root to run some commands And this is where user namespace come come into play right a user namespace is a way to simulate a root user But you're not roots in on the whole so you just root in this limited namespace with limited capability and User namespace are very powerful and they interact with many low-level canal implementations like capability mounts or security modules If you're curious about how this works There's this great presentation from a hero at cubecon in 2019 where he explained exactly how this works So to give you a very quick overview. This is how it looks like You have a built-in worker the Pid one is running as you ID one thousand So a non-privileged user so that's completely fine exactly what we want and Then everything else is running on a user namespace and the UID is zero But I put a star because it's not really zero. It's zero inside this namespace, but it's not food on the host a Very quick example It's very easy to to simulate if you want to try it on a Linux machine this command will create User namespace and in there you can see we're roots in the namespace But we can't touch a file that requires being a root on the host So that's why the touch at C slash X fails because we're just root in this namespace We're not root on the host so we can't do something that requires root permission on the host And if we touch a file we can't we can modify you We say that it belongs to root in the namespace But if we exit the namespace the file actually belongs to you ID one thousand right? So this is how the magic works So this is the the the main intro and I were going to dappy to interesting and fun things To be to be fair when we started working with buildkits Everything mostly works right more than 80% of the builds just work out of the box But today we're going to focus on the 20% that were interesting Because otherwise was the fun right so we're going to present to you like three different issues And we're going to go into increasing complexity. So the first one is Well, actually simple enough. It took us only a few a few hours to understand what was happening So this issue started with this extremely complex docify, right? It's we're just getting an image from a public registry and running a co-test What could go wrong? I mean this felt like something that should work, right? Well, when we run this inside our Rootless buildkit environment here is what we got so that was of course very surprising because this is like the most basic docify you can use and And something that's interesting is we've got an operation not permitted message But it says something about mounts. So something might be happening with the file system So that's pretty surprising. Let's look into it So what we wanted to understand is what was happening. So what we did is we a stressed the buildkits Did demon and we started the built and we looked for permission errors, right? And this one is actually Pretty interesting, right? we have this command that is trying to set an extended attribute and it's actually a sea linux attribute and it's failing and Well, I made the joke about it's always DNS because I tend to talk about networking issues and to be fair It's often DNS, but it's also quite often a sea linux and the only thing I need I know how to do with a sea linux Is how to disable it? I'm sorry So, I mean we had a very good inside of a very good idea what might be happening and And so what we did is we downloaded the layers of the image extracted the table and look at the content of the table And as we suspected there's actually a sea linux label labels on the files, right? And it turns out if you're root on a host Use a namespace you can modify the security context of a file. This is what we do at the top of the slide However, if you're in a username space, which is the second part of the slide and we're entering the username space used by bilkitD We can't modify the sea linux attribute of a file, right? And the kernel disallow it you can't do it if you're in the username space, which makes sense So this is the issue that we open an issue upstream to be fair. There's no magic. Nothing. We can really do But it's pretty easy to mitigate right either you remove the sea linux attributes by pulling and pushing the file or You use an image without any sea linux label Which to be fair is something I would recommend you do and we were looking at because one was an upstream image and The one just the release just after when we were using I couldn't have the labels anymore Let's move to issue number two So this one as you're gonna see was slightly more complex and took us a few days to understand So the docker file is a bit more complex, but once again, no rocket science, right? We're just downloading a depth file and we're trying to install it and Of course as before this works perfectly fine if you do docker build it works perfectly fine if you use buildkits in root mode However, when you use buildkits in rootless mode and you do exactly this for this specific app the build times out so because We are very scientific when we retry it, right? And this time it failed again, but it failed in a very different way It felt with an error saying well Address already news. That's very weird. So at that point. Well, we're like, let's try and understand what's happened What's happening? So we tried again This time it was consistent at least we're seeing the same error address already news So it was very confusing to us. So what we did is so well, let's try from scratch. Let's delete the buildkits depot and try again Well, this time the build times out. Okay, we're back to what we had at the beginning. Well, let's try again This time it fails with the same thing So at least it's consistent with fire the way of repulsing the failures in a way that is reproducible, but doesn't really make sense yet So let's debug So we use this with this way to debug right which is pretty simple We just added netstat at the beginning and at the end of the command to see what was happening And when we start with a new buildkit depot The first netstat show no port bound which is very expected, but the second one shows that the port is bound by the app demon and the build hangs Okay, that's what so maybe package installation is starting a demon right happen sometimes Let's do the second build when we do the second build this time netstat is showing that the port is bound and package installation fails with address already in use So we're getting somewhere right it looks like package installation is starting a demon and It looks like the demon is still running when we do the second build which makes little sense because it's a completely separate build So can we reproduce this by using a different method? So we use this very simple reproducer So you have the docker file on the top left of the slide and the script on the top right So it's pretty simple. We just start from moving to we had a script We run the script and we go the script is done and As you can see here everything works fine except we never get to the last line of the docker file Because with the build hangs so exactly what we were saying before so it's a good thing. We've reproduced What seems to happen is well If we looked into the container We actually see sleep still running which is what was suspected before I should remember some app was still running And that's why the port was bound. So it seemed that the process is leaked or and never stopped So let's let's look exactly at what's happening under the hood. So this is the anatomy of the bilkit worker So we have the bilkit demon When we start the build build decks is going to do an exact in the pod it's going to start the build steps so here it's going to run bash bash is going to run sleep in the background and What's important here and we're going to come back to this later is there's no process sandbox We can see all the processes in the container So if you exactly to this pod and run ps we see our build step but also we see bilkiddy with this kid we see we see everything and When when bash exists so bash is not there anymore the build hangs and the process is still there and never cleaned up So at that point we're like well we got curious right what has what happens if we actually kill sleep inside this hanging build build pod well We were able to kill it But it was never garbage collected because it's and it became a zombie which was also kind of interesting to us so At this moment we took a step back How does it work usually so build step use usually run in a process sandbox and when the step finishes All the process inside the sandbox are killed However in rootless mode we run build kit with this flag that is very clear like this flag says well You don't get a process sandbox and because we don't have a sandbox We can't keep track of all the processes Starting during the build and we can't clean them and you can really clean them up So this got us to but why do we need this flag and? The reason we need this flag is because of the way prokfs work in containers So when you create a container you have your own prokfs, but for security reasons every runtime is Every single runtime is not going to give you a full prokfs It's going to give you a limited prokfs where some of the directory in slack procs will be either masked Which means it's going to be an empty directory or made read only so you You can't modify things system-wide right and And that makes and that's for security reason We don't want to expose too much things to containers and we don't want containers to modify things on the host However when you do this, so you're in a container you have a prok which is partially masked for security reasons If you want to create a new prokfs, which is what we would need to do to create a sandbox for our build step We actually can't because there's a kernel check. That's called mount to revealing Which is verifying if you the prokfs you have access to is partially masked or not And if it is partially masked it won't let you create a new prokfs And that's why we actually need our worker to not try and and create a sandbox for the processes And that's why all the processes are seen inside the build kit process namespace In conclusion, I mean there's no Resolution to date we we chatted about it with maintainers in this issue extensively and there's no real way to do it There are potentially multiple mitigations The one we use we're using for now is we're making sure that our dogfiles are not starting demons in the background Or if they do we explicitly stop them. So that's easy enough Something that could be interesting in the future is Kubernetes expose a notion that is called prokman type where you can tell that a specific container Will have a prokfs that is not masked so fully accessible Which means if you remember that mount to revealing will be okay And we will be able to create a new prokfs for our build step inside the sandbox inside the container Sorry, so that's extremely promising But it's the feature has been in alpha in Kubernetes since 1.12. So we're not exactly sure if it's going to To be G at some point. We could also use jobs for builds where we use Where the build key demon would be used for a single built which would be which would avoid this issue of course, I mean, there are some security limitations because What I didn't see what I didn't show you before is if build kitty can be used to run multiple build step at the same time So inside a build kitty demon you can actually have processes for different build step at the same time in the same process In the same set of processes and of course it could lead to processes a build thing process from all the builds so not too big of an issue but still not ideal and This gets us to the last and third issue, which is the most complex one Okay. Yeah, so our last talk was at cube corn was ghosts in the runtime. We like ghosts. So this time ghosts in the file system So we're just going to build a go program. So here. It's the local volume provisioner fairly straightforward We clone the repository Check out a specific tag and then run go build What could possibly go wrong? Well, we get a compilation error consistent read redeclared It's clearly declared in two files here read dot go and consistent read dot go But the strange thing is When you build this yourself on your laptop or in a docker build not in rootless, it works perfectly fine So what's happening? Let's have a look at the directory contents. Well indeed in this vendor directory. We have both files consistent read and read so The compilation area is normal at this point, but how did we get in this state? Because in the master branch, you just have read dot go and at the tag we have just consistent read dot go So we clearly shouldn't have these two files at the same time So we check at each layer The get clone shows that we have read dot go. That's expected the check out at the tag shows that we have only Consistent read dot go so that's fine too, but we've already seen that in the end at the next step We have both files for some reason. So something's wrong with the file system here. What's what's going on? So we use overlay FS as our snap shutter for for builds. This is a fairly common file system to be using in this situation and the way overlay for FS works is that it's what's known as a union file system So the idea is you have a set of directories that are the base and you want to expose a mount point Where changes can be made but without affecting the the original files And so what overlay FS does is it has an intermediate or what we call the upper layer which? records the changes that have occurred on the file system that's exposed to to processes through the mount point and So for instance a file that's deleted will be marked in the change layer the upper layer as a tombstone file So that's a special character device file major zero minus zero and that will serve to mask the file that's in the lower directories and And so that it doesn't appear in the mount point any longer if the files being removed You also have for instance directories in the upper and lower layers combined so that The the mount point exposes the the combination of the changes and the original files depending on files that are masked changed whatever So that's all very good Let's try and reproduce the steps of the of build kit here So first we unshare the username space. That's What build kit is doing that's because we're running in rootless We create some directories To to provision our overlay file system So we have layer one which is the lower layer where we're going to do the get clone then we have a The layer two which will be our mount points that we will expose and we have the layer to diff Which is going to be the changes that occur Through the overlay file system and so in particular when we actually do the get check out after mounting the file system The L2 diff directory is going to record the changes that are made So that's all very good We look at what we end up with in each of the directories and we see so layer one the check out We have read.go fine layer two difference. We have consistent read.go. That's expected That's the new file and in the mount point that's exposed We see only consistent read.go. That's consistent with what we've seen before but it's We haven't reproduced the problem at this point So now let's pile on the step three of our build Where we list the directory contents because it's the equivalent of when we build and we get compiler error So we unpound the previous overlay file system. We provision some new directories for the The difference layer for for our step three the layer three directory Which will be the mount point that we expose and we mount this so one thing to note here is that in the lower directories We actually have two directories because we have first the changes That were made by the get check out and then we have the base which is the get clone And we list the the files and we've reproduced the problem we We see both files so clearly there's something wrong here So if you step take a step back and look at what we've done So we have a layer one where we do the get clone We then have an over layer fs where we do the get check out and so we see that In the difference layer we have consistent read.go and it's exposed through the mount point We've then piled on step three So we have an extra difference layer, which is not too important here And we have the mount point which is exposed which shows both files, which is our problem So at this point we begin to suspect that the problem is somewhere in the layer two difference Directory something's going on here and indeed one thing to note. I mentioned earlier tombstone files We've removed the read.go file and yet we have no tombstone file in the listings So where is it well, maybe we've missed something and yes, we have so in overlay fs There's actually an optimization For directories, which are where all this contents have been deleted and But the directory still exists. You have an opaque flag basically it avoids over layer fs from having to recurse in sub direct in under Layer directories if all the contents have been suppressed It's an optimization How does it work? so basically in our case, there's The opacity that the opaque flag has been set on the layer to diff directory and that is why we're not seeing Read.go any longer in the layer to Mount point But how does it actually work? So what's done is that overlay fs sets an extended attribute called trusted overlay opaque on the directory that it wants to mask Or it wants to mask the lower layers So can we see this fire this extended attribute? So we do the operation in the username space get file attributes and we've got nothing so that's interesting because How come in step two then? Read.go is ending up being masked If we can't see the extended attribute it shouldn't work Now if we rerun the same command but in the host username space the initial username space We do see the extended attribute so it is being set but that also is a bit strange because Trusted the trusted namespace of extended attributes is actually subject to Permissions you can only set a trusted extended attributes if you are Capsis admin you have the the system administrator capability in the host User namespace, which is not the case in our overlay fs sequence here in rootless So we shouldn't be able to set that extended attribute and yet we have So we have a bunch of mysteries how come trusted overlay opaque is being set because we shouldn't have the permissions to do so and When the directory that is flagged as opaque is Mounted as an upper directory The problem doesn't reproduce, but when it's a lower directory it does reproduce so at this point we Resort to kernel function tracing we rerun the git checkout step with kernel function tracing And we look for the operation where the extended attribute is set and there we realize that a Function that's being called is vfs set extended attributes no perm and this makes us suspect that well The credential checks are actually being bypassed in this scenario for some reason which we don't quite understand So looking at the source code and the the the git commits We realize that on the kernel that we're using and I'm going to kernel there was some work done to make Overlay file systems work in username spaces and the change that was made was that indeed the Credentials check on setting extended attributes for trusted opaque overlay and For removing them were bypassed. So you don't have a credential check But it's also interesting to note that there was no change made for the get so we still Don't quite understand certain things we do understand that We're having the attribute set But we can't read it So that's why in step three We see both files because we can't read the extended attribute. So the opacity is not being on it fine, but in step two Why don't we have the same problem because we can't read that extended attribute either and It turns out that well if we think about it fast systems do a lot of caching So maybe if we drop the caches, we'll see something different So again, we reproduce our case. We reach the point. This is step two We reach the point where we list the files and we see consistent read.go only so Opacity somehow is being respect is being honored even though we don't expect it to be We now drop the caches and learn behold the two files are there So in fact, what's happening is that in addition to the extended attributes there's an opacity flag that's being set on the directory entries in the file cache and that so long as that The de-entry is in memory and the file doesn't the directory doesn't need to be re-read from disk We're able to honor the opacity so in short the kernel we're using added a patch to make user overlay FS work inside username spaces But it was a partial change. It only changed the set and the removal of the extended attribute and Thanks to caching. Well, sometimes the opacity is honored and sometimes it isn't which leads to some rather interesting behaviors Now the nice thing is that actually in kernel 511 New option was introduced to overlay FS mount options called user xh and what this does is it changes the namespace That's used for the opacity and other overlay FS Extended attributes and it makes it possible for any user to set The extended attributes simply because they're not any longer in the trusted namespace The other nice thing is the build kit actually the overlay implementation adds user xh support when it's available and so all we had to do was wait for kernel 511 to be available for our distribution for Ubuntu and Then simply roll it out to our nodes renew our nodes and from that point it just worked so at this point we've solved I Think pretty much all our problems with building images and so we have some pretty good results As long as said we you know right from the the get-go we had something like 80% of the images were actually building fine with build kit in rootless mode and Really it was just a matter of getting past the last few hurdles I'm starting with a mono repo for building images dedicated to building images was clearly very helpful here because it really you know Sifted through all the problems very fast So it allowed us to decommission our dedicated Docker runners Giving us you know easier node lifecycle management the the developer experience team no longer needed to manage a dedicated set of Docker runners and things like that and It helped us get native multi-art builds Because well, yeah emulation is way too slow. So currently we have several hundred distinct images that are being built on Cube Or I mean all our images are now built on Cube anyway, and so this system is now handling More than a thousand builds a day perfectly reliably and we're really happy with it So our messages are that build kit, you know, it's really good. We we've had a very good experience with it It gives us remote builds. It gives us multi-art images really easily Rootless is or was a little bit bleeding edge But you know with the changes in kernel 511 over a fs in user name spaces has become really Very usable. Well, it works fine I think really the only Problem we still have slightly that you know that could affect us is the process Sandboxing and so I think at some point we'll have to check out the prog mount unmasked option that Laurent mentioned And that's it. Thank you