 Oh, time to start? OK. Hi, everybody. Hi, Tom. How's everyone doing today? So my name is Nolan Dye by working the Containers Group of Red Hat, and I'm here today to talk to you about how we build container images. Or more specifically, ways that you could build a container image that you probably shouldn't, but we're going to do it anyway as a learning exercise. And then I'm going to list some reasons why we shouldn't do it that way. And then we're going to look at some other ways to do it that are way better. So let me back up in a second and recap. For those of us who've been here, we've been talking about container runtimes. People who saw Scott McCarty's talk or Dan Walsh's talk or Sally Narash's talk earlier today. Lots of talking about container runtimes and engines and a little bit on orchestration. I missed some of those talks, but we have been talking about it here today. One of the things that we didn't go in depth so much on is what goes into an image. And the image is what you use as a template for launching a container. It starts, well, it is the initial state of the root file system. It's some additional information about how to run this stuff that you have inside of that file system. And it's a useful exercise to know how it works. The purpose of this particular talk is to demystify the build process. And we're going to do it by actually walking through it all directly. And by the end of the day, well, actually within the next half hour or so, you're probably going to walk out of here thinking, well, I know how to build that. And you might just go ahead and do it anyway. And I'm not going to stop you. So for an example case, we are going to use, well, probably find inside of a tree. And we're going to start with that. We're going to build a container image out of that. We're going to run it. Then we're going to refine that a little bit. And, well, over three or four steps, we're going to refine it into something more full-featured and fully fleshed out that looks more like you would expect a container image that actually does things to do. It's going to start with a simple case. I'm going to work my way up. So what are you going to apply when you're building a container image? Three things, really. You need to supply the root file system. The things that the process is running inside of the container are going to see. You need to supply a configuration that tells the container engine how to run it, things like the environment variables set, the actual command to invoke when you want to start the container. And the third thing is the manifest. That's a detail of the image format and how things are moved around. But I'm going to go through them in that particular order because, well, each one of them successfully contains information about the one that you've created before that. So let's start with, oops, I forgot about this one. One thing I'm not going to do is actually do the work of pushing the image to a registry myself because we already have scopio for that, and it's a great tool. So the output of our particular build process will just be some files in the local disk that we're going to transfer over to the registry. It actually does quite a few other things for us. But that's what I'm going to be using it for here. So let's get to it. Let's create a layer. This is going to take a little while. So I'm going to use a directory named root. Oops. That's a subsequent version. OK. Let's make a directory named root. We're going to call that our root file system. Create a directory inside of it named bin. Just not really required, but let's do that. And I want to run find inside of my container. So I'll just grab a copy of find that I have and have on my system already, copy it in there. And let's see if we can run it under trute. That's not going to work. All right. My copy of find, there we go, is a dynamically linked object. So we need the runtime linker to be present in the truited environment in order to run it. So what was the name of the file that we needed? There it is. We need to make the directory first. Now let's try running it again. Right. It needs shared libraries. Well, which shared library does it need? Bin, find. It needs several shared libraries. And at this point, I'm going to give up on this particular exercise because that's starting to look like a lot of work. So let's find a statically linked binary to use instead. Let's search user bin for static linked binaries. One of those are particularly good examples for telling me what's going on inside of a container. Let's check user sbin. Oh, busybox. Of course it was going to be busybox. Those of you who predicted it was going to be busybox, pack yourselves on the back. So start over again. Now the fact that I'm using a directory named bin to hold it, that's merely personal preference and isn't really strictly required. OK, we could run that inside of a chrude. So now let's turn it into a file system layer. File system layers are extracted relative to the current directory. So let's just create one. Tars, edf dot dot, layer dot tar, current directory. Now we have a layer. So the next thing we need to do is create a configuration. This is an example configuration. Oh, oops. Hopefully it shows up well. I'm actually emitting some of the fields that can be left empty by default just so that it will fit on the screen here. I'm actually going to edit a proper one in a minute. So as you can see, there is some booking information. They created a field that doesn't really tend to get used by anything. Information about what architecture and OS you need in order to run the container, because again, it's not a fully virtualized environment. So it isn't a full kernel. Then the basic stuff you would expect, environment variables, the username, and ID, well, the user ID to run the command as and which command to actually run. And the history is part of how we build it. The history identity is, well, the RIDFS and the history match the set of layers that we're using. And since, for our example, we're only using one file system layer, each of these things, which can be an array, is really just one thing. So let's go to the command line. I actually already created a template which contains most of the stuff in empty form. So just edit that really quickly. Layers represented in the configuration file are identified using the SHA of some of the layer contents. So we need this information. OK, and this goes here. I already filled in a history entry, because, well, why not? I can leave this alone. I added a volume. We don't actually need a volume, so we'll just delete that. This is the command we're going to use. It's a little bit more elaborate than just busybox. Fine, because I wanted to show us all the information about the files that we're finding. This, well, we're not going to use that. We don't need that. OK. Now we need to, now we have a working config. We just have to take my word on it until we actually try for real. And like it says, it's not like filling out a survey. The next thing we need is a manifest. The manifest needs to list both the configuration and every layer blob, again, using their digest. And this time it also needs to include their sizes. This is actually a very simple one, which we'll work quite correctly for our image, except for the fact that these values are all made up. And the SHA sums are truncated, and the sizes are completely wrong. So let's create one of our own. This actually all fits on the screen once, so let's do that. Let's grab the configuration. It's 1071 bytes, and this is the digest. Paste the digest. That's that. We again need the SHA sum of our layer, which goes here. This is the type of data that it is, and this just says that it's not a compressed tar ball, because we can do compression. That just makes things a little bit more complicated for us. OK, I just need this again. I probably could have remembered that. Probably not. Wait, it's the size I needed. There we go. I'll paste that here. That's everything I needed in order to copy that up to a registry. So one of the things we're going to do is we're going to tell a scopio to copy it. We're going to use the current directory as a source. And one of the things that scopio does when you're using a current directory is it assumes that everything except the manifest is named using the SHA sum of its contents. So I need to create a couple of symbolic things so that scopio will be able to find things. Now, let's see, scopio copy. What should I name this image? Anyone? What? OK. Busy box. But I might already have an image named busybox in my registry. You wouldn't know. So let's go in it in, which is my brother's name. I forgot who I didn't see who suggested that, but let's go with that, because I know I didn't do that. Anyway, I'm running a registry on my local machine. Slash library slash. Right, scopio does not like the fact that I didn't bother setting up SSL. Right, let's destination TLS verify equals false. OK, got copied it up to my registry. Now let's do it this way, because we're familiar with this. Dan, I'm pretty sure you're giving me stinker eye right now, but let's not go there. Now add a tag. Do I need to add a tag? This was working yesterday. This is very annoying. OK, forget the registry. We'll just copy it directly into the daemon. Destination looks like this. OK, so let's run that. Right, I made a mistake here, and because my root file system doesn't contain an Etsy password file, the fact that I specified the username to run everything with a name meant that it couldn't be resolved to an ID. So let's change that. We need to update these things in the manifest. And 65 is the new size. Everything else is still correct. Oh, right, I didn't create this symbolic link from the new digest to the name. Success, I'm now running a copy of busybox in a container image that I just built without much in the way of tooling. Oh, I see. There's a problem here in that this file should be owned by root, but it belongs to me because the copy that I used belongs to me. So I'll digress for a second. Those of you who saw Sally and Rashi's talk earlier today, they discussed user namespaces, and that's how I'm going to get around this one, actually. Yeah. Oh, no, I own the files. That's where I'm supposed to be. OK, so the problem we ran into is that when you tarp something as yourself, you own the files, usually. Well, it's usually content that you own. And I want the contents of the layer to look like they belong to root. Now, one of the things you can do with Unshare, which is a useful piece of container technology, is you can create a username space. It's essentially launching a new process inside of a new process tree. And for everything inside of that process tree, there's a set of UIDs specified in this configuration that map there are ranges of UIDs and GIDs inside the namespace that map to different ranges of UIDs and GIDs outside of the namespace. One of the cool things about username spaces is that you don't need much in the way of privileges to create a new one. An unprivileged user can create a new username space and then map their own UID to UID 0 inside of the namespace. They'll still have limited privileges on the system at large, but everything inside of it will think that that user is root. So we're going to do that. In fact, I'm going to do it the cheapest way possible. We need Unshare-UR. So let's take a look at what we've got here. Proxelf UID map actually shows us what's going on here. This shows us that UID 0 in the namespace is the beginning of a range that's being mapped to a range starting with my UID, 2510 outside of the namespace. But the range only has one thing in it, which is fine because we only have one file that we want to map. So if I look at my raw contents, it looks like it's owned by root. It's still actually owned by me, but the username space is causing everything inside the namespace to see things that belong to me as if they belong to root. Things that are not mapped because, again, I only mapped my own ID. Things look a little weird there. Unmapped values are mapped to specific magical values set as syscitals that are configured in the kernel at runtime. But we'll get to that in a minute. So I need to recreate my layer with the new contents. Now I need to update. This is going to take a while. So the digest of my layer is different now, which means the copy of the digest I keep in the config needs to be updated. And the copy that I keep in the manifest also needs to be updated. How big is the file? The file's the same size, OK? And then, right, but the digest of the configuration just changed. Live demos, everybody. I know you love them. OK. Oh, I forgot to name files. Am I in the right shell? Yeah. Oh, that's why I'm out of my free work. I forgot to exit my namespace. OK, there we go. Right, the blob is, I didn't create the symbolic links. To my updated layer, copy this, what I miss. This one that is missing is the layer, which is no longer than that. There we go. Did I name this thing? OK. Right. OK, now if I run it, ah, there we go. Everything in the image and the container that's based on it now appears to be owned by UID0. Fun, yes? Yes, you can see you're all laughing. OK, now one of the fun things about a user namespace is when I'm UID0, I don't have to settle for just copying one file at a time. I can go ahead and use the whole package manager. So let's lock them in later. So let's try and, yeah, use this as my shortcut cheat sheet. Actually, yeah, that'll be good enough. Oh, whatever, good. Oh, yes, least version. Yes, I'm using OGPG check. This is, again, a talk about bad ways to do things. So that's some more about that. Oh, Wi-Fi is going much better today than it was yesterday. You're right. So I'm installing a simple Python script that actually depends on the Python interpreter, which depends on libc. And wow, this is taking longer than I expected. I just got the 10 minute warning. So actually, we're doing fine. Tick, tick, tick, tick, tick. Come on. Right, OK. So this is where I need to fill time. Anybody here from Indiana? Anyone in the audience? No? OK, I've just been obsessed with that since this morning. One of these, the annoying thing about installing things into the charit is that we do need to pull down a fresh copy of metadata that gets stored in the charit environment. We're not going to bother cleaning that out, because I'm going to use a bit later, because I fully expect this to fail at some point. Now, I did have a slash bin directory, and that's probably going to break something. But that's not what I expect to break. Right, language max, got it? That's a fairly large piece of it. OK, that's an error. That's more errors. So many errors. Transaction failed. That's a problem. So I already know, but does anyone else know why the transaction failed? Well, that's one good reason. The other one is that, unlike the simple case, because the main problem is that when I created a namespace, the only UID and GID that I mapped that are known to the system, they are allowed to own new files that you create, R0. And not every file in the distribution is owned by root. So a lot of this is actually just that the user couldn't be given ownership of anything, in particular this one. Oh, yeah. So let's try that again. We can actually use Unshare to not. Well, we can tell Unshare not to map the things, and then we can use new UID map and new GID map, which is some tools that were introduced in the newer versions of Shadow Utils to go ahead and give us access to things that we didn't already have. Let me back up a second. Give me that look. OK. Normally, when you run Unshare as an unprivileged user, you're only allowed to map your ID. That way, you can't map somebody else's ID as an unprivileged user ID into the namespace that you're creating and start fooling around with their stuff. Because you are UID 0, you can start doing chone and start deleting things. That wouldn't be allowed, because it's generally unsafe. If you could map the root ID into your space, then all kinds of crazy stuff would happen, and that's also not allowed. Starting with Shadow Utils, I'm going to say 4.2. One of the things that happened by default when you logged in, when you created a user, I should say, with user ad, is that it allocates an entire range of previously unused UIDs into the Etsy sub-UID file. Actually, for every user that gets created, and I'm going to hear somewhere, yep, and there's one in sub-UID. The purpose of this is the idea that, well, the whole point of this is that this notionally sets aside a whole range of UIDs that are only going to be available for use by me, and that are authorized for use by me. So it also includes a set UID tool that I can use to set up a UID map in a user namespace that lets me map in at least things that are in this range. So we're going to change things around a little bit, and we're going to use Unshare again, but we're not going to tell it to initialize the UID map. There is nothing there. So we'll go to a different shell, and use new UID map and new GID map to set the mappings for my new shell. 22520, 22520. Start mapping my ID to root, okay, a range of one, then let's map starting at range one in the container to 2200544, and the whole thing, okay. Now this should be, aha, I've set up mappings. Now let's try, let's double check the new GID. Yep, same, so new GID map will set up that. Or that again, I really need to cut that down. Okay, that should be a little faster. So what I'm doing here is I'm doing the exact same thing I was doing earlier, installing an entire root file system to run one program. That's fine, but we're doing it inside of a namespace where the set of available and recognized GIDs runs all the way from zero to 65,537, because the first one's actually me, followed by an entire range. And that should be enough for us to create this image. So far, in practice, I haven't seen images that had contents in them owned by UIDs higher than about 1000, so you could actually cut this down significantly, or take that 65K range and slice it up into about 64 of them, and that would still work. And they could be completely unrelated sets. So hopefully this should go a little faster than the last run, if only because I'm scanning fewer repositories. And now this is not the most optimal way to install it, because there are ways to do smaller installations. I'm still installing recommended package dependencies. I'm still installing the all langpacks for G-Lib C. Those are things that you would want to not do. I'm also not gonna bother cleaning up the DNF metadata that I just downloaded about the Fedware repository. That's also gonna take up space in my image, but I don't care because it's local disk and local disk is free, sort of. That's just a warning, not an error. Keep going, come on, getting closer. All right, come on, come on. And the transaction succeeded. So let's create a new layer out of this, same as we did before. Before I forget to get out of that shell, I should really do that. Okay, so the diff ID for the new layer is this. And in fact, I'm gonna change the command I invoked because I just installed a new one. There we go. Probably let me update the small log links. Figuration is much smaller. That's about as big as the new layer. And the new layer is significantly bigger than it was. That's fine, it's not compressed. And also I didn't bother cleaning up a lot of space, so let's copy this. We'll be much slower to upload it, but that's fine. And let's run it. There we go. Run the command in a container, everything's fine. Now if you're thinking, it's pretty easy, yeah it is. I did purposely skip over some of the things that make this complicated though, and now I'm gonna go over some of what those are. First things first, while I was installing packages, PostScript's being run by RPM underneath the NF, those all executed as UID zero in the namespace. Now if you remember, I set up myself as UID zero in the namespace, so those commands, if they had broken out of the truth the DNF set up, would have free reign over the system as my UID. That's generally a bad idea for me because I like having my stuff not messed around with by other people. If I were building it as root, that would be an even bigger problem. Normally container build tools that you see out there in the world, well image build tools, will use a proper container. They'll do set comp filtering. They will set up control groups to limit the set of resources that can be consumed by this one process to avoid messing with the system. So that was sort of a bad idea what I just did back there, but it kind of worked. You also didn't have to deal with multiple image formats. While I was using the OCI format for configuration blobs and manifests, there are actually three different formats, two of which are very similar to each other. Each of them unfortunately includes information that the other ones don't. So if you need a specific field that is specific to a particular version, you need to be able to write that exact format of a file. Scopio, in addition to being awesome at copying things around, will also handle format conversions for you automatically. Also, if the registry requires that you compress layers, it'll do that for you too. It's pretty sweet. And oh yeah, way back in the beginning of the talk I said I wasn't gonna be creating file system layers and I didn't because that's a much harder problem to do from the command line. We're not gonna be able to build up root file systems or generate the differences between them in a way that we can do for layers. So what I've essentially done here is the equivalent of a squashed image and it's not gonna be able to do anything more complicated than that. So how do you overcome these limitations? Oh yeah, I forgot the repeatability. While a shell script is great for screwing around and messing around on your system and doing something ad hoc, it doesn't help you rebuild it again later unless you remember the exact sequence of steps you ran before. The big one there is that a shell script is very hard to document. The format that people tend to use for expressing how to build an image is Dockerfile. So it's still highly desirable to be able to support Dockerfile if you're gonna be creating one. And I'm gonna do it on time. Oh, 30 minutes more. Right, okay, so time for a speed run. So we use tools. So this is a rundown of tools that I was inspecting because, well, the pointer and container of network building. Sorry, a quick survey of tools that are out there that are known to me at least, there's probably a lot more that I don't know about for building images. Microsoft's Azure Cloud Platform includes a container registry which currently includes a building containers feature that's in preview status. It handles Dockerfiles, you can all read. Build is the one that I work on. I happen to like it. It also supports Dockerfiles. It currently runs an enclosed mode. Like a lot of other tools, you're going to use Run-C if you're going to have to handle run instructions in a Dockerfile. BuildKit is the newer bit that was spun out of Docker. It's actually pretty cool. It uses a lower level interpretation and it handles Dockerfiles. Well, one of its examples is a front end that uses Dockerfiles that reworks it into a lower level syntax that actually is run by BuildKit. DockerBuild is the OG builder. Everyone's pretty familiar with how it works. GoogleThouBuild is a fun set of tools. I wasn't actually able to get all this running on my local machine, but you can run it locally, which I think is a very positive thing. Jenny will put his IMG is a pretty cool project which handles a lot of this. It fills in the blanks in BuildKit and also handles Dockerfiles. Also runs on privileged. Conoco is a really interesting one that is built to run inside of a container. It makes certain assumptions because it is running inside of a container and it's hard to start a container inside of a container. So one of the things it does is it just uses the containers running and as you're working space. If you're doing from an image, it actually blows that into the container that it's running in. So it doesn't have to launch a container to run the command because it already expects to be running in a container. It just executes them directly, which is pretty cool. And these are some of the references I went through. Here's where you can find all of this stuff. Now, ooh, sorry about that. Didn't really time this well. Yeah, any questions? Yeah. Any comments or opinions disguised as questions? Okay, one last thing. I want you all to check under your seats and make sure you didn't drop anything. Okay. Thanks a lot, everyone. Thank you. Thank you. Just a couple of words before you guys leave. Tomorrow at 9.30, there's a keynote. Please plan to attend that. The keynote speaker is Chris Wright, who is the CTO of Renate. And if you haven't got the party tickets, I think there are some available for tonight's party. Drop, stop by the front, doesn't. Thank you. Oh, thanks.