 I think you're all coming. Cool. Here, everyone hear me? I am so mic'd. All right. I feel like I'm ready to do a Super Bowl show. All right. So this talk is about the modern Linux server and containers. Really, what I'm trying to do with this talk is decompose the technologies that make containers possible. Containers have been, A, a very active point of discussion within the technical community recently, and B, completely misunderstood. So I thought if we decompose the parts that make it all possible, we might all gain a better understanding of what's actually going on under the hood and how it's useful. So yeah, let's just go back. So who I think you are as my audience are engineers who are wanting to learn a little bit about namespaces in the kernel and the low-level pieces of technology like C-groups that make containers possible. And you possibly like turtles. So let's kind of walk through what the major pieces of this talk are going to be and we'll dive right in. The first is that we're going to do a high-level system design overview going from virtualization, hypervisor virtualization down to application containers. Then we'll talk about namespaces, which are the fundamental piece of isolation technology in containers. And then we'll talk about C-groups, which is the accounting piece of containers. And then we'll look into some of the cool tools that you can use to introspect and build stuff. So this is a super 10,000-foot diagram of how you can kind of think about these different types of virtualization and isolation of applications. So the very top, the heaviest weight thing that we have is a hypervisor. And I'm sure all of you are familiar with hypervisors, so we won't go into that too much. And then lighter weight is a container. And a container is like a virtual machine, only you have a shared kernel. So instead of emulating all the hardware, like a block device, CPU, et cetera, you have a shared kernel and you just launch from slash S-bin in it. And then the lightest weight piece of isolation that we now get with all this namespace and C-group stuff is what's called an application container. And there's a company in San Francisco called Pantheon that's been using these quite heavily. And so what they do is they actually isolate each PHP web worker for their huge Drupal hosting product. And they isolate each individual one so that it has an isolated environment, but it believes that it's in a full working machine. So we're going to look at a few diagrams and I just want to give you a quick warning about two things. First, they're going to be a little dense and second is the hand drawn because I got fed up with my vector diagramming tool. So this is about the level of skill that you're going to see. This is a turtle and the thing is that these are going to be really recursive, terrible drawings. So it's going to be like turtles on turtles. And in some cases, it's going to get really bad later in the talk because it's going to be like a tree of turtles. And so yeah, just bear with me. We'll try to get through this quickly. So the system design. This is a classic hypervisor diagram. It's classic because I know you've all thought of it. So it's hardware, Linux, and then in this case, KVM is the hypervisor sort of. And so you run a KVM process and then Linux inside. So you're running a full Linux kernel inside of a Linux kernel. So you got a stack of turtles going. So what the hypervisor is providing is full hardware environment, block devices, ethernet devices. That's what you're pushing into the hypervisor and the guests are running a full kernel. So a container looks a little bit different like we slice off one whole level of kernels and turtles. And you end up with hardware, Linux, and then just regular processes that are containers. And inside of these containers is Espen in it and the full stack that you'd expect. So you can have your cron daemon running and your syslog daemon running all side this container next to another container that has syslog and cron and all the usual stuff going. So in this world, the host provides the kernel to you. You don't get an option. You're sharing the kernel with everybody else. And instead of getting block devices and ethernet devices, like actual ethernet devices, you end up getting file systems and network interfaces, etc. They're already there for you. You don't have any device drivers. You just, when you come up, Espen in it comes up. You're like, oh, I have roots already mounted for me. How convenient. Oh, I already have a network interface. This is wonderful. And that's how it goes. You just start from Espen in it. And then finally, there's application isolation, which I mentioned Panpeon does and a lot of other companies do, where they use these same fundamental things that make containers possible. And instead of launching Espen in it, they launch user bin PHP with a bunch of arguments. And so this is a lot lighter weight because you don't end up having cron and syslog and all your other stuff running. You just have your application. But it's running in a root file system. So instead of launching all those extra stuff that's eating up memory and CPU time, you're just launching PHP or you're just launching that core piece of functionality you need. So that's what I'm calling an application container. And really a fundamental, a thing that's kind of made containers more complicated than it should be is we don't have good vocabulary around it yet. So there's container, application container. Anyways, if you guys have better ideas, I'm sure the community at large would appreciate it. But so in the application container, it's just like the other one, the host provides the kernel. You don't get, in most cases, you're not going to get a full ethernet device anymore. You may just get like a file descriptor or something, but I'll describe that later. And it starts from your application, not in it. So the first big piece of technology that makes these LXC containers, these Linux containers possible, is namespaces. And earlier today, I was having a really hard time getting to flickr.com for some reason there down. So imagine a really cool medieval castle photo here. Yeah, breathtaking, isn't it? Perhaps the fog's rolling in. Yeah? Okay, got you there. And so with that beautiful metaphor that I have on the screen here, what I'm thinking of is that namespaces are sort of like a medieval castle where you have the outside ring, which everyone's allowed access to, and then you kind of create little compartments. So if an attacking aggressor comes in, you kind of have these isolated little pieces of the castle to protect people. And so, yeah, the first piece of namespaces is, hey, Eric, do you have a clock or something? Because my clock never started. Okay, thanks. So the first piece is file systems. So obviously it's like a tru. If you've ever used a tru, the first piece to isolating something is you need it to not look like it's running on the same slash as the host environment. So the file system namespaces are really important. So the big pieces of namespacing of file systems that have happened over the last few years is it's a read-only bind mount, a shared bind mount, a slave bind mount, and a private bind mount. And building these tools together, you're able to give and build up a root file system for a container that is able to look at just particular pieces of the host file system. The first really important one is obviously the read-only one. And what this is used for in, say, Fedora, Fedora 19 has a bunch of container tools in it. And what they do is they actually give you the option of mounting slash user, which is where all the libraries and utilities and everything live in Fedora now, just slash user. And they allow you to bind mount that read-only into your container. So now all of a sudden in your container in this isolated environment, just like a virtual machine, you have the exact same tools and utilities in your host for free inside of your container because you made a read-only bind mount right in there. And so it's like sharing a read-only block device, which some virtual, thank you, thank you, which some hypervisors allow you to do, but it's really, really lightweight. And you're able to share stuff like the iNode cache and all this stuff for free. The next piece is the private bind mount, which is pretty straightforward if you think about it. What the private bind mount allows you to do is it allows you to say, I'm going to bind these two subtrees together. But if I mount new things under this subtree, it doesn't appear under this subtree, vice versa. If I mount things here, it doesn't appear up here. And so that allows you to create these sort of where you mount a piece of your file system into the container, but subsequent mounts under that bind don't show up just willy-nilly within your other containers. So you'll see here that I set up a regular bind mount minus bind. And then in the source bind mount, I create a tempFS, and it doesn't show up under the target bind directory. And vice versa, I create a, in the target, I create a bind mount, and it doesn't show up in the source. So we've completely isolated the ability to bind or mount subtrees with these bind mounts. The next piece is a shared bind mount. It's sort of the opposite of that. So everything's shared by default. And then the last is a slave bind mount where only the source is allowed to share things to the target instead of, and the target doesn't share anything back. So with these fundamental things, you're able to build a container file system. You're able to create a private read-only mounted namespace and maybe optionally share things to it. It may be user data or databases or that sort of thing. Yes? If you're running in the host, I think I quite follow what you're saying. Yeah, yeah, yeah, yeah, that's true. Yes. OK, so some common patterns that come out of this functionality are things like mounting read-only user inside a container, like I talked about Fedora doing, or giving a completely private temp directory to a container. So services don't end up sharing temp at all, or like I said, sharing data across bind mounts. The next piece of the puzzle is networking. So you've laid out a root file system now. How do you hook up networking into this thing? There's essentially three different ways that in practice this all gets used together. Either you share the root namespace into the container, you set up a bridge so that the container has a private network interface, but it gets bridged to the host network, and then it gets usually netted out. And if you need to expose some port like port 80, you set up IP tables to the bridge. And finally, private namespace, completely isolated from a bridge or anything. And you use SOC activation, which I'll explain later, to actually get some sort of network file descriptor into the service. You can actually look at how that looks. That might be illustrative. So one of the container management tools that's come out recently is called Docker. And what it uses, it uses the bridge method of sharing. So if I get in here, type those on a demo. OK, so inside the container, oops. Inside the container, we have an F0 like you'd expect from a virtual machine or a regular machine, and it's set up at .o.2. And then outside the container, it shows up as this VET, some random numbers device. And if you look at the bridge that it's set up to, so you can think of the bridge as like a level two switch or something, you see that that's connected to a bridge that then routes you to the internet. So yeah, you have a private network namespace, but it's given access via sort of a bridge. So yeah, and then the other way of doing it is the root namespace, where essentially you just are given access to all the devices. So by default, inSpawn gives you, inSpawn's another container thing that's connected to system D allows you to run sort of a machine container. And they just give you full access to the host's ethernet devices. So if you want to bind to port 80 or whatever, you can. So the advantages of the root namespace is that it's really, really fast. It's easy to get set up because you don't have to set up bridging or anything. All the network looks normal to the container. It's not fancy. There's no sock activation or anything. Disadvantages are that there's no separation of concerns. You can just turn on and off. The ethernet devices change the MAC addresses, et cetera. And the container is in full control, essentially. And comparing that to network bridges, the advantage is it's more complex to set up. Hmm. That's oftentimes not the case. So the network looks normal to the container. And the disadvantages is that it's more complex to get set up. You get less speed because you're going through a bridge. You're possibly netting to the internet, which can make configuring some services a lot more complicated. Hadoop and stuff do not enjoy being netted, for example. And you need to use IP tables to expose to a public socket. So you can't just open port 80 and then suddenly get engine X traffic. You need to set up IP tables via the bridge to your container. And the piece of innovation that's come out of system D is a thing called socket activation. And what socket activation is, who's familiar with Inet D? And like Inet D style services? So what socket activation allows you to do is do Inet D style startup of services. So when somebody connects to a port, a process is started. But the difference is that you don't need to start the process for every single client that connects to that port. So system D starts it and hands off the socket to you before it actually accepts the connection. And so you're then in charge of managing the connections after that. So socket activation gives you the advantage of the Inet D startup style. And you're able to fully isolate yourself from the network because you're not listening on anything. You're only accepting this file descriptor when you get started up. And there's a little protocol where essentially they open a file descriptor for you and via some environment variables tell you, hey, you have a file descriptor here that you were expecting. I hope you know what to do with it. So socket activation, the advantage is that there's no network interfaces for you to listen on. So you don't have to worry about that. Sockets are passed in either via standard in, like I was mentioning, Inet D style, or via the new system D style. So I guess I've walked through this. I always get ahead of myself. Here are the bullet points if you're interested in bullet points. And then sort of the last piece of the puzzle is the process namespace. So if I'm inside of my container, what the process namespace allows me to do is when I run PSOX, I see PID 1 is bin shell. And outside the container, if I run PSOX, that is not the case. So that's sort of the major use case there of the namespace. And then, obviously, there's a ton of other pieces that you need to virtualize and namespace within the kernel that aren't famous enough or in the front of anybody's mind to get mentioned here. But they are like UTS name, UNIX style IPC, and lots of lots of other stuff. But I'm just punting because going through all of it is just a lot of work. And file systems and networking is mostly what we think about. So I appreciate all the work of the engineers, if any of them are in the room. But it's just a lot of stuff. OK. So namespaces allow us to get isolation. They allow us to kind of fence off, create those castle walls that you all remember. And what cgroups allow us to do is they kind of allow us to create accounting. So we're able to, well, why don't we just turn to another photo? So imagine, imagine an accountant's desk overflowing with paperwork, his hands are on his head and despair, which is the spell. And so what cgroups allow us to do is they're like the accountant. They track all these metrics that are happening on the machine inside the container, not necessarily just containers, but cgroups map to essentially the process namespace. But they allow us to, the kernel, to track all these things that are happening in the process namespace. And then, in some cases, put limits on what can happen. So I can say things like I only want this set of process, this subtree of the process tree, to only run for 25% of the time, those sorts of things. So it's able to give you some control over essentially nicing the processes in various metrics. So yeah, Block.io is obviously a big one. And so the cgroup for Block.io allows you to do essentially weights from, I don't know who selected these, but from 10 to 1,000. And the default, I think, is 1,000. And then also you're able to do arbitrary bandwidth limits. And so you can say, I only want five megabits per second on this device to this process group. Or read, or write, or whatever. So it's very high-level functionality that allows you to do. And then, out of the cgroup, you get all sorts of metrics. Probably the most important ones for most of us are going to be the Iops that are serviced and the Iops that are weighting and queued, that sort of thing. The next major cgroup is CPU. So the CPU has, very similar to the weighting system in the Block.io, the CPU has a share system. And the default is 1,024. And it allows you to say, the default is 1,024. If I want to give twice as many shares, and it's like, as far as I can tell, it's like shares in a company or something. So if you weight something in 2,048, and it gets twice as much as the 1,024 sub-tree. And then there's a bunch of useful metrics that come on off of this cgroup. And probably the most, well, there are actually not that many metrics off the CPU group. But for a sub-tree of processes, you're able to get the user and system time that was spent there. And then the last sort of major cgroup that's used is the memory cgroup. And you're able to limit the total amount of memory. You're able to limit approximately the total amount of memory that sub-tree uses. And it actually dumps out a ton of useful metrics. But for most of us, probably it's going to be the swap metrics, RSS, and maybe a number of page-ins and page-outs. To understand, probably 80 metrics come out of there, you need to know the memory management system quite well of the kernel. And so those are the major cgroups categories. And that's what allows you to set limits and give good quality of service to some containers, like your production connectors, and then relegate the non-paying customers or the developers to have no memory and horrible CPU and no access to disks. And this is really what all this came out of, was companies like Google who wanted to run production and development machines and long-running batch jobs all next to each other on one machine. And so you can use various dimensions and limit how things get packed onto the machine. So the last bit that we'll talk about, and probably the most important bit, is just some of the tooling that exists around this. And what I'm going to end up doing is just dropping down into the terminal. So yell out if you see typos, because it's irritating for you and it's irritating for me. So I'll start with the first one, which is Docker. Docker, we got a little look at earlier. But what Docker is, is it's a container management tool. And they do a few things differently. So in the last few years, we've had tools like LXC and N-Spawn, which are really convenient if you have a container already sitting on disk. And building a container and getting it on disk is actually fairly high bar. Most people don't know the command line flies to Debootstrap or whatever the Fedora RPM command line invocation is. And so Docker gives you a few convenient things. And the first is that it has an index where you can just pull down arbitrary containers. So if I come in here, you see I'm running, is that FonterK for everyone? OK. I'm running Docker run, some primary, some meta. And then busybox and then the actual binary inside the container I want to run. And busybox is actually a part of a URL that goes off to the internet index.docker.io. It downloads the busybox container if it doesn't exist, and then it starts running it for me. In this case, it's already cached, so I can just press Enter and I get a shell. And it doesn't feel magical unless you don't have something cached. So if I come back in here and then I type in base, what's going to happen is, oh, OK, I don't have the image right now. I should go off and pull the image down to the internet. And it does a few other things. So a container is a unit of running processes that are isolated. But also, when you're doing containers, you need to build the containers up on top of each other. A lot of development shops aren't sophisticated enough to just say, out of our continuous integration process, here pops out a root file system that will run our application perfectly the first time. And so Docker allows you to iteratively build these layers of your file system and actually go from a base like Ubuntu or Fedora image and layer on PHP or MySQL or Rails or whatever. And then you can layer on the latest code from your CI system. And so it gives you a nice abstraction for creating these layers iteratively. And so you can interact with it more as a developer versus a system administrator who's just laying down a root file system somewhere. And so that went over the internet, downloaded Ubuntu image, and now it's like on my box and it's running. So it's a very convenient abstraction for using all these tools together. Oops, sorry. So the next really convenient tool is Int and Spawn. And Int and Spawn comes out of System D and it's a really, really thin wrapper around all of this, all the namespace and C-group stuff. It operates very similarly to Docker. Only it doesn't have all the magic around pulling things down and it uses some of the newer APIs a little bit better. So if I go in here and then I get on my favor... I have a little bug in the latest kernel I built. Sorry, you're going to wait for five seconds. Oh, what happened? Oh, shoot. Well, anyways, I won't try to debug that right now, but needless to say it works. Another really cool tool is called NSNR. Now, we have these namespaces and they're all isolated and that's fantastic. But not all the time will you actually know exactly what's going on inside the container. And unless you're running SSH or something, you're not going to be able to just jump in there really quick and get a shell or a terminal and look at the file system why your application is not running or something, why is it just exiting. Now, I'm going to have to bust out S-trace or something and get really frustrated. Instead, NSNR is a really nice tool that allows you to enter an existing namespace. So I can just say I want to jump into however the namespaces look for this process ID. So I'm skeptical that that PID is correct, but we'll try it. Ah, hey. So NSNR, all those flags are the different namespaces. It's like the process namespace and the networking namespace and the other ones. And then it's saying jump into the namespace of process683, which is an in-spawn process. It's running on my host. And run bin shell. And so now I have a shell where I can debug the things that are running inside of that container. So you see that in it is running and stuff in here. Because process682 is an actual container that's running busybox. So I just jumped into that namespace. And so you don't actually need to run SSH or something like you would on a virtual machine or, heaven forbid, use the HBM on Xen or something and get an old TTY or something that's virtualized. Instead, you just jump into the process's namespace. Super cool. Very convenient. The other major tool is actually just the C-group hierarchy. So C-group is an actual virtual file system that is usually mounted in CISFS, the C-group. It's really not recommended these days to actually manipulate this file system anymore. There's lots of mailing lists. And there's actually talk going on right over there. If you're interested right now, you should leave the room immediately because you about missed that other talk completely about how this file system is not really built for having lots and lots of different processes managing it and updating it and adding and removing processes from C-groups. But that said, you can go in and you can look at the statistics and that sort of thing that are coming off the C-group. So let's see. For example, I can look at the CPU off the scheduler, like the CPU accounting statistics for the C-groups. And you'll notice here that these C-groups are being managed by SystemD and they've just introduced a new concept called the slice. Anyways, that's the talk that's going on next door. But a slice is essentially a piece of a flattened C-group hierarchy that SystemD is managing. So you'll probably be seeing a lot of this sort of stuff in Fedora 20 and 21. Yeah. So SysFS C-group is where all the metrics and all the management of the C-groups that we were talking about earlier actually happens. And then, yeah, SystemD units, let's see. I'll spend two more minutes and then I'll have for questions. SystemD units have a number of tools that allow you to easily manage the C-groups and set arbitrary limits on memory or disk IO or whatever. So I have, what do I call it? And one second while I go spelunking. OK, so I have these two things called CPU Eater, large and small. And they're pretty straightforward things. It's the classic, well true, do something, and throw away all the work I did. And then I set the CPU shares to 1,500 on this large CPU Eater, which means that it gets a larger portion of the scheduler's attention. And then on this smaller CPU Eater, it gets 100 CPU shares. And so using these two things, when I start these services, calling them services is a little much, perhaps, but these two processes that are useful in their own way and special. So if I do top right now, just do top. So you'll see that I'm assuming the one that's eating up 60% of my CPU is the one that's given a lot more CPU shares than the one that's eating up 4% of my CPU. And they're doing the exact same while loop. And so you're able to, I mean, that's an illustration of how the C groups work at actually limiting and pulling back the amount of resources an individual process is using on your machine. And the same thing can be done for disks, too. I have a disk eater task. And oops, the disk eater task is just DDing files. Ironically, this doesn't illustrate anything very well, because I have, A, very little room on this virtual machine, and B, an SSD. So the process is exit really fast, and I didn't have enough time to write some C code for this. So the block IO weight is of 1,000, which means it gets a lot of access to disks. And then on the small one, which I guarantee you slower, but I can't illustrate. So you're going to have to take my word for it. The smaller one has a block IO weight of 100, and so it goes much, much slower at writing to disk. So you can manage that. And then the cool thing, the last cool tool is an improved top that is C groups aware that ships with system D called CGTOP. And right now it's sorted by CPU, and so you can see it uses a C group hierarchy, and you can see the number of tasks in the C group, and then which ones are taking up the most CPU. And you can also sort by memory and IO, that sort of thing. So all the C groups we talked about, you can sort by them and actually figure out which tasks in the hierarchy are using up more of the resources of the machine. So recap, yes, we made it. We have four minutes. All right, recap. The containers as they are built on namespaces and C groups. Namespaces provide the isolation. It's similar to hypervisors, but it's the kernel that's doing the isolation instead of all this virtual hardware. C groups provide the resource limiting and the accounting. So if you want to draw pretty graphs or isolate prod from develop on the same machine, that's what the C groups mechanism does. And these tools can be mixed together to create useful hybrids like Pantheon has done with their application containers for all their PHP stuff. The future's happy on next door. You just missed it, sorry. And that's it. Questions? Oh, OK. Ended over to Brandon to choose people. Don't embarrass me here. Yes. I have a very specific question. So as far as I understand, with the namespaces and containers, you can take the local users that are defined in possible D file and truth and create multiple different storages for the local users. But with the centrally managed users that are resolved from the central directory or directories of different sources, there is no way to partition different identity sources to different namespaces right now. Yeah, I believe you're correct. And I think a lot of it has to do with the fact that there's no way of namespacing in the file systems right now, like you can't say two releases ago. OK, that makes sense. OK, yeah. But that would be rematable, right, with the central user? So you'd have to rewrite Etsy password effectively, right? So OK. That would be very interesting to actually implement because it sounds like it's implemented, right? Well, here's the problem that I'm trying to solve. The problem is I have different identity source central, and I can give them to the box. How I manage different sources for different subsets of the containers. How I define which identity sources need to be exposed and remapped to different containers. Cool. All right, well, thanks.