 Všetkoj mojí něco odmího. A to je vám knuté. Vám se to dá prepmůjt potom. Ale když já jsem připravený na všetko. Mělám věděka, když jsem učasit, jak je to vždy. Mělám věděka, když jsem učasit. Mělám věděka, když jsem učasit. Už to mi se rášit, tak to je těchto. Už to zmíněl, tak to je. Iborně. Čer, půjdeš, Myslí na čeště tady, pořádíš dísku na 24 rate. Ano vypravíte dílom pomoře a vypravíte tohoto, co bych by zestělili. Mělám věděka, když jsem učasit. Vám se nějak jen odmínit díl. Ten také cicek. Tohoto, co je za tebou. To je Miror, toho, co je za mnou. Je to, když nás sítila vůbec vojka. Není se v ten moment, se moment velika teda. Když tak na to koupkám, když tu pak máte nějakou plečku, na který máme lidi prezenace. Když tu pak máte nějakou plečku, na kterou máme lidi prezenace. Kolečku máme k svoju, kde byžel. Ne, že větším jsou kopírovalo taky. Tady? Jsme měží na kálej. Já. Možem teoreticky, to na bude ještě lepší. Ale mám tu problém aktuálně. Ne, aktuálně mi to zamrzlo, zamrzlo mignou vejlent prostě. Vejlent je problém, prof. Aktuálně. Víš, co bychu dělal, a ještě bychu odpojoval tohoto dle. To by mohl dělat ten 3. monitor. Ok, teraz to možno spamejtá. Větším, že vždyť vrátvalo, že máme u pozoru mít lajkou tolky, že je to majou lidi a plasik. A co? A co to? Ještě hlasovat se, že zotlutučel čas komuči. Ok. Vždyť vrátvalo, která smějí mít lidi s námi. Vždyť vrátvalo. Toto tam připájd, ale nebudem. Začneme asi trochu dnes. A kdo to pojede? Ne, ne. Já si myslím, že to pojde. A kudat toto tam připájd nebudem. Pripojde mi si vysloveně toto kamidore, to mi v podstatě bude viac vyhovovat, když bude madíba dvá displeje, ne 3. Potrvujem si vyskůžet hlavně, že SSH-ku mi funguje. Já funguje, super. SSH, Solaris. Super. Takže AVS-ku funguje. No, jsem se bál říct, že tam nezmení prává dresa a tím padou. My si to ište tak, když tam také já nás budem pochvíli, my chváváme, a spoděl si, když myslíte, když myslíte, když to budou. To bude dít, to bude dít. Musím? Už, můžete. Ještě, což hlavně, výzutáme noc deň. A to bude tak, že oba lidy když antice měli do nás otáz Halizm, když myslíte, když tam, Takže jste se vznává, že na konferenciách také toho, 90% zvukovat je o dokeru. Takže jsme vznávávali, že jsme vznávávali nějaké dělá, a my vždyme, že jsme dělávali dělá, že jsme vznávávali všechno všechno, které je to. Takže zvukovat na našli našeho, našeho arhologistů, takže jsme vznávávali nějaké dělá, co je o konferenciách, které je Unix. A zvukovat našeho, co je o dělávali dělá, co je o konferenciách, které je to. Zvukovat na 10 minut, a zvukovat našeho, co je o dělávali dělá, co je o dělá, co je o dělávali dělá. Takže děláme dělá. Děkuju, tomu hlya! A veliki udělala dělávo型ů. Dlatego pak vzdobně annexujeme. Asi našlo, tak jak je to lečené, chtěl jsem době o OK-chrověch kontra a dělává do državosti Solaris. V tým těch dělám na dělávě a dělávě s software engineer. A mnohy můžeme spět na SystemD a UDF. …zo, let's keep agenda, so question. What is a container? Does anyone has any definition? I guess everyone has some definition. I want to see the stage and tell you what I assume is a container meeting for bese lighthouse combination. Takže, množem je, že je to virulální aplikace a vytvořené světy. Takže je to nějaké vytvořené světy, které je vytvořené a vytvořené vytvořené a vytvořené technologie a jak je to virulální světy, to je to nevětšené, ale v tomto věděle, tady bychom se vytvořené, takže nevětšené věděle. Takže nevětšené věděle nevětšené významené významené, ale je to aplikace a vytvořené světy, které je vytvořené a vytvořené, vytvořené technologie, které je vytvořené vytvořené a než nekonekty hodně. Vytvořené vytvořené technologie vytvořené vytvořené na věděle, takže vyskoučené vytvořené, takže vytvořené vytvořené na věděle, to je zvěděle. Tady jsme BSD Jail, Solary Zones, když jsme tady mluvíši. A když v Linuxu jsme technologičky, které je Linux vysvěřovat, Linux našličí, C-gruby a dělává projekty, které se závěří třeba vysvěřovat VZLXY SystemD N-Splom, které to být vysvěřený fokus tady a dělává vysvěřený vysvěřený vysvěřený vysvěřený vysvěřený vysvěrání. Když bylo vytvěřený vysvěřený vysvěřený vysvěrání vysvěrání a je to vytvěřený vysvěrání. A které to bylo varměný, které to bylo vysvěráný do jsou br posture a mluvíši,derní lidi, e.g. vysvěřený, vysvalý auger, vysvalý auger, tady mám otvočy, na spíšený vysvěřený vysvěrání, Autotools didn't exist really. So the guy found out that to use such a two line hack would be just easier. Today you would just build in a virtual directory. Basically Autotools gives you this functionality by default to build in whatever directory you want without you rewriting any include paths or anything like that. But at the time Autotools didn't exist so they just implemented it should because it was easy and there you have it. I mean it's all started with chute. So when then for a couple of years for almost 20 years nothing happened in the regards to containers. But in 1998 FreeBSD 4.0 was released and they implemented at the time a news is called that was called jail. And the problem that they were trying to address it was basically early not early days of the internet but internet started to be used a lot more than before at that time. There were new companies popping up and the business model was to provide services for their clients and most of these services were like hey here's the server and you can run your PHP or early Ruby or early Python app on it. But we do not want to care about what version of PHP you use. We do not want to care about what version of Ruby you use. It should be up to you to install whatever software stack you want. The problem is that how do you do it? I mean on Unix traditional Unix root can do everything and normal user cannot do anything basically. So they tried to solve this problem of omnipotent root like root that root user can do everything. And that was and their take was to implement jail syscall and that syscall should allow you to create an environment in which other people can have root access but they are not able to destroy your system basically. So in a BSD kernel the patch for that syscall was very small, only 400 lines of code, very small structure that is used in BSD kernel to implement underlines of the syscall is called prison actually, not jail. And yeah, so process can be either in a jail or not but once you put it in a jail, it cannot escape a jail anymore so you cannot migrate it out of jail. Once it is in a jail, that's it, it's always in a jail. And then again some limitations applies to it. And some of the limitations, yeah, again jails, very small patch, 400 lines of code so they didn't bother to come up with anything very sophisticated so they just used truth for the file system virtualization to provide a view of the file system to the processes in a jail so they run them in some truth environment and some functionality is just not available in a jail. So those processes cannot mount file systems, they cannot create raw sockets and stuff like that. So basically over the kernel code they sprinkled additional checks they thought that some feature can be used to kill the system or to attack some other systems on a network so they just disallowed it. Basically they edit if checks, like if you are in jail, you cannot do that. You cannot do this. So since this is the developer conference, I will not hesitate and I will show you some code. Let's look at some code directly in BS, this is free BSD by the way and this is the, so this is the today's implementation. No, not how it looked like back then. So this is the struct prison that I was talking about and here are all the members that are in that structure all neatly arranged in columns and as you can see we have some rev counts here. Prison has some internal IDs. This means that prisons can be nested because one prison can have parents and siblings so you can arrange prisons in a tree basically. So prison can see differently how many CPUs are on a machine so you can lie to the processes in a jail that you have only one CPU for example. It has a virtual networking for the most part. IP stacks, couple of other things. So it can tell you that it has actually different kernel release than the host. It can have its own hostname, its own domain name. Here you can see the truth path. So, yeah. A lot of, I mean this structure contains a lot of things that you'd expect, that you need to have virtualized in order to provide such an environment that you can give people root access but not really. OK, so that was a very quick tour of BSD jails. Yeah, this is one more thing, that each jail is characterized by these main four properties. Like directory subtree you choose in that. It can have its own hostname. It has own IP address. That is usually some IP address that is like alias on some other interface. I mean second IP address on some existing interface. And yeah, command to run usually, that you would like to run in jail. And I prepared a very quick demo just to show you how it looks like and how it's managed. So I will SSH to some BSD system. This is free BSD. And if I'm not mistaken, I should have some jails running. Yeah, as you can see, there is a command jail LS. It will list you all the jails you have on the systems. As you can see, I have two jails. Like file system trees for those jails are in user jails, demo jail and demo jail one. They do not have access to network, basically. They only have access to loopback device. I have a couple of IP aliases on a loopback device. So... Is it better? Yeah, a little bit. So if I want, I can create a new jail really easily with the easy jail admin command. It's basically a shell wrapper they use in free BSD land to manage all these things. Just so I don't make too many mistakes. Oops, I want to grab, actually. Yeah, so I can create very easily next jail. I will give it a new IP address on that loopback device. And yeah, that's done because it's pre-populated. This is just the basic file system tree that I already have. It's called base jail. And that base jail serves as a default setting basically for all the other jails I create. So that has, of course, obvious throwback that I cannot have different versions of free BSD in those jails. But you can set up jails in a way that you have different jails for different jails, containers, basically. All my containers or these jails are free BSD 10 based. So now I can start the jail. Yeah, starting getting the saying that I use some absolute settings but whatever. I can get a shell inside. Yeah, so I have a new shell. I can list the processes in a jail. As you can see, there is a couple of processes. So I see only processes that are in the same jail as my shell. I do not see entire system here. As you can see, PID numbers are not virtualized. I see PID numbers from a real system. So I do not have PID1 here anywhere. I see PID numbers from a real system. I can try to ping something. I could fail. If I could remember what IP address I had there. So as I said, socket operation not permitted because ping tries to create a raw socket. Again, I could not mount anything. But what I could do is just install some package. Just install Vim. Yeah, so I didn't install anything before. Could not be found. Yeah, I would have to set up a packaging system in a jail. But anyway, I could do it if I wanted. Just install packages like Python then put my application in and actually run it. I am like a systems programmer, not interested really in Python or Ruby, so I didn't prepare any fancy demo with WordPress. No, yeah, as you can tell, I didn't even know that WordPress is PHP. OK, so let's move on to Solaris Zones, a second real containerization technology that appeared in Unix systems. Solaris Zones appeared in 2004. Of course, Rocket Sun was done in early 2000s, late 1990s. It was inspired by the works that they see FreeBSD community was doing. But the Sun was actually trying to solve different problems. Sun back then didn't cater to internet providers or service providers. In early 2000s, these guys, they already been running LAMPstacks on Linux because it was cheap. Sun had enterprise customers. It was selling to enterprises. Sun, they were making boxes, like machines, that were just too big and too expensive. And they were so expensive that even some governments were like saying, yeah, we would buy this box, but we can run the database on it. But it has so much power. We could do so much more with it. Please give us a way how to do it. So they tried to solve this problem. How to run more than one application on a single box in a way that these applications do not really know that they are running on the same box. So this is the problem that Sun was solving, basically, with zones. Also, their approach was much more methodical. They didn't write just 400 lines patch for their kernel to see whether it will work or not. They actually baked idea of the zone. Zone is a name for container on Solaris, basically. Right into the heart of the operating system. And they did it in a way that even when you start up a Solaris box, Solaris operating system, you are already in a zone. So there is never a time when the process is not in a zone. Like in BSD, you can either be in jail or not. But on Solaris, you are always in a zone. They call these zones, the one zone that you are in by default is called global zone. And then all the other zones, like containers, if you will, are non-global zones. And also, they provided tools, how to tools written in C, how to provision those zones, how to install them, how to uninstall them, how to list them, and these kinds of things. So on BSD, as you saw, there was a syskull, basic support in kernel, and then on user space site, they provide this easy jail wrapper, but it's just a very complicated shell script. But on Solaris, they baked this idea of a zone also into tools. So this is something that Leonard is trying to do with system detools, that system detools understand the concept of a container. So you can, for example, list services in a container. You can tell system CTL to list services in a certain n-spawn container. And that is actually what some people also did. So you can list services via SVC command, their command line tool for interaction with the init system in a zone. You can give it a dash z argument and name of the zone, the list of services on your system in a global zone, but it will go into the non-global zone. It will communicate somehow with the init system that is inside. It will ask it, please, give me the list of services and it will present it to you. So these are the main commands that you use when you are working with the zone. Zone ADM, zone CFG. And yesterday I learned that there is a quite nice easy provision zone on the Solaris box using Ansible, just couple of lines of YAML with the Ansible module and it will take care of creating ZFS dataset for the zone and installing zone and booting it up and all that stuff. So before I go to the demo, let's again look into the source code. And so this is actually not Solaris because Solaris is these days closed source as you know. It was open source for a brief period of time, but then Sun was acquired by Oracle and you know rest of the story. But we have open source fork of open Solaris basically and this is Ilumos. So this is Ilumos GitHub repo. It's basically kernel and user space tools all in a one repo and it's actively been worked on. So you will see comets in there. So it's not totally dead. I mean there are still people using Solaris and building mostly appliances on top of Solaris. Like does anyone know Srinas? So they use this as a base basically Ilumos. Because Solaris has most comprehensive and most stable ZFS implementation. There are kernel modules for Linux and they work, but still people use Solaris when they need proper ZFS. So as you can see this doesn't fit in one page. Actually the structure itself is very big. It has tens and tens of members so it has a lot of things in it. I mean processes, mount tables for example this is interesting audit. You get a native audit context in a zone. So basically auditing and zones work together. I'm not sure what's the status on Linux actually whether audit subsystem in kernel is even aware that such things as containers that people run such things as containers on Linux these days. They are working on it. So it's working progress. In a zone. Some memory management things pages. There was one thing that I found interesting. Where was it? Here. Restart init if it dies. That is funny because on Linux this works completely differently. I mean if I will be talking about it in a bit but basically on Linux the init system dies in a process namespace. The namespace is basically host. You cannot create new processes and here they have a flag so you can tell zone to basically restart init if it dies. I found it weird but just because I know how Linux works. Again number of CPUs so maybe you can tell zone that it has or more CPUs than a host system scheduler stuff some logs not that much interesting. There was some IPC that is the one bit that is very different from BSD because in BSD by default you share system 5 IPC object with the host. So if you run some off with sysctl sysco but by default you share system 5 IPC with the host that is kind of surprising and it can cause problems if you are running applications on a host and in a container that both use system 5 IPC for example Postgres or I don't know some databases maybe IBM DB2 uses system 5 IPC but yeah OK so as you can see zone structure is much much bigger that is the main take away and it has a lot more stuff in it than in free BSD implementation so let's return back to the slides actually let me show you how tools on Solaris look like uh so I'll SSH into Solaris box uh so I can do zone ADM list and as you can see I have a global zone that one is there by default and then I have 3 zones that I provisioned they are basically the same I can log in inside the zone so I can do Z log in zone 0 inside the zone as you can see of course I am presented with information that I am root but again I cannot do all the things that you would expect root can do on a system so that is again similar with free BSD implementation uh yeah I can exit out so zones can be in a different states like it is first time you first you create a zone then you install a zone and then you can boot it basically so and you have tools to do it like you have zone CFG command and with that tool you create a zone you configure it like what the networking should look like what interfaces should be inside what IP addresses and stuff like that and then you install it so you basically create a ZFS data set install a software and then you can boot it ok so now let's look at Linux containers so first linux containers appeared around 2005 linux containers are mostly implemented on top of these two technologies that are provided by linux kernel namespaces and cgroups in 2005 first namespace was implemented inside kernel and now I will walk through all the namespaces we have in linux and how they provide virtualization capabilities so in the end you can run a container and you are provided with a virtualized view of the system so like I said namespaces provided natively by linux kernel these days we have these 7 7 kinds of namespaces mount, pid, user, uts network IPCC groups these are the syscalls that you use when you interact with namespaces so basically I guess clone has different arguments so if you create a clone of a task you can specify via the arguments of the syscall for example it should have its own network namespace or later on you can join a network namespace of some other process via setnsyscall also during the runtime of a process you can unshare and it will create a new namespace like it depends on an argument for that process so you can migrate processes between namespaces and you can influence which namespaces they will see from a host system and which namespaces they will have their own when you are creating a new processes so every process has a directory in proc that is called ns and you can see at the first glance this looks weird because if you ls and you have actually colors enabled in your shell you will see that these are dangling synlings basically so these files identify namespaces type of the namespace and some identifier and they are good for joining namespaces of other processes you do that by opening that file and calling setnsyscall that you get from open on that file that is how you join a namespace of a different process you open a synlink, you are interested in so let's say you want to join pid namespace of a different process first you check the number should be different than the namespace of your process because otherwise it doesn't really make much sense you open it up and then you call setns on that file descriptor and that is how you effectively join the namespace of that process basically you become a part of that namespace setns is like syscall, there is no binary setns is a syscall but there is a binary nsexec that basically does that thing no no no nsenter so first namespace the oldest one mount namespace, virtualization of file system view in most cases you would like to provide your container with its own view of a file system and for that you use a mount namespace with that there is something called mount propagation I mean I will skip over explaining mount propagation because I would bore you to death so I'll just skip that no no no that's just too boring I mean it makes sense like I mean for me it definitely makes sense because like if you don't know how that works and if you like dump into it what's happening where my mount ok so these guys wants to know so this guy wants me to explain it so if it is too boring just blame him so imagine you create a new namespace and you are in that namespace then what happens is if you run a mount command actually you can try it out with the unshare so that basically means that you create a new mount namespace and your bash process will be spawned basically in that bash shell will be spawned in that mount namespace and then if you run mount command you will see mount table from your parent mount namespace basically and now if you mount something new in that namespace bash shell then depending on the mount propagation couple of things could happen for example if the mount propagation is set to private that new mount you just did will not show up in a parent mount namespace so basically host system will not know that you mounted something inside then you can have shared that means I think that is a default so basically if you mount something inside the mount namespace it will show up in a parent mount namespace in a host mount namespace then then you have a slave propagation that basically means that nothing propagates up to the host to the parent but if the parent mounts something you will see it in your mount namespace and then third option is unchanged so it basically leaves the mount propagation as it is right now set for current process in which you are unsharing the mount namespace so yeah that was very quick but I hope you understand what it is about so PID namespace of course obvious thing virtualization of process identifier so in a container you see again PID1 and as you could see on a BSD it's not the case on BSD I didn't see a PID1 in a container because I saw PIDs from a host system here I can see here I can have a PID namespace for my process so this is actually a nice thing about Linux namespaces you have quite a lot of flexibility how you want to build your containers what things you want to share with the host and what things you do not want to share so you can play a bit with the PID namespace if you call unshare minus P4 mount proc so that just creates a bash process in own PID namespace mounts proc and also mount proc will give you your own mount namespace in which you remount the proc so if you do PS you actually see just one PID PID1 that will be a bash process and you will not see anything else if you do not do that of course you see PS looks at the proc and you would see a proc from a host so then processes inside the container will get confused and then there is a fork option a bit weird and it is weird because how PID namespace works on Linux so basically if if you create a PID namespace and in its system and there is no process in that namespace then you cannot create any new processes in it so it wouldn't work that you just create a new namespace like if you do unshare and then you join it but once you create it then you cannot create a new process in it it's a semantics of PID namespaces on Linux so if you leave out that fork option you would see strange errors like permission denied like fork gives you permission denied it's a bit weird if you run into this for the first time and a PID namespace is only one that does this and it is a bit special with this regard so it is expected then that you have some process that can act as PID as in it in a PID namespace so it reaps processes within the namespace if a process dies and it was forked off by PID1 in the namespace then that process will get sick child not PID1 in a parent namespace so not system D on a host and when in it exits all processes in PID namespace get sick killed so again difference from for example Solaris in Solaris if in it dies you can set it up in a way that kernel will restart in it process in that zone PID namespace of a process can be changed and again it is possible to nest PID namespaces so user namespaces provides virtualization of user and group databases again that was the last namespace to be implemented but then Cgroup namespaces came even later again this is the as far as kernel implementation goes this one is most complex and patch for it was the biggest patch from all of the namespaces I talk about here so the patch for PID namespace is code inside kernel very complex patch changed a lot of in kernel and because of that there was couple of CVEs for the kernel because of user namespace and what that gives you this is the only namespace by the way which you can use as an unprivileged user so this give you a possibility to provide a user normal user with a view of a system such that it thinks it's root so if it runs ID command it will return zero there was a default on zones in Solaris and on FreeBSD I ran ID there and I saw zero but it is not default on Linux basically but with use of user namespace you can give this possibility to unprivileged user to be root on a system basically but actually if it such root in a user namespace try to mess with some resources that are not in that user namespace not governed by that user namespace so you will get permission denied so you as a normal user you can again to your bash or shell and you get a bash you write ID, you will see you are root but then when you try to execute some operation that actually requires privileges on a system I don't know like flapping interface or something like that you will get actually permission denied because in your parent user namespace you are actually not root you are just normal user and then you also have to do mapping mapping of user ID between container and host between parent and child namespaces so and you do that by writing to these proc files and if you use unshare the unshare utility basically it will create the mapping such that you as a normal user outside of user namespace are mapped to the root user inside the user namespace and again they can be nested so you can create sub user namespaces if you want then network namespace again virtualization of network resources network related resources so interfaces, ipv4 stack routing table supports this is actually quite nice feature and I use it quite a lot not for container use cases but for testing of a network software it's very nice for testing network demons you create a separate view of a network for one demon you run a second instance in a second network namespace and you tunnel you create a tunnel between namespaces via VTH links quite nice use case that I use from time to time so if you are writing some network software might be useful to look into that again for testing and stuff for container use cases I guess you probably if you have some application that communicates locally I don't know why some I don't know why like unique sockets maybe or basically doesn't need network access to the real network so it's inside the network namespace and you present it only with the loopback interface so it can't really talk to the real network even code inside may be privileged I don't know depends on use case so it cannot create sockets bind to IP addresses and stuff like that because there are no interfaces inside except loopback for example so then there are namespaces that I will go through real quick IPC namespace again just isolation and provides you with your own IPC resources like POSIX message queues system5 semaphors that is again different from other operating systems in Solaris you get this by default in 3bsd jails you by default do not have it on linux again you have a possibility to turn it off turn it on depends on use case UTS namespace weird name UTS stands for Unix time sharing system by not I don't know why they name it that way hostname namespace make much sense because it is what it is you basically can your container change your hostname and this domain name and then there is a cgroup namespace that was added last to the kernel it's the newest one and it's a virtualization of a cgroup3 view so if you look at how if you cat proc your PID and cgroup you will see that certain process is in certain cgroups by default when you run a system d if you do it for some system for a process that is part of a system service you will most likely see that it is in a system d hierarchy it is in some system system slice then there will be a sub-cgroup for a service and then you will have and the process will be in that service but you can create a cgroup namespace and you can provide the process a different view of a cgroup3 basically you can claim you can tell a process that it is in a root cgroup even though it's not really on a system yeah and related to that linux cgroups technology that is mostly used not for isolation process isolation or virtualization of resources but second big part in how we use containers and how people use containers and that is process aggregation and resource limiting and setting up resources in a system so with linux cgroups you have multiple cgroup controllers cgroup controller is a thing basically it's a kernel concept that governs how certain resources on a system are allocated between processes these things cgroup controllers are mounted by default on your fedora or neural linux system it will be mounted under sys fs cgroup because this is where system d mounts all these controllers by default so you can have a look there these hierarchies you can create like some directory trees in those in those directories in sys fs cgroup block IO sys fs cgroup memory you can create subdirectories these subdirectories will have some control files in it so by writing to those files you basically change what current settings are so it's a file system based api and these hierarchies are orthogonal that means that the hierarchy basically directory tree in pid controller can look totally different from the hierarchy in memory controller and yeah you can set it up in a way that they are totally they look totally different but there are some implications inside kernel when you set things in a certain interactions then between these subsystems are not obvious like if you mess with memory and cpu for example depends i mean how you set it up but in most cases if you do something with cpu and how much cpu who can use then it may have some implications on some other controllers so what system d does cpu for hierarchies same way in every controller so if you look at the directory tree and if you did not mess with the cpu file system by yourself you will see same structure in every controller that we support because we do not support every controller and now the cpu v2 is in it works it's not finished yet hopefully we will get it soon and it's a different again as file system based api you would mount it again that api file system and with cgroup v2 you get unified hierarchy so it's basically one directory tree not many directory trees and you enable and disable these controllers that are supported in v2 by again interacting with these control files that you see in those directories this is already supported in system d but kernel support is missing for example for cpu controller it's not merged yet in upstream kernel hopefully we will have it soon but system d already supports this so you can boot your machine with the unified hierarchy turned on and even kernel support it to some extent like some of these controllers can be used either in v1 or v2 mode so you can experiment on row height for example yeah and that's it so if you have any questions to this part of presentation i will be more than happy to answer yeah go ahead cgroups and devices because that shouldn't be like cgroups but it is and in v2 that controller won't be there i think the devices controller will not be there in v2 we will have memory, block i'll, cpu and maybe i don't know networking but i think devices will not be there because you can have a white list or black list of devices devices is allowed to use so it's just only what is allowed to be visible yeah there is a white list or black list pile i don't remember i would have to for system d services you can do something a little bit different you can turn on private devices equals yes and you get your own separate slash death so with that you can do whatever you want like you get slash death your own slash death with couple of device nodes prepopulated the basic ones like depth null and depth zero and all these kinds of things but then you can set up device from the host so masking them by the group by the nice solution nothing from the host will be like i don't know it's just like i know the lib word recently merged a patch that they use this private devices but in a bit different way that basically they have a problem that udev is again touching devices that do not want udev to touch so they create a private like private death and then they create couple of devices there with mk node manually their own private death but they basically do not like you have a you are from lvm team and you have udev integration and you actually use udev to do some useful work basically they want to avoid udev all together they do not want udev to ever touch these devices so they have their own death and they like QEMU KVM process has own death basically own mount namespace in which they set up death differently from the host so do you want to break before i start who wants a break so who wants a break just one so maybe so maybe just two organizing thing in the beginning because i have a slight phobia about typing on keyboard in public i put everything to do slide so there will be no live demo unless you ask for it if you really want to see something like it will play for me or something like that just please scream or raise your hand on something like that and the second part is we got much more time than we asked for like an hour more so please have a question as much as you can because otherwise we will just stand here and stare in each other for an hour in silence so yeah so but let's start live talking so is there someone who does not know what system he is raise your hand yeah i guess that so basically just a few words system D was made as a set of tools for managing systems and it's primary goal is to create a standard base for the Linux SQL space and a lot of those basically principle scale overlaps to the container world because you want to manage containers as you want to manage the services so let's move to the system B and spawn itself as Michal mentioned every containerization technology was made with some purpose for example the jail was made because they want to remove the omnipotent root and stuff like that so the reason why lineart started with the system the end spawn was that basically developing any system sucks if you made a change in a code and you want to try it you can install it on your computer but if you made some mistakes it is basically broken and you can't do that you can put it inside the VM which also works but then you have the problem that copying files to a VM or basically attaching some debuggers it's really not that comfortable so the initial use case was that basically we wanted to run system D inside the container so we could attach for example a debugger from the host to the system D and see what's going on but at the end the tool itself proves to be quite useful for more people so right now it's part of the system D project and we already know about some users for example Rocket uses as their own back end and I even heard about some companies that based their business on sort of top of system D and spawn but I think that's most of the reason for those was that basically they started with containers really early and as Michal said end spawn was here before the docker so they use the end spawn because it was first there and well and how this is different from the docker actually the system D and spawn covers a completely different area than docker docker is something we call application containers which basically means that you just want to run one application inside the container it's basically just a packaging tool for lazy developers what we really needed was to have environment where we could run the entire was stack so we want to run it inside the container we want to have I don't know journal D to collect logs we want to try some stuff with networks so the OS containers are really more closer to the virtual machine than to the original service I think if you were on the internet stall just before this one you already explained this basically the application container it does not use its own innate why it will be basically want to do the main difference of course between the virtual machine and the OS container is that OS container shares the kernel with the host and there is also no virtualized hardware around when I said that this will be about the system dn spawn it's not that true because system dn spawn is really a tool that will create the namespace the stuff that Michal talked about because for example in containers you need networking you need to somehow collect logs and it really doesn't make sense to implement all this stuff just for dn spawn so we basically share a lot of stuff with the system itself so we use the system d itself in it so we can start the container for example on the boot and we also use it inside the container we have a machine d that tracks virtual machine and container as well and it works with the images we have a journal d that basically it can collect logs from the container there is a special tool called system d import that can download images and as I said we share the networking and we actually use it was there some question we use it on the host for example as a DHCP serial and also in the container itself but I will get to that part later so how you actually you can get or both is an image for dn spawn as Lenar mentioned on his previous talk we want to be a businessman we basically define any new format so we support basically everything that basically nox support so a container or image for container from the system dn spawn point of view can be for example just a plane directory where you dump the os3 it could be a butterf s subvolume you can have the directory back inside the tsar ball or it can be a raw image no n spawn images lying around so what do you want to do if you want to run something so first thing you can do is to use machines to teleport pull something and download for example some cloud image Ubuntu has their cloud image pack as star ball so you can use the pulltar command vyle fedora for example provides their cloud images as raw images so you can use other tool raw if you want to create a container by yourself the old method was always to use the tools that are or the package manager that is already presented on the system so for example in the fedora if you want to create a container you can just use the enf to download the packages into some to some directory and specify the packages in the container the same thing basically goes for dbn or arch Linux i don't know much how those tool works so if you are using those kind of distribution just look a deep end page this is basically a lot of stuff to type so lennard last year wrote a new tool which is called beamer coursing it's really nice and it's basically a tool that creates image for you you just specified what distribution you want and how the image should look like so for example if i would like to have dbn their image i would just type mclosey c directory d like a distribution dbn and basically it would call the previous command which i show you they are also for example the dbn the bootstrap is package for the fedora so you can create a dbn image and a really nice thing is if you build a raw image you can also specify the delotion and that would make that basically such image can be also started inside the QEMU so just with the QEMU Kavi M you can point it to this image these images are basically by by the way if you need those slides there online so these images you see quite a lot so you basically need also some firmware for the virtual machine that does the efi but that's nothing that interesting so this is the way how you prepare an image so when we have an image how do we basically start it or how we create a container on top of that image the basic call would be just calling the system dn spawn and pointing it for example to the directory so the thing that it does is that basically it does the truth like behavior it's a little bit better than truth because it also mounts stuff like the processor or mounts some basic depth but again it just like a truth so if you move one step further you often really want to have some init inside the container i don't know if anyone hit the problem with docker and the zombie apocalypse i don't know if you know about that basically the problem is that as Michal said when you run something in the bit namespace the first process that started there is marked as bit 1 and the processes that are basically running like that they are considered as init so for example if you do something then there is some double fork and you have a demil process his parent will be the bit 1 and if such process died init must take its 6 out so if you run some utility inside the container as bit 1 that is not doing that for example i think that the problem was when you were running makefile inside the container it forked some kind of processes and it was expecting that init will read them but at the end you basically end up with inside the container so what you can do in this case if you don't want to run the full operation system inside the container is specify this as a bit 2 option what it will do is that basically it will start a really small init we have init now inside system d and this init only thing it does is basically that it reads the children and the init stop and your own process will be run as bit 2 but we are talking about OS containers so what we want to do is that we want to boot the whole operating system inside the container it's enough just to specify the option then the system ends point out that you want to start the init and it will actually boot the whole system inside the init and you will get the login prompt any questions so far if you want all of these methods will basically start the container on the foreground if you want to start in the background you need to use the system dMachineD and through machineCTL command or tPass API tell it to start that container or if you want to enable it during the boot you just use dMachineCTL enable command dMachineCTL enable command it's really the same as in the previous command where you would use systemCTL start here you can use dMachineCTL start in the previous command I did not say bear you should put those images basically you can have them all around your system but then the machineD would not know about those kind of images so the best place where you should download those are in valid machines and then you can again use dMachineD and get some information about the images you have downloaded dMachineCTL list image it will display you the information about installed image it will give you some basic information about it like what type of it is when it was created, when it was last modified again if you want to get some more specific information about the image use the status command it will give you human readable output where you can get various information and again the same thing with systemCTL where systemCTL status is a tool for human if you want something parsable you use systemCTL show here you can use the machineCTL show image which will basically give you the parsable information and while this output is not considered stable it can change we provide a stability promise or for this output there are other stuff you can do basically with the images you can create copy of the image you can rename it or you can mark it so it's only bit only images in the memory kernel space using the same it depends only pages or you are just it depends which kind of image you are using basically if you are using directories you will run cp command cp not happening no no it will really create a copy of the directory if you use for example butterfs subvolumes as images then it will create a snapshot and butterfs will do that's basically a copy on bright what butterfs will do is i don't know much about the file system but what i think is going on there if you use butterfs it will basically create a copy on bright subvolumes so if that might be the case i really don't know much about that so and also with processes running no because this is the tool that is for images so this works with the files this doesn't work with containers yes oh you are asking about running containers no the clone image will be using the same pages on the disk because when you start it clone it will have its own full copy like i think that the clone will reference same blocks on disk if you do not change it it will reference same blocks on disk if you copy again with butterfs you don't do a copy you do a snapshot that's only for the directory you basically do a copy so if you do a reflection copy it probably uses reflection copy or i don't know but i guess usually if you run a process and the shared library the memory will be shared by mapping the page but if you use the butterfs it will not be shared so the memory i mean the RAM will not cannot be shared between containers that was how i understood the question ok so maybe we should just move on there was a question in the back end no we don't have any support for ZFS but again if you do your own job with the images machine CTL does not support that so you can't do any clone there but if you basically prepare the directory and run and splone on top of that it really does not care what file system is under it so i don't know much about ZFS i don't use it but if it has some clone ability and you basically can create a copy in file system so you could do that well as i said we only have from the file system we just know only about the butterfs for the result of the stuff it will do the simple copy well with butterfs it creates a snapshot of that i'm not sure if i understand that really don't know what you are asking or i don't understand but butterfs is a snapshot that's all well for the cloning we have a different i will get to the butterfs a little bit later so i already showed you how you can start a container in the background that's probably nothing interesting what i find really interesting is the machine CTL status command it will give you so much information about the image that actually there was a cve reported for that it will tell you when the container was started what is the PID number from the view of the host of the init system it will tell you the version of the operating system it will tell you the IP address it will show you all processes that are running inside the container from the cgroup point of view and also it will show you the last i think ten lines from the journal so if you have a running container you want to stop it there are basically three ways how you can do that the machine CTL power off is basically the same thing as system CTL power off on the host it will gracefully turn down the machine it basically tells the init inside the container to shut down the machine so the init inside will stop the running services and then grace will end the machine CTL is much more destructive it basically sends i think sikilt to every process inside the container or end and sikil or you can specify your own signal by using machine CTL kill then you can send whatever signal you want and you can choose if you want to send it to the init itself or to all processes inside the container so we have a running machine what can we do with that so you probably all know how to use system CTL you can also use it with the end spawn containers so if i wanted to start apache on my host i would just type system CTL start hptpd if i want to start apache inside the running container i just specified dash big m name of the container for the system CTL command the apache inside the container in the same way you can use basically all of the commands the system CTL provides so you can display status of a service running inside the container with all that advantage that the regular system CTL has that it will show you all of the processes that are running inside that C group it will tell you the main pit the status basically everything if you want to get a lock inside the container you can use the machine CTL shell command again it will just give you the shell inside the container inside all of these namespaces if you for some reason want to login inside of the container you can use the machine CTL login the one interesting thing about this is that if you don't specify the name of the container inside the command it will basically create a login shell on your own host computer i think on some pages like for an extent these kind of magazine there was a news that in system D we have re-implemented the behavior of the zoo but the real reason behind us this was to enable basically login inside the container and all of those machine CTL command also work basically on the host so for example you can i think call machine CTL power of dot host and it will it will perform the action on the host itself so that was just a misunderstanding there but it happens quite often next thing i want to show you is basically you probably know this system the run command on the host it will just tell system D that you want to start this command and run it inside temporary unit or inside temporary slice you can do the same thing with the container system D run again dash big M specify the name of the container and this will start the command inside the container the path is of course relative to the container itself so this will not start being true command from my host but being true command from the container so the next nice thing is integration with journal i guess every one of you knows CTL command or is there someone who does not it's nice to have a talk among experts around those things when you also have to explain what is system D and what is journal it takes much more time so if you have a running container you can again use the same command as you would use with the host you are specified again dash big M name of the container and you will get the logs you can again use all of the same stuff so for example i could use journal CTM dash big R you apache that service and it will show me logs from the container that are related only to the apache the journal are they saved inside the container or are they outside the container i will get to it on the next slide so the next slide where are the journals start so completely by default they will be inside the container but you can use this link journal option and with the first one the host mode what it basically do is that it will store the journal logs on the host itself so in var log journal and it will basically mount the directory inside the container itself so the journal the inside the container is logged into a directory and the logs are stored on the host the opposite of that is the journal the link journal guest it will do that basically the logs are still stored in the machine itself but they are mounted to the host the main difference between those two modes is that with the first one if you delete the container you will still have the logs if the second one if you remove that container you will lose one problem is that basically with the big M option it gets the information from the machine D where the container is located and this works only for the running containers so if you want to have information from stop container you need to do some hack like this one where you basically say that it should merge all logs that it files in its like a log directory and you can for example search for the specific entries by the host name value so what about the file system I already mentioned that we use butterfs for some of the stuff here is a nano command this is basically for with only container of how to call it one inside the bar and its configuration you could also use those two options or this one option with two settings basically if you specify the what I equals state what it will do is that it will mount MFS inside the bar inside the container and if you for example want to upgrade your system you just stop the container the image the system itself update the image and then again when you run it it will stack the application and data on top of the updated image so its not like