 My name is Michael. My talk is about running and distributing free BSD containers, and more importantly, what can we do to... I claim we make it even better than Docker containers on Linux. I was here in EuroBSDCon last year talking about how I have an open source version of the things I'm working on. After that, I take a couple months of break and actually implement it out, and this is the result I'm presenting. So, the overall of the talk, I would try to have a quick introduction to containers and OCI, and why does it matter when we already have gels and so many kinds of gel manager. And I will talk about when implementing a container system for free BSD, what kind of issue we're running to, or special consideration, and more importantly, the special features that free BSD give us. Then I will do some demo. Actually, we'll do some demo throughout the talk. And lastly, I will talk about some future work and maybe something deeper in the architecture related things. So, let's start with OCI. So, if you're not familiar with OCI, it's basically a project from Linux Foundation. After Docker and all that, people realize they want to have a more standardized specification to define what is, not really what is a container, but how to distribute some bobs, and then how to run them, basically, like the environment variables, different kinds of specifications. And it kind of can break into three really important specifications. The first one is a runtime specification, kind of irrelevant in this case, but basically it's like telling the actual runtime to say like what to run and then what is the required condition. The last two is more interesting, which is the image specification, basically telling it's a manifest to say like what to run. And I think most important is the distribution specification, which is like when we have a container, how do we distribute it? That means like how vendors build facility to support people to push their container. That kind of translates to if we are able to adopt distribution specification, we can just utilize the existing infrastructure, for example, Docker Hub or R3-CL to distribute a previous container, which is the prime goal of this project. And first, like I'm implementing new things, right? So let's talk about what we have already and why I need to implement something. We already have a lot of things to deal with gels, AppGels, Bastel, IOKH, you name it, and we even have Portman now, which is fully OCI compatible. But except for Portman and Port, mostly about creating stable gels. So you build something and the gel contains states and you manage it almost like you're having a virtual host. But in reality in DevOps or in like backend-focused applications, those are just changed a lot. And if you attend Dave talk earlier, he talked about why immutable is kind of important when we're trying to distribute a software solution. Port was great. The only issue with Port is that it's not OCI compatible, so I can't really use it at work. It also requires its own image registry. So it's not really, you can't really use Docker for it to put it in a simple way. Portman basically is the Linux Portman. So whatever you can do with Portman or Linux, you can do it on FreeBSD. It also uses that as a backend, but in this time I'm going to focus on XC, which is the tool I build because I really can't think of a better name because naming is really hard. First of all, when I work on XC, there are some reasons why I don't want to just port Portman or Docker over. Most specifically is that I wasn't realizing someone was porting Portman over, so I didn't wait for it. And also I really want something that plays well with FreeBSD that's FreeBSD native, both in terms of like value and both in terms of like use experience. And I think there's some really important shortcoming in OCI image specifications, which will get into the XC features. And again, I want something more FreeBSD container. That means I want to utilize all the features I have with FreeBSD when I build container. Otherwise, why do I use FreeBSD? Because I want to run features that only exist in FreeBSD. We kind of talk about this. Most importantly, again, it's distribution because when you figure out the distribution, then you have an ecosystem, then people can utilize it. It's not expensive to run. You don't have to spin up your own image registry with things like that. But before we jump into XC, let's talk about when implementing a FreeBSD container, what kind of issue we'll run into. First of all, on FreeBSD, if you use Beehive, Tantab, NADM, they all require you to have access to some kind of device nodes. And on FreeBSD, the way you manage device nodes is that you manage via DevFS, so you really need something to dynamically generate a DevFS rule set. Otherwise, you technically can just create a custom rule set every single time. You need to run applications in Wolfs, for example, Beehive, but it's not really practical. So we need something that we're able to automate it. But in the same time, you don't want image to say, I want to expose all the NVMe device to a container. So you kind of want to have something to guard against it. Some other special considerations are really FreeBSD features. For example, on FreeBSD, we can have a VNetGL and a non-VNetGL. And like the talk mentioned earlier, there are some really nice security features of non-VNetGL. For example, you cannot change the network stack. We can also run some sort of LinuxGL. Those, to me, is a very important feature. We can also do some things like gel set of S. And that is something quite important, because that means you kind of can dedicate a data set to a container that implements a certain solution, or you want to run through the way in a gel. Lastly, it's D-Chase. FreeBSD on D-Chase is really quite nice in terms of doing something related to container versus like you can have on Linux, because on Linux, a container is really a number of C groups and namespace that's no way you use EBDF. You can say, I want to trace exactly this container, because that doesn't really make sense. You can kind of trace a specific process, but nothing as like a whole container, like everything running together. So the last two issues kind of already fixed it. There are changes committed that you can use ifconfig.sj and root.sj to just like change the routing table and configure network interface without having the binary in the gel. And also as part of like port-in-portment, the gentleman also implements like now mount on file, so we can actually use now FS to mount files to file, which is really nice. Okay, come back to XC. It is a container run time for FreeBSD. It actually uses its own image format. It does not use OCI one, although you can understand it. You can utilize OCI distribution specification to upload and pull images. That means you can just upload to Docker Hub. You can upload to Asuri. You basically can utilize all the infrastructures vendors are already providing for containers. The container images I think is improved. It's kind of more self-documenting. If you think about Linux container, you can think about those are containers quite literally without a label. So there are just things inside and you know it's some kind of liquid, but there's no really things like how to drink it or how much you should drink it, things like that. What I mean by self-documenting container images as you see later, there's a lot of like guarding against user from even using wrong. And these are also a number of features that available XC that couldn't available on Docker or Portman. For example, the network king is a bit interesting in terms of setup. There's no such idea to say the container or the container engine take over a network and have some really constraint to it. It's more like just let me know which interface should I create an address on or let me know which interface I need to bridge on and optionally take care of like some address allocation because you probably don't want to come up with IP address every single time when you want to create a new container like a fast cell. You support both VNet and non-VNet containers because I think it's a really good security feature. We get into the sanity tracks later. Those are pretty cool in my opinion. Volume hints as well. Those are really great features that we can add to containers. Most importantly, it handles a generation of database rulesets. So if a container does require access to some kind of device nodes to function, it kind of just takes care of that and then it do it in a quite secure way. It also supports juggling and juggling a set of data sets. So the easiest way to say is that you can run you can distribute as a container. So you really don't have to like set it up. Because they also understand OCI image format in general, that means many Linux containers available right now can just run a modified. Lastly, how can we forget about dChase? There's a lot of effort also put into making the dChase integration with VBSE and containers in general more seamless and I think would be a great addition if you try to use in production or just doing some DevOps stuff. The architecture of XC kind of look like this. First of all, let me take this with me. First of all, you have client which is like the front-end XC command. It basically send commands to a daemon. And this daemon when needed, for example, when you say create a new container, it will fork itself to the other process that do the container run loop. And in between, there's still a communicator via the Unix socket. And when you need to create a new process, it would just like spin up a general process. The reason why to use Unix socket is not just because it's native-ish. For example, when Docker use Unix socket, it's really just running HTTP over. But when XC running is actually sending JSON files back and forth but also sending file descriptors, which later if we have enough time, we can talk about it, which is also a really cool feature when we're able to use some OS primitives. So let's talk about using XC. Well, first of all, you cannot run the container when you don't have the image, right? You just have a thing, but if you're just like this but you don't have food, there's nothing for you to eat. There's a few ways you can create these container images. First, like I said before, you can just call them from image registry. For example, Docker Hub, which we will demonstrate later. Or you can convert a gel to a container image. There's a script that hasn't uploaded yet, but basically not just like any gel, but you can also take a hub or just convert it into a container image that is ready to distribute to save Docker Hub or R3. Lastly, that is before I realized the support import. I also made a gel file thing. You kind of have the Docker file syntax that you can build your images in that way. In terms of networking, it's kind of optional. For example, you don't have to say, you must attach to XC network. You can just pass an existing interface. It would just work, but network is kind of nice because it kind of group the parameter in a single place. And it also do a few things like it would create PF tables dynamically. It allows a container to belongs to multiple networks. And the address allocated in that network would just like fill into this PF table or PF tables. It basically brings all the networking down to two questions, which is like, I don't care what you do with networking, but if you're running a VNetGel, just tell me which network interface I should put the address to. And if you're running a VNetGel, you just tell them, just tell the runtime which interface should add the push interface. So we just add a new repair there yet. Does that have support to a network? Yeah, but that's future work. It also handles, again, address allocation. So we give it a subnet. It would generate a new IP address that's not used in that subnet. So you don't have to kind of do the mental gymnastics to come up with new IP address every single time. So let's do a demo with Docker Hub and Linux containers, because that's probably fun. And we can kind of see some shortcomings of existing Docker containers as well at the same time. Let me figure out how to exit this. All right. So what I'm doing right now in this demo is that I'm going to pull MariaDB from Docker Hub. So the Docker Hub MariaDB is really a Linux container. It's designed for Linux, but I'm going to show that we can just run it. And if demo got like me enough, nothing should go wrong. As you can see, it's downloading all the layers and then it's extracting them and then creating a set of data set and all that. In the same time, also we need to register the div ID, but that's in the technical implementation. We don't really have to care about that. So when I try to run it, I basically just say, I see run if I can type. I cannot type. And then I haven't loaded K-Mog yet. As you can see, actually tell me about it. It doesn't just try to run it. There's also other... ... Basically that means if they cannot identify or ABI for the binary, it fall back to a Linux binary because that's what happens sometimes with Go containers. Okay, now we can actually run it. Now, as you can see, we're already running a Linux MariaDB container without any modification, except here's the thing. This happens to all the Linux containers, while containers in general, it really depends on the script and how well the script is written for them to tell you what's wrong or what kind of variables are required. So we'll come back to this feature later, but let's put MariaDB password so we can continue our demo. So I'm going to run it. This time I use the argument that I set a MariaDB password. I'm going to use a re-secure password, which is password, and I'll run it. Now, as you can see, the database is running right now, which is kind of cool, but let's try to see if we can actually connect to it. First of all, I need SSH to my... Excuse me? Well, because I didn't put it to any networks because this demo, we don't really have to use a network, but because I'm now sitting in Tima's in Tima, so that's kind of annoying, but it reminds me to show one feature, which is just like in Docker containers and all that, you can actually detach from this action and re-attach it again. So I would use a script given name to the container, but if I say, xcps, I can see the container is running, and I can refer to it using the id of jrv. I would use the jrv because it's easier to type, so I can attach it back. So here we go, we're back here. In fact, you can have multiple... You can have multiple sessions used to attach to the same session, so it's kind of like a poolman's Tima's in that way as well. Other way really good about it, but back to the demo, we tried to run a client inside the same container over Unisub to connect to this database. So again, I would use xcps, this time we use xcps, I would allocate a new terminal, and then I would say... This time I would use the container id instead, actually, and then I would say MariaDB, and then say, we should deny because I cannot type, because I need a password, and now we're connected to the same MariaDB, so this is running a Linux container without any modification. All right, back to the slides. Thank you. So that's the... Unfortunately, that's probably the least impressive part of the demo. So, before we talk about the more advanced features, let's step back a little bit and talk about the FFS reset management because I hope I've damaged how important the whole thing is, and one class of application that I forgot to mention is that when, let's say you have a jail, well, container, what it does is running fast on the disk, then you obviously need to find a way to pass the device to the jail, and in this case, you really need a FFS reset management. So I think I have a cool demo to... and also D-Trace for that matter, so I have a cool demo to show. Actually, I think I messed up with my slides order, but anyway, we'll just show the Earthen demo anyway. Now, because the D-Trace stuff really run much better in 86, so I'm just going to use a server at home to do it, and XC as a runtime itself, actually also we're just the... Yeah, we'll go back to the slides but we'll do a demo here anyway. It actually registers the D-Trace USB-T so we can actually trace exactly what is going on inside a container runtime. So if you have somebody creating jail, you will show up and all that. You can also optionally fence mark for letting me know the potential security issue, but you can optionally expose the USB-T capability to the jail as well. So if you're trying to run something like an Earthen in the jail, you can actually D-Trace outside the jail as well. So, I will run the stored USB image and install Earthen inside. I will run the D-Trace port first just to show that the runtime is doing something. So we say okay, we're back to our home so when the jail is created you should bring up something, you should bring the names of the jail and the jail ID. Let's run it. So as we create the container, as you can see D-Trace takes it off and then now it says hey, the jail has been created. So if you need to build some kind of special tools to gather some kind of data and show off your father, this is the one way to do it, you don't have to patch it for it to work, but this is a method of where to put more USB ports if that can help you. So we're going to install Earthen. And by the way, as you can see, this time I attached a network because I used it. So I'm going to run YLL, which is the Earthen virtual chain, well, the Earthen shell inside. And now I'm going to trace the D-Trace port. As you can see, it already matched the Earthen ports. So let's try something useful in Earthen, for example, between the Howler world. See, when I run the Howler world, D-Trace is able to pick up the functionality happening in the Earthen shell and then just print things out. So if you're kind of workload, I don't know, maybe you are pre Facebook, WhatsApp, or let's say you depend on MQTT a lot, so you're probably using Earthen, that means when you're running a container on FreeBSD, you automatically get the feature say you can trace the container that way. So this is, in my opinion, a really good way to show the strength of FreeBSD and D-Trace, but the cool thing does not stop here because we have our own D-Trace invention called D-Watch, which basically scripts built around D-Trace. And I built a shorthand for it. First of all, I need to know the container name for it, or ID. This is 5, so I can say XC Trace 5. And you can also use different D-Watch profile, but by default you show the system call. Now if I try to do things inside the container, if I can oh, it's doing things already great. We already see like the system... Yeah, anyway, we already see like D-Watch is doing its thing and then show all the system call things happening in the container. And by the way, FreeBSD was again FreeBSD container and JLR121. That means you can literally just say I really want to trace things happening in the container. And you can see all the process and how the interactions is happening in between, which is really hard to do on Linux if not impossible. So this shows up the D-Trace feature and back to our Let me skip this part so I can talk about... Yeah, I prepared a slide for D-Trace. I just messed it up. But it kind of caught me off guard a little bit. So I guess I have to do the DevFS demo as well. The DevFS demo I have picked is Beehive. In fact, it's just a really simple Beehive. Let me show you the JL file for this Beehive image. So it just installed Beehive frameworks. Forget about this part first. We'll talk about these features later, which are super cool and important features. Notice these directives here. These are actually DevFS syntax. So what here means is basically that this container requires access to some device path. I need to unhide VMM and VMM slash the name of the Beehive virtual machine as well as the VMM IO and entry point instead of using the script, I'm just using something really simple. I'll run Beehive and these are the commands variables to run Beehive. So there's no disk here. There's no network here. Just a really simple example of Beehive running a UEFI shell. And I will rebuild it called... What do I call it? I kind of forgot. Give me one second and you check my note. Okay, here we go. So I'll run it. Yup, if I run it right now give me one second I might forgot to run the name. Ah, I know why. Let me give it a different name because that's an other Beehive already running with this name. So I will call it Euro BSDCon and it also doesn't work, why? Yeah, Demo God really hates me turns out. I know, right? Yeah. I really shouldn't control Demo God. Is it gonna work? It's still... I'm restarting the DJ statement. I'm sorry, I mean the XC statement Q or XC that's probably something... Just give me one second. Okay, it's running now. What? Okay, anyway Yes, what it's supposed to show is that it will show a palm to say, do you really want to add this DevFS rules to the DevFS? So you can kind of review the result in DevFS rules before it actually start the container. So that way you still have the image able to specify what kind of things you need but in the same time you don't lose control of what kind of device you're exposing. Oh, that's just a patch. So it's like the image name and the patch I probably messed up the patch somewhere as well, that's probably why it's not running but five minutes ago it was running five. It's not a gel ID. It's kind of like a G10.9 thing. Go ahead. It kind of gives it a range for you to generate so for example in my current configuration if I can show you it starts from 1000 and by default I think I leveled in a couple hundred so it's not trying to exhaust the DevFS ID. Yep. It is actually doing that but it has the default rules upfront and then put the other rules and the other reason why you have to kind of generate it is because depends on the workflow you want to run you can't really predict what kind of device you really need to expose. You can actually I have experienced adding a guard statement at the end of the DevFS rules as well so that means like definitely you do not expose these device nodes but I've added that yet because let's say you really have a container that runs fast on one of the disks you know those edge cases and we have time later I hate demo god. So all the things I told you to ignore is called it's a feature in XZ called environment variable guarding so essentially what it is is that in traditional container you only provide default value and default value to the environment variable you have to run so if you think back the MariaDB example it has to say well when the script is running that means you have to create a container already and it has to run and then it depends on the script to tell you your missing environment variable and then the container gets destroyed we also got wasted the idea is here we go we embed the specification for the required environment variables in the image format this is the reason why we need to have a different image format than the OCI one but the benefit it provides is just huge and not just that if the required environment variable is required at all but also a good description so for example you try to run a container that's missing some variable instead of the script telling you what is missing and go to GitHub and try to find a solution you as a developer can provide a description of what is the missing environment variables and what is it for so you don't have to go back to GitHub or external documentation and this all inspired by syscontrol.d which is part I think about previous D culture about documentation if you try to do I have an example to try that you know what I think we'll show it later later we have an example showing running Pudry in a gel so I will show that later and this yeah, demo got and this is about Dtrace and let's talk about ZFS also a really important feature so if you're using FreeBSD you're probably using ZFS and what's the point of running container if you can't take some kind of advantage from ZFS so the first one is kind of easier to understand it's called Volume Hints so basically this is again the reason why I cannot really reuse the OCI image format is why I want to it basically allow the developer to specify a bunch of things called Volume Hints essentially what they are hints about the ZFS properties or like the most recommended defaults so for example you're shipping a Postgres container you might have volumes about what's the most optimized properties for ZFS data set so you can specify all that here the mount point thing is like standard thing basically just say where to mount it and if it's required things like that the idea is that now you're not just shipping a water bottle to say don't eat more than two dose a day or something like that and when you try to create Volume for that purpose all this default automatically applies so it guarantees you to get the most optimal settings of course you can override it later but that's the point another thing is the runtime also manage a gel ZFS normally when you do ZFS gel a data set and a gel it will work but when you try to gel it the same data set to another gel it will just detach from the previous gel and move to the second gel so it kind of like it can cause some serious error but the runtime will keep track of in this case runtime will keep track of the gel ZFS allocation so if the ZFS is already mounted in one gel it will prevent it from and I want to show the example of putery which is kind of like multiple things is going on together first of all it needs a ZFS to work it also kind of showcase a nasty gel and also the environment variable things gel file looks like this again we install putery gate and nginx because I also want to show case the port redirection we need to create a directory there because putery complains you don't have this file directory we add an environment that is called ZFS data set basically a set of data set for putery now this content because I want to show how things work inside the gel I'm not going to use an entry script but as you can imagine when you know the value of the set of data set assigned to you you can write some script to patch the putery configuration file so it will just work and also allow tons of attributes like you need nullFS, ZFS all that copy the engine it's config over and now here's the thing we talk about we assign a new volume to say I make a new volume that the Hindi ZFS compression is off and it's off because they are stored in this file so they already compressed there's no point to have ZFS to try to compress it again and I have already built it but I can't build it again so I'm going to run XC build we require network to run this build and I call it putery hero as you can see it's not running the commands in the builder that says I think they recommend way to do this kind of thing is try to use builder and portman to build it because their caching system is much better the moment I know the products is kind of abandoned to implement the caching layer that has too much work on my side okay you can see here it's not actually stuck what is happening here is that it's running ZFS stiff and try to create a new layer for the container and now we are done and I already forgot what name I put it okay it's putery hero so we'll try to run it and before that I will try to create ZFS data set for main putery stuff gel is on seawood putery no ZFS I'm sorry yup here we go and then let's try to also create a volume for the disfiles just copy a previous command and that is hero and I give you a name called hello world now if I ZFS get all you know what maybe not get all maybe get compression and you can see compression set to all and locally set to all so this way you can distribute even the ZFS settings and now we want to actually run the thing this is the right one but I don't want to use the right one I want to use the wrong one something so instead of that that Z basically means pass the data set to the gel but let's remove that it'll be really funny if that does not work okay as you can see because we don't have a value bound to the data set it would just refuse to run and all of these are done without the cost of even creating a container because all of these can create an instantiation and also tell you what is missing and why is it missing right it's like I need a data set for Pudui okay fine I would do that okay that's funny I cannot find the command I would just remove this file because I think it's making it hard to see I would use another port for it and it's not Pudui 2 it's Pudui Euro okay I'm in already so the container is that quick it's there because I don't have a script here let's show the environment variable first but I'm going to do it by hand oh that's really funny alright and then I need to copy my gigantic command basically I try to create which call because it's easy to build do I do full pop-up with direction? I don't care maybe I would and here I'm opening my chrome on the other screen I want to port A0 to A0 anyway it's probably a bad demo because it takes so long five minutes but but because it takes so long I can detach it again like Docker we can detach it right so and then we can also demonstrate pushing an image alright so to push an image let's say I want to push it to Euro demo I cannot type so it has 51 just like R3 container registry I have with R3 so as you can see it's trying to upload things the funny thing is that you can actually interrupt it any time because it's actually controlled by a daemon from actually running a program that also means you can have multiple users try to connect to the same daemon and try to pull an image if they're pulling the same layer that's not going to be a great condition because it's all managed by a daemon and therefore I can seems to be canceling any time but if I attach it back you can see the same thing is actually still running this is kind of both a bug and features that kind of means I also need to implement a cancel command which may not but it turns out both these two daemons taking some time but it's okay we can cancel not cancel it but detach it first and then we attach to the other one 28 yeah it's still going probably a really bad idea yep you basically use for some in this case that's a gel file and you use it to build it and after you build it just copy the things oh of course I forgot to run nginx inside this gel I can accept allocate a terminal bin shell I cannot type really maybe service nginx start sorry about that and then 8080 yeah probably figure out the pf in it probably my map just restarted so I don't have the pf rule on my map but anyway you can use dash p like the port forwarding rule just like you do with Docker in fact it supports the whole feature with more you can even specify multiple interface for it to do the redirection and you can run am I in a gel outside I'm in a gel and you can see I believe I didn't do the redirection but anyhow let's go back and check if it's finished oh it's done already as you know putery builds packages in jails if you want to do a natural gel it also works I think I'm 2 minutes left any questions I don't think I have time to talk about feature working mostly that they kind of limit for first is D trade you can't really trade a container as a container especially the multiple process inside the second limitation is really not limit limitation but like all kinds of specification things like early checking they're just not defining OCI's best so technically is it possible for them to implement that yes so it depends on how well supported by 3BSD the Linux translation for example if you want if you try to run some system or if it requires some system call that 3BSD has not implemented for Linux then obviously it won't run but if you just have like something Linux-ish it could probably run I think I try next cloud and that works to really verify it works 100% right I don't think that is possible because the Linux does not understand the 3BSD APIs and system calls do you have a sense of like when the question is about the progress and version I think it's more pretty alpha there's still a lot of things for me to clean up I mean the print message is really just rust print app right so also a lot of like user experience things I don't suffer from second system syndrome but I try to push it out a set maybe if there are many people to help maybe a couple months maybe if there are not many people to help maybe longer yeah it's on github any other questions yes the question is can you run multiple container and have them to interact with each other that way is yes because they really just like talk to each other over the network but unlike the Docker container by default XC does not come with DNS server you can override DNS but that's a feature in XC you have to manually use it that basically say you put a couple holes in the same holes group and each container can belong to multiple holes group that it will write the IP address I mean it will put the name to IP address mapping and write to the holes file of each not really socket they are over kind of network I think not but you can always monitor it that is I haven't experienced that yet but maybe yeah but that's also other future work thing I want to do which is about having kind of breaking the wall to have one container tell somehow magically tell the runtime say I need to connect to other container kind of things like that so one thing interesting about XC if we go back to the XC architecture so this is actually not it's like unit socket but there's some abstraction built on top of it called channel so the damer actually can create multiple channels that means multiple unit socket listening and my goal actually is try to make it also multi user friendly so each channel you can attach some kind of ACL thing going inside so you can kind of expose part of the features to some gel it's actually important because that's how if we do support NASA XC because you can't change the FFS rules inside a gel so they have to have some way to talk to the prison zero to deal with the FFS things yep well thank you