 Hello, hello. My name is Michael, but my real name is kind of Trojan, and today I'm going to talk about how the experience is like for us to run FreeBSD containers in production, more specifically on AWS. And I want to go through like the tooling side of it, like what are the existing stuff, and some fun stuff about FreeBSD containers versus what you have on Linux, and there's some advantages and disadvantages. So, me personally, I'm a FreeBSD user for about six to eight years. I also submitted some patch. I have my background in mathematics that naturally make me a more functional programming person with Scala and Haskell, and that also affect how I look, the way I locate container and what my life in container really should do. I also do some system programming with C++ and Rust, mainly on FreeBSD. I'm currently working in a live streaming company in San Francisco. Basically, what we do is do some live streaming stuff plus e-commerce. So, it's kind of like a TV shopping on steroids. It turns out to be the backend engineers who also need to do all the DevOps stuff, and basically everyone else except the front end. So, that means there's a lot of workflow, but it also means I can be more free in terms of picking what technology we use and everything. And we pick FreeBSD running on AWS. Originally, it's actually just because, okay, I'm more familiar with this. We need to get this up super quick and let's stick with FreeBSD, but later we actually have think of, okay, should we do Linux and should we do Docker and try out different stuff. It's funny because our application is mostly written in Scala. Some of it is Go. So, in terms of like performance or in terms of like deployment, it doesn't really make a huge difference between different operating system. It also means we don't really have any problem usually. Okay, so we have, for example, dependency health and anything like that. We basically have nothing. So, we are pretty happy and we just decide to switch. Plus, we, well, I love DT so much. So, this is why we stick with FreeBSD. But until some points when you start to have more services and you start to have some CIC pipelines, all of a sudden, it is quite important for us to say this thing is becoming kind of painful. So, how are we going to accelerate the CI CD process? And the idea is, okay, well, container, because a container is kind of like this thing, okay, you can build a container and then you can restore a container state from a several snapshots. So, you can use existing real data from your database to do a lot of CI CD stuff because you want the thing in your building to work with real data and you want to make sure things like migration, everything works. Now, this is what Google thinks about a container, right? It's like, okay, those are just like library packages and put together somewhere else. If you think about it, that sounds pretty wrong because by this definition, a truth is a container and by this definition of spinning with no capability of executing anything will also be a container. So, there's a lot of things missing from here. So, how we define container is, to me, a container is like the environmental context or like execution context. It controls how the process in an operating system sees resources or sees its previous domain. So, for example, one of the most obvious one is file system, okay? This is how the process in this container, that means the environmental context is able to find its share libraries, is able to find its executable. Now, they also include things like networks, right? So, a program usually cannot run with just files. If you put it in like a central program point of view is that a computation is kind of useless if you can never have side effects. So, a network is like one of the multiple way to think about it. There's some states that it's actually not in the file system but like it is part of a container, right? And SysFile IPC, if you guys have run Postgres in jail, then you know SysFile IPC is something that you don't expose in a file system, therefore. It's absolutely everything about the state of a container because that is share memory that's message queues. And more importantly, it privileges like, okay, can I mount something or can I open a routing socket or can I just, you know, use one socket? These all these things combined is in my point of view what a container should be because this is, if you look at container as a monad from a functional program perspective, this entire thing is a boss, that bossing your computation such that you can get some result out of it. Well, I kind of mentioned about it before. So, why we start to use a container when it will kind of perfectly fine. The biggest reason was CI and CD. And also, there were some members, they actually cannot access AWS because they live in a country that actually blocks Internet. I think you guys know which one is it? Yes. And the second thing is that when you start to run services on cloud, what you do is that you just spin up some easy-to-instance and each one runs some kind of service. But let's start to get expensive. And you are having services that maybe it's running most of the time, but it doesn't really need all the resources all the time. And then the idea is that, okay, so how can we make the scalability better? How can we scale up and scale down and then scale up by each service and everything like that? And once you do that, it brings another issue, which is that, okay, if I want to use D-Trace, if I want to use any kind of observability tools, so many servers happening and so many servers running, how are we going to deal with that? Well, this is one thing really good in FreeBSD, but something is kind of lacking in Linux, surprisingly, because Linux does not really have a primitive core container or something like Jail. In all of the Trace use PDF Trace, for example, you're actually going through a lot of garbage just trying to find what your process is actually doing and how you interact with the rest of the system. But when you put this thing into a Jail and you run D-Trace, that all of a sudden, because you get the JID back, so you know how the process in this execution context is interacting with other processes, maybe even out of the same node and how they interact together, especially when we have a D-Watch, which is much better than just D-Trace. D-Trace is amazing, you can do so many things with it, but D-Watch will save you so much work in terms of scripting. The last thing is the privilege management. This one is more trivial for us FreeBSD user to see container because Jail originally, the paper says, is developed specifically to manage privilege and users better than a traditional Unix system by introducing this namespace in between them. And we also happen to have some users that they just don't have, well, all of our users are actually developers that they just don't have anything at home that can really resemble the development environment. So it's not just like the application contains that we're running, we're also running some traditional gels just to say, hey, this is your environment, I know you really suck at this at me, but don't worry about it, just build and run your code inside, like you can't even see anything outside of the network. Container on different OS, on sexually, the same thing, the implementation is very different. The reason why is that OS works a little bit differently. In FreeBSD, if we go back to the definition of like environment context and execution context, in FreeBSD it means that the root file system of the gel, the DFFS reset, and I'm actually surprised how frequent the DFFS reset has actually been overlooked by all gel authorization system mainly because you're either running an application or something that you don't need DFFS, but DFFS is everything about the state of your kernel interface and something you can access from the userlet. So if you want to use a tab interface, you want to use a null terminal, especially if you want to be high inside, you absolutely need something that can manage a DFFS, a root set, and even worse, it's kind of bad to have the DFFS reset always that we define in the whole system because now you have this very infestable architecture that every time when you want to run a new container that requires some really special DFFS root set, you need to go back and change every host node. So it's kind of defeating the point. The other gel parameters, obviously like in Postgres, System5 message gives share memories and a bunch of six controls which actually determine if your gel can well traffic, right? Those are all the state about a container on FreeBSD. This is like what you actually have to think of when you're building something to say, okay, I have a container, like how do I get most of it? You need to think of every one of it. And FreeBSD included some base tools and utilities to manage gel. Man-free gel actually is the LibGel interface, so it's like a LibC routine that allows you to run your gel. Well, the way it works is literally that your process enter a gel and then you can set the gel parameters by ID or by name. Most FreeBSD users and System5 are more familiar with is the gel command. The gel command itself is actually just a wrapper around the LibGel. So you cannot actually reliably use the gel command to do your lifecycle management. The reason why, despite you have gel config in your system and you say what is supposed to do when you stop, I think actually we never run if you don't stop using the gel command because essentially what it does is say, okay, gel, dash out, that means remove the gel, and they actually pass the gel config again and to see what kind of routine should it run, and that's how it does the lifecycle management. It also means that for whatever reason your gel dies, which can happen because the default behavior of a gel is that it will die as soon as there's no process in the container. So you actually cannot capture rigorously the lifecycle of a gel. There are many, many existing utilities to manage gel and to try to do some orchestration on gels. The rendering is like a new one. It's used to more like if you try to run some OCI-compatible container. Basically, you say, okay, this is a process and you run it as this UID and this is how we define the image. A person hasn't used it in production or tried it, so I'm not going to include that in my next slides, but most of the IOC-H and POD are currently some of the most used ones. IOC-H is probably a little bit older and more people are familiar with it. It's probably also the default recommender rate in Michael Lucas Spokes. But the IOC-H is really for that matter, IOC-H and Bastel really meant to manage faithful gels. That means you actually use it almost like a VM and you allow yourself to constantly change its own, obviously its own state because functional programming-wise, you are literally it's like running recursive and try changing your file system. To me, it just like changes your parameters as a function. And POD is really great, Lucas, up there. POD is awesome because it allows you to do everything you expect dockers available to do. So you can say, okay, I want to build an image with set-advice, you can distribute it, and it has actually switch slides. Yes, and you can do all kinds of stuff, but let's go to Bastel first. The good thing about Bastel is that set-advice is totally optional, you can run it on UFS. That's kind of great because set-advice, you have to run it on AWS, you have to tune down the IOC a little bit. You don't want the IOC to consume most of the memory. In that regard, Bastel is great. And also, Bastel file basically is almost a docker file and then you apply this kind, basically it's like a shell script almost. It is a wrapper of shell script. Basically you say, okay, from point A to point B, what do I need to do and what kind of surgery I need to do to this gel in order to run a certain way. And it's very maintainable. It's crazy because one thing less that people know is that you cannot use nullFS or unionFS mount point as a gel route, right? And the Bastel file basically overcome this problem by making, by mounting the base system inside dot directory and then use sim link and then so sim link back to the base system. So this way is also helped with the maintenance a lot because if you want to update a base system of like a group of gels, you just do that. You just like make changes to that base system and because everything is sim linked, the upgrades are complete like really nicely and you use so much less file system space, they are sim linked, even without set of ads. But there are some problems. Bastel does not know containers that you created but they're not started. So if you created some container previously and they die or you kill it and you say a Bastel list again, Bastel does not know if it's there. The way you do it is that you arrest Bastel's own directory to figure out what kind of gels are already there. The gels are also like state preserving. So they're kind of like, it's still like your traditional gel. Is it a soft fit use on the cloud as like application layer because you actually don't want them to preserve anything as possible. So Bastel is kind of out of the question. The next one is IOCage. This basic URL, it requires set of ads. It's like, if I need to run it on a really small thing on EC2 or whatever, I either have to build this 3DSD custom image myself using set of ads or routes or what else can I do? It's like, okay, I can attach a volume and then basically format that as a set of ads and create a CPU on it. But that's a tremendous amount of work and it's not really automatable, right? The templates are great. I really like the part that IOCage templates are decorative. So you can pass it with standard tools. You can use it just straight on. So you know kind of behavior is going on and you can tell even a gel stop, it knows it's there at least, right? So you won't have to say, okay, I want to list like what kind of containers that are alive. I cannot see a dead one then. It's kind of crazy. Again, port does everything and by default has orchestration with Nomad. So it's kind of repeating a row of Kubernetes and it's also a little, a lot more same than Kubernetes. Basically, whatever you want to do, you can accomplish with port interface nice. I know it's using production, is it? No? Yeah, sometimes it's a question. But again, it requires set of ads. So it's like, that's all we're actually using. None of them. We have our custom build things. Yes. Set of ads on which it's not yet there. I think there's some make it fast, worked on set of ads. So maybe later it could be a thing. It's a functional program and have a method ground. It is so important to me, I feel like the way a container is built to have a manifest as declarative and tells you what it is. So that matters. I also want like an image registry. That means the way I do is just make it as compliant to like the OCI as possible. So you just use all the personalities there. And we're also discovering something really funny. You can actually have a gels with our system over NFS. So that means you can actually run my headless nodes. But it's a container. So that is actually awesome. I'm going to go through that later. So now when I decide this interface, right? The things say, okay, I have this manifesting. What we really should do? Like what is the correct behavior and semantics of a container? The first thing is obviously the whole point is that you need to have a registry. That's just like a must. But unlike attitudes like how they treat containers and especially Bastel, I don't think by before you should always consider a container image or a method. It shouldn't be considered trust. You may be getting this for someone else. You may be getting from your colleagues, but you don't know what is actually going on there. You don't want to say, okay, Bastel file, redirect, and then you just put forth like a PF. That sounds kind of crazy to me. For me, I feel like the operator should have control to say these are the policy I allow my containers to run. These are the previous domain that I'm just okay with it. And if the container is violating that, it should not be run by default. It should be overrun. You should allow operator to override it, but the privilege should be granted because they are foreign stuff. It's like you don't know where you're getting from, especially when it's downloaded. So we designed the image file. I hope I can open source it in the next couple of months to be fixing something. I wish it was returning all Haskell, but now it's becoming Rust because it turns out not many people understand Haskell. So there are file system layers which are the main states of any containers, basically. It doesn't matter it's Windows, Solaris, or Linux, file system is always the biggest thing. So the nice thing is OCI has its own specification such that you can just store layers in a competitive way. So we use so many tools, and that's really awesome. But when we design the manifest, the idea is that, again, the privilege and the requirement at how to use this container should be self-documented. It's like sys-control, right? Like if you are previously sys-control-d, it tells you what the sys-control does, right? And if I put in a document to say that, okay, I have a document image, I can go through the file and figure out how to use it, but won't it be nice if it's self-describing like these are the volumes you're supposed to mount to? Like these are the mount points that are supposed to do these kind of things. I think that is a big requirement. And again, that manifest must happen. Like if I want to have a gel that can do things like open VPN, whatever, I need the type device. If I have a gel, I want beehive inside. I need a VMM, like device nodes. They should be there. And it also should tells you, yes, we might open these ports and it should be up to operator to design if the port forwarding whatever should be enabled. Because otherwise, you're injecting some foreign stuff to your system that secretly opening some ports. Even if it's only like in its own IP assigned to it. But in a production system and like in the cluster, they are some serious vulnerabilities. And also, obviously, you need some system IPC, 65 IPC and other attributes because otherwise Postgres is not going to run. But the way we manage it is that on every single host, there are some host policy, right? So these policies that, okay, in general, these kind of containers, maybe it's matched by name, maybe it's matched by prefix, what am I allowed them to do? So let's say you have a containers manifest assigned by me and I'm okay that, because I know what it is. So if I'm okay with it opening port 80s and I just redirect, that's fine. Then in the host policy, I say, okay, I can redirect of this equivalent class of containers to do these kind of things. The same thing goes for, like, there are non-coins and everything like that. But if the container is trying to violate that contract, so the container is saying, yeah, actually this container requires opening port 22 and is self-described there. Basically one time catchers are like, you want to open port 22? And he's like, yeah. And he's like, let me ask the guy, right? So you provide a way, you still want it to be overrated. Otherwise you're having an automated system, but never reported any failure. And that's the worst kind of automation system ever. So this is like the security domain we built in, but what's more interesting is probably the depth of S part. In depth of S, we kind of adopted an idea from all call DJs. There's always some open rules and close rules. So the idea is that we always need to open something such that PTY, for example, you need to allow people to be able to log in a user shell. You also have some close rule. That means like, no matter what happened, these device nodes should be hidden. It should never be exposed to this container. Otherwise, in the container image, you annotate what kind of device node you need and you just embed it inside. And after all the finalized depth of S rules, that will be all the rules opened and allowed by the host. And then union with like everything the images is requested and run the finalizer, which is like everything the host want to hide. So the close rules are extremely important because that is like, that's how you configure a system to say that there's things that absolutely do not want to expose no matter what. For example, you don't want to expose that, if you're on an ass, you don't want to expose like the block device or if you are, I don't know, what else is cool. PF set events, those kind of things are fine because they're jerry-runners, but that's just so much things you actually want to hide. Mm-hmm. This one? Beth Kingman. Yeah, yeah. Exactly. Well, I don't think you can actually open Beth Kingman in jail. So, yeah. But if whenever a container is like spawn and then you're creating new rules that has actually become a attack because in fabulously, the size of the FFS rules are only 16 bits. So it's actually stored inside U32 but 16 bit is used for like every single rule in that rule set. And the previous 16 bit is storing the actual rulesets. So you only have around 65K rulesets to work with and if you keep running container over and over, by the way I've seen that in machine learning clusters, people actually like bring up purchase to want to run one function. It happens. This is crazy. This one will have like a housekeeping statement. Just keep chapter mapping. Say, okay, I got a hash. So this is how the ruleset defined it. Like it cannot be otherwise because otherwise the hash would be different and internally it's mapping through a ruleset as a number. So whenever you bring up a new container, what is going on is that, okay, you ask for these rules. So I go inside and check, okay, have I already created these rules? If I've already created it, then I can mount a ruleset on the DFFS of the jail. So this way we have like, I think we just close a closed loop because now you can actually have DFFS ruleset things that's kind of programmed though. You can write it down in the container image and you can distribute it to someone else when they take it to actually can run it because the DFFS will be automatically created. Creating image I think is going to be a lot similar to what Paul is doing. Currently because no one actually is going to create image on the UFS node on AWS. That's not for development. That's for running actual stuff. So we only support like creating image on set of S and able node and we abuse that as the hell of it. Basically what happened is that every image is going to have some base layers, right? The layers is kind of continuous. So the idea is that we look into the data set to say, okay, do I already have a snapshot of this kind of things? If it's no, just keep pulling and stage it. The real stage is just that set of a snapshot, the previous one, and then you also clone it such that you can work with the file system and after you work with it, it will get called again and you can use set of S div to see the changes between your newly created dataset versus like the previous one and then you get everything that has been changed. You bundle them into like OCR compatible file system as you upload to any registry you want to do. The reason why set of S div is just like div because set of S div seems to be quite a lot faster than like recursively looping every single device, sorry, files in a folder, we'll say. It's also funny because set of S div seems to be able to capture some interesting things for example, the file system flats, but not actually programming the library actually are aware of the flats, for example, like the no-delift flats and all kind of stuff. I'm glad the tar captures that because OCI layer file is really just tar with really horrible hack about the write-out files. This is another story. One very funny thing as I mentioned before is that because on FreeBSD, GL is a primitive, like on Docker is a seagrass plus a bunch of stuff, you actually can mount, I mean, the root file system of a container can live somewhere on NFS. So we call that our company also do some live streaming stuff. So one party that usually costs a lot of CPU power is that okay, now let's say a client will download the video. We start as HLS because you want to do live streaming. So what's now? Now you can use Amazon services and API is horrible and the structure is extremely weird and you can experiment on that and spend a lot of time and money and frustration with AWS because AWS are horrible and get your result back but what is better is that you can literally just do the same thing with FFMpad and I guarantee the thing is test for and work with it on every single system and I will know if I switch to Microsoft or I don't know, DigitalOcean one day that the whole thing will continue to work. And the best part of it is that you can just use API to bring 32-core, 64-core EC2 instance up. Literally, like in the users group you can ask them to use this container over NFS just do this job and ask it to queue itself. Self-destruction and then now you're only paying the CPU time to transcode this file. You're not paying anything more, anything less. You're not paying for the latency for pulling these containers images because you just run directly over NFS and it's extremely nice call-saving and time-saving mechanism because it is much, much cheaper than Lambda even cheaper than Amazon's own media converting service. Just by having that amount of CPU just get it done as soon as possible and die as quickly as possible it is so vibrant we probably save a lot of money just by doing that. And this also reason why our auto scaling stuff is much simpler because there are just so many things that your auto scaling okay I have a bunch of notes and how do I load balance all these things but if you just have like real machine every single time you actually don't need to worry about that in terms of calls you're still using the same amount of CPU time you otherwise will be using anyway so no additional calls at all and this is where my presentation where Keynote self-destruct and everything else is gone so and I have a hold of it I drink too much but any questions so far? Yep So you use Unimers Yes, Unimers Do you have some whole story of Unimers? There used to be a bug that if a binary is on the Unimers if you try to actually run the binary the system will panic but very recently that bug has been fixed that we're also only using well at that point we know this will happen but we're also just using union months to months because you always actually need that place to store some state and run excuse me but any other state related directly you just now mount it and that's totally fine so which one? It depends there's so much hold so we usually want to do it re-only but some language somehow I don't know what it's trying to write I can probably debug it but it doesn't matter it's a container with union months to it and when additional details it actually clone that data set before exporting it so even if it's re-write that's actually basically no impact at all yep just make sure you repeat the question of the string oh sure yep Michael that's the one what was the deal breaker on busted for you? the deal breaker on busted first of all the rental busted file I'm going to do it so the thing about busted file it has a lot of directives some of it is like limit and some of it is like re-direct here's the thing when I'm staging a container image it's supposed to be an image and an image does not set limit on itself it's supposed to hold to set the limit on this container inside of this container is the current limit on the computing resources the same thing goes for re-direct a container should not just re-direct its port and do the port forwarding stuff the contract should be the opposite it's supposed to hold which owns the entire professional domain to decide what subset of it it wants to give to this container it should never be the other way around and for me that is like breaking the cement takes it is like let's say you decide to put like a master on Trudas and order something you are downloading a plotting or like a busted file that say hey can you put for all my the ports to this container what are you going to do that you can't update it but yes you can do that because you can check the iodl.config but I just think it's the wrong model it also should be documented yep yeah the comment is about how infreviously world the usage of containers okay I have infrastructure I have my own stuff I don't want to use it but I know I know I know I know I know I know I know I know the user but the rest of the world say Linux oh this is funny because Linux has a historical problem to solve which is that whatever you have so much dependency but they are all writing over the place that you actually can break your system in some really real way and Docker in some way is that when a bunch of developers say I cannot do this anymore can I just shoot my entire environment as one piece right but it went to a point that turns out it's not the idea becomes like the new elf such that because we are sometimes interesting and hard so I have no complain just like run container almost like an application I do see people run a container as I mentioned before as a function or even as a threat that kind of freak me out but yeah I'm okay with that yep oh 10 minutes okay any more questions yep yep yeah just so I just wanted to ask how about to look at your stuff is that you wanted to publish it outside? I want to open source it but I also want to work with the base to try to get something that can help everyone the reason is that our thing is science cool but it's designed to fit our needs and one thing I've seen on github is someone make a kernel module that is really nice what happened is that when the jail dies it actually immediate message via deputy so it doesn't matter like what you use you will be able to capture such life cycle events and you can deal with it and I think like those small things that probably alone cannot solve anyone's problem is going to be a huge lab in terms of like people develop thing tools that fits for them because if you think about container as a monad like you know the degrees of freedom is enormous if you need to develop a tools like orchestration and deal with these degrees of freedom you have to have the tools also have the same if not more degrees of freedom because mathematically you cannot just decrease the degrees of freedom it's not possible you always miss something if you make some assumption but that is also things that make orchestration so hard so at least if we have some life cycle support in the base that will be awesome it will have some saying like stacking firing system then it will be awesome because overlay tool is pretty nice it doesn't buy the way that when you mount it where you mount it from will become re-only they work around a lot of issues by doing that it just turns out in the container world storage actually sometimes becomes a problem so these like layering fire system actually have a working one actually help I can run raw with it I don't have issue am I going to run a gel with it I probably won't do that so that's my take on it yep you mentioned briefly that you've published your images so like some sort of container registry can you talk a little bit about that what might be used yes so as I said you actually make the layer right you can do whatever you want to do with it but the layer itself is just a tar file with very bad with our files work around that actually is designed to fit the Linux how overlay tool like do things so what happens that you have a tar file inside is every single file that you change obviously if you update it it's going to copy the entire thing which is the right things to do but let's say if you remove some file from the upper layer the way to do that is that they append a file that prefix with .wh . and then the file name that means you're supposed to delete that file then this path how are you extracting it right so I recommend it because I'm kind of OCD is that I write utilities in Rust what it does is that it will go through the tar header and if you see this file that started with .wh which is horrible because tar actually is not tar tar actually is Pax and tar combined so handling file is a little bit interesting but what it does basically it go through the tar blocks and if it sees a write out file it will delete it otherwise it will just pass underlaying tar to extract continually extract the distribution but what you're creating you have to think about that and you have to also from . you look at what kind of files have already been deleted and then you need to write at least write out files back if I remember correctly . is only the best effort tool corner cases where . can't tell you which file has been modified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .