 Hello, everybody. I'm going to be talking about containerizing the operating system and how to do that efficiently. So before that, a bit about me. Who am I? I am an intern on the CoreOS team. And I'm also pursuing a computer engineering at the University of Toronto and here are my contact details. So let's start with the agenda. So why are we even containerizing the OS? This is what I'm going to answer first. And then how do we containerize it? And the more important question, how do we do this efficiently and intelligently? And the third part is, what did I achieve from this? So let's start with which operating system am I talking about? We're talking about CoreOS, which is a Linux distribution specialized for containerized workloads. And how it came about to be was from the merger of two different projects, Container Linux and Project Atomic, which brought together two very important open-source projects, Ignition and RPMOS tree. And they helped solve some of the most important requirements of being a container host, which were the operating system should be lightweight, it should be configurable, it should be up-to-date, immutable, and should work on many infrastructure. So while doing this, we came across several challenges. The first one was obviously configuring the operating system. Now, when you have already provisioned your operating system onto your infrastructure, you need to think about, how do you make changes to this config files? That was the first problem. Second one was, how do you handle adding the packages, RPM packages, or handling unpackaged data to the base operating system after the node has been provisioned? So these were the three challenges we had. And the solutions we initially had, I mean the problems, the solution to these problems exist. And the first one was Ignition Reprovisioning, which means you write beauty in configs, you change whatever you needed to, and you basically reprovision the entire system. And obviously, this is slow and not the most efficient way to do things. And the second thing was using client-side package learning, aka, RPMOS to install the package name. But now, what this meant is that each of the client nodes needed to basically run like a script which basically handled RPMOS to install. And doing that might lead to problems, because what if one of the nodes gets into an unmanaged state, and there comes problems. You don't want to SSH into one of the nodes and try to fix things over there. And also, there's no one way of testing it, because the customer needs to ship out this new package, and they haven't tested that before. And if something fails, it becomes increasingly hard to manage that. But the solution we wanted now was, what if we could do all of this on the customer side in just one single update? And this update not being delivered from the base operating system, but from the customers that needed it. And also, if there was a way to basically do server-side learning, which means the customer could basically, if they wanted to add a package, they could basically test this already on the base operating system and then deploy it. Basically, the layer part and the building part could be done as a single operation. And thus came our need for CoreOS learning, which means you could manage your entire operating system with the help of Dockerfiles. Any configuration you needed to do, if you want to add any package data and package data, you could basically ship out as an update to your container image. And this is the part when I showed up. Everything that I've told you till now is what I inherited from the team. And this is when I actually came as an intern on the team. And this is the problem that I tackled. How to containerize the operating system efficiently? Now we get to the question. But hold on to this thought again. Let's give us a refresher on how our container images actually stored. So they're stored using this thing called overlayFS, which is a union mount file system. And just to give a brief on this, it contains of two different layers. One is the image layers, which means they are also known as the lower directories, which are read only, which means that they cannot be modified. And there are several of these. And there's only one container layer on the top of them, which is read and write, and which is FML storage. Whatever changes you make to your container is only made to the container layer. And as soon as you destroy that container, all the FML storage gets destroyed. And what you see on the top is the changes that you made. Now from this, we can infer that if we could somehow divide the operating system files into these layers, we could efficiently containerize the operating system. So there are several ways to divide the operating system files. The first of them is to do it arbitrarily, like put random files into one, put random files another. And the second one could be do it using the Linux file structure, like slashboot, slashvar, slashrock, put it into one of those layers. But the one that I chose was using, by segmenting the files to the RPMs. If all the files belong to one RPM, it would form a single block. And why did I do this? Because RPMs form the basic building block of an operating system. And if one of these, like, bash files or one of these binary files within them change, the entire RPM package changes. And OK, so now we tackle the first part of the problem. But how do we do this efficiently? So what does efficient mean? Imagine you have an operating system with just two very simple packages, G-Lipsy and OpenSSL. And you need to containerize this. So let's say you have two different layers on Fedora CoroS, two different layers of the container image. And arbitrarily, you put both of these packages in the first layer. And you keep the second layer empty. And an important part of this is that all of these layers are cached on the client side. But when you get an update, what happens is if G-Lipsy updates, the entire layer needs to be updated because the Shah of the entire layer changed. And OpenSSL will be needed to download again. Which is not efficient. Could we do this in a better way? Obviously, we can. We can split this into two different layers. And now only if G-Lipsy updates, only G-Lipsy gets the needs to be re-downloaded. So essentially, we have created a design requirement for ourselves, which is minimizing layer deltas across upgrades. The other design requirements exist also, which is the function needs to be deterministic. The algorithm needs to be deterministic. The third one is that it needs to be a two-way function. That means you could reversibly go from an operating system to a container image and back to that exact same version. And this is the most important part. When I was talking about the overlay FS, how it works is using kernel mounts. And there is a hard limit. Or you could almost say it's a bug from the kernel that there is a limit on the amount of kernel mounts that you can do, which means there's a limit on the number of container layers that you can put, and which is right now 128. And for people who are interested in academia, this could be boiled down to a multi-objective bin packing problem, since there is a limited amount of bins, and you've got to put a lot of these packages in it. So let's start with the naive approach in our solution quest. Let's put each RPM into one layer, and the next RPM to the next layer, and keep doing this. But obviously, this doesn't work, because we have greater than 400 RPMs in the operating system, especially in Fedora Core OS. And there's only 128 layers. This doesn't work out. So let's try a better approach. Use some data. Let's use size, for example. You sort these RPM by size, so the biggest RPMs never get redowned again and again. So when you do that, you sort them by size, and you put the biggest in each layer, and the last layer will contain all the little small packages. But when we did this, when I tried this approach, the last layer actually becomes really, really big. And if a single package worth about one MB changes, the entire 200 MB worth of layer needs to be downloaded. Again, this is not the most efficient manner to do this. So why not let's go all the way in. Let's use all the data. Let's not even infer. Let's give all the specifications about what the RPM is made up of, and give it to deep learning algorithms. Why not? And it would really work well, because we could define our loss function as the layer delta, and this would work really well. But the problem with this one is it's not deterministic. It is slow and unreliable, and you don't want your package manager or RPMSG to run a deep learning algorithm every single time you do a build. That's not what you want from an operating system. So hence, I came up with this approach, where I made an observation that if you have a layer which has two packages, and that one of those packages are operating at a higher frequency than the other, then these two packages should not be kept together, because one of them would keep updating constantly, another one would just stay there and keep getting redone every single time. That's not efficient, right? So what you do is you can just separate them out, right? So I was talking about what is high frequency package, what is low frequency package, but how do you essentially classify them? The way we can classify them is using simple statistics, mean, and standard deviation. If a package frequency falls below three standard deviations away from the mean, you could classify it as a low frequency. And if it falls above three standard deviations, like above from the mean, then it classifies as a high frequency. Now obviously, there are much more subtleties to this approach, but this is a more boiled down diagram of what I did. Now what we can do once we have developed those high frequency, high size, low frequency, low size, and all of those matrix, what we see when I do a scatterplot is these. And you can see they form these natural clusters, right? And these natural clusters can be seen at high frequency, low size, and all of them, right? And there's a lot of them pushed into this lower frequency, low size caravans. So what we can do, since we have partitioned away high frequency from the low frequency, high size from the low size, what we can do is from each of these partitions, take a few packages and put them into a layer. Now what we have essentially done is that we have separated away high frequency from the low frequency. So if you do this for each partition and sort them by frequency, then what we can assure is that these two frequencies never get really mixed, right? But if you keep doing that for each layer, what can happen is that it might go beyond the limit, right? It might go beyond the 128th actual limit. And then what we can do is just merge some of these layers if they are in the same partition. That way we still avoid mixing these partitions together. And the other problem that I came up with is if you keep doing this for every single build, it'll become highly sensitive to the changes of the RPM, right? So instead of computing this algorithm every single major build, what if we only did it in y and z streams, right? Because if we only did this in the x stream, which means in the major build, and in y and z, just take the packet structure of the major build. And this way we ensure to recompute this algorithm every single time and change the packing structure, which makes it slow, right? So to summarize the approach, you first split the operating system files into different RPMs. Like all of these files owned by a single RPM goes into one of these packages, right? And obviously there are files that belong to none. And they go into one of the other packages. And you basically divide the operating system files like these. Then you use data to basically segment these RPM packages into these partitions. And once you have segmented them, you can put them into layers, right? Simple approach. So let's talk about the results now. So the results were actually good because we found there was a 30% reduction in the redundant data being downloaded across upgrades as compared to the previous data approach, where we only use size. Here we use size and frequency. So we used more data. And we helped optimize the way it was packaging. And it also fulfills other requirements. It's deterministic since we're not using any machine learning algorithm. It's a two-way function. We can reversely go up front and back. And we also adhere to the number of layers, which is the maximum number of layers, which is 128. And now I want to talk about what's next. So I started with the problem of bin packing, right? Because there was a limited amount of bins. But there's this new thing called Compose FS, which is like an extension of overlay FS, which prevents a higher number of kernel amounts being present. So instead of each layer having a single kernel amount, it basically merges them together, which is a better approach. And you wouldn't need to do all of these fancy things just to get across the kernel bug. So it also has other optimizations done to provide file system verification. And yeah, if you want to get involved, there are some of the links that you guys can look at. And yes, thank you. Does anybody have any questions? Yeah, good question. Because the question was, how many layers are we actually shipping right now? So obviously, since this is a feature that allows the customer to efficiently use their operating system as a container image and derive some layers from it, we don't want to occupy all the 128 different layers. Because you still want to keep the leeway for the customers to basically add their own packages, add their own customizations. So right now we are doing with 64 layers, like basically half of the entire limit. So the customer still has the rest of the 50% like container layers left to basically put in their customizations. Next, go ahead. So next. Well, I tried this. Like I tried different. So basically the question is, if you have lower layers in a container image, it performs better. And is that true or not? So I saw some like Reddit articles mentioning that, yes, it does affect the performance. But when I actually tried doing this, I didn't see much difference honestly, because these layers are cached on the client side. So once it's cached, you don't really need to do anything on it. Any time it gets updated, it will just pull that delta from the container registry and do it. So as far as to my knowledge, it did not affect. We are out of time, so I think we can talk about this after.