 All right, guys, come on in, settle down. We're going to start. Cool. Now we have a talk from Santosh, talking about challenges of file system isolation at Twitter. It's a pretty exciting topic, so hopefully we'll have a lot of lessons that we can learn for production usage. And yeah, Santosh is an Apache Aurora cometer and an engineer at Twitter. Thanks for the introduction. Hello, everybody. So during the last, during the early part of the year, I was working on trying to enable file system isolation in the Mesa clusters at Twitter as a way to improve our operational efficiency. And in this talk, I'm going to share my experience about the challenges that we faced and the lessons that we learned along the way. So first off, let me tell how it all began. During the beginning of the year, the Mesa host had the running Santosh file. And these were going towards end of life, which meant we had to upgrade the operating system on all of our hosts and move on to Santosh 7. And we had to do this with a couple of constraints to make sure that there was zero impact to any running services. And we also had to do it in a short period of time, less than three months. And to put things into perspective as to where the challenges actually come from, it's mostly due to the scale of Twitter's clusters. Our Mesa cluster runs about hundreds of thousands of containers on tens of thousands of machines, belonging to thousands of different services. And we manage all of these with a team of a handful of engineers, less than 10. Let me talk about how a typical container infrastructure is usually set up. It usually contains both a container image and a container runtime. And any infrastructure that uses a container image gets file system isolation for free. And there are several types of image formats that are available in the community right now, namely Docker, App-C and OCI. On their corresponding runtimes are Docker, Rocket, and Run-C. However, so yeah, using container images provides file system isolation. I just want to call this out. At Twitter, we have some peculiarities in the way we use our Mesa cluster. Or the way we set up our Mesa's cluster. And these peculiarities are that we don't use a container runtime, and sorry, we don't use a container image. And for the container runtime itself, we use the Mesa containerizer. The Mesa containerizer is very modular, and it allows to work with or without container images. Since we don't use a container image, we don't essentially get any file system isolation on our containers. So how is life without container images at Twitter? So a typical developer's workflow involves building their binaries, uploading them to our internal binary store. And when he wants to launch an application, you fetch the binary from into the container and start running the application. So since we don't have a container image where you can package your dependencies, service owners get a couple of building blocks to build on. And out of the box, we provide both the JVM and the Python runtimes on our Mesa's hosts. We know that it's a limited amount of choices, but the combination of both Python and Java put together gives a very powerful system, a very powerful platform where you can build complex and extensive systems. As a matter of fact, entire Twitter is based off of both just Java and Python. What this results in is that the artifacts that get built for the applications are either a jar or a pex. And these artifacts have this nice property that they're usually self-contained in terms of dependencies. So now, since the platform provides both the JVM and Python are the box, it becomes the platform owner's responsibility to maintain Java and Python runtime on these machines. So we can view the infrastructure at Twitter and its ownership model like this diagram. Anything above the line is owned by service owners, and anything below is owned by the platform owners. So when an application needs to be updated, we use Aurora to push the built artifacts, such as the jars or the pexes, into the containers. However, when it comes to the host, where if there is a requirement to update the JVM or the Python, we have a separate puppet infrastructure, which actually pushes the libraries into the host and gets them installed. So due to the fact that we don't have any file system isolation in our clusters, it can lead to bad behavior from service owners, where services start depending directly on libraries that are available on the mesos hosts. So the containers or the services running inside the container starts leaking below the line of separation and creates coupling between the host and the container. And this makes dependency management much more harder, since there's no file system isolation. Just to give an example of what really happened, we were rolling out a vulnerability fix, and we had a certain service, which actually depended on a particular MySQL client that was available on a host, on our mesos hosts. And rolling out the fix ended up breaking this particular service. So we were forced to roll back the entire cluster since we were affecting this single service. So we learned some lessons from it, namely that snowflakes are dangerous. That is, don't have containers or any applications having unique configurations. It's also really hard to test or can really any changes at platform-wide scale. What this means is that in order to gain more operational agility, we needed a better dependency management system. And what it meant is to basically break the coupling between the service and the host by enabling file system isolation. So in the end, we ended up actually upgrading all of our hosts, which is upwards of 30,000 hosts, which corresponded to almost 99% of our fleet. The remaining 1% is due to some snowflake services, which had some very tight coupling, which had to be handled in a special case. And we did all this with very minimal R-link zero impact to any of our running services. All of this we did without even turning on file system isolation. So why didn't we turn it on? Before we see why we didn't turn it on, let me give you some background. First off, what is file system isolation, right? File system isolation is a way to create an isolated environment for each container. On Linux, it is achieved with the combination of cheroot and shared subtrees. Cheroot is just a change route. And it helps to create environments where the dependencies could be self-contained. So here we have an example where we have a host with a couple of containers on them. And as you can see, the containers have their own file system subtrees. And it is separated from each other and from the host. So how does using file system isolation change our day-to-day workflow? Before without any file system isolation, we did not have any container images that we had to fetch. Now that we are introducing, we had to have this extra step before actually creating a container, which is to fetch the image. We had to fetch the image before we could create the container since the image actually contains the file system that would be unpacked and created that gets mounted into the container. So we had a couple of requirements for the solution for enabling file system isolation. First off, we had to ensure that our container launch time would not change. And second off, we had to ensure that we get full adoption. We needed full adoption, which is to basically lift and shift all the services into container images because we had to pull this off in a short time of three months. File system isolation could be achieved in different levels. First, as the most obvious, there is no isolation where both the container and the host share the file system. And at the other end, we have full isolation where both the host and the container have separate file system. There is a middle ground where there could be partial sharing of some parts of the file system subtree that is shared between the host and the container. It's worth noting that since we are to lift and shift services, we had to package all the dependencies that were currently available on the mesos host into the image. And this meant we had to get full file system isolation to ensure that we completely decouple the hosts from the services. At this point, it's worth noting that the way containers work today when a container actually executes the container runtime depends on the host kernel for execution. Why is this important? It's because even if a container file system image has specific kernel patches, this would not be in effect when the container is actually executing. So next, let's look at some of the technical details of the choices that we made for enabling full isolation. Since we had containers that were leading into the host and had coupling with the host itself, we had to package all the dependencies, all the possible dependencies that were available on the host and create an image that would satisfy every service. This meant that we had to create a really big and fat image. And this image turned out to be close to five gigs in size. Now, looking back at our requirement, that we do not change any container launch time. Since we have this extra step to actually fetch the image, we had to look at the options that Mesa's provided for fetching the image. Mesa's agent provides a couple of ways of fetching the image before launching a container. First one is the registry puller, which can talk to any container registry and fetch the image. One example would be the Docker registry. And at the other end, we have the local puller, which would just copy the container image or read the container image from the local host itself. Using the registry meant like we had to have an extra step that is fetching the five gig image on demand when the container is being launched. And this would definitely affect our container launch time. So we had to make sure that we had the image already on the host before we started launching it. And the local puller was the option that best suited us. So we needed a way to prefetch the image. Since we had to prefetch the image, we needed to now think about distributing this image to upwards of 30,000 hosts. And this image was close to five gigs in size. So we looked at the first option, which is our default, which is the binary store that actually hosts the binaries that get created as part of the CI. It was not designed for pushing gigs of data to tens of thousands of nodes. However, it would work for just pushing application binaries to thousands of containers, and it should just work fine. So next, we looked at CERN-BMFS, which is a distributed file system project from CERN Labs in Switzerland. This would have the scale, but it did not have the ability to prefetch. It was a heavily cached distribution service, which meant the file is actually fetched on demand when you try to access it. This had the implication that the container execution would have unpredictable performance, since fetching the file on demand was totally reliant on the network's performance at that point. The third option and the obvious one is to use the appropriate registry. For example, we picked Docker image, so Docker registry could have been used. But it did not allow us to prefetch. So we had this same problem that we had to elongate the container launch time. So we ended up rolling our own distribution mechanism based on the BitTorrent protocol. And we made slight changes to the protocol to make sure that it was more efficient to avoid unnecessary off-rack traffic. So the protocol essentially prefers to fetch pieces of files that are present locally in the same rack. It's worth calling out now that if we had had an object or a block store, it would have totally worked, and we would have used it. Since Twitter uses a private cloud, we don't have those options. So we built a BitTorrent-based distribution system, and the system's installation looks roughly like this, where the binary store hosted the actual built file system image. And we had a layer of peers called as the seeders, which were responsible for downloading this binary, downloading the image, and making it available for the Mesos agents to download, Mesos host to download. This was necessary since the binary store did not have enough throughput to match the requirements of the high number of Mesos agents that we have, Mesos hosts that we have. So the last layer is the leecher, which are peers that essentially talk to the seeders. And as you can see, the modified BitTorrent protocol prefers fetching from a seeder that is in the local rack. And if there is no such seeder in the same rack, it then goes over the rack. So we built it, and then we tried enabling and started distributing our first image, and we had a bad situation. So what happened was the Torrent traffic ended up overwhelming the host nicks, and it blocked all the traffic that were going off of the host, which also blocked the heartbeats from the agent to the master, which led to lost slaves and lost tasks. So we had to restrict the resource usage of BitTorrent peers. So some of the challenges that we faced here when we actually tried to isolate it. Firstly, isolating the seeders were really easy, because these were essentially Aurora jobs that were running inside containers. And the container isolation automatically took care of restricting the resource usage. However, on the other hand, the leachers were not running inside a container. And we had to do this because the leachers had to have access, or like root privileges, to access the mesos agent's image cache. So the leacher ended up being a demon running alongside the mesos agent on every mesos host. So we had to manually isolate resource usage for the leachers. Isolating CPU, memory, and disk was easy. It was just setting up appropriate seed groups. However, it was not the same case when it came to isolating the network. This is because Twitter views a non-standard network isolator called as the port mapping isolator that is present in mesos, which is complicated and makes it harder for changing it. So let me talk a little bit to explain how the port mapping isolator actually works. A port mapping isolator tries to divide the range of ports that are available on a host and assign them to containers. Each container then gets its own network namespace. And it also has a virtual ethernet pair, where one end of the pair is pushed into the network namespace and the other end stays in the host. Once we have done this, appropriate routing from the host ethernet to the virtual ethernet for the container that's present in the host is created so that any traffic that is designed for a particular port range is routed to that container and vice versa. And it's worth noting that we also have a rate limit or a hierarchical token bucket rate limit that's installed on the container's virtual ethernet to limit any egress traffic. This way, it makes sure that containers have isolated network access, and we do not have starvation. Now we had to add network isolation for a leecher, which meant we had to do the same or similar kind of setup for the leecher by hand, which meant we had to create its own network namespace, have a virtual ethernet pair, and its own HGB rate limit. The point where it's noting here is that we have an extra TC filter that we install on the virtual ethernet for the leecher's container that's present in the host, which is the TC police filter. And we do this to limit egress traffic as well. The reason we had to do this is because we could have a misbehaving torrent pier, which could start sending traffic to the leecher, which could end up flooding the ethernet on the host and browning out the host altogether, which would affect the rest of the containers. And we don't do this on our production containers, since those are tier one and tier zero services, because TC police is very aggressive and tries to drop packages and forces TCP to reduce its window size. Doing this for tier two service, which is only used for distributing the images, seemed appropriate since we wanted to minimize any effect to running production containers. So we had a working distribution mechanism. And before we could take it to production, we wanted to have a good story around versioning these images that we were using. So we had a couple of, we looked at it and we came up with a couple of requirements. First off, we had to have full adoption, which meant we had to use the big and fat image that we created before. And we also had to make sure that we had multiple versions. We could maintain multiple versions of this image in case we ever run into a vulnerability. So we needed to have multiple versions since it usually takes service owners longer to update their services. And we had to make sure that we don't break running services. So the challenges in this part were quite bad in that we had certain hardware profiles or hardware specification that were not really designed for this use case. And this turned out to be a significant portion of our fleet, which was practically old hardware that we were still continuing to run. What it meant is these hoists had barely 100 gig disks. And even after carving out 10% of these disks away from container usage, we could only support a bare minimum of two versions. So we tried a bunch of things to try to reduce the image size. And we had picked Docker image as our image format. And generally, most of the container image formats out there are supported layering. And these layers are content addressable as well. So it was possible to dedupe layers that were shared across different images. So we looked at it. However, it turned out that due to the way Docker image creation works, Docker image is layering works. Docker image tended to explode in size due to the presence of whiteout files. These are files that get created when a new file gets installed onto the file system. And immediately in the next layer, it gets deleted. So essentially, we could have multiple versions for the same file in the container image. But when we actually unpack the image and recreate the file system, it may not even be present in the eventual file system that gets created. And we also found that the Docker file format and the command sequence that we had to use to make sure that we got the maximum amount of deduplication turned out to be quite tricky. So we had to tweak the Docker file quite a bit to get the correct deduplication that we wanted. So this led to a really unpredictable build process, which was not favorable. So essentially, the takeaway is that full file system isolation at scale is really hard just because of the scale part. So going back to the question, how did we actually pull out the upgrade? Turns out our peculiar infrastructure actually had certain advantages. And it comes out as these two points where the jars and pixels which were used for our artifact formats stood in for the isolation since these were self-contained in terms of dependencies. And since we had radio-opinionated infrastructure, which limited the number of options that we actually provided to our customers, we thereby limited the coupling that could be happening between services and the host, which meant even if a Python library had native bindings to a particular host OS, if we just fix the problem for one service, we could essentially fix the problem for rest of the fleet as well. So on the whole, now, comparing the different styles of different levels of file system isolation, we found that we compared it along different dimensions, namely the container launch delay, the extra time that would be needed to launch our containers, in which case, if you don't have any image at all, that would be the best case. And then we looked at service debugging, where we get the same file system environment that would be used in production and could be replicated in the development environment. Service agility and operational agility would be really high if there is full isolation. However, having just partial isolation just makes it work. Then we had vulnerability updates, and we had to roll up fixes. Isolation, having no isolation meant that we could roll out a fix to all service at once, but it had the other problems of not being able to properly canary changes. When it came to the number of images that people could have, if we were to allow service owners to create their own images, if we had chosen full isolation, we would hand up with huge images and essentially limiting the number of images that the system can support at any point in time. So in this case, a partial isolated application-specific image would have been more appropriate. And lastly, we have service owners to own their own dependencies, in which case, full isolation is the best scenario. However, it comes with the drawback of being bloated in terms of image size. So even here, partial isolation seems to be the middle ground. So the key takeaway that I want to enforce is that application of file system isolation has the better of all the worlds. And full isolation seems to be the hardest. Thank you. I was wondering how many container-based images are you deploying, because it sounded like you put quite a lot of effort into prefetching. Do you change those images a lot, or are you just planning for the future? Because if you have one image, the prefetch impact perhaps matters less, because once you've got one task running on each machine, it's there, and it's going to stay there for quite a long time. So we actually, like I mentioned, we provide both the JVM and Python runtimes directly out of the box, so we had to maintain it. And we do have multiple versions of JVM that we provide at any point in time, because the JVM is also owned by a separate team, and they have Kennedy versions and other versions. So similarly for the case for Python as well. So we have multiple layers of these same, so multiple versions of these libraries, which essentially bloat subs Image. Any other questions? I'm just curious, how long does it take to prefetch an image? So we were able to push a 5 gig image in less than two hours. Less than two hours into the entire cluster? To the entire cluster. OK, thanks. One thing I noticed that I didn't see mentioned, maybe I missed it, that I'm wondering about is, with this partial isolation approach that you were favoring, I think, if I'm ready or quickly, do you have some other means in your peculiar infrastructure for managing, like, subversion of one microservice on a box, letting you leapfrog to others with a partial isolation thing more easily? Or is that not a concern for the reasons for you? I don't understand the second part. Files of system isolation, one of the reasons that my team's been investigating files of system isolation as part of a cohesive security boundary between multi-tenant applications owned by different service owners that might have different security stances because of their target audience or maturity or whatever. And so how do you guys handle that conflict? Yeah, with the partial isolation thing. With the partial isolation. So with the partial isolation, it's very similar to application containers, where anything that's specific to the application gets bundled into the container image. And anything that's provided by the platform is then mounted or shared or laid on top of this. So it's completely up to the service owners to package whichever way they want and they would have to deal with it. It's not dependent on the platform itself. Any other questions? Cool. Thank you. Thanks, Anush.