 today. I want to talk about how containers ended up at Netflix. This is a question that we get really often. How do TITUS containers work? How does our containerization platform differ from others? What are some interesting things that we've learned while building TITUS, and what are some open problems? This is really a high-level overview talk about the Netflix containerization platform, and we kind of want to talk about it openly a little bit more and see what tech exchange we can do with other people in the industry that are facing the same problems that we are. A little bit of a background on TITUS. Development on it began in 2015, so relatively recently, and we open-sourced most of the platform in April 2018, if this year, and it's not a total container platform. We leverage off-the-shelf components like Docker and Mezos for what they're best at being that. But even though it's relatively young and a relatively small platform, we ran on the order of 100,000 or 300,000 containers a day for batch to completion, and we run a few tens of thousands of containers a day to termination, services containers. But to go back to the earlier question, how do containers actually get to Netflix? Why do we have these containers? To answer that question, we have to talk a little bit about Netflix engineering. Netflix engineering is different in the sense that there's no top-down edicts. Nobody tells you where to run your workloads or how to run your workloads, and our engineering teams are really small. I think we have about 1,500 engineers today. Three years ago, we had less than 700 engineers. And every engineer at Netflix is responsible for their software, the operations that software and the computers that software runs on. And what that actually means is that, like most companies, every engineer has a local test and dev environment, and they're responsible for testing. But they're responsible for testing way more than a traditional engineer would be responsible for testing because they own the entire AMI, they own the entire image and the machines it runs on. So when they rebake those images, they have to vet the new kernel that goes into those images, potentially a new operating system that goes into those images, potentially being on new SKUs, et cetera. And this means that they have to understand their system end-to-end because they're also responsible for setting up monitoring on those systems, capacity planning for those systems. And if any of this goes wrong, they're on call for that. They're responsible for patching their OS and they're responsible for deploying new machines. So how have we possibly made this tenable? How are engineers at Netflix getting anything done? And a big part of this is by shedding responsibility using the cloud. And EC2 was the first iteration of this. EC2 was a way to give Amazon the undifferentiated heavy lifting because we don't like to build things that are commodity or effectively commodity in the industry. And the second iteration of this is really giving developers the best tool to choose for the job. And this means whether they're deploying machines and they use EC2 or they're using something like cloud functions or a variety of other cloud services that are offered out there. We don't tell them what to do as an infrastructure team. The only kind of guidance we give them is that software has to run on autopilot. Software can't involve the software engineer giving care and feeding to their software. People have to be able to take vacation. People have to have their teams be able to turn down when they're done and not constantly feed the machine like blood, sweat and tears in order to keep their systems up. And that means the workload has to automatically capacity plan. It has to automatically heal from failure. And there's a certain level of magic around this because we have so many tools that have been built up over the past 10 years. But this is great for standard workloads. This is great for standard workloads like services where you have a massive set of homogenous services or like the JVM where it abstracts all this out for you. But ad hoc batch kind of became a challenge a couple of years ago. And this is where Titus containers came in. In a couple of years ago what happened that really caused this ad hoc batch thing to become interesting is the soul machine learning grace. We had a lot of people that were coming to us that have like grad student code or random code that they pulled off the internet and they copied and pasted from like some TensorFlow tutorial. And they were like, I want to run this on a cluster. And it's using brand new libraries and I want to figure out how to get that into like the CentOS 6 image or the Ubuntu 14.04 image. And it was just untenable because getting these native dependencies in and getting it working became an entire ordeal. So this is kind of what Titus unlocked. Titus unlocked these users to be able to run experimentation on the order of thousands of cores initially. And it made it so that they could go ahead and build Docker containers or OCI images and do this in tens of seconds versus going through an entire OS build process which took tens of minutes and potentially hours to deploy those new machines after going through the entire build process. We wanted to enable these people to do fast iteration and fast testing as opposed to having to run on top of a traditional hardware platform. The other thing that it did and this was kind of ancillary thing was that it allowed them to choose arbitrary resource dimensions and not have to deal with capacity planning. And this was part of this was a efficiency play. But the bigger part of this was automatically figuring out what resources they need and pulling in nodes from EC2 in automatically auto scaling or EC2 clusters in order to get those nodes. But how do Titus containers work? What do we do a little bit differently? First to jump into our orchestration model. We only have two long running components on the system. And they're soft state components. This is Mesos agent. It's used for state tracking of what containers are running on the system. It's kind of the source of truth of what the system is supposed to be doing. And Docker, which is really there just to orchestrate container D and to act as an interface. We have a bunch of other tools around Docker that are used by our developers for troubleshooting what's going on in the system in understanding the state of the system. And then underneath Mesos for those just order of hands, how many of you are familiar with Mesos? Okay, a handful of you. All right, cool. So Mesos is a really lightweight thing. It doesn't really do anything other than say, like, go run this process under some isolation or no isolation in our case largely. And this process, in our case, is the Titus executor. And it's responsible for actually getting that workload running. And we run one of these Titus executors per workload that we run per container that we run. And the only containment that we do is PIDS C group in order to make sure that we're able to reap everything under that C group. And children of that C group when that workload dies. Once this Titus executor goes ahead and starts, it hands off and tells Docker to go ahead and wire up a container file system. It says, start the run C and start the PID 1 of that container. And we own that PID 1 using the run C or the O C I D ability to inject your own PID 1 into a container. And this PID 1 goes ahead, connects back to the Titus executor over a Unix domain socket that we injected into the container or we bind-mounted into the container. And this is how we're able to do things like provide EC2 integration and launch a metadata proxy on behalf of the container. We're able to launch special EC2 networking extensions on behalf of the container, set up NFS mounts on behalf of the container. And it also lets us do our Netflix integration bits, like have our MTLS solution that drops certificates into the container file system and starts metric collection for the container. But once this happens, we have no control over the user workload. And we let the user do whatever they want in their container. And this is a big deal to us because we want to enable every workload wherever it comes from, whether it comes from Docker Hub and some random image someone has downloaded or something a grad student has put together or something that's gone through a traditional workflow system and been built as a production homogenous image. But we do provide a couple of PAZE services, platform services, and one of those services is logging. And that's because everybody needs logging. It's one of the few things that Titus does as a service. And it comes in two parts. The first part of this is file log rotation. And the way that we do file log rotation is super similar to your traditional file log rotation mechanisms, where if you have an application that has a log file that's based on time or window or something like that, like January.log, it'll keep writing to that until it's a certain window where it'll switch over to February.log. Once this is done writing, we go ahead and we say that this file hasn't been written to for one day or some such based on some rule system. We go ahead and we move that file and upload it to S3. Once this is done, that image may go ahead and create some new files. And at that point, we'll go ahead and delete the old files. And this is, again, based on some heuristics, like how long has that file gone unwritten to, how full is the disk image, so on and so forth. But where this becomes more interesting is how we have to deal with standard IO file log rotation. And our initial approach is just using Docker logs. I think someone actually talked about this earlier, the fluent bit fellow. And Docker has the super nice logging API where you can go ahead and get standard IO from the container and figure out what's coming out of your container. And it goes ahead and compresses this on disk and rotates it and so on. But we found some issues in this. And we found that a bunch of our dockeries were pinned to 100% CPU and this is bad. And what we found, the underlying issue was a bunch of bad actors. And these bad actors were generating on the order of like 10 megabytes or 100 megabytes a second worth of log lines. And the fact that Docker was running a little go process, they had to read through every one of these lines and compress it was untenable for us. And we didn't really have a good isolation mechanism without putting Docker into yet another C group and dealing with all the problems within Docker. So that PID 1, one of the few things that it does, is it intercepts the standard IO file descriptors and it sends those over to files that are opened up in OAPend. And we have a little bit of metadata associated with those. We have some X adders on the file system that say for this given offset, we've recognized that it was written to at this given time to give us like a virtual file offset mapping similar to what you get in traditionally log rotated systems or traditional systems where the application is able to change which file it's writing to at a given point in time. And we have abstractions that people interact with these two special files via the log uploader understands this special metadata format, this extended attribute format and our log viewer service understand the special extended attribute format to go ahead and map these into API where people can fetch these via their virtual names that say, let me fetch the January log or the hour one log file. But eventually we need to reclaim these. And what we do is we upload them like we would with traditional file rotation. And we use F allocate to punch holes in these. In this way, we don't need to have the application work with us in order to move those file descriptors around or to say reopen these file descriptors that you're using for standard IO. And this is kind of a hack. And we're not super happy with it. And we eventually want to move to something like collapse range. But unfortunately, it doesn't work super well because collapse range requires aligned collapsing to the block alignment, whereas people have been used to having time-based alignment or having something new line-based alignment. The last kind of interesting thing that the actual containerizer that we have does is networking. And our networking kind of started backwards. It started with some set of goals versus kind of trying to build a solution that would solve arbitrary problems. And those goals were given to us because it had to work with the rest of our EC2 environment, with the rest of our VM environment. And that meant that we had to have like VPC network security. How many of you are familiar with what a VPC is? A couple of you. A high level overview of a VPC, it's kind of like a micro segmentation as a service in the cloud, where you're able to say, I have a bunch of VMs, EC2 VMs, and they belong to a set of security groups. And then you can have firewall rules on these security groups. And we had a bunch of infrastructure, like 10 years of infrastructure built up around this. Because we worked really closely with Amazon to build the VPC product. So we wanted to be able to leverage that and not have to reinvent the wheel. So with that, we had that as a requirement, and we had the ability to also leverage VPC routable IPs. So everything within our VPCs has to be routable, primarily to ensure that we can lift and shift applications from EC2 onto the container platform without having to rewrite them to understand things like port bindings or having to understand sharing of IPs or anything like that, or go through NATs. And we had a kind of a secondary set of goals, which were to prevent the interference of mice and elephants. With a lot of these machine learning workloads, they download a bunch of data initially, and then they upload a very, very tiny model. And both are equally important in the eyes of the service owner. And we needed to make sure that both were reliable. And then the latter one was to make sure that we could modify these security groups and move around these IPs without having to have a human or a system administrator come into the picture. And the tool that AWS gave us was the ENI, the Elastic Network Interface. And the Elastic Network Interface is basically just a traditional Ethernet interface from the perspective of the VM, the OS. And you get somewhere between two and 15 per machine. We dedicate one of these to the control plane for the Mesos agent or for the Docker daemon to be able to connect back to the Titus controller, but the rest of them are used for our containers. And these are really rudimentary interfaces. All they give us is security control. Because we can go ahead and bound some set of security groups to these interfaces and say that traffic from these interfaces is classified with these security groups and filtered based on these security groups. And they're not performance isolating either. They're just raw interfaces. And we can multiplex these because we can attach secondary IPs to these in AWS and under the hood will route the secondary IPs to us. But there's a really high cost to reconfiguring this. And there's a really high cost to reconfiguring the security groups on these as well. So if we use off the shelf networking like a lot of these container daemons are using, we have something like this where you have a bunch of these containers talking to a bridge that goes through NAT that goes through one network interface. And this doesn't work super well because you have no public IPs and you have to pay the overhead of NAT. And then all containers have to share the same security group. So it causes a massive bin packing problem in the sense that you can only run one type of workload on a given machine. And so it doesn't solve the QoS problem. So if we continue to evolve this and we say each container gets an ENI and we configure that ENI based on the security groups for the container and we give one IP per ENI, it solves the security group problem. But it comes at the cost of again not having QoS. And we can only pack about six containers to 10 containers per machine before things get wonky because based on the number of interfaces we have, we have to pay the nappy polling cost across each interface. And it's just super expensive where, you know, if we multiplex these we can have something 50 to 60 containers per machine. And if we continue to evolve this, we can use IPvlan, which Google added a couple years ago to the kernel to go ahead and multiplex these ENIs and take those secondary IPs and assign them to subcontainers as long as they're sharing the same security group. But we still don't have QoS. And the reason why we don't have QoS is because each ENI is rendered as a different entity from the perspective of QoS in the kernel. So we need to tie all of these ENIs together. And the way that we do that is using ifbies, using intermediate functional blocks. And we're basically just using married actions where we steal network packets from the ENIs on egress and on ingress and shove that into an intermediate functional block. And then we use BPF to go ahead and classify the traffic depending on whether it's high priority traffic, low-priority traffic, what container it's coming to and from. And then these go into HTV classes that are configured via the Titus executor to say that each one of these has, you know, 10% of the bandwidth of the network interface or 30% of the bandwidth of the network interface. And since we have such a high level abstraction from the network and we can't actually have any real sense feedback to the network other than drop and delay, one of the things that we instituted recently about six months ago was ECN, explicit congestion notification. And that's at least a way for us to start giving notifications to the network to say back off before we have to start dropping and delaying. And it's allowing us to run interactive applications next to batch-style applications. The other trick for doing mice versus elephants has been FQ-codal to give a stochastic fair queuing between different flows and between different containers to avoid them clashing and creating a queue storm. But all this comes at a very high overhead. And we've been kind of looking at a solution that will give us all of these same benefits, but without forcing everything through a single set of queues and a single interface. And if we measure the overhead of this, we're paying on the order of like 10% overhead in terms of CPU cost. And it's even higher than that in throughput. But again, this is worth it to us in the fact that we're able to get higher capacity from our machines in density, but at a slight cost of total efficiency. And the last part of TITUS that's really interesting is the intelligent scheduler. And that's not part of the TITUS executor. That's what the control plane is. And I'm just going to touch on this really at a high level. From our scheduler today, we have two tiers of scheduling. And the two tiers of scheduling are critical tier and flex tier. Our critical tier has a launch time SLO. And that launch time SLO is to be able to launch 10,000 containers in five minutes within a given region. And it's typically used for customer facing traffic. If you use an iPhone or an Android device or PlayStation or an Xbox, you're going through these. And then we have our flex tier. And the flex tier is what you get spotted into by default. And we don't have an SLO on that you can get queued for an unbounded amount of time. But if you look at the P95 realistically, you get a schedule in about a minute or so. It's typically used for internal services or batch. And the reason why we don't have an SLO on that is because we might have to auto scale and get more instances from EC2. And if EC2 doesn't have instances at that point in time, we have to wait for them to actually give us nodes. And today these are two separate pools of VMs. Eventually we want to combine these. But as Facebook has talked about earlier, like running multiple workloads that are diverse on the same machine is hard. But to actually get to the reason why we have that SLO is because of these things called congs. And congs are what we call our multi-region failovers. And the way that we handle these is that the savior region basically has to scale up to support all of the traffic that the victim region has. And if we look at this from a traffic pattern perspective, we move over one region to an entire other region wholesale. So it has to basically double in capacity at that point in time if it's in trough. So the way that we handle that is having two scheduling modes. Where normal scheduling spreads workloads across machines. And this is in order to pre-warm our runtime. And pre-warm our runtime by getting images onto those machines, find out, getting those security groups configured, getting that networking warmed up. And then at cong time we switch over to a different scheduling mode. And that scheduling mode is to try to make sure as many workloads of the same machine or the same type are on the same machine. And the reason to do this is to reduce the app launch time at the slight cost of reliability. And eventually we'll go back into non-cong mode and we'll go back to normal scheduling and we'll go back to the traditional region we were in and at this point it all does itself. And our container scheduler has insight into our containers and knows when they've started up and knows when they're ready to serve traffic. So it won't keep trying to start workloads on the same machine or start a batch of workloads on the same machine until that first workload comes up in order to optimize from the perspective of queuing theory. Until all of the machines are full or the workloads are totally scheduled. A couple of challenges that we've run into. The biggest challenge that we've run into over the past two years I would say is zombies and reconciliation with a variety of these tools. And this is really what happens when something breaks. When one of these intermediate components breaks. Because in a traditional environment you might have something like Docker or your kubelet be responsible for tearing down your networking primitives or your storage primitives. And those things like the kubelet might be supervised. But it doesn't really address if we have intermediate component failure. Something like the run C fails or the Titus executor fails. Or even a soft state component like the mesos agent. Because when this happens we get a zombie state where one of these containers now leaks. And in order to deal with this what we've had to do is have a reconciler. And that reconciler goes ahead and pulls all of the components on the machine and compares your state and tries to take remedial action. And then remedial action is basically just to blow things away. Just to shoot containers in the head until you're in a good state. And this is super risky. Because we can blow away our entire environment. And we don't want to do this. So we see this as an anti-pattern. And we see this as an anti-pattern that we can kind of solve using things like POSIX advisory locks. And using solve-freeping systems like PED namespaces and isolating things via seed groups and using those to prevent arbitrary leakage. But unfortunately we've seen a bunch of people in the container industry use things like CSI and CNI that kind of don't work super well with this. And they kind of remind me of the system five days where you had a startup script and a shutdown script. And if you didn't call it a shutdown script, things would get into a wonky situation. Another big challenge we've seen is CPU isolation. And CPU isolation or CPU interference kind of unlocks a can of worms. And it unlocks a can of worms because not only do you have the issue where not all workloads are affected equally, it also opens up philosophical questions. Like, now that we have control over our machines because we're taking over our entire machines and able to control scheduling, what about things like Intel Turbo? Where EC2 had kind of chosen the most conservative rules for us in order to give all of our workloads the most reasonable speed in their mind. But where we can say, this workload is more important, we can give more resources to this workload. How do workloads signal up to kernel space that they should be more important than other workloads? And it turns out that the Linux APIs have existed for this for some time, they just haven't really evolved the processors evolving. And we have some indicators of what's going on with this, like cycle sprint instruction or cache hit rate. But these are only proxies and they're kind of ruined depending on what's going on. And we can actually see this. Like, this is an example of the kind of latency numbers that we get when we're seeing interference on production. And when we compare containers and container CPU utilization against VM CPU utilization, like these are real clusters, you see way better grouping on VMs and containers aren't really there yet. And we've had a couple ideas here, like relocating outliers or doing what EC2 does and limiting everyone to the same speed and then requiring what the capacity plan based on that. We're figuring out some way to tell the CPU to isolate these workloads in an intelligent way, but we really don't have answers here. And, you know, we're thinking of kind of giving up here and saying to the user, this is not how you should think about your system. And instead, you should think about your system as like instructions per second or cycles per second that you're getting scheduled. And the last problem that we've had is syscall performance. And this is kind of due to the current mechanisms of filtering syscalls. And like BPF is awesome, but BPF still comes at a non-zero overhead. And when people compare it against the VMs, they're still asking us why this exists. And again, I don't have an answer here, but it's something that we kind of have to come up with a better solution for. So in closing, containers at Netflix, where are we? We've been pretty successful in batch. Pretty much every batch workload is targeting Titus and containers as their primary way of running things. We've been pretty successful in modern microservices things like Node.js, things like Ruby can target Titus as a first class. But we've been having a traditional, we've been having a difficult time in enabling traditional microservices, things that rely on very, very strong guarantees from the underlying system, things that have no hedging, things like that. But we don't really have a story around high performance services or security oriented workloads without impeding performance. So our priorities today are to figure out how to make Titus a first class deployable platform for every workload at Netflix. And part of this ensures that every part of our platform works correctly in containers, which we still found situations where things don't work as expected. And instead of trying to become equal with EC2, instead of trying to become equal with our VM environment, we're trying to figure out where we can exceed it if possible. And eventually, our target is to find out how we can extract hidden efficiencies as our third target and our third priority. Thank you. Any questions? We have time? No, this was basically for filtering. Oh, sorry. The question was, is the syscall graph for the EBPF syscalls or all syscalls? This was, this is basically, we took a benchmark of how long syscall times took in a benchmark application and using various filtering mechanisms on it in our environment. And these were the over, this was the runtime speed of the application as a whole, compared to some other mechanisms we've tried. No more questions, I think, we're out of time, but I'll be out on the hallway to talk later. Thank you.