 Hi, welcome to the SideCars and Netflix presentation. I'm Rodrigo, I'm a software engineer at Kinfolk. A little bit about me. I studied computer science in Argentina and I'm a maintainer of MetalB and core Kubernetes reviewer. And as I said, I work at Kinfolk. We are focused on developing open source software. You may know some of our projects like FlatCar Container Linux. There is a container optimized for us or Headlamp, a Kubernetes work UI or Logomotive, a fully self-ported Kubernetes distribution. Today I'll be presenting with Manus, a software engineer at Netflix. He works on the compete infrastructure team on container execution and we work together on the cycle ordering on the QLED. The talk is divided in two parts. First, Manus will share how Netflix is used in SideCars containers and in the second part, I'll talk about the problems we face today with SideCars and the efforts we did to improve this situation in Kubernetes after. Hi, I'm Manus. And before we dive into SideCars, here is some background. At Netflix, our fleet is composed of more VMs and Titus, our container platform is gaining organic adoption. We are now working to make Titus the primary deployment target to give developers one well supported infrastructure that allows them to iterate fast and is also more scalable and more efficient at the same time. The goal is to do this migration fast and automate as much as we can. So hundreds of platform engineers are not supporting two deployment targets for a very long time. To get a sense of what we are migrating, here is what a typical VM runs at Netflix. We have the application typically running in the JVM and a series of daemons providing service discovery, security and observability to the name of few things. Application developers expect these daemons to just work and so we run them in an operationally sensitive fashion by which I mean we start and shut them down in a specific order restart them when they fail and they are generally available to the application. As an example, the metrics forwarder starts up before the agent which boots track certificates. This is important because consider what happens when that agent starts failing. In our case, it can publish a metric to report these failures and we can use our monitoring and other things systems to track and remediate these. In the same way, the service mesh expects TLS certificates and the open policy agent to be available before it starts up. Otherwise every application has to decide if the mesh needs to fail open or short circuited start. To do this we run daemons as system D units and use its DSL to get the ordering and restart policies we want. We also control the shutdown sequence with system D. So a node is taken out of service discovery and connections get drained before we shut it down. Now, here is what a Titus host instance is running. The virtual cooblet launches an instance of Titus executor per container and because Titus is an independent container in different AWS accounts with different security groups and IM roles run on a shared instance. We provide the functionality provided by daemons and BMS using what we call system services and alongside a few extra services we need for example, we intercept instance metadata service on Titus. Finally, we have some software we use to manage the field. It might seem that the best way to run services is to run containers that look like VMs but this means that we have to rebuild and redeploy containers when we want to replace system services. This impacts how fast we can react to security advisories or deploy hotfixes to these services and we see these problems on VMs. Secondly, it is not possible for developers to import a container from Docker hub and expect it to work on Titus in this setup. We want our developers to have this ability to experiment. Also, we cannot intercept the instance metadata service or mount remote storage securely in this model. So perhaps we should run multi-tenant services on the host. This has no ordering issues and we might optimize some resource usage. We started by running a multi-tenant instance metadata service as an example but we learned very quickly that users can accidentally deploy workloads which query it in a tight loop. This starves other containers on the host and if that loop is leaking connections, then our service runs out of file descriptors and now we have a bad host. Another example is a log viewer which used to be multi-tenant and developers realize that the log viewer is a performant HTTP server. So why write your own? When you can serve large files by writing them to the log volume and as a platform we want to be resilient to such resource sharing issues. The second problem is that we have workloads that expect to not be disrupted once they are running. This means all our system services will need to have the ability to hot reload and have perfect tenant isolation if these workloads are going to meet their SLAs. There are some more serious challenges when it comes to securing these multi-tenant services. Consider our multi-tenant log viewer which serve logs for all containers and someone figured out that relative parts work. We also allowed users to configure the matrix agent using a file in a volume and the red team provided a handcrafted executable instead of a CSV. Now we have compromised the host instead of a container and this is not acceptable. We feel it is difficult to anticipate all these issues and get perfect isolation and fault domains on multi-tenant services. So we write single tenant services and contain them with the workload. To that end we have settled on a design where we give every container its own set of immutable system services and usually this guarantee is that once a container is set running it will keep running. We tie the lifetime of the service to the container so the service dies when the container dies and the service also shares the identity, fault domains and resources with the container. This means for example that the application and service mesh share a cryptographic identity, namespaces and C-groups which is almost always what our users expect. So how does Types execute a run system services? It starts by creating the container using the image specified by the user and then it bind mounts data containers which package the system services into it. These are ruthless containers in our system and that has its pros and cons. It also bind mounts a socket which we use to control the container and a few volumes and it does all of this in parallel. Next it runs the container with the entry point set to tini which immediately blocks. This creates the namespaces and C-groups in which the system services run. Tini is a tiny unit for containers which is used by Docker and we use our own fork. Now the executer can set system services running with the correct ordering and restart policies and we do this using systemd. These services are actually parametrized units which are managed by the systemd running on the host. The information about which container to join is passed in files on the host by the Types executer and finally we simply use run C to launch a process running inside the container. Now the service is running in the namespace and C-group of the container which means that the kernel will reap the service when the container arrives. Once the sidecars are running we release Tini using the socket and it runs the entry point as a child. Tini remains PID 1 except in special cases and we use it to redirect standard output and error and also to do more advanced things with Seccom Notify as an example. Since Tini is not really optional the way we use it Titus can avoid the overhead of host containers. We just follow the reverse sequence when shutting down a container so services are always available. We don't have to worry about an application starting before the metadata service starts and crashing or the metadata service stopping before the application and causing the application to crash. I should point out that there are some special services like SSH where we don't use run C. There used to be a time when we ran Docker exec to get a shell in the container but that is insecure. Now we use a custom injector. This injector is a small C program. Again we pass information about which namespaces to join via the environment and the injector makes sure that it wears the correct app armor hat and change hat before it enters the container and launches the SSH server. With this work we can use a real SSH server with all the bells and whistles including the ability to SCP files to the instance and that is nice. We'll do some similar advanced injection for mounting remote storage and intercepting the metadata service and this way we can avoid running privileged containers or using IP tables connection tracking and you can find more details in the Titus executor sources which are open source. That brings us to how we version and upgrade the system services. For most workloads the executor reads what the current stable version of sidecars is from a centralized store and this happens at every container launch. This makes upgrading sidecars fast and simple. We change the configuration in the store and trigger a redeploy if required and some jobs have more exacting SLAs and they specify a exact version for a specific service. For example applications in the streaming path might pin the service mesh. In this case all containers in that Titus job get the pinned version. Some services are opt-in and there is a parameter on Titus job which says if a container gets that particular service or not. But broadly it is an open problem for us how we go about managing and upgrading these services as we adopt a growing number of them to enable our developers. The challenge in this pace is pretty big and it's probably worthy of a discussion of its own. In summary system services are pretty easy to implement using rootless containers. We get the benefit of being able to share mount and pin name spaces so when a user logs into a container they see a process tree and file system layout that makes sense. But we can only have one user container in our model and we want to have more user containers to move towards more composable workloads. It is not trivial to create these services and see or go. It is not so easy to build a relocated statically linked SSH server and currently we expect our partners to pay the stacks alongside us if they want to run system services. Lastly the debugging experience for these containers is terrible. You can't ship extra tools with these containers and debug symbols blew up the size of these containers which can make these containers a long pole in container start names. So what has this got to do with the Kublin? Titus started life as a mesos cluster but starting in 2019 we moved to virtual Kublin. My colleagues are going to talk about this in KubeCon San Diego you can look at that talk. And over the past year we have been adopting more Kubernetes components in our core plane like Kube scheduler, CRDs and controllers which do fleet management and usage based scheduling as an example. So in 2020 we decided that we would try to migrate to the Kublin. Then we can run multiple user containers and get benefits of the Kubernetes ecosystem what we ran into some problems. First we need startup and shutdown ordering to run parts of the model. We have teeny and we know how to write the unique socket so naturally we thought of building a coordinator using the web hook and you get this contraction. This works but we are not enthused about rewriting entry points to all containers when using the Kublin. We also looked around and found KEP753 and decided we would work with Rata to see how far we can get with two phase ordering. Now if we had that feature some agents like the log viewer and aggregators or the open policy agent can start up before the application if we mark them as sidecars. You can even make the cryptographic bootstrap work by splitting it into an init container that generates the certificates and a sidecar that refreshes them. And then there are some system services like storage which we can implement using the CSI and metadata and implement using the CNI if we are willing to tweak our resource isolation a little bit. But we cannot make the metrics forward or start first or shut apart down with connection training till we have made some more progress on ordering. We will probably maintain some highly sophisticated injectors outside the Kublet and that is fine but we don't want to build a contraction outside the Kublet or run an internal fork and risk maintaining it in perpetuity if the community pursues a different direction. In summary, we run sidecars in an operationally sensitive fashion and letting them crash is not an option for us. We also run an aggressive security posture using username, spaces, seccom and app armor. And there have been cases where one of these was the final thread between us and an attacker so we cannot compromise here. Finally, we like our single tenant services they are easy to write and run reliably. I want to wrap up by saying that we want to move towards a more composable pod with sidecars and multiple user containers and we could use the Kublet to get there if it had ordering and username space support. So now over to Rata for more about the Kublet. Hi, I'll now explain what are the problems that we have with sidecar containers in Kubernetes today and the efforts we did in Kubernetes option to improve this. Before we start, let me say that Kubernetes knows nothing about sidecar containers today. Kubernetes treats all containers in the same way. The distinction between sidecar and main containers is useful only for us, at least for now. So let's see some problems we have today if Kubernetes doesn't really know anything about sidecars. I'll use a service mesh for several examples but the problems are not limited to a service mesh. They affect the vast majority of sidecar containers. So one problem that we face today is that there is no startup order warranties. The main container can be started before the sidecars and when that happens, the window between the main containers is started until the sidecar is ready is the window where the problems usually arise. For example, if the sidecar is a service mesh all network connections will be black hole until the service mesh container is ready. This is not a minor problem and workarounds are not great, quite the opposite. I work around in some cases, for example, is to do nothing. Let's say your container crashes when sidecars are not ready. If you do nothing, your container will crash until the sidecars are ready and the next time it is restarted, it will work. This can be slow though and if your sidecar depends on some other sidecars this can get easily super slow and this will be a problem you face to scale up when traffic increases, for example. Another workaround is to change your container entry point. For example, you can change it with a script that waits for other sidecars to be ready and only then starts the main process in the container. An example of this is the linkerD await comment linkerD to become ready and then executes your container process. This has a lot of downsides it becomes impossible with a lot of sidecars and we fail the promise of augmenting the functionality without changes to the main container. The container are completely coupled right now if we do this. Something similar happens on shutdown. There are no order guarantees on shutdown either so the sidecar can terminate before the main application and this is a big problem for several scenarios. Let's say the sidecar is a source mesh again. If the source mesh is killed before your main container we won't be able to gracefully finish inflight connections because we can use the network and this is a big problem. Workarounds will have far from good too like doing a slip to delay being killed or ignoring the term or just losing some traffic. All workarounds have big problems all workarounds have big downsides. Another problem that I want to talk about is that we can't use sidecar containers with a job today. A job is a part that runs until completion it computes something and finish like the first 100 digits of pi and it usually works like this the container starts, computes something and finish therefore all container exits successfully and coordinates persist to clean up the pod However, if we are using a source mesh for example, we start the pod with two containers the main container and the source mesh the main container finishes but the sidecar won't finish and it will never finish because it is a demon therefore the pod continues to run forever and it will never be cleaned up Similar problems arise when we take into account init containers because init containers also run until completion but let's not go into the details of those now we can read them in the cap linked at the end of the presentation if you want So those are the problems we currently have with sidecar but now let me tell you about the Kubernetes enhancement proposal we work on to improve this situation A cap is a document that describes any enhancement made to Kubernetes the sidecar cap that adds the notion of sidecar containers to Kubernetes has a long history, it was created in 2019 by Joseph Irving and it started a big discussion in the community about the different ways that can be pursued to solve these issues we just talked about I joined the sidecar effort in May 2020 and we really tried to do what is best for the community so we reached out to as much people as possible to make sure what we proposed in the cap worked for all and not only for us and as a sign out by reaching out I found out that several companies today are using a Kubernetes fork just to add the concept of sidecar containers I was really surprised about this but it helped me to convince myself that this is a problem we need to solve in some way or the other and so as I was saying, we invested a lot of time and effort in this we met with these companies but also met with developers from Linkerd and Istio and we did a bunch of other things like make sure the kubelet graceful shutdown cap interacts properly with the sidecar cap we did a detailed analysis of some edge cases that were a major concern for Synode and we found some design issues and bugs that exist in the cap and PR that were noticed in the past two years after all of this, we improved the cap to others all concerned raised by Synode and companies using sidecars in their own fork it was not easy but we found a way forward with reasonable tradeoffs sadly, there was a catch the original cap focused on doing a simple change just adding the concept of sidecar containers and after much discussion and deliberation it became clear that this change would leave out too many use cases that would require significant additional work in the future so in October 2020, together with the Synode community we decided to reject the original cap and work on something completely different to solve more use cases some of the ideas that we might want to explore on a more radical cap include something like use front levels for container startup or termination or even add an explicit container to container dependencies like container A depends on container B so the startup of a port looks like a graph now so we decided to reject this cap and start from scratch also like really from scratch we decided to start collecting use cases for things that are not properly supported in corners today and sketch some high level ideas or preproposals we have done that and we have one proposal by Tim Hawking the idea is to later choose with the community which proposal should turn into a cap looking back we made good progress, we have documented a lot of use cases we can now use to create proposals and I really want to thank everyone that helped by adding the reuse case there this journey also helped to clarify that sidecars are somehow important to Kubernetes and that we don't want to make minimal changes we want something that probably solves the problem and that opens the door not only for bigger changes but also for better support not just fixing the most obvious use cases that need to work but also allow more complex container dependencies within the pod overall this is clear a big very big project and we come a long way but even though it is far from finished now and with this we arrived at where we are today we are at this pre-cap stage Joseph moved on suddenly I don't expect to have the time to move this from the pre-cap stage we currently are at all the way into a cap and later to a GA feature but I spend a lot of time on this and would like to help so let me know if I can help to review some proposals or give some general input luckily Matias from the community has volunteered to continue to drive this forward and I want to thank him again for this and yes so to close regarding sidecar usage we show four problems with sidecars that don't have a reasonable workaround today and this is making some companies fork the cubelete like Pinterest and Lyft how their public fork is unbeatable, Netflix is also a diversion as Manus showed using the virtual cubelete and sidecar seems to be a big win for others too it's a big win for Lingardine and Istio in the software patch it seems Tecton the Google CI CD will also benefit from this as they mentioned in the presentation at KubeCon 2019 especially if we add more advanced features and regarding proven sidecar supporting coordinates we are at this pre-cap stage and we expect a cap out of the proposal to improve the situation here are the links for all the things that we mentioned in the presentation thank you very much we are ready to take some questions now