 Thank you all for your interest in our work. I am a poor and this is Aditya. We work in the compute platform team at Uber and today we are going to talk about ongoing migration to Kubernetes. Here is a brief summary of what we are going to talk about. First we will introduce the compute platform team at Uber and what we work on. Then we will briefly capture the current status of our migration. Then we want to take this opportunity to give credit to the community and talk about the unique features which we heavily use and have found to work really well without requiring any changes. Next we talk about a few features and customizations we implement on top of Kubernetes, why we implement them so as to make it better suited for Uber's needs. And finally we talk about some of the interesting learnings we had during the course of the migration. First let me introduce the compute platform team at Uber. Today Uber manages its own on-prem data center as well as leverages capacity from Oracle and Google Cloud. These providers are abstracted away from the platforms via layer we call Crane which essentially implements host as a service. It ingests capacity from these providers, provisions the host and the VMs with the right OS, the right image, installs the right set of packages and essentially makes the node ready for use by platforms. Above Crane we have the container orchestration layer which essentially provides container as a service to the rest of the company. Today this layer is built on top of Mesos and Peloton where Peloton is a custom framework on Mesos and we are in the process of migrating this layer to Kubernetes. This layer is or this platform is used to run all the stateless microservices at Uber including the ones which run on shared infrastructure, the ones which require its own dedicated infrastructure to run as well as low level infrastructure services which are required to boot up the rest of the infrastructure. A number of patch workloads are also run on this platform including all machine learning workloads, all Jupyter notebook sessions and a subset of Spark workloads. The Spark workloads which run on this platform are the ones which don't run on Yarn very well due to numerous reasons like requiring large containers or requiring customized features like gang scheduling or custom Docker images, etc. This talk is primarily going to focus on the stateless side of things. We have another talk at 325 by Amit and Kevin which will talk about the bad side of things at Uber. Before moving forward, let me briefly capture the scale at which we operate and how we think about scale. Today our stateless fleet runs more than actually thousands of services across millions of course. However, the one big factor which has a significant impact on how we think about scale is the number of deploys which run every day in our fleet. We have fairly sophisticated CICD platforms and nearly all our services are onboarded onto it and every service gets deployed multiple times in a single day in our fleet. So averaged over 30 days, we see more than a million and a half containers launched every day in our fleet. And this is averaged over 30 days. There are times during the heavy deploy times where we see new pod launches at the rate of 120 to 130 new pod launches every second. And note that this is just the new pod launch rate. The actual pod generate is actually much higher. So whenever we design our system or we think about migrating to a new platform like Kubernetes, this is the one factor. The high pod generate is something which we explicitly account for and design for. Next for context, let me also capture where we are in terms of migration to Kubernetes. We started at the start of this year and as of today, more than 70% of our stateless fleet is running on Kubernetes. And we expect to be done by sometime next half. We have multiple 5,000 node clusters and our largest cluster is between 5,000 and 6,000 nodes today. And we expect it to grow to around 7,500 nodes. When we started this migration, based on our previous experiences with using open source technologies at Uber, including MeSource and others, and also having done similar migrations like this in the past, we laid out a number of principles for us. I'm going to capture the three most important ones here. The first one is seamless upgrades. We would like to run Kubernetes version in our fleet at the same version as what the cloud providers are running at that point in time. In the past, with MeSource and other open source code, we had really struggled to upgrade our fleet to the new open source version. And based on those learnings, we decided that when we move to Kubernetes, we are going to use the upstream code as is with no minimal changes to the upstream code. And we are going to rely on Kubernetes native extensibility like plugins and CRDs to inject any customization. The second principle is reliable upgrades. In the past, whenever we tried to upgrade open source to a new version, invariably we always had incidents and issues in production. To mitigate against that, for Kubernetes, we have built extensive release validation with numerous integration end-to-end and most importantly we have been able to do a lot of work on our cluster. So as to capture any regression before we roll out into release. And another thing which we did which has really helped us is that we have continuous probes running in our clusters which keep testing our cluster control planes. So as to ensure no regression and if any issue is detected, we immediately roll back the new release. The final principle is around transparent and automated migrations. We found that anytime we require any developer to do any effort or we change the developer experience in an unexpected way, that migration invariably failed. So as a principle, we said that we want to do this migration which is fully automated and completely transparent from all developers. That is the developers just keep running the service as is like any normal day. Under need we change the platform from mesos to Kubernetes without anyone doing anything or even noticing that something has changed. We want to keep this incident free with no business impact Uber. So let me talk about how we accomplish the transparent and automated migration and for that let me introduce UP. UP is a platform owned by our sister team and it implements Uber's global stateless federation layer. It is the primary service owner interface and provides a number of federation features including safe rollouts, continuous deployments and distributing service capacity across multiple availability zones for service high availability. Importantly, it abstracts away the cluster technology of mesos or Kubernetes away from developers. One interesting feature it has is what we call cluster selection where within one availability zone, if we have multiple clusters and it rebalances services away from clusters with high allocation to clusters with low allocation. So this allows us to do automated migration because what we do is that we just move the physical capacity from mesos to Kubernetes, does the allocation in mesos becomes higher, the allocation percentage in Kubernetes becomes lower and up automatically rebalances. So this allows us to just run an automated migration. So how do we do a transparent migration? The fact that we have UP which is abstracting away the underlying cluster technology obviously is a very important step in achieving transparent migration and that's what allows us to even think about doing a transparent migration. However, as you may guess, compute is a central infrastructure piece which integrates with numerous other infrastructure platforms which all developers use. So when migrating to Kubernetes, we have to rebuild all these existing integrations and given Kubernetes and mesos have numerous subtle differences, each of these integrations require a very thoughtful design so as to ensure that the developer experience remains exactly the same. In the subsequent slides, we will see a few examples of the customizations we had to build on top of Kubernetes to ensure this. Next, we want to take this opportunity to thank the community for providing numerous well-built features which we directly use without making any changes to it. Note that this is not an exhaustive list, this is just a subset of some things we thought are fairly unique to Kubernetes and pretty much anyone who is running Kubernetes should be using these. The first is the default Kube scheduler, it's awesome, it's super stable, super scalable, six scheduling and six scalability in the past few releases have done an amazing job in scaling it up. Another thing which you want to call out is the plugin architecture which Kube scheduler has, we heavily leverage it, not only to use the numerous plugins which other companies have built and open source but also to inject a couple of our own specific customizations. The next is security. The security first nature of Kubernetes stands out. I think one of the teams who is the most happiest with our migration to Kubernetes is the engineering security team because they are able to reuse most of the features directly from Kubernetes and significantly upgrade our security posture. An example is how do we secure our security for our own cluster control plane. We use search for authentication and provide enough granularity for us to be able to really secure it. We are now actually exploring an authentication proxy to as a validating admission controller to potentially set up personal access control. The next feature is API priority and fairness which we heavily use to protect API server and security. This is not only for every controller, every operator, every service which integrates with our control plane but also we limit what operations can they perform on the cluster. For example, we have pretty much disabled the use of direct get and direct list except during an informal start up on to API server directly. This particular feature has protected our security for a very long time. This is one of the reasons we have had an incident free migration up till now. The next is controller untimed ecosystem. All our controllers and operators are built on top of it. It's great, it's intuitive to use, great telemetry, no performance hit in using it. Finally and something which is not that well known is support for separate events database which is super helpful to us because it allows us to scale our clusters pretty well without losing the auditability and debuggability provided by events. I'm now going to hand it over to Aditya who will talk about the customizations we have added for Uber developers and also discuss why we added them and also talk about some interesting learnings we have had during the course of this migration. Hello, everyone. My name is Aditya. I work with Apur on the container platform team at Uber. As Apur mentioned a few slides earlier, transparent migration is one of our key migration guiding principles as part of this project. What that means is as we move from mesos to Kubernetes, our developers should continue to have the same developer experience, the same levels of deployment safety as well as developer velocity. To that end, we have built numerous features to achieve this on top of native gates. For example, we have abstracted away service intent into a CRD. We allow for retrieving container artifacts after a pod has exited. We allow setting U limits on your container which then let service owners set things like FD limits, for example. We have improved the scale of Kubernetes UI. For example, if you take the case UI natively and point it to a 7500 node cluster, which say 200 k parts, it is going to take about over 7, 8 minutes to load. With our optimizations, this happens under 10 seconds. I'm going to pause here and the features that are highlighted, we are going to talk in detail in the subsequent slides. But if you think there is anything here that is applicable outside of Uber as well, we are all years, we want to talk about it. We want to find ways in which we can give back to the community. So to take transparent migrations further, what that really means is that we do not want our service owners to care about any of the Kubernetes internals, nor do we want to expose them in a lot of depth to our federation platform which is up. For example, if a service wants to run on custom SKUs like some NVIDIA specific GPU, service owners should just tell us that and not tell us to use a specific node selector feature so that their service runs there. So what we have done is abstracted away service intent as part of Uber deployment CRD. So the intent can be the service needs image prefetching or in place updates or dedicated hardware, for example. And the CRD controller's job is to then translate this intent into a meaningful one or more Kubernetes specific expressions and actually make it happen. So to keep developer experience seamless, we support a widely used feature at Uber called container artifacts retrieval. Today our developers are able to access their artifacts like core dumps or heap profiles even after their container exits. These artifacts are written to local disk and exposed to the users and they can be used as a source endpoint. For example, Java services are configured to write their peep prof and core dumps locally and these are very useful for debugging, for developers to debug their crashed or home-killed containers for some time after the container has already crashed. We lose this functionality when we move to kates because all the local volumes and this local data on the host gets cleaned up after like during pod deletion. To support this we introduced an artifact uploader daemon that uploads these artifacts to a blob store on container exit. So how does this work? We have introduced a sidecar container as part of all of our stateless pods. The main objective of the sidecar container is to buy us time after a primary container exits until a pod is deleted. So once a primary container exits, be it for normal updates or abnormal exits like womb kills or site foils, the artifact uploader daemon on the host gets a signal, it execs into the sidecar, it tires up all of these artifacts that the developers care about and then it uploads them to blob store. And then it asks the sidecar container to kill itself. Upon that the pod gets deleted and all the local artifacts get cleaned up. In terms of deployment safety, a widely used feature at Uber is controlled or gradual scaling. We have multiple services which use membership based protocols like Apache Helix, we have salary workers, we have shard head services and all of them are very sensitive to rapid scale ups or rapid scale downs. So to support this we looked at what comes closest in the native ecosystem and the thing that comes closest is a rolling update spec. But this is applicable only during actual updates. On the contrary Kubernetes is actually optimized to make scales go as fast as possible. So to support this we introduced a batch sizing concept in Uber deployment CRD. So our Federation layer can specify a batch size during scale operations. So if you specify a batch size of 3 and you want to go from 10 to 20 instances, you go in steps of 10, 13, 16, 19 and then 20. So we talked about slowing down your scale operations but a majority of service owners, they actually want to deploy their code to production as fast as possible. We heavily use CI CD, service owners deploy multiple times based on this. So we use CI CD and therefore the desire is to do this as quickly as possible. Also with these rapid rollouts we also want to make sure that incident mitigation is super quick. So we can deploy like hot fixes very quickly. Some services at Uber, it's very hard to do this for those services because one reason is that their containers are super large. For example a container of a service can take up to like more than 25% of host resource. Our clusters run fairly hot at 85 to 90% allocation, which means that plus there is a lot of churn like we talked about earlier, so a lot of pods are restarting, being moved around, which means that our clusters are inherently fragmented and that means that there is not enough hosts typically that have that much free capacity to house these pods. So it takes a long time to place them. Once we actually place them, it is highly desirable that they do not lose this placement across updates. Now when the pod actually gets placed, it is also running into issues like cold start because these pods will have image sizes close to 5 gigs sometimes. It takes a long time to download these images and then start the pod. So all of these things, they combine towards like a slower rollout. So to sort of solve this problem, we started using clone sets, which are developed by Alibaba. They are a case resource that provide us in place updates using pod patching. So now that we have solved the in place updating problem, we introduced image prefetched which allows for prefetching all these images that are being downloaded. So when an update is taking place, we go zone by zone. So if you are updating zone A, we will notify the image prefetched demon in zones B and C to pre-download the image that is being updated so that when the update hits zone B, the image is already present on those hosts. Another interesting feature we provide is a unique 32-bit instance ID for per pod. They are unique within a service and an environment. Developers want to identify issues with single instance failures. They do so by tagging their metrics and logs with instance IDs to improve their debug ability. If they are not assigned unique IDs, their metrics can get jumbled, so uniqueness is very important to developers. We support this by using the last five characters of a pod ID, and we add service name and environment name with some padding to the first 58 characters. So the last five characters are random, but they are still unique to that scope of service and environment. So an ask from the community is to provide, hopefully, ways of making this slightly better. Even before we started with this migration, we talked to a lot of community members. We did a lot of research, and the general consensus within the community was that we should set up large number of small clusters. The cluster sizes were like 1,000 to 1,500 nodes. At Uber, we were doing the exact opposite of this. Our cluster sizes were for mesos where between 5,000 to 7,500 nodes. Why did we do this? We did this so that we reduce fragmentation issues. We reduced the amount of stranded capacity that we can no longer use if there are smaller clusters and we reduced operational toil. So with Kubernetes, we wanted to see for ourselves, we wanted to get a reproducible setup where we can scale Kubernetes to our requirements. We set up a state-of-the-art benchmarking cluster using Cubemark and cluster loader, and I'm happy to note that we were able to get 100,000 nodes, 200 k pods, and 150 pod launches reliably using this cluster with no minimal changes to the Kubernetes control plane. So kudos to the community for that. There were some minor config and software changes that we had to do. For example, we had to carefully tune QPS settings, parallelism for controller manager and scheduler. We restrict the API calls like list and get it done using API priority and fairness. We used proto encoding instead of JSON. So we heard in a talk that CRD is now support proto instead of JSON, which is awesome. We made some software changes to speed up things like the pod topology scheduler plugin. With all of these changes, we could actually get to the desired scale that we wanted. So far, we talked about what we proactively did to make sure that we were able to do that. But as we went ahead with the migration, we saw some unexpected issues, some unexpected behaviors, some quirks, and some lack of tooling, which I will get into with this slide. So generally, we did not see a holistic monitoring solution out of the box to help us reason about the state of the cluster. For example, we started seeing a lot more fragmentation issues on Kubernetes versus MeSOS. And one of the reasons was we did not have in place updates. There were higher pod churns we used make before break and so on. We could not find a tool to really investigate where this fragmentation is and answer questions like why is my pod not being placed. So there are some ways with which you can use kubectl and script around it, but they all use some variant of aggressive listing which we wanted to avoid. We saw issues where pods kept getting rescheduled on same degraded hosts and get crash loop all the time. We saw issues with noisy neighbors, but we could not figure out why a set of hosts were seeing degraded performance. If there are common set of services running across these many nodes, we didn't have enough visibility into that. We wanted more visibility into the kind of churn this cluster is seeing in terms of how many parallel updates are there straggling updates, are there stuck updates. So to fix all of that, we built deployment and observability tool ourselves. What we saw was with native informers and the way they reconcile their events. So the way native Kubernetes informers reconcile is every 8 to 10 hours they will replay all the events in their cache to their controller so that they make sure that none of the events is missed. This works fairly okay for a smallish cluster but we have seen issues where sometimes a deployment gets created and the deployment create event is not acted upon because at the same time the controller is having a leadership change. So that deployment create event gets lost and then the next time it is actually acted upon is after 8 to 10 hours which is not desirable. So we created a custom reconciliation mechanism for high level objects like Uber deployments. Similar to faster rollouts we also want to ensure that our roll backs are pretty quick and deterministic. So to do that we started using progress deadline seconds and we treated PDS as a wall clock timer. That didn't work out quite as expected because we had some services which had disabled their health checks but they still had crash looping pods. We had services that had health checks enabled but they had a long initial delay for those health checks. For both these cases the deployment kept making some sort of progress because the pods crash looped but before that they were marked ready immediately. So whenever it appears to make progress the PDS timer gets reset so we couldn't use it. So we decided to use a heuristic number of container restarts during a rollout. For example if we see more than 10% of pods getting restarted five or more times during a rollout we consider that as a bad rollout and we roll back. Lastly I want to note that we couldn't have made this rapid progress without a global federation layer of up as well as investing in portability of services. So because of these two things I want to make sure that our services are extensive efforts in making sure things are reliable. At our peak we were moving about 250k to 300k course per week which is a pretty high number in my opinion. So far we have talked about our stateless side of story. Where we are going with this is we have multiple cluster management technologies at Uber. We have presto jobs and we have Odin to manage stateful workloads like Cassandra, Redis and so on. As a company we have decided to converge on Kubernetes as a unified platform for all of these workloads. So watch this space for the next year or two. We will have more updates on that. Lastly I want to thank all the teams mentioned here for their support and work during this migration as well as the Cates community . So thank you very much for your efforts and with that we can go to Q&A. If I understand correctly during your message to Kubernetes migration you spread the workloads per instance across Mesos and Kubernetes, is that correct? If that's so, how does an instance in Mesos discover the pods in Kubernetes and vice versa? How does your service discovery work? How does your service discovery work? How does your service discovery work? So we have a service mesh and we integrate the same service mesh with both Mesos and Kubernetes and the integration looks exactly the same. So the pods in Kubernetes as well as containers and Mesos are visible to each other . So I think I remember you said you were using 48 core hosts and then you also made some other optimizations with the APL burst etc. So how do you define the tradeoff between throwing more resources and actually optimizing for the various parameters within the objects and the follow-up ? How do you define the tradeoff between having, let's say, 12 8 core hosts which might give you more resiliency probably versus single large hosts run a big cluster like that. So those are two of my questions. That's an excellent question. So the way we think about it is that we want as large hosts as possible to reduce the less fragmentation we see. I mean it's the same thing. We have large clusters as well as large hosts within the cluster so as to reduce fragmentation. Now why we can't go just to let's say a thousand core machine or something like that. The reason is that we have host agents which run including a service mesh, the metrics, logging and so on and as we keep scaling the host up , we need to find some issues as we move forward and we are fixing them. So yes, we use 48 core machines right now. We are already moving to 96. We hope to move to 120 and 256 over time. But it requires a careful scaling of the demons on the host. For example, Docker didn't scale. So when we moved to Kubernetes, we got continuity and that scales . And you also orchestrated your control plane on larger machines as well? The control plane so the control plane is basically what we did as part of our benchmarking. We figured out what is the minimum size of the machine and what is the disk IOPS and what's the memory and so on. What is the best instance type which fits that control plane ? So we chose the cheapest possible machine which gives us the maximum which allows us to reach the scale we want to reach. Thanks for the talk. I'm curious to hear about the optimisation that you did on the reconciliation. You were saying that you have the recent period every eight hours and then you did some optimisation on it. How do you do that? Right, right. So by default, informers have setting which is kind of a little misleading. It says reconcile which I thought would like re-list but it doesn't re-list. It actually just replays its cash every, there's like a randomised timer between eight or 12 hours. Anytime this happens between eight to 12 hours, all events are replayed. So you have to wait for the whole re-sync to happen every 12 hours. So your deployment got created so the developer is like, hey, I created my deployment. Why are you not doing anything with it? So that's why we want to reconcile high-level objects, force reconcile them every 15 minutes. Thanks. Thank you. I was just curious like if you can share more about your experience moving from mesos to Kubernetes about resource enforcement, resource limit enforcement, so request and limits in the mesos platform versus on Kubernetes. So can you repeat you talked about the request and limits? In Kubernetes there is request and limits, so that are very important. So I'm curious, what was your experience moving from mesos to Kubernetes? So in mesos we didn't use revocable if you know what that means, right? So we didn't use CPU over commitment as much. We used to use CPU over commitment before but we disabled it sometime back. So when we move to Kubernetes essentially request equals to limit of CPU. Moving forward if you have to enable over commitment we will do it on a case-by-case basis. What else? That's fine. Overall I think like Kubernetes is a more, I mean like the just amount of features available in Kubernetes as compared to mesos are huge and overall the migration has been like pretty good for security as well as numerous of the reasons. Thank you. Thank you so much.