 Good afternoon, everyone, and welcome to our talk, Parts and Storage. This is a case study of moving flipkart.com to K8S on DMT. A brief introduction about us. I'm Neeraj, and I work as a software engineer in the cloud platform team at Flipkart. My co-presenter is Livingston, who is my teammate at Flipkart. Let's briefly touch upon today's agenda. We'll talk about what is Flipkart and how is it set up in terms of its IES infrastructure. Then we'll talk about our journey of migrating flipkart.com to Kubernetes. We touch upon the reasons we wanted to learn Kubernetes on bare metal. We finally share our key learnings from this exercise, and we'll take up any questions at the end. So what is Flipkart? Flipkart is India's largest e-commerce company. During our annual mega sale event, Big Billion Day last year, we received 600k QPS on our external load balancer, and overall, 3.8 million QPS on our internal load balancer whips. This setup is powered by 20,000 servers deployed across two private data centers and a few services running in GCP. We have a whopping 1700 microservices which are maintained by 1500 developers. So let's start at the fundamental question. Why would a company like Flipkart want to move its platform stack to Kubernetes? Firstly, for immutable and repeatable deployments. In Flipkart, the deployment model across services was fragmented and mostly procedure, automated via Ansible or similar automation frameworks. The problem with this is that bringing up a new environment containing all those 1700 microservices that we talked about was incredibly hard and time consuming. Cross-platform projects would often suffer from poor integration testing because of a lack of isolated integration testing environment. This would in turn slow down the project delivery. Secondly, we wanted to standardize our deployment model and use infrastructure as core declarations to manage them. In a way, Kubernetes API provides a robust declarative deployment model. What this enables is that now the platform teams can come in and automate and monetize the recipes for different runtimes and frameworks because underneath, everyone will deploy to the same uniform declarative API. Thirdly, for auto-skilling. We test upon how large the spikes on our BPD events are. These are about eight times of what our BAU dates peaks are. This allows us an opportunity where workloads can scale down from non-peak hours and burst into public clouds for handling the peaks. If all the workloads could auto-scale, it would mean about seven times of theoretical resource savings of Flipkart. Containers provide faster spin-up times than virtual machines in our tests, making the move to auto-skilling more feasible. Next, as I mentioned earlier, we want to utilize public clouds for busting. But for application teams migrating their workloads to a new cloud and a new deployment model could be a huge over. Kubernetes API being cloud-eplastic allows these workloads to seamlessly work across private and public cloud with minimal changes. Lastly, our custom virtual machine-based infrastructure stack was robust. But lacking in features that developers wanted, that is, auto-skilling, auto-healing, spin-up times being slow. We came to a stark realization that we as a small team of 50 developers cannot keep pace with the demands if we continue to run a proprietary close source stack. We instead decided to leverage the power of the CNCF community to modernize our IES stack. Let's discuss about our CNC model a bit. I think it's fair to say that despite its huge benefits, Kubernetes is hard. And Kubernetes cluster management is extremely challenging for the uninitiated. We decided that it would make sense to have this cluster management responsibility taken by a single cloud team that would manage the uptime and scalability of all the Kubernetes clusters. App teams are then provided namespaces and resource quotas in the clusters where they can deploy their workloads. This means that not all of the 150 odd teams have to learn how to manage a Kubernetes cluster. In Flipkart, we have six products in clusters spread across three regions and two cloud products, Private and GCP. The largest cluster runs 3,000 nodes, 1,500 microservices and handles more than 30,000 points. One question we get asked is why did we choose to host small number of large clusters versus a large number of small clusters? Firstly, in the private data centers, we are dealing with fixed static size servers. Having larger clusters allows for more variety in pod sizes, which in turn means we waste less resources due to bin packing wastage. Secondly, large clusters allow complementary workloads to upscale and downscale and allow transfer of capacity between them more fluidly. I know this is also possible using cluster auto scaling, but node provisioning is and will always be slower than the scale at which auto scaling needs to happen. This would also get magnified with slower bare metal provisioning. It also means a larger chunks of freed capacity in the cluster that can now be used by the workloads. Larger clusters also mean lesser control plane management overhead in terms of the number of master's nodes, the Prometheus node pods, the Grafana and other control plane tools that have deployed clusters. On the flip side, managing large clusters is a lot harder. We did struggle with APS server memory spikes initially. But with careful understanding of APS server flows, choosing the right resource limit for each control plane pod and doing the APS server priority and fairness photos, we have managed to make this fairly stable. As of today, more than 90% of our stateless services are running on Kubernetes. We ran BBG 21 with these services on Kubernetes. Stateful services are provided by managed database as a service offering provided by our platform teams, which are also being built from the grounds up using CNCF projects like TIDV and Apache. The migration of other stateful stacks is ongoing as we invest heavily in building and hardening database operators. So let's talk about what are the challenges faced during this migration. Firstly, as we test upon, managing the stability of the control plane is very hard. Secondly, when migrations carried out at such a large scale, both human and microservices, there's a lot more entropy than what we would have assumed for a simple access. We had to face issues where our teams would come and say, hey, my pods are latent than my virtual machines, and I don't know why. Can you help? We would find that there were frequent JVM argument mismatches between the two setups. We should be using older versions of Java, which would not have continuous support. There will be duplicate versions of jobs in the class path causing functional and performance issues. For example, logging being done as synchronously in the virtual machines versus synchronously in the pods. Next, we were in a stage of moving our virtual machine fleet from demdine to demdine. Hence, most of our workloads were running older 4.x cuddles. But for Kubernetes, we chose to learn newer 5.10 kernel, which was also required by ceiling. The Spectre and Meltdown and other vulnerability fixes that were pushed between these kernel versions have degraded the performance for pods vis-a-vis VMs. We ran benchmarks with the mitigations turned off and observed an improvement of 5% to 10% in our synthetic benchmark tests. Moving on, there will always be very few servers in such a large fleet which have CPU frequencies that are being throttled either because of overheating, BIOS misconfigurations, post maintenance, and so on. This manifests to users as one of the instances in their services having higher tail ratings. We have to also delete and replace these instances. In VMWord, this was less visible to users as VMCs smaller churn rate and hence over a period of time of users deleting these virtual machines. We reach a sort of stable configuration where virtual machines are less concerned with tail ratings sit on these slow servers. In Kubernetes, because of higher pod churn rate, the same services would frequently keep on getting scheduled on these nodes and one-time deletion would not work. We also observed issues where teams would benchmark and qualify their services only to see them degrade during central LFR runs where all the services are being scaled to simultaneously. In such cases, most cores on the underlying servers would spike causing Intel's turbo boost to switch to all-core turbo of 6.6 GHz from 3.1 GHz for single-core turbo frequencies. All the above performance issues could be attributed to external factors but we had some services in our user path which used to run on LX on bare metal. These services saw significant puff degradation simply because on LX they were able to avoid the virtualization overhead. This forms the motivation for the next key mousephone in our Kubernetes journey, that is running Kubernetes on bare metals. Why did we start running Kubernetes on virtual machines in the first place? We had a mature virtual machine provision stack and it made sense to start running Kubernetes on virtual machines to get a quick start on our Kubernetes journey. Also, we didn't want to introduce the virtual isolation stack within our Kubernetes stack. We wanted to continue with our matured isolation stack. Moreover, we always had control over our Kubernetes clusters and we decided that we could always switch the nodes to bare metals in the future without involving the user teams. So what were the considerations for running Kubernetes on bare metal? Firstly, as we touched upon, workloads which were migrating from LX demanded bare metal performance. We were able to migrate them to Kubernetes on VM by provisioning additional compute capacity. But this always seemed like a lot of waste. We needed to reclaim it. We also observed that tail latencies and throughput improved for all the workloads when they were running on bare metal. We wanted to reduce the overhead cost because of the cores that are reserved for the virtualization stack. On the security front, we argued that since we were running only one virtual machine per server, the pods running... if they ran directly on the bare metal was not any less secure than the model in which all of the pods were sharing the same guest kernel and so on. As a bonus, some of the control plane packages to demo sets would make the deployment lifecycle easier to maintain. Next, taking you through the life details of our bare metal design and journey is Livingston. Over to you, Livingston. Thank you, Neeraj. Good evening, folks. This is Livingston from Flipkart's cloud platform team here to walk you through the rest of the presentation. Let's quickly take a look at how the virtual machine stack is set up on a single server in Flipkart's data center. KVM is a kernel-based virtual machine that provides Linux hypervisor capabilities. KEMU is a virtual machine monitor which works with KVM and provides different types of devices and hardware for the guest machine through emulation. Libvert is a user space application that allows management of virtual machines. Let's discuss about the networking setup for virtual machines, which is powered by VhostNet. There are two primary components in VhostNet architecture. VertioNet, which is a virtual Ethernet card and runs in the guest kernel space. VhostNet, a kernel module that runs in the host kernel space and implements the Vhost protocol required for VertioNet working. Apart from these, there is a Linux kernel bridge which is used to forward packets from the VhostNet module to other guests on the same server or on other servers. Our in-house VPC stack has a control plane that allows us to define isolation policies for different VPCs. There is a data plane component of this VPC stack that runs on the host server and programs corresponding IP tables rules on the respective network interfaces. Like Neeraj already discussed, we decided to run Kubernetes on VMs instead of bare metal for a couple of reasons. One was that we were already having a mature VPC stack for configuring network isolation rules and we didn't want to re-architect that as part of the Kubernetes migration. Secondly, VMs are faster to provision. Lastly, our team structure didn't give us control over the roadmap of the isolation stack and bare metal provisioning when we started the Kubernetes project. So instead, we chose to run on large VMs that were as big as the entire bare metal server. On the bare metal, four cores are reserved for running host OS along with the VPC data plane, virtualization control plane and other agents that detect hardware failures. On the guest, two cores are reserved for the guest OS, Cubelet, container D. Another 2.6 cores are reserved for running demon sets. One of them is Cillium that provides us EBPF-based network stack. We primarily used it for bandwidth control, although it has capabilities for providing network qualities, we continue to use in-house VPC stack that was built and runs on the host OS. Some of the other demon sets are the logic agent, node exporter and node problem detector. The remaining 87.4 cores are used to host user workloads. Now that we are already on Kubernetes powered by VMs and the organization has seen the benefits KITUS as a platform provides in reducing ops overhead as well as better deployment models, we decided to move Kubernetes nodes to bare metal. In this model, two cores are reserved for the host OS. Another 2.6 cores are reserved for platform-owned workloads like Cillium, logging agent, node exporter and node problem detector. When we compare running KITUS on VM with running KITUS on VM, we will see some of the redundant components were removed. Instead of having two separate components for network policies, we have only Cillium now. The virtualization control plane that allowed the central VM orchestrator to spin up VMs is no longer needed. The agents responsible for identifying hardware failures on the host OS are now replaced with even sets like node problem detector. From the removal of these redundant components, we were able to save 4 cores per node that translates to almost 40,000 cores across our two private data centers. Apart from the savings that we saw earlier, we also wanted to quantify the throughput and latency improvements by running aero spike YCSB benchmarks. In this session, we will talk about the 100% retest. The CPU that was used for this test is the older generation Haswell server that is running in one of our older data centers. The CPU is Intel Xeon E5 2670 and the NIC is Intel 82599 10G gigabit dual port NIC. The host OS is Debian 10 kernel 4.19 and the guest OS is Debian 10 kernel 5.10. On the Kubernetes bare metal node, the OS is Debian 11 kernel 5.10. The parts that we used for the test were 16 cores and 16 GB memory. Now, as you can see, the throughput and the mean latency was measured for three different types of tests. In addition to the VhostNet networking architecture that we use in production, we also benchmarked another networking setup where the SRIUVVF is configured as a PCI pass-through device in Keymoom. In this architecture, the packets from the guest kernel go directly to the NIC and not via the host kernel. We observed 80% increase in throughput between the VhostNet architecture and running Kubernetes on bare metal. We also saw a 45% improvement in the mean latencies. When it comes to the 95th and 99th percentile latencies, we observed approximately 50% improvement. Now, let's move on to some real-world benchmarks that we have done with a couple of our large user path systems. Mappy, which is short for mobile API, serves requests to all frontends by aggregating data from a large number of microservices within Flipkart. By frontends, we refer to different channels on which Flipkart can be accessed, like Android app, iOS app, the mobile site, as well as the desktop website. Mappy calls close to 200 microservices. The benchmark was run on a single pod with the SLA of 95th percentile being below 1 second. The CPU utilization was about 90% in both cases. With these constraints, we saw 40% increase in throughput for Mappy workloads. Now, let's move to another service called Athena. Athena is a microservice that serves product serviceability data for products available on Flipkart.com. Qualifying Flipkart.com scale is generally done through centrally coordinated NFRs. The data shown in this slide is from a couple of such NFRs. Athena runs a total of 250 pods with 16 cores and 35 GB each. In this test, we have compared data where the throughput is around 30K. With that, we were able to see approximately 60% improvement in both mean as well as the tail latency. As we have seen from the previous slides, there has been throughput and latency improvement in both synthetic and application benchmarks. During the Aerospike YCSB benchmarks for PCI Paththrough VM, we noticed that the MPStack command for the host CPUs shows utilization as a combination of guest percentage and system percentage. The guest percentage seen on the host matched the overall CPU utilization seen inside the virtual machine. The virtualization overhead is not the same for all applications and is generally higher for applications that are more IO intensive than those that are more CPU bound like inferencing off machine learning models. Eliminating the 20% CPU spent for virtualization by the hypervisor would effectively help improve peak throughput of a single part. This in turn would translate to reduction of the deployment's replica count effectively giving core savings. In this example, consider an application that has a peak per part throughput of X and having 10 parts. It achieves this using 80% of the CPU assigned to it as the remaining 20% is used by the hypervisor. It can serve an overall peak throughput of 10X. Moving to Kubernetes on bare metal would increase the throughput by approximately 25% and result in a peak per part throughput of 1.25X. So for the same application to have a peak throughput of 10X, it would require only 8 parts now. Let's summarize the overall projected savings that Flipkart will see by moving to Kubernetes on bare metal. This is a ballpark figure as different applications have different virtualization overhead which have resided at approximately 20%. Projected core savings from removing virtualization is estimated at 40,000 cores across two data centers. The projected core savings from improved throughput per application part is estimated to be around 100K cores across two data centers. Hence the total core savings would sum up to approximately 140K cores which is almost 30% of the total Kubernetes cores. During our journey of migrating to bare metal servers, we have learned a couple of things. ARFS is a feature that ensures incoming packets are processed by the same course where the application is running. Unfortunately, this feature breaks when we run applications that use high number of threads. We have ascertained this behavior in both Intel 82599 and Intel E8106. Another important learning is that bare metal provisioning is an often overlooked requirement as they are not exercised very frequently. But if we want to have our infrastructure provision declaratively using something like cluster API, then we need this to be more robust and faster. One open source option that we've seen till now is the Tinkerbell project. With the success of migrating a couple of large user path microservices to bare metal Kubernetes nodes, we are looking forward to expanding this project for other types of workloads like stateful and Istio enabled workloads. Another revenue for improvement will be faster and consistent provisioning of bare metal nodes. Cluster API is a project that allows declarative management of Kubernetes clusters. Writing a cluster API provider for Flipkart's IAE stack would help us manage our Kubernetes clusters declaratively. The final point is that even though we have shared some benchmark results, we would want to publish a white paper on the benefits and cost savings of moving real world applications from Kubernetes on VM to Kubernetes on bare metal nodes. Thank you everyone for letting us share the experience of migrating a large enterprise like Flipkart from virtual machines to a container such as a multi-stater like Kubernetes. Thank you.