 Okay, so Hello guys My name is true. I am a software engineer from Databricks and that today my topic is experience with hard multi-tenancy in Kubernetes using Kata container So starting from some introduction of we are so Databricks provides a unified and open platform for data and analytics So we have our own data storage solution Which combines the data warehouse and the data lake house as a single concept and we call it lake house And meanwhile many of our services are built on top of Spark for data analyzing And we also have our own machine learning products for AI and data model training So in our classic infrastructure model We separate our own services and customers infrastructure infrastructure So we integrate with cloud storage compute and the security in our customers cloud account and We manage and deploy those cloud infrastructures on customers behalf and Also, we always provide services on top of multiple cloud providers such as like AWS Azure or or GCP for example So that's the traditional way that we provide our services and recently one of our Revolutionary aspect of our products is moving our services to a serverless mode. So what is the wireless mode? the meaning of the serverless is we migrate all the Infrastructure management so our own account instead of like inside a customer account So that it eliminates that overhas to manage those of cloud provider's assets from the customer account Meanwhile, there are a lot of benefits for service such as the service provisioning time will be super fast and the service will be super elastic So with serverless we usually can provide our spark cluster to customer in less than five seconds Also, since we take the ownership of the new front management We will have some more flexibility to play with it and lower customer's infrastructure cost so with the serverless mode We want to use Kubernetes as our infrastructure due to its great portability and Exe sensibility for containerized workloads and the Kubernetes is cloud agnostic and it has a rapid growing ecosystem So we market as our number one candidate Each customer's workloads will be a single pod or a set of pods basically a deployment or a demon set in You know our Kubernetes cluster However, like one of the main difference with traditional Kubernetes cluster is We requires the hard multi-tenancy. So what is hard multi-tenancy? It means that basically the tenants within a single-quantus cluster My comes from different colonies and they do not trust each other And also the infra infrastructure do not trust the tenant so in this case Asolation of the data plane and the control plane are both very critical And meanwhile there breaks also deploys our own service into the same cluster So we differentiate the service into two parts The service is deployed by our team our company is a first-party services, which is in trusted group and Any pass that is running customers for clothes are untrusted So the default security boundary between each pause are the container boundary basically in the traditional Kubernetes environments And the tricky part is you know environments customer can run arbitrary code like whatever they want one example is like The recent products that we recently release is our machine learning products where a customer can train in their own machine learning models in shared environments with other customers and Databricks has no idea like what customer is doing So in this case the container boundary might not be good enough because it has a pretty large attack surface And it's not safe for a hard multi-tenancy environments So think about case if there is a malicious user That write a program to break out a break out a container What it can do So basically like they can enter basic basically they can enter other customers container and access their data They can attack the node kernel to affect the services inside other customers pod or They can directly attack the Kubernetes control plane for example, or they can even like attack the database trusted services Basically like after the container breakout they can do whatever they want As a result we need to build additional security boundary around the traditional container and pause To ensure the hard multi-tenancy So one of the solution is using a club provider provided services, for example, like the Fargan from AWS However, it turns out that those services cannot a hundred percent fit our requirements For example, like some of our spark workloads request to build a spark cluster with multiple pause, right? And each pod needs to connect to each other to serve as a single spark cluster So you've seen the cloud provider services is not that flexible for us to build our own logic on top of it As a result We want to seek our own hard multi-tenancy solution on top of Kubernetes so we do some exploration with multiple directions and one of the direction that we explore to achieve this hard multi-tenancy solution is By using Cata container So what is Cata container? The hardware description of Cata container is it is a secure container runtime With lightweight virtual machines that feel and perform like containers with the macro VM It provides stronger workloads and solution using hardware virtualization technology as a second layer of defense instead of like in a traditional container boundary which is based on a software solution So the security advantage provided by Cata is obvious. It is provides a VM boundary instead of a container boundary However, like one of the trade-off here is When we use a Cata container We request the cloud provider's VM has the capability for the virtualization Basically, it means that we can create VMs inside the cloud provider's node as a result As a result not all the instance types from different cloud providers can support that So with the Cata container each pod will have its own CPU its own memory And it has its dedicated disks kernel and it is pretty hard for customers to break out of VM boundary So that is for the computation security and meanwhile for the network access control We rely on the Kubernetes native solution, which is network policy to shape shape the customer's traffic We also build a simple layer with a cloud provider's native firewall solution such as the network security group It's just for the defense in depth purpose so This is a single node view after onboarding Cata container So we use large machine machines to hold multiple Cata VMs from different customers Each Cata VM will consume their own compute and storage resources For each machine, we also need to reserve some cores for the system and direct services such as Kubernetes, for example, like there is logging services and our matrix submitting services etc So as you can see in the right graph The boundary between each pods are from Cata VM And there are no shared resources between each customers as well as with our first-party services So along with the Cata, this is the network policy layer that we built so basically A pod can only talk to pods from the same customer So we built the network policy basically disabled network connections between different customers We also disabled the access of the Kubernetes control plane And we also disabled the pod to talk to any other cloud providers VMs Especially like the open ports of the kubelet from like the nodes in the in the fleet And we also only allow a one-way connection from our trusted services to the customer's pod So one additional thing that I want to mention is Cata container makes the network policy more secure in multi-tenancy environments The way that CNI usually handles network policy is by translating the policy rules into either IP table rules or like the EBPM functions and apply those rules onto the host directly However, like without Cata container If a customer break out the container and earn the host a rule privilege It would be pretty easy for them to modify the IP table rules and then bypass the network policy However, like after using Cata container The VM boundary makes it almost impossible Even if the customer breaks out its own container, it can only access the IP table rules for its own Cata VM The host IP table rules is almost immutable by any processes inside the Cata VM So that's for the how we build the network and Integrate with the Cata So due to the high compatibility between Cata container and Kubernetes Unboarding Cata is pretty simple It's just a special container runtime, right? So what we did is after installing the Cata artifacts on the node We just add a special runtime name inside our policy back And also add a runtime handler inside our container configuration and then Like in a runtime the community can automatically figure out the right runtime and create a Cata VM as the policy sandbox and function net wise it just works but Exactly enough for Cata to directly run inside of our production environments And the answer is definitely no I will talk about the main challenges that we explore Not doing we explore the direction of the Cata container Let's start from the biggest problem, which is the performance So after onboarding vanilla cata, we find out that our spark workloads has 3x to 6x performance slowdown This means that we definitely cannot directly use our vanilla cata inside our production environment So What is the reason that There's a performance problem Why there are slowdown? So the nature of our spark workloads are both computing intensive memory intensive and IO intensive So by after onboarding cata it introduced an additional version layer and it makes all of these aspects more complex for example Our workload will run on another layer of virtual CPU inside of like the CPU on the host When executing instructions the CPU has to jump between cata guest vm and the host vm We call it vm exit And it takes time And also we have to rely on the word IO protocol to provide our virtual block devices into cata vm So all of these factors will introduce additional overhead for both the computation and the IO path So how do we solve this problem? Let's start from the storage performance swing So spark requires a pretty fast disk for the IO intensive workloads As a result instead of like using our cloud providers remote disk We are leveraging the local SSD on each machine to support such space So in our current pods spec, we have a pvc statement to ask Kubernetes to handle this special month So without cata the vanilla storage support for pv and pvc is just They mount that local SSD to a folder on the host namespace And then bind month this folder into the container namespace But after using cata the default way to support such scenario Is by using a component called virtual fs So this component virtualized our file system on the host and build a shared file system inside the cata vm As a result any IO happens on either host or gas will be synced in real time So this is similar to like bind month Folder inside the container However, like the performance is not that good because like during a single IO There are multiple contact switches between host user and kernel space And the IO packet will go through multiple files and layers like one layer is from the gas and We have to go to go through another layer inside a host So these facts will introduce additional latency For every single IO and it also shrinks the total throughput So here the technology that we use to solve the problem is called spdk So what is spdk? So the full name is called storage performance development kit So it is a open source projects which provides a set of tools and lips for writing high performance scalable user mode storage application So the most advantage that spdk can provide us are first It introduced a polling mode instead of like the traditional mode like waiting the system interrupts to trigger the actual IO second It can bypass the kernels file system layer on the host and directly talk to the kernel device driver or like directly talk to the device itself For example, like the mme device So with these technologies The IO path can be extremely simplified and the performance improves a lot So the way that we integrate with spdk countercontainer and Kubernetes is by implementing our own csi So with our own csi When a new pod creation request comes The kubelet will first ask our csi to prepare the pv for that pod So during this preparation process Our csi will talk to spdk to create the necessary virtual block device the control sockets etc And then it will utilize the direct volume function And he provided it by the counter runtime to record such virtual block device for that specific pod And then The kubelet will next call the cri to create the sandbox and the containers And during the sandbox creation the kala scheme will hot plug The virtual block device is directly into the kala vm and mount that bite mount that into the container namespace So that the any all the process can see such block devices mounting to a specific folder So that is the whole process like how we integrate spdk inside the kala vm and the container inside kala vm So with the integration of the spdk, this is the disk performance we tested We saw that both read and write Has significant performance improvements And the spdk disk performance is pretty close to what we have with native disks outside of the kala container So besides the storage we also did some exploration for the cpu and memory tuning first For all the cpu's that we got assigned to the kala vm, we isolated them from the latest scheduling It prevents the scheduler to assign other host processes onto this set of cpu's And also we pin each kala vm's virtual cpu to a dedicated and isolated core So that every single virtual cpu the process get assigned inside the kala vm Can statically get assigned to a fix the core on the host So these two tunings can benefit us a lot from both performance perspective and the security perspective It prevents the frequent contact switching between each stress on each single host core So that the cpu can have been more focused on a single kala vm's work close and has a better cpu cache locality Meanwhile, it prevents the customer share any computation resources with each other Which further prevents some side channel attack, for example And meanwhile also we we did some cpu state tunings including for example, I can enable the cpu performance mode And tune the cpu power management like option for like lower cpu latencies and another interesting Optimization that we explore is about the numa control. So So I'm basically interaction of what is numa. So the full name is called non-uniform memory access Basically like when kala provider provides as a large instance type It really contains multiple physical processors and multiple memory slots Some processors is basically closer to some memory slots Which can provide the best memory access latency However, like when a processor is trying to access a remote memory slot It will have a longer latency So a single numa node contains processors and memories close with each other And with a large instance type or like the bare metal machines provide a better call provider The node really contains multiple numa nodes As a result like when we try to do What we try to do is to make sure that for every single kala vm The cpu and memory resources that are assigned to it comes from a single numa node So in this case the memory access latency will be short and consistent And meanwhile we are also trying to balance the load between different numa nodes on the same host For example, like in this case, we saw that numa zero has two vms But numa one has only one vm and when there is a new vm kala vm comes We will make sure make sure that upcoming kala vm falls into numa one to balance the load So the way that we implement such scenarios by introducing an unknown services for physical resource management The resource management services is responsible for bookkeeping the physical resource usage metadata It will record which kala vm is using which core and which numas memory Then every time the kala runtime creates a new vm It will first ask this service for a hint And the hint contains a set of cores that we want to pin to this kala vm as well as like for example the numa id And then the kala runtime will use this hint to call the hypervisor with additional parameters And the newly created kala vm will consume The targeted resource that we specified so For the numa control we also explore some Kubernetes native solutions such as like cpu manager or like topology manager But it turns out that they cannot meet all of our requirements for example like for the cpu manager The cpu manager can specify like which kala vm use which cpu sets by specified inside a c-group However, it cannot pin each virtual cpu to a dedicated core And also like the topology manager is not compatible with the kala container as well So we finally like decided to implement our own services on load services instead of like using a native solution So with all the performance improvements that I introduced before We improve our workloads end-to-end performance from 3x to 6x slowdown to less than 5% slowdown So This indicates that performance wise kala is qualified to support our hard multi-tenancy infrastructure So in our exploration of kala container, we also identify some potential risks Besides the performance For example, like the noise neighbor problem. So inside a single node There are still some resources that will be shared by multi-cana VMs For example, like the L3 cache the memory bandwidth And maybe like different partitions from the same disk or like the network bandwidth from the same mic exact work So when multiple kala VMs access these shared resources at the same time It's possible that resource contention will happen and it might cause the performance variation So that's one of the potential risk and another risk is the additional cost of the infrastructure so our original architecture is like assigning a single pod onto a single VM And the VM size is just to fit the size of a single pod But after using kala, we have to use large machines, abandon the machines And we have to allocate multiple kala VMs on top of a single machine So The scheduler by default will balance the load across the node within the Kubernetes cluster And as the time goes by some pod will come and somehow some pod will go and The node cannot always be fully utilized like the graph are showing below some of the node might have some fragmentations and these fragmentations Makes more infrastructure costs from our like the SaaS company like us So those are some concerns And potential risk that we found during our exploration And there might be some other problems for example like When if we onboard a kala container into AWS, we have to use the bare metal machine And it turns out that the instance type capacity may not that large enough for the bare metal machine compared with like the normal VM and allocate additional We also need to allocate additional resource in a node to cover the virtualization costs for example like after onboarding sdk There are additional CPUs that will be used to Visible in the aisles and we also need to allocate memories and CPUs for the hypervisors And some of our product scenario is not well covered by the current kala VM such as like What if we want to do some machine learning data training model with the GPU? We have to device possibly do the device possible for the GPU inside a kala VM So those scenarios is not well supported currently in a kala upstream So, but what is the conclusion then? So the conclusion is The kala container is a great project to support the hard multi-sendancy increments It provides the VM boundary instead of a container boundary And make the container make the customer hard to break out a break out a container And by fine-tuning the performance kala container can reach similar performance level to native container technology However, on the other side kala container brings its own complexity If we want to use it in production Some additional efforts is required to make the infrastructure as consistent performance and cost efficient So that's all for today's share and finally I will do a short advertisement So Databricks serverless is now in public preview for both alis and azure And we were short goes to ga and we'll come to sure to try for our products and it's super amazing So that's it for today Thanks for attending the talk