 Hello, and welcome to this KubeCon talk where we will discuss our experiences at Databricks integrating our platform with Azure Private Link over IPv6. I'm Mike Wierhold, and I'm an engineering manager at Databricks, and I'm presenting with Mei Sheng Li, who is a senior software engineer who led the Private Link integration. During this talk, I will give an overview of the Databricks architecture at a high level and talk about what Private Link is and why it's important to customers running production systems in cloud environments. After that, Mei Sheng will discuss some of the challenges we faced when integrating Private Link, how we eventually did our integration, and then show a demo of how customers can access Databricks over Private Link. Before we dive into the Databricks architecture, I first want to talk a little bit about Databricks and the platform we offer to customers. Maybe you've heard about Databricks and know a little bit about what we do, but if you haven't, then you've probably at least heard of Apache Spark. Apache Spark is a general-purpose analytics engine that was created by the founders of Databricks, and it's used for all kinds of large-scale analytics jobs ranging from BI workloads to machine learning. While Apache Spark by itself is a great product, it, like all other distributed analytics engines, can be time-consuming to manage when you have large clusters. There are also a lot of different types of software you need to manage outside of your analytics engine, like notebooking software, experiment tracking software if you're doing machine learning, and so on. Databricks realized this early on and built the software as a service platform that runs in the cloud to manage all this complexity for you. This means that your data scientists can focus on what they're good at, which is writing analytics jobs, and not on maintaining infrastructure, which is not in their area of expertise. Since Databricks was founded in 2013, the platform has grown exponentially and is used by many Fortune 500 companies for mission-critical applications. As use of this platform has grown, so has Databricks. Today we have over 6,000 customers, more than 1,500 employees, over 300 of which are engineers, and we have 350 million annual recurring revenue. So why is Databricks so popular? As I just mentioned, we provide a unified analytics platform that allows customers to be able to easily write analytics jobs to get value out of their data. While the company originally started out as a company built around Apache Spark, we have since started incorporating other types of analytics engines into our product, like TensorFlow, so that you have a choice of analytics engines to run jobs on. Our platform is also multi-cloud and runs in AWS and Azure. We provide built-in notebooking and reporting and allow customers to easily spin up and spin down Spark clusters in a cost-effective way. This allows data scientists, data engineers, and business users to focus entirely on the things they're good at and not have to worry about the details of the underlying platform. So how do we build this platform? Well, Databricks consists of both a control plane and a data plane. The control plane is where customers log in to Databricks, and it allows customers to manage users, manage Spark clusters, create notebooks, set up jobs to run periodically, and so on. The Databricks, the data plane is where the Spark clusters run, and you may have noticed in the diagram that there is one control plane and a lot of data planes. There's a reason for this, and it's because Databricks uses what's known as the NVPC model. This means that the control plane runs in the Databricks cloud account, but the data plane runs in the customers cloud account. The advantage of building your system this way is that a customer's data never needs to leave their account. This is particularly important for security-sensitive customers. Our control plane is built on top of Kubernetes, and we leverage a variety of different open-source projects such as Envoy, MLflow, Koalas, Nginx, Consul, Redis, MLflow, Prometheus, CoreDNS, Yeager, and so on. We also write our services in Scala and Python. Databricks also doesn't just operate a single control plane. We operate many control planes all around the world and across multiple clouds. In fact, we operate over 2,000 Kubernetes clusters worldwide, which are accessed by over 100,000 users. These control planes manage hundreds of thousands of Spark clusters every day and launch millions of VMs to execute customer-submitted Spark jobs. These Spark jobs process exabytes of data in order to produce reports and other business insights that are used by customers. Alright, so let's talk a little bit about PrivateLink. What is PrivateLink and why did we integrate with it? Well, it turns out that even though customer data never leaves the customer's cloud account due to our NVPC model, some security-sensitive customers are still concerned about the results of their jobs or other potentially sensitive information stored on a platform being sent over the public internet, even if all of it is encrypted during transit. They want to be able to guarantee that communication to Databricks never occurs over the public internet and also want to limit access so their Databricks accounts are only accessible through specific endpoints in their cloud network. In order to provide this level of security, all major cloud providers offer a service called PrivateLink. At a high level, PrivateLink has two components to it, a PrivateLink service and a PrivateLink endpoint. PrivateLink service is configured as part of the load balancer that allows traffic into your VNet and ultimately to your application. PrivateLink endpoint is, in Databricks case, set up in the customer's account and provides an IP address and DNS name that can only be routed to from inside the customer's cloud network. As part of the PrivateLink endpoint setup, you specify a PrivateLink service that the endpoint sends traffic to. Traffic between the PrivateLink endpoint and PrivateLink service always goes over cloud provider's private internet, giving customers a more secure way to access their cloud resource without any communication ever leaving their cloud accounts. Alright, so that's a brief overview of PrivateLink and I'll hand the presentation off now to Mei Xing. He'll talk about the challenges we face at Databricks, integrating PrivateLink and how we solved those challenges. We'll conclude the talk with a demo showing a Databricks workspace being accessed over PrivateLink. Thank you, Michael. Hello everyone, my name is Mei Xing. Next I will share our journey to integrate with Azure PrivateLink. As you can see in the diagram, if you want to use Azure PrivateLink, you need to first provision one PrivateLink endpoint within your virtual network. Then connect the PrivateLink endpoint to the PrivateLink service inside the service provider's virtual network, in this case, Databricks virtual network. Then you can send your traffic to this PrivateLink endpoint inside your VNet, which is a local kind of IPv4 address. Azure PrivateLink will take care of the traffic routing using Azure networking instead of the public internet. Then send the traffic to the PrivateLink service inside Databricks VNet. That's how the traffic is more secure and more private. At Databricks, we have several use cases to integrate with Azure PrivateLink to benefit our customers. The first use case which will focus mostly in this talk is the user to web application traffic. The user can set up a private link endpoint inside their virtual network. Through PrivateLink, they can talk to the Databricks control plan web application. They can use the notebook there to launch clusters through all the data science work. There are other use cases like Databricks control plan to Databricks like a data plan communication can also be secured through the PrivateLink feature. What is the challenge on the infrastructure side to integrate with PrivateLink? First of all, Databricks is a first party service on Azure. What does that mean? Actually, it's called Azure Databricks. It's up here as a native service in Azure. Creating a Databricks workspace is as easy as creating other resources, for example, virtual machines and database in Azure. You just need to go through several clicks, then you can create a Databricks workspace to work on. From Azure side, they provide two types of PrivateLink support models. First is the third party offering, which is available to all the Azure customers. It's purely on IPv4. The second type of support model is called the past version of PrivateLink. It definitely provides deeper integration with other Azure services. All the other first party service on Azure, they all use past version of PrivateLink. Even though it appears to the customer as a routed over IPv4, like shown in the previous diagram, you connect to the PrivateLink endpoint in your vNet, which is a local IPv4 address. Actually, the traffic routed by Azure networking is carried over IPv6 between these two vNets. That's the past version of PrivateLink. As a first party service on Azure, we have to use the past version of PrivateLink. The challenge for us is basically there's a requirement you have to accept IPv6 traffic on the control plan to make the PrivateLink traffic work. From Azure side, they do have a lot of IPv6 support on most of their resources. In terms of vNet, virtual network, some net load balancer, VMSS, which is virtual machine skill set. These resources all support dual stack. You can assign both IPv4 and IPv6 to these resources at the same time. The challenge for us is really our control plan. As shown before, we completely run on top of Kubernetes, which is purely IPv4 at this point. PrivateLink traffic coming in as IPv6 traffic. We have to accept IPv6 traffic to our Kubernetes services. There are two high-level options to solve this problem. The first is the proxy solution. We can convert the IPv6 to IPv4 traffic outside of Kubernetes and then just talk IPv4 to the Kubernetes services. The other option is just support IPv6 natively in Kubernetes. The IPv6 PrivateLink traffic can directly hit our service running on Kubernetes. A little bit about the background of running Kubernetes at Databricks. At Databricks, we run all the control plan services on Kubernetes, but we are not using the manager Kubernetes service, for example, AKS. Mainly because Databricks is a multi-cloud, we want to be consistent across all different cloud providers. We make our own virtual machine images and we make sure these VMs can bootstrap into Kubernetes clusters. We have more control over that. We can make sure it has the same kernel OS version, Kubernetes version. It's easier to support our own services. In terms of the configuration, we totally disable the IPv6 from the kernel level because we don't need that before supporting PrivateLink. The CNI plugin we are using Flannel. The container runtime is Docker. The version we are running is Kubernetes version we are running is BE1.16. The load balancer in Azure is a little bit different from load balancer settings in AWS or GCP. Basically, in Azure, there is one single load balancer and all the Kubernetes load balancer service is added as load balancer rules on the same load balancer. We first explored the first option, like a proxy solution outside the Kubernetes cluster. This solution is also used by some other Azure internal service. The basic idea is simple. Within the same VNet, we can provision a dedicated load balancer which accepts the IPv6 PrivateLink traffic. This load balancer can send traffic to a backend like VMSS virtual machine skill set. We can run the PrivateLink proxy on this VMSS which will terminate the IPv6 traffic proxy to IPv4, then again talk into the Kubernetes cluster. The Kubernetes cluster load balancer is also serving the public traffic which is IPv4 from the rest of the users. This seems to be a straightforward solution, but there are a lot of challenges. The first question is how can we deploy this proxy? Do we deliver it as a virtual machine image because it's run on top of a virtual machine skill set? Do we deliver it as a virtual machine image or a container image? It is completed outside Kubernetes. How do we actually deploy it? We cannot use a kubectl command. Maybe we can run some Docker commands if we run as a container. Or maybe we just deploy the virtual machine using the virtual machine image. But we don't have this support model. It will be a completely new kind of service we support. The other question is how we monitor this service, the metrics, logins, and this kind of stuff. We only have native support from Kubernetes. We don't have such a use case like a virtual machine skill set outside of Kubernetes and then provide the metrics and logins. These are the problems to use this proxy solution. Then because of that, we also explore the second option, which is to support Kubernetes natively with IPv6. Because we still need to serve IPv4 traffic for the public access, which is not through private link, we have to use the dual stack feature in Kubernetes if we choose to have native IPv6 support. If we use that feature dual stack on the loader balance level, we can accept both IPv6 and v4. The overall architecture looks like a simpler, which is good. However, this option has its own challenges. The first item is the stability concern. The version we are running on 1.16 Kubernetes, the dual stack feature is just an app feature. Dual stack is targeting to be better in 1.20. It's not a good idea to enable an app feature in the production workload. The second item is this option seems to be an overkill. Because we only need IPv6 support at the front end, maybe several services and parts will require IPv6 support, but most of our Kubernetes workload will not require IPv6. At last, it could be a huge engineering effort about prototyping and testing to make sure everything works. If we enable the dual stack feature in Kubernetes. We did some investigation on running the dual stack on Kubernetes. First of all, don't confuse with IPv6 single stack feature in Kubernetes, which already entered the app in 1.9 and moved to beta in 1.18. The dual stack is different from the single stack feature. The dual stack feature started as after in 1.16, but when we talk into the contributors, it seems like this feature is mostly stable right now. The reason it's not promoted to the beta is mostly due to some pending discussion on the service APIs. Actually, that will not affect our use case for the dual stack. Once you enable the dual stack, it will assign both IPv4 and v6 IP to literally every part running in Kubernetes. But for the service level, you need to have separate service for like one for IPv4 and one dedicated service for IPv6. To run the dual stack, there's also some networking prerequisites. First of all, the Kubernetes node, the host level, you must have the dual stack support. This is not surprising and Azure VMSS already support this. They can have dual stack support. Second, because every part will have both IPv4 and IPv6 address. So the container network interface, you choose the CNI must have supported dual stack as well. As we are using Flannel, actually Flannel will not support dual stack feature. No CNI plugins can better support the dual stack feature, Kubernetes and Calico. But it's case by case on different cloud providers. It's not guaranteed to work on multi-clouds. So after we explore these dual stack possibilities on Kubernetes, actually we go back to the proxy solution. Looks like a proxy solution is better option in the short term. Then we try to revisit our proxy solution. Yeah, can we combine the above two options together, whether it's possible to move the proxy into the Kubernetes. Then if we do that, we will definitely get the deployment and monitoring kind of for free. We know how to deploy Kubernetes workload and we have native like metrics logging in the Kubernetes. Yeah, so on the load balancer level, it does support dual stack. So basically, even if we use one load balancer, it can support both IPv4 and V6 traffic. So we can combine them into one. The virtual machine skill set, it also supports dual stack. Actually, only the Flannel, the Kubernetes CNI network is not supporting the dual stack. So is it possible to deploy the proxy as like a pods on Kubernetes but give the virtual machine level networking to the pods. Yeah, there are definitely some features called use host network where to work for the dual stack. Yeah, that's the third option we're trying to explore. We don't know whether it works or not. Then we prototype it and luckily it works pretty well. So here comes to our solution. Yeah, basically to make it work hand to hand. So we first need to make sure IPv6 is populated everywhere on this Azure cloud provider infrastructure. So first provisioning IPv6 on the vNet subnet and load balancer added IPv6 IP. And then create a virtual machine skill set. We create with a special image that we enabled IPv6 in this virtual machine image. Then it can have the dual stack which can be later on used by the proxy pods running on top. Yeah, to provisioning the VMSS with dual stack. If you use Azure provider 2.0 with Terraform, you can use the Terraform resource. Otherwise, you can always use ADCLI to attach a IPv6 interface to the VMSS. Yeah, then we set a load balancer to only load balance traffic to this VMSS which enabled IPv6. So once traffic hit the IPv6 IP on the load balancer, it's only go to this VMSS. And inside the Kubernetes, we also have a dedicated node pool for this VMSS so that only the proxy workload will run on this node pool, not interfering with other Kubernetes workloads. Yeah, so then we deployed private link proxy which is v6 to v4 proxy as a general Kubernetes deployment to that dedicated node pool. Yeah, we set it to host networking. Yeah, it actually gets both IPv4 and v6 interface. The proxy is just a Nginx proxy, less than part 4.3. So if you use host networking and enable the pod security policy, just make sure you specify the policy to allow the pods to use host network. Then make sure the load balancer rules are correctly configured to send the traffic to this proxy pod. And also make sure you do the proper white listing on the private link traffic so that before a v6 traffic can come into your Kubernetes. And then yeah, the proxy works. So we got a lot of benefit from this solution. So first of all, it's straightforward, easy to troubleshoot. Then it's the deployment is managed by Kubernetes. It's very easy if we want to update the proxy. And it's deployed as a stateless deployment. So as the traffic load increase, it's very easy to scale up just increase the pod replica or can normal traffic. Next item is like we definitely get the Kubernetes native monitoring and logging. Yeah, at last, this solution will also work for other use case. For example, our data plan to control plan traffic using private link. Yeah, this is some screenshots as a demo. So we provisioning this private link node pool inside the Kubernetes cluster, which is a special VMSS accept like both v4 and v6 traffic. So when you deploy the proxy pod, yeah, just make sure have the host networking and make sure it's scheduled only some private link and node pool. If we looking at the network on the pod level, we exactly to the pod, you can see both IPv4 and v6 IP. Yeah, on the virtual machine on the pod level, it's completely private IP and public IP is only available on the actual load balancer level. So it's the same like a networking interface if you look on the host VM. So on the Azure load balancer side, if we look at the IP network provisioning the IPv6 for this purpose. And then it works. When we set up the private link, if you are within your user like a vNet, you look for this URL, it will return you a local IP address. You just need to connect that to this local IP address. Actually, it will show the Azure Databricks workspace. And yeah, since the traffic is routed through the private link. Yeah, to recap, not necessary to integrate with Azure private link, but if you need any like IPv6 traffic support in your IPv4 Kubernetes, yeah, then you can use this approach. Basically, you enable the dual stack or with IPv6 everywhere on the cloud provider info level. Then you can set up with v6 to v4 proxy. You can just deploy it as a regular Kubernetes deployments. Just that you use the host networking, then you can receive the IPv6 traffic and then you can proxy to v4 and then the route will work all the way to your Kubernetes. Yeah, this is I think a good way to support IPv6 traffic with IPv4 Kubernetes if you don't want to use the dual stack feature yet. So yeah, we delivered this Azure Databricks private link is in private preview now. And it's also available in Azure cloud. This concludes our cubicle talk. Hopefully it's useful to you. Thank you very much for joining our talk. Feel free to ask questions. Thank you.