 I guess let's get started. So hi, everyone. I'm Teng Feimu, engineer manager for the image container team at LinkedIn. My colleague, Aviv Shahab, staff engineer at LinkedIn. Today, we are going to talk about LinkedIn's journey of adopting Kubernetes for unified cluster management platform. As you can see from this picture, our journey is a Star Wars story, assuming many of you have watched Star Wars. Here's the agenda in the episode view. We will start with introducing existing LinkedIn cluster management platform, how it encountered new challenges, thus we evaluated Kubernetes as our new hope, then how we quickly started and approved value by supporting different kind of machine learning related workloads on Kubernetes. And then eventually we are extending Kubernetes and unifying our platform. So LinkedIn is the world's largest professional network with more than half a billion professionals in more than 200 countries and territories worldwide. Our mission is to connect the world's professionals to make them more productive and successful. So both of us work in a team called LPS, LinkedIn Platform as Service. Our mission is to provide a compute platform for LinkedIn engineers to make them more productive and make efficient use of our hardware resources. So we have four pillars. Enable innovation by providing the right building blocks and abstractions, increase developer and operations productivity, and increase hardware utilization. This is our scale. We manage thousands of services. Every day we have hundreds of thousands of builds. We have 10,000 of deployments daily. We manage hundreds of thousands of hosts in multiple data centers with millions of containers. So this is our existing cluster management platform, a simplified layered view. On the very bottom we have in-ops, which manages hardware assets and their lifecycle. Then we have RAIN, it's resource allocation in LinkedIn, and LEAD is for LinkedIn deployment control play. So on top of it we have the past layer. RAIN is for auto skating and auto remediation. ORCA is for short-lived job orchestration. Nuance is for storage service provisioner. On the top most we have Mistro. The idea is we are moving to this intent-based service blueprint. In this architectural diagram, you can see we have multiple data centers. Each data center has multiple clusters. We call it a fabric. The fabric has its own control play. We have one main artifact cluster and another for disaster recovery purpose on in another data center. All the fabrics have their own artifactory proxies to scale out the read traffic. Let's go through a typical manual deployment workflow. Developers first come to RAIN to allocate or change resources for their services. They normally create a resource slice or profile that they add instances or remove instances from it. After resource allocation, they can trigger deployments through our deployment control play, LEAD. We will check our back against state vault, check their deployment policies and trigger config, compilation and publish process. Then generate and send the deployment plan to physical host. After the deployment plan gets to the host, the host launcher will download the artifact image and the configs, then start the application in the locker container, which is our abstraction layer over RUNC. During the container start process, we decrypt secrets in the config, opens necessary firewall ports and download the TLS certificate for the service so that service-to-service communication is HTTPS-based. Other than the low-running service jobs, we also have batch jobs orchestrated through Orca. Orca was started as our internal Jenkins replacement. Basically, it creates a tree of short-running jobs with priorities, and then utilize spare capacity in our common pool. When there is no available resources in the common pool, the jobs will remain in the queue. So Orca presented these challenges in term of capability to support growing batch workloads. And then another challenge is how do we model emerging, for example, AI workloads in our existing platform? Last but not least, having this portable stack across on-prime and Azure is critical to us nowadays. We felt Kubernetes and other modern schedulers gave us a new hope to deal with these challenges. Oops, backwards. We set out the experiment. We set up a 16-note Kubernetes cluster and ran as many Orca post-commit jobs there as possible. Post-commit jobs are jobs that run when a LinkedIn engineer submits code or merges code or creates a pool request. And we found out that Kubernetes performed admirably compared to our own stack in measures such as how long does it take to take a job from a queued to running Kubernetes performed really well. We performed several other more tests and holistically we realized that Kubernetes and other modern schedulers can provide a generational boost to the productivity, operability, and utilization of our application fleet beyond what our current stack made of rain lid locker count. Among all these systems that we tested, Kubernetes was the one with the most capabilities. It's built in this kind of layered way where there's a robust ecosystem around it and a lot of components have been built around it which has extended its extensibility, which I'm sure you guys are already aware about. Aware of. And because of its extensibility, we felt that it was the scheduler that we could integrate the best with LinkedIn's architecture. So we now have two paths to go forward to, whether we could start on a researching on our pluggable scheduler architecture idea where we could decide how to integrate Kubernetes with the rest of LinkedIn and complete that work and versus whether we could also pick our first use case. There were a lot of teams looking to move into and use Kubernetes and we wanted to, we had a need to support them. So we decided to do both, but first I'm gonna talk about our first use case. So our first use case was supporting Jupyter notebooks on Kubernetes and when we proposed this to LinkedIn's security team, their question was Kubernetes is not integrated with any security system that we have in LinkedIn. So is it going to bring down LinkedIn's security or can it be integrated? So we decided on our integration strategy with the kind of security first in mind. We deployed Kubernetes on RAIN, so both the control plane and the application plane, the cubelets are all on RAIN. So when they get deployed, they can get a certificate from our own certificate server and they also can send metrics to our own metric service. So this allows us to kind of discover and may manage Kubernetes clusters just the way we've been managing and discovering RAIN. When we deployed Jupyter Hub on Kubernetes, so Jupyter Hub is mainly two pod and orchestration pod which is called the Jupyter Hub and then the actual worker or minion pod which is kind of the single person notebook which is like an AI engineer would use to do their work. So both of these types of pods could get certificates from our certificate server and could talk to the rest of LinkedIn using LinkedIn's own certificate authority. The user's workflow is like this, user logs in to our ML platform UI and then they get a two factor authentication token from our certificate server and then that token is used by Jupyter Hub to figure out which user they have to spawn the notebook pod for and then that notebook pod gets this token as a secret mounted on it so that it can use it for subsequent actions. User's notebooks get loaded from a Git back storage service that we have in LinkedIn and so this allows the user to kind of do their work and save their work in that Git back storage repository and when they come back to the notebook platform they can start over from where the left off. The other thing is because that token is available to the notebook as a secret that can be used to talk to our HDFS cluster securely. So when the user launches a query to our HDFS cluster we use a Spark contact manager called Apache DB and Apache DB will use that token to validate, identify the user and then talk on behalf of them to our secure HDFS cluster. Once we productionized Jupyter Hub we thought that Kubernetes journey at LinkedIn will be very smooth and this is usually what day one Kubernetes looks like and then that brings to our first war story. One fine evening our notebook pod started crashing and we thought that the Flannel D overlay network that we were using, that's the one to blame that was the root cause of this. That was the root cause and when we looked up advice about this on the internet the reigning advice was we should nuke our cluster and start over. Luckily we didn't do that. We kept digging into what could be the issue. Our next root cause hypothesis was that it was the core DNS pod that we were running to support all DNS of all pods and when we decoupled core DNS from the rest from the Jupyter Hub and notebook pods the issue still didn't go away so that hypothesis was not correct. A few days into the debugging of this issue we started having internal arguments within the team that whether we should actually listen to the advice on the internet and kill our cluster and start over. Then we got our first clue. It was that all pod to pod networking was not working and this particular issue is not related to Jupyter Hub or notebooks but it's a deeper issue. So we had to take a deep look at pod to pod networking and really try to understand how that works. So this is how Jupyter notebooks, sorry Kubernetes pod to pod networking should work. All pods have their own IP address and say when the Jupyter Hub pod wants to send an IP packet to the notebook it just addresses that IP packet to the notebooks IP address and it should go there. However in reality this packet has to go out of the host where Jupyter notebook is travel to all the routers and switches that are in the data center and then go into the pod into the host where the Jupyter notebook pod is. So to do that the IP address that's on that IP packet has to make sense to the routers and switches. And the internal IP address used in Kubernetes in the pod do not make sense to the routers and switches. So the way it works is when a packet is sent from Jupyter Hub that packet address to the notebook pod makes it to this FlannelDDemon on the host and what FlannelDDemon does is it will wrap this packet with the IP address of the destination host of the host where the notebook pod is. And now this IP packet can make it out of the host where the Jupyter Hub host is and go to the host where the notebook is. And so this packet which is double wrapped it now makes it into the host where the notebook is and it goes to the FlannelD app whose job is to unwrap this and once it gets unwrapped it now has the address of the notebook pod who it makes it to the notebook. Unfortunately there's more to this madness. In both cases when the packet is going out of the host where the Jupyter Hub is or going into the host where the notebook is we're asking the hosts to act as routers. We're asking them take this packet IP packet which is coming from an internal network and send it send to the external network. And in the case of the ingress we're telling take a packet from the external network and send it to the internal network. And Linux operating systems would not do that unless they are told explicitly told by setting a kernel flag and that is the IPv4 forwarding flag. Once that we set this IPv4 forwarding flag it starts working. So what happened the root cause of this issue was we have an underlying automation platform which sets up our hosts and that automation platform had gone in and unsets this flag, the IPv4 forwarding flag in a bunch of our hosts and that caused this problem. So once we fixed that issue in that underlying automation platform notebooks were happy and users were happy. So after this we thought okay we're done with war stories Kubernetes is going to be successful at LinkedIn. This brings to our second war story. One fine evening, Nougat Pogs stopped working. We thought that the plan LD demon was to blame and when we looked up advice on the internet the reigning advice was nuclear cluster, start over. This time we quickly identified that pod to pod networking was not the issue. We had to take a closer look and found out that our Jupyter Hub pod was complaining that it cannot reach the API server and that's why it started crashing and dominoing a crash into the notebooks. So we took a look at how the Jupyter Hub pod is constructed. In our case our Jupyter Hub pod has three containers and in its container that gets certificates from our certificate server and then this in its container makes that certificate available for the subsequent containers which is we have an Nginx container to set up routes and then a Hub container to orchestrate pods. And in this particular case it was the Nginx container complaining that it doesn't have this Kubernetes service host environment variable set therefore it doesn't know where the API server is. And because it doesn't know where the API server is it doesn't work, it just crashes. So in our clusters we set the Kubernetes service host environment variable on every pod using a pod preset. A pod preset if you guys know it's something you can use to say that okay do this before the pod is set up. And in our case we basically say, okay set up the Kubernetes service host environment variable on the pod's environment before you start the pod. And it's always set to the FUDN of the API server and that ensures that users can securely talk to the API server because the FUDN is the only thing that's in the API server certificate. So before Kubernetes version 1.14 init containers did not honor pod presets. So pod presets could not be set on init container. So what we did was we hard coded the Kubernetes service host value in the init container. So once we upgraded to 1.14 the init containers issue was fixed at 1.14 we didn't know that. So pod preset controller tried to apply the Kubernetes service host value on the init container and it found a conflict. It found that there was a value with the same key but a different value for the Kubernetes service host. And once it failed there it failed to apply that environment variable value in any subsequent container. So once we fixed this issue things started working again and the notebook pod was happy. So after this we extended our use case to a batch use case. Users wanted to be able to launch distributed TensorFlow training right from their notebooks. So we enabled that. We're using Kubeflow's fairing library. Users could annotate their notebooks, notebook code and then once they ran that it would actually talk to the TF operator on the cluster and then the TF operator would launch a TensorFlow cluster where the workers distributed across the cluster. And these workers would be able to talk to the NVIDIA device plugins on those hosts to get GPUs allocated to them. So this enabled them to do a very large distributed job and also take advantage of GPUs while talking to our security-related GFS cluster. Our next use case was another online use case where serving models directly from our Kubernetes cluster. So let's say there is a production service that wants to identify the image of a cat. So it will send that image of a cat to this model service and then that request would, what it expects from the model service is the word cat. So that request gets forwarded to the model serving, TF serving pod in our model serving cluster and then the TF serving pod will route it to a model that's deployed by our model deployment system and this can also take advantage of GPUs on the box. So it can answer a lot of queries in parallel. It can cache queries and things like that. With these three use cases, we had started supporting a model authoring through Jupyter Notebooks, model training through TensorFlow distributed training and model serving through TF serving. And this were significant parts of an ML pipeline. So naturally, we now had an issue with figuring out do we buy an entire pipeline from the open source or do we build it in-house? So this is because LinkedIn already has significant ML infrastructure that's built on the Hadoop-Yarn stack and we had to kind of compare the strengths and weaknesses of the Hadoop-Yarn stack versus the Kubernetes stack. The Hadoop-Yarn stack is very well integrated into LinkedIn and also it has capabilities such as hierarchical cues, preemption that we use a lot. Kubernetes on the other hand is very strong with mature container support. And it also, its API is implemented by all the new ML training frameworks. So we had to bring both of these worlds together because there's strengths on both sides and our next-gen ML pipeline had to adapt the best. So to bridge the gap, we started initiatives such as allowing users to be able to securely access HDFS directly from their Kubernetes cluster. So Kerberos is what's used in LinkedIn all across our HDFS cluster. We are building a product that we may open source is to be able to support Kerberos directly from the Kubernetes cluster. So the way it works is when user logs in and submits a job to Kubernetes, they are able to submit a Kerberos ticket along with it and that ticket is handled by a delegation token controller and a delegation token service that we will build in the Kubernetes cluster. And this delegation token controller and service then talks to the HDFS cluster to get a Hadoop delegation token and then it mounts that into the worker pods. This allows the pods to directly talk to HDFS cluster securely and once that worker is done that delegation token could be revoked. If that worker's lifetime span longer than the delegation token's expiry then that gets automatically renewed by the delegation token service. All right, so after supporting both online and batch AI workloads on our Kubernetes platform which we showed the value and the confidence for the team, we are working on extending Kubernetes and unifying them platform. Here are more integrations that we are working on. Past layer integration with RACE and Orca. We're also working on unified topology integration. The unified topology provides this single shared view of LinkedIn's application fleet and it's a topology. Basically what runs where. So we have many system, internal system depends on such a view because for example that like certificate management server distributes the firewall system and DNS discovery. And we also work on our back integration with statewide using automation webhooks, pod-certificate integration using unit containers. Other physical hosts, we also have implemented the CRI for our locker containers. The idea is Kuba Lab would be able to speak with locker CRI out of the box. So the unit container in a locker CRI would decrypt the secrets in the config and open firewall port if necessary. As you can see with all these integrations, Kubernetes adoption and LinkedIn is really picking off. So let's wrap up. Kubernetes, the whole ecosystem is really powerful and going very fast. However, they do operation and integrations especially with a huge legacy infrastructure can be challenging. Supporting our ultimate goal is to create a unified compute platform on top of these Kubernetes. What started with supporting emerging AI workloads to prove the value and show confidence was a great kickstart for us. Cloud native is more than cloud only. Kubernetes and its ecosystem embodies cloud-centric best practices and provide this cloud native approach to modernize our legacy infrastructure. It has a huge value for large scale enterprise confidence. All right, that's for this talk. Thank you for coming. So.