 Okay, cool. It's 11 a.m. Let's get started. So thank you everyone for attending my talk today. My name is Shirley Yang and I'm an engineering manager in LinkedIn. Today we're going to talk about how LinkedIn stabilized and journey I first data lake house by provision 20,000 ephemeral clusters every year. So a little bit about myself. I've been in LinkedIn for seven years. Currently I'm working in LinkedIn's big data platform team. So there are two areas I'm responsible for. One is the foundations team, which handles all horizontal stuff within the offline stack, including developer productivity. That is how we build this product. Security, cost efficiency, as well as some of the resilience stuff. The other area is the LinkedIn airflow team where we are working on all the data processing drops and also some of the workflow orchestration within LinkedIn. Okay, so agendas today. We will start by talking about our problem statements and we will share our approach on solving this problem as well as the results. We'll also share some of the learnings and continuous development. We'll summarize our talk lastly and also we'll touch a little bit on our next steps. When I talk about 20K ephemeral clusters, I wanna first share about the LinkedIn offline stack and what the clusters, our ephemeral clusters are simulating. So currently LinkedIn has about 10 plus Hadoop clusters, which can be transferred to more than 35K total nodes and four plus extra by storage. So we run 500K jobs per day. This can be transferred to around 100 million containers allocations every day. Now the whole offline stack, we have 100 engineers working on that, which includes both devs and SREs and we are maintaining 30 plus core services. We are keep growing on that. Our users are definitely the whole LinkedIn, including AI engineers, data scientists, analysts and everyone who actually request access for the offline stack. So what are the challenges if we're running at such a large scale? So I wanna share two examples here. These are two slack messages. I took a screenshot. So the first one is actually a machine learning arrow. From Emoji, you can see our AI engineers are pretty frustrated because the 11 hours drops failed again. And the reason was because they're trying to release the checkpoint and it failed due to an HDFS connection arrow. The second one is actually kind of a common issues within LinkedIn because our services are interdependent among each other. And in this specific case, it's Trino. So Trino was trying to release a backward incompatible change, which broke the Jupiter notebook that data scientists and AI engineer used a lot, completely blocked them for a couple hours until they fully rolled it back. So when you have such a large scale and you have all these services interdependent with each other, your infrastructure just become very unpredictable. On the other hand, our business as every other company continue growing. So LinkedIn has transitioned its business to an AI and machine learning centric. So here we constantly have multiple big initiatives all going on at the same time. So here is a screenshot. Well, sorry, so here is basically a snapshot of the current ongoing initiatives. At the top level, so the orange one here are actually the ones, it's a new initiative. So at the top level, we are heavily investing on the data pre-processing and training. We introduced the MacTron deep speed and large language models to our stock. And to support those at the pipeline level, we introduced the flight, which is actually famous for its fast iteration on the machine learning workflow orchestration, as well as air flow, which is in a replacement of the previous ask band to orchestrate the data processing jobs. In the computer layer, we are shifting entirely from young to Kubernetes. And we are introducing Spark on K8 to our stack as well as also experimenting on using volcano for the offline workload scheduling. Metadata layer. So we are adopting unified SQL on top of our Apache iceberg. So LinkedIn just open source its product, open house, I think a couple of weeks ago. So if you're interested, feel free to check that on the GitHub. Lastly, we are actually adopting object storage. So LinkedIn is currently building its own object storage, and which is a better for the IO and read rice throughput for machine learning. One reason is as I showed before, sometimes the HDFS is not that stable. So we want to eliminate that issues. Another is you will want a faster say throughput in terms of say data set and storing the experiments and even a code across the stat. We're actually doing a security revamp. So currently LinkedIn is still using the Hadoop delegation token. So with all of these change within our stat, it's no longer can satisfy our needs. So we are leveraging a speed feed token as well as the spiral and we wanted to say adopt the RBAC and pack based a security mechanism. How does our engineer feel about all of this? Of course, our engineers are very excited to work on all new initiatives like every other engineer, right? So on the other hand, they actually are afraid of making new change. Two examples here, the first one is the last Hadoop update was September, 2021. Even though this message was returned seven months ago, it's still around two years without any of the full Hadoop update. So a lot of new changes were not able to say capturing our stack. The second one is a coordination among our engineers when they try to release change to our client libraries. In LinkedIn, all the client of the offline stack are in a single model report. The reason we did this before is because we want to make sure that say we always keep backward in compatibility when we release these change among the stack. Now, for after a couple years, our stack grows, our say number of say engineers grow. So this has actually become a bottleneck. So consider you have a hundred engineers or say contribute to the same code base. You actually are very scared of say releasing product say issues or causing product issues when you are releasing change. So our engineer actually needs to constantly think about backward compatibility, forward compatibility and all of these things. So basically you can see they're even a little bit fair of say releasing the change. And the result is we actually last year we actually seen once that there are 15 versions without getting released in our clients say libraries which actually cause a big production issues. So we start to debug this issue, right? And we need to solve it. Like every other company, our development cycle is we develop, we build and we deploy to production. Now, because LinkedIn is having such a large scale and we have so many clusters, the bottleneck is when you deploy to all of these clusters. So consider you deploy to one of the static clusters and it takes a couple of days. And now if you find a bug and now you have to say all go back, you need to roll back and go back to fix it, the process, all the process again. So it may take a couple of weeks and eventually when you release all the production clusters, it may take a couple of months. So it's pretty bad. So we started to think about this. What if we can just say taking something ephemeral and we can run all of this spinning it up within 10 to 20 minutes. So our engineers can basically just test on top of this ephemeral cluster. And if we can simulate the production clusters then they basically can just test ephemeral. So the whole process will be you deploy a cluster for a couple minutes, right? Say and you run your test, assuming everything okay. You say having in a couple of hours, everything should be good. And even the worst case, it may take one or two days for you to fix everything. Now the overall process of releasing to the whole production can be reduced from weeks to months to just say days, right? So at most a couple of weeks. So it's basically will be a large productivity increase. That's why how we introduce the Groundhog Day. The idea is we take a snapshot of all the offline stack and then we bundle it and we allow our user to be specified their production, the production config they want so that we can say take a snapshot of all these clusters and run it on Kubernetes so that they can run tasks. So as you can see the name come from the movie, Groundhog Day and what it means is we can run it over and over again. So as our user are mainly the platform engineers, so the goal we wanted to make sure is we have simple UX and we wanted the learning curve for our engineers as low as possible and we want to allow them to be able to integrate this product everywhere they want. So I want to show a demo on how the Groundhog Day works and I hope it works. And I hope the phone is not too small but if it's small I can show it the one in the local. It's good enough, okay. So basically it's taking a JSON config as you can see and you can put some cluster level configs as well as some of the flow level configs. As I mentioned that LinkedIn supports three orchestrators, Askaban, the legacy ones, Elflow and Flight. So basically in the type here you are able to put all of these things based on your need. Every, because every orchestrator the way it defines a flow is different so you will need to put some of the flow level say config and metadata there. So our system can automatically create the flows for you. So in this example it's showing Askaban. So you will need to define what they have a concept as a project and then flow. If it's an Elflow you will just have a DAC. So, and then say the demo just shows some. And the zip file is actually where your test, the code for your test, which we download directly from the Artifactory. Basically then it will be used automatically to execute the flows. I'm gonna fast forward a little bit so that we don't spend too much time. So as I mentioned we want a simple UX. So basically this thing is we have a CRI to say execute our flow. And in this example we actually integrate with our pre-comment, so the CI process. And once I wanna mention one thing is this cluster ID here, it's a UUID. The reason is at any time we have many say ephemeral clusters running our Kubernetes namespace. And we wanted to allow our users to be able to access their ephemeral clusters. So we need a unique identifier. In this one it shows the starts with H5U. I'm gonna show that in the ephemeral orchestrator UI you will also find this. Then so this is basically saying that all the flow has executed successfully. So it's not showing clear here but I'm gonna show it later. So the remaining saying it will be say the same experience as you just run a flow in a normal static cluster. So in this one it's because it's using ask band. So you will see an ask band UI. And as I mentioned that in the URL you will have this UUID identified. And basically say also, so to allow our user to easily use this find this URL we basically already construct this in our log. It's in the CRI output so the user can directly just click on it. And it's also showing here this UUID so that when you're actually logging on so consider your user and you have multiple say ephemeral cluster you know which one you wanna find. Remaining things are just say you need to log in as a normal user and this is the same integration as for a production cluster because we wanna make sure that even when you're testing you actually ensure the whole security flow the authentication is working. On all of these projects you define it in the configs and it's automatically created. And remaining thing is you have your flows in the project and executions and then your DAG and your flow logs which you can use for debug as well as your job logs. So in addition to that, sometimes not all the logs you can find in Orchestrator. So for example, if you have a Spark application it depends on what your schedule is. When we developed this we were still using Yon. So I have this say demo using Yon. So basically if user wants to debug using for a specific Spark application you actually need to go to the Yon resource manager in the UI. And we also output this URL in our log so user can just directly click. And as you can see that's a, it's basically the same experience as a static cluster. So basically you click on that and you can find your logs. Okay, cool. So the next thing is the result here. So as I mentioned, Groundhog Day we enable developer to be able to provision our data lake house clusters on the fly using Kubernetes. User can integrate anywhere, spin up on a death box or integrate with their CI command. And we also allow user to be able to customize the production environment in their configs. And now after we launch this product say we see our compute team, the CI success rate has increased from 68.1 to 81.7%. And the storage team, the CI success rate has increased from 76% to 87.5%. Scale here, I mentioned we run 20K. So the real data is we run 23.5K ephemeral clusters every year and 500K flows. And now since we launched it last year, we have captured 2.1K to a more than 2000 flow failures in pre-commit. What it means is 2000 bugs and production issues have been captured before the code is even released into any of the cluster. Architecture here, we are in a quick calm. So you can assume that everything is deployed using Kubernetes. So both the control plan and the data plan in our cluster, in our Kubernetes, sorry, in our Groundhug clusters are deployed using Kubernetes. And internally, the control plan contains two service, orchestration service and history service. So the name, the responsibility is as the name suggests is orchestration service. It's talked to the Kubernetes API to spin up these ephemeral clusters. History service, it includes all the metadata for every of the execution. So user can just debug, inspect, or replay their executions. Now I wanna talk a little bit about how we are able to mimic the production environment. So the way we do it is we actually integrated with our central release system. It's a source of a truth for all the versions deployed in all the environment. So basically, whenever a user specified the environment they wanna put, it's an item. We will just talk to our central release system and they'll give us a list of, say, the services and the versions they deploy in their environment. So what we do is we're going to override in our user-specified hem chart so that they know when Kubernetes spin up the cluster and know, say, okay, this is the right image. I need to download and spin up. The data plan contains a group of, say, Kubernetes resources which are used to, say, create these flows, I showed it before, as well as to execute the executions. And users are able to access their ephemeral clusters through an edge proxy, which I actually showed you right before. And if users want to log onto a specific container they can just use the normal, say, Kubernetes way like port forward. After we launched this, first thing we wanted to ensure is we wanted to keep this platform as stable as possible. So the first thing you do is metrics. Now, as we inspect our metrics in our ephemeral clusters we notice our metrics are either sparse or has no data. It's basically sure as this. It doesn't work, right? So we started to debug a little bit and we noticed that the reason was because in LinkedIn the metrics were actually not multi-dimensional. What it means is consider you have a simple REST service and you image, say, status code, right? So the number of status that you emit, for example, you have four and it will become, say, four single graphs. So it's not ideal but it's still workable for the REST services because in the end you have finite number of, say, status codes. But it doesn't work for our ephemeral case because, say, our ephemeral clusters come and go and everything's being up is different. So we definitely need to solve this problem and as every Kubernetes service, the first thing we need to think about is we want to integrate with the Prometheus SDK. How do we do it is like everyone else, we're thinking about using OpenToolmetry and we do this by expose different client to our users if they want to use this tool. Basically, it is for us, we are using the Prometheus SDK and we also have the status D client Open because we want to keep backward compatible in case they wanted to, say, continue using our LinkedIn provided internal metric service. So every host, there will be an hotel agent and we expose the one single protocol, the OLTP protocol. On the metrics collector side, so for ourselves, we use the hotel gateway. It's also open source so that we can have a nice UI, the Grafana UI, and we can also integrate with MDM, which is a Microsoft developed at the metric service. And also Cooster Invent, as you can imagine, LinkedIn is also a Microsoft company, so we try to, say, adopt our own things as much as possible. On the other hand, we also actually get a fork of the OpenToolmetry so that we can do some improvement and we can say ingest the data, ingest the metrics to our conventional metrics system, which is called ingrath, as shown in the first, say, screenshot before. Now, after we release this, for ourselves, we definitely see a very nice graph and not only for the operational metrics, we also enable business metrics in our the same Grafana dashboard. Business metrics are actually very important for myself because as a manager, you want to showcase the impact. Operational metrics are good for our on-corks. We can have alert and metrics to inspect. Not only that, these two actually have been adopted by every service who either, in LinkedIn, who either want to, say, pull from an open source or try to say open source themselves. For example, the open source, so open house service I mentioned before, built on top of the Apoge Iceberg. So eventually, we work with the monitoring infer team in LinkedIn, and we handle this tool, so they are developing on top of it. Now it's become a general tool in LinkedIn for all the Kubernetes use cases. It's metrics enough, right? So metrics help you to deduct, but does it help you to actually debug your issues? This, say, screenshot or the Slack message is our user who reach out to us. They're like, say, hey, our integration task has failed. How do I debug, right? So if you run tasks, you know, integration tasks, it's good because it helps you to test all the compatibility among your dependencies, but it doesn't help you, sometimes it's hard for you to debug. You don't know whether it's your code issue, your dependency issue, or it's an infrastructure issue. So you want to find a way to quickly debug. Another thing is from our engineer, our on-call perspective is, if, say, you have, whenever there is an issue, you are blocking every, say, user CI. So we need to quickly detect, actually not just detect, debug this issue. If it's our problem, we need to fix that ourselves. It's if it's our component problem, like for example, if it's Spark or HDFS, because they release something bad, we need to work with them to, say, solve this problem ASAP. Now, so we definitely need to improve our debug experience so that we can quickly debug the issue. We are, we took some kind of inspiration from an open source tool called ASAS. So basically this tool allows you to use a YAML config and on top of it. So you can just change the JVM, say, you can change the debug, the log level and also expose the stack trace at the JVM level. This tool actually works with us quite well because most of our services deploy, as you can see, Hadoop system. They are actually using the JVM services. So we did, we did a by, say, first to say we just use the plan tool ourselves and we're using our on call. And we actually find it's pretty useful because our debug experience is much faster. After we did it for ourselves, we basically want to expose to our users. So what we do is we integrate it with our Groundhog Day CLI and then say we expose this thing so that the user can debug. Now, after we launch this thing, we actually say get a pretty good feedback from our users and also our own on call experience has been, say, improved a lot. So takeaways, right? So what is Groundhog Day? Groundhog Day, it's an informal data lake house hosting entirely on Kubernetes. And we, this product, actually helped our platform engineers improve our de-productivity by increasing, enabled them to, say, having a faster iteration. So along the way, we also, we also made some foundational efforts across the LinkedIn on observability. So we also improved the debugability by introducing the ASOS tool into LinkedIn. And with that, we believe that we can keep up with the pace of our ML innovations is driving. Now, this thing actually is not all shiny and bright. So it actually takes us two years to make. Fundamentally, the reason was because it's a horizontal problem. It's an intersection among everything in the offline stack, right? So, and sometimes it's really hard to actually set the ownership right, that's one thing. Because it's a big company and you need to actually negotiate around with different teams to make sure they agree on to, say, taking the on call on board to your product and maintain this platform. Now, two lessons we learned. First thing is DevOps is increasingly important. So in LinkedIn, because it's such a large company, initially we actually have a clean separation among, say, Devs and SREs. And since last year, we are starting to transition from this model to everyone needs to be DevOps. What it means is regardless of your title, you will need to write your code and you need to deploy your code and monitoring your deployment and also your product. So it actually takes our developers previously they only responsible for writing a code and test. They need to change their experience and they need to learn all of these tools that they were not used to. Ham chart, Terraform, the Kube-Cuttle command, all of these things they need to learn. So it's taking them some of the ramps, say, time to, say, get used to it. Another thing is the most important part for the whole tool is it needs to be able to reproduce the production environment. So basically you need to version everything. It sounds very easy, especially for everyone here, you're cloud native, it's very easy to you. But I mean, as LinkedIn is just doing this transition, it's a little bit, it's also taking some time for folks to ramp up. So initially for SREs when they deploy, so even now if you look at some of the documentation on Hadoop system, you will see that deployment doc, it will say that, hey, you build your RPM, you download it on your host and you restart your service. It's not a model anymore in this cloud native world. So you need to say, build an image for your code, you need to say, mount your config, you need to say, use Ham chart for your deployments, and you need to check in everything, right? So this actually took our SREs a while to, say, get used to all of this model and in the beginning, because they don't remember to say checking everything, we actually are not able to say fully getting this production environment mimic. So that's why the whole thing actually takes a little bit than we expected. So lastly, what do we wanna do next? We definitely want to improve the efficiency. What I mean by that is we should be able to spinning up the whole cluster within five minutes, and that's why. So there are two parts. One thing is we wanted to say, why do we say that data is important, right? So we believe that data shouldn't continue to be a central in this AI centric world. I'm actually quoting a blog from James Becker. He's a research engineer in OpenAI, and he actually have a very interesting blog. So I find myself very intuitive. So if you're interested, you can just check it out. My understanding on that is basically data, the key part of all this AI innovation is that data. So think about this, even now, a lot of, say, big institutions and even so for ourselves, we are, say, just use some of pre-trained models, big models, and then we apply our data and then get the result we want, right? Some of the data set, which I actually was working and show this morning, is you can just train on your local laptop, right? This will be trained. But the data pre-processing, the compute, this is the part you continuously need to be investing, and we need to ensure that our platform engineer can actually have a fast iteration on all of this. Next thing is, why do we say five minutes? This is actually from our customer feedback. So 30 minutes is basically not usable. It's too long, right? 10 to 20 minutes, most of them are willing to use it in their pre-commit. They're willing to wait this amount of time and to say, hey, before I quote checking, just do a sanity check, right? Only if, say, we can make it in five minutes, they are willing to use it more often in their development. Basically, as you spin up something, you do some experimenting in your code, you tail it on, you do more development, and then you spin it again. How do we do this? So two ways. One thing is, you want to catch as much as possible. Everything should be in local as much as possible, except for the necessary, you want to patch the necessary diff and also say, upload the necessary jobs, right? But remaining things, your images need to be local and everything, your dependency needs local. It just needs to be spin up. Instead of, you download from a remote, say, image registry. Another thing is we want to expose a way for our user to be able to basically just selectively spinning up certain component. So think about it. If you're a Spark engineer, you probably don't care if you're in your ephemeral cluster, you have a Trino or Flink in your stack, right? You just need your orchestrator. In our case, we also need some of the shuffle service. But mainly is you don't need all of these other compute engine in your cluster. You need some of the storage so that you can import the data in. That's it, right? So if by allowing our user to be able to construct, to selectively say, choose a component they want to spin up, we can just reduce this overhead of spinning up different ports and different, say, spin up the services. So with that, I actually concluded my talk and I think we have enough time for Q&A. You hear me well, yeah. But one quick question. When you say you duplicate everything with snapshots to get your stack up and running in 25 minutes, I understood that it is about all the versions of your containers that are snapshots and redeployed automatically. What about the data? Because most of the time, the problem is to initiate with the accurate data similar to what is in production. Yeah, so to repeat a question is how do we, say, get the data in for our users to run the end-to-end test and sometimes the accuracy is depending on data. So there are two ways for users to be able to do so. If, say, most of the user will just choose to say, we import some data from our staging cluster and the reason is, say, in the production cluster, there are some PI data, basically people related data. We cannot say, from security perspective, we cannot just import. So we do have some obfuscated data that from our staging cluster for the production, from the production, so we can import. Of course, because it's a family cluster, you cannot, say, have, say, basically infinite amount of storage. So we don't expect the users to, say, import a lot of data. It will just couple and most, I think, 100 megabytes data. And because it's running the CI, so mostly it will still for your functional testing. And we do have another, say, pre-pro cluster, which is running some of the flows for pre-porters. They have some production data and they pass the security, say, requirement to be able to run this as the end sanity check before we fully release it to production. Does that answer your question? Sure, thank you. Cool. If no question, thank you for having me today. Enjoy the rest of your conference.