 Hello everybody, my name is Weiwei and I'm from Apple AI ML Data Infra team and today as we're supposed to be me and Bowen to give this presentation but Bowen couldn't make the trip here so I will be presenting this along and the session is beyond experimental spark on Kubernetes. If you're running Spark, no matter it's on Kubernetes or not, this is the right room for you and if you're not running Spark, you're running some batch jobs, I think this is also very relevant. So today's agenda is the first part that we'll be talking about how to build the cloud-lative data platform. So everybody is talking about cloud-lative and how does it mean for a data processing platform where the backend is Spark. And the second part I'll be talking about, so basically when we build the platform we actually met a lot of challenges, I believe most people do. So we basically talk about how we leverage the open source software to tackle the challenges. And then the third part is when we go to scale, we are seeing a lot of more problems and now we are talking about how to address these issues when we scale up. And the last part we'll be talking about is future work. And first thing is that what we're doing here is to want to build really build a cloud-lative batch processing and their interactive analytics platform on top of Kubernetes and also powered by Apache Spark. And this will handle large data processing, machine learning workloads and for different users, all sorts of jobs. So basically the platform where it supports like scheduled workflows, ad hoc queries, also have batch jobs and interactive sessions. And for job types, some of the jobs are really mission-critical and they want the job can be done as soon as possible. And also there are some other PUC jobs, user-randomly-sum-made sandbox jobs. And from the jobs execution, some of the jobs may just take a few minutes or even seconds but some of them are taking days. So given all these things together and we are essentially building a pretty complex system where we want to support all those workloads and also do not give user complexity, right? Users just need to do some easier jobs and everything will just work. This is the user expectation and our work is just to make it simple based on the cloud-lative software and the ecosystem to make it happen. First let's talk about architecture evolution. So I'm from the old Hadoop world. So I spent a lot of years working on Hadoop. So in the past days that we are running like Spark on thousands of nodes cluster, at that time we really want to build really giant clusters. The reason for that is basically at that time, so data and the compute are sharing the hardware. So when you want to store more data, you need to scale up the cluster. When you want to run more computes and you also need to scale up the cluster. And that's why we end up keep adding nodes to the cluster and essentially the nodes, the cluster just get bigger and bigger. So when we want to replicate everything on cloud, on Kubernetes, the first thing came to our mind is like we can do the same. We can build a giant compute cluster and the beauty of Kubernetes is like on cloud is like we can disaggregate the compute and storage. So we can run, operate our own compute clusters. So why not we just build really large compute clusters? But the first attempt, definitely not the right solution. When we evaluate these parts, we realize that there are two major issues. One is it's really hard to scale Kubernetes with thousands of nodes. So we have tried, but after 1000 nodes, we are seeing a lot of problems and we don't want to deal with that. The second challenge is that sharing the resources between batch jobs is really hard. So when we put all of those user jobs on the cluster, basically users are constantly competing resources. Teams are complaining about why I'm not getting resources, why my job is start. So how we solve these challenges is a major problem and that's why we go to the next stage. And then we want to involve our architecture to be more cloud-lative and also leverage what we call is dynamic compute pools. So dynamic compute pools is really nothing but just a lot of more, just split the giant compute cluster to a lot of more smaller clusters. The clusters where we are comfortable to manage and we are comfortable, they are be running just fine and give us lights overhead. And so to give users that, we actually had a layer here, which is the number one coded here is the batch processing gateway. The reason to introduce another layer is to hide the complexity of how many clusters we have. We don't want user to know that. User doesn't need to know where their jobs will be running on which cluster. This is not something user really concerns about. So we introduce a layer doing all the, this is basically a job API for all the end users. User just needs to interact with this gateway. But underneath will be multiple compute clusters and each cluster will be looking like just on the right side. We are leveraging a Spark Kubernetes operator to handle the Spark job submission on the Kubernetes cluster that is easy to use and we love that. But then remember the second challenge is the resource, the scheduling on each of the individual Kubernetes clusters. And that's why we have a Kubernetes batch scheduler instead of using the default scheduler. And with that, put them all together, we have architecture like this. So on the left side, this is what we expose to our end users. So first, user will need to talk to the batch processing gateway, which is a bunch of APIs, REST API command line tools, so they can submit and monitor their jobs, Spark jobs. And second part is the tooling and monitoring, which is presenting the resource UI, also Spark history server where user can create their historic jobs and the profiler which user can use that to tune their jobs. Job UI where user can track their jobs, login and metrics, those things are exposed to users. Essentially, these things are the only things that user would want to deal with. But on the right side, that's the compute pools. And each of the compute pools, we are managing data separate Kubernetes cluster and we use the Unicorn scheduler to replace the default scheduler to the scheduling. And for each of the cluster, we do the skill up and down by leveraging the cluster auto-skiller and also use the Spark operator to manage the Spark jobs. This way, and also for the compute pools, we are able to scale our scale in, which means we can spin up new clusters on demand when it is necessary, when our existing cluster has not enough capacity. And also we can destroy those clusters when we don't need them. So with all these things put together, we really come up with a cloud-related platform where we can serve our purpose. The recipe here, so basically remember the challenges we mentioned. So we need a scheduler that works for batch jobs. In our case, it's the Spark jobs and that's why we choose our parts Unicorn. And we'll talk about more about Unicorn later. And the second part is the service gateway that really gave us use a very, very simple interface. They don't need to deal with all those infra complications. So that's the batch processing gateway. And the third part is really empowered by Kubernetes ecosystem. Because within the Kubernetes ecosystem, there are so many components, great components you can choose to do the networking, to do the logging, metrics. So all those things we can leverage in order to build our platform better and better. A typical workflow is like when users submit their jobs, they are interacting with our gateway service. So the gateway service actually exposes the rice API and command line so the user can simply submit their jobs. And once the job is submitted to the gateway, the gateway actually will submit a job to a queue. This queue is a very central concept that we have our system. This is actually the virtual queues. When we say they are virtual queues, because over all the compute clusters, we just gave user a central virtual queue to manage the resources. And these queues resources are actually coming from different, one or more Kubernetes clusters. Then after the job is submitted to the queue, the gateway will dispatch the job to a dedicated cluster to run the job. And in our design, the job will only be able to run on one cluster, it's not crossing the cluster. We are avoiding a lot of complicated things when you want to deal with one job to multiple clusters. When the job is dispatched onto a dedicated cluster, and the Spark operator will interpret the job's back and launch the job on that cluster. After then, once the job is launched on the cluster, that means the job will create a bunch of pods. And the Unicorn scheduler will watch at the pending pods and then schedule all the pods on this cluster. And also, it will respect the queue settings. So everything will be coming up together. The batch processing gateway we are talking here is a central stainless job ABS server over Kubernetes cluster for Spark. And the key idea is we want to hide the Kubernetes clusters behind the scenes. We really want users to use the API to submit and monitor their jobs. Instead of caring about which cluster my job will be running, do I need to log on to the cluster to see what's happening? So we want to hide all those complexity. And it's also managed the virtual queues. So basically the virtual queues were actually mapping to the Unicorn actual queues on those compute clusters. And we aggregate the metrics and then we have a virtual queue presented to other users. That's all the things we need to care about. And also, the gateway will handle the load balancing to avoid creating super hotspot clusters. And also, it will give users the essential APIs to track their jobs, status, and also retrieve logs. This is basically what we are thinking to bring this platform to our users. We really want to make it very easy to use. From the user perspective, they can access the gateway to operate their jobs. They also have great observability based on the UIs and the logs we provided. And also, the monitoring leverage the system will have built. They have great monitoring metrics over the jobs. And on the right side, you can see how much complexity we are hiding from users. Cluster provisioning, user doesn't need to care about that. Cluster upgrade, we're doing that quarterly for Kubernetes versions. Skill up and down is automatic on demand. IP rebalancing, basically, there are IP limits we want to avoid. Instance, capacity planning, node balancing, workflow schedule, finial recovery, all those things are hiding from the user. And also, more important, job queuing, beanpacking, resource families, those key features are provided as well. Log rotation, RK, so all these complexities are not seen by the user. So the user will be just happy to run their jobs on our platform and they will be handling all the great parts. We keep talking about the virtual resource queues and really want to reinforce that. The key idea here is we have some virtual queues and this is where our teams and users plan their budget for the year and they basically tell us how much resource they want. But their resource for each of the queues may come from one cluster, we may come from 10 clusters. So how we provide a resource for the virtual queues is totally depends on how we plan the compute post. And also, once the user's jobs are submitted to queues, actually the jobs will be queuing. And also, we will be scheduled based on number factors such as priority, submission time, and sometimes resource usage based on what user needs. And also, all the gateway provides the tracking capability for a user to be able to view their jobs, retrieve the logs, blah, blah, blah. So the virtual queues actually give the user a really good way to track planned resources and also hide all the complexity behind all the architecture that we have in so many clusters. So batch processing gateway will be open source as well. That's why I encourage anyone interested to search this project on GitHub. Then the second challenge is the job scheduling, which we need to talk about more. This is basically what Unicorn feeds in the picture. Unicorn is an open source batch scheduler for Kubernetes. And over the default scheduler actually provides some key features that are needed by the batch system. The first one is the resource quota management. I know that user folks are using namespace resource quota as not really designed for batch systems. And resource quota management job scheduling, we need to queue the jobs. And also, we need to really schedule the jobs, not just schedule the pods. And the jobs, there are priority. There are some different factors we need to consider during the scheduling. So that is also considered. And beyond that, we have advanced scheduling features, such as gun scheduling. And also, most recently, the committee has a scheduler plug-in deploy mode for some of the use cases. And one last thing is the throughput. So we really care about throughput. And we really want the schedule to be as fast as enough to catch up with the user requests. And so basically, with the Unicorn, we are dealing with these challenges that we're able to solve the problems for queuing jobs and also for the resource quota management. And for the performance, you can see that from our benchmark, it's a list twice better than the default scheduler on the same environment. And for resource, the queue use case, Unicorn actually provides this queue concept for the users and for them to be able to manage, plan the resource, and also sharing between teams and users. Such as guaranteed resources is the minimal resources that you can get from the queue. And based on that, when the queue utilization is below that, basically we know that you are starving. You need more resources. So your weight will be higher than other queues. Your gap resource is faster than other queues. Max resource is the hard limit. But the key thing here is that even we are using the submitting jobs to the same queue and it won't go beyond the hard limit but the jobs will not be killed, the parts will not be killed, even you submit more and more to the same queue. So that's the key thing here. Basically that simplified user client side, they don't need to handle a lot of fingers because of quota exceed. Then there are some use cases that we have the hierarchy of queues. In the old yarn word, we have a lot of use cases using hierarchy queues for teams and the users under the teams. And we can use the same concept here in Unicorn. Beampacking, so that's another feature built in with Unicorn Scheduler, basically squeeze the pause until the minimum number of nodes. This actually works very well with auto scaling. But that also has some side effects such as creating some hotspots and sometimes creating dex pressure. So there are some optimizations to distribute jobs based on the jobs type like what is CPU intensive or memory intensive. So all those things are job level metrics and that can be leveraged by the scheduler. And when we scale up, so the challenges we have and here are the tips. The first thing is to be disclaim because we really don't want to scale to thousands of single cluster. We've seen a lot of problems have to do really, really hard cleaning up in order to get a cluster back. So in our project, maybe this is not a rule of sample, but in our practice, we just keep the cluster size 200,000, 1,000 nodes, which is running pretty well, very well in most of the cases. The second thing is agenity. So basically we really want to leverage the cloud, how easy to build the clusters, how easy to destroy them. So we really want to leverage that. And third thing is elasticity. And all the nodes, if possible, put them onto auto scaling group and make sure they can auto scale on demand. And also provided good observability so you can detect the issues. And also diversity is the key where you are running workloads on cloud. You need to leverage as many instance types you want. And automation is also very, very important because we are hiding those complexities and those complexity are actually coming to the infra layer and they want to have the good tools to automate them. So this is like the pattern we see. It doesn't have any pattern, but basically it's on demand. So user can submit anything to the cluster and it can scale up and scale down. And the number of clusters can also scale up and down. So this is the case thing that we are running the instance that we're really based on our demands. We're not going to run preserve resources for any reason. And further research, we are looking at several directions. One is the remote shuffling. A lot of people know that Spark has those shovel data on local disk, which is actually creating some problems when we are running on cloud. And so remote shuffling is moving the shovel data to remote storage. That's definitely a way to look at how we can to make Spark more cloud-lative. And also there are some efforts in the community on the new auto scaler to replace the cluster auto scaler such as the carpenter and also ocean from spot. So I think those projects are actually very good idea to towards the new groupness cluster setup. Actually that will save a lot of that efforts. I'm running out of time. So that's all about my presentation questions. Thank you. Questions. How are you constructing the virtual cues? Cause it looks like with the scheduler that's actually a cluster local concept. So the question is how to construct the virtual cues. And the virtual cues, so for each of the physical cluster we have a unicorn instance running and the unicorn instance will provide the actual cues on that cluster. With virtual cues what we mean is like for one cue in the virtual cue that can actually go to multiple Kubernetes cluster, one or more Kubernetes clusters for the same cue resources. And for each of the compute cluster cues they still have the mean max capacity setting and when you sum them up that's the max capacity in the virtual cue. That's basically how it works today. Hey, really great talk. A lot of it's remarkably similar to what we're doing with our Marder actually in the setup. I was wondering your batch API said it's stateless. How have you achieved that because does it not need some kind of state to know after you've submitted things where to find out information about jobs for example? Or workloads? Matt, may I ask you to repeat your question? Yeah, the batch API. The batch IP? API, I think you called it the batch, or gateway, sorry, I forgot, gateway. Oh, gateway, okay. The gateway API. How is that stateless? Do you not need to save some state to find jobs or workloads? That's right, that's right. Okay, so the question is about how the batch processing gateway is stateless. So it is, I don't think it is typical stateless. It still has state. We have a database and so it writes the, so because it's basically tracking all the jobs. So actually we are writing those data into database, but we are able to run different instances might be into same database so still we can easily turn down our instance and spin up another. So we run several instances together to do load balancing. That's what we want it to be. So it's not a state, it still has state. Makes sense, thanks. So for this gateway, what exactly do users submit to this gateway? So the gateway, the question is what users submit to this gateway? Is that right? So basically that's, for example, if you submit jobs to Kubernetes classes using Spark Summit, you are doing Spark Summit, blah, blah, a bunch of parameters. And what we do is that user will not use Spark Summit, you will use the gateway command line or REST API to submit their jobs. And then this you need to give a bunch of parameters like how many executioners you want to run and which queue you want to run and what is your class, main class. So all those things are provided to the other jobs submission API and that will be submitted to a REST service to the gateway and we know that then we dispatch a job. All right, it's essentially something like a Kubernetes resource but you put some extra metadata in there that you need, is that true? The gateway is more like API server for Spark jobs is not, actually doesn't necessarily to do any resourcing stuff. Hi, so for log and metric aggregation, is there like one large instance of Prometheus or Elasticsearch that users are accessing logs through or is there some sort of segregation between users accessing X, Y and Z or does each queue associate with an instance of Prometheus and et cetera? You're asking about logging? Yeah, specifically logging and metrics that you said that users have access to. Right, right. So logging is basically, I think we have done nothing different than others and just leverage the like fluently declared logs and put it on the storage and we just in our service gateway we create some API for users to retrieve the logs from the remote storage and I think it might be most of people doing today. I see, so there is the API implementation in between? Yeah. I gotcha. I have a lot of questions but hopefully we can talk later but I wanted to focus now on the cluster of the scaler. You say that Unicorn supports cluster of the scaling? Yeah. So I ask because as far as I know, the cluster of the scaler, at least the open source one, just embeds the KubeScaler code to do the simulation. So I wonder how does it work with Unicorn? Yeah, actually for the cluster of the scaler and so right now, what do we have done in Unicorn? It works with cluster of the perfectly but when we want to use different strategy on the scheduling like we want to, in some cases we might want to distribute the pods differently on the nodes, sometimes there will be conflict. That's why we are not doing that today and to be able to fully compatible with the cluster of the scaler. And in the last part of my session, I mentioned about some other auto scaler projects and I think right now, the auto scaler is doing part of the scheduling work because it needs to do the computation and see how many instances needed. And so sometimes that creates some challenges, I would say. And one of the directions we're looking at how to really make Unicorn's schedule works better with cluster auto scaler and also other auto scalers. So that is a little bit unknown for me, at least for now. Definitely something we can talk about more. I think we are almost, okay, one last question and then we have to... So like from in Jamie's talk, there seems to be a lot of interest in like multi-cluster in general like workloads. I was curious from your, from this work, are you able to submit like MPI operator jobs or like, you know, or other types of CRDs into this? Or is this kind of more geared towards Spark? Yeah, this is right now purely for Spark. So all our computers are running on Spark, so there's no general CRDs work. All right, thank you so much. Thank you.