 Hi folks, thanks for attending this session and are you My name is Huizhao Zhao and today I feel a bit deeply honored to be here to share our learnings and the journeys regarding how to running spark Kubernetes and Let me introduce myself firstly and and I have been working I from Apple AI-MIL Department, I have been working on design develop and manage large-scale spark cluster and the spark workload on Kubernetes around the four years now and throughout our journey most recently we have encountered some new and Very crucial new requests from our internal and the users which inspiring us Try to exploring a new way to allow user to running their spark workloads on Kubernetes which is running Apache stand-alone cluster on top of all communities and Over the next 30 minutes. I try to have a deep dive to this topic. So I try to Firstly introduce was our existing, you know platform on our how to run spark workloads on Kubernetes and then why we need what kind of challenges we are fitting right now and Why we introduce this new way and then I try to give you some high-level design Examples we try to achieve and there's a detail implementation for that and Lots of things like what's our ongoing work going to do As you know Apache spark is an open-sourced software engineer a soft software and which is kind of Distributed computation engine just in a field of lines of codes in either, you know pass on Scala or even circle data scientists and the engineer can easily define a spark application to process a huge amount of data and The spark will take care of Paralyzing the work with the help of a cluster of machines. However Spark itself cannot manage this kind of machines Directly Usually it will rely on some third-party class managers to help it There are some popular, you know class managers like how to be young Apache muscles Kubernetes as of course so around the same years ago, we have been exploring how to run in spark on Kubernetes and We Through this kind of journey we find that there is quite a lot of kind of advantage if we can take this design First of all The full containerization these kinds of capability makes the spark application on is a very agile and portable our data scientists and Engineers can easily install whatever packages to the container images and the run it's everywhere and This is really helpful and speed up our developers, you know development of velocity and iterations Secondly and By running spark workloads on cloud with the help of a connect is we can easily apply the auto scaling features to Spark applications so we can automatically adjust to the scale in or scale out to the machines to adjust to the different size of a spark workloads this help us to save a lot of cost to our internal users as well and Thirdly, and you know, we always try to take in security and privacy as our first Citizens to our platform. This is really important especially when we try to you know, build a maintenance platform for our users With the help of a Kubernetes such as you know service account and the cluster rows Such as this kind of those, you know security control services provided by connect is itself We can very easily to apply authentication tokens or authorization policies to every spark port and or even every data access operations Inside the spark workload Here is our existing Architecture how to run in spark workloads on top of a Kubernetes from this picture, you can see we build a unified batch processing gateway and We leverage this gateway firstly to manage several several configurable Numbers of a spark Kubernetes cluster and once this gateway Get to the request from whatever API call come the line airflow operator or Jupyter Notebook it will translate this kind of request to a CRD object so here what is the CRD object and You can see on each spark Kubernetes clusters. We install several spark Kubernetes operator and The CRD object are going to be Rotated to the corresponding spark because clutter and the operator will process this kind of CRD instance and Basically the operator are going to try to translate all the fields defined in the CRD object to a parameter required by the spark submit if you're familiar like you know spark and It's a provide a spark submit and also it's already, you know support Kubernetes as a cluster manager as default so that spark submit script can directly talk to Kubernetes endpoints and By leveraging this shell script spark operator Can help you to spawn bunch of reports which is including one spark driver pod and several Numbers of a spark executor pod and At the bottom you can feel we leverage a unicorn to apply the resources quota per each tenant and of course, we also leverage autoscaler class autoscaler and All cabinet are to automatically adjust the machine or trigger the new machine to create it to support the new part and you may have a lot of a kind of you know Questions regarding this kind of architecture, but let's see have a deep dive later if we have time at the end We have been leveraged this Clutter to support large-scale production workloads on our internal platform for several years and we made a lot of Internal customizations and the defectors to this open source to software, which is a spark Kubernetes operator however, we are still facing some new challenges and Here is a brief summary first of all, we do have a lot of very short processing spark applications and this is we got we got a lot of requests from our end users and the They try to require their small applications to be finished in only around three to four minutes and also the interval of each spark application They need to be less than 30 seconds You can imagine all over this kind of a spark application task is a schedule is by airflow or other, you know, or just administrators and You also can imagine any kind of job schedule delay or kind of a pod allocation delay Well leading to the failures to a spark application or even the whole airflow decks and From the previous Architecture you may also notice we also provide interactive Analytics capabilities to our users. So what does that mean? That means if we user want to have a query session Where is the truth or notebook? It will also need our back end to spawn bunch of old pod So that means If we can reduce the startup time of the pod creation is well Help very helpful to enhance our user experience certainly Jin AI of course is a very hot topic right now and On our platform. We also try to provide both data parallelism and model parallelism to our platform So that's why we try to exploring how to running a non running recluster on top of a spark however, the existing architecture cannot Satisfy this need That's why we conduct a lot of you know research and try to find some way to reduce the start up Time of each spark application and also try to find a new way to run in this kind of machine learning framework To find the the firstly Let's try to find the root cause why the sport or the spark application start Start up time taking so much time Firstly, let's check what happened when CRD Rotated to the spark operator and the house by the spark operator does and From this workflow of Kubernetes operator, we can see once the controller received the spark application CRD's and it will firstly translate to a shell script and then using the submission runner to submit to execute this shell script and allow it to talk to API server directly and Then the driver part going to be created and then the spark driver part will directly Talk to the API server to spawn the bunch of executor part We can see they're going to be a long negotiation process, especially when for example, there is one spark application requires thousands of executor part and our bottom components like a unicor or Autoscaler need to find the machines or create the new machines to allocate this kind of you know bunch of Executor part. However, the spark operator right now is the involvement already says it Of course, it will provide some power the monitoring and the web validations And also it will try to help you to create the spark ui service your spark ui service and the ingress rules But it is a long negotiation process Another thing I want to point out here is that once the spark application workloads is completed and The spark operator will terminate as a driver part and all the corresponding executor part This is actually is a totally unnecessary or repeated. You know post creation creation and the deletion Especially when we try to you know running a production grades workloads because typically this kind of workloads There is no frequent, you know doctor image change or configuration change This is totally a waste on the other hand. We try to exploring. What's the spark operate? What's the spark? Capabilities already provide It's provided a standalone mode to all the users to you know Create a tiny cluster on your local machine to a test or experimental purpose it's basically idea is is try to running a shell script to Start some, you know JVM process Among all these JVM, I mean Java process. There is one master process and the several worker process and We can take this master as a simulated a cluster manager and it provides some rest for API to all the users to able to repeat the spark Submit to the spark application to this tiny clusters and also this master provides a default a faithful scheduler to allocate the different, you know spark applications And the two, you know and also monitoring the workers usage metrics and Monitor the hails of this simulated cluster. So as a infra engineer, we try we have a global picture Compared to our, you know and the users because we have both kind of Infra We have kind of a global pictures to the infra and also both to the spark epic Coral code base So what we're trying to do here is we try to continues leverages sparks octopus to help us Manages, you know handling like virtual machine provisioning auto scaling and this kind of a security control And but on the other hand, we try to leverage Sparks inherent class of managers functionality to Maintain a master and the worker for us That's being said that if we can find a way to keep the master and the worker running on top of a pod so we can Have a better situation here Our justification is here if we can keep a master and the worker running on top of a Kubernetes There are going to be no machine. I mean virtual machine startup delay There are going to be no container image pulling delay. No Kubernetes API and the unicorn interactively with the, you know cloud private provider delay to allocate the pod from the user side Here is another difference like Especially compared to the previous architecture right now We only need to expose the master's end point instead. We totally hide as a Kubernetes and the pod end point for users And also it's totally, you know simplifies, you know management as a infra engineers and On the other hand, we also can take this kind of tiny Stand-alone cluster as a team-shared Cluster which should provide a new multi-tenant management styles for for us This has been approved like that. This is very helpful to foster collaboration and maximize resources utilization so after, you know finalizing the motivations and the design choices how to you know design and Implementation is a relatively straightforward Here is our design principle Like how to support the drawing stand-alone cluster on top of a Kubernetes Firstly, we try to provide a unified the spark operator as I mentioned the previous line We have been leveraged this kind of open-sourced software, which is called the spark Kubernetes operator right right now It's a merged to the Cooper flow this you know get up triple and We try to continually using this kind of you know code base and the try The one important things that we try to ensure as a totally compatibility to our platform to continually support our users to using regular spark app this kind of Secondly, we try to provide an extendable framework based on this operators and For the CRD life circle measurement, of course, we also try to make it fully automated and Status update is very important and we try to leverage the CRD status updates So our batch gateway can fetch that the details back to our users so they can know what's their cluster status or job status So we try to provide this capabilities as well as As I mentioned before we try to make the spark master and the spark worker to keep them always running on top of Kubernetes, so the cost of efficiency is going to be a Concern so we try to provide some cost of efficiency, you know solutions So here is some detailed implementation we have did First of all like I mentioned the spark master provide the rest API. So in the new versions of Spark operator is right now. It's can provide both submission style You can either veer spark spamming shell script or rest API to submit your job Secondly, we introduce two new CRDs and their corresponding controllers to manage Which is one is the spark standalone cluster. So we all are users for to firstly create a standalone cluster And secondly, we introduce a new CRD, which is called a spark application on standalone So we so we all are users to submission to submit their sub spark application to the corresponding standalone clusters We also able to allow users to add it as an initial containers and the side cars to their standalone clusters This is a very helpful especially when user try to you know Download some very large-scale, you know artifacts or even machine learning models and usually user try to leverage your initial container to do this So the car for example, we try to leverage a side car to allow you to make some you know authentication stuff as a as I mentioned the previous lives are like a cost of efficiency and So for each CRD, we will attempt automatically added a pod levels auto-scaler Which is called that, you know HPA to issue a CRD. So we as a start point right now We only take a you know CPU utilization this metric to trigger the scaling The last thing is we try to provide both the cluster levels, you know observer and the job levels observer for the cluster levels observer because as I mentioned the master provide API and it's a provide some methods to check the master status for the job status observer and we can Try to you know append the driver ID and the application ID to the request and the tool checks Corresponding spark application status So here is the example to the two new CRDs on the left hand is the sparkle standalone cluster CRDs You can see we use are only needed to specify the master and workers resources and then a new standalone cluster will be created on the right hand there is This is an example for the spark application on standalone We can see if you familiar the previous I mean the original sparkle operators CRD definitions You can see this is pretty similar to that CRD definitions and though with we Keeper most of their you know fields like a main application files and the environment Variables and the spark conf the only new field that we added is a spark cluster service name So we can identify which Sustainable cluster going to be wrong is applications Yeah, that's pretty much about how we implement spark operator. The next thing is how we integrate to our existing Spark platform We try to reuse every component. I mentioned before Firstly, we able to reuse the batch processing gateway Just with just added some new CRD spot supported as I mentioned it to try to translate users request to the new CRDs object Secondly, we able to reuse the spark history server because we made some internal optimization to you know indexing all the logs we stored on the you know storage object object storage system Suddenly we able to reuse the unicor to manage the maintenance resource quota and the limitations per queue and We continue can use the queue to manage this kind of stuff also and If you really if you remember I just applied HP of it HPA to the pod levels of the scaling But we still able to reuse class of the scalar or capital to scale the machines to adjust to support as that's kind of a HPA The its activities So Accordingly, this is a new architecture. You can see the most of all components is the new change We just added the two new block to each Spark Kubernetes clusters and we can leverage unicor to manage the several standalone clusters and So that's being said every tenant tenant. I mean our you know Platform tenants can either using the new standalone cluster all the previous Regular spark application CRDs on the same queue Okay, now let's talk about why this kind of new spark operator can provide a Production grid spot and why it's a faster safer and the service Is this is a data point we collect we collect it Firstly, let's take the Original, you know design as a baseline, which is means like we keep using connect is a as the resource manager and but Let's take the baseline like resources use pattern as a temporal what that means that means all the virtual machine going to be a start as a code Status like it will require auto scaler and the cabinet are tools provision the machine and then leverage unicor to allocate the power to this new machines We call we let's take this as baseline and then the first thing is we did is we try to keep this machine running and then To check as a what's the average startup time By the way, when we doing some baseline benchmark and we find that the average startup time is around 30 to 90 seconds per each application here the application is It's required around the thousands of pods to create to be created So if we keep that the virtual machine is running the average start of time going to be Two times faster that means the average, you know start of time is around 30 to 40 seconds If we leverage our new design like we if we submit our application to a team shared Spark standalone mode the average startup startup time will be 10 times faster than the baseline and Basically our testing number is around the fall to five seconds per each spark applications Another thing is why we think a standard class cluster is more safer first thing As you know, this is going to be a long running standalone cluster. It's more stable and it's also Help our info engineer to you know when we're doing some authentication and also reason we always try to keep a you know talking to API server especially when we Repeated creating the new pod and the delete but if we keep a long running standalone cluster is will hugely, you know decrease the Reduces a you know API servers stressed and Also, it's a help us very you know more easier to audit all their you know activities Secondly because we added added a new layers of you know Like a standalone cluster to our customers and we can provide a more granular of a security isolation we can apply a dedicated the service account and that there are policies or I am a rose to it Certainly We it's provides a tenant autonomy. So that's being said. We no need to expose All the you know necessary policies so users do we can get some new information for this kind of standalone cluster because previously user only needed only can get a Status from the pod level and there are probably their application levels status right now. They also can get some You know master and the worker levels so that us Third thing is how it is called the serverless Yeah, the first thing is like We simplify the you know cluster creation process user just need a single API call user cast Creators, they're dedicated Even team share the you know cluster on top of committees and There's a no need to consider what kind of virtual machine instant type of families They're going to select and what kind of storage layer going to select are very you know platform try to help them Also, they no need to continually like updates of the spark version and Our infra or platform can handle it handle it They're going to we try to you know maximize the resources Isolations so they are going to be no idle capacities with the help of HPA auto scalar a clever auto scalar together Also user only can pay only their current usage At the end If you remember the recall the previous slide, we try to also provide some of the abilities functionalities So at the bottom we you know build some cost of pipelines to calculate every Clutter level and the application levels cost the numbers for our users On the other hand, so what is the new to our users? By introduce this stand-alone cluster and Because I mentioned the spark master also provide a new UI. So we expose this new UI to our users So they need to learn how to you know use leverage this new spark web UI On this spark UI, you can check all the Submitted applications and the running status of your jobs and you also can check the workers that are like CPU and the memory usage that's kind of stuff So they can fetch more details Not only you know driver and executors, but also master and workers Of course we provide a new option to a team share the long running cluster and new options for the Jupiter users and lastly It helps us able to running dream on top of Spark cluster if you look at the race, you know official website You can you can say right now really only can running on standalone cluster Okay, lastly, this is our ongoing work First of all, we try to provide a better multi-tenant management If you remember recall what I mentioned that before right now the vanilla spark master is You know code base right now only support the faithful scheduling and we try to you know make some patch to the spark master to able to you know support Other scheduling strategy like fair or prior priority based, you know scheduling Secondly spark master is a kind of a stateful service and We try to Make you know avoid any kind of restart the feelings and We are considering leverage PVC and the stateful set to support this master pod and Lastly, but a very a lot of a huge work to the third party is like how to Improve the debug ability to our platform. This is an ongoing issue from the other, you know previous platform as well and we try to continually Optimizing the login metrics and the web UI for our users Yeah, I think that's all My today's the sharing and now it's time for the question Hello, is the operator you showed open source and we can all find it someone use it Yeah, good question. That's when we started this project This is the first question. We you know think as you know right now as an open source of the spark Kubernetes operator is already merged to the undercube flow right now. We have you know Better position to make PR to that's a public repo And that right now we try to speed up this kind of process internally and we go into you know Merge our PR to the open source software soon Thank you. Hello. Thank you for the presentation So my question is as I know spark standalone mode does not support Python a pie spark code So do I say correctly that in your like solution there won't be support for Python code? I don't think that's correct the statement based on our internal testing firstly you need Using spark cluster mode Yes, so cluster mode does not support Python right now No, actually based on our internal testing It definitely supports and the buzzer you just need Using the right way over spark matter API to submit to their to submit to the job That's being said that the main class going to be not your applications Spark applications the main class, but so you have to using a Didn't show that but you have to using a Java submit that's kind of class as a main application file and then Added your person application file to the dependency Pass then you're able to run. Oh, so you don't use native Python support for spark But you submit Java code and run Python from there, right? Yeah, but this is a capability already supported to the current spark master source code. Okay. Thank you We can yeah, we can check it detail offline Hi Have you considered using spark connect to further shorten the start at the pine? Thanks. So the question is why not we leverage a spark connect instead of using the very old spark standalone mode, right and Yeah, we made a lot of research to the spark connect At that time at least I mean, you know half years ago spark net is not as mature It's cannot spot the data frame and lots of passing you the udf still cannot support. That's why We you know didn't leverage it but I think You know weighing the spark connect is you know mature getting mature. We may be able to spot it But the main advantage maybe is provide the small debug abilities for our users And maybe better for the Jupiter notebook users, but not so for the larger scale workload But the leverage is the stand-alone cluster. You also can run in ray. I don't think a spark connect can you know connect can spot this last question What are the challenges of running ray on spark? Yeah, good question I think our first question is our first thing is a first the challenge we facing is Dependence management Even though we can leverage, you know container autografer to manage this But you know if we're running some GPU based the machine learning workloads It's very hard to find the combination versions of you know could a driver You know a media driver could you know took it this kind of stuff with the you know other you know machine learning framework that's one of those you know first no challenges we are facing right now and Secondly how efficiently, you know improve the GPU utilization with the various tools and Accelerators and doing whatever inference or training that says ongoing challenges we are facing and Yeah, and we can share more detail offline if you need it Good question. How do you scale? standalone clusters in response to varying workload queue like nightly batches or Monday morning spikes on submissions of the jobs So basically your question is how to handle peak usage for the some emissions, right? Yeah, basically Right now as a start point we all know users to specify the Minimum and the maximum power the numbers to your CRDs and then leverage HPA to adjust the power the numbers Inside this range. That's our current status But how to support, you know of struts, you know or peak usage for the very large scale or backfill use cases We haven't explored that but basically I will say in that case we may you know Encourage user to Overproduce over-providing there, you know spark port spark standalone clusters so we can leverage HPA to do that but not Thank you. Okay. Thanks for your time