 Hi everyone. Thank you for joining us today. It's our pleasure to present you the elastic story of running Spark on Kubernetes natively at massive scale for Apple. My name is Bowen Li. I lead the batch processing and interactive analytics areas at the Apple AMO data platform. My team builds and operates cloud native services like batch processing powered by Spark on Kubernetes, interactive data science service with interactive Spark and the Jupyter and interactive analytics service powered by Presto and Trino. So we serve hundreds of data engineers and scientists every day to improve our AMO product like Siri and Apple search with best in class data analytics and processing infrastructure, which is an engineering from my team who has been focusing on how to run Spark elastically and cost efficiently on Kubernetes. So here's an agenda today. We will first talk about the benefits of cloud and our design principles to leverage those cloud native characteristics and then the architecture of our cloud native Spark on Kubernetes platform and why we need to like auto-skill our Spark service based on cost saving and elasticity need. Then we travel dive deep into design of the reactive auto-scaling and the productionization of it and our learnings and the future work. Sounds good. Let's get in there. So you know why we are moving to cloud is this might not be a new topic but I want to iterate our unique perspectives. So cloud and Kubernetes can help solve lots of the problems of legacy infrastructure have. For example, it is IGL resources are consumed on demand and the user can pay as you go. Second is elastic and scalable. We can acquire resources we need and return them when we are done. So that saves us lots of money and the compute and the storage are almost infinite scale. Then Kubernetes enables us to build services in a container native way with strong resource isolation. So users workload only impact each other. This supports our like multi-tenancy and the isolation guarantees. With cloud and Kubernetes we can leverage cloud native cutting edge security techniques to build a privacy first data infra. And lastly Kubernetes and providers on the cloud took away lots of those heavy liftings from us which enabled our developers to focus on building and improve business critical batch processing service to achieve higher ROI. So it's a no-brainer a couple years ago for us to decide to all-in cloud and Kubernetes. With the benefits of cloud and Kubernetes in mind we set a few critical design principles for ourselves when designing the system. First we want to fully embrace public cloud and the cloud native way of thinking and building infrastructure. That is quite a mindset change. For example you know in the cloud native world when we want to upgrade our infrastructure we don't have to do any in-place upgrade which expose a huge risk to our infrastructure and our users. In the new world we can just spin up a completely new environment and gradually roll traffic over from an old environment to the new environment and instantaneously switch back if there is an issue. So that's kind of flexibility is a huge win for our DevOps. Second everything should be containerized for elasticity agility and reproducibility. We aim to scale and replicate our infra very fast to cater to business needs and full containerization enables us to do so. Third compute and storage have to be fully decoupled so they can scale independently according to business needs. For example the shuffle data size can vary significantly from spark job to job and then we have to build our own spark service be able to handle that in a flexible way rather than one size fit all solution. Then security on the privacy and user experience. I'm talking about them together since they are related. You know in our new infrastructure security on the privacy are first class citizens in the design stack and we leverage fine-tuned you know techniques like roles, policies to govern our data services and at the same time we still want to make it super easy for users to run spark jobs by following those governance. So instead of having users to run Spark Summit directly we expose a REST API that has exactly the same parameters as Spark Summit so we can enforce security at the REST API layer while still giving users a very familiar development experience. Last we have an Apple internal distribution that we decided to use. Next I want to present our cloud native elastic spark architecture. So you know we can start from the data plan where in the back end we have multiple spark Kubernetes cluster. We use the spark Kubernetes operator to submit to spark jobs and manage this job life cycles. They're a replica set of spark operators so they can load balance and achieve high availability. Each tenant on the platform have their own resource queues powered by Apache Unicorn. Unicorn plays a few key roles here like it's for example one is it is a multi-tenancy support and resource quotas. Each queue for each tenant is fully isolated from one another. Second Unicorn queue runs all the resource scheduling for spark workload you know from basic ones like gas scheduling requirements to more advanced scheduling policies like FIFO priority or preemption. Lastly Unicorn handles elasticity of the queues by independently scaling resources for each tenant. We have many of the spark clusters in the back end. The multi-cluster and the multi- queue strategy provide us with many folds of elasticity and the linear scalability with all the single bottleneck. So in the control plan we build our own spark service gateway which exposed the REST API I mentioned before. It is itself is continuity when it can be deployed and scaled very easily as a microservice on Kubernetes. When submitting jobs through our REST API users can specify additional parameters like queue name and the skate will route the job to the underlying queue. On the client side we provide REST API a simple easy to use CLI for users to run jobs from terminal and a corresponding airflow operator so users can run scheduled jobs. We have also the data science service where our data engineers quickly iterate and build their spark ETL pipeline and data scientists build and train their models interactively. We aim to share a unified back end for the two spark services. So as you can see our interactive spark workload that comes from Jupyter Notebooks went through its own interactive spark gateway and workload are running on the same infrastructure on the back end. This way we achieved the goal of reusing most of our infra without reinventing the wheel. Lastly we closely collaborate with our security and privacy team and observability team to develop and integrate on these two fronts in a fully integrated way. So our spark service has been running in production for a year so far. It currently supports many business critical workload for Apple AML. The development skill is massive. We are running hundreds of thousands of vCPUs and hundreds of terabytes memories with supports you know hundreds of thousands of spark jobs per week. The job skill is also very large. Our users biggest jobs can consume up to you know thousands of executors and the CPUs at the same time and it runs for hours. We have been very active contributors to the Apache Unicorn project and have grown commuters and the PMCs organically from the team. We are also planning to open source some of the components in the stack. So being super successful we initially have been operating all the resources statically for users. For example our Unicorn queues are all static amount of resources and we see a massive opportunity to make the stack more elastic and save costs. For example workload patterns can vary from time to time in a week or even during a day. They also vary quite a bit from use case to use case. For example from running only scheduled jobs to mostly ad hoc and interactive jobs or mixed off both or occasionally super large-scale backfill jobs. When using a fixed amount of resources it has to account for the max usage and will cost waste. So we have been investing heavily into auto scaling spark on Kubernetes and have achieved the great results so far by cutting down cost for our users by as much as you know 70 to 80 percent on a queue basis. Next I'll hand it over to Huichao to talk about how we achieved that on our learnings and roadmap on that direction. Hi folks this is Huichao Zhao from AI-MIL data platform in Apple. Now let me walk you through the architecture and the design of this reactive auto scaling feature in our cloud native Spark cluster we delivered recently. First of all let me talk about the auto scaling cluster node groups layout. As a multi-tenant auto scaling cluster we provide physical isolation among system components Spark driver and Spark executors and each of them are located in their own node groups. Here the system component including such as node problem detector ingress controller Spark Kubernetes operator unical and so on. Also by mapping different tenant queue to their dedicated executor node groups we can oscillate different tenants from each other to minimize the potential impact and also help us to generate the cost usage reports per tenant very easily. We provide a main capacity setting per queue so there is a amount of guaranteed machine a keep running over there to support long running and smaller latency workload. The maximum capacity setting can provide guard rail for each queue and workloads will be waited in a queue if they are exceeded the maximum threshold until there are released resources our scheduler find. This is the workflow however cluster size being changed based on the Spark workloads per node group. When users submit their job to our gateway the SKIT service will create the CRD on the corresponding cluster firstly. It will create the driver pause on driver node group to make sure the job can always be scheduled and then executor pause will be created by Spark operator in the pre-designed node group scheduled by Unical. We can also see once Kubernetes cluster autoscaler find the pending pause in whichever node group it will talk to cloud provider to scale out the suitable numbers of nodes in the specified node group here which is mapping to our Unical resources queue vice versa once it finds that there are idle nodes it will terminate the node to save the cost. Besides this we also provide some customized scaling control to our autoscaling clusters. For skill in control our backend will only apply the skill in on executor node groups and the scaling process will be triggered only when no running executor pause on the node. We have enabled beam packing provided by Unical to minimize the number of instances to use. The default allocation policy of the scheduler will try to evenly distribute the pause to all the nodes. The beam packing policy can sort the list of nodes by the amount of available resources so our scheduler can eventually allocate the pause to the under utilized nodes firstly and then to the idle nodes. So cluster autoscaler can trigger the scaling in a very efficient way. The right hand are EC2 machine utilization dashboard. The top one is the metrics of static queue without beam packing. We can see most of the CPU and memory utilization is only around 10 percentage only a few of machines can approach to 45 percentage. The bottom dashboard shows the metrics after enabled beam packing on autoscaling cluster. We can see there is a pretty good usage rate on both the CPU and memory compared to the massive wasting before. Regarding to the skill out control we provide a skill out only feature to the driver node group which although our users always get their driver pause launched so they also can check their logs over there always. We also speed up the skill out latency by tuning some Spark configurations. Now let me talk about our production status with this new feature till now we have embodied more than 19 internal teams to our autoscaling clusters for more than three months so far and the average cost saving range is around from 20 percentage to 70 percentage. During migration we have found that all scaling events works as expected and the machine will not be removed as long as there are running on active Spark reports. The skill out latency is consistent which is keep lower than five minutes. Here the maximum skill out range we are talking about is from two to two hundred machines. Moreover autoscaling feature can work with various type of resources usage pattern such as ad hoc ETL and mixed patterns. In the meantime we also found that compared to the massive over provisioning approach before runtime of workloads with autoscaling enables may increase. However this is expected which is due to the very good usage rate of CPU and memory compared to the massive wasting before. Given this user needed to take this into consideration and optimize their jobs if there is a strict data delivery time required. I know we have covered a lot since this is a short time. There are some key takeaways they are doing with development and the delivery is made feature on our platform. Physical oscillation at the mean max capacity setting is very important for customer requirements. We can leverage node group mean at the max setting and the unicor resources quota setting together to achieve this. It will help us to support budget based control going forward. Auto provide guarantees that no impact to existing Spark jobs when scaling happens is the most important feature for production jobs. We need to apply some customized scaling control based on different node group types to provide this guarantee. In time we also need enable impacting to improve its efficiency. The skill out latency is important to larger scale jobs. By using the dedicated driver node group and tuned Spark configurations we can keep the skill out latency as low as possible. Going forward there are still a lot of improved areas needed to be explored such as how to support mixed instant type per cluster and how to fully support dynamic allocation. Support instance is much cheaper than on-demand instance which we are using right now. This will be another big win if we can support it with help over remote shuffle services or similar these aggregated computes and storage architecture. Then we can trigger the skill in more aggressive and even separate the computation and the storage independently with different outscaling control. In future how to provide a predictive outscaling feature to the platform is another interesting topic. That's all today's sharing. Thanks for your time.