 Maybe, yeah, let's get started. Hi, everyone. Thanks for being here with us. I'm really excited to present the Spark on Kubernetes talk today. My name is Melody Yang. I'm a senior big data architect in AWS. With me is my co-presenter, Joe Ke Yong. So he is the big data engineer from the Ali Cloud. And also, he is the creator of Apache Calibur. Thank you. So sorry, I have to present in English because I don't want to make a mistake when I'm presenting in English called the Shuffle Service as a shi pai, shi pai fu. So we are not really presenting about the gambling industry use case. Actually, shuffle service means a Spark data shuffling. So I'm going to present in English. And Ke Yong, he is going to present in Chinese to deep dive about the Apache Calibur service. Don't worry, if you don't understand, we have a lot of pictures, architecture, diagrams, for you to understand. It's all written in English. Yeah, so if you have any questions in English, that's totally fine. I can translate. OK, thank you for joining us. Yeah, so let's get started. So in the era of data driving everything, Apache Spark has emerged as a very popular big data framework. It can handle large-scale data processing needs. The common use cases are machine learning, such as the hot topic we are talking about today, Jeng AI or ETL processes. Some data scientists said 80% of their workload is about ETL, data engineering. So what is the challenge when we are talking about Spark on Kubernetes? One of the well-known challenges is how to support the dynamic resource allocation, DRA for short, right? So this is our talk today. We will cover what is the key challenges in Spark on Kubernetes, especially around the data shuffle and DRA. I will also share with you the first hand of benchmarking result when I tested the solution with Apache Calibon. After that, I will hand over to Keiyun to deep dive into the Spark shuffle and DRA Calibon project. And then he will also cover what is next in the roadmap. So I want to just ask the room, how many of you are familiar with Spark on Kubernetes? Great, of course. OK, so actually less than half of the audience in the room. So I will probably just quickly go through what is Spark on Kubernetes. So on the left-hand side, we have Kubernetes control plan. And I'm a data person, so I don't really know much about Kubernetes internal, so I'll keep it quick and simple. So inside the control plan, we have scheduler and API server. Then when user requests the Spark job, submit the job, the control plan will schedule the driver part in the data plan. Inside the driver part, we have those components. Then the driver part will send the request back to the scheduler saying it's time to schedule some executor part. So this is what happened. We scheduled three nodes in this case. Then we scheduled multiple amount of Spark executors on each nodes. And finally, we send executor part watch events from the API server to the executor part. So this is the common workflow for Spark on Kubernetes. Let's talk about DIA challenges with the Spark on Kubernetes. So what's this fast about? So since the Spark 3.0, we have the light way of solution to support dynamic allocation without external shuffle service. So that means you can scale the number of executors up and down based on your workloads. And if your executor idle, it gets removed from your Kubernetes cluster. If the pending tasks exist and it will scale up, request more Spark executors. Can we see the problem here? So actually, Spark on Kubernetes doesn't fully support the external shuffle service. Even though ESS, so external shuffle service, is natively support by Spark on Young, it doesn't fully support in the Kubernetes pattern. So the lightweight solution I mentioned earlier is the shuffle tracking, the last one. Shuffle tracking must turn on when you need to enable the dynamic resource allocation with the Spark on Kubernetes. What happens if we don't turn on? So that's the error message you will see in your Spark application immediately. It's a little bit misleading. As you say, the Spark context error cannot initialize the driver program because you have to turn on the external shuffle service. But like I said, we don't natively support external shuffle service. So what can we do? Just turn on the shuffle tracking. So let's deep dive into the challenge when we do DRA with a shuffle tracking. So what's the problem here with this lightweight solution? So the dashed line means we have the shuffle tracking. And the blue line actually represent the actual usage of your compute and memory resources. So we can see there's a lot of resource wastage here. The gap highlights it in red. So our customers have to pay for those idle time. Let's dive into some example. When we turn on the shuffle tracking, which actually stops our Spark cluster scale down. So the scenario here is DRA setting, we can scale up and down from 1 to 100 executor pot. And the executor idle timeout threshold by default is 60 seconds. I will go more in details into that, what does that mean? So let's take a look at the right hand side. So when we have a pending tasks from the Spark, it requires to scale up three nodes just for the simplicity. Scale up three nodes, EC2 nodes. What's going to happen? So the pending task will run three executors. One executor per EC2 node just for the simplicity. So what happened when the pending task is finished? So no more pending tasks from our workload. And we have two executors idle over 60 seconds. What got to happen? So those executors supposed to be removed and released because they idle being idle for a long time. It's a time to release the resources, but it's not that simple. Spark will check, has it some shuffle data on the executors? If the answer is yes in the node 2, there's one executor being idle more than 60 seconds, but it contains a shuffle data, what's going to happen? We've got to stay, keep the node 2. If there's a no shuffle data, we've got to terminate those executors, and we can terminate the node 3. Great, we saved some money on running one node 2. EC2 instance, however, the problem is the node 2. We cannot scale down. So that is a DRA with tracking, shuffle tracking. What's the second challenge? So we tried to make the shuffle tracking time out faster. Maybe you remember that default setting is 60 seconds. What if we shorten the shuffle tracking time from 60 seconds to five seconds? So that configuration setting, Spark dynamic allocation shuffle tracking time out equals five seconds. This will happen. Fantastic. We don't really waste any resources here because we can up and down very close to the actual usage. What's the problem here? This is the problem from my testing. So when you look at that red long bar, it says the stage failed. The reason is on the bottom here, it says fetch failed. Means it couldn't fetch the shuffle data. Why is that? Because we time out the executor too fast, every five seconds. We try to kill the executor so it can't find the shuffle data because it's being removed. So we lost the shuffle data during this frequent time out operation and the stage of the Spark processing failed. So it will trigger the reconfutation and your job will run slower. So the third challenge is when we do the shuffle migration, what's that mean? So since Spark 3.0, we also provide some graceful decommission feature. It's called, yeah, at least those are three key configurations. So when you need to do the graceful decommission, you need to turn those on. What happened? So your migration triggers extra cost, such as storage compute network. Because before you turn down or scale down one of the EC2 instance, you have to move those shuffle data from the EC2 to be taken away to another healthy EC2 instance or compute node. That takes time when you copy. So that triggers extra cost. The second one is imagine in the spot instance scenario, we have a two-minute notification. So before we take in the spot instances, we notify our customers in two minutes advance. We say we need this compute instance. So it's time for you to move all the data out of this EC2. So the problem is the migration doesn't know the two-minute threshold. So if you are facing large amounts of a shuffle data move from A to B, we think that two minutes time frame, if you didn't finish copy the data, what's going to happen? You lose partial data. So the recomputation still happens. The third bullet point here is actually a quite extreme scenario, but we do have customers see this problem. So sometimes when you move the shuffle data from node A to node B, then the node B is about to interrupt it as well, unfortunately. So you have to continuously move from node B to node C, from node C to node D. So your decommission process actually is dramatically delayed. OK, so to solve those challenges, we actually found a solution. I'm pretty sure in the open source community there's lots of solutions, right? So recently we actually did testing against Apache Caliburn project. And we found it is quite compelling in terms of the result. So inside the AWS, this is the architecture we designed and how to host this remote shuffle service. There are two options here. The first one on the left-hand side is hosting the Caliburn cluster outside of your YMA environment. So it's completely standalone. It can serve to any other workloads, not only just the big data or Spark workloads. On the right-hand side, it's quite interesting. We are hosting the Caliburn cluster inside the YMA cluster. So the key difference between these two is the HDFS storage. That's the only difference. So when we host the Caliburn cluster inside YMA, they actually share the storage with Spark executors. So in my benchmarking result, actually, I used the left-hand side one. So I host my Caliburn cluster along inside the Kubernetes environment. And my YMA can be wrong in any environment. Could be in YMA sublays or YMA on Kubernetes environment anywhere, as long as they can talk to each other. So this is my setup. So I turn on my Spark on Kubernetes DRA, Dynamic Allocation, with Apache Caliburn, which means I send all of my shuffle data out of my Spark cluster across the network to a standalone Apache Caliburn. So we can see, even though it's not that curved lines in red, very close to the actual usage, it's the dashed line still very close to the actual resource usage. So this is the great result we can achieve using Apache Caliburn. So can anyone guess? I did a side-by-side test. One has a Caliburn, one doesn't have a Caliburn. Which one has DRA with Caliburn? Can anyone give me the answer? OK, I will give you the answer. So on the left-hand side is the usual shuffle tracking with a slow executor release. And on the right-hand side, we have a more responsive, up-and-down release, the executors. Sorry, each blue box represents a Spark executor. Yeah, so you can see those highlighted two to three minutes time period actually has some red boxes that represent the executor being killed by the driver, because it's being idle. But this one, because it's been tracking, it never shut down the executor, even though the executor is idle. OK, this is our favorite part. What's the result to our customers? How much do they need to spend when they enable the Caliburn with the DRA? So I have two tables. The top one is using our AWS Apache EMR. Sorry, Amazon EMR with a Spark. The bottom one is the open-source Spark running on Kubernetes. So we can see when we run the jobs in parallel, that's a DRA with tracking. The first one cost around $2.32 US dollars. And the second run is EMR with a Caliburn without shuffle tracking. It only costs $1.43. Why? Because we can release the EC2 nodes more responsibly and quicker. So it actually provides up to 38% of cost efficiency when we enable the DRA with RSS Caliburn. So on the bottom is the open-source Spark on Kubernetes test. The baseline is already over $5 anyway. But you still can see when we enable the Caliburn without shuffle tracking, it's also around 30% of cheaper cost. So on the right-hand side, that is the setting we use. And we can see that executor idle timeout is around 10 seconds. So that is the sweet spot we used. We don't really define the really quick timeout, but it's at a certain degree we timeout. All right. So that's all from me, from the user perspective to use Apache Caliburn. So I'm going to invite my co-presenter, Joe Koyong, to deep dive into Spark features with Apache Caliburn. Hello, everyone. I'm Joe Koyong, from Aliyun. I'm going to use Chinese to share with you some of the design details of Apache Caliburn and explain why, through Apache Caliburn, it allows Spark on Kubernetes to have better performance and better stability. OK, let's take a look at the traditional Spark shuffle, its flow and its existing challenges. But here, Spark shuffle is the external shuffle service which is ESS. We know that in the computer computing field, shuffle is a very important algorithm and it consumes a lot of computer resources. There are a lot of information that shows that the overall resources it consumes are more than 15%. At the same time, it itself is not very efficient and not very stable. Because if we are familiar with Spark operations, we will often see some large shuffle Spark operations. Its shuffle fetch time, wait time, will be very long. And in the shuffle read stage, there will often be a fetch failure or out of memory error. Why do these problems occur? Let's look at this picture on the right. This is the traditional shuffle flow. On the left, MAPR is the task to produce shuffle data. On the right, reducer is the task to consume shuffle data. In MAPR task, they will produce shuffle data. First, they will do the排序 according to the partition ID. Here, every partition ID corresponds to every reducer in the bottom. They write the shuffle data in the排序. At the same time, there will be a lock-in file to define the position and length of each partition data. In fact, in this process, you can see that if some shuffle task produces a larger shuffle data, it may not be useful to do full-fledged排序. Instead, it will be useful to do full-fledged排序. In this process, you can see that there will be a lot of shuffle data in each partition ID. If you look at it from the perspective of a shuffle file, you will see that a file will be read by all the reducer tasks. This is usually a few thousand volumes. At the same time, every partition will only read a small part of the data. At this time, it will bring a serious problem of random I.O. From this process, you can also see that the shuffle data is stored in the bottom. Therefore, it is difficult to do the same thing. In addition, there are some problems in the shuffle process of S.B.A.K.E. This means that S.B.A.K.E. will not allow the shuffle data to be lost when it is released to the external executor. In fact, Melody mentioned some of these. S.B.A.K.E. is used in the same and on-commonetized scenarios. In the same way, we know that in the same steps, the nodes in the same management will have a long-term service called NullManager. S.B.A.K.E. has achieved a difference in the NullManager, which is External Shuffle Service. ESS is also a long-term service. S.Cutor will give ESS to the management of the shuffle documents that it produces. After this, the idle S.Cutor will be released and the next task can be read by ESS through the shuffle data. Therefore, it will not cause the data to be lost. So, in the same environment, External Shuffle Service can handle a long-term service called NullManager. Of course, it still exists. The performance, stability, and the existence of a certain core. But it is not in the Kubernetes environment. The official S.B.A.K.E. does not have External Shuffle Service to solve the problem with the same methods. So, it can only use the shuffle checking or graspable decommission methods to walk around. But the previous two methods all have efficiency problems. Of course, we also understand that some users will use the S.B.A.K.E. to start External Shuffle Service using the Demonset method. This method can solve some problems, but it actually requires the whole infrastructure. For example, it requires a fixed IP and ESS nodes are more stable. If it is forced, it may cause the data to be lost. So, we will have a problem at this time. Is there a way to solve the same situation and the problem of Kubernetes and its performance and stability compared to ESS nodes? Yes, it is. Colubn is actually a unified shuffle service. It controls the shuffle data generated by large-scale computing. It makes the executor not need to have a shuffle data, which means that it becomes free. In this case, when Spark decides to release the idle executor, there is no worry because it will not cause the data to be lost. From this map, you can see that Colubn seems to be the ESS of independent deployment, but it is not just these. Compared to ESSS, it has better performance and stability. I will introduce what is the key difference between ESS nodes. First of all, here are three important components of Colubn. Master and Walker form the shuffle data. Colubn is stored in Spark's machine. Colubn is divided into two roles. One is the life cycle manager in the driver. He is in charge of the current application of the shuffle life cycle management. Another role is in each executor, he is in charge of the shuffle data delivery and delivery. In the service section, in order to ensure that the service is reliable, we made a high availability for Master. He is responsible for the multi-master relationship between the rubbed agreement. Master's role is in charge of managing the entire Colubn group and the complex distribution. Walker's role is in charge of storage and service shuffle data. We can see that in Master and Walker, there will be some links. Dash's line represents a control message. I should say that there is no difference. In Master and Walker, there will be a connection between them. Life cycle and Master will also have a connection. Ice cuter and Walker will also have a connection. Let me introduce what is the whole process after the shuffle is applied to Colubn. When Ice Cuter starts to write shuffle data, he will send a message to Lifecycle Manager that I need to register shuffle. At this time, Lifecycle Manager will send a part of the Walker to service the shuffle. Master will choose some of the Walker labels according to the current situation. Then Lifecycle Manager will broadcast these Walker to each Ice Cuter. Each Ice Cuter will know which Walker needs to send the shuffle data to him. After the task is done, Lifecycle Manager will send a commit file to all of the Shuffle Walker. They will send the shuffle data to the site. After that, Reduce Task will start. Each Walker's Fetch server will read the data. From this price, first of all, Colubn is a kind of push shuffle. But in fact, Colubn is more important than external shuffle service. It has two very important points. One is the combination of partition data and the other is the separation of partition files. To be specific, there is a push style and partition data combination. You can see that each mapper belongs to the same partition data to the same Walker. For example, partition 1 data belongs to Walker 1. Mapper 1 and Mapper 2 will push the shuffle data of partition 1 to Walker 1. After the data is sent to Walker 1, it will write a complete document after the memory is done. At the shuffle read stage, each reducer will read a complete partition data from the right Walker. If we look at the shuffle read stage, we can see that first of all, the network connection from the original out-to-out becomes one-to-one. Secondly, a partition document will have a larger size, such as 256M. With such a document, it will be read smoothly. This way, it will turn into a smooth I.O. Secondly, you can see that Mapper will push the shuffle data directly to Column Cluster, so it does not need to have the shuffle data. It is more beneficial to use the price of distribution. The picture on the right shows that my shuffle size is very large. My partition document is very large. It may not be useful enough or it may cause a delay to Walker 1. Column has provided a mechanism called SPLIT. It will monitor the size of the partition document. If it exceeds the default value of 1G, it will be sent to a new SPLIT from the same partition. These information will be fully recorded in LabSecumManager. The reducer will read the shuffle data from two SPLITs. The above is to introduce Column's core architecture. Next, I will introduce some evaluations. These evaluations are 1G, 2G, and 3G shuffle data on ESS and Column on 0.200 and 0.300. You can see that the shuffle size is large. In the shuffle scene, Column is significantly improved compared to ESS, and the bigger the shuffle size is, the more obvious it is. Because we know that the total time of shuffle can be divided into shuffle right and shuffle read. The total time is in the previous table. The first number is the shuffle right time. The second is the shuffle read time. In Column's architecture, the shuffle right time is not a very obvious advantage. In the previous introduction, we also know that the shuffle read time is mainly solved by stability and performance. You can also see that the shuffle read time is significantly reduced. The next screenshot is to show that after using Column the stability of the shuffle right time will be greatly improved. This is one of the users' productivity. You can see that the shuffle size is over 400TB. This is still a very exaggerated number. And the change of his task is also very exaggerated. The shuffle right time has a total of 380,000 tasks. But in the entire process, we can see that there is no failed task. So it can very well show that Column has improved performance and stability in very large shuffle operations. We also did a test on the standard TPCDS 10T scale. The test conclusion is because Column also supports two copies. In the case of single copies, compared to ESS, it has a 20% performance improvement. Two copies have about 15%. Column mentioned earlier that it can improve the performance and stability in the shuffle scene. Is it merely a shuffle service? Or is it a false spark? In fact, it's not like that. In the future, we hope that Column will be able to manage the data of the spill and the stored data. In this way, the data generated by Big Data can be managed by Column. Then it can really rely on some other components. For example, we hope to use internal storage to make small shuffles and we hope to use the corresponding storage to reduce the cost of the Column. For example, from the LinkedIn team, we will provide a certificate and security isolation feature to the community. In the future, we plan to support more computing applications. Finally, we hope that some of you can join us. Thank you. Is there any questions? Thank you so much. Sorry, it's going to be in English. When it comes to your node sizes, when you're actually running the executors, if the shuffle files can fit on the executors, is it sometimes not advantageous to do a backup? You write to the executors and then, at the same time, send out to Celeborn. If you wait on Celeborn every time, your work time might be longer than just writing to the disk on the executor itself. You're doing a backup at the same time allows for you to continue with the task and not block on writing. I'm wondering if that's a use case, because I think the exact example was you directly write to Celeborn. We directly send all the shuffle data out of your Spark cluster to Celeborn over the network. I'm asking, do we have an engaged mechanism before sending over to the Celeborn? Is that your question? If the data is small and it can fit on the node, you can just write it locally to the executor. If I understand correctly, your question is that can we use Celeborn as a backup? Sometimes, for example, for small shuffle jobs, additional or typical shuffle, and for a very big shuffle, we can use Celeborn. The answer is yes, because currently, Celeborn shuffle has a plug-in mechanism so that you can customize when shuffle can go local shuffle another shuffle can go to Celeborn. Based on multiple policies, for example, the shuffle size, or for example, the status of the Celeborn cluster, and all that, you'll have a configuration that you'll force the shuffle to use the traditional shuffle mechanism. If you have heterogeneous hardware, is it going to be aware of the actual node size, too, or is the setting just like if shuffle larger than 2GB than local? Are you referring to the Spark, POR executor POR or Celeborn POR? If the shuffle is bigger than 2GB, you send it to Celeborn. But if your POR has ephemeral storage of 300GB, you're like, oh, I can fit 300GB. Right? Yeah, for now, Celeborn does not support dynamic switch between the local shuffle and Celeborn. That is to say, for example, if you decided that when shuffle ID uses Celeborn, then you can switch some tasks of the stage to use Celeborn and all other tasks to use the local disk. It doesn't support this. It only supports either the whole shuffle for one shuffle ID. All tasks in the shuffle stage uses Celeborn. All tasks of the shuffle uses the local disk. It can switch between traditional shuffle and Celeborn within one shuffle. If I understand correctly, you may refer to that you can based on the size of the POR storage to decide whether we should go to Celeborn or to... Yeah. My opinion is that it can be... It can be a policy to... It can be a policy. You can implement such a policy so that it can do this. But I'm not sure whether it's generally good policy because, for example, the number of tasks is really large, for example, more than 10,000. Even though your SQL port has enough disk storage to hold the shuffle data, it may be not efficient because of the random IO. So I think it depends. Yeah. So I'll just add a little bit to that explanation. Apache Celeborn is not good for all. So we need to choose, make decisions. What kind of use case do you have? If your entire Spark job only contains gigabytes of shuffle data at each stage, you don't need a remote shuffle service. You can just use the default shuffle tracking. That's fine. This is for extremely large-scale so we have that example. We have a customer have hundreds terabytes of data shuffling in one single stage. That's when you need to enable the Apache Celeborn remote shuffle service for that particular job. So usually, if your Spark job is okay, just don't need to have a large scale of a shuffling, you don't need to turn on it. The reason I was asking is because you can have both. So if you allow for a backup to happen at the same time concurrently, like a concurrent backup, when you do your fetches, your shuffle fetches, you can either read from the executor or you can read from the Celeborn cluster, the RSS register. Based on that, if the executor goes down, so you mean that, for example, in the shuffle write phase, the map tasker will write the shuffle data both locally and push to the... Yeah, this is how Apache Spark does. Yeah, and I know that that method is contributed by people from LinkedIn and the project is called Magnate. I think it's a good solution with external shuffle service, I think it's a good solution. And for Celeborn, I also think it can do this. We can have a try. Maybe you can contribute to the community to bring in this extra fantastic feature as well to speed up the entire performance. HDFS writes are very slow, so sometimes it's like... Yeah, that's right. Thank you for your questions. Is there any other questions in the audience? Yeah, sure. Thanks for your great talk. I have two questions. First is about the cost down numbers. Does it also include the cost on the Celeborn cluster? Which cost? Is it this one, the benchmarking? In your talk. This one? Yes. So which cost are you referring to? The 38% cheaper when using... Yeah, so the 38% cheaper is purely comparing this result. So this cost. And this cost contains everything, including EMR, uplift, premium price, plus EKS compute, EC2 instances price, plus storage. Plus the Celeborn storage cost. Very good question. It didn't include Celeborn, because all of these have a Celeborn cost behind the scene, so I assume the cost should be highly similar. But that's a very good point. We should add a Celeborn cost in it, but I'm very confident that shouldn't make a big difference. The big difference is the dynamic allocation resource up and down. When you release an EC2 note, that will save you lots of money. Okay. And my second question is, I've seen your slides that Celeborn runs on, directly on EC2 notes. Does it support to run on Kubernetes cluster? Actually, it is running on the Kubernetes environment. Probably this... Yeah, this diagram wasn't that clear. Actually, this is inside the Kubernetes. Sorry. This should be EKS. I'm either on EC2 or Celeborn. Also, does Celeborn support auto-scale for the workers? Correct. I will compliment some. For the first question, we have some users that exactly use the Spark on Kubernetes with Celeborn cluster. And in there, they told me that the overall cost has decreased remarkably because of the better elasticity. Yeah. And the Celeborn cluster usually do not require much resource. It do require... You can think of it like... For traditional shuffle, your computer node, your pod has storage and network resources for the shuffle. And with Celeborn, you just offload the storage to the Celeborn cluster and the network resource. So it does not add much more resources. And at the same time, because it is more efficient and more IO friendly, it indeed reduces the resources for storing and network. That's for the first question. And for the second question, yes, Celeborn supports to be deployed on the Kubernetes and it supports dynamic elasticity, even though it is not very complete for now, but it supports... You can decommission Celeborn Walker and it will release when it finishes serving the local shuffle data. Okay, thank you very much. I like today's talk. Thanks. Thank you. Any more questions? I understand it's already lunchtime. I really appreciate everyone has the passion here to stay with us, have all the questions. Yeah, please go ahead. Yeah, thanks. I have a very quick question. I know it's probably a little bit off topic because my company used both Kubernetes as well as Databricks to run Spark. I just wonder if there's any chance from upstream perspective the Celeborn can be enabled from the SaaS provider such as Databricks because we like the simplicity of having SaaS rather than running everything from scratch if it's available. May I ask a little bit of questions? So where is your Databricks hosted? So our Databricks hosted on Azure. In Azure, okay. So that means, so Databricks at this stage, Databricks doesn't support containerization. Yeah, we know that. So your Databricks clusters has to run on the EC2 equivalent compute node. So if you want to use Calibon, self-host it, you could run the cluster of Calibon either on the compute node similar to your Databricks compute node. The electricity, like up and down scale, probably is not that good because it's fixed size of EC2 instances. However, if you run Calibon cluster in the Kubernetes environment such as, I don't know, in Azure is it called AKS? AKS. Yes. So if you host that in AKS, it should enable you to use Calibon. But the downside is you need to enable the network, right? So your Databricks on the compute needs to be able to talk to your Kubernetes in AKS because you imagine you will have a large amount of data sets sending over from Databricks cluster into Kubernetes environment and also lots of a read happening. So the performance could be the bottleneck. So you have to fully test it. Is it worth it or is not? Okay. Yeah. All right. Thank you. Okay. I have some compliment. I don't know if you are asking whether we need to modify Spark to Spark. It's not. And I think it's all the schedule of things of Databricks is probably managed by Databricks itself. So yeah. I kind of had it answer myself that if it's not supported that's end of the story unless you wanted to build everything in Kubernetes or AKS yourself. I just wonder from upstream perspective, if there's any plans in the future that can be enabled because I'm interested in reducing any potential costs that I can to bring it down. Yeah. So actually I found AWS. So actually we are introduced this with EMR on AKS. So our EMR, sorry, our EMR product actually do offer this solution in our EMR on AKS feature. Yeah. Just let you know. Yeah.