 Hi, everyone. Yeah, thank you for coming this session. OK. Yeah, my name is Bo, so I'm coming from Apple. And I have been working in the big data area for about more than 30 years, so it's a long time. And now I'm in Apple building data platform and the machine learning platform. Yeah, very nice to see you here. So my colleague. Hi, Bonjour. My name is Hai. I also from Apple. And I work in the data infer team. So I'm glad to be here talking about reliability and the cause efficiency for running Spark using Kubernetes in the cloud. Thank you. Good. Let's get started. So this is talk about cloud native data processing and also about Apache Spark. So I want to do a very quick survey. So here, how many people use Spark before? Oh, great. OK. You are in the right place and I'm in the right place. OK. Nice. Here is a quick agenda. We will maybe very quickly introduce Spark and how fault tolerance works in Spark. Then we will present what kind of problem we'll try to solve. Let's deal some problem with Spark. And there are already many solutions. They are built for different reasons. So you will see a lot of them. And we built another solution. Now we call Cloud Shuffle Manager, CSM. And we will explain how it works, what benefit it has. Then hopefully, we can have more discussion. So please, if you have any questions, so we can discuss more. Yeah, very quick about Spark. Spark was created about 15 years ago. It's a long time. And it is a unified framework to do data processing on a very large scale. It's very fast. How it works, it is on the line. It might reduce very simple concepts brought by Google. If you have a very large amount of data, it splits the data in small chunks and process those chunks in parallel. So you can use your compute power to process them in massively scale and run it very quickly. And originally, Spark was running in your own data center like a yarn. And these days, people bring Spark to cloud era. It can run on Kubernetes. You have a very small kind of explaining about how MapReduce works. MapReduce produces data in two stages. The first stage is a map. It splits data and puts related data together on each machine. Here we call executor. Here is a SQL statement, select word, and count it to do a word counting, very simple SQL execution. So the map side gets the data split. Then it will shuffle the data and get the same word to the same place. Then in the reducer side, it just count each word and it generates the output file, which count how many words appeared. Underline Spark is just very simple like this. Even though it's a simple, Spark will bring multiple stages MapReduce into the framework. And between different stages, it can shuffle data. When it shuffles data, it will exchange the data files among different executor. So we will see some problem might happen here. And before I will work, how Spark solved the fault tolerance problem. So because this different stage, so if one stage goes wrong, then Spark will try to recompute the data from previous stage and then continue running. That's how Spark solved fault tolerance in current status. So for example, here there's an executor 3 when it dies. And the executor 5, which depends on executor 3. So what will happen? Spark will launch a new executor like executor 6 here. This executor will reprocess data. And now the next stage, executor 5 is happy. It can get data from there and continue running. So now what problem we have? So let's think in this scenario. So normally you have a lot of data, and the executor may be there across multiple stages. In this case, the executor 1 cross stage 10 and 11. So it has shuffled data on these two stages. Now let's see if executor 1 dies, but the next stage, executor 5, got fetch failure. So what will happen? So Spark will launch another executor on the middle stage. Then that middle stage will read data from previous stage. But in the previous stage, because that executor still is dead, so the middle stage will fail again. And Spark will launch a new executor in previous stage. So it will kind of propagate back to the previous stage and just achieve reaction. The result is it may run slow because a lot of retry. Or in the end, there are the limits of the retry. So the application may fail if the retry has happened too many times. This is rather a trouble in the cloud. So people like to use support VM because support VM is kind of cheap and can save you costs. But the downside is it can be terminated by cloud vendor at any time. So if it's terminated, your Spark job will highly likely fail because all the shuffled data are lost. And Spark has a dynamic allocation feature. It will also kill your executor, so it may cause your application to fail as well. So how we solve this problem? In the cloud area, the idea is very simple. We can decouple compute and the storage. Here, the storage is the shuffle storage. So we don't store the shuffle data on the local disk. We can store the shuffle data on remote storage. So in this case, when your executor is gone, the data is still in the remote storage. So your execution can resume at any time. And this is very powerful. That means when you run Spark application, you can just kill your executor at any time without impact the success of the application. The downside is right now, the remote storage, like cloud storage, it may be slow when you read a lot of files. So it is good at the throughput, but it's not good at the latency. So if you have many small files, it will be pretty slow. You will have some way to solve this problem as well. Yeah, before our solution, the industry already worked on this for a few years and come out all these solutions. So you can just search it. You can find a lot of information from that. And before I work in Apple, I was in Uber. There I built my previous shuffle service, the remote shuffle service there. We launched another dedicated server to store the shuffle data there. But now here in Apple, I kind of built another version. So because we want to make the serverless, we do not want to make another server. So we will see how we do it later. Yeah, here's a quick overview and explain what's the difference of different solutions. So we look at at three angle. So whether it support remote storage, how is the operation cost and whether it support the support of VM. So right now, our solution, Cloud Shuffle Manager, is the only solution which satisfies the three dimensions. All other solutions never work in some part and fit some scenario. Yeah, every solution is good in their certain scenario. Here is the overall architecture of our solution. So in our side, we we build the whole platform with Spark Gateway. So that Spark Gateway is also an open source project. You can check the link and go there. It can help you to run Spark job very easily on the Kubernetes. And we can enhance that to add the Spark job, Spark Config Manager. So it can inject Cloud Shuffle Manager related configuration there and the user don't need to do too much work. Then in the right side, it is how the Cloud Shuffle Manager is implemented. So the green blocks are the new components we add into Spark. The first one is we add a dual shuffle manager. When executor is running, we copy the shuffle file from local disk to cloud storage. Yeah, it's just a very simple copy. But it's very fast and we can also optimize the copy and we'll explain that later. So when the data is generated and another executor will read it with continued execution. So what happened if the previous executor is dead? We add a full back shuffle reader here. So if the previous executor is dead, the full back shuffle reader will read from cloud storage and then continue running. So this will make your application very reliable and it will never fail. And sometime when executor is dead, it may take a while for us to detect that. So it will slow down the whole process. So we proactively add a dead executor detector. So if we detect the executor is dead, we will read from cloud storage directly without do the full back. So it can minimize the full recovery time. Cool, yeah. So my colleague will dive into some details and explain how we test it. Okay, so I would like to talk about some of the details about our design. So firstly, why we chose cloud storage as the place to store the software data. So because that brings us many benefits we really like. For example, high availability, high scalability, building lifecycle manager so we can do data cleanup automatically very easily. And security features such as fun green access control encryption. So a lot of building services features available for us. And this is really a very easy and a lazy approach for us. So we're trying to take advantage all the existing services from the cloud. So we don't have to reinvent the views. So we can save our effort. And that's the main reason we chose cloud storage. However, on the other side, the main challenge is that cloud storage is relatively slow if we compare to local SSDs. And it is really a challenge because many of our effort has to address the problems. So as I mentioned, that's the challenge. And so we had to make a number of optimizations in order to achieve cause efficiency. And here we list a few of them. So firstly, we only read from cloud storage only if we have to. And meaning that the majority of the read still going to the local disk. So the majority is still going to the fast local disk only for back if we have to. Also, we added a feature called asking write. So asking write basically saying that we don't have to stop and wait until the copy finish. We can just have the reducer side continue to move on and then let the copy happen on the background that are synchronously. So this can save us some run times. Another feature that we is trying to leverage some catching mechanisms. So to catch small files. For example, the index files, they are pretty small. So we're trying to make some catching on both executor side and the driver side so we can boost the performance. And so with all this work, we're able to do some evaluations for the performance that we're really concerned about. And in order to make a fair meaningful evaluation, so firstly, we do the benchmarking evaluation using TCPDS. So TCPDS is an industry commonly used benchmarking to utility for typical spark workload performance test. And this allow us to run a number of skills against the CSM to evaluate how it works. And secondly, we developed a utility called the termination simulator. So basically we like to simulate the what happens in a real world. And then we run the benchmarking job one after another one. So basically firstly, we run the baseline and then we run the job with the CSM enabled. We trigger the termination at the same time in the same stage. So we're trying to apply the same condition to both so we can compare the result after. The key metrics we care about is basically two-fold. So one is the overhead. So we want to make sure the overhead that we introduced is insignificant is reasonable. And on the other side, we want to measure the runtime reduction. We like to see the runtime is reduced significantly so we can claim a profit from that. So this UI, you may be familiar since many people are familiar with Spark. So this is Spark history UI. It shows what happens when the termination happened. So in this example, we kill the fault executors two at a time and you'll know that the software data also lasts when the termination happens because every executor did have some software data inside it. And this shows what happens without CSM. So that's basically the baseline native Spark, how native Spark behave when termination happens. So on the bottom, you can see there is a fast field exception. So that means the reducer side encountered this exception. They failed to fetch software data from the mapper side because the mapper had been killed, they were gone. So the software data was lost. As a result, on the left-hand side, you can see there were multiple stages of choice because Spark has to regenerate the software data. And on the right-hand side, you can see there are multiple entries in the input column. That's because the job, when they do a retry, they had to reread the data from the source. So when those retry happened, it takes a lot of times and that's all dollars. And here is what happened when we enabled CSM. So there was no stage retry and the data only read once. So what happened behind the scene is that when we enabled CSM, we got a copy. We got another copy of the software data on the cloud storage. So when the executor got killed, the reducer was able to fetch the other copy from the cloud storage so it can continue to move on. So it didn't need to redo the retry. And the result is showing on the bottom. So we can see the runtime when we enable the CSM. It's about a 50% reduction compared to the baseline. And we run this benchmarking multiple times. We use a scheduler to schedule the job regularly. And on average, we observed about 5% to 10% overhead. That's due to the actual write. So that's the one we're concerned. And with that, we were able to achieve reliability. So there are no job failures because we had another copy of the software data on the cloud storage and that improved the reliability. And there are no much stage retries also because the job was able to continue without having to regenerate the software data. And we also run this on the standard jobs. So these are basically large-scale applications in production. And we observed a similar result. And we actually quite a few of our standard jobs we enabled our single write. And we observed a similar, same runtime as the baseline. So meaning that the overhead is very minimum when we enable our single write. So basically we take a little risk but we gain some performance boost. And we do notice there were about 5% CPU usage increase. And mostly that's because software data we read from local disks that we need to do uncompression and decryption and also copy itself, it takes CPUs. And also network IO. So there is slightly overhead about 5%. And just a recap for the CSM. So CSM, it is a solution to increase the reliability for Spark, especially when Spark are running on top of Kubernetes in a cloud environment. It is so a less approach. I mean that we don't need to, it doesn't require to set up a separate shuffle service and then save a lot of effort. So we save costs both from the compute and from the SRE side. And we try to take advantage of cloud storage as a mature service. It is reliable, scalable, secure, and there are a bunch of features we just want to take and use. And so this solution can be applied to multiple scenarios. So firstly, it allows us to run Spark on top of Saba VMs. So Saba VMs basically coming with a significant cost discount. So that enable cost efficiency be available utilizing Saba VMs. And another use case is dynamic allocation. So I want to talk a little bit about dynamic allocation. So currently Spark, if we run Spark on Kubernetes, it requires to, if we want to enable dynamic allocation, it requires to enable shuffle tracking. And the shuffle tracking timeout by default is infinity. So basically it means that if there is any shuffle data on executors, dynamic allocation will not work very effectively. But if we enable CSM, since we have another copy of shuffle data on the cloud storage, we achieve the somehow decoupling for compute and storage. So when we enable it in our standard jobs, we observed the dynamic allocation happened more effectively, more efficiently. And that basically it is horizontal out of scale for Spark on the job level. Yeah, and that's pretty much for today's talk. And we'd like to take questions or feedbacks if there is. Thank you. Yeah, thank you. Yeah, if we have any questions related to this or you have any questions about Spark dynamic allocation, yeah, we can feel free to ask. Oh, the question is, is it over sourced? It is not over sourced yet, so we are working on it. But the idea is general. And overall, it doesn't take too much effort to implement it yourself. Yeah, by the way. Okay, go ahead. Sorry I cannot hear. So there's a microphone there, yeah. Can you elaborate a little bit on the data format that you are using when you're writing to the cloud storage? Okay. Is it something preferable? Yeah. Okay, yeah, okay, I can answer this. So the question is the data format. The short answer is we didn't change the data format because we want the solution to be simple. And the long answer is Spark shuffle file, it's kind of a file we segmented. It has several segments. Each segment is a split. It's corresponding to the process split in the Spark. We just copied the whole file to the cloud storage. So we're not trying to do any short emerging those stuff things yet. But it's a good question that may potentially improve the performance further, yeah. Is it? I may have one. You mentioned that storing this shuffle data on the cloud blob store is secure. How do you achieve that? Oh, you mean security? Yes. Like authentication and another user can read the very same bucket or the same data. Yeah, good question. So as I mentioned that we trying to fully leverage it features from the cloud storage. And the cloud storage, it comes with fun green access control. And we, so we basically use the security access control to control the access. For example, we use the IAM rows. So that enable to have authentication and authorization happened on the shuffle data. And we do have like Q-based authorization and so we can keep the shuffle data in a secure way on the cloud storage. Yeah, so yeah that explained from the cloud storage part. From Spark side, there's a setting you can enable data encryption in Spark. So it's a Spark config supported by native Spark and we also leverage that. After you enable the encryption, Spark will generate a unique key for each application. And it will use that key to encrypt your data before it writes to the local disk and the write to the cloud storage. And because other application and other people do not know that key. So the data is very secure only for your own application. All right, so another question. So from what I'm understanding, the lack of an external shuffle service makes so that if you're trying to use the dynamic executor location, basically it's gonna, how to say, have a harder time in downscaling these executors because I mean, those partition are gonna maybe still needed in the future in the future stages. Is your solution, as I have to say, allowing Spark to downscale those dynamic allocated executors? Yeah, exactly. As you mentioned, without another copy of the software data, all the software data will be stored on a local disk and that's associated with the executors. So even though you enable dynamic allocation, but the software tracking will disallow, the scaling down happen effectively. Yeah, exactly, yeah. Yeah. Thank you. So we just want to explain with our cloud shuffle manager, you can enable that at a downscale very easily. Without cloud shuffle manager, normally it doesn't work well with native Spark. There's a risk involved because you don't set infinity as the tracking and so if they get deleted the executor, then you lost the data and you start again. Yeah, exactly. When you use cloud shuffle manager, we say that time out to zero. So it just expires immediately. It can just be shut down at this time because you can get from the cloud storage. Yes, exactly. Cloud storage, we are talking about S3, GCS and that kind of. Yeah, we're trying to make the solution cloud provider agnostic. So it means that it can be used across different service providers, yeah. One curiosity is how long did it take for you to build that because you said it's not open source yet. You can build it yourself. The idea is quite easy, but I'm curious how long did it take? Good, good, good. Yeah, good question. Thanks for asking that. Together is not that easy because you see there are so many solutions previously. We tried different ideas and we do multiple iterations together. So now the idea is very simple, but it is after several round of iteration. If you just focus on the idea, the change is only on the shuffle writer and the reader, it's a small change and you add some retry in Spark internal code. I will say if you are very familiar with Spark, you can do the change, adding some testing time, maybe in one or two months, if you are familiar with Spark. Yeah. And did you need also to patch Spark core or was it possible to do it just using plugins, let's say? Yeah, both are possible. If you want to do it quick and dirty, you can just make change inside the Spark kernel core code. But Spark shuffle manager has abstraction like shuffle manager interface. So previously, I built my previous version of shuffle service, I used that without need to change the Spark core code, but that will take kind of more time, but both are possible, yeah. Okay, but in your case, so you are, because I'm curious about, let's say the fallback logic is that built into your shuffle manager, so you don't need to touch anything of Spark core to make that work, because your shuffle manager is like probably borrowing some code from the standard one plus adding your logic. Is that the case? Right, right now, we embed that code in Spark core part because we want to iterate fast. Okay. So we are working on to extract that and put it in the shuffle manager abstraction so it won't kind of change the internal code. Got it, got it. So right now you have like your Spark distro. Yes. Okay, thank you. Cool, no problem, yeah. Okay, thank you everyone. Bo and me will be around. You will have questions, feedbacks, welcome to resource. Thank you. Okay, thank you. Thank you.