 Hi there all, welcome and thank you so much for joining us. We're gonna be talking about scaling Apache Spark on Kubernetes. We're Amanda and Holden and we're so delighted to be here at KubeCon EU. So first and foremost, we're gonna tell you a little bit about ourselves. Then we're gonna talk about what is Apache Spark and why it matters to Kubernetes? What is Yarn and Mesos? Spark standalone mode, so get ready for that one. Why you should put Spark on Kubernetes? What worked well while we worked through this process? And what didn't work is good. Best practices for doing this kind of transformation. New features that have been implemented and upstreamed in areas for improvement. So who are we? So you'll see two of us in this picture, in this collection of pictures, work for Apple and that is me and Holden. And you also see our two pups, Timba and Jack. They support us throughout the day as we get our work done. Both of us have worked for Apple for over a year. I think Holden is getting close to two years. And both of us have been a part of the Apache Spark community. Holden much more so, she's an Apache Committer. She's written books on Apache Spark. She's been with it pretty much from the get go. I came to Spark a little bit later, but I've been involved with it for quite a few years. I've done quite a few talks on Spark and it's a technology that I enjoy. I like teaching other people about and using myself. So all right, so let's talk about Spark and what are these things? So what is Spark? So Lightning Fast Unified Analytics Engine is one of the tag lines you'll hear quite frequently. But Apache Spark is really used to do large scale data processing on large data sets. It's used by data scientists, data engineers, machine learning engineers, basically anyone who is working with large data sets, basically regardless of title. So Apache Spark allows for batch jobs, streaming jobs, the ability to use Spark SQL, R, Python, Scala, or Java. You can do machine learning using Spark and you can also do graph analytics with Spark as well. So some other tag lines you may have heard about Spark, which is it's Hadoop MapReduce, but with seven cups of coffee. Of course we have a little disclaimer there, check with your hardware vendor first. But I mean there's been so many numbers thrown around the years, but essentially Spark jobs because of their ability to utilize memory and large and do that large processing anywhere between 10 to 100 times faster than Hadoop, regardless of that's a big range, but it's faster. It is, Spark is a good way for folks to learn functional programming, which is true since Scala is a functional language. Me, myself, I know this much about Scala and I prefer to use Python. That's my language of choice when using Spark. So I did not take this opportunity to learn functional programming. And it's a great way to use a lot of compute resources, right? Because you're doing extremely large jobs across super large clusters, using a ton of memory and CPU. So let's talk about how Spark works. So basically what Spark does at the TLDR is it spreads out compute across a cluster, most specifically a Spark cluster, right? There's two main components when you launch a Spark job, the driver, which contains a Spark context. And which works to transform the user's code that I've written, maybe in Python, right? It splits it up into tasks, these byte size chunks that can be sent to be performed by the executors. And the executors reside on each node. So from there, it's very easy to scale up and scale down, depending on the workload, because you can always add more nodes and more executors. Especially when you're working in a cloud native environment in the cloud. It's really easy to add those extra resources. So Apache Spark has abstracted away from the user's any need to have to deal with orchestrating data processing parallelism or worrying about fault tolerance. This is all taken care of them, for them. So because of the architecture and knowing that Spark would be spread across multiple nodes on very large clusters, it was known that nodes would fail throughout this process. And so the executors are able to handle those failures because they know that they will happen and they can just recompute results. So the executors are long lived, especially when compared to MapReduce. And the executors store data in a mixture of memory and disk. So it utilizes both of those to run faster. With all this said, so now we know what Spark is, right? So Spark can be run either in a standalone mode with just Spark and a JVM on each machine, or you can use a custom resource management system. And that's what we're gonna talk about next. So what is Yarn? Yarn is yet another resource negotiator. So Yarn was released in 2012 and was a rewrite of MapReduce, the MapReduce engine from Hadoop 1.0. So MR2, which was the informal nickname, even though it really means MapReduce 2.0, which is an application that is actually managed by Yarn. So it all gets a little bit confusing. But it is the de facto standard for big data workloads. So Yarn supports a variety of process engines and applications. You can run Hadoop and Spark on the same cluster. You can do isolation and dynamic allocation of resources. So Yarn actually keeps track of the available resources. So memory, CPU, storage, and includes multiple types of scheduling methods. So it supports lightweight isolation and it allows to share local disk. And so with all this talk about resource allocation and managing resources, what does this remind you of KubeCon? Kind of reminds you of Kubernetes, doesn't it? So what is Mesos? So Mesos is a distributed system kernel. So Mesos is very similar to Kubernetes, but it's a bit more flexible. Not only can you manage containers, but you can also manage applications that are not containerized. So it does try to be more than just analytics, like maybe Yarn, which is being able to manage Hadoop and Spark. There's a private company now that is dedicated to working on it, and they are a bit more Kube, Kubernetes focused than in the past. And it also allows for local shared disk as well. So let's go into standalone mode. So are you tired of doing your actual job, which is writing Java or Python and doing data analytics? And you want to spend more time writing shell scripts and managing your servers, like they're your kids or your pets. Then this is the mode for you. So standalone Spark essentially a way to create a Spark cluster and manage that yourself. So there's no support, dynamic resource allocation or resource management and scheduling. This is the more painful option for sure, but it is possible. Maybe you have a use case where this works better for you. I would love to hear it actually. So it does not support dynamic scaling. So that's what you are for. So maybe some awesome scripts that you wrote and pager duty as well. So let's get into why we should put Spark on Kubernetes. So I think we've made it pretty clear that you need a dynamic resource cluster management system to help you manage your Spark workloads. But what was wrong with Yarn and Mesa is why I moved to Kubernetes, right? So there's a layer of reasons, right? Well, first, this is KubeCon, right? So we want to focus on Kubernetes, but all jokes aside, right? So one true ring to rule them all. We'll talk about that more in a second. You can use your spare capacity for analytics and the cloud. We're gonna talk about Python and what using Kubernetes allows you to do with Python. We'll talk about security and what that gives you by using Kubernetes. And of course, it's all about learning new skills, right? So just from our quick review of Yarn and Mesa, it's easy to see the benefits of needing a container management platform. When putting Spark on Kubernetes, you can add Spark and any other types of workloads. So it doesn't have to just be Spark or Hadoop, etc. So you can better utilize your cluster's resources because you can have all types of workloads from across your organization on one Kubernetes cluster. So and then just to point this out, right? So many of us in our, where we work, we may have just a little bit of everything doing, every type of technology there is, we have spread out across our companies, right? So you could have three cluster management systems, right? Mesa, Yarn, standalone, Kubernetes, that's actually four, right? But managing those all, that leads to a sad team, sad engineers, because that's a lot of overhead for them to have to try to switch between platforms and troubleshooting, etc. So of course, the solution would just be turn or two or three of them off, right? That'll just make all the workloads better, right? Well, yes and no. You want to definitely try to converge on one and Kubernetes seems like it's the direction that so many of us are taking. And so it gives you all the benefits of Kubernetes, which we'll talk about here in a second, and it just makes your life a little bit easier to just standardize on one platform. So it also, so like I said before, that you can run different types of workloads within that one Kubernetes cluster, so it allows you to use that spare capacity. So what's nice about Kubernetes, again, is that multiple types of applications and services can be all running on the same cluster. You don't need a dedicated Spark cluster that would be managed by Yarn or Mesa's to run your Spark jobs on that same cluster. You can actually run them on the same, so Spark jobs that may be analyzing your services, you can actually have the Spark jobs running on the same cluster as that. So like I said, they're kind of choppy. The Spark jobs you're doing can be analyzing anything from the very services that are running on the platform for doing any other kind of data crunching, right? So additionally, when we're talking about preemption and Kubernetes, so Kubernetes allows the ability to preempt workloads based on priority classes, which is really powerful. So workloads that are less time sensitive than other jobs can be rescheduled when resource demand is lower. And this is all just can be done, you know, fairly automatically. The power of Kubernetes allows for resources to be redistributed once a Spark job is complete and the pods have been released. So that resources can be given to other Spark jobs or to other services. So next year, so talking about containers and Python and security. So the ability to easily move workloads from Dev to Prod to add new libraries are, you know, that is because of things like isolation that Kube has. So data engineers and scientists are always trying to find new Python packages to add and ways, you know, new workloads that they want to be able to utilize the latest versions of these libraries and new versions of Spark. So with a single yarn, mesos, or standalone Kube cluster, standalone Spark cluster, I should say, you're absolutely tied to only using one version of Spark or Python that's on that cluster. With Kubernetes and containerization of Spark workloads, one data scientist can be using the latest version of Spark, so 3.1.1 with Python version 3.0. While your data engineer can continue to use Spark 2.4, why not, with Python version 2.5, all along the same cluster because it's all containerized and isolated. So anyone who has had to deal with Python dependencies understands the importance of having the ability to abstract using containers so that their jobs can do and use exactly what they want to use. Also, the ability to add increased isolation when dealing with data is very important and Kubernetes allows for this where yarn and mesos just didn't the same degree. So also just to add, when moving from yarn or mesos, Kubernetes has a lot of custom configurations that makes it a bit more flexible than yarn, per se. And last but not least, learning a new skill, right? Verholden and I personally, adopting to Kubernetes, has allowed us to pick up one more tool for our tool belt and of course now we're so much more popular, right? Because now not only do we know Apache Spark, we also know all about Kubernetes. So with that said, I will pass this off to Holden to tell us more about what worked well, what didn't work so well, making the transformation from Spark on yarn, Spark on mesos to Spark on Kubernetes. Thank you. Awesome. And thanks for that introduction. So now I'm gonna talk about sort of the second half of the presentation, namely what worked well and where we had rooms for growth. So small to medium size ETL jobs worked really well and migrating them to Qube was relatively easy. The only thing that we really had to do was increase the resource requests to match the reality of what they were actually using because yarn and mesos weren't enforcing the resource limits quite as strictly as Qube ended up enforcing them. There were some challenges around integration with the different data sources and that mostly comes back to some networking configuration choices that were made. And I think we could make some improvements there, but that's sort of just something to keep in mind, like make sure that you can easily access your data sources, but it was a relatively easy fix. Large and long ETL jobs were a bit more challenging. Long enough running jobs at low priorities tended to run into over commit issues. They'd still succeed, but often they'd take a lot longer than they would on mesos or yarn. And the primary room for growth here is more efficient resource utilization and more effective handling of over commit issues. Multi-language jobs are where things started to get a little challenging. So one of the really great things is that dependency management improved substantially compared to yarn. In the yarn world, all of the dependencies had to be managed by a systems administrator, whereas with running on cube, it can be very much more self-serve. The initial migrations often ran into resource difficulty and unplanned exits caused large amounts of recomputation. And this is, you know, the recomputation that is sort of expected, but it was more than the amount of recomputation that we were seeing running these jobs in other cluster environments. Most of our opportunities for improvement are around memory allocation and specifically sort of how we share native and JVM memory. Complex ML jobs did not work well when we tried to migrate them, primarily due to much more expensive recovery costs. So in all of these situations, Spark handles executor failure by recomputing data loss. With the complex ML jobs, the cost of recomputing that data is really, really quite expensive. The opportunities for improvement are around checkpointing. And so one of the things that we can do to sort of deal with this more expensive recomputation is when we get to these really expensive points, is checkpoint to persistent storage, but we had difficulties connecting to persistent storage, specifically the kinds with TTLs that can do automatic cleanups. And so this is one of the rooms for growth that we have in complex ML jobs. Streaming jobs had a lot of room for growth and just many areas of opportunity for investment. They just frankly don't work very well right now. And a lot of that is around the connection to the different data sources. It turns out that our connection to the streaming data sources has even more room for growth than our connections to the batch data sources. Also checkpointing is especially important in streaming. And so the same problem that we had from ML is even more present with the streaming jobs. So, okay, that's about our experiences moving to Kube. What are the best practices that we learned from this? So one of the things that we learned is not to just cash everything. Caching was never free, right? But now it costs even more. And that's because when we decommissioned an executor, we now have to migrate the cash data. And also now for caching stuff to disk, disk is actually a metered resource. So be careful. Like think about, are you actually gonna use the data twice? If you're done with it, tell Spark you're done with it, right? Especially for users in no quick environments, the garbage collector isn't able to handle this as well. So you need to explicitly tell Spark that you're done with the data and it can get rid of it. Timbit, of course, is not done with the bone and he would prefer that we never get rid of the bone. Using disaggregated storage is super important. The big thing is there's no stay resident block manager anymore. And yeah, okay, data locality does matter, but it's not enough to try and co-locate HDFS. It's not worth it. Another really interesting thing is that using cloud storage from on-prem doesn't have as much overhead as one might think if you structure your network correctly. And so initially in a lot of situations, we assumed that we had to use HDFS, but after we did some benchmarks, it turns out that it was actually perfectly reasonable to use cloud storage from on-prem. So definitely don't just assume that you need to co-locate HDFS, take the time, run the benchmarks and see if it's actually gonna be worth it for you. One of the other things is that you're gonna need to increase your resource requests. And we talked about this a little bit with the job migrations. Essentially, yarn containers are very fuzzy definitions of containers. We have much stricter resource requirements in cube. So you're gonna need to allocate more memory. Femoral disk starts mattering, that wasn't tracked at all previously. Another one is don't set quota for the sake of quota. Default Spark uses config maps. We ended up doing a large rewrite of some code because of a config map quota. And it turns out that config maps weren't as expensive as we had assumed. So this is very important. When you are starting to set quota, definitely take the time to see if this is actually a constrained resource or if you don't actually need to have a quota here and it's perfectly fine to let things run wild. So in addition to the best stream practices for your jobs, let's talk about what things we changed. So we added a new mechanism for dynamic scaling in Spark. I'm really excited about this. This is actually based on the design that I came up with five years ago. And it just didn't make sense back then, but we'll talk more about it in a little bit. If you remove the commit fig map requirement as we talked about, we essentially added an alternative because of our quota system. Persistent storage for fallback on out-of-disk events. And we were really hoping that was gonna trigger more frequently what it turns out that, because of how ephemeral disk quota works, we don't actually get the out-of-disk events in the same way. So we also added some additional hooks inside of there. Integrations into PVCs, some templating. We also gave users the ability to explicitly remove unneeded shuffle files. And there were a bunch of sort of corner cases in Spark's understanding of pod state that historically hadn't mattered, but once we started to add dynamic scaling, did matter. And so we added graceful decommissioning and we made this to support dynamic allocation on Spark on Cube. And this is really important because historically Spark on Cube has not had a good dynamic allocation. Only recently, a restricted dynamic allocation was added where if there was no data on an executor, we could get rid of it. But by adding graceful decommissioning, if there is data on an executor, we can still get rid of it. We just migrate the data away first. There are some alternatives proposed by other people in the community of adding a truly external shuffle service. And there's a few different ones and we're not sure which one is gonna land, but it'll be interesting to see those things. And I think even once one of those things land, we'll probably still keep graceful decommissioning and we'll still leave it turned on because the truly external shuffle service only handles shuffle files. And Spark also has this concept of cache blocks and I think it makes sense to migrate cache blocks. Just while we're talking about this, one of the things that was a little counterintuitive with the configuration that we found was that we went from this sort of initial configuration of an executor idle time and a cache idle time of 120 seconds. And then we increased those idle times and we actually got better scale up and scale down with the higher idle times. And this is because essentially Spark doesn't do a great or really perfect job of keeping track of sort of if an executor is likely to have a job scheduled on it. And so we will get into the situation where we would start to see executors coming up and going away, essentially flapping very quickly when we tried to set tighter timeouts that we thought would actually cause better scale up and scale down experience. But by relaxing these timeouts, we actually got a much more reasonable scale up and scale down experience. We added external shuffle storage. This allows scaling beyond with executor to executor migrations support. There are alternative proposals for doing essentially more on top of this. But this is really important because with executor to executor migrations, we can only scale down to the point that we still have enough FMRL disk available for the data. And if you have a lot of data, but like let's say your data scientist goes home at the end of the night, really it probably makes sense to go ahead and put that data in some kind of external storage while the data scientist is taking a break from their job, hanging out with their family. So what were some areas for improvement? So is specifically in graceful decommissioning and dynamic allocation. I would say our biggest area for improvement here is documentation. It is possible to turn on, but really right now, as we saw from that configuration example, it sort of involves a lot of fine tuning and playing with things. So we haven't documented it because we don't know what the right settings are generally. We know what the right settings are for our cluster, but we haven't had enough other people sort of play with it. So we don't have good recommendations here yet. And so if you do want to play with graceful decommissioning and dynamic allocation on Spark on Cube, I would really, really appreciate your feedback on sort of what's working and what's not working. And if you can contribute that to the documentation, that would be amazing. Another thing is not all data is equal, right? And Spark has some internal heuristics around what kind of data is more likely to be used or not. They're not perfect. We could start by applying those heuristics to block migrations or we could try and come up with better heuristics for what kind of data is worth migrating. Another one is avoiding cascading failures. We have some work on this and it essentially can come to the point where quota can trigger cascading failures as we do migrations and we force executors over their quota limits. We have sort of a hacky solution, a longer term though, I think this is an area for better investigation. Lazy write back support I think is also really interesting. And that's this idea that like, yeah, we still want to try and store data locally on the executor, but we can start writing it back to persistent storage as soon as it lands on the executor. I'm not sure if this is gonna like give us good performance or not, but I think it's an area that really has a lot of potential and I'd like to see some more folks investigating it. So more generally, Spark on Cube has a lot of areas for improvement. Documentation is still, there is some documentation for Spark on Cube, but I think this is an area where we can once again improve a lot. Another one is sort of Q mechanisms and better understanding of the different kinds of jobs that Spark is scheduling and integration with the Cube scheduler so that we can actually get faster spin up. Another one is better communication for failure reasons. This is something that I've been working on a little bit because there's all kinds of reasons why things can fail and it can be difficult for a user to get at that information. It's not populated into the Spark web UI and so I think finding ways to better communicate to users what's going on is really important. Dynamic preemption priorities, this is really complicated in the Spark world and that's because we definitely have this concept of pods that are maybe more important but the problem is which pods are more important is going to change a lot while our jobs are running and even as we change resource profiles or even within the same resource profile as data ages in and out. So we don't have a really good way to communicate that to the Cube scheduler right now. Another one that is kind of maybe, I don't know if this is a good idea but I think it's worth exploring because it's a sort of hack that we've used in Spark on different systems and essentially it's where we use a shared local volume when we have multiple executors scheduled on the same node and this allows us to sort of pass data back and forth without necessarily having to go through the JVM on both sides of it. Another one is handling of jobs with changing priorities or deadlines. So we might have a job which is very low priority but it really does need to complete by the end of the month and so we don't have a good way to express that right now. In fact, like a user would actually probably have to cancel and reschedule their job at a higher priority if it was getting close to a deadline. I think having a good way to express this in Cube or in say like volcano or something could be very useful. So in conclusion, Spark on Cube migrations, they're not something that you can just set and forget. And yes, we can preempt Spark more than API servers. It has less impact on end users but there's still a lot of work that we can be doing to make the Spark on Cube preemption experience better. The increased isolation does come with some overhead. I think it's well worth it for the benefits that we get, the security improvements and the additional flexibility around Python and native libraries. And also we really like cake and docs, you know. So thanks for coming. We really appreciate it. I hope you are having a wonderful CubeCon and please stay safe. Thank you.