 OK, testing. We good? OK, awesome. OK, welcome, everyone, and thank you for coming. My name is Alon Philanenko, and I'm an engineering team lead at Bloomberg. And I'm joined here by my colleague, Aki Tsukagawa, who's a senior software engineer with me on the Enterprise Data Science Infrastructure team. So today we'll be talking about managing multi-cloud Apache Spark on Kubernetes. So in general, starting from the top, managing data science infrastructure in a multi-cloud environment is really hard. The variation of behavior between different hardware and cloud providers is vast and rather daunting. Now, Kubernetes in its cloud-native design provides an extraordinarily powerful abstraction across all these hardware stacks and enables the building of highly composable and transportable infrastructure substrates. This talk will be primarily focusing on that feature of Kubernetes as we're looking to address complications that might arise, specifically that we had when we ran a data science platform both on-prem and in various cloud environments like AWS EKS and Azure AKS. Given the Kubernetes as a foundational abstraction that we're using, we found that when designing various compute runtimes like Apache Spark, each of these runtime components require the same standard managed services like we have here identity management, network policies, federated job APIs, and more. But in this talk in particular, we will be focusing on the journey that we took to build out the Spark compute runtime within our data science platform. So what is Apache Spark? I'm sure of Hans who is familiar here with Spark and who's not. Good. OK, so there's a couple. OK, there's a couple. So we'll do a little bit of run-down. So what is Apache Spark? So the quick background for those who are unfamiliar is that it's an ETL application that enables us to provide data science at scale. So traditionally, ETL for files smaller than a gigabyte uses pandas, but when you start scaling up to 100 gigabytes, you either need a chunk pandas data frames or use Dask, but Spark can also be a pretty good candidate. But where it really shines is when you start looking at petabytes or more. It has a variety of features that make it a powerful general execution engine, especially when communicating with cloud object stores. But for the purpose of this talk, we'll be focusing on its distributed architecture and its relation to Kubernetes. Spark runs as independent processes on a cluster, coordinated by the Spark context object in your driver program. So the Spark context can connect to several types of cluster managers, like Mesos, Yarn, or Kubernetes, and for the purpose of allocating resources across applications. So Spark's cluster manager, functionally, is just purpose to provide IP addresses, with which Spark uses them to communicate. So this enables the cluster managers to be rather flexible or composable for a variety of different scheduler backends, like Yarn or Kubernetes. Now, this has benefits for plugability via a scheduler interface, but this does mean that Spark isn't good at looking up or trying to bubble up too much Kubernetes information as it won't necessarily fit into the general interface. We call this kind of a lack of information flow. And this is a lot of drawbacks that we'll talk about throughout this talk. So upon receiving the provision executors, Spark sends the application code to the executors, after which then the Spark context would send tasks to the executors to run. The Spark context could be initialized, for example, as a Jupyter notebook, as you can see kind of in the diagram, which means that the notebook pod would have a service account for directly creating executor pods. So if we were to parallelize the Kubernetes speak, we could kind of think of the driver here as a controller of sorts. It's now tasked with provisioning and managing state. However, this is all done from within Spark. Are folks familiar with controllers or Kubernetes controllers? Sweet. Now, what does it mean for a driver to have controller-like privileges? So this means that the user's pod will have elevated service account privileges that in some environments, like ours at Bloomberg, is rejected due to security concerns. Some examples are, for example, some environments don't allow for users to create pods. Another potential problem is that the full information flow from Kubernetes is expected to reach Spark. But because of the agnostic nature of the scheduler, as I said earlier, this might not be entirely possible, as we discussed a little bit earlier. So to solve this, Bloomberg, as I also talked about in my previous KubeCon talk, looked to enable a pluggable executor creation template that pushed the pod creation to an admin controller. So this allowed now for the communication with the admin controller to kind of push out that service account requirement. Now, to communicate with controllers in general, this is traditionally done via custom resources. So this means that the Spark driver now modifies a custom resource instead of creating pods directly. So this moves the pod creation to an entity that doesn't share service accounts with the users. And that solves the first issue and could potentially fix information flow issues that might come up. But we'll get into that later. So to help understand this, let's work with the visual. Your driver pod, which in this example is running within the user's namespace from a Jupyter Notebook pod, would communicate with the API server to create and modify a Spark app custom resource. With the intention of having the controller in the admin namespace create those executors securely in the user's namespace. The admin controller will watch for Spark app custom resources, where the spec field of these custom resources has some desired state. In our case, we're specifying the expected executor pod templates. And the controller will then act on the contents of that spec field by creating these respective executor pods via its reconciliation loop. The driver will then wait for those executor pods asynchronously, which is functionally then the same as it would otherwise. So therefore, the rest of the Spark paradigm kind of remains the same, even with this pluggable design. Now, we've been running our personal fork of Spark in a managed data science platform across bare metal and various clouds for some time. So let's walk through a couple user studies that address the success stories and complications that arose. So let's start with all of our success stories. We've got a comment that it works. OK, now for what matters, the complications that arose. So without spending too much time on each of these issues, as you can review them and their solutions offline, I think we posted the slides, what I want to highlight here are the two categories of issues. One's relating to the pending status of the executors and the other's relating to the failed states. Now, something to note is that these issues went in a bare metal, non-autoscaling environment with a constrained resource quota can vary from the pending issues that arise when running in the cloud with autoscaling and preemptible spot instances. So to cater to both is quite challenging. Now, regarding failure scenarios, node scale down, preemption, and OMS, these are quite ubiquitous in Spark applications, especially in a large-scale Kubernetes cluster, and especially when you're running in the cloud. And so catering to all these variations of failures is another challenge. Now, Kubernetes with its cloud-native resources provides all the necessary information to cater to these problems across a multi-cloud environment, as you can see by the solutions to these complications. So we will generalize these problems and their respective solutions into three categories and then show that by solving the information flow from Kubernetes to Spark, we can solve for most of the common user complications that have arisen in our platform. So the three generalized limitations that result from missing information flow include autoscaling information, the capture of scale down, preemption, OM, and failure events, and post-job introspection, where a user might want to triage their Spark job after the job completes or fails. So this complements their existing monitoring or logging dashboards that might be available in their platform. So I'll now pass it over to Aki so that he can walk us through our approach and our solutions to addressing these limitations in Apache Spark. So our solution to the lack of information is to collect information ourselves somehow and store high-level statistics into status object, I mean, status field of the custom resources we described earlier. So as you can see in this slide, missing autoscaling information is now stored. I mean, it will be stored in the cluster of scaling field on the right-hand side. And scale down and OM, out of memory failures, information are stored in a terminated status field. And for post-job introspection, we can keep the job object for a while and query this object. And this is how it will look like. So we build contours to collect necessary information somehow and aggregate it as a high-level statistics and make it available in the Spark app custom resources for each Spark app or context. We can, for example, build a UI on top of the information we gather from the custom resource or bubble up the information to the Jupyter notebook that is running the driver program. We could directly collect the information and provide an endpoint to get the statistics in one service. There's nothing wrong with this approach, but we prefer having a separate component for just collecting the information inside the Kubernetes world. So by having a controller to collect information from underlying Kubernetes layer, we have a nice separation of concern. And also, we can follow a common controller pattern for handling Kubernetes objects. So what does it mean? So as we saw earlier, we used the Spark app custom resource for creating executables. So in the controller pattern, the spec field represents the desired state of the world. And the status field is the represented state of the world. So controller tries to represent the desired state and represent back the actual realized state. So we are following the controller pattern here when we store the statistics about the executers to the status field back. So other considerations about the controller pattern itself is that its reconciliation loop needs to be very important, meaning that it needs to be stateless. It cannot rely on the previous results. So it works from scratch every time the reconciliation loop is invoked. So another way to think about this approach is that we are effectively using the custom resource as a data store for the high level statistics we want. We are effectively storing information to SCD via Kubernetes API server. So here's the characteristics of this approach as a data store. So it introduces no extra dependency. And it provides out of the box, pub-sub mechanism via watch. And we can do basic query and aggregation via selectors and controllers. And latency is not ideal, but it's not too high as well. So we can expect an order of a second latency. And not to mention its availability is really excellent. Since we are providing high level statistics in there, we want to push updates via long polling or with socket. So we want pub-sub mechanism in the data store. And the users are not clicking on something and actively waiting for the response. The latency requirement for our UI is not so tight. And we only need the basic query mechanism in this case, in our use case. And since we run it in bare metal and multiple cloud services, we really want to minimize extra dependencies in our case because the maintenance cost and the complexity in each environment add up. So there are two alternatives we consider as a data store. One is that our external log metrics services, such as Hume or Splank, which we already use for our debugging purposes, they are excellent for retrospective troubleshooting via adhoc queries. But building real time UI on top of this is not really trivial. And also, latency is of concern as they are typically backed by S3 storage to handle a large amount of data. So another way we thought about is to use an additional database. This is suddenly doable, but it introduces extra infrastructure. And we also need to choose a solution carefully and design on top of it with care as well. So as we saw, we are actually choosing SCD for our database for the advantages listed before. Now that we have proposed using a custom resource to store executors, high level statistics, let's walk through each of the limitations in information flow we identified earlier. We will start with auto scaling. So what is auto scaling up? So class auto scale up happens when existing node, existing Kubernetes nodes cannot fit in, newly created to pause. So if there's not enough nodes already, the class auto scale up tries to create more nodes. So when it happens, there are actually two problems from the perspective of Spark user experience. Firstly, it takes much longer than usual potentialization. And secondly, it's not always possible. For example, if the user is not allocated enough quota for AWS or Azure resources. So without knowing this underlying cluster, knowing this underlying cluster scale up status, Spark users will be left wondering why the requested executors don't come up. And are they going to ever come up? Or whether they need to inform cluster administrators for help or things like that. So now that we know how cluster scale up works and why it's useful for our users, we are going to actually collect the information to detect the status. So if we are running a standard cluster of scale implementation provided by Kubernetes, that particular implementation will inform us about cluster scale up via event objects. It creates an event object with the reference to the affected pod, like shown in the slide. So the reference is in the involved object field. So one reason we chose to run a standard cluster of scale up, by the way, was because of our multi-cloud environment and the desire to keep a single implementation across all deployments. So this is another example of when a cluster scale up was impossible, so the last one was for scaling up was possible and triggered. So we can detect cluster outscaling state of each executor pod by listing these events from cluster outscaler involving the particular pod and pick the latest one. So we can have one controller watching pod and cluster outscaler events and aggregate all the information into SparkUp custom resource. A minor concern here is that the reconciliation loop is about SparkUp spanning all the executors in the SparkUp context. And the one reconciliation loop will go through the pause of the app and also the events of each pod in the app. And on top of this, the outscaler event can be repeated and it can cause spike of incoming updates and it can potentially make the controller less responsible at the time. So we can separate the concern by introducing another controller nicely. So the additional controller works on cluster outscaler events exclusively and saw the cluster outscaling status information of the affected pod. I mean, the status into the affected pod itself. And the main controller only watches pods. Simple. And this is possible because we can dynamically store additional information to the pod itself. So how we do this? So one way is to put the custom label or annotation to the pod. We define an arbitrary custom label name and put arbitrary information with some constraints. And another less known way is to put the custom condition to the pod status field. We can define a custom condition type with additional field with more flexible constraint. Now that we handle the scale up, the next problem that we will work on is the cluster outscaling down and out of memory. The information source we need here is actually the same as before. The cluster outscaled down information is provided in the exact same way as cluster outscaled up. And out of memory information can be already found in the pod itself in the terminality status in the pod status field. So the same pods and events are the information sources. However, the challenge here is that since these pods have failed already, they can be deleted. As we are following the control patterns, we need all the information source available at the time of reconciliation because control is stateless. And it needs to be even potent as I covered before. So one way to deal with this deletion is not to delete pods in the first place. So there is a relevant option introduced in Spark 3.0 that prevents Spark drivers from deleting failed executable pods. By default, Spark driver deletes failed pods. But by setting this deleted on termination option to false, it leaves the failed pods as they are. So this solves the majority of the problem. And this solved the issue for out of memory error. But this only prevents Spark driver from killing pods. And there are others who can kill executable pods. For example, in the case of a cluster outscaled down or eviction, pod will be deleted by Kubernetes core components, not drivers. So it's not prevented by the above config. It turns out that even for those cases, we can actually prevent the deletion by using Kubernetes finalizers. Finalizers is a field in the metadata. And it's just an array of strings. And as long as it's not empty, the object cannot be deleted. So we can put a custom arbitrary finalizer there and prevent executable pods from being deleted as long as we want. We did not choose this approach because the scope of the implication is really hard to know. Pod is such a central component of Kubernetes. And so many types of resources are attached to it that it is very hard to foresee every single edge cases we can face down the road. So we decided that we cannot keep the pod resource itself. We need to store the information somewhere. So given the previous considerations about the data source, another custom resource in SCD is a natural fit. So now the pod status controller, what is the executable pods and the cluster scale events, and the necessary information into new pod status custom resource. So we just created it. So it's worth noting in this diagram that some of the executable pods may be deleted. And so they won't exist in SCD anymore, which is why this pod status custom resource is necessary. This is indicated in a dash to circle. So the Spark app controller will now watch pod status objects instead of managing aggregated status. So the main controller is still watching the single object. So this is what pod status custom resource actually looks like. We follow the common pattern of the spec and the status field here. And the spec is a desired state, and the status is the realized state. And in this case, it's a variant of that. Spec field specifies the target pod to track in the object. And the controller replicates the necessary information into the status field, meaning that the status field itself is the realized state. So we can selectively activate this functionality for a single pod by creating one pod status object like this. So optionally, we can have a separate web or controller on top of this. And they automatically create such pod status objects for each pod that matches certain criteria. Like in Spark, there is a requirement that all executor pods have a label. Spark role equals executable. So in this example, in the example configuration in this slide, if a pod has a Spark role equals executor, a corresponding pod status object will be automatically created. And then the controller will be watching that and keep updating relevant information from event and the pod into those pod status. So now it looks like we are copying field values around only from one object type to another. So we might as well make it more generic and describe it by a declarative field copying specs. So this is an alternative design of the pod status controller. So for this controller, we have a configuration like this and in the left-hand side, we specify the source and the destination field declaratively and the type, of course. And the controller will watch the source objects and keep replicating specified fields. Resulting destination object in this case is shown in the right-hand side and the owner reference field and the container status field is replicated. And this is another example for the cluster all scalar events. On top of the previous one, we also let the controller watch events with the configuration watch and replicate the latest field values and replicate the latest field values from the cluster all scalar event to the destination status object. So our implementation of this approach is powered by an unstructured type in a client goal. And it's actually quite lightweight, let's say like several hundreds of line of code, goal code, and it might be even more straightforward in Kubernetes clients in dynamic languages. And downside of this approach is that everything becomes a goal map in the implementation and we lose static typing. You might want to do more than just copying, for example, define more complicated trigger rule and have more volatile actions triggered by those rules, not just copying, et cetera, and it's totally possible and not that hard to do. But we would recommend to keep the configuration like this really simple as you don't want to, for example, unit test your more configuration down the road. One realization from this experimental implementation is that this way we can really make sure that the additional control, additional controller, that power status controller only does copying and nothing else. So it can be good design pattern or discipline to follow in certain use cases. So in other words, you can make sure that all the business logic reside in the main controller and not randomly spread across multiple places. But in our use cases, we actually wanted to encapsulate auto-scaler related logic into additional controller, the power status controller, so we didn't choose this route. So as a recap, now that we've walked through providing auto-scaling information and the various failures scenarios by fixing the information flow to Spark, we can look back at the Spark app custom resource and know how it was produced. Regarding the persistence of the Spark app information for the purpose of postage when it's inspection, we've went through various storage options with the conclusion that the custom resource is the effective candidate for our multi-cloud environment. The building upon what we talked about in this presentation, similar features can be implemented in other distributed execution systems such as PyTorch or TensorFlow that rely on Kubernetes as a resource manager. In that case, the power status controller will be reusable as is and the main job controller can handle job type specific statistics apart from that. While the auto-scaler detection behavior via event object is specific to the standard class auto-scaler, we can actually switch to alternative auto-scalers such as Carpenter without affecting the main job controller now by only switching detection logic inside the power status controller. This is especially useful if we have multiple job types and multiple main job controllers. So feel free to check out the rest of our talks from Bloomberg and also check out the links below to learn more about us. And by the way, we are hiring. Thank you for that. So do we have any questions for these? Yeah, there's a question, of course. Hi, thank you for the presentation. How do you manage the data? I mean, in a multi-cloud environment, how do you manage the different aspect of the data? Isolation, security? Do you share the data? How do you run Spark in multi-cloud with different type of data, different type of teams? Got it. So is general data management, do you guys hear me? Sorry, general data management questions? Yeah, the context of the talk is mostly about the compute but in terms of data, it varies between the restrictions and the security concerns of the application. Like if it's something where we're taking data that's only within a spurt and like say network we're constricting it to would have like calico or cilium policies locking everything down in terms of moving data in and out, right? And so it varies based on the implementation. So the platforms ourselves are composable so they vary with security concerns. But if you run the same code in different places, you have to share the data. You have to have only one place or different place. So when you see, when you say the, okay. The pods themselves are on the user's namespace, right? And so the data itself is either co-located to the namespace or you have specific verification on what data source you're communicating to. So the pods themselves are always the user namespace. The admin controllers are just the control plane, right? The pods are not in the control plane. It's only in the user's namespace for the Spark app. Does that answer the question? Hey, so I, yeah, there was one behind. Hey, I have two questions. The first one is you had on-prem data center footprint. You had AWS and GCPs. I was curious if all the cluster compute nodes on all these three footprints across cloud providers are they homogeneous or do you have heterogeneous computer environments? So homogeneous in terms of the hardware types like CPU, GPU, that kind of stuff. Yeah, yeah. Got it, got it. In terms of Spark, there are Nvidia plugins all you to run GPU. And so it's based on once again, the configuration of the cluster. The infrastructure itself is portable. So functionally, yes, it could be, yeah. Thanks. And the second question is I'm curious to learn if you built cluster order scalar before Carpenter was announced or what were the reasons for building it in-house versus using another open source project? Do you mean? Yeah, so we were using the standard cluster auto scalar from Kubernetes. The intention, as Aki mentioned, was that because we're in a multi-cloud environment and we need portable infrastructure, the built-in standard cluster auto scaler, something that would run in EKS, AKS and on bare metal because it's all Kubernetes abstraction. So we just use the base cluster auto scaler so we didn't build our own. But the idea is because we have a separation of concerns between the job controllers that reconcile however that job operates and the auto scaling information, the auto scaler, the auto scaling controller could have either the cluster auto scaler events or it could use Carpenter's way of doing it or not. Got it, yeah. Thank you. And the last question. So I guess, like because you're using STD, like it's kind of a database. I'm curious if you like hit any scaling or performance constraints. And if you did any testing on that sort of thing, seems kind of like pod statuses are just created and imagine it'll pile up over time. So with regards to kind of considerations, so traditionally it's kind of about like what is your Spark app clusters like scale to, right? Like we had some battle tests, I think 400 to 500 kind of executor clusters and that's kind of what I've seen mostly like sub 1000. I don't really know too many have more than a thousand kind of executor pods. So obviously, again, if you're talking about very large, it's also about numbers about how much you can store in STD. We felt that it kind of fit within our models of 400, 500 pod clusters. We can talk offline in terms of like other kind of maybe strategy tests in terms of actual sizing itself, but it seems to fit for most applications that we're running in prod. Thank you.