 Hi everyone. Thank you. I know it's evening and thank you for joining the session today. Today we're going to talk about dynamic large scale Spark on Kubernetes with Argo workflows and Argo events. Before we kick off this session, I would like to check how many if you're currently running Spark on Kubernetes today. Yeah. How many of you are using Argo workflows today? For any type of right? Thank you very much. That gives probably you can take away some of the points from the session today. My name is what up onto. I'm a principal solutions architect working for AWS. I'm specialized in data analytics and Kubernetes. And today I have my colleague Ovidu. Hello everybody is very excited to be here. My name is Ovidu. I'm a container specialist solution architect at AWS. And I'm excited to have this presentation together with my friend Vara in front of you. All right. Thanks Ovidu. Without further ado, let's get started. Okay. Sorry. Right. Just a quick agenda. We're going to talk about, I give you a little bit of intro for anyone who doesn't know about Spark on Kubernetes by just introduction to Spark on Kubernetes. And then we move on to why you need why customers want to run Spark on Kubernetes. And, uh, and we discussed about some of the best practices for running Spark on Kubernetes. And especially, and today's talk is about, you know, when it comes to running a large scale Spark on Kubernetes, what kind of considerations on the best practices that you need to, uh, consider before deploying your workloads onto Kubernetes. So we'll touch upon those best practices. Um, and, uh, we also talk about other workplace and other events like, you know, some of these data pipelines that you're building, um, you need to ensure that these data pipelines are running on specific schedulers or workflow engines. And, uh, we will dive into the other workplace and other events. And finally, the demo, we prepared a demo that will showcase how you can actually create a DAG with multiple Spark jobs and trigger those Spark jobs using both Spark operator and Spark submit as well. So that's what we are planning today. So let's get started on the Spark on Kubernetes. So as you see on the slide, um, as you all knows Apache Spark is a distributed processing framework and which is mainly used for processing terabytes and even petabytes of data at a very large scale with both unstructured, even structured data for Apache Spark. It comes with set of libraries such as Spark SQL, uh, MLib, streaming and graphics is mainly for various data processing such as, uh, data processing machine learning and real-time data processing and a graph data processing. So these set of libraries that you can use to process various set of data and you can run Apache Spark on a standalone machine, such as your Windows machine or a laptop. Um, but to run Apache Spark on a distributed mode, uh, to process terabytes of data that is when Apache Spark needs a resource manager, such as Hadoop. You might be familiar, uh, Hadoop Yon and Yon is used as a resource manager to actually distribute that, uh, executors across these instances and run your workload. But then, uh, in 2018, that's when, uh, Apache Spark added a support, uh, for Kubernetes as a resource manager. That means you can actually run, uh, Spark workloads onto Kubernetes, um, now, um, you know, uh, with that version. That's when the shift started. Customers have started to think about, uh, you know, we want to migrate our existing workloads into Spark on Kubernetes now and how matured it is and because of all the scalability features that, uh, Kubernetes offers. So why Spark on Kubernetes? I think probably that's the common question that everybody asks and I know most of you know about Kubernetes because you hear, uh, Kubernetes, it's very powerful container orchestration tool, which you can, uh, use and leverage some of these features. One of them is dynamic scaling, right? You can leverage, you can run your Spark workloads, bursty workloads from zero nodes to 1000 nodes, uh, when you trigger the job and you can scale down when you don't want to run the jobs using cluster auto scaler or carpenter, um, using the auto scaling features portability allows you to actually write your spark job and containerize it and run it on any type of Kubernetes flavor of the cluster, right? So you can use, uh, EKS GKE or even on-prem Kubernetes clusters and should work and resource isolation. It's one of the key feature, uh, within the spark. Like you can define every single job can have its own CPU and memory definition. That way you are ensuring that your job doesn't take more than that CPU and memory that you allocated. So that gives you that resource isolation ensures that every job gets its own quota of the job. And it's a cloud diagnostic, right? So you can develop the script. And if you think about migrating onto from on-prem to more cloud native tools, like, you know, EKS or GKE, any other, uh, Kubernetes, uh, solutions, you can run it. Um, and the multi-tenancy, uh, one of the key feature about running spark on Kubernetes is the multi-tenancy, right? So you can have namespace isolations, multiple teams can share the same cluster and run the multiple workloads and running multiple versions of the spark. I think this is the one of the key feature. When you're using Hadoop, you might have that situation that you'll have to create a dedicated Hadoop cluster for every single spark version. But with Kubernetes, you can use the same Kubernetes cluster, such as EKS in this case, I'm talking about, uh, you can have multiple versions of the spark running on the same Kubernetes. That's a great feature. And the CNCF ecosystem. So now you want to monitor your spark jobs and you want to create dashboards and you want to, uh, you know, extract the logs and CNCF ecosystem comes with a lot of open source, um, add-ons, which you can leverage to, uh, build a good ecosystem on a full data pipeline. And cost optimization. So all of these features actually drives the cost optimization compared to other, you know, the traditional way of running on Hadoop, uh, a spark on Hadoop, just mainly the auto scaling portability and multi-tenancy. Right. This is a simple, uh, uh, command like spark submit. So it's no different from how you run on hood Hadoop. And if you see the master URL points to the k8s your API server. So if you want to run your spark job and using this command, you can run it on your local machine, um, and, and then points to the k8 server that can be any EKS or any of the Kubernetes cluster and, and the job will run. Let's, let's dig into how the job that runs on Kubernetes. So this is, uh, this is the internals of how you can run spark on Kubernetes. Like, you know, how it works when you submit a job. As you see on that, on the left hand side, we have Kubernetes control plane and the right hand side Kubernetes data plane. So when you submit your spark submit from your local machine or from air flow or any other schedulers and the job will be submitted to an API server and an API scheduler will schedule the driver part. That is the first step. Uh, once the driver part is created, but it's not showing there even there's a headless service also created along with the driver part and the driver part will request scheduler to request API server to schedule the executors. So, uh, how many of our executors and the scheduler will schedule those executors to, you know, and once these executors started running, uh, and these executors will, will make a connection to the driver using that headless service. That's how the driver communicates with the executors and send the task down to the executors and the jobs will run within individual executors. So that, that's how Spark on Kubernetes works, uh, pretty much, uh, the communication end to end. But now, as you see here, this is a YAML file. This is a Spark operator, but within Kubernetes world and everything that we want to run in a YAML declarator way, right? So what we have seen before is a spark submit is a simple JSON config with a CLI command, but rather you want to define your spark jobs in a simple YAML file with, you know, the job name. And as you see here, the driver codes and executor codes and number of executors that you need. So this simplifies the whole process of running a spark job. So I can just simply write this job and then use kubectl to apply and that goes and runs your spark job on Kubernetes, right? So how to, uh, get to this point and how do we actually, uh, um, you know, leverage Spark operator. Let's talk a bit more about Spark operator. So Google, uh, has created this, uh, new Kubernetes operator for Spark to simplify that process. So what it does differently from Spark submit, let's take a look at this one here. So you have the created two, uh, uh, CRDs, which is Spark application and schedule Spark application. And it comes with four main components. One is controller submission job runner and spark pod monitor and then mutating admission back up. So it's basically a controller, uh, which will, uh, look for the spark application object YAML that you have seen before. And then controller will ask the submission, uh, submission job runner to actually submit the job and submission job runner. What it does is it converts that YAML into a spark submit assembles into a spark submit and then submits to Kubernetes API server that's, which is similar to what you have seen with the spark submit. So even spark operators leveraging spark submit behind the scenes, once a job is submitted and spark operator does it in, uh, differently in two things. One is park pod monitor and the mutating admission by book. So once a job is submitted, uh, you might need some storage. I mean, you know, mounting the volumes or mounting configuration maps and so on. So mutating admission by book is responsible before the pod is persisted. It goes and creates the volumes and amounts the volumes and, you know, and ensure the pod is created with all the volumes and the spark, uh, spark pod monitor, this ensure the status of driver status, executor status, and keeps informing to the controller saying the job is running. So if in any case of the job is failed and the controller can restart the job. So that's how the operator works and which is more a simplified way than the spark submit. But underneath it uses the spark, spark submit, right? So, um, the scaling spark and Kubernetes cells, some of the common challenges that we, when we work with the customers, they come back and say, Hey, we tried spark and Kubernetes, but when, when it comes to scaling, when we go beyond 200 nodes and even 1000 nodes and we hit, we hit all sorts of issues and what kind of consideration we need to take and how you availability consideration, what kind of logging and monitoring that we need to build when we run spark and Kubernetes and choosing the right compute and storage and what kind of network configurations that we need to define. And do we need to use specific basket you list to run the job? So let's talk about some of those best practices that we talked about. Some of the challenge that we talked before, um, as you see here in this diagram that we have, um, VPC CNI, I'm not sure how many of you heard the networking within EKS. We use VPC CNI, um, uh, within AWS, but, uh, the similar networking will be, um, available in other cloud providers as well. So with the VPC CNI, we use a part networking. So, so when it comes to running spark and Kubernetes and you want to run 50,000 parts, you need 50,000 IPs. So, which means you might hit, uh, you might have heard, um, you know, we've hit IP exhaustion issues. So to avoid IP exhaustion issues and one of the main considerations, our recommendation would be do is create to, uh, uh, use a non-routable IP secondary side of range to create a two large subnets and each comes with a 65,000 IPs. So in this architecture, you can go up to 120,000 parts running on within the same EKS cluster if you want to run a large scale. And then the future, the next step is, Hey, um, we want to try IPv6 and now we secondary side of range is an option, but, um, this is something that we are looking at in the future where the customers move to the IPv6 and it's a lot easier to work with. So Kubernetes from version 1.23, it supports IPv6 and even spark 340 added a support for IPv6 as well. So all that you need to do is the configuration that you see here when you submit the job. And that allows you to leverage IPv6 and awards IP exhaustion. You have a, you know, uh, the large number of ports that you want to run within the same Kubernetes cluster. Right. So this is one of the common issue when it comes to the core DNS, like if you're running a large scale, uh, spark and Kubernetes and the first thing you might see an error, something related to unknown host exception. It's because, uh, the nature of the spark is a bursty workload. So you go and hit 10,000 parts and trying to spin up and trying to communicate with each other and, and putting a pressure on the code DNS. So to avoid that, we recommend using cluster proportional auto-scaler with the code DNS, which means when your Kubernetes cluster is growing, uh, and your code DNS spots horizontally scales to support that, uh, the last skill Kubernetes cluster. And then we also recommend using node local DNS caching, which is basically, um, uh, you know, having a cache in every single node so that it doesn't need to make DNS resolution called to the code DNS ports. Uh, it can refer to the local caching and reduces that amount of calls that makes the code DNS and improves the performance for the large scale Kubernetes clusters. And finally the pod anti affinity, just make sure that you don't run, uh, more than two, you know, more than one part within the same node. And these code DNS spots are actually spread across all the nodes rather than running everything in a single node. Right. So one of the key, uh, important feature here is, uh, the storage when it comes to the spark, you know, you think about storage and it's a, what kind of storage that we need to use. And we highly recommend using NVMe SSD based volumes, uh, because spark is a nature does a lot of shuffling, and NVMe based SSD volumes, as you see, this is part of the city instance itself. Like, uh, when user, uh, when the spark job runs, it goes and pulls the data from S3 bucket into that NVMe SSD. And these executors can, uh, you know, work with the data quite quickly without having to, you know, without having any latency. So this gives you high throughput, low latency, uh, you know, the, uh, performance for the spark jobs. But you also have another option where you can leverage the block block storage, which is EBS volumes as well. But, uh, the only difference here is you get a variable throughput because E, um, the volume is external to the EC2 instance. And you have, uh, and the EBS bandwidth, uh, Bay, various based on the EC2 instance that you choose, the smaller instance and the larger instance. And, but it also comes with other features like reusing persistent volume claim. Like when you wanted to leverage read, reuse the persistent volume claim is a feature within Kubernetes where if one of the node dies, uh, for some reason, um, you know, even if the node dies and the pod gets killed, you can still keep the, uh, the, the volumes that are connected to that old node and reuse those volumes with the new node that comes up. So basically it doesn't need to re-computed. It can use where it is started. Give it to me. Yep. Do you want to? Yeah. Thank you. Hey. Now, what options do we have when, uh, when we're running from the compute options, when you're running the Spark on the Kubernetes? When you are deploying the Spark on Kubernetes, the executor can be scheduled on the spot instances and the driver on the, on demand instances. And they, this in, scheduling the executors on the spot instances, these enables the faster results by scaling and executors running, uh, on it. And there is also another way because, another reason, because if a po, if a driver pod, it will be run on the spot instances and the spot instances gets terminated, then the, the whole application fails and have to be resubmitted. While if they, that's why we recommend that the driver should be always installed on the, on demand instances. The executors though, even if the spot instances get terminated, the resiliency from the Spark, the driver will create a new executor once the new spot instances it is, uh, it is being created. And like this, and like this, you achieve the, uh, also the, the cost efficiency and, uh, uh, within your cluster. While running Spark on the, uh, on the Kubernetes, you can experience challenges with the scalability and performance. But with the carpenter, the Spark clusters can quickly, can be quickly scaled and dynamically. And in the same times you, uh, and the workloads meets the, the needs that they want to, and in the same time you achieve the resource availability and the cost efficiency with it. And Carpenter is coming also with some, uh, features, improvements to workload consolidation, deprovisioning, no termination, and further improving the efficiency and the cost savings. Apache Unicorn that you will see actually in our demo, uh, it's a batch scheduler purposely built for the Kubernetes uh, and usually used into, uh, big data and ML workloads. And it is particularly useful for Kubernetes with applications like, uh, with the features like application aware scheduling, where it is recognized the users, the applications, and the, the queues, the scheduling according with the, uh, submission order. It has also gang scheduling for the pod placement. And you will see, uh, you will see this also in the demo and a hierarchy queues for built in for, uh, optimized Spark performance. Um, for the logging, Fluentbit, uh, Kubernetes filter allow you to enrich your log files with Kubernetes metadata. And if you are using the filters, you are seeing the, the field, uh, Fluentbit filter, uh, extra filter config, uh, you can get the metadata, uh, from with, uh, with Kubelet instead of, uh, querying the API. It's very specific, very useful in when you have large customers because you don't put pressure on the API anymore. Getting into the Argo workflow, uh, this is a DAG orchestrator. The workflow engineering is the best way to run Kubernetes, um, uh, because it is built for Kubernetes. All right. It is very, uh, popular and very useful when you're running, uh, uh, ETL workloads or training jobs and you can orchestrate deployments and then using in your DevOps and, um, CD pipelines with Argo workflows. And the Argo workflows, it can be integrated with Argo events and Argo events will, will fetch any event source, uh, whatever it is, Kafka, Slack, uh, web, different web hooks, you name it. And you'll see also in, in our example, when you'll have like an external, external source that will trigger an event and consequently an workflow and a job. And this is actually what I'm going to show you to you in the demo today. So in the demo, we created an Amazon EKS cluster and installed the Argo workflows in the Argo events in their dedicated namespaces as you can see on the screen, Argo events and Argo, Argo, Argo workflows. In the same time as an outside source, okay, we create, we, we have an Amazon SQS queue that is created to receive requests from users. Okay. And the, and then SQS event source that you're gonna, you'll see it into the Argo events namespace object, okay, that is set up to fetch the events from that, uh, SQS queue. That event can come from any, like S3 put object notification, not necessarily from users. Yes, which triggers. This is only for the, uh, that we choose for the demo purposes. The sensors runs and waits for the certain condition to be made. Okay. And we will go through the, how the sensor is built. And when the, the message is received by the event source, uh, the sensor sees it and creates an workflow and the workflow through the Spark operator, um, uh, create a Spark job and, uh, in the Spark team, a namespace. Obviously, Vara mentioned earlier the multi tenancy. So you can do like that in separate namespaces. It allows you to build that end to end data pipeline using Argo workflows and, and you can drive those data pipelines using Argo events in this case. And you have Spark operator Prometheus and Unicorn sit in the, within the same communities cluster. Um, okay. So yeah. And now I am going to jump to the demo. Okay. As you can see here, all right. We can pick off the demo first and then we can talk about yes. I will, I will start because it takes two minutes for the, for the job. Okay. And then I'm gonna, uh, I'm gonna tell you exactly what I did. While all we do is, uh, triggering this job. And it's basically, uh, just simple YAML configuration that, uh, you can define the whole DAG, uh, in a YAML file using Argo workflows. And you can build the most complex Argo workflows similar to you might be using our airflow, um, for, uh, scheduling your data and ML jobs. And now Kubernetes, uh, Argo workflows is a very lightweight workflow engine is more powerful that you can build your end to end data pipelines and include the ML pipelines part of the data pipelines and, and use Argo workflows as your, you know, native engine to run your jobs. So, and all that you will be doing is defining, um, uh, simple YAML files. In this case, there is a sensor YAML file that we do is going to show you, uh, in a minute once the job is triggered. Allow me to trigger the job. And then we'll come back to you and then show you to what exactly what I did. So we have SQSQ created, um, is going to send a message, uh, to an SQSQ to invoke that pipeline. One second. If to make sure that I have the variables in place and yeah, it's all good. I didn't copy the whole part of it. That's make it more interesting. Yeah. So, um, I think when it comes to the production that you have this so the message from the, and now I can increase it a bit again. The message from the SQS was sent and we should have a workflow triggered. Yeah, it's here. I can see that on the web UI, which is the Argo workflow web UI here. If you can explain that a little bit. Yes. And that's the sensor and need to lock in again on this one, I think. Yep. One second. Argo workflows provides a really nice UI that you can expose that you are using engine X and are backed by any application load balancer or a network load balancer. Um, and set up AD authentication with your own LDAP configuration to make it more accessible. So our, our, uh, our event source is a SQS spark workflow. Okay. Was met dependencies for the, for the sensor and created a spark workflow. And this one, it is just happening right now, as you can see. Okay. So make it a bit bigger for you. All right. Okay. And the workflow has several parallel jobs. So one job, it was the, some simple job, hello, world job. And from this to, from this job, okay, another two jobs has been created. And at the end, a bigger one. And you will see that in the, uh, in the same time, because I'm going to move to the CLI, when the pods are being created into the spark teammate namespace. And also in the same time, the carpenter will provision the nodes for the driver and also the spot instances for the executors. Behind the scenes, it is using unicorn as a gang scheduling. Uh, it's using carpenter to scale your notes. So basically you have a empty cluster, just a Kubernetes cluster running. And once you trigger the job, uh, which is all the workflows, and it goes and spins up the nodes and runs your spark job and then scale it down to zero. Okay. And this is the event source, okay, that is running into Argo events namespace and is looking for the, uh, events that are to fetch from Amazon SQS. And then this is the sensor and I'll go really briefly through with you. And the sensor has is triggering the Argo workflow here, as you can see, and this one has several parallel jobs, one, two, hello world jobs and one spark operator job, more complex jobs. The jobs in the, in this park operator jobs, they have the driver that needs to be installed with the, uh, through the carpenter on on-demand distances as you can see in here. And in the same time, it is in the unicorn will take, take care about the scheduling. And in the same time, the executors with the executors will be installed on the spot instances. And the same is happening for the bigger job. But because I measure it and it takes exactly two minutes. Okay. And I think I'm already talking for two minutes about this thing. So we should have these things happening here. So if I keep, if I click that, it gives not viewer. Yes. All right. So these are the, the first three nodes. Okay. They are the core nodes where the system is running. And the carpenter created one node on-demand node for the driver and two additional nodes where the executors are running. Okay. If I'm going to, if I'm going to watch this into the spark, uh, spark team a namespace. Okay. I see that the first driver, the, the first driver into this workflow, this one from here spark operator, uh, PI job has, it is running already and the executors are being created on the spot instances. Okay. While the second job, the spark operator taxi job is waiting to finish the spark operated PI job. Okay. And then to start running, running again. And that the executors from the, from the first one is terminating and the, the, the, the last job is being created as well. Yeah. I think we are near to the time. Uh, sorry, we have took, uh, longer than that. But if you have any quick questions, uh, we can take a couple of questions if you have. Yeah, go on, please. One question. And then, sorry, they can, can they find you somewhere? Yeah, yeah. We'll be standing outside. So that's a good question. Um, there's a lot of customers are experimenting IPv6 at the moment. Uh, it's because even though spark submit supports, uh, there are other components which does not have support for IPv6. So it's mainly validating what other tools or add-ons that you want to run on Kubernetes, whether that supports IPv6 or not. Say for example, spark operator, uh, does not support IPv6 because of that reasons customers are waiting until that support is available. So, so we're thinking about that's a feature, but once we have that, you know, we are going towards that.