 So today, we'll be running some Spark jobs on Kubernetes with Argo workflows. What we'll cover here, we'll be learning about the scaling and stability advantages of running Spark jobs on Kubernetes versus traditionally running them on Yarn. We'll learn how to use Argo workflows to deploy and automate Spark jobs on Kubernetes and not just run one job, but run multiple Spark jobs successively or in parallel. And once we're running our job on Kubernetes, we will also learn what else is possible to do. So a little bit about us. I'm Kaelin, co-founder and CEO of PipeKit, you probably just heard about. But at PipeKit, we use Argo workflows to scale data pipelines for enterprises. We help them with a SAS control plan that sets up Argo workflows and gets them running on production using Kubernetes to build faster and have more reliable data pipelines without having to know everything about Argo. I'm a contributor on the Argo workflows project. And I was supposed to give this talk today with my talented teammate, Darko. But unfortunately, he couldn't make it from Serbia. So just shout out to Darko for the contributions on this talk. Let's get into background of the tools that we're using. So first, Apache Spark. So it's a really popular big data compute framework that's open source. So in addition to being super performant for batch data jobs, it's actually a really versatile tool. So that's why it's maintained its popularity over time, even with upstarts like Dask and Ray coming onto the scene that are really even better at doing batch jobs. But Spark is possible because it's versatile. You can adapt, use the framework to also run streaming jobs alongside your batch jobs. Has a handy machine learning library creatively named MLlib. So you can transition right from ingestion and data processing into machine learning and do that with a bunch of common languages. So we work with a lot of companies that use Spark alongside all these other frameworks and do it on Kubernetes. So we want to talk about a bit of the advantages today because more and more Spark jobs are moving away from Yarn. There's a bunch of issues that we run into when we deploy on Yarn. First is requiring global installs. So the risk of sharing libraries between your Spark jobs on one cluster is really hairy to deal with. And often you want to use a different Spark version for different apps that you're building and not have to go back in time and refactor Spark jobs as versions upgrade. But this isn't really possible with Yarn. So that's a big advantage of adopting Kubernetes. Of course, using Docker natively is another big advantage of moving Kubernetes. On Yarn, you have to do a lot of legwork to be able to containerize your applications and take advantage of the faster dev experience and the improved dependency management that Docker offers. Again, there's a resource benefit as well. So instead of running a JVM on each node, you can take advantage of Kubernetes pods. And then autoscaling is the big final advantage that we really want to take advantage of. So how can you use different CPU or GPU thresholds on the same cluster? And how can you adapt to spiky data jobs depending on the workload you need? But we often ask just, why use Kubernetes for data science? And to quickly touch on this, four data teams who are newer to Kubernetes often, when you use containers as your building blocks in the pipeline, you get reproducibility of the pipeline. So you can share parts or the full pipeline with different people on your team. And you get improved reliability given better dependency management. Kubernetes also gives you a declarative approach to pipeline. So this is much more advantageous than so that you can essentially define what outputs you want and let Kubernetes handle the autoscaling. And so that's where vertical autoscaling comes into play. Kubernetes is a great complement to the horizontal scaling that a framework like Spark offers you, but now introduces the ability to also vertically scale depending on the compute that you want. And finally, we'll talk a little bit later about how the cloud native ecosystem can come into play to bring your data pipeline fully to production. We've talked a lot about Argo workflows today. So we'll be using Argo workflows. And the big benefit here is it's generalizable. So beyond just running your batch jobs, you can now use Argo workflows alongside Spark to then trigger ML training jobs, trigger a model serving or deployment. And ultimately, we're using this as the workflow engine, the orchestrator to automate many, many Spark jobs instead of having to run these manually. Taking a look at what the solution looks like, so on the cluster here, we'll be deploying Argo workflows. And the user would submit a workflow to Argo in the Argo namespace. Argo then talks to the Kubernetes API to schedule that workflow, which is going to be a series of Spark jobs. Kubernetes starts the workflow, and then the workflow starts to step through and submit Spark jobs for the Kubernetes scheduler to then spin up the Spark driver pods and then spin up the executor pods for the driver to then assign tasks to. And the big advantage here is, number one, Argo workflows is managing the state of these jobs. So that's how we can then begin to string jobs together into a longer pipeline. And the second big advantage is Kubernetes is handling the spinning up and down of all the executor pods. So as pods are no longer needed, Kubernetes will spin those down and save you resources. And then once the driver's done, it spins that down as well. So you're not holding any sort of resources idle. So yeah, all in all, we're really seeing the problems we saw initially solved by moving container native. So we're now running different Spark versions and different jobs on the same cluster. We can manage dependencies more easily. And we can stop downloading those dependencies at runtime. The other big thing that we get is now lower resource requirements for running our jobs and autoscaling vertically. But wait, there's, of course, more that we get. So the first thing that we want to call out here is it's duplicable. So this architecture can be used if you have a team that wants to move beyond Spark, start adopting Dask array. You don't have to completely set up a new cluster. You can spin up a similar architecture on the same cluster, run some jobs with Spark, some with Dask and Ray, and use our workflows to orchestrate all of that. You can extend it, so now you can bring in other data processing or machine learning tooling into the same cluster, which we'll talk about in a minute. And you get a bunch of other benefits by being cloud native. You're agnostic, and you can start to pull in other tools like Prometheus for your logging and metrics. So quickly, I'm going to jump into a demo and how we get this done. Let's see. So what we're going to be running here is a couple of basic Spark jobs. So we have a bike share data set that we're pulling off a Kaggle. We're just doing a simple map reduce. So we're going to count our bikes by bike type and total up the amount of hours we've rode. And then what we're going to do is create an Agra workflow that is a DAG that runs these two Spark jobs in parallel. So we're going to run the bike type count job, and then we're going to run the ride length job in parallel. So what we do is we create an Agra workflow, and we start to define the DAG steps. So first down here, we're going to define the bike type step. We're going to define the success and failure conditions, which again is handy for when we want to string together tasks, depending on how a step completes or fails. We then designate that this is going to be using the Spark operator. We then designate the language. So we're using Scala for this job. Sorry, the video is blowing up on me here. We're going to set the mode to cluster, which we have to do, for running Spark on Kubernetes. We also put in our Docker image here, and we set our main class, which is the Scala file that we defined our job in. And then we set the jar location and the Spark version. So this is where you can change within the same workflow. You can have one step that runs version 3.1, one step that's running version 2. And then down at the bottom here, we start to define our resource requirements. So first for the driver, which we set up here to run locally. And then next for the executors. And of course, we can scale these up if we want. We're going to move to our AWS account and move off a local. But for now, we're just going to run two instances here, which would mean that we're going to spin up two Kubernetes pods to execute each Spark job. Let me just skip. So now what we've done, let's see, there we go. We're going to take our workflow, put it into Argo, and run it. And now we can see our DAG that maps back to our workflow file. So we have our bike type count and our bike length count. And this is going to run. And now what we can do is we can go into kubectl and monitor it. So we see that we have our driver pods, two driver pods, and two executor pods for each driver. Once they complete, I'm going to just grab the logs here. And we can see that we get our output of our count for bike type. And then we'll also grab the logs for our ride length by bike type. And since this is a demo, we're just using the print statements there. But that's an example of how we can just run two Spark jobs in parallel on Kubernetes with Argo. Now we're going to shift into how we can actually parameterize that workflow file so that more people on our team can run workflows on their own by just simply passing arguments. So for instance, this is an example of a workflow that we're actually going to be using the workflow template ref to call in a Spark workflow template. And instead of writing all the YAML that we did before, we just simply pass in template parameters that we've defined in the workflow template. So as a data scientist, we can just define what our Scala job is going to be, the file, the image, and where we want to run it, and what resources we need to run it. And so what this looks like on the workflow template side is a bunch of input parameters that map back to our original workflow. But instead, we set some defaults. And we allow now any user to adapt this template, depending on the Spark job they want to run. So just jumping back into next steps. So now that we're running Spark on Kubernetes, we can next enable Cron workflows. So if you hop into the GitHub repo that we have online, we have an example of that that you can check out as well. We can start to define different node pools. So if you want to scale up GPUs for certain jobs or not with a few lines of code, you define those nodes and select them. And with Argo workflows, have the logic to then run certain Spark jobs on more compute or less compute, depending on your needs. Start to use cluster autoscaling, and then extend the pipeline with more capabilities. An example of that is this architecture where we're deploying Argo workflows. But we're also then deploying a REST API that can feed events through Kafka and into Spark's structured streaming app. We can ingest that alongside running some batch jobs that we're pulling from a database. And we can use Argo and Argo events to trigger these Spark jobs simultaneously. And then we can, now that we know the state of these jobs, we can trigger ML model trading, hyperparameter tuning, and even model deployment as the pipeline progresses. So you get an all-encompassing way to automate your ML pipeline on Kubernetes with Argo workflows. As far as resources and next steps, feel free to check out our repo, where we have all the code in the demo. Hit us up on Twitter if you're curious about chatting about this any further. And make sure to check out a couple extra resources if you end up diving in. Thank you.