 We'll be talking about automating cloud native spark jobs with Argo workflows today Thanks for coming. So yeah quick agenda. We'll be doing an intro just talking about the goals for the talk We'll cover some background about the tools. We'll be addressing today and using Talk about a few common problems with spark that we've run into and how we've solved them with kubernetes and Argo workflows We also have a demo for you Plans to be live live demo, so we'll show you how that runs and share a repo with you And then we'll talk about some next steps and leave time for questions So of our intros Yeah, today we want to learn about scaling and stability advantages of using spark on kubernetes So we'll cover that we'll also cover how to use Argo workflows to automate and orchestrate spark jobs as well as Jobs that you might run with different compute frameworks like dask array as well Doing that on kubernetes and then we'll learn about what else is possible, you know now that you've deployed Your spark app onto kubernetes. What can you do next? We'll talk about that Little quick about us. I'm Kaylin the co-founder and CEO of pipe kit At pipe kit we use Argo workflows to scale data pipelines for enterprises And that's essentially we run For companies that like to use spark we help them get spark jobs running on kubernetes And as well as other compute frameworks and help them build fully featured data science platforms That are more reliable And help them move faster So if you're interested in managed Argo workflows, that's what we do We've worked with numerous companies that use sparks. That's why we're talking about some of the issues We've ran into and how we solve them And then I'm also a contributor on the Argo workflows project So you'll see me in the community answering questions commuting meetings And in general helping with the Argo workflows roadmap And I'm fortunate today to be with my talented teammate Darko Hello, so my name is Darko and I'm senior software engineer at pipe kit I am from Serbia and I'm mostly working with distributed systems cloud infrastructure and kubernetes. Thank you All right So now let's cover some background make sure we're all familiar with these three tools So how many folks here are using spark already? A couple awesome Great. Yeah. Well, I'm a little bit about spark then so it's actually the most popular data compute framework currently So in the past you might have used to dupe. It's the next generation of that and then It's a generation before some newer frameworks like dask and ray But still it's the most popular because it's so versatile. So it's kind of the one-stop shop for data science You can run batch jobs Paralyzed jobs really can get them done faster. You can also use a streaming framework that spark has as well That really complements doing batch jobs alongside streaming And then there's a whole machine learning suite as well that you can use as part of spark You can use almost any common language that you would use for data science, which is handy So that's why we see a lot of people wanting to use spark And and as that as their first step However, there are some issues especially deploying spark on the traditional deployment using yarn On a Hadoop and these make doing data engineering pretty frustrating So doing things like managing just dependencies different libraries on the same distribution is really difficult Figure out how do you run experiments between environments or hand off an experiment to a teammate? You know if I want to use different compute configurations on the same cluster That is actually pretty challenging and this is because Spark on yarn actually requires you to do global installs of the version So you you're pretty locked into a single version for your cluster. It's not supporting Docker natively So that makes it really you have to do some extra work to use containers and run your jobs With containerization And then that also means like debugging any issues you run into is a little bit harder And then the actual deployment consumes more resources at baseline than if you were to apply on kubernetes So if you're really display deploying a lot of spark jobs On a on a given cluster this does add up when you're when you're actually not running your compute And then finally there's no auto scaling capabilities like you have with kubernetes So if you want to vertically scale up To a gpu or add more cpu This is a lot a lot harder on the traditional deployment So fortunately we're going to talk about kubernetes today. So is anyone using kubernetes currently at work? Good, okay. Yeah, it's becoming more popular, which is great. And how about for data use cases? Okay, a couple cool. So today we'll be focusing on yeah using kubernetes for data science applications And really the benefits that we see with that are around number one containerization So when you start to adopt containers as the building block for each step in your pipeline You see two big benefits number one being your data pipelines more reproducible So you can hand off portions or the whole data pipeline to a teammate For them to rerun or to iterate off of They don't have to you know fully reproduce your local environment Everything's containerized and if you have a container registry setup They can just pull down their docker images for any step in the pipeline and rerun it So this helps a team move a lot faster Given there's a lot less overhead to reproduce any portion of your data pipeline And then reliability wise you you receive a lot of benefit from having all your dependencies managed in a given container You can move easily between versions of spark or any sort of other Compute framework you're using And deploy that to production and know with confidence that okay is worked, you know in dev. We'll get it to prod The number two biggest thing is Benefit of moving to kubernetes for data science is a declarative approach So we can focus less on how are we going to scale our compute and more define What are the parameters in which we want to scale within and kubernetes will handle the rest for you So we'll get into that a little bit later today But that really changes things and makes the data engineering side a little bit more predictable and easier to deal with A good example of that is vertical auto scaling So this is a key feature of kubernetes that we'll be talking about today So in the in the event that you have a job that's very high in usage You know you get a big data dump in Every so often maybe it's only once a week or once a month You can define parameters that Will scale up the compute and even scale down the cluster to zero If when you when you're not running that job And then the final big benefit of kubernetes is what you get in addition To just the data applications. You also get a bunch of other things In the ecosystem that are built on top of the kubernetes api and let you integrate easily with CICD tooling or observability tooling if you want to add that for a more production ready data pipeline So yeah a quick glimpse of our architecture. Sorry if this is a little small But this is how we we deploy spark on kubernetes We're essentially using the kubernetes master as the scheduler instead of the old traditional approach of relying completely on spark So kubernetes will spin up a driver pod It'll talk to that driver pod and when there's a request for an executor kubernetes will then say okay, I will spin up executor pods for you It then tells the driver pod those pods are ready and the driver pod will communicate and Shard the jobs and spin up the the executions and then of course once that's all done kubernetes knows, okay The job was successful. We'll spin all of this down and that way you don't have to keep it live Now getting into argo workflows. So our workflows is a dag orchestrator A workflow engine And it's the best way to run workloads on kubernetes because it's built for kubernetes So the deployment is very straightforward. There's only a couple of dependencies that need to be deployed And it's very generalizable. So you can use it for just about anything on kubernetes, you know etl Is really popular to use you can spin up training jobs You can orchestrate deployments and then you can do, you know traditional dev ops and cicd workloads with argo workflows as well So that's a big benefit and that's why the community has grown a lot lately So over just about 12,000 stars on github And you'll find a variety of people on in the community using it for any of these reasons or a combination thereof um argo workflows uses yaml as The definition and then it also has a python sdk as well. So you can define workflows with python, which is really handy for as data folk And I think the the biggest takeaway with argo workflows is once you've declared your daggs and your pipelines There's um also ways to um trigger those jobs in a way that you know Just besides manual, which we'll do today, but you can use get to trigger jobs You can do cron jobs. Um, and you can also do event based triggering And so this really just opens up the possibility to do real time workflow Orchestration and do it all on kubernetes All right, I'll hand it over to you Thanks So yeah, now I'm going to show you our architectural diagram and how we connected everything with argo So yeah at the beginning you simply need to submit a new workflow to the argo And after that argo will schedule a new workflow using kubernetes master and specifically kubernetes api And then the scheduler will bring up a new workflow after that Our new workflow will submit a new spark job and we'll start monitoring for the state And after that Kubernetes master will bring up the driver pod for the spark and then spark will request the executor pods and Then we will get our new executor pods inside our cluster And yeah, then spark will schedule all the tasks on the executors And in the meantime argo workflow will constantly monitor the state and when the execution is done We will get the result. So yeah, it can be a failed job or can be a successful job So we already saw how you can well from our diagram how you can schedule Spark job using argon kubernetes. So one of the biggest benefits is You can use the different spark versions inside your same cluster because you're Well, you're using containers. So everything is contained inside your Container images and you do not need to manage any dependencies or install anything And by using that you are also Well, you do not need to download any dependencies at the runtime. So you can run everything really fast And this also allows you to standardize your environment. So You can use the same configuration on your local machine and in the cloud Also, this also provides you Well, you will use a lower resource because you do not need to install anything inside the kubernetes and Because you're using kubernetes. You can auto scale easily And then you can also Use other frameworks besides Spark so you can create and use similar architecture for other frameworks like desk array And you can extend all your toolings And at the end since You're running everything in container and and using kubernetes. You can add some kind of a log aggregator You can provide metrics and monitor everything really easily And you can also extend your c i cd pipeline So you can bring everything inside argon kubernetes and you can cover Your entire process from the moment you are building your images and submitting chains to the git to the moment when you're running test and running the actual workload And this also allows you to easily move from One cloud vendor to the other because you're using kubernetes and that's not the actual Spark service from the cloud provider And yeah, so Now i'm going to show you a little small demo that we prepared So we are going to create small spark job that analyzes the kegel data set So this data set contains some bike Share ride and we are going to analyze this data set and find the How much is each bike type used and we are going to find out How long is each bike type? Read and yeah bike type can be electric bike regular bike So yeah, that's our spark job. And after that, we are going to show you how Argo manifests our workflow manifest looks like and how are we executing that on kubernetes And at the end we are going to simplify our our workflow manifest by using workflow templates So yeah, i'm running everything on local using local kubernetes cluster We already have installed spark operator and argo workflows and we already have pre-built Images for our jobs So We'll just pull up the live demo So yeah Okay, so yeah, here's our spark job written in scala and Yeah, here you can see how it looks like. So basically we are loading everything from our data set and simply calculating the occurrence Of our bike type and at the end we are simply printing everything up And yeah, and this is how our argo workflow manifest looks like and it's really easy to write one So at the beginning you need to define the name for your workflow and you need to specify Your dag tasks so the jobs that you want to run. So here we have two jobs one for the bike type occurrence and another for the ride length And here we are defining how our Spark job is looking like so since we are using Spark operator we are creating a spark application We are defining Spark job name And yeah, here here we are doing some configurations. So since we are using scala we here specify that Our spark job is written in scala It also can be written in java Python or r and since Our driver Pod and executor pod are inside kubernetes. We are using cluster deployment mode And here we are specifying our Docker image our location of our jar inside container and our main class And yeah, one another important thing is here. We are specifying spark versions So you can use two point x or three point x. So it doesn't matter everything will work for flawlessly and at the end we are defining the resource requirements. So we are defining cpu and memory usage And one specific thing for the driver Pod is that we need to specify service account. So service account is kubernetes thing that allows us to authenticate with kubernetes api and allows us to bring up more pods So this service account has permissions to bring up more pods And this is important because here we are specifying how much executor instances we want to have. So in this case we have two instances or Two executor pods And yeah, the second job is quite similar to the previous one. The main difference is this one line So we are using different main class and yeah Last but not least so Since we are using DAG it is important to Let argon know how our execution What our execution is doing and how it ended So here we are specifying the conditions on which our spark job or our spark task will fail or not. So we are monitoring the actual spark job And now let me just go here. So Yeah, this is the argov workflow UI and now i'm going to submit the new workflow And yeah, here you can see a graphical representation of our DAG you can see From here that is quite similar To the our yaml representation. So everything is running now and just let me go here and List our pods. So yeah, you can see that Both our driver pods are up and you can see that our executor pods are up And yeah, let me list again. So yeah, you can see that the job is completed and now i'm going to show the logs for the Bike type. So here we are counting the occurrence of each bike type. So you can see here the result and I'm now going to show you The second job. So here we are calculating the ride length for each bike type And yeah, you can see here that our job is successful and yeah, the last thing I want to show you Is how you can simplify your argov workflow by using workflow templates. So instead of writing Uh, well this complicated manifest you can instead use a workflow template and make your life a lot easier So everything is quite similar to the previous example The important thing here is we are simply referencing our workflow template And we are passing the parameters to our workflow template. So you may notice That we are passing things like language that we are using Uh, well how much resources we need And yeah, now I'm going to show you how our workflow templates looks like So yeah, here it is you may notice that this is quite similar to the one that I'll show you And the important difference here is that we are using these parameters instead of actual values And yeah, they are all specified here. You can specify your default values And there is also different types of templating And yeah, at the end This template is more general purpose purpose template that we use and your template can be a lot easier And simpler. So we are all we are here specifying almost everything And yeah, that's about it for the our Yeah, one quick point on the workflow template is essentially that means that you have one person that would define the original template And they can then enable any other user who doesn't really know Argo well to just simply submit a workflow With the parameters that they want for their job Push that to get and trigger a Argo workflows job off of the workflow template Instead of having to write their own Argo workflows manifest. So that's something we see a lot as teams that implement Argo workflows and then can let other folks on the team who don't necessarily need to know the details start to run workflow jobs Just based off of the parameters that we define Okay, thanks. And yeah, let me just go back To the presentations. So yeah, we saw how you can create the spark job We also saw how you can Use Argo workflows to execute your spark job on Kubernetes. And at the end We showed you how you can simplify your Argo workflow manifest by using a workflow template so Yeah, and the next steps. So well, since we are already running Spark on Kubernetes so By using Argo, we can allow you to use Chrome workflows. So you can run your workflows daily and you can execute Whatever you like on a daily basis or whatever basis you want You can also define a different node pools So you can use different type of instances for your for your machine learning for your GPU Well, you can use the GPU node pools. So then you can use well node pools with different hardware requirements and you can easily Schedule your spark jobs there by using node affinity feature. You can also Easily auto scale your cluster So your spark drop can come up and based on the needs Kubernetes can bring up more worker nodes execute the Your spark job and downscale to one or zero and that way you can Lower your cost and last but not least you can extend your entire pipeline you can use githops and have everything in git and You can also You can also trigger your workflow by Well by creating some kind of events. So everything is possible. So thank you Yeah, and on the last port here's a simple or I guess a Architecture diagram of how you can build on top of this. So we already talked about running batch jobs Which is really the example we did before So you can have data that appears in the database triggers an Argo workflows job and runs a job with spark You can also then add in other data sources. So you can deploy an api That's streaming data through Kafka and can feed that into spark or another streaming analytics tool that you'd like to use And again orchestrate that job up depending on triggers that you set up in Argo And then since Argos monitoring the state of all these jobs It can it can essentially trigger off any sort of machine learning job that you want to do So you could trigger a training job Depending out like once the data has been processed you can trigger that on the same cluster And even once that done you can do a hyper parameter tuning check Or you can move on into model deployment and deploy the model to another api And all of this can be on the same kubernetes cluster, which makes things a bit easier to manage And if you wanted to then do you know one cluster for dev or one cluster for prod All this code is pretty reproducible So yeah, that wraps it up a few extra resources. So once again, we have a repo So you can go check out all the code that we ran today In addition to that we included a cron job that uses python. So feel free to go check that out What's handy in there is yeah, you can pull down the workflow template that We've parameterized and you could start using that for instance So that's really handy. We have a blog that covers more argo workflows topics and spark topics And then a few other things to check out are if you're going to deploy spark on kubernetes Check out the kubernetes operator github repo and check out the argo workflows docs for more details on how you can scale up resources And that's it. Thanks. We'll be happy to take questions Yeah, right in front Yeah, so the question is how do we control the templating? How do you change the the templates and parameters that are available to users? How's it being made available to the source code? Cool. Yeah. So thanks. So, yeah, basically these parameters are available through environment variables So you can easily define them and you can easily use workflow Manifests to add them or you can never even do that programmatically using python sdk for argo workflows Yeah, right behind Okay, cool. Yeah, good question. So the question was What's the advantage of using argo workflows versus like an airflow? Um, so we did have a quick couple points on that The biggest one is it's lighter weight to deploy argo On kubernetes and airflow. So there's less dependencies to manage less overhead from like a dev ops perspective You're also working completely container native with argo workflows. So by default every step runs as a container Which is, you know, just really handy when you're working with kubernetes Um, and then there's a lot of other features with argo Like a lot of the dynamic and the parameter Parameterization we talked about today is is out of the box with argo. Whereas with with airflow It's a bit harder heavier of a lift to to use. I think the biggest challenge is definitely like python Argo workflows is a yaml first approach But now there's python sdk's like this one called harrow workflows That are essentially allowing you to declare your pipelines in python and from like a functional Perspective which is handy. So if you've used airflow or other Python based orchestrators a lot of times you have to use decorators to define a lot of the the pipeline Um, and you actually don't have to do that now with harrow workflows and argo workflows So that's that's something where if you prefer more of a functional approach Um, it's definitely an advantage of using argo workflows Does that make sense? Okay. Thanks for the question. Yeah over there Yeah The question is yeah, since we're using the spark Kubernetes operator. What's the value of also using argo workflows on top? Yep. Um, so the big value there is now you can create larger daggs larger pipeline steps So in in addition to just using spark you can link up spark with other tooling So really I guess the advantage is like since you're adopting an orchestrator to run your jobs Um, you can now orchestrate tooling around that initial spark job So that's probably the biggest advantage is like what you get beyond just the initial spark deployment Like yeah, if you want to just deploy one spark job, um, you could totally just use the kubernetes operator and run that job um, if you want to then do a bit more like event-based triggering or If you want to link in like a get ops approach, that's where argo workflows adds a lot of benefit Um, and then I'd say yeah, again like more towards the right half of this Diagram like as you get more towards deployment. That's where argo workflows is essentially like a ci tool Um, a lot of people would describe it as that where yeah From there you can run a lot of tests and basically your whole ci suite out of argo workflows if you want it so, um, yeah at the end of the day, um, it's not like the simplest out of the gate, but then it Provides much more options going forward. Yes another question Yeah question is how does um argo workflows talk to the kubernetes master and the api? Yeah Okay, thanks So basically argo is implementing Well the entire entire kubernetes api and by using service accounts and all the permissions that are available there Argo is able to well to bring up entire workflow. It's able to monitor everything And at the end it's able able to destroy everything when it's no longer needed Yeah, yeah, so it's uh inside cluster you have like, uh, well entire. Yeah Yeah, sorry So yeah inside the cluster you have uh argo controller and the workflow workflow controller. So these two Services are controlling everything regarding argo Yeah, and that'll be so you'll do um deploy argo into like an argo namespace and then yeah as darko mentioned It has the the resource definition for kubernetes to be able to talk to the api. So that's Like a big advantage of using argo is just all that is very straightforward compared to like an airflow Where there's there's more dependencies and it isn't just built for Being kubernetes native any other questions? Yeah, one more question up front Okay All Okay, yeah as the first question i'll try summarize so you're saying yeah in the past Like I guess you're assessing using airflow and argo workflows one of the two on kubernetes And you've ran into some issues with like the overhead of airflow on kubernetes, but then You're still looking to understand. What is the development lifecycle with argo workflows on kubernetes? Yeah, um and and specifically is the ui the argo workflows ui where everybody's interacting So yeah, as far as like the development lifecycle goes It's you can deploy argo locally, which is really handy And that deployment will be pretty much the same as what you're going to be deploying to aws or google Or azure and the big variance is then just how granular do you want to get in your roles and and like your rback control In your security parameters, but for the most part that deployment's quite straightforward And then once you're running workflows all of the workflows. Yeah will appear in the argo ui So if anyone is not as familiar with like yaml or the kubernetes side of things they can access workflows there And and you would you would make that available to a URL that they could access and then they could see the logs All the workflows that have ran you can set up workflow archiving So you can save the workflows once they're done And that's yeah, primarily the place where folks who aren't As communities oriented would go The other option is you know running it out of a git repo So if once people are very familiar with like the the workflow definitions either in yaml or in python Like we talked about earlier You can essentially set up git ops triggers to basically use the repo as the source of truth instead of like the ui So on any sort of pull requests Or merge you could then trigger workflows But yeah, you end up going back to either the ui or you go back to the kube cuddle to view the status of the workflows that you're running Yeah, oh, yeah, sorry the argo cli as well. Yeah Does that answer your question as far as that goes and yeah, I think we could happy to talk about more about like larger deployments All right Thanks a lot for coming everybody. Have a great day