 All right, so my name is JP Zivillich. Today we're going to be talking about CI-CD for data pipelines using Argo workflows. Cool. So a quick little agenda. I'm going to have to speed through this just so that we stay on time. But here's what we're going to talk about. Like, what is CI-CD for data pipelines? What it solves? And then we're going to do a quick demo. Cool. All right. So the goals for the talk is we're going to show a development setup for CI-CD for data pipelines using Argo workflows. We're going to understand how to test our workflow templates. And then lastly, we're going to set some strategies for versioning those workflow templates. So real quick about me, my name is JP Zivillich. I'm the CTO and founder of PipeKit and also amateur Harry Potter impersonator. Didn't land quite as good as I was expecting, but OK. At PipeKit, we are a control plane for Argo workflows, targeted at data teams. So if you're trying to manage several different clusters, like let's say a dev staging in a prod cluster, or you have different clusters for different teams, or you have different clusters for different customers, we help with that use case. Also, we do CI-CD for data pipelines out of the box as part of our SAS product. But within this talk, we're just going to be doing the open source. Cool. So without further ado, what is CI-CD for data pipelines? So could I get a quick show of hands? Who here is doing data pipelines or data processing? Cool. That is less hands than I was expecting to see. Can I get another show of hands? Who here is doing automated testing on parts of those pipelines before pushing to prod? Kind of. All right. That's what I thought. It's not the most common, even though a lot of people are doing data pipelines. Most of the time, we're kind of just pushing maybe to a staging environment, and then maybe to a prod environment and really hoping for the best. So really, we're taking our data transformations, just pushing them to the higher environments and fingers crossed. Ideally, we should be validating these transforms when we're making pull requests or pushing changes. Also, rollbacks are currently pretty difficult. It's hard to go from, hey, this is the version that's currently on my prod environment or my staging environment. We introduced a bug with this latest change. Let's roll this back to a previous version, especially if you have a big team and you're trying to manage a bunch of different people pushing to the same clusters all at once. So a lot of forward thinkers in the community are applying these CI CD concepts that we're seeing with traditional software engineering two data pipelines. So they're factoring out their transformations and other critical pieces into components. They're running tests on these components during pull requests and other events, and they're also versioning these components using semantic versioning. And this solves two big problems. One is money, and the other is time, or you can flip those depending on which you think is most important. But the money one is wasted cloud spend. Cloud spend is pretty intense. I know a lot of people have pretty high AWS, GCP, Azure budgets, and data pipelines can really rack up those budgets. They take everywhere from several minutes to over a month from some customers we've spoken to. Also, if you're running expensive GPUs, that's another thing. And then data scientist time. Let's say you push a change and there's a bug. You want to learn about that, or actually pushing that change to staging or prod within the dev lifecycle. Cool. So I'd assume that everyone knows what workflow templates and cluster workflow templates are here if you're using Argo workflows heavily. But really what they are is Argo workflows flu powder from Harry Potter. Argo workflows native reuse component. So let's say you have a part of a workflow and you want to refactor it into a piece that you can reuse over and over and over again. That's what you're going to use a workflow template for. Now, why would you want to do this? Really, you just don't want to be repeating yourself all the time. So one workflow template that we're going to use a little bit later is one that just clones down a Git repo. That's a common task that you're going to have to do with a lot of different workflows. So it makes sense to refactor that into a workflow template. Also, if you're using data or doing data transformations, that's another good candidate. So yeah, data transforms, setting up, tearing down of Kubernetes resources like DASC and Spark deployments. If you're interested in how to do that, stick around for my co-founder Kalen's talk coming up next. And also I give a talk last Argo con on DASC deployment and utilities. So testing these components. Now, a workflow template is really kind of just a function. It has inputs and there are outputs. So we can test these inputs and outputs. And if we want to do that, we want to make sure that the workflow templates or these components are really just pure functions, meaning that for a given set of inputs, there is the same outputs. So anything that's non-deterministic or random, that's what's going to kick a problem. Or if there are side effects to this function. But this unlocks testing for us, which is great. Real quick, before I get into the demo, I want to talk about semantic versioning. I'm sure everyone's quick show of thumbs, everyone familiar with what semantic versioning is. Yes, love to see that. So they're great to have, but they're not available in vanilla Argo workflows. So Docker images can be semantically versioned. But workflow templates cannot be. Now we've seen two ways of implementing that with workflow templates and Argo workflows. The first, which is what we do internally and with our SaaS platform, is appending the version to the name with dashes, i.e. doing template dash 12, dash 3, dash 9. Or we've seen it done with labels and annotations. Shout out to our friends at Bloomberg. All right, so now we're going to go through a quick example of doing that CI step where we're going to push a change to a workflow template in a PR. And then we're going to use Argo events and Argo workflows to test that out. And we're going to do it fast. Here's a quick architecture diagram that I don't have too much time to talk about. But you can see that I'm taking a GitHub, in this case, poor quest event. The Argo events deployment is set up with an event source to read the webhook payload, load it onto the event bus, and then the sensor is going to spin up an Argo workflow. That is going to spin up a parent workflow. That is going to be cloning down the repo, applying the template or the workflow template, and then spinning up three assertion workflows to run that template. And I'll get into why we're doing it in the two-step workflow or the multiple workflow rather than just doing it in one big workflow in two slides. So Argo events, we're doing a one-time configuration of that event source to connect to the webhook. And then the sensor is the start player, the point guard in this setup, where we're taking that webhook and then we're mapping that to an Argo workflow. And we're also extracting some information from the payload. Cool. So the Argo workflow in that sensor is doing some really cool stuff. What it's going to do is it's going to clone down that workflow template. In this case, it's going to be a quick doubler, like a workflow template that doubles inputs. And then it's going to apply that workflow template to our cluster. And then once it has applied, it's going to spin up additional set of workflows so that it can run the workflow template that we have just applied several times with given inputs and outputs. And the reason that has to happen in the workflow of workflows pattern is because at compile time, Argo workflows checks a workflow to validate that each template that it calls is present. So if you're making modifications to your templates within the course of a workflow, you actually have to spin up other workflows for those modifications to take place. And I found that out when I was making this demo. And that was a fun thing to do. All right, so let's rock through this demo. I'm going to have to do it, I think, at two times speed. So hopefully that's not too bad. So what I'm going to do quickly is we're going to make a change to a pull request, just a quick read me update. We're going to push that commit. But this file or this pull request also contains a workflow template called doubler, which is taking an input and then multiplying that input times two. So if we have two, it's going to give us 4, 4, 8, 0, 0. And those are the test cases that we're going to be running on. But you see that we have that pull request. The commit just got updated. And within our Argo workflows instance, we have a workflow that was created using Argo events. Cool. And UI is being a UI. Refreshing. Cool. Now we have the workflow that is running. And we can see that the first step is checking out the GitHub PR. This is going to be cloning down the code and storing it within a volume. I'm going to have to keep this going extra fast. So then it's going to get the template and apply the template. From that template, it then spins up the three assertions that we've specified using with items that I'll show in just a minute. Cool. So those are going to run successfully. And I will pause that. So we should be good. On that part, we successfully ran some assertions. I know that was quick. Trying to be respectful of time. Lightning talk. Benet, benet, bam. So now what we're going to do is push a breaking change to that workflow template. So you're going to see me pop open Vim and modify that workflow template to instead of multiply by two, it's going to add two to every input. So I'll go into this doubler template and unfold it. And then we'll see where there's the multiply. I'll just switch that to an add. This is going to be simulating a breaking change to a data transformation. Then we're going to push this guy again one more time. And I got to keep it going fast. I don't know why it keeps going to slow speed. Anyway, so we've pushed it. It's cloning the PR. It's applying the template. And then it's spinning up the other cases. Cool. And we can see that these are running. And we should get a good output of two failures and one success. Whoops. Struggling with this video. So we can see that one of the cases passed, two plus two is four still, but the other one's failed. And then lastly, if I go to our setup here, we can see the test cases that we did this with, with items where we specified our inputs and our expected outputs. So we didn't have a chance to cover all the versioning and whatnot and show the example of that. But this is how you can specify a set of inputs, outputs, and validate all of your workflow templates using just ARGO events and ARGO workflows. Again, that's all the time that I got. My name's JP Zivillich. If you all want to chat a little bit further about this or about some of the stuff we do, come holler. Thank you so much.