 Let's hop into managing artifacts at scale with Argo. I'm Kaelin, my co-presenter Julie couldn't make it, but she'll be joining here virtually. So goals for the talk. Today, we're going to first cover some basics, you know, what artifacts you typically are going to be using with Argo workflows, and why, and then also what artifact storage options do we have at hand, and what are some scenarios in which we might want to use them versus a different type of artifact storage option. And then we're going to learn some of the key features that Argo workflows offers to scale up artifact management and cover a few of the common missteps folks run into when setting up artifacts for scale. As I mentioned, I'm Kaelin, so I'm the co-founder and CEO of PipeKit. We help data teams scale up Argo workflows for their data workloads. We provide a SAS control plane that helps them maintain Argo workflows at a lower cost and also provide professional support. And we're contributors on the Argo workflows project, so you'll see me and my colleagues often in Slack and on GitHub. And don't be shy, feel free to say hi. Julie is a staff software engineer at Intuit. You're probably familiar with Intuit, the backers and creatives of the Argo project. And she's a maintainer on the Argo workflows project, so she'll be hopping in here over video to share a few of the features that we're gonna talk about today. So some quick background. Just to make sure where we know what we're talking about, artifacts cover several different data types, but it depends on the use case. So for data processing, we're typically talking about tabular data, image, video files, maybe some vector data, but for CI it looks very different, so we're typically looking at like Git repos, Docker files and the like. And then for machine learning, we have some specialized data types, typically like training data sets, trained models and feature stores. As far as types of data we're talking about today, we really wanna zoom in on like how persistent is the data that you want to be working with, because that really affects like how you're gonna be managing it throughout the Argo workflow. So transient data would be data that you don't really care what happens to it beyond the step that it's used in and it's rarely needed beyond the life of the workflow. Whereas as we get more persistent, we're starting to be more concerned about how do we archive this, how do we search and find it? And so on the semi-persistent level, we might not be too worried about if it's lost, we can pretty easily reinstate it like Docker file caches, but then there's persistent data where we definitely wanna keep it. It's typically like the fully processed output of a workflow. So we'll just be keeping that in mind as we look into the workflows we're looking at today and how do we wanna manage these artifacts? As for storage types, there's three main types we wanna cover today and we'll be focusing on the first one which is blob storage. So there's all sorts of pros and cons and it's important to keep these in mind when you're designing your workflow and how you want to be reading files in and out of your artifact storage. So blob is obviously everybody's familiar with like S3 or GCS, maybe MinIO. This is what we'll be focusing on today. The big benefit being it's easily queryable in the archive, easily also visible in the Argo workflows UI. So that is a big benefit as we'll see later when we're demoing some of these workflows. But it does have some downsides a bit slower because you're tarring the data in between workflow steps and it will take a little bit extra resources for that and there's not as much rewrite many compatibility. So you might wanna consider that when you're designing workflows. As we get into block storage, it's an alternative. We actually don't see a ton of users using in the workflows community but definitely to be considered for its slight performance advantages over blob. And then there's network file systems where the big benefit is read write many compatibility and it does really auto scale well for a high throughput needs. So if you're interested in that there will be a lightning talk a little bit later comparing S3 and NFS type artifacts set up. So definitely make sure to tune into that. As far as the problem that we're trying to solve today it's really just how do we pass data in between steps in our workflows. We have a few different tools in the chest here to do that within Argo workflows. So people are probably familiar with parameters. This is the most basic way to pass information between steps in a workflow. The biggest challenge here is it's very limited in the size. So as your workflows grow in length or parallelism or as you're running more and more workflows on your cluster you might run into accidental limit challenges here with the amount of data you're passing in your parameters or in your script outputs. So it's definitely something to be aware of even as you're starting out to maybe build your workflows for scale start thinking about how do you wanna set up your artifact repos or your NFS system to handle that. And then on the blob storage side yes you're getting more persistent artifacts. They're a good fit for data processing especially for versioning and trained models in the ML space. And overall you just have to remember that you're gonna be packaging in between steps so you'll lose a little bit of speed but you might save some money as far as like the archiving benefits there. And lastly with network file systems the big benefit here is you're gonna be setting up a rewrite many compatibility. So if you will be running a lot of concurrent workflows this is something to keep in mind and if you have maybe a step in the workflow that's processing a lot of files at once you will save some run time. So before you start passing artifacts around we're gonna cover some of the features that can help you solve problems. So we're gonna cover four features centralizing the artifact repository configs implementing good naming conventions that'll help you scale up and help the team find their artifacts. Managing small versus large artifacts how do you go about that? And then we'll also be covering a newer feature in Argo workflows called artifact garbage collection. So what I'm gonna do now is actually hand it over to Julie to talk about centralizing artifact configs. Let me just hand it over to you Julie. So if we look at this example workflow here we can see just how verbose it can get. If you've got a bunch of different artifacts input artifacts, output artifacts and you're specifying all this information here about where to put that artifact or where do we retrieve that artifact from? What for an S3 artifact? What bucket it's in? What the URL is? What the secret information is? You can imagine just how redundant that's gonna get both within a workflow and between your workflows and maybe a lot of that information is common. And if you need to change it you don't wanna have to change it everywhere at once. So what if we could consolidate all of that down to something very small like this and basically move out all of that common configuration? So now our S3 information here is really just the key itself because that's the thing that is actually likely to be different between workflows and then also between artifacts within a workflow. And then we can move all that other information out somewhere else. This is actually also a concept called key only artifacts. Okay, so where are we gonna move that information to then? Basically we can move it to a config map. You can have one or more of these config maps that can be referenced from your workflow and the config map can actually have multiple keys in it where each one is some unique configuration. And so then from your workflow basically this section called artifact repository ref is where you can say here's that config map that I wanna reference, here's the key within it. And so then all of the artifacts that are defined within this workflow are gonna use that by default. But they can also, if one of those artifacts should use some other S3 bucket or use artifact or whatever, you can specify that within the workflow as well. It's just that anything that doesn't have that specification will use this by default. And if there is a particular config map called artifact dash repositories that you can create and this actually will be whatever, if you wanna use this, it's going to be the default information. So any workflow that is not defining the information here or within the artifacts themselves by default that I'll use whatever is defined here. Awesome, thanks Julie for coming through there over video. So yeah, again, the big takeaway here is you have a few different options for how to set and centralize your artifact repositories. Remember that you have a default option if you want, but you can also expand that into configuring multiple repositories and you don't have to duplicate that code in your workflow definition. The next point we wanna talk about here for scale is actually thinking about how you wanna name artifacts. Ultimately when you're working with a lot of workflows and a lot of users, often we're doing data processing, maybe machine learning, or we have some CI and we need to enable users to find outputs from those workflows or to get in between steps in the workflow and debug the workflow based on whatever process and was applied to a given artifact. So we definitely recommend at this point you start using parameters in your artifact keys, specifically the workflow UID, if you have concurrent workflows running of the same name and the same namespace cluster, you're gonna run into users potentially overwriting artifacts and so this is a great way to avoid that. In addition to help users find artifacts after the workflows running, we recommend using parameters like these to ensure that it's easy and logical to search through S3 and pick out the artifact that you're looking for. And one thing that's not as helpful maybe if you have many users currently running workflows is to use timestamp because everybody will be looking at the same time potentially. So it's think a bit through like I guess how you wanna be naming your artifacts in your repo to help your end users at scale would be the takeaway here. Next we wanna get into how you wanna provision resources so that processing artifacts of different size doesn't cause issues and kill your workflows. Important feature to know here is PodSpecPatch which will provide template level resources to accommodate a larger artifact if you need it processed or if you need to maybe consolidate a bunch of artifacts in a step provide more resources to get that done quicker. You do that here in the workflow definition and you can again just provision resources here for the init container to spin up and get that file processed and that way you don't run into some sort of out of memory issue that kills your workflow. As a reminder definitely provision enough resources for the executor as well. So even if you're using PodSpecPatch it's important to make sure you have enough resources for the whole workflow including your artifacts defined. A lot of folks end up having to come back to the config map and fix that problem. Lastly when it comes to artifact scale and size you might consider adjusting your archiving strategy which is something you can adjust in the workflow definition so you can actually turn off compression if you want so that things aren't necessarily zipped at the end of a step which could be handy for some CI use cases or you can compress even further than the default by using a GZip value. So this would be one which is we're trying to compress this text file as much as we can. At scale we do see this helping you save some costs if you are aggregating a lot of artifacts in your repository. Next we'll talk about garbage collection for artifacts which is a newer feature in workflows 3.4 and I'll hand it back to Julie to dive into this. So the problem we were trying to address is that users have traditionally needed to manually delete their artifacts to avoid incurred storage costs or maybe their admin sets up some sort of TTL type policy in the cloud storage but that's really being controlled by the admin and not by the user themselves. So we developed something called artifact garbage collection. Basically you can configure your workflow to automatically delete artifacts either when the workflow completes or when the workflow is deleted. And currently this is implemented for S3, GCS and Azure but for other types of artifacts we definitely are very happy if people want to contribute pull requests for those. There is an artifact driver interface and there is a delete method on that that can be implemented. How this works under the hood is we actually perform the deletions through pods that run in the user's namespace to do this. All right, so this is what it looks like if you're updating your workflow spec to include the deletion policies. And so you can specify both on the artifact or on the workflow level and or you can specify it on the artifact level or both the artifact level can basically override what's set on the workflow level. But you can imagine that in a workflow you've got artifacts that are temporary that are being passed between steps and then you've got those artifacts that maybe you wanna keep around after the fact. And so how can we define something like that within our workflow? Well, one way to do it is you could have a policy at the top level here that basically says, okay, when this workflow is deleted I want most of the artifacts in this workflow to get deleted as well. And so that will apply to any artifact that is not overriding that policy like this one. But in this example, this is an artifact where we don't wanna delete it so we can override it here with this never strategy. Okay, and so the pods that are doing the garbage collection are going to need to be able to access that back in storage. And so you'll probably wanna use a service account to do that that enables you to perform that deletion or you can, if you're using AWS, you can use this role ARN annotation. And so this gives you an idea for how to do that. And that can also be overrided for an, overwritten for an individual artifact as well. And this is what it looks like if you use a service account to do that. Awesome, thanks, Julie. So that covers all of the features that we wanted to share about what you can use to manage your artifacts at scale. And now we're gonna try our luck with a couple live demos. So we're gonna show a workflow that does a simple fan out, fan in example. And we'll show, highlight what artifact management configurations we're using in that workflow. And then we'll also do a CI demo where we'll build and deploy a web app with Argo CD. All of this is available on GitHub. So if you check out the pipe kit GitHub org, you can find our talk demos there and pull down some of these examples to suit your needs. So first we'll hop over and check out the config. So we're using the same config map for both workflows. As you can see, we've defined our resources that we need here in the executor so we won't run out of them. And we also have the artifact repository set up here using NINIO and some local, very secure creds. So don't use this on prod. Now we'll look at the fan out, fan in example. So this is going to fan out into 20 parallel steps. And in this workflow, we're actually gonna define an artifact GC strategy here. So when the workflow is deleted, all the artifacts will be deleted. And then as we move down the workflow, we can see we're actually gonna be using the workflow UID here in the artifact key. This will help us, again, if we wanna run this a couple times in parallel, we won't be overriding any artifacts. And if we keep going down, we do have a reduced step in the workflow. And what we wanted to do was speed this step up in the workflow. So we use pod spec patch to add some resources here and make sure that reduced happens faster. And then finally, the output artifact down at the bottom, we did an override here to make sure that the garbage collection strategy doesn't kick in. So we'll save all the outputs of this workflow. So yeah, let me just see if we can hop this workflow into Argo. And here we have it. So the workflow is running here. And we're just gonna wait for it to calculate the fan out. There it goes. So now it's running a script for each fan out step. We're just running a Python script. And as you can see in the Argo UI, we do now have visualization of artifacts. So if you click on an artifact, you'll be able to see it populate once it's processed. And here it'll just be a basic JSON object. So you'll get formatted JSON, but this could also be an image file or an HTML file if you'd like. And this again helps end users figure out like what's being created when as they run their workflow. And this only works with S3, GCS, Azure at the moment. But as Julie mentioned, PR is definitely welcome and we'll help you out with that. So now we can see the parallelization is kicking in and the objects are still processing. So we'll give it a couple more seconds here. And now it's hitting the reduce step here. Again, running a reduce function here in Python. And since we added more resources, that finished a bit faster. So that ran our workflow right there. Just an example of how you would use Argo workflows to run batch processing and do a bunch of parallel jobs and then reduce back to an output. I also mentioned that we have a CI example. I'll talk about this really quickly. So this is an example of using MinIOS 3 as the artifact repo with your CI job and then deploying an app with Argo CD. So you can see that all in this definition and we'll end up running this in a workflow template. So I'm just gonna kick this off here while we talk through the file. This one takes a little longer to run. So this one's kicked off. We're starting to do the get check out step. Again, the tar ball file will show up here now. So you can see at what point the artifacts are being created throughout the workflow. Since this one zipped, it won't display. But if you did want to have a file here that formatted some sort of output or test object, that's totally possible there. And this will be building. We'll wait for that build and take a look at, again, the workflow manifest. In here, we specified an artifact GC strategy. So we'll keep all these artifacts after every run until we delete the workflow. And it's still building. So we'll give it some time there. This usually takes maybe a minute. But other than that, other than that artifact GC strategy, we didn't provide any other configs in this workflow. But you could also add PodSpec match maybe for the container build to speed it up. And you could even add a bit more of not zipping the container. That's our steps you could do in the future if you'd like. And now it's deploying. So in a second, we'll see if we can pull up our website here. And just as a reminder, all this is on GitHub, so you can definitely go check it out and adjust the workflow how you feel and run it for your own CI jobs. Just visit our pipekit org to find that. All right, it looks like we've deployed. Let's see if it's actually there. Yep, hello world. There we go. So I'm glad the live demo went all right. It's always a bit of a risk. But as far as next steps go, like I mentioned, check out the repo and hit us up if you have any questions. And there is an upcoming lightning talk configuring volumes for parallel workloads. And come back in about 30 minutes and you'll get to see that. It's with my teammate Tim and Luconde from AWS. If you're interested in artifact management or different types than just the blob storage, that's definitely worth a stop by. And that's about it. I'll wrap it up there. Any questions? Questions anyone? Hi, thanks for great presentation. Question about separation between different workflows and artifacts. So does Argo provide some way to make sure that artifacts created by a certain workflow cannot be accessed by other workflows and also way to verify that they have not been tampered with? So can I trust between steps that what I created is what I received in the next steps? Yep, so that gets into your key strategy and how you pass parameters in your artifact keys. So that's definitely a key feature where the workflows is between steps, you'll wanna figure out like, hey, do we need to define a name for this artifact, pass that name into the key via parameter, and then the next step in your workflow can reference that artifact, and even in other workflows as well. So yeah, happy to give you a further demo of that. Yeah, it's in the repo as well, so definitely check that out. If you have any other questions to Kailan, please find him because we're kind of running out of time. So thanks. Thanks. Thanks, everybody. Thank you. Thank you.