 So today we're going to be talking about leveraging ARGA workflow templates within a data platform. We're going to be comparing and contrasting our two companies' data platforms and seeing how they work under the hood. So, my name is JP Zivillich. I'm the CTO and founder of PipeKit. We are a SaaS control plan for data teams that allows users to manage ARGA workflows, and I'm joined by... I'm Yao Lin. I come from Bloomberg. I work as a platform software engineers. Our team is basically managing ARGA workflow as a service to our internal teams. All right. So in order to get the most out of this talk, you'll need a basic familiarity with ARGA workflows. We'll be using a couple of different terms. So the first is a workflow. A workflow is a series of steps that either orchestrate different units of work or a unit of work that is executing within a pod on Kubernetes. The second is a workflow template, which is a way to factor out sections of a workflow and then reuse them across multiple workflows. And that's scoped to a given namespace. And then a cluster workflow template is the same thing, but scoped to a cluster. Very obvious naming, which is good, keeping with Dutch stereotypes, I guess. And so we're setting out to solve the following challenges. Cool. Our main problem is that managing those templates across multiple clusters and environments is very hard, although the template has been doing a good job in its nature. But it's our responsibility to set up a platform for user teams. So here comes the question, how resources are well managed across clusters, how resources are created, persisted and cleaned up, and how user team track their versions. Now, although we face different user profiles, but we see these following essential concept in common. Yep. So first, don't repeat yourself. That's handled by templates nature already. And then logical resource grouping as some of you already have experience on templates in certain namespace are just managed in a flattened way. You don't have groupings. And also we need to ensure a clean clusters, meaning outdated stuff can be cleaned out away, saving space and maintenance resource. Also, during this process for product security and stability is a guarantee that we must provide like how to guard your production environment. Next, we will have this talk for those of interns to talk about the templates lifecycle, how it gets created, how it's maintained and how it's deleted. I'll kick off the setup part on our side. We have a UI that's built on top of the workflow clusters so user can choose their preferred deployment strategies. And that's done by a admin service cluster outside the workflow cluster. We have a component that help persist the input resources and rely on a sinker on the workflow cluster to grab it back. And during this process, we have policies installed in those clusters to guarantee a validation. Now, let's take a next step to look into the actual steps. First, we'll have a driver. The deployment API, the component dimension, it takes the input from our UI or from GitHub ops and also it can accept the user direct input. Then it takes that template input around a driver against the real clusters, make sure everything looks okay, synthetically and for environment aspect. The two policies we have here, one is validate the template names are valid with the version and also try to parse and label the template with the version. After everything checks out, the deployment API will dump that input into a source of truth database is a postgres database. And then that's the end of deployment managers API's mission so far. And then is the sinker on each of the cluster to pick up that change, unload the resource to the cluster. And at this stage, the validation is no longer needed assuming things are already trusted. And it's time for the actual parse and labeling happen. Now, let's hand over to JP to explain their stuff. Thank you, yeah. So we'll be going over the setup of the SAS platform here at PipeKit. So we take a very Git based approach. First, we have a UI for selecting folders within a GitHub or GitLab repository. And upon clicking submit, that will add each workflow template within that folder to the PipeKit control plane. GitHub and GitLab has like an app that allows the administrator user to have fine-grained control over the repository permissions. Then the same cluster administrators will install the PipeKit agent, which is a Kubernetes deployment in the clusters that they manage. And that gives them access to be managed by the PipeKit control plane. PipeKit is going to be managing ARGAR workflows resources on each cluster. And then the templates that we have connected through like GitHub or GitLab will then be available on all of the clusters through the PipeKit control plane. And the PipeKit agent deployment will then pull the workflow templates from GitHub or GitLab and add them to the cluster at workflow creation time. One thing we wanted to make sure we handled was like conflict handling when you import a template gracefully. So whenever a workflow gets run, it locks in the templates that it uses at creation time, meaning that like let's say you update the template that a workflow is using when it's in flight. That's not going to change like the business logic of the workflow itself, which is pretty cool. All right, here is an architecture diagram. So you'll see like some similarities between our architecture and Bloomberg's architecture. So you have a user that can submit through the PipeKit control plane using their browser or the CLI. We have the connection to GitHub or GitLab. And then we have multiple clusters here. We have cluster A and cluster B. Each has the PipeKit agent installed on the cluster. And that's going to be submitting workflows, workflow templates to the ARGO server directly. And then taking the resources like logs and other things that get created on the cluster and relaying that state back into the PipeKit control plane. Interesting about this setup is that the PipeKit agent doesn't allow for any sort of ingress through HTTP. It is like pulling from Q set up through Redis and then having egress only back into the PipeKit control plan. So if you're very like security minded, that's a that's an interesting feature. Next, I'll give it back to Yao to talk about the usage of Bloomberg's data platform. Yep. So the two main highlights for our usage usability for users are our version history view and also approval is required to promote something onto the production tier. So we'll take a first look at the view first. We have a mocked view of how our UI looks like here in the main page templates are grouped by its name plus major versions. We show quite high level summaries like deployment clusters and version countings. And then when you click into a detail, it shows you further information like how many versions you already have and where are where they are deployed to. And the time, of course. Next, we'll explain a little bit about how we gather those information for the UI. When some resource gets modified or created or deleted on the cluster, there will be a good audit log generated. And we rely on fluent bid to pick up that information and forward to the Kafka topic. And then that ends up in a Postgres database of job history. And we also build up a GraphQL layer to serve the select group by queries from the UI. And then user can also access their true source from the deployment API. Next, we'll take a look at the promotion steps. We need to ensure our production tier only accept those has been validated and approved. So we ask our users to deploy their resource onto the development tier first. And then once everything checks out, they can click the promote button on the UI. And that essentially sends a deployment request to the deployment API. We'll generate a approval process first. Of course, during this process, we try to encourage our user to use as much as template as possible, not use some customized stuff. It's allowed, but we don't encourage that. We encourage by making a fast track for those templated stuff. Once the approval gets passed, the deployment API essentially just copy the record for the development tier and create one for the production tier and rely on the sinker again to pick it up. Next, JP. Awesome. So now I'll talk about PipeKid's usage and how to use the platform. So when we were designing it, we had a couple goals in mind. The first was we wanted to avoid changing the open source workflow specification. So we didn't want to do anything custom that wouldn't allow for a company to lift and shift their workflows as they run them normally and use them on our PipeKid control plan. It should be just, you can take it, use it, and it works. And the second is we wanted to allow users to select the get tag, commit branch, et cetera for the corresponding workflow template definitions. So these are kind of like two conflicting goals. Like how do you not modify the workflow template or the workflow and then add some like bonus feature. So shout out to one of the engineers on our team, Phillip, who came up with an interesting solution where we use metadata to override the get reference. So we can specify a custom PipeKid label when we're defining a template or invoking a template. Sorry, not defining a template. And that supports get tags, commits, branches, et cetera. So you'll see on the right, we have the highlighted section here. So we've got a workflow and then there's the metadata section under the well say template. So we specify metadata labels. You'll see the PipeKid.io, blah, blah, blah, blah, blah, version number. So this corresponds to a get version. And so when this workflow is invoked, it will tell the PipeKid control plan to look into the get hub or get lab connection at that given version, pull that workflow template down if it's not already in the cluster at that version and then run it when invoking this workflow. A lot all happening. But next, I think we got a quick demo video to see it working. So first, you can see that I ran kubectl get to show that there were no workflow templates on the cluster. No resources found. And next we'll show the actual workflow being invoked itself. You can then see where we're specifying the metadata. So no workflow templates living on the cluster and we're just going to run a quick submission of that workflow template. And within our UI, we can see that the workflow is running, right? And if there was no workflow template on the cluster, normally what would happen is Argo workflows would complain and say like, hey, you know, it's missing a resource, but we saw it just ran and ran to completion. So it pulled down again that workflow template at that given version and added it to the cluster that it was being run on. All in 40 seconds. So that's cool. Next, I'll be handing it back to you to talk about cleanup. Thanks. Yeah, next, as we explained before, our templates is deployed right after the creation. So that implies if we want to delete something, it has to be deleted from the source. So we must be very conservative on that in case anything just got deleted and that's the end of it. So we have async steps to make that happen. First, a crown workflow. It runs regularly to examine the resource and those references record on this tier. Make sure some templates just as still is real still. And then it patch a label to those templates with expected delete time. And then we have the deleter sits on top onto this closers and monitor that label and execute the real delete operation. Of course, during this process, we try to allow user to have the option of cancel or delete or postpone the deletion. Now, next, we'll look into a little bit details. So that crown workflow is this on single cluster, but it has the view of all the closers resources on the given tier. So that that's how it makes a sound conclusion that certain templates are really stale. Nothing is referring to it for the entire tier. And then that label patch is applied to the development deployment API. Sorry, that ends up in the database. Once it's learned on the database, then we have to pass diverse next page. So on the one side, that information is picked up by the sinker and lands onto the template itself. And then the deleter will monitor that templates already deployed. And when time comes, it fires a delete request to the deployment API for the final deletion. But on another side, the UI will also get notified of that delete label. And it will transfer that notification to the user so user can choose to take actions if they want to, like delete that label or extend that label. Yep. Next, I'll hand it over to JP. We have slightly different user profiles. Thank you. So the way that PipeKit cleanups process work is a bit different than Bloomberg's cleanup process, whereas they use a Cron workflow to do a lot of the labeling and whatnot. We rely very heavily on the PipeKit agent deployment. So that's going to be handling garbage collection from start to finish. So the first thing that we do is we apply a timestamp, like a last use timestamp, labeled that gets set each time a template is run. And instead of next again, having that Cron workflow, we just use like a separate go routine that's running within the deployment itself to check the templates on the cluster hourly and make sure that they still should be on that cluster. Users have the ability to specify a TTL within an environment variable, but our default is set to 24 hours. So on the right, you can see the highlighted section where this is a workflow template that got created. It was actually the same workflow template from the demo that we ran in the other slide, and it will say that there's a last use timestamp. I can't read timestamps so I don't know what that actually corresponds to. But our PipeKit agent will be constantly checking that workflow template or checking it every hour saying like, hey, should this be live or should it not? And then taking the appropriate action. So here's like just a quick map of the control flow. So we've got the PipeKit agent deployment starting up on one go routine. It checks the status of the in-flight workflows, pulls or updates any workflow templates that the workflow is going to be running. And then on a second go routine, it's going to be doing the loop of deleting the collection of workflow templates that are beyond their TTL and waiting an hour. Cool. Lastly, we'll do a comparison between the two platforms. So as Yao said, we do have some very different user profiles. PipeKit is a SaaS platform where any user can sign up and onboard their company. And we had to make it very generic to support a wide variety of use cases and we couldn't be as opinionated. For Git handling, we have the back end provided by GitHub or GitLab. And then for cleanup, we have just in time delivery of workflow templates and the default cleanup of 24 hours configurable by end users. Yep. So, luckily, we serve most of our internal users. So we can be very opinionated to promote some like we call it recommended or best practices. In this case, we encourage user to use pre-built workflow templates. But on the other hand, we face some specific SLAs as a company's internal platform. For the Git handleings, we don't have that in-depth integration with the GitOps, but we do provide some CICD agents for our user to make use of. Also, we do use Git versions more of an info perspective. And there are some major differences in the cleanup steps. That's because how our template is designed to pull into the cluster. So we have a quite longer grace period. It's cancelable, but it's also not recoverable once it's deleted. The above topics are not everything in our mind yet. So we do have some ideas, but we haven't have a solid plan yet. Like we would welcome a version detection and refresh. Say we want to allow user to just put a latest tag or a version range. They don't need to look for a specific version in the history, and we can just have a worker patch for them. And also auto update whenever something new comes out. And what about PypeGit sites? Yeah, over the next couple quarters, we're interested in figuring out a good workflow template pull policy. So similar to what Kubernetes does for Docker images, where we want to figure out, like, hey, can we have the user specify whenever there's a... Whenever there's a change to like one of the workflow templates, does it get pulled into the clusters that already have the workflow template instantly? Is there like an override? Do we say it should never get pulled unless there's like a workflow being invoked? So definitely appreciate any design thoughts on that one. So lastly, thank you guys so much for showing up to our talk. I think we got a little under five minutes to do some Q&A, so we can hang out for that. Thanks everyone. Any questions? Hi. Did you ever consider using Argo Events to trigger any of your workflows? Was that for... Myself or Yao, or both? Both. I want you to go. So using Argo Events is something we have sought for a long time, and we are running some experimental stuff with it. We plan to have it onboard to our production, but we don't have a specific EK yet. We're in a similar boat where we're looking for design partners. We just want to figure out the right design to make sure the integration is seamless and works for, especially like a multi-cluster context. That gets a little challenging when doing it. It's like a SaaS platform. But yeah. I guess that's it. Thank you. All right. Thanks everyone. Thank you.