 Welcome, everybody, to our talk. So I'm Sudes. I work for the FikaWorks Collective, and I did a project at Rabobank. And today, I'm here together with my colleague, Carly Heibergts. And we will talk a little bit about the journey that we had at Rabobank introducing Argo events and Argo workflows and running this for two years in our production environment. So Carly. Yeah, let's start. Before I kick off, let me introduce myself. I'm Carly. I worked at Rabobank for the last nine years and started off as a financial advisor and now I'm a product owner. And in these nine years, I've worked in the area of special asset management. More on special asset management later, let me first introduce to you the Rabobank. We are the second largest Dutch bank and we were founded by farmers decades ago, a corporation of farmers. And we still have a cooperative mindset and we are still a corporation. We have over 2 million members and those members have direct influence on the decision the bank makes. The cooperative roots are also reflected in our mission, growing a better world together. And we have over 9 million customers and almost 2 million mortgages with a total of 193 million euros. So in our mission, growing a better world together, we have three transition topics, first food, climate energy transition and the last one on which I wanna put the focus is the transition to be more inclusive society. We believe that everyone deserves a fair and equal chance to pursue their ambitions and we do that by taking away barriers to financial products, but also by helping our customers with a mortgage who are not able to pay their mortgage anymore. It can happen to all of us whether it's unemployment or whether it's when you get a divorce or whether you have any financial difficulties in your life. There's a point that you can stop paying your mortgage and the department where I worked the last nine years special asset management is there to help these customers. And to support our advisors who work at special asset management, we have three processes in place. One is our mortgage administration who registers the payments who come in every night and if they're missed and it sends out a nightly data file. Then we have Argo workflows who process this data file and transfer or update the database in our case management system. The case management system supports the employees and these employees get tasks created on their dashboard so they know which clients they should start supporting with their financial difficulties. So using Argo workflows and Argo events, these critical information on the missed payments of the mortgages becomes visible in the case management system, which is really important to us. And Sudesh will take you on a journey to how we set up these jobs in the public cloud. Yes, so thank you, Carly. Yeah, so first I'm gonna explain a little bit about how the infrastructure looks like, very high level, so I'm not gonna dive too much into the details there. Then I'm gonna explain how we had issues we had and how we improved our workflows. I will explain how we are using reports and also how we alert on our workflows. So let's start with the infrastructure. So this is basically it. We have on-premise, we have our data center where the mainframe is running where all the transactions are happening and we are running our infrastructure in the public cloud and every night we get the latest information on the client accounts from the mainframe and we receive that data and we start processing that using, we receive the data using Argo events and then we start the processing using Argo workflows. The data is then subsequently stored in our databases and in our storage, which is then used by the operational platform for the operators. So when we started out with using Argo workflows, we had jobs basically everywhere. So some things were running in development, other things were running in production. We had different versions in development and different versions in productions. Sometimes we didn't even know what was running where. It was very inconsistent and we basically just had stuff all over the place. You can imagine that leads to a lot of problems and basically it was kind of hard to manage this. And yeah, so we had to take some steps to make this more controlled. So we did three things, which I'm gonna explain a little bit more in more detail. So we added a role-based access control so that people cannot change too many things anymore. We set up a pipeline for consistency and we split up the workflows using templates. So first the airbag, so we created three roles. So what we saw is that developers, what they would do is they would test something in development and then they would go to the GUI and copy-paste the YAML from the GUI into the production environment. And then yeah, you can imagine that this is not something you would want, right? And so basically what we did is we said, okay, you can do stuff because we don't want, you need to have some control, right? If you're running these jobs or you're responsible for them, you need to have some kind of access that allows you to do the things that you have to do. So we created a specific role for them as an operator so they can stop-start, restart, they can check the logs, but they cannot remove things and they cannot add stuff anymore by hand. We also created the reader role because we think it would be good to have some kind of transparency also towards the business owners because you can imagine these are very important processes and they need to be reported on daily so that they can also actually look into the GUI and check what is going on, how long does it take, what is failing. So they don't really have to understand it in much detail but it's nice that they can actually just have a look to see what's actually going on. And then of course we created the third role for ourselves so that we can step in in case there are any issues and we can still manage it through the GUI as well. So that's Airbug. The second thing that we did is using Customize. So I assume that everyone is familiar with Customize but it basically allows you to kind of to kind of templatize your manifests so that you can roll them out consistently across different environments. But there is a small problem with that. Customize is a built-in tool for Kubernetes so it works with Kubernetes resources and Argo resources are custom resources so Customize doesn't recognize those things and that can lead to some problems, right? So if you're, for example, you have an environment prefix so every time I create a new workflow I want to have production in front of it or development basically to know where I am or where this is running. So if you would try that, then you would... that would not work. So basically you could not deploy that or you would not pass through to the workflows or the resources from Argo that you want to deploy. So but there is a little trick that you can use and that is the Customize config. So basically you tell Customize, hey, there is this resource that I want you to know which you can also customize. You can modify this and I tell you the way how you can do this. So I give you a couple of examples and also the code shall be done later so I have everything on GitHub so if you want to have a look at it later you can also just check it there. So one of the things that you can do, for example, is you can patch the names, right? So in this case I want to patch the name of the workflow template which I'm using inside of a Chrome job. So basically when I add a prefix then that prefix will also pass through to the template that is using it. So another thing that is quite important I think is if you want to use different images in different environments you can also patch those. So Customize has a way of overwriting images but you need to tell it, hey, I want you to also customize the workflow templates and the other custom resources that Argo has, right? So that's what you see here. This is the image and basically by specifying in the Customize config where it can find the image in this type of resource I can then use Normal Customize, the Normal Customize language which you're probably already familiar with to also overwrite those images as well. And the last improvement that we did was we saw that sometimes we have jobs that run in the weekend which are different than during the week because the bank uses different processes and then we don't have to do all the work that we're doing during the week. But what we saw is that the developers were basically copy pasting the tasks into new cron jobs. So then you get copies of things and once you get copies of things then things tend to diverge and that can lead again to all sorts of problems, right? In consistencies. So basically what we did is we split the things up, right? So by splitting them up and creating separate templates for different processes we can reuse them when we need them. So in this case, for example, we have a different job for the weekend, different job for during the week, but they use the same templates so that means that the jobs that you want to run are always the same. Yeah, so this is actually how that then looks like so we are also always using the DAG type, so the directed acyclic graph. And yeah, so all the examples there in GitHub so you can check it also by yourself. So once we had that done, the next thing was, okay, we have a kind of stable environment now but what we saw is that the developers are looking in the GUI every morning to see if the jobs had run successfully but if you have a lot of jobs you can imagine that the GUI shows you a long list of things and it can be kind of difficult to find the right thing that you are looking for. So what we did is we created a reporting mechanism that will only show them the things that they actually need to see. So how does that work? So this is the report and basically what we did is we created a job in Argo, so we created a workflow to check the other workflows and the nice thing is that Argo has an API which you can call with the token from the service account and so basically that's what we're doing. So we're running the container and then the pod reached the token from the service account and uses that to connect to the API and gets the workflows that it needs to have and then generates the report and sends it out either via email and in our case we're using Teams for that and this is basically how it looks like. So this job is running at 7 o'clock in the morning and it checks all the jobs that had run from 9 o'clock in the evening the day before and it basically gives us an overview of what is the status, was it successful, did it fail, you can actually click on it and also you can check the history and that is something that I can highly recommend that you enable workflow, the workflow archive and this is how it looks like. So they can see and everyone who has a report can click on it and they can immediately see what was the performance and how is it going over time. So very convenient, very nice feature, highly recommended. But then we thought about how can we improve this better? So we have this report now but it would be nice if people would get informed also at the time that it breaks. Some of these jobs they can run for multiple hours and if you check it in the morning at 7 o'clock and you see that it failed then you have to start it again and then it can take another five hours to run during the day and then you get into all kinds of problems with the timing and does it run, does it finish before the next one, do we have to stop it? So you want to get to be informed as soon as possible. So ARGO does have this concept of an exit handler but the problem with the exit handler is that normally you would have to set it on every template so every job that you would have you could set it but that is also not really doable if you have a lot of jobs and you have to manage that of course. So we created this thing I call it the global exit handler so there is a little trick that you can use which is the config map of the workflow controller. So the workflow controller has some settings and one of the settings is the workflow defaults. So these are settings that you can always pass to your workflows and they will be patched automatically by the workflow controller to add these things. So one of the things that we added was an exit handler and it's basically just a Python script that parses the statuses and based on the statuses it sends out an alert when the job was failed. So as you can see here this one failed but the exit handler always runs to send out the status. So how does that look like? So in our case we're using PagerDuty and here you see the report in PagerDuty so what went wrong at the time which job it was all the information is right there and we added the link so if you click on it you can actually just go to Argo and boom you're right there where you need to be. And here for the technical guys the payload that we are sending there but everything again everything is on GitHub you can find it there and you can check it yourself also the Python script. So that's our that's our journey and yeah now we have some time for questions Yeah? Do we have a mic or no? You have to shout really hard Yeah, yeah. Hello. Well first thank you for the presentation I was wondering are you using for workflow templating are you using kind of nesting you have like workflow templates which call other workflow templates or do you just have workflows calling workflow templates? Yeah so they are always templates Yeah I think why you're asking this is because it's the workflow you mean the running instance right? Yeah So all our jobs they live inside the cluster as templates but it depends how they are triggered right so everything is template everything is workflow template and they can be triggered either through a cron job so a cron workflow or they can be triggered through events if they are based on on events but basically we use workflow templates for all of them Yeah but you don't have a workflow template which is in turn reusing another workflow templates How are you tracking then a little bit those dependencies because at some point you could have like 3-4 level nesting of workflow templates do you have anything to kind of visualize a little bit and track how certain changes will affect other workflow templates down the line Now the jobs that we need to run they are quite consistent so we do have an overview of these things but that's just on paper Alright thank you Thanks a lot Great talk As you're working in a very high secure industry I would like to ask a question which sorry you may hate me for that but why did you choose ARGO CD while the same functionality can be achieved for example with github actions or even a Jenkins CI right so what was the decision to choose this particular technology in your case So you're saying if we could use github ci for the jobs that we are running I'm asking like what was the decision so what was the logic behind selecting ARGO CD in your particular use case right because you could achieve the same results without the technology available in the market I think ARGO was already implemented before I came but the thing is that before this the bank was running these jobs on-premise with Kronos on DCOS so now we're moving to the public cloud we're running this on Kubernetes and that's why ARGO was chosen I was interested about the reporting did you ever consider using the metrics endpoint to scrape it with Prometheus and just put it onto a dashboard as opposed to emailing or using MS teams So I had looked into it but the thing is that we do use that or we don't use that endpoint but with the jobs they can be very long running so they do have their own they expose their own metrics but the nice thing about doing it this way is that if you call the API you get all the information from all the workflows at once right so you have everything together Anyone else? Do you have something which to actually create the workflow templates you would need some CI as well to validate and prove that your templates aren't going to break ARGO workflows itself so like a CI for the CI yeah so we do that's why we have the pipeline that you create you can test in multiple environments before we actually run it in production so also all these templates I guess that's the jobs themselves to actually prove your templates are are linted and clean and not going to just if you deploy a workflow which isn't valid and legitimate it breaks all the workflows so it's something which would do that it would be possible to break ARGO itself with some workflows some workflow templates I don't understand I've deployed templates before where there was some typo or something like that you know a indentation that was incorrect and I pushed that I deployed it using ARGO CD for ARGO workflows and the ARGO workflows were not available because they were broken I think we do test everything and we don't have many changes to the templates themselves so the code might change but that's basically the script or the logic inside of the job but the templates themselves they're quite consistent so for us that's not really an issue thank you anyone else thank you okay thank you