 Cheswch eich defnyddio, yw wedi bod yn gyffredegllu y ni hefyd yn gondol yng ngwyloeddmagau ar gyfer gwirioneddol. Efallai gydag pam allu eu brifyddoedd genniololion, gyda'r pethau a alliau lleolodau, yn sgwyloedd diwrnod, ac yna ymdweud i'n rhai cyfuates yma ar y cŵr. Fyddwch i gael, rwy'n cymorth. Ymdweud eu rhan oedd yn iawn. Rwy'n yr ysgwp wedi'i meddwl yn CNCF Argo Slack, felly yw'r ysgwp wedi'r pan fathau ar gyfer gwirioneddol? Rwyf wedi bod yn dweud i mi ar y dda 4 yw'r ysgol. Rwyf wedi gweithio'r cyffredinol o'r ysgol. Yn hyn, Alec, rwyf wedi'i gweithio ar y FFETCH anelidic, a rwyf wedi'i gweithio ar yr agrawerc, ar y llos, ar ddwy o'r 6 ymlaen. Rwyf wedi'u gweithio ar y Llyfrgell, a'r gweithio'r ddeiligion, ddeiligion, ddeiligion, a'r ddeiligion, a'r ddeiligion arall y ddeiligion. Rwyf wedi gweithio ar y ddeiligion, I can arrive as I am defending it is my time. This is the time. Very briefly about Puppe kit. We offer two main services, we offer services on Argo workflow so if you need support on Argo or structure and things like that, we're more than happy to help you. We also offer a product that sits on top of Argo. It just helps you with scaling. It does cool things like multi-cluster making Arback just less sucky and all that kind of stuff. We're at fetch analytics bydd yw y pwyllteideg ond y cyhoedd prosiect-ŷlchedd a'r cyfer sydd yn s���dd a chael gymunedol i'r cyfnodd, sy'n gyntaf o'r nifer o gweithio cyfnod. The one problem I like is trying to work out where is the best place to place a coffee shop in a given area, but we've also got transport use cases. The people interested in their clientele, people planning high streets, managing historic monuments, etc. we active in these five different areas, and yep, start up, so we're scaling. So, yeah, how does our go fit in with our business, what's the core of our company is this pipeline that runs twice a week. It's got about more than 500 billion records, and essentially we've got these algorithms that we've written that help derive the morbidity patterns to surface this information. Yeah, if you rewind six months ago, they were kind of reliable, unreliable, we had this feeling that there could be potentially better or faster, and then like the image on the right, it was kind of a bit not transparent as to what was actually going on under the hood. So, yeah, then we kind of embarked on this last couple of months of improving and want to share some of those insights that we've had along the way. So, I'm a data scientist by trade, so I guess the official term would be like the parameter grid search for optimizing our workflow. We ran about 66 different experiments, it was super costly, that kind of came down to about 15 different regressors or things that we're tuning and changing and experimenting with. Yeah, some of the important ones were looking at thread usage, looking at how we tuned the pods to the different nodes, types of storage options, how we gave data to each pod and how many chunks of data we give to each pod for a specific run type, and then machine specs. Essentially, we're monitoring pipeline run time as our main metric, also looking at improving reliability, reducing cost, and then I guess the headline results were that we've got much more reliable, much more consistent deterministic pipeline, we know what we're going to get when we run it. The test pipeline that we were using, we got down from four hours to one hour, which is really handy, and our CTO said that he can sleep in on Sundays, so I think that's a good result. So, before we look at how to scale our workflows, it's probably worth just briefly explaining what happens in a basic lifecycle of a workflow. So, you get to your manifest and you shove it into your cluster somehow, you know, I'll go CLI or QTCL, something like that. You're going via the Kubernetes API to get that manifest into LCD. The workflow controller kind of wakes up at that point, realises you've popped something in there and starts doing workflow controller stuff. So, it starts asking the Kubernetes API to create your pods and starts kind of just monitoring all that, and all of that's going through the Kubernetes API with lots of back and forth, lots of calls. Which at scale gets very noisy and very scary quite quickly. On top of that, the controller is writing back the state of all your steps in your workflow, back to the manifest in LCD. And of course, if you're a user using the UI to look at your workflow, that's also going through the Kubernetes API, looking at the manifest, basically rendering the manifest into nice pretty blobs on the screen for you. So, there's a lot of chatter that goes on just to create even quite a simple workflow. So, in terms of stuff that could break, in no particular order, at a high level, your Kubernetes control pane might explode. Your workflows manifest may get too big for LCD and you get all sorts of scary errors. Your workflow controller and server can't handle what's being thrown at them, so you run out of resources, they keep out of memory and that kind of stuff. And lastly, your cluster itself could just not have enough nodes to do what you're asking of it and things get quite sad quite fast. So, if we take them one by one, if we look at the Kubernetes API, by default that's quite a black box, especially if you're in a cloud environment. The Kubernetes API is over there somewhere and the workflow is over here and you don't really see anything. You need to get some sort of observability in there to be able to see what's going on, which is shown in the graphs on the right-hand side there. Those are the graphs that come with the Prometheus stack. I'm sure there's others out there. The equally as pretty in Graphi. If you don't have observability, you can look at the workflow controller logs and you'll see stuff like... I tried to ask the Kubernetes API to do a thing and it said no and I'll try again in 30 seconds. Those kind of log messages are a good indicator that your workflow controller is basically asking too much of the Kubernetes API. And this is where the big lists start to come in, where I just try and give you some things to look at. You're going to have to go away and actually Google some of these things to work out what's best for you. But essentially you can ask the workflow controller to back off a little bit on how often it talks to the Kubernetes API using the ReqTime. You can look into limiting how many workflows you're running at once using workflow parallelism. There's an option to limit how many pods the workflow controller creates at each time it tries to ask for something to be created through the resource rate limit. If you're on an older version of Kubernetes, you will see quite a significant advantage in going to 127 or above. I think we found a three-fold increase in Kubernetes API performance just by upgrading. You didn't have to do anything else other than just upgrade. If the worst comes to the worst, you can throw more clusters at the problem. I know a great product that will handle that for you. And you also need to consider what else you're using that cluster for. It's very rare that we see people only using a Kubernetes cluster for workflows. They usually got something else going on on that cluster as well. That something else will also be talking to the Kubernetes API and it will have its own demands. And so you need to kind of balance everything in your cluster to make sure that you're not going to break things. Something we ran into was this error at the bottom that your request entity is too high. This is when you're exceeding the ETC limit, which can happen if you're trying to push how many pods you've got in your workflow. It can also happen if you're trying to put data into your parameters. So you can consider using rewrite many disks to actually write those out rather than storing them in the manifest. You might have a really big workflow, and in that case, splitting them into smaller workflows is a good option. In our case, we wanted to have a lot of parallelism. We wanted to have a lot of pods. You've got quite big workflows. So then we went with this option of persistence where you offload that manifest to a database. I'll check in at the workflow controller. If you install the controller a sensible way through the helm chart or through the official release manifest, there's no resource limits or requests set on that controller out of the box. So if you've just installed it and walked away, you're in a quite a scary position. You need to go ahead and set some kind of resource requests on that controller. Unfortunately, I can't stand here and give you a magic number because everyone's workflows are different. Everyone's requirements are different. So you're going to have to do a bit of experimentation to find out what your magic number is. And then possibly slightly, controversially, I would suggest not running the controller in HA. So in the example here, we've got three controllers. Only one of them will actually be running. The other two will be sat there dormant waiting to win a leader election. So if you, for example, request 10 CPUs for each controller, you're burning 30 CPU only to ever use 10 at a time. You might be better off burning 15 CPU on one controller using priority classes to make sure that controller comes back healthy if it ever dies. The other thing to remember is if you just run one controller, it will still try and do a leader election even though it will always win. And you could turn that off using the environment variable there. That would just save yet another set of API calls to the community's API controller. And then likewise with the server, if you're using the server, I recommend spending some time improving that experience, get SSO up, get your TLS sorted. If you have a lot of workflows on the server, you're going to find it slowing down. Again, there's no magic number depending on the size. There's a delete command from Argo. It's quite useful. You can also use this TTL strategy on the manifest that will delete them for you. Also, a really nice one is if you set up your log aggregation, because otherwise, as soon as your pod dies, you lose the logs. So having that link out and keeping your logs, no matter when you're going back to your workloads, super handy. You can directly, obviously, skip the whole UI, which is also another option, just use kubectl commands. And then likewise with the server, there's no default set by the official manifest. So if you've got users using the UI, you might want to set those. Here again. So we went on this journey of tuning the nodes and the resources, which is where we got quite a lot of performance out of. First thing you're going to need is obviously monitoring to know what the performance is like. The top left image is basically what you want to sort of aim for, where you're using 80% of your resource utilization. You're going over that with memory or ephemeral store. It's going to break. And if you're going over that with CPU, things will slow down. Bottom right image, you can see we've got two steps that have got quite different resource requirements. So in that case, it made sense for us to split those up, put them on different nodes, tune them accordingly. So save some money and give them exactly what they need. And then you can see there's one node that's going wild there. That was our workflow controller, which wasn't tuned. So also, as Tim was saying, tune that, and that'll also have some performance impacts on your workloads. And then lastly, just to look at what else is running, your other metrics on the pods and any sidecars or drivers that might also be draining resources and causing things to slow down. So once your Kubernetes nodes, hopefully you're autoscaling. If you're in the cloud, please autoscale. In case you're not aware, if there's a node attached to your cluster, you're paying for it, whether it's being used 100% or being used 0%, you're paying for all of it. So the important thing here is really to turn them off when you're not using them, rather than anything else. We scale down to zero at night on the weekends. It's definitely doable. In terms of pods pending, a good indicator is the image that I mocked up on the right-hand side there. If you've got your pods pending for a long time, there's something a bit iffy. Maybe you're not scaling fast enough, but maybe your images are big. Maybe you're not caching your images well. There's a whole rabbit hole to go down there. There's a whole talk in that. Basically, try and avoid the yellow blobs. Do whatever you can to make the yellow blobs go away. Slightly on a tangent, just in terms of cross-saving, do consider using ARM over AMD64. We use 100% ARM in our cluster. Argo workflows works well with ARM. Obviously, depending on what you're running in your workflows, that's where the things might get a bit trickier. But you should see a significant cost saving if you're in the cloud using ARM over AMD64. Hopefully, most people know about spot instances by now, but spots is significantly cheaper at the risk of your node being pulled out from under your feet within two minutes. You obviously need to write your workflows in a way that they can handle the surprise eviction event. But it's definitely doable. We do the same. We're 100% ARM. It definitely can be done. Something to draw your attention to and think about is how you structure algorithms in a way that you can scale horizontally and quite easily. Fundamentally, you've got thread level is your lowest level of parallelism that you've got to work with. But you want to start thinking about how practically you can make that easy for yourself to just go horizontal with more and more pods, given the right resources and data per process that you're running. Yeah, so that can inform quite a lot of other decisions which nodes you choose. How long running do you want your pods to be? Ideally, you don't want them to die after a very long time and then you lose all of that progress. Also, thinking about whether you're running very small workflows or just large workflows and maybe you want just small workflows to use your full cluster while not affecting how your large workflows work. That was fine. Hopefully, by now, we've got the point across that you really need observability in your cluster to have any chance of scaling workflows. Hopefully, we have a few things here that should help you on that journey. There has been a long-standing Grafana dashboard for Argo. It's pretty long in the tooth. It doesn't work with the latest version of Grafana. It doesn't really give you all the metrics you need. We just spent a bit of time polishing it up. The exact same dashboard, as it was before, but it works, which is kind of important. Feel free to download it. If you've got observability stack already, you can just plug it in, get running with it, obviously tweak it to your own needs. It's there. Just use it, basically. In terms of the future, my colleagues at PipeKit have started work already on enhancing the metrics coming in Argo workflows. We hope for 3.6, but I'm not going to promise anything. We have basically ripped out the old ones and put in some new ones. There's some questionable things in the old ones, which we don't have a lot of time to go into today, but hopefully the new ones will give you the answers to the questions that you really need answered. As you can see on the screen now, I won't repeat them, but the bottom graph is showing how long a workflow template takes to run across multiple workflows. It gives you an average of how long your template takes. The top corner, the reasons why your pods are stuck in pending, so you can start to see, oh, okay, I've got an image pool problem or I've got a scaling problem, or it just starts to hopefully give you the better picture straight out of the workflow's controller. The proposal link is there. We'd love a thumb up, we'd love some comments, all that kind of stuff. As I say, it's in the works. Hopefully we'll get it out soon. If you don't have observability, I do appreciate that's quite a scary thing to take on. It's not a quick thing to just roll out. The quote at the top is from a recent Grafana study that they did, you should have observability. Do invest in it. Hopefully we can help you on that journey. We have written this free plugin that you just plug into your cluster. It covers exactly the metrics you need for your other workflows and you can just see them on a dashboard and you can start to see some of the answers to the things that we just went through today. I'll show it to you really briefly. First, what I've done is loaded it in separate tabs and I'm not going to hit refresh. This is a natural cluster. I've been desperately running some random jobs just to try and get some spikes and some graphs. This first page shows you everything about the workflow's controller itself. It's showing successful workflows and the failed or erode workflows. As well as what the inside of the controller looks like so how many things are being added to the controller's queue. The depth of things in the queue. If we've got a problem we'll start to see those lines go up and not come down. This cluster looks fine. The Kubernetes control plane will be in here as well so you can start to again see if your control plane is crying it will really cry on this graph and you'll see it at a glance. The same you can see of the queues and things like that within the API itself. You can then see your workflows pods. You can see pending versus running pods. If you see a large spike of pending and it takes a long time for them to turn into running then that's when you can start to see that you've got some kind of scaling issue. In fact this large spike here is when I made the screenshot a few slides back showing all the pending pods. Lastly looking at Kubernetes nodes. How many nodes have you got? Are they scaling up and down? This particular cluster scales down to four when it's doing nothing and spikes all the way up to 30-ish when it's doing some work. Really you want to aim to use as much of your nodes as possible because you're paying for all of them anyway. In this particular cluster we're doing a pretty bad job. We're not using much of the CPU at all and the memory is a little bit low. So we've got some work here. We can now see at a glance that we need to do something in our cluster to try and get the best economy out of them. We hope to have this out very soon. Please just register on the quick link there. We'll get in touch with you in a couple of weeks. As I say it's completely free. The collector itself is open source. You can see all the code. You can see all the data that's being collected. Everything we collect, we show back to you. There's nothing hidden there. We just want to help people just get the best out of workflows. That was a key summary of our findings. Definitely key theme of getting observability up was pivotal. Having Prometheus, Cube State metrics, Argo, Custom Metrics, all super useful. Some other kind of tips were using NFS, a network file store. Tim's got a great talk on that. Worth checking out if you haven't seen it. Getting rid of life cycle hooks was quite important because we'd often have a screen like you saw earlier with a bunch of green where things would actually be failing under the hood. Getting rid of those is worth it. And then also just investing in some tooling that we hadn't come across. Making things more dry with customise, also controlling cost with lower environments using cheaper storage versus production using the high end NFS. And canines is super useful if you haven't come across it. And latest version of Argo is also obviously good. Yeah, so those optimisations are what kind of got us much more performance at various levels, a part on node level, a quarter of the time, and knowing what we're going to get each time is super valuable. We just collected all the stuff we've mentioned, basically onto one slide just to make it a little bit easier. I won't go through them all again. These slides will be available on the top link. There are extra slides in here that we didn't have time to talk about today. So if you want to go into a little bit more depth, there's some things to look at. In terms of pipe kit, we're more than happy to listen and help wherever we can. We're on booth E34. We're happy to hear your open source questions and try and answer them. We do regular office hours to do exactly that. Just sign up, sign up and throw your questions at us and watch us cry sometimes when we're not quite sure what to do. But yeah, do just turn up and get help. That's what we're here for. Yeah, and that guys, if you're interested in this data and the story of human movement, come reach out to us because that's our, that's what we do. Thank you.