 I think we can start. Yeah, perfect. So hello, everyone. Thanks for joining me this morning. I hope you're having a great time so far. And today I'm going to talk about Agro-Rollertat scale, or how we brought automated rollbacks to more than 2,100 microservices at Monzo. So a little thing about me and why I'm doing this talk here in the first place. So I'm Joseph Paramedici, a picture of me with much more air. I'm a backend engineer working in the platform team at Monzo. I led this migration project where we essentially replace our default deployment strategy to basically use Agro-Rollertat and automated rollback canering by default. And I was a tech lead of the engineering effectiveness team, as we had a couple of discussions this morning during the keynote. So the basic team doing everything DevX, all the tooling that's engineer used to interface with the platform, and also deployment pipeline CICD and observability stack. And you can probably hear already, I have a strong accent. And unfortunately, I cannot do anything about it. So please bear with me. So first of all, what I'm going to talk about this morning. This is a study case of how we migrated all of our services to Agro-Rollertat. As I said, it could become like the default deployment strategy at Monzo. Or we did it safely because we regulated bank in the UK. And the name of the bank is Monzo. And that we are actually quite happy with what we got and what we achieved with Agro-Rollertat. And there will be a lot of emojis because it's part of Monzo branding. So yeah, a little bit of scene setting, a little bit of context. So what is Monzo? It's a regulated bank in the UK. It's seven years old. It has more than 5 million customers. It's one best British bank this year, I believe. It is mobile only banking app. As you can see, what the basic app look like. And if you've been in the UK, you've probably seen those very flashy hot coral cards that we're famous for. And that basically what they look like. From the tech point of view, which is the tech stack point of view, the more interesting bit, is Monzo really brought into the microservices philosophy. We have more than 2,100 microservices of our application, depending on what you call them. We have over 200 engineers, most of them back-end engineer. And that's where the interesting number come in. We did last year 27,000 prod deployments. And those are actual prod deployments. This is not conflict change. This is actual code that we push to production. Everything is going. And the other interesting bit is everything is running in one mega-cubanetic cluster of like 300 nodes, 20,000 pods, massive scale. It's a really tricky community cluster. And our entire platform is quite centralized and quite homogeneous. Every golden services look the same and basically use the same libraries. So as I mentioned, basically deploying at Monzo is a better than better. We made that process like incredibly efficient, like as efficient as possible for engineers to be able to deploy. And basically our deployment pipeline is unruled. And you can deploy in less than two minutes. And that's including building the images. So it's very efficient. But we kind of realized that it's something we do so often on a daily basis. Our deployment strategy was kind of simple. We still rely on community native deployment and use a rolling-up-date strategy, which basically means that the engineer, when you deploy stuff, because we do it so much, we are like alerting, genocadeting, and also business-level alert. But engineer kind of had to babysit deployments, look at Grafana dashboard, which is, as you know, not super interesting. And also, with our current setup, every time you do a new release, you basically release it directly to 100% of the customers. As I mentioned, banking is a very good industry. Dine time is very costly financially and in terms of reputation. I mean, for all of us in this room, you don't definitely do not want downtime. But for us, it's also like the risk, some regulatory risk. If we're down too often, or if some very important payments system are down for long enough time, that might actually trigger some regulatory action. And those are quite unsavory. We kind of want to avoid that. So in order to do this, we kind of try to stick to best practices. And we have this Swiss cheese approach to change management. So we try to catch bad changes. Bad changes are narrow at multiple layer. Could review to our CI pipeline having a staging environment that is actually quite close to the production environment, didn't get alerted. But we kind of realized that we did not add anything after you deploy. You basically deploy and then it's on you, except for obviously some alerting. So we kind of want to implement our deployment safer. We wanted to reduce the connectivity of deploying and also to allow for more automation, such as continuous deployments. In short, that can make deploying safer and quicker. So we're onboarding this progressive volat initiative. And the big idea at the beginning where to move gradually, move traffic gradually to a new version of the code. And also automatically rolling back a bad change if that happened. I need to do a bit of introduction about Agorollout. So this is stolen from the project documentation. So Agorollout is a Kubernetes controller and set of CID which provide advanced deployment capabilities, such as Bruggrin and Canary deployment. And what Agorollout more or less is, it can be used as a replacement of Kubernetes native deployment objects that basically essentially provide more feature and those advanced deployment capabilities, such as doing staged rollout through Canary and things like that. But in essence, it can be used either as a replacement for Kubernetes native deployment or alongside them by essentially referencing an existing deployment object. So this slide is a bit like there's a lot of things going on. What I really wanted to convey here is you have the volat controller that basically manages a different volat object. So the central thing of Agorollout are the volat objects. Like I say, it can be used as a replacement for Kubernetes native deployment. That's where you can define your pod template spec, annotations, but also where you define what the deployment strategy should be like. Like for example, canary, shift traffic 10% for the first five minutes and then 100% after that. And that's where you define all those things. That's also where you can define some analysis. So an analysis is essentially a condition using some metric provider, some metric provider, in our case, primitives. Or you cannot define rules, such as a rollback if that metric is above a certain threshold for a couple of times, this amount of time, and so on and so on. So in essence, that's basically what Agorollout look like. If you've been to the workshop yesterday, you probably played with it, and you potentially played with it a little bit. And that basically all there is to it, which is great. So great and simple tool. And what did we decided to choose Agorollout in the first place is because it was easy to understand and experiment with, very well maintained and well documented. And it kind of adds this plug and play quality to it as you can just install it alongside your existing cluster. Oculus has a very big, very complex, with a lot of business logic at the community level, especially everything related to security. We have a custom network isolation capability built into the community layer at Monzo. But even through all of that, you can just easily add Agorollout. And you don't need much requirements for it to be working fine. It works better if you have a service mesh that allows you to do final-grained traffic shaping. Yeah, perfect. Final-grained traffic shaping. But even if you don't have a service mesh, it still work. Traffic shaping still work. It's just a bit more like calls. And it's highly customizable. You can basically define custom rule for each of your rollouts for each of the services, which is what we were looking for in the first place. So now we need to talk a little bit about our journey. Or we did get to automated rollback at Monzo. The first thing is a bit of a bait-and-switch. So I talk about canarying a bit. So canarying is a canary deployment is when you shift traffic progressively to a new version of an app. And that's what we're looking for in the first place. But then we ask ourselves the question, is that really what we want? Is that really what we needed? Because we're going from a really simple process of engineering and not having to think about things that could be rolled back. You only have a couple of assumptions. You only have one version of the code running in the cluster, and so on, and so on. So something much more complex. And there was a clear need to educate engineers of the org and of the consequential tooling. And we're like, is that really what we want? Is that really what we need right now? By the way, this illustration is made with Dali, which is a game changer when you actually need to put illustration in your presentation. And as you can see with this kind of ronky giant canary and giant cage. So reviewing canary deployment where actually what we care is reducing the impact of a bad change. And what we're really actually looking for is automated rollbacks. So that's what we decided to implement first. We're kind of using a blue-green-like strategy at the moment. It's still use our go canary strategy, except you still go straight to 100% of the shift, the new version, to 100% of the traffic. But you have this analysis period where you basically run this check to know if you need to rollback or not. And we did that to basically allow for us to move to real canary in the future because we got most of the infrastructure and most of the work in place to be able to do that. Basically going instead of going from a rolling updates strategy to just having automated rollback and later on move to canarying. And this is what's the deployment workflow look in this new world. It's essentially an ingenious target deployment that updates the relative objects, the new version of the app become live. Then for a short period of time, there is this analysis step where the different rules, like we call them rollback rules, are evaluated. And if it's successful, the engine is notified of success. We do that in Slack because we also like, we use basically a lot of Slack Slack ups. We have a lot of processes in Slack. And if it's not, if there is one of the rules actually evaluated to false, we automatically rollback to previous version where we notify the engineer and the team. This workflow allows us to choose a strategy, the deployment strategy, per service and per deployment. So for each individual deployment that you make, that allow you to potentially change what type of deployment do you want? Do you want automated rollback? Do you want canarying? Do you want something more fancy in the future? And that's what this flow allows us to do. Then again, this is probably not super interesting right now. It's more like if you're reviewing this talk later on and you want to look more about the details. This is what it looks like from an architecture point of view. So I don't mention everything here. Our deployment pipeline is enrolled. So we have a service that basically generates a manifest and so you say engineer use CLI to link all cheaper to trigger the deployment that basically make one of the service create a new manifest for it, put that in a GPO and then this is applied to another like services, another service that applies that to Kubernetes and updates the world object. And then creates a new replica set and so on. So basically this is just to prove that we actually use GitOps. And yeah, there will be a complimentary blog post about this talk with like all the nice gritty technical details about all the aspect of the work that we did for all of you if you actually love the animal. So that's what the tooling is a look like. So at Monzo we have this philosophy of backend engineers should not know about the platform. They should be able to interact with it but they should not actually need to know things about Kubernetes like networking or even auto scanning. So we have this one tool called Shipper to deploy and the president super interesting here again here but what we just wanted to show you is you basically just deploy by calling Shipper, giving it a commit hash, a number of service and deploy it for you. And we basically completely abstracted away our goal out. Our goal out itself provide like great tooling and great dashboard but we decided not to like expose that to our engineer and basically abstract all of that away which basically required us to required us to integrate with our goal out in a slightly different way. Same thing here that basically what our slack notification look like when you start the deployment basically tell you how you started the deployment with this strategy. Here's dashboard to end the logs if you want to see what's going on and the notification kind of success of successful deployment or like failed deployments. So I talk about rules and rollback rules. So we actually started with only one rule and we try to keep our rule as simple and generic as possible at the moment all the services in our back end use the same rollback rules. So they have to be like quite generic and we started basically with if the total error rate of a service is above a certain threshold then you need to rollback and about the rule themselves we decided to basically tweak them and add them based on our incident handling process. So basically every time we have an incident or something that look like an incident we have to have the question should that be caught at deployment time or could that have been caught at deployment time? And if yes, would have we made a rule to actually catch that issue? So issue is like rules when you come to like honoring an automated rollback is a false positive is annoying for your engineers because the change our rollback so deployment is rollback. Customers should not seem much problem hopefully but it's really annoying and it's not great gross UX and you don't definitely want to avoid like a false negative where you're not rolling back where you actually should. So the takeaway here is basically if you're planning to move your entire organization to something more complex like our rollout do it in step start as simple as possible. You don't need to go from like a simple process to something like state of the art who's like very complex canary strategy just start with what actually matter for your org and in our case that was automated rollbacks. Argo is relatively easy to integrate as you can see with our custom tooling because it provide a lot of things out of the box like it will trace notification on the line as a Argo rollout as a Argo controller like metrics like the rollout metrics and it's CLI and dashboard are quite useful when you're like in development phase in the development phase. Now I'm going to talk a little bit about how do you actually migrate to Argo rollout? And as I mentioned before, Argo rollout allow you to do it in an incremental fashion where the rollout objects essentially reference an existing deployment. So the rollout objects essentially define the strategy and a couple of other parameters but the potential aspect and everything related to the actual deployment is still on the deployment object and the rollout just reference it. And this allow to basically migrate to Argo rollout in a safe manner that avoid downtime because you essentially just bring a new rollout object that reference the deployment object. At that point you have two replicas sets one managed by the deployment resource one managed by the rollout resource and then you can safely decide to set the number of replicas of the deployment to zero. So you just have like a replica set put managed by the rollout object and doing this at all for like a no downtime migration which is great. And also doing it this way allow you to easily reverse the process if you don't want rollout again or you're just in an experimental phase. So that is great. That is super easy to do. Cube CTL command probably like two, three minutes relatively safe to do except as I mentioned earlier we have 2,100 application. So we need to do that a lot more which kind of make us actually there was a clear need to up to the automated process and through this automation we also decided to like on soft to enforce even more safety because as I mentioned earlier we are regulated bank who cannot really allow downtime and also part of our like company motto making money work for everyone and there is nothing more like scary for our customers than having that card not working at the supermarket because you know produce down. So we wanted to make sure like we extra safe and that the whole process is repeatable, observable and in the patterns. So that's what basically what the migration look like for us is we first notify the service owner on stock we create a new rollout object for that savvy that we're migrating we freeze the deployments because we want to limit the amount of change we'll be doing this process which is not very long it's like on average two, three minutes per service and then we scale up the rollouts and about three of the rollouts if your service is an HPA will you do the migration we wanted to make sure that we scale up the rollout to the max band of the HPA just to be extra sure that if there is like a surge of loads will we do the migration there's not gonna be any problem and then we check if all the pod that we're supposed to see are actually alive and well in our cluster and only then we decide to scale down the deployment to zero replica because we know that we have enough rollout replica managed by the rollout object and we then check that we don't have any deployment pods managed by the deployment object left and then if the service is an HPA we switch the target of the HPA from the deployment to the rollout and for the deployment notify people of success or not of success that is great so that basically what the process look like and we decided to like execute on it by going before like a migrator pattern where essentially all the step I just described are independent because that migrator pattern that migrator service that we're using will retry all the steps and that's weird about and also like allow for like fine grained scheduling we had so much services to migrate we only wanted to do them like during business hours we want to spread the migration across like the entire day because we want to avoid adding too much pattern in a short amount of time for example and just you know to basically control the amount of risk and we did all that through like immigration service where the user decided which service to migrate and then we had a cron every day that triggers immigration for those services and spread them out through the day we did it by using your service tiering so we went from like the lowest tier services to the highest tier services to the less important service to the most important service and by the time we actually got to migrate our tier services including the ledger service which is not a big deal like just managing like four million quid so you want that won't not be down because if it's down all the payments all the payment system are essentially down by the time we got to those tier services we already immigrated with that process more than 1,700 services so we were quite confident about it being quite safe so take away if make it safe, automate it like if you automate it you can enforce some additional safety rules obviously if you have like 10 or 20 application you can still do it by hand, it's probably much quicker but as if your backend is big enough that might be worth actually adding some automation the migration is actually what took the most time the act of migrating is actually what took the most time because we had to do it slowly and carefully so anywhere to invest in tooling and monitoring to make sure that we could do this process safely so in terms of like deployment probably like half the time was just actually running the migration after we developed everything so what did we learn like moving to our go-rollout and running it at this scale first of all some numbers because everyone like numbers so we have in our production cluster right now 2,143 rollouts so 2,143 services running in production probably a couple more since I did the slide like a week ago we have seven rollback rules that cover some things that are very specific to us but are generally enough to catch large classes of problem and we've been running a go-rollout in production for essentially every services for like four months now which mean we did 7,000 production deployments with our go-rollout and 22,000 production deployments in staging because I told you we did 27,000 production deployment per year but the truth is we did like more than 100K deployment staging last year so yeah we have like we built like the confidence that our go-rollout is actually capable of operating at this scale and we are like 63 rollbacks since we basically start using our go-rollout so a little bit like less than 1% a couple of them are like false positive that's why there is a need to clean it to like tweak your rules so what about operating rollout at scale which is like the whole point of this presentation right actually there was very little thing for us to do it just kind of scaled really well with our kind of like unconventional architecture like it's very unlikely that we have like one Kubernetes cluster that big and that many rollouts running but the go-rollout controller did not have an issue with our scale of like operation at Monzo which was quite surprising we even had a small incident in staging where we did like 5,000 deployments over the course of like three hours and it worked which is great all the stuff broke like primitives but because of the putram but other than that it actually worked quite well at like this very large scale there's a clear need to like share the knowledge about our go-rollout because it becomes a new critical component that your ops and platform team need to understand quite well and as well as the rest of the organization that need to understand that the process is a bit different now there was still one little performance issue about leftover like analysis for an object but we're on the case and we hope to be able to push a PR like soonish to actually fix that issue and yeah and it's important to invest into our lighting and observability for when operating our go-rollout in general and especially the scale and the project make it easy because it already provide like most of the metric you can think of that you would want to monitor would we do it again and should you the answer to that first part of the question is yes and the answer and for the other question, yes as well so for us it was a clear positive outcome we reduced the connective load of like deploying we catch loads of issues across like tons of deployments we're quite happy with what we got in return and also with how easy it was to actually set it up in given the amount of customization and you know like business specific stuff that is running in our cluster and it's a nice addition to our like Swiss cheese approach like deep and deep strategy to control for bad changes and the most important thing is like engineers actually love it so this is one feedback that we got from backend engineer that's basically got paid for an issue due to deployment and was automatically rolled back in I think less than a minute and that would have been like a rather nasty nasty problem with basically the MasterCard 3ds processor on our end that would have rejected like every request which basically means that a customer would not be able to pay with a Monzo card online for the duration of that incident but that was automatically reverted within one minute and yeah, thanks you very much for listening to me we have some time for like a few questions if you have any and like I mentioned earlier like we will do like a complimentary block pass with all the actual technical detail of how we enabled the different aspect of this migration and our setup and if you have any question in general about the gritty details please grab me after this, grab me during the conference any questions? Yes. I got the essence of what you were asking about so how about the migration process when we scale up and down the difference? Exactly, yeah, so when you're scaling up and down the deployment versus rollout I think sometimes what I've heard is that there can be some kind of cost rep precautions because either you have two workloads that are both scaled up fully or you have to have downtime to scale one down to roll out the rollout did you guys, was that ever a conversation in terms of that process especially with the scale that you guys were rolling out these microservices at? Sorry, so the question is like was it a problem for us in general? Yeah, I guess was it a topic of conversation like a concern about maybe like lack of resources in duplicating these workloads or something? So we kind of like didn't add that much of the problem because due to the nature of what we are like a bank, we basically run with a lot of overhead in our cluster so we had like a load of like capacity leftover like the whole point of doing this migration of like spreading out the migration across like Zontair Day is to make sure like to control for those kind of problem where yeah sure you're gonna run double the workload for like a short amount of time for a couple of minutes and if you spread them out across the day for example for low tier services we did batch of 200 like spread out over like eight hours which basically mean you have like one migration every five, 10 minutes and the migration process itself only lasts for like two, three minutes. So that's how we control for it and that's why we decided to go slow to be extra safe but yeah indeed that's something to take care of because you don't, and that's also the whole point about doing this migrator, having this migrator pattern and checking that we are the right number of parts that we're expecting because for very, very large deployment with like 500 pods sometimes you would take like a couple of future eyes to it would take time for pods to actually come up due to scheduling pressure. So yeah that's why we, yeah. Okay, thank you. 17 second, any question? Okay, cool, perfect. Well, thank you very much. Thank you very much. And.