 I came in from Australia on Thursday, so that's why I'm not very jet lagged still so I got to enjoy the weekend here So I'm Jason O'Connell. I come from Australia from Macquarie Bank. I presented at OpenShift Commons gathering in Boston two years ago, and at that time we were just migrating our first applications Into production so our whole focus of that time was getting into production and making sure we had everything stable and running After that we wanted to move to scaling out OpenShift to offering it for every team in the organization To make sure that any team could use it to migrate to the cloud So OpenShift became a core part of our cloud migration strategy So I just want to talk to you today about how we industrialize OpenShift and scaled it out as an offering for everyone And so just a bit about Macquarie Bank So there's Macquarie Group, which is a large global financial organization Macquarie Bank is a retail bank in Australia We're like a digital first bank. That means we've got no branches Our customer engagement is via our digital channels So digital is very important for us. I just got a few screenshots. This is our personal banking site We also have business banking and wealth. We've got a lot of nice features I won't have time to go through them all And this is like a mobile app. You can see it recognized when I came to Spain There's a lot of little personalization and things like that. We also have nice transaction search We stamped the location of all your transactions so you can search for transactions based on where you did spend So yeah, there's a lot of cool features but Let me just talk about OpenShift here So for our cluster right now at the moment for OpenShift, we've got four clusters We've got 42 teams on boarded and 300 applications So in the last year, it's grown a lot usage We're optimized I say for stateless applications. So generally we on board micro services We've got a lot of legacy as well We focus on a particular use case, which is these stateless applications Everything runs on AWS and we rebuild our clusters every 90 days This means we build a new cluster from scratch where we can patch and upgrade and then we install Reinstall all the applications on cross We have two clusters running and then we cut traffic across So it's mandated. We have to do that every 90 days It means that our automation is quite good because we're practicing it all the time So in terms of scaling the platform Now I've just got a graph here around The cost per application now This isn't the cost of running the infrastructure but the cost of supporting the application migrating the application For the application for the platform team that I run Versus the number of applications When we first went live in production, we were getting quite good at onboarding new applications So our cost was coming down. We had some good automation which we could replicate But things started slowing down They slowed down when we onboarded more and more teams Initially, we were working together with product teams to onboard their applications Later as we expanded out to the organization. We're dealing teams That aren't even in the same building Let alone the same country Teams that are less let's say mature with understanding of Docker And everything was new for them And so what it means is we slowed down a lot We had to support in production applications. We had to support onboarding new teams And we had to refine our scripts So naturally this could happen things can slow down. This is a diseconomy of scale What we want really is an economy of scale here We want to make it that we can onboard 10 teams 20 teams go from 300 applications of 500 applications seamlessly With no friction It should need the platform team no extra resources in order to onboard more and more applications That's what we're aiming for So just looking at this, how do we achieve that? Naturally we need to automate But automation we had from the start Automation wasn't the answer necessarily Because initially we used to automate everything per application automate the deployment scripts We'd onboard a new application would need to get new automation scripts A lot of teams would copy and paste And then eventually there's people that don't even know how those scripts work and they get a lot of problems And they get a lot of problems. We need to support them. So what we really want is reuse So we don't just want to automate we want to automate it once for everyone And have teams reuse those scripts In order to do that we have to standardize So rather than everyone doing things differently we started to say you have to do things the same way Although all teams think that they're doing things differently. We actually realized that 90% of the applications we run a spring boot microservices And they're actually very very very similar So we had to force teams to standardize Also, you need a clear scope for your cluster So we focus on stateless microservices That means we give no persistent volumes for any teams This is just our scope of our cluster and makes managing the cluster very simple For your own platforms, you might have a different scope that you're offering Crafts we narrow it and that allows us to scale and naturally we need a lot of self-service for teams So I think it's very important to have a clear line of responsibility between what the platform teams do and what the product teams do So the product teams naturally they're responsible for creating amazing products And the platform team has to run a stable platform Another way of looking at it is that the product teams focus on customer experience and the platform team focuses on a developer experience Now Governance is a big part being in a bank governance is very important. We need to have controls preventative controls detective controls manage resources On boarding there's a lot of work we do in governance Naturally, that's a platform responsibility Now these services I call capability services. These are things like secrets management, which we're migrating to vault chatbots CICD with Jenkins Knative these are core capabilities And you know in OpenShift, it's very easy to install some of these tools in five minutes and development teams and product teams want to do this But that'll break those principles before It means we'll lose standardization and lose control and things can become messy So we say that any capability service our team runs We build it properly. We make it multi-tenanted. We address security controls risk And we build it for everyone to use and our deployment and release automation So like I said before the teams used to run their own deployments deployment scripts and things got very very messy So we said there's going to be one way that you do deployments and we're going to write those scripts And what that meant is that early on when we upgraded from OpenShift 3.5 to 3.7 there was some breaking API changes and We needed 20 teams update their scripts And that was holding up our entire upgrade if one team can't upgrade then we have to hold up the entire upgrade By moving it back into our team to manage it meant that we're in full control Upgrades happen seamlessly for teams We we can change our scripts and give new offerings and they don't even realize it's happening So now I'm going to show a demo. I wanted to do a live demo but I was playing around last night and our VPN going into Australia was quite slow So hold on a second. So I pre-recorded it last night at 1 a.m Now what I've got here is just a demo application that we created Sorry, sorry, that's the end of the demo. Give me one second. All right, so we've got a demo application We have to create a demo application. I sorry. I can't show you any of our real applications Now this demo application It's broken You can see the spending overseas is not showing anything So this application what I want to do now is fix this application. So let's just see how we fix it. I Go into our OpenShift portal This is our self-service portal that we built you can see it looks a lot like OpenShift because we use pattern fly Which is the same framework as OpenShift use itself So this is what all developers use to interact and do deployments and self-service functions You can see the sort of self-service functions We've got user access management deployments application onboarding vulnerability scanning Pulse is synthetic testing capacity management that that so everything self-service goes through the single portal So what I want to do here is Look at that broken Application so it's deployed here and we just called it broken and to make a change What I want to do is branch it Now you can see it's not just one application usually microservices work together to form what we call a product So we never deal with an application individually. We group everything together So that application you saw before or the product is three microservices and five sets of config What I can do here is take that whole environment that's broken and branch it So if you were to do this in production You would you could branch up at a current production environment and bring it into tests So I've just called the new branch environment feature Jason Vicks And what's happening now is we're getting an exact copy of that environment if we did a copy of a production environment, it would be exactly the same Except the config would be pointing to non-production So it's just running through and running the deployment I noticed everyone before when they said put up your hands if you like Jenkins or not We had a lot of trouble with Jenkins. So we got rid of it. We do do all the deployments with this Our own custom tool all of these have API so end-to-end pipelines developer teams can integrate with this They don't need to use this UI You'll just see it deploying an open shift here Now the cool part is that an entire environment is defined in git YAML files So we've created a branch automatically in git What we want to do is update this These files to do the fix You can see you got a file per application And if you look at one of them, they're quite simple a production application isn't much more complicated than this We try to default everything so teams don't explicitly define anything or override unless they need to So this is just a spring boot application that needs high CPU. There's not much more to it Everything that goes into an environment for non-prod to prod is in this repository. So you've got one repository per product You can see I've got my config map variables here What usually you'd have dev test prod Just in the demo. We didn't create it but you have all of your configuration for every environment in there So we're just going to edit this file and make the change So we're going to go from broken connection to valid connection commit the change So again, if you were changing application code, you could update the version of the application you want to the tag in in the repository So now I go back to the deployments after making that change and and Wish what we should see pop up here is The original deployment I did when I branched earlier and then a new one has been kicked off. It got kicked off from a webhook So what's happened here is as soon as I made the change It's syncing up and applying that that change and that change alone into that environment We don't use this in production of course because you got some more controls to adhere to but non-prod It's very powerful. It tells you the change that we made as you can see It says I'm going to redeploy that application and I'm going to apply that config Because of that change that we made So it means that everything's kept in sync Between what's in your definitions in the YAML files and what's actually in the environment And we can go into open shifts here and we can see the new pod is just come up So that's all great. We made a fix but now we need to apply the fix To the actual customer traffic You can see it's not applied here yet Because I've created a whole new environment and so for that we do canary releases with a tool we built on top of Apigee Hold on this a bit quick. So in this tool here We can say we want to move from one back end to another back end. We can move by probability For a canary release which I'll talk more about later We're just going to flick all the traffic across here In the future all this ability to canary release. We're just moving into Istio away from Apigee It's much cleaner using operators and using Istio for this But we've been using this ability for two years with Apigee and it's allowed us to release much faster So let's see if the change worked And there it worked so you can see the URL at the top is the same URL that'll be a customer-facing URL if it was prod, but we've moved everything in the back end behind So let me just go back to the presentation Our beautiful So you see when we do releases We have a whole version of an entire environment a product a product When we do a deployment even changing one or two things we build a new environment Then we bleed traffic across we do a canary release This is great because Macquarie has a lot of channels and a lot of partners connecting through to these banking APIs We don't want to make a release and affect every single partner at the same time Which is what happened like five years ago when you couldn't do releases often Because you had to line up your partners and get them to agree to a date now. We can release anytime So I could bleed across our mobile channel We also have open APIs open banking APIs And so they might want to do a change they could do that change on the same day if they wanted They've done their testing they go straight in they know they're only impacting themselves Then we could remove the old namespace and bleed the other traffic across We've got full capability to bleed traffic This allows us to do targeted releases to release frequently And importantly the fact that the entire product is Deployed together and all of those microservices and all those dependencies Means we can effectively promote environments through from tests through to our lab and beta and production environments So our lab environments are production Except they're cut down in terms of the users that can use it Only the digital team can use these environments, but it's still real money a payment is still a payment But allows it a lot of experimentation in those environments teams can spin up as many lab environments as they want for different features Beta environments are opened up to staff and some public beta testers We get a lot of feedback from our internal staff on new beta features or if there's any issues that have regressed So you actually promote through from test lab beta into production the full set Um You'll probably hear about GitOps. I think this is getting really popular Everything we do is GitOps based so I wanted to demo, but I'm not gonna have time Our user access management is a pull request, but via that portal Resource management application on boarding everything is done with pull requests And the nice thing about that is every team can then attach Approvals onto that by approving the pull request and we get full auditability of everything Now I just want to talk quickly about controls If you looked at a pod there's a lot of things that surround a pod We've got a pod as part of an application which rolls up into a product We've got teams that own that pod or own the application You've got user access management. You got resource management For us being a bank we need controls on all of these things and the controls you need at different levels You need different types of controls at each stage So I can't go into all the controls Except that if you want a scalable platform Any control you put in you need to focus on developer experience You need your control to be frictionless and automated and self-managed if you blocked a deployment on a release day for a team And they don't know why they're gonna come to the platform team to my team and cause a big noise And then there'll be a lot of work to get them on blocked right But there's a lot of work to to manage just one control So I've just got an example here of controlling resources. This is where a team wants to request CPU and memory So they all get allocated We need incentives for them or disincentives in this case We say if you're over your capacity, we're gonna prevent your deployments for locking you down until you bring it under That's a disincentive to force them to adhere to the control To ensure that they optimize we need charge back. So if they ask for more CPU and memory, they get charged If you don't put in charge back, they just ask for more and more and more We need them to be able to self-manage So we give them the ability to auto clean up in non-prod so they can say these developer Environments are gonna be cleaned up nightly and these environments last longer, etc And if they do want more resources, it should be self-approved. So they don't need to involve our team They also need visibility of their current usage on a dashboard And we also have email and we're just building out the chatbot alerting so that when they do approach their limits We're alerting them so that they know that they're not going to be blocked in production They're given enough warning before then So you can see just for this one control to make it seamless and frictionless We need to actually build out six different things six different components and applications Just to manage the control of resource usage So just to wrap up. I saw this quote from aging Cockroft about a month ago and Our principle is that the product teams should be focusing on everything that's inside the container They should focus on the business logic and the business value So that eventually all the code they have a right is business logic And then the platform team is focused on actually the developer experience Really what we're there for is improving developer productivity So yeah, I think that's it. Thank you