 So today, I'm here to talk to you about a implementation and partnership between ThoughtWorks and Richie Brothers. ThoughtWorks is the company that I work at. Richie Brothers is our client. The story I'm going to tell you is about two use cases that we built on top of an engineering platform. However, to get you there, I have to tell you a little bit about what we built and why. So we'll spend the first half of context in the second half of the two cool use cases. So first up, there was supposed to be another person up here with me. He had a medical emergency and can't be here. But his name is Ranbir Chala. And he's the SVP at Richie Brothers and previously an ex-thought worker. He's been working on platform engineering for years long before it was even called that. So it's been great to have this person as a stakeholder. If any of you are going to QCon in London, he's doing a one-hour deep dive on this platform as well. So please check that out if you're going to that event. Then me. So my name's Brian Oliver. I'm a principal at ThoughtWorks. We're the people with the books. If you're familiar with Martin Fowler, I'm working on one myself with O'Reilly on cloud native delivery patterns. And I've been speaking around the world. Went to open source summit in Japan last year, CDCon in Vancouver, all kinds of fun stuff. But yeah, you can check me out at those sites. So why we're here. First of all, who is Richie Brothers? Richie Brothers is an auctioneering company. They sell and, well, they do all kinds of services with heavy construction equipment, mining equipment, forestry equipment, just about anything you can think of. They're essentially the largest auctioneering and service broker in the world for heavy equipment. Think like million dollar cranes. They help you sell all that kind of stuff. These auctions happen all over the world as well. They have auctions in Japan and Germany and the United States, just about anywhere you can think of that would need heavy equipment. They're probably involved in having sales in those places. So some of the problems that this company has had, they're a very old company, is self-inflicted. They've acquired a lot of companies over the years. These companies have all kinds of different services and platforms that they've built and tried to integrate together. And what they end up doing is creating a lot of parallel services that are doing the same thing. And that's ended up with a really brittle architecture. And so they brought ThoughtWorks in to build a platform, as well as a marketplace for all of their services. So some of the challenges that they were trying to solve, these disparate acquisitions were not being integrated or being deployed side by side. Their architecture required linear growth and personnel. That is not a joke. Every time that they needed to scale up, they would have to add more people for every single service. They were yet still facing record growth with that inability to scale. And they needed to be able to domain bound their services and APIs. So to talk about what we did there to get you to why I'm talking to you here, I'm gonna briefly tell you about the engineering platform we built at Richie Brothers. This is the shortest version of it ever. We've given hour to two hour long talks on what we've built there. So very, very high level. We have four things we'll look at. The principles, numbers, design and scale. By numbers I mean like how many people, apps, et cetera. So the principles that we built with this platform, you're gonna hear about a lot of these throughout the rest of the day. We actually heavily align with a lot of those principles. But the three that I would sort of highlight with the platform we built there, the platform is a product. We treat it as an engineering platform. I mean, as an engineering product, we have a backlog. We do not do operations and support requests. We do feature management, all that kind of fun stuff. There's several talks on that topic later. So we'll leave it there. Building an engineering platform is an exercise in software engineering, not an exercise in operations. And our highest priority is removing developer friction. So the first number and the most important one is if you're a brand new team at this company, you can deploy to production in less than an hour from zero to full production. That is tested, that is measured daily. That is a real metric that we track. The second metric is probably more interesting to some of you. Thousands upon thousands of deployments per month. Over 32 clusters globally. Dozens of namespaces that represent individual teams. Thousands of pods, thousands of deployments, like fairly large scale. Just to give you an idea of the context we're dealing with these problems. So from a design perspective, I don't expect you to go through and understand all of this. I would check out Ranbeer's talk at QCon later if you get a chance. But just to give you a sense of what we're doing with this platform, some of those interfaces we talked about in the last talk are present. We have CLIs, we have APIs, we have custom Kubernetes operators, we have starter kits, we have templates. We built all of these different interfaces for the engineers on our platform so that they can consume and use this platform. Within the platform, we're also using things like admission controllers, custom operators, and we'll kind of get into some of that in a little bit. And this kind of gives you more of a deep dive into some of that architecture. Again, not gonna go too far into this, it's just giving you the context of the scale and the complexity of the platform. We use a sidecar model for things like OPA and Istio so that we can connect services across the globe. We do admission controller deployments within these environments. The teams manage all those resources inside their namespace themselves. I mean, we just provide self-service APIs for them to do so. And lastly, this platform is completely global, meaning we have regions deployed to, these are the current ones with plans for more for this platform. Teams can opt in and opt out of these regions and that's by design with our self-service APIs. So what I mean by that is if you have residency requirements or data requirements they can choose to have services deployed in US West or US East and not in the EU or vice versa. Our global network handles that for them but we're able to allow them to make those choices between Istio Service Mesh and our global network. So now we can kind of get into the two things that we've built on top of this platform that the platform enabled us to do. The first one is this concept we like to call compliance at the point of change. If some of you are familiar with the traditional DevOps pipeline, it might look like this. It's probably gonna look more like this or it's lots of stop lights and hand raises meaning various different teams are gonna stop you from doing certain things and you don't really own your own pipeline. We've completely flipped this on its head with our platform and the way we've done that is by breaking the evidence of compliance with the doing of compliance work. What I mean by that is the development teams are responsible for scanning their resources, scanning their artifacts or doing whatever thing it is they're required to do in order for a service to be deployed into our environment. Our environment is handling the verification of that work before it gets deployed and we handle this at the boundary of the environment meaning an admission controller that's sitting on top of the cluster. So what this does is it means that developers now own their pipelines 100% outright. They can do whatever they want with them. We do not even have administrative access on any of their pipelines because all of the compliance work is handled by them and we're just verifying it at the actual boundary of the cluster. This is really important in that global context when we have over 30 clusters with plans for dozens more we don't wanna own any of that stuff. We just wanna verify that it was done at an individual cluster level. So to give you an example of how this works you could for example use a gatekeeper OPA, something like that. It would sit on the boundary of your cluster and verify that the compliance work was being done with something like Rego and then it would either allow or deny that deployment into the environment. You can do this with a lot of different services we just happen to use OPA and gatekeeper for ours and we do this across the globe for our entire platform. So to give you an example teams can use sneak to scan for CVEs, right? Well then what we can do is we can go and look up that CVE work that had been done by that team and make sure that they've actually checked off and verified that application. We could also pass that into a bill of materials for example and verify that say via gatekeeper. And I think we have that example here, yeah. So here we could say like oh you've maybe done a scan on your bill of materials or on your application pass it into your bill of materials and then our admission controller is gonna allow you to either in with that deployment or it's gonna reject it. Another case that this could be useful for is at Richie Brothers because of that global scale. There are still some teams that haven't adopted end to end testing especially post deployment. So one of the things we enable them to do is to use JIRA as an approval hook for product owners. So for example say all of the work's been done and you're ready to go to production with your service across the globe. Well the admission controller isn't gonna allow that deployment to pass until it actually has a sign off from a product owner and that's happening at the actual cluster level. You have an admission controller reaching out to the API of JIRA. Not ideal but until you get those end to end tests or maybe canaries in place in your environment this is a good stop cap. The second use case I wanted to talk to you all today was operators and it's always a contention topic talking about building operators for a platform because operators are a fairly heavy solution. If any of you are familiar with maybe the operator framework you can write them in Golang or Helm and they take quite a bit of time because you're dealing with things like event loops, continuous reconciliation and all these other concepts in Kubernetes that are much more low level. And so it's definitely a choice that you have to make and you have to have reasons to make it. Like you don't just need a DynamoDB operator just because you're trying to abstract something there might be a deeper reason behind that. And in our case we had multiple deeper reasons behind that. So for today what we're gonna talk about that this platform enabled us to build was an S3 operator, an IM operator and a DynamoDB operator. The reason we selected to build these operators is we basically just didn't want all of our developers to have access to our cloud environments. Ritchie being a heavy acquisition company. They have GCP, they have Azure, they have AWS, they have Oracle, it's all there. AWS is currently the primary engineering platform that we're building on top of. Or AWS is the vendor that we're building the engineering platform on top of. However, we wanted that flexibility for our developers if we were to begin moving to other clouds like Azure or GCP. So we're trying to build concepts into our engineering platform that are transferable to those other clouds. So we'll kinda look at a few examples around that. But the key takeaway there is you're abstracting away certain details that the developer may not care about like IM policies or roles, et cetera. So if you think about the role and move that role to the platform domain and then you give your team's control of that role, it has a sort of interesting dynamic switch. What I mean by that is you no longer have teams sending tickets in, asking for roles being created, for policies being created, or any of that other stuff you have to do on Amazon. You don't have to give them access to your Amazon Cloud Console anymore and they don't have to write Terraform anymore. This was really powerful because we had hundreds of developers that did not wanna do any of that stuff, but they needed Dynamo and DB, they needed S3 and some of those other services. So we took the concept of the role and moved it into the platform domain and abstracted it, but in a way that made sense where we weren't like trying to hide it completely, it's not magic under the hood, we want them to understand what's going on. We just wanna make it a abstracted concept that's reusable across the platform, but still in a way that makes sense. So what I mean by that is instead of say, an I am role in AWS, they're gonna create something called a service role in our platform. A service role can be assigned to any resource like DynamoDB or S3 and they can create those same resources with DynamoDB or an S3 operator. So effectively all that they're doing is they're saying give me a service role from the platform and they're just deploying that say Helm and a CRD, give me a DynamoDB database and they're specifying the regions they want it to be deployed in, so east, west, two, et cetera, and maybe an S3 bucket. And all they're gonna do in those CRDs is say, this is the role that's being attached to those resources and then inside of our environment, the operators are handling attaching policies as well as service accounts to their workloads for them, but they're having to specify that. So if they just deployed some application and tried to access Dynamo without assigning this role to their workload, nothing's gonna happen. So to kind of break this down a little bit more, let's say our developer's gonna make a DynamoDB database, or database apparently, because I can't put Ss in my words. Database, why not? They're gonna do that with a CRD and then they're also gonna create a service role. This is again with a CRD. Now when they create that DynamoDB database, we're just gonna roll with that word now because now I can't stop saying it. Going back to our slide for a moment here, if you look at that DynamoCR, you can see where we've set allowed service roles at the bottom, right? So we're attaching our service role to our DynamoDB database. On the Kubernetes side, again, globally in every single cluster that this is being reconciled within, all we're gonna be doing is creating a pod and maybe some secrets config and a service account. The operators are creating the service account and attaching IM policies to that service account. So if you're familiar with EKS, they have this thing called IM for service accounts, which effectively you can take an IM role and assign it to a service account. Azure and GCP have similar concepts, so we could take our IM operator and move it into those clouds and then just change the implementation layer to match, meaning you might be dealing with an Azure AD role or maybe a service account on Google or something or principle or whatever. Doesn't really matter because to our developers, they're still just gonna use the same concepts on our platform across those different clouds. So to give you more example of the flexibility, let's say they also then decided to add an S3 bucket. What do they need to do on the Kubernetes side? Absolutely nothing. There's no change here. All they needed to do was assign that service role to the S3 bucket and on the Kubernetes side, our operators are gonna reconcile everything for them and it's just gonna add maybe some secrets into their workload and the service account is already gonna have access to that role immediately after that assignment. So it adds a lot of flexibility to our teams just from this change. The reason that we decided to build this out is because of the scale of this company. So if you can think about like, and I had some fun with Keynote, if you can think about where our users are interacting with our platform as well as our developers are interacting with our platform, it's in all these different places. So we kind of have to handle that from both a interaction standpoint as well as a data standpoint. What I mean by data standpoint is we're gonna have users that are talking to DynamoDB tables and databases that are potentially set up with certain rules and regulations, meaning you might have an auction in the UK where you want the rights to go only to that region, but you want all the data to be replicated globally. Well, with our Dynamo operator that's fit to that business need, we're able to meet that for them. All right, so I've got a quick demo that we can do here and it's to show you sort of some of the observability we've built into these tools and then we can do some questions if you want. So this part is using a honeycomb or a blank screen. Let's change that. I guess I need to turn screen mirroring off, hold on, mirror, there we go. So part of this platform, another thing that we do is we want those operators to be easy to use. And so one of the other chief complaints about operators when you build them is that if you have a globally distributed platform and you have all those operators reconciling developer CRDs across the world, how on earth are they gonna debug those operators across the world? So what we do is we automatically instrument all of our operators with open telemetry and pass that into honeycomb. So when our developers are trying to debug, say an operator reconciliation, they don't need to go and try and figure out which cluster that happened in or what have you, they can just go into honeycomb instead and try to find it there. And that's just using open telemetry traces that spans. So like for example, this is a DynamoDB operator change where they were trying to update the TTL or Time to Live on a Dynamo database. And typically when you look at things like traces, you think HTTP step by step and sort of a web request. These are actually the functions within a operator. So these are the actual cases that are happening inside of our going operator reconciliation loop. And you can see there at the very end that that TTL has just been modified too many times. So maybe they're just trying different things and they're doing it too quickly and Amazon has a limit on there. But they can quickly get to that problem without having to go and find out which cluster it happened in, which region or whatever. It's all already there. So super powerful. Then the other one I wanted to show you is our IM operator. This is just a case of, this one was fine. There were no errors, but it's another case of making a change to one of those service roles where maybe that team was updating read or write access to their Dynamo service or to their S3 service. And so they're updating and making changes there. But they could step through those changes quickly instead of trying to go again into Qubectl and find that across 32 clusters. Not fine. And that is it. Questions or thank you first. Thank you. Thank you. Any questions? So far? One question. Thanks for the talk. Can you share with us the size of the team that is managing such platform? That might be the best part. So the team in the US is five people and we have also four colleagues in China. So a total of nine. And the number of developers is 1,000, maybe 2,000 developers. It's a lot. OK, perfect. Can I ask a second one? And regarding the pre-hook validation, especially when you mention JIRA, for example. Sorry, it was hard to hear you. Can you say a little louder? When you mentioned JIRA, for example, in the pre-hook validation on Kubernetes, what happens if the link to JIRA is broken or is not working on your Kubernetes cluster? If the link to JIRA is broken? That's happened a couple of times. And so they have control of their actual policies. So all of the policies that they're using for those admission controllers, the developers own those unless it's the platform-specific ones like CVEs or static analysis. But the JIRA ones are configured by them, meaning that's a choice they make. And so they can just go into basically, they sync them with GitOps so they can go in and disable that policy until JIRA is back up or whatever is going on there. OK, thanks. That's exactly what they did. Any other questions? Oh, good one. Get you a workout. There we go. You worked with highly autonomous teams that are actually owning their own pipelines. Is there any hesitancy or resistance to taking more responsibility over their own work rather than the traditional ops throwing things over the fence approach? Yeah, this has been a multi-year journey. And there was plenty of resistance at first. And what we found, at least in this case, was despite there were lots of teams coming from different companies from the various acquisitions, but they all settled on this interface of owning their pipelines and deploying with Helm and us providing them best practices through applications that we maintain and deploy as reference. And they slowly became comfortable with that. And then we started building CLIs and self-service APIs into the platform. And they consumed those with those pipelines that they own in their Helm charts. But yeah, there was definitely friction at first. And then as time went on, now everybody's just in love with it and asking for more. Don't fall over. How do you handle different environments? So if a team wants to spin up their own environment to do some experimentation, or they want to play around with a third party, but they don't want to do that in prod or something, can they go and create their own environment that is then isolated to a region? Or how does that work? Yeah, they can totally make their own environments. They can spin up their own namespaces that are outside of the standard set. And those can be production namespaces. If they attach them to the Ingress Gateways that we provide, they totally have autonomy to do that. The only thing that would stop them is the Admission Controllers that are looking at the CVEs and those third party tools. If they find anything, they're not going to let those through. So they would have to go and reconcile and fix that. But yeah, they totally have autonomy to go in there and do that. And they can choose which regions to do it as well. No, they could choose whether that's within a sandbox set of clusters or all the way up to production if they wanted to. Thank you. It's really interesting with an operator part where you can assign roles. Do you have any ways to control the approval? Like I say, which team can access which resource? Yeah, so that's actually driven by we have a Teams API that we built for the platform. And so the teams have control of their own namespaces. And they can assign other teams to have access to those namespaces. And they can then give those teams access to their services with the service role. So it's completely self-managed by them in their own governance policies. But they can do that. Hi. How do you manage deletion of custom resources? Can users just wipe database by removing the definition? Good question. So with Dynamo, we have a 30-day retention policy, which I think Amazon does as well. So they can't just wipe out a production database immediately. There's also finalizers that exist on that Dynamo operator specifically that would prevent a short-term deletion. There's another one over there. Sorry if I don't look at you. It's the light. Hi. How did you manage the lifecycle of CRDs and operators in general? Because CRDs are open to upgrade. Open to, sorry, what was that last part? Yeah, just how do you manage the lifecycle of your operators versus because it's an API right for your developers? Yeah. You mean how do we release the operators themselves? Yeah. Oh, good question. So initially, we developed our own multi-region pipeline tool that dynamically generates CircleCI, but it's quite perry and complex to do it that way. So we're actually shifting to using Carmada, which is an incubating project in the CNCF, to help us release those operators across the globe for us. You can specify a propagation policy and say which regions you want the operators to be updated in. And then they run canaries on themselves once they get inside the cluster before they're actually able to start reconciling the current CRDs. Any other questions? No questions from this side? From this side. Thank you. Hi. It's very appealing that you're saying about these self-managed things that developers can do everything that they want. But for me, I work in a relatively small company, the traditional SRE team, where SREs know everything that developers do. And they're very involved in their work. And I just know every piece of the Amazon stuff that they use. And that's really important to me. And how that connects with platform engineering, engineering this stuff. Yeah, so we have a one-to-one relationship with the SRE team. We actually think that's fairly important. So they are just in the weeds as the platform team at this company. They're also just as small as we are, too, though. But when we're implementing features like on an operator level, they're right there with us working on it, or at least taking a look at how it works so that they understand it. But it's definitely a partnership, yeah. Hi, there. Sorry, this might be super basic. I'm very silly. But in your diagram, you had the security scan at the start and then the policy blocked at the end. And I'm very novice. And I don't really understand what's happening with that blocking part at the end. Like, I'm assuming it's not just the same scan, but you do it. Sorry, you got a little quiet at the end there. Sorry. So you had the pipeline with the security scan at the start and then the policy blocking at the end. I don't really understand what the policy blocking at the end part is. It's not just the same scan, I'm assuming, to block it. So what is it? Yeah, so they're pushing a CVE report into an artifact registry. So then we're verifying the authenticity of that CVE report that they generated as well as its currency or recency or timestamp effectively. So if they've generated a new one, we'll verify it's them. And then we'll use that as the most current report. Another one? Hello. Do you have some shared resources, for example, inside one team or between teams, how you manage it? Because for us, for example, it's a very big problem. Yeah, so there are some teams that ended up becoming mega teams, as we call them. There's one team that is 130 people, which I don't think that's a team anymore. That's a company. And the projects that live within that team, it's interesting they've kind of become an inner-source-like service. And there's only a few of them. It's like their account service and a team service that they manage. It's mainly the org service, where they're basically registering the data of their users as well as the organizations that live within their databases. So if you can imagine, like Richie Brothers has hundreds of businesses they interact with. So they have an API for all the organizations that they manage within that platform. Those services are managed by that, essentially mega team. And the way they've kind of handled it is doing pull requests and self-reviewing. And that's worked fairly well for those larger services, but they're starting to figure out a way to maybe break some of them up and self-manage them into smaller teams. But it's usually when that occurs, we treat it as like a smell, if you've heard the term. And we'll then go through an exercise of event storming and domain-driven design to help them break down why that's becoming a large team into smaller teams. But usually we just try to keep track of the size of teams and make a suggestion. Ultimately, it's up to them. Any other questions for Brian? OK then, big round of applause for Brian. Thank you very much. Thank you.