 Thanks for coming to my talk after lunch. Hope it was a good one. Today we're gonna talk about an idea for multi-region, multi-primary, and asynchronous globally distributed APIs. We're gonna keep this as simple as possible because it's a very complex topic, but I hope it gives you some inspiration and ideas on how you could accomplish this. So who am I? I'm Brian, I'm a principal engineer on the platforming team at ThoughtWorks. I'm based in Atlanta. You can find me at this website or on GitHub. Fun fact about me, I'm currently working on a book with O'Reilly on cloud native delivery patterns, which is something I talked about last week in Japan. And before I get started, I wanted to say just one thing about this year because it's been kind of crazy. My daughter was born in January. Obviously working on a book with O'Reilly. I got to give a keynote in Vancouver, give workshops in Minneapolis, attend the KubeCon Contributor Summit, and then last week I spoke at Open Source Summit Japan with the CD contract. Kevin, who was on stage earlier up here, also spoke on that same track. So that was an awesome event. And then I don't have any fun jokes for today, but I did take a really cool picture in Japan. That is not painted. That is in a Buddhist temple in Kyoto where they do like illuminations for the fall leaves and stuff. So pretty fun little trip. So agenda, we're gonna talk a little bit about why and who this sort of thing applies to. We'll talk about some requirements. We'll go through a policy-based example. We'll look at Istio patterns, and then we'll have a summary. So the cases where I think this applies to the most are financial and auction-based. However, there are probably other use cases that we're not focused on here. These two just make a really good way to explain this topic. So it's just really helpful to dig through it that way. So when we talk about financial systems and trading in sort of a multi-region context, there's a few things that we have to think about. The first is that they're usually time-sensitive, especially with analytical or read data. There are a lot of regulatory requirements for data rights, and what I mean by that is there are often regulatory requirements for where your data gets written and how it gets written there. And usually with these types of systems, delays or issues with querying leads to millions of dollars of losses, if not more. Another item that we have to think about with these types of systems is when we look at, say, trading across the globe, which is happening more and more, we have traders that are operating in one part of the world and operating on trading systems that are in multiple parts of the world. So we then have to think about policy and routing for these types of things, as well as the regulatory requirements we talked about. In the auction-based systems context, we have some similar requirements actually. Rights need to be fair. We often have, at ThoughtWorks, we have auctioneering companies that we work with where they will have an auction that is on site, but then they will also have people participating in the auction remotely from different parts of the world. And the interesting context with that is you wanna make sure that these stay fair. So the people that are on-prem are obviously going to have the fastest experience, and the people that are remote need to be writing as if they were on location, but from their point. So meaning if your right is coming from one place and we'll see it in an animation. Another thing to think about with database systems where you may have multi-primary and asynchronous replication is a lot of these, like DynamoDB or Ugabyte, have a last-writer wins, meaning that if two rights come in at the same time and they're conflicting, the one that came in last is actually the one that wins. And for an auction-based system, this doesn't work so well. So we have to think about that. Lastly, user dissatisfaction with any of these issues leads to millions in losses. So if we think about the auction-based perspective, and I got really fun with keynote on this presentation, just warning you now, we have people that are participating in auctions from different parts of the world. Again, kinda similar to that financial one where we were talking about where we might have somebody that's in Australia participating in an auction that's in Japan, but we want the rights to go to the API that is in Japan, not to say the API that's in Australia, even though we wanna support replication and reads from where they are locally. So we get fast reads, but then also fair writes. Some additional considerations, we need reads to be incredibly fast for both of these use cases, meaning we want the data for reads to be as close as possible to the user. This is why we wanna look at some of those multi-asynchronous-based databases like Dynamo or Ugabyte, where we get those capabilities. Slowness and performance of reads leads to loss of queryable in events and data. What do we mean by that? We mean that if we have, say, a delay in speed when you're looking at a stock price or a trading price, then somebody can't act on that as quickly as possible, which leads to loss of money. In an auction-based scenario, if we have slowness in reads and somebody trying to decide what bid to put in, you're gonna lose bits, which means you have less competition on those items, which leads to loss of money. There we go. So let's take both of those scenarios and kinda talk of them from a high level. We have a user that's based in the United States, and they wanna get their data as quickly as possible, so we wanna send their reads to somewhere that's near them. So we might have Central Canada, US West, US East, right? And that's fine, they're getting very, very fast read requests from this. But the event that they're talking to, auction or stock-based or whatever you, is in the UK. So we wanna send that right over there, and then we want that right to be replicated to all those regions as quickly as possible. If we're leveraging one of those database systems we talked about, after that right gets put in, those reads are gonna be replicated for us automatically. What we're really concerned with is, how do we make sure we send the right to the right region and location with some policy and networking rules around that? And so what we're really talking about is we need to have some sort of programmatic API in front of that right request that says, I wanna send this right to the right location. Oh, there it goes, can you hear me? All right, it's skipping to the right location, which in this case is in the UK. So to do this, we're gonna simplify our requirements a bit and there's really just two, policy and routing. Policy, we say this because we need to make sure that when right requests are made to a certain region or location that they're valid, routing so that we get them to the right place. With these two things we're gonna be talking about, open policy agent and Istio. So in the policy case, we're gonna be using OPA, which if you're not familiar with OPA, it is a policy engine that allows you to write in a language called Rego, which you can then use to write policy that is able to interact with Kubernetes as well as other cloud-native technologies. To give you a much clearer example in the case of what we're doing today, if you imagine a Kubernetes service talking to, say, a pod, that pod's gonna have an on-voice sidecar if you're using Istio Service Mesh. Well, if OPA is also in the mix, you're gonna be using something called Envoy OPA plugin, which basically is just a long-term for Envoy is going to send all of your HTTP requests to OPA first before it passes those requests to the API that is part of that pod. Looks like that. So this is a really simple example. We have maybe an HTTP request coming to our API and it says, we're gonna allow maybe get requests on auctions, right? And if that's the case, we're gonna allow that to pass through. That decision is passed back up to Envoy and then that service is able to actually serve that request. That's a super simple example of how OPA works with Envoy. So now we're gonna look at sort of a demonstration to kind of get more into what you can do with this. Or not. We're just looking at a blank screen. See it now? Nope. We'll just do mirror displays. Maybe that's the issue. We'll zoom in a bit. So before I start, the context here is we have a very simple demo setup. I don't believe in demos working when you have networking in the middle of that. So I try to do everything locally with demos, learned from experience. So in this case, we have a kind cluster running locally. On that cluster, we have a few things. We have a example application. This is running a simple API and we have a bundle server and we'll look at both of those in a second. And then we also have a database running in our database namespace. This is UGByte. This is one of those databases that meets the requirements of asynchronous, multi-primary replication across regions. In this case, we're just using it in the context of our cluster but it's useful for the purposes of our demo. So the first thing we're gonna look at is some of our code. I'll pick this a little bigger. And the first thing we're gonna look at is just our application. So here we have a fairly basic setup. We're not running Istio inside of this small cluster. I wanted to keep it as small and easy to demo as possible. So instead, what we're doing is we're doing a proxy init with Envoy which essentially just allows us to stand up Envoy, standalone without Istio inside of a smaller cluster. We have a basic application. This application is an API that serves teams requests. At ThoughtWorks, we maintain a reference implementation of Teams API used for a platform starter kit. So this API is actually something that I maintain as part of that. But in this case, we're just using it for a few simple API requests but you can take a look if you want. And then we're having it connect to that UGByte database on our cluster. We have an Envoy sidecar and we have an OPA sidecar which is that plugin. OPA is loading its policies from this bundle server which is the pod we saw earlier. Let's take a look at that. So if you remember when we did get pod, we saw a bundle server. So if you look at bundleserver.aml, we have a fairly basic OPA bundle server. What this does is it serves policies to the pods inside of our application. So in this case, what we're doing is we're serving a config map called authcpolicy to our application. And then we're serving bundle server just as a regular Kubernetes service so that on a init of our pod we can fetch policies from that. So then lastly, let's take a look at that policy. Here we have a basic rego policy. What this is doing is it's intercepting HTTP requests and making decisions based on what you did with that HTTP request. And we're gonna look at some of the unique things you see here in a minute. A few of the ones that are more standard for OPA that you might see all the time are like, is this token valid, meaning the JWT token we passed to that pod? We're validating it as part of OPA. We're also doing some basic action allowed things like is this person trying to fetch teams or maybe trying to create teams if they're an admin? Then we'll go ahead and allow those things if the great criteria is met. So now that we've done that, we're gonna go ahead and actually start talking to our API. So I'm gonna set up a basic curl pod to act as a user. And the first thing I'm gonna do is get my JWT token. We're gonna go ahead and assume that I've logged in. And then we're going to talk to our API and see our list of teams here. So we currently have Team Sapphire one, two, three and four inside of our database, right? Now, maybe I wanna go ahead and create new teams inside of this Teams API. Well, let's assume that maybe this is one of those auction or financial-based cases where we have some requirements around where the rights are actually being propagated to the correct region. So in this case, the rights can only happen if it's Singapore, meaning that you can only write to the primary that is in Singapore, even if you have primaries in Singapore, US, Canada, whatever, right? So let's go ahead and take a look at our basic requests, examples, and here what we're gonna do is we're gonna use just this simple request here. We're gonna modify it to do, let's say, Team Five and we'll go ahead and pass that guy up. And here we can see that that request was forbidden. Now, why is that? The reason is because we are actually looking at the region of where this request came from. So what I mean by that is, let's go back to our code for a second. In this case, we've just set a static value for where this API is actually located. We're saying that the API is located in Singapore. In a production case, what you would do is you would detect this from the cluster that it's running on using metadata, which OPA has access to. To keep the demo short and compressed, we're just setting this as a value here. We're then looking at the request that's coming in from our API and comparing that to the right required location of that API request. So if we don't have a right location being propagated to that request, we're gonna deny it. If we also pass the wrong region or it comes from the wrong region, like say we're trying to write to the US and instead we needed it to write to the Singapore, it's also going to be rejected. So let's go back to our curl pod and we're gonna go ahead and modify our request again. And this time we're gonna actually do the correct right location. So let's do team five. Wrong, maybe wrong region. I realize what we did there. We forgot to pass our token to our API. So let's grab our token and now we can go ahead and do our team five right. There we go. And now let's go ahead and ask for our teams. There we go. So now we have our team added to our database coming from the right location or right region. So that's a simple example of how you could maybe use policy to enforce where those rights are coming from and where they're being propagated to. That's only one half of the equation. The other half is routing. So how do we actually get them there to the right place so that we don't get those rejections that we saw? So going back to our demo, coming to our slides. There we go. So in this case we're gonna be using Istio and for the purposes of this we're just gonna be talking through the code and examples. We're not gonna have time to do a full demo of multi-region Istio on a stage. So the assumption that we're gonna make is that we have Istio multi-cluster. Something to note if you're not familiar with Istio multi-cluster is it gonna be installed in various network topologies meaning that there is like single network and multi-network like if you're on Amazon you might have multiple VPCs or a single VPC. Those are all fine with what we're talking about here today. You can kind of pick and choose how you want. Just there might be different things you'll have to do depending on what you choose. There are also multiple control plane modes within Istio multi-cluster that you'll need to take in consideration when you do this. Like, there we go. So the first thing that we need to talk about is destination rules. Destination rules are basically policies that apply to traffic after routing has occurred which sounds like a lot. Really all it means is when that request gets to your cluster that means routing has already occurred. The request has received or gotten into your service mesh. Then we're going to apply policy to make a decision on that request after it has entered into our Istio multi-cluster network. So that's what we mean by after routing has occurred. We're still gonna do some additional routing but this is saying your request is now already there. We're gonna do something about it. You can use it for things like targeting specific versions of services or targeting specific services based on criteria. The next thing to talk about with destination rules is subsets. So we can use subsets to do what I just talked about which is target specific services and we can use like various criteria like say version numbers or headers or whatever you want to target that specific service. Sort of defines the criteria for how the subset is targeted through the destination rule. So in the Istio multi-cluster context we have something called topology labels which basically just means inside of our multi-cluster global environment every single cluster inside of our Istio mesh is going to have a topology label. So if you have say two clusters west and east west is going to have a topology label of cluster west east is going to have a topology label of cluster east. You can change these and modify them however you want. These are going to be useful for us though. So to dynamically route what we can do is actually write a destination rule using the subsets we just talked about to then target specific regions within our cluster meaning that we can define a subset that says we want anything that's cluster SG to go to the services running in cluster SG inside of the Istio multi-cluster service mesh. Same thing with cluster USA. These are obviously abstractions because I'm using them for simplicity. Yours might be more like US East 1, US East 2, et cetera. So using these subsets and I know this is a lot we'll go through it. What we're going to then do is define a virtual service and using this virtual service we can actually target our regions using those subsets and destination rules. So what I mean by this is we can say okay for the requests that are targeting Singapore we are going to target the destination of subset cluster SG. For the events and requests that need to go to US we can target the destination of cluster USA. Keeps it really simple for us. We can also then say certain criteria like we only want to apply this to post requests and why do I say that? Well, if you remember in the animation from much earlier where we were looking at we want reads to go everywhere because we don't care where those go. We just want rights to go to certain regions. This is one way to allow us to do that meaning that we're only going to enforce sending these requests to those regions when we have a right request. Otherwise we're just going to let it go wherever it wants. We're going to let Istio do what it wants and load balance that as quickly as possible. So to give you more of a like a simpler easier to understand example let's pretend I'm a person in the US. And I'm going to send a request to say my API in the US and that's routed for me because that's the closest thing to me, right? So I'm going to be asking for maybe some data that's in an auction that's based in Singapore but it's just a read request. So it's going to go through our gateway. It's going to go through our virtual service. It's going to go through that destination rule and apply some policy and it's just going to decide, oh, this is a read request so it's not going to apply that right policy. So it's just going to go straight to the auction service local to that region and read it straight from there. Now when we look at the right request we have slightly different. We're still going to send that right to the closest possible region for that person meaning that you might have a GSLB or something in there or it's just making sure that you get into your cluster mesh as quickly as possible and we're then going to go through a few steps. Still going to go through the gateway. Still going to go through our virtual service and then we're going to hit our destination rule and our destination rule is going to say this is a right to an auction that's based in Singapore so we actually need to send that to our API that's based in Singapore. So that's going to go there and get written to our database in Singapore and then the replication will occur to the one up in the US for read requests later on. I see a few photos so I'll give it a second. I will share the slides too. Good. All right. So what if we don't know the downstream location? In my opinion, in these scenarios the client should already know the location of the event before sending a right meaning that you're probably doing some get requests prior so you already know some metadata about the event but if for some reason you don't you can do manipulations through virtual services or also OPA although I don't recommend doing it at OPA it's quite tricky where you can apply say header-based changes directly to that service request from a virtual service. Again, only if you really need to do this otherwise I would say maybe stick with getting some metadata through your read requests and then using that as part of your workflow into write requests. It's still important to note that policy still protects us from the edge cases in either type. So in the summary we can use Istio multi-cluster and open policy agent to accomplish a global multi-region asynchronous multi-primary that's a lot. Distribution of our API. It's a very complex use case and it's really only specific examples where this happens like finance and auctions are just good ones. OPA ensures rights are only accepted at the right locations meaning that when we talked about our two requirements of policy and routing we wanna make sure that routes are being and requests are being routed to the correct place but then once they get there we always need to make sure that they're actually valid. So if you think about like the admission controller pattern in Kubernetes where you're sending signed deployments to Kubernetes. Well, the thing that actually validates that those deployments are signed is local to that environment meaning that your policy should always live as close as possible to the boundary of the thing that's actually validating it. So in our case that's the API request. So we want that policy to live alongside the API. We then use some clever uses of Istio in order to get those requests there using destination rules and virtual services. And this is a very cloud native projects focus solution. There's a lot of other ways to tackle this. There's a really good talk by Vanguard and Amazon re-invent 2022 where they used all AWS services to do a lot of the same thing. They did have to write a config library for their dev teams which is sort of what spawned this idea for me was I was sort of like, I wonder if you could accomplish this without having the dev teams put any code into their applications in regards to the actual routing and came up with this concept. And then lastly, I would highly, highly, highly recommend if you need to go down this path looking into global service load balancers, global DNS, global accelerators like things that are gonna get those requests as quickly as possible into your network. This solution does not do that for you. You wanna look at some of those services I mentioned in order to just get the request into your network and then you can take it from there with this solution. And we have two minutes and 50 seconds left but thank you for coming. Thank you. Questions, comments? I got one and then there's a mic in the middle. Nope, two. How did you establish network connectivity across these multi-clusters with STO? Yeah, so that's gonna be like super implementation specific, right? In our case at ThoughtWorks we do, we work with all three of the major cloud vendors but I'm just, I'll talk to AWS. In our case, we typically do a multi-VPC so Istio multi-primary setup and then you obviously have to do VPC connectivity for those services. We actually started using their WAN service in order to accomplish that, which is basically just like a managed mesh of transit gateways so that you can get the network connectivity for each of those VPCs. There are other, it depends on the cloud implementation though but yeah, if you go through Istio's multi-cluster docs it'll explain all that for you and then take you on pass for each cloud service. Thanks for the slides but I have two questions. How do you handle regional failures and secondly, how do you recover from the regional failures when the recovery is happening, how the OPA policy and routing would work in that case? Yeah, that's a great question. It was something I debated about putting in but didn't think I would have time and I think it was right. But basically when you look at those destination rules that we're say routing us to one region, well destination rules actually allow you to send a balance of traffic as well as define fault tolerance patterns. So those are actually the normal use cases for destination rules meaning you might define if this endpoint is down like let's say the Singapore service is down. In your destination rule you might have defined a secondary region that's as close as possible that still meets the regulatory requirements of that API. So you might have like, let's take the U.S. it's simpler for me to remember the region names. If the regulatory requirement is send these rights to U.S. based regions for say stock exchange that your primary defined in your policy might be U.S. East and you might define in the same destination rule a secondary failover of U.S. West. But you can do all of that in the virtual service and destination rule and then it'll just handle it for you. And we're out of time so thank you.