 Hi, this is an Raga just joined from that incorrect meeting. We're about to start. So go ahead, Serena. Sorry, I just got off with Serena. I welcome back to another OpenShift Commons. We're here with one of our really great partners and we love all our partners, but especially Cockroach Labs. I'm here to talk about geographically distributed Cockroach DB with OpenShift. And Rafael Spazzoli, he is another one of our Uber architects who, if you have any questions about OpenShift, but today about geographically distributed OpenShift in particular, please ask your questions. And also Keith McFallen from Cockroach Labs is here with us. So please take it away. Thank you. Thank you. I'll start. Yeah, my name is Rafael Spazzoli. Like Irina said, I am an OpenShift architect in consulting. Keith, would you like to introduce yourself? My name is Keith McFallen. I am on the Solutions Engineering team at Cockroach Labs. So I help our customers implement solutions like the one we're going to be demonstrating here today in their own environment. So I'm excited and Rafael and I have been working hard on this demo for quite a while and we've used it a number of times. So I'm glad that we get to show it off to a broader audience. Thank you for inviting me. You're welcome. So today, the idea is to talk about how we can distribute workloads across different geographies and as an approach to manage disaster recovery. And in this case, obviously the workload is CockroachDB. And then we are going to present a demo of where we are simulating a disaster and we will show how everything, how the system reacts to the disaster. So I started about two years ago thinking about how we can manage disaster for stateful workloads. And my line of thought was, okay, we have figure out data list for OpenShift and Kubernetes in general as a community. It's time to think about stateful workloads and obviously stateful workload bring more problems. In particular, they bring state and they need to sync state across instances. So this, obviously there is storage involved, but today we are narrowly focused on disaster recovery. And one other thing I was trying to do when I was thinking about these things is to define a new approach to disaster recovery, which I call cloud native disaster recovery. So here is how I define it and how it differs from traditional disaster recovery. So in traditional disaster recovery, usually there is a human who decides when a disaster occurs, okay? So a human triggers some disaster recovery procedure. It's not, the situation is not detected by the system. In cloud native, we cannot wait for a human. We need faster reaction times. And so the trigger has to be autonomous. And so the system has to identify this, the situation and react. When you have a human reacting to disaster recovery, what happens is, typically a long time passes before you realize one or two hours. That's what I see at my customers. That is just to start the recovery procedure. Then the recovery procedure itself in traditional disaster recovery, it's usually a mix of automation and human actions, you know, better, you know, if you're good, you probably have it all automated. If you're not very good, you probably have a lot of human actions. In cloud native, we want it to be fully automated. And then the two main metrics through which you measure the SLA of a disaster recovery procedure is our RTO and RPO. So RTO is approximately how long the system is down. Okay. And RPO is approximately how much paid or many transactional. What is the length of time of transactions that you have missed? So it measures, so the first one measures availability, essentially. The second one measures essentially consistency of your data. And so we want, in traditional disaster recovery, you can have fast RTO, you know, minutes, but it can go up to hours. And we have seen why one of the reason is the human component in the detection and in the procedure for recovery. In cloud native, we want near zero RTO. So it could be theoretically close to zero, but there are some things like load balancers and they'll check the need to react to the new situation and start diverting traffic. So we have a near zero in the order of magnitude of six seconds outage. And then for recovery point objective, we have in traditional disaster recovery, it could be between zero and the hours, depending on how you sync the state. But in cloud native disaster recovery, we want it to be exactly zero. And then when it comes to ownership of the process, what I see usually is that the ownership is formally on the application team to design a disaster recovery process. But what the application team usually does is they turn around to the storage team and say, what SLA can you give me? And then that's their SLA for disaster recovery. So basically they rely completely on the storage team. In cloud native, it's going to be all on the application team to find the right kind of middleware or software that can deal with disaster. In cloud native, there is really no one storage team anymore, especially if you have an hybrid cloud. There is the AWS storage, Google storage, maybe your internal storage. But there is no single team that you can go and tell, give me your SLA. And then from a technical capability standpoint, there is another interesting difference. In traditional disaster recovery, we usually build these recovery procedures using capabilities that come from storage and storage products. So backups, volume sync, these kind of capabilities. For cloud native instead, and this was an interesting finding for me, the capabilities that we need come from the networking space. In particular, we need the ability to communicate east to west between these geographies. So that all the instances of our workload can find each other. And we need a good global load balancer that can detect the data geography has gone offline and start directing the traffic to the available geographies. If you don't mind me jumping in here, I'd like to talk a little bit more about why all of this is important from a cloud native perspective. So I've been dealing with these types of problems for a long time, much like you ever have it. The reality is, as we become more abstracted away from the infrastructure, as you mentioned with hybrid workloads and where we're running potentially across on-premise data centers and cloud data centers. Essentially, even as we move towards a full cloud deployments, in a lot of cases, what we're seeing is silent disasters happen a lot more frequently. We can't rely on our own processes to guarantee that we don't have a data center out for availability zone or a network partition. And so because some of these scenarios become more likely as we move to kind of a cloud native ecosystem, we need to start treating them like just any other issue that comes up on any of them. And fundamentally, I think that's something that we're going to be showing as a part of this demo later here today, which I think. And yeah, that thanks, Keith, that's exactly true. A disaster should start to feel like an HA event, so high availability event, where a component goes down, but everything keeps working. In a disaster, an entire data center or an entire region goes down, so we lose more than one component. We lose an entire piece of the IT, but still everything should keep working, except for those glitches that we talked about in the availability space. Okay, so we have prepared a demo for you to see this in action. I'm going to talk about a little bit on the infrastructure that we set up for this demo. So we have three open shift clusters in three AWS regions, two in the East, United States and North America, I should say, and one in the West. In this open shift, we have, we set up this open shift using a tool called Red Hat ACM, that's the advanced cluster manager, which is, which runs here on the top. Right corner on the administrator administration cluster, and it can be used to manage the cluster life cycle and to observe the status of the cluster. So we use that tool to bring up this three clusters. And then what we have done is deploy a tool called Submariner. Now Submariner helps you, helps you establish a tunnel between the open shift SDNs. So that's the SDN is a software defined network that is established inside of an open shift cluster for the pods to run with. And so with this tunnel between the SDNs, we are now able to open connection from one cluster, from one pod running in one cluster to another pod running in another cluster. Without having to egress and ingress using the router or other means to ingress the cluster. So Submariner brings us discovery and connectivity. So now we can, we can, we can configure a, we can configure pods to find each other, to find other other peers in other clusters. And that's what Perkroach needs in order to be able to, to establish its own logical cluster. So we will have open shift clusters and we will have one Perkroach cluster. Another thing that we did is to deploy vault for our secrets and certificates distribution. We needed, we needed a single certificate authority for all of the Perkroach instances. And so vault is a way to, to provision that and certificates will be provisioned using an operator called CertManager that you see here in the slide. The last piece of infrastructure, this is all preparation that we needed to be able to deploy Perkroach. The last piece is a global load balancer. So obviously we are in WS, we are using route 53, which is a DNS, right, but it's a very powerful DNS. And we have a global load balancer operator here on the right, running on the administration cluster that is observing the other cluster, the control cluster, and it's automatically programming route 53. So when we deploy things like vault here, a route is created, a DNS definition is created in route 53, and same thing is going to be for PerkroachDB. So, so I have a couple of questions for you Raphael, specifically on, on the infrastructure setup, just get the opportunity to pick your brain a little bit. I think that's fun. You know, obviously I'm a fan of Submariner, but how exactly is Submariner different from some of the other ways we could peer the networks between these different OpenShift clusters? Yeah, what Submariner does is it's established an IPsec tunnel between the SDNs, like I said. So it's a very efficient tunnel because IPsec is an established technology. And it's, it's encapsulating layer three on top of UDP, which is layer four. We have seen other solution that uses, that use higher level of encapsulation, so slightly less efficient. And I should, I should also, that's why we, this is, this is one of the problems where we need to keep the latency as short as possible to, to enable this, this distributed workload to be efficient. Right. And so we need to solve the problem as close as possible to the network, the network, physical network space and, and Submariner here does a good job. There, there is, there are two, there is an upcoming way of running Submariner that will make it even more efficient, which is going to be using WireGuard as opposed to IPsec to establish the time. And WireGuard is a, is a more lightweight protocol. So is that, are those the reasons why it's a superior production solution? That's just other ways we could peer the networks? Or is there other stuff that we should be thinking about here? No, I think, I think that's, that's, that and the fact that it's, you know, it's going to be deployed by your administrator and it's going to serve the entire cluster. Right. Not just individual namespace solution. So this is going to, it's a, it's a, it's a piece of the infrastructure. Once it's there, it's almost disappears and it, it just works. If you don't mind, I'm going to pick your brain about some of the other parts of the infrastructure that you set up for the demo too. Is this a good time for that? Yeah, yeah, go ahead. So, so I personally haven't used the Advanced Cluster Manager before. Can you, can you kind of talk me through why you chose to, to use that to, to orchestrate these, these Kubernetes clusters? And I'm curious to know if, if that administrative cluster is running all the time or is that something that's, that's more ephemeral in nature? No, the administrative cluster is supposed to be there all the time. Let's go to it for a second. Let me just go here. I'm going to show exactly just quickly what it is. So in this page here, I can manage my clusters. So it's a single pane of glass for all of my fleet of clusters and customers are starting to have several, you know, tens and tens of clusters at this point. So this makes it easy as an entry point to manage all of your clusters. There are some administrative capabilities that you can do from me. For example, I can do, I can upgrade all of them at once. And then I can, I can set up monitoring capability where all the metrics are collected into a single spot. And then I can even deploy application through RACM and spread it across multiple cluster or enforce policies. In this particular demo, we just use them, use RACM to spin up these clusters where our workloads are going to run. Got it. Any sense? So it's safe to say then that while we have this administrative cluster and we use it for kind of, you know, administering the distributed multi-cluster configuration here, it's not like a single point of failure. If the administrative cluster were to go down because of a failure, the infrastructure that it's already provisioned is independent of that. That's a correct statement. That's correct. That's a fair statement. Yes. Awesome. So I want to pick your brain a little bit about how you configured Vault and Dirt Manager here, because I think it's really kind of interesting. So as you mentioned, CockroachDB does MTLS between our pods. We're going to be talking about that here in the next few minutes. But to get a single certificate authority across all three clusters, you chose Vault, which is great. It's the same thing that I would recommend to customers going into production. What specifically did you do to make Vault work as a single CA across all three of these clusters? Right. So first of all, I think it's important to discuss why I decided to deploy Vault this way, right? Because you can say, well, I need a common CA across these three clusters. The CA could be running anywhere. Why running in the cluster and across the clusters? And here is the reasoning. With these three clusters, we're trying to build the most available infrastructure in our data center, right? It's going to be distributed across multiple geographies. So in our IT, I should say it's going to be distributed. It's the most available thing that we're going to have. And a CA, a PKI, and a secret management tool is one of those things, is one of those pieces of infrastructure that is in the critical path for applications. Meaning that if it is, it used to be something that needed to be available at maybe boot time, but now if it's not available, things stop working. So it needs to be available all the time. And so that should also benefit from the most available infrastructure that we're building. So I was looking for a way to have a PKI slash a set, you know, secret management that never goes down. And Vault can do that because Vault supports raft as a storage protocol. And when you, which is the same thing that CoCoach has. So in terms of syncing the state and making sure and managing availability, they have some commonalities. And so Vault here is deployed as a single logical cluster across these three open shifts. And it can serve secret. They're going to be the same secret for all the three clusters. Or you can serve certificates and they're going to be generated from the same PKI. Okay, can I also ask a quick question, Rafael? So I just want to check the global load balancer. Where is this actually running? Is it running on a specific open shift cluster infrastructure? Where is that actually located? The global load balancer itself is Route 53. So it's running in, it's running run by AWS. The global load balancer operator, which is the thing that configures Route 53 for our needs, runs on the Rackam cluster. As you can see, it's this one here, Pastel. Right. So my question was like, if it's not AWS and if it's on a customer's infrastructure, right, data center, so where will this load balancer run? Will it be like on a specific open shift clusters infrastructure mode or just curious on? No, in case you have data center across multiple, on premise data center, but across multiple geographies, I recommend you do a global load balancer with a DNS. So you can use maybe something like a five big IP as your DNS. And you still need, which has the same capabilities, kind of the same capabilities of Route 53, but it can certainly do LCS, which is the things that we need, the thing that we need here. And then it's a matter of how do you configure it? And you can do manual configuration, but I recommend having the configuration program automatically. And an operator would be a good fit here. So it would be essentially the same architecture that you see here, except that here instead of in parentheses, Route 53, you would have a five big IP or something along those lines. Got it. Thank you. And then the green logo is the cockroach DB logo. Is that correct? No, cockroach DB is not in the picture yet. The green logo is some mariner. Okay. Yeah. And then I thought, and what is the blue logo then? That's cert manager. That's the operator that creates certificates. Okay. We're about to build on this and add cockroach DB to it. It's all the work Raphael did before I was allowed to even get started. Okay. Yeah. So let's go to the cockroach DB. Actually Keith, would you like to describe it or I can do it? No, absolutely. So cockroach DB is a distributed SQL database. It's cloud native. We run the vast majority of our installs run in Kubernetes, OpenShift. We run our databases as a service product on Kubernetes. Fundamentally, we function a lot like the other technologies we've already been talking about. So Raphael mentioned Vault and how it uses RAF to do consensus-based replication across sites. That's the same way that etcd in the Kubernetes ecosystem replicates the state that Kubernetes is supposed to maintain across different, you know, pod hosts and whatnot. So cockroach DB also implements the RAF protocol for doing consensus-based replication of our data layer. Under the covers, we use a KV store. It used to be RocksDB, if you've ever heard of that. We've re-implemented a KV store under the covers that we called Pebble. There was kind of more purpose built for what we were trying to do. Single binary deployment, completely written, almost completely written in GoLine. On the front end of that, what we're creating is a mesh where every single node has the authority to act on some portion of the data in the database, is a follower for some other portion, and then potentially is not involved in some third portion of the data. So every node is active as the leader for some portion of the data in the database, and we create a global logical cluster that allows you to talk to any given node, and we will route your queries to wherever the... There's a lot of great stuff here, but one of the prerequisites, as Raphael mentioned, is those nodes talk to each other over NTLS. Those are encrypted communications. So cockroach DB is going to communicate with, in this case, CERT manager, to get certificates to enable that encrypted communications. And then also for our back end communications, we require that all the nodes be able to route to all the other nodes. This allows us to do things like deal in the case of losing a pod or a site. To do that, all the nodes have to be able to talk to all the other nodes. So we're using Submariner here to allow all the pods to talk to each other across the sites so that they can act as a single global database cluster. I've been a cockroach lover for about two years now. It is especially the easiest database, particularly all-TP database that I've ever had the privilege of supporting. One of the great things about designing the database to be cloud-native from the very... is that a lot of the operational challenges that you would have in a traditional all-TP system, particularly if you were trying to run it in Kubernetes, we don't have those things. So we could talk about how we manage data replication. We talk about query performance. It's a lot of great top speed you go into. But I'll pause there as kind of the high-level description of the database. So, Akith, you said it's a SQL database. So as a developer, let's say I want to... Let's say I already have an application running on a SQL database, and I want to start using CockroachDB, I can probably reuse my SQL skills, because it should feel the same. But is there anything that changes or that you want to highlight? Yeah, so we've implemented the Postgres wire protocol. So you can connect to us using Postgres drivers. You can, in a lot of cases, use your existing Postgres tooling to interact with us. There's some CockroachDB variants of the OLMs that are out there. The one thing that you need to know, and this is true for any distributed system, is that the data has a location attached to it. It may have an intrinsic location, like an address does. It may not, and then we need to consider where it's going to be accessed from and what that access pattern looks like. So the one piece that you have to add to your kind of DBA, that are the tricks when you're moving towards a distributed SQL environment is thinking about how we want to distribute this data across cluster, and in inverse, how we're going to get back out. So if you ever go and you take a data model in class in a college, they'll talk about the physical data model as opposed to the logical data model. When you move towards a distributed system, you have to think a lot more about the physical data model. In CockroachDB, we make this super easy. We have a couple of, we're called DDL extensions, so basically when we define the table, we define how we want to distribute the data across that table or set of tables. By default, we're going to what do some things called follow the workload, which is where we're going to basically move the authority to act for any particular segment of the data to where it's most likely to be used from. But we also have concepts of regional tables and global tables. All of these things have different trade-offs on read and write performance, and also kind of impact what types of scenarios we're going to survive without user error. So one of the big philosophical things that we talk about is designing to survive as opposed to designing to fail. A traditional VR is your designing system that can pick up when your primary system, that's why you have two-site solutions, you have failovers and you have backups and all that kind. We're designing to survive, so we're going to have three or more sites because if we lose an entire site, as we're about to start talking about from a demonstration perspective, we want the system to continue to operate and function as if it was any other day. So the data center, if you will, is the new rack. So if you've ever kind of set up a distributed system in a physical data center and you wanted to make sure that it survived, say, a rack failure because you had a PDU failure or you had a top of a rack switch failure, so on those lines you didn't want your application to go down in that scenario. We're now kind of treating the data center as that new abstraction layer that needs to be survivable without any noticeable doubt. I hope that answered your question. Yeah, yeah, it does. And I think this is a perfect segue to my next question, which is I see customers now that are considering migrating their SQL farms, could be any product, any database, but they won't migrate their SQL farms to OpenShift and they may have maybe 1,000 instances of databases running on VMs, which essentially are treated like pets with a team of DBAs that tend and care for these pets. And what I feel like the reason, I feel there is a risk that we're going to migrate these databases inside of OpenShift. They can certainly run, but we're still going to treat them like pets, right? And instead, the philosophy of Kubernetes and OpenShift is to treat everything as cattle, things that can die and will respond somewhere else. So here is the question for you. And sorry, to conclude the thought, this can be difficult for state for workload, obviously, much more difficult than for stateless. But how does cockroach help in that space? Yeah. So fundamentally, state is what makes it special from an IT perspective, right? You know, if you remove the state from almost any system, you can probably genericize it pretty easily. So we fundamentally have kind of taken the same approach to this problem as bulk has and as SED does, right? In that we make sure that all of our data lives in more than one place. And we guarantee that our data lives in one place. So then rather than having a single point of failure, then we have a configurable, we have effectively configurable availability while guaranteeing efficiency. So one of the things that I didn't talk about is the replication factor. So by default, everything that gets written to CockroachDB gets written to at least three places. And that's configurable up from there. So there's scenarios where there might be five, there might be seven, you could even theoretically go higher than that, although mathematically, if you lose 51% of your replicas and you had seven replicas, you probably have bigger problems than the database. The intent is you say, hey, these are the everyday occurrences that I want to serve. In AWS, it might be an availability zone outage, which happens in AWS a couple of times a year. Maybe it's a region failure, which happens once every three years. You want to make sure that you're sliding. In some cases, it might be a full cloud. We've had scenarios where cloud providers have had kind of cascading problems were caused by user error because almost all disasters are caused by user error at the end of the day. And where we've lost multiple regions in AWS and Azure and GCP. So you may want to actually spread your workload across multiple clouds. This is very much in the neighborhood of the hybrid workload. I've got two physical data centers across, and my third data center is in Google, or it's in AWS. Or maybe I only want to maintain one data center now. I want my other two data centers to be in two different cloud providers because I don't want to put all of my eggs in one bag. So fundamentally, we use raft to do this. We have some enhancements to raft that make it function for SQL databases. So if you ever go and read our life of distributed transaction documentation or you're reading the blog post about how we guarantee serializable isolation across transactions when you may have two transactions that come into two completely different nodes and two completely different data centers, we have a a lot of very interesting writing on that topic and I won't go into today. But the first thing that we do is just make sure that nothing about or pod in a conference to be cluster is so special that we can't survive without. And that is the at the very kind of fundamental baseline. That's how we solve for that problem. Hey Keith, can I ask a quick two questions? So you didn't mention that like most of the time we designed for failure and not for survival. So if you can just explain that, what do you exactly mean by that? And then also if you can help us understand what really cockroach DV, like how does it differ from maybe with Redis or infamous span? Or is it completely inherently different architecture that you are using? Yeah, so I'll answer the second question first because it's pretty short. So those other databases are no SQL databases. They tend to be in the, if you look at the cat theorem, as soon as you're partition tolerant, as soon as you have to manage for network partitions, you can either guarantee consistency or you can guarantee availability. So generally speaking, no SQL databases are lean towards availability. And then so they don't guarantee consistency in all cases. I'm not going to go into the specific databases because the nuances there get really specific. We're a CP database where we can increase our availability by increasing our replica account because we're using consensus based revenue. So fundamentally that makes us more valuable for system of record like workloads. So things like inventory management, financial transactions, we're used by a number of large financial institutions in the United States and Europe, for example, because we can be in a cloud native environment like this and we can have a extremely high availability as well as guaranteed consistency for transactions. Things like Redis and Cassandra and MongoDB are much better at workloads where they're kind of right ones for you many. And that's a broad generalization. I know that there are people in the call that could give me specific examples where something like a Cassandra or MongoDB would be a better fit for a problem than a CockroachDB. And all of these things are always true, use the right tool for the job. We are specifically very focused on transactional workloads that require guaranteed consistency. We do all of our consistency in that if you look at an ACID, like an ACID compliant database, we're fully ACID compliant database and all of our transactions are serializably isolated. So your earlier question, designing to survive versus designing to fail. This is the difference between, in my mind, the difference between high availability versus disaster recovery. So when people put together a disaster recovery plan, they're expecting that things are bad enough that they're willing to accept that things aren't going to operate as they normally would. The challenge is we've moved to these kind of newer cloud native technologies. The cloud is just us running on other people's computers. We have less control. So it's more likely that something outside of anything that we've done could possibly cause one of these failure events to happen. So what we want to do is we want to be able to treat them as a high availability event where we have automated failover to the kind of the point Raphael was making earlier and not treat them like they're a disaster where we're taking a multi-hour. So philosophically, it's basically like coming at it like a glass half full versus a glass half empty kind of a perspective by going to it saying, hey, you know, I need to be able to continue to operate if I lose an entire region of AWS and I'm going to design a system to solve for that. Then if a region in a US fails, I shouldn't need to get a page at the middle of the night to get up to fix my computer. I should our systems. I should be able to kind of deal with it in the normal order of things rather than treating it like a disaster. And if you soon as you start to look at it as I want to be able to continue to operate as normal during these scenarios versus I need to be able to get back up and running at some point in the future if something like this happens, that's all of a sudden you're designing to survive as opposed to designing to fail. And hopefully that answered your question. Can I add I have a consideration on event or consistency that I think just building up on this conversation. When I started looking at these architectures, I could have chosen an event or consistent database. So the database that decides to be available rather than consistent in an event of network partition. And if that was the case, we would probably see only two open ships here in this picture because in that case, you just need two to continue working. If you lose one, you just need one to continue working. But I read a bunch around eventual consistency and one thing that people may not realize it is that eventual consistency does not mean eventual correctness. Aventual consistency means when the network partition goes away and all the instances can talk, they will converge to a state. But there is no guarantee that that state is the state that is logically correct for your business problem. And so it's very hard. So I didn't like that situation as a developer. I don't want to think about that situation and how my code would have to be designed to react to that situation. So I chose to go with consistent databases for this research. And I think it makes things very much easier for the developers. Yeah, I had one thing to that, and I know we're running short on time. I really want to get the demo. But effectively, no SQL databases take all of the logic that we bake into an RDB event and offload it to the application. So you have to think about all of those potentially inconsistent cases that you just mentioned, Raphael. In your application, you need to handle them at the applet. There are good valid reasons why you might need to do that in certain scenarios. But in a lot of auditable audit type situations, particularly some mentioned financial management, inventory tracking, where correctness is of utmost importance, that risk is unacceptable, at least in my opinion. Which is why I'm at that conference last, and I'm not currently working in a SQL database better. So Yeah, so Keith, as you go forward, what is the data we are storing in this CockroachDB cluster? Is it like the application data we are storing? That's right. It is application data. This is a great transition to the next slide, actually. So there's an industry standard old TP benchmark called TPCC. It's been around since the 90s. It simulates literal warehouses and how packages or things might flow into or out of those warehouses, as well as kind of like a point of sale system where those products are getting manufactured and then shipped out to customers. It's very much a transactional use case. This is the same type of database implementation that you might use for inventory tracking at a large big box retail vendor. It's actually modeled after a large big box retail vendor and what they actually did in there, what we're doing in their environment. It's a good generic benchmark because it's one of the benchmarks that's most available for SQL databases. There's published benchmarks and guidance on how to run this benchmark on pretty much every SQL database I've ever seen going back to like 1996. So it gives you a good wide swath of kind of what that looks like. What we're going to, what we're going to be showing here today is what happens when one of these sites goes away while we're running TPCC against Cockroach TV. So with that, I will leave it to Raphael to actually walk us through the demo since he owns all of the wonderful infrastructure. Right. And on the infrastructure, I should say one thing you said before Keith is very correct. Today, many customers, many enterprises are considering building every cloud solution where they deploy on different clouds. There is nothing in this demo that cannot be deployed across multiple clouds. It's just that the account that I have is only on AWS. And so we're using AWS only for that reason, but you could do it. You could do this demo across multiple clouds. So here we have the Cockroach TV console. We can see in this nice map where the data centers are and where the Cockroach DB nodes are. Okay. We have nine nodes, three, three and three, of course. And we have some ranges. These are the data spaces that Cockroach manages. And sorry if I'm not using the right word here, but essentially these are the partitions and the replicas that are being managed by Cockroach. I have preloaded some data for the use case that we're going to run. And now I'm going to start loading the database with this TPCC workload. And this TPCC workload, as Keith said, is generating a bunch of OLTP transactions. So they're typically fast insert, fast update or... Yeah, so it's... The majority of the workload is going to be individual item updates. So as a particular widget moves around a warehouse or a set of warehouses, then that record is going to get updated. And then a portion, I think it's six percent, although don't quote me on that. Maybe I shouldn't have said that on a broadcast forum. Are aggregate queries to look at the current state of the inventory for that warehouse? There's a very, very succinct description of exactly what the query mix is and how many are updates versus selects versus deletes and versus aggregates on the TPC.org website. I'd urge anyone that's more interested in how the actual workload simulation works to check that out. Okay. And you may have seen that one of the database was orange. And that's what happens when it goes down. I don't know. It must have been just a little glitch. But everything is up now. I want to show you that we are generating... So this little processes are generating load. These are pods running inside the cluster. So they are near... This is simulating traffic coming from different sources. And we direct a portion of the traffic to the database that is close to the source. So the traffic stays there. So they are generating traffic on the local cluster. Obviously, Dynacocros will spread the data where it needs to. And I'm just redirecting the output here. And we can see all the transactions that are generated. And if we go to the metrics, we should see that we have some queries. So you can see there is some load that is being generated on this database. So now what we're going to do to simulate a disaster is that we are going to take down one of the regions. So I am going to actually take down the west region. And the way I'm going to do it is I am going to completely isolate the VPC in which OpenShift is running. So nothing can go out and nothing can go in. And this is the perfect disaster simulation because it's a network partition. You don't know... You're sending a packet, but nothing answers. So you don't know if the packet has been received or not. It's way more difficult to manage than I'm sending a packet and I'm receiving a response that says there is an error. So that's perfect. That's exactly what happens when there is a disaster. So to do that, I need to copy a script. I can't remember everything. So this, if you can read, is probably too small here. Let me copy the other side. So this is going to set a deny, basically, traffic rule on all the addresses on this VPC, which is the west region VPC. So now what we should see is... Like I said, there can be some glitch when this happens. Remember, this is our cockroach console. So it's the traffic that comes from my browser is load balanced by the global load balancer that I was describing before. So it could go to any of the regions. So maybe we took down the pod that was serving this console. In case you may want to describe it. But as you can see, after a few seconds, we were able to connect again. And the console is already aware, you see that three clusters, three nodes of the nine are suspect of having a problem. Yeah. So what happened there was because each node has all of the services of every node in the cluster. You were definitely... The load balancer was originally routing you to one of the pods that we just segregated from the network. So as soon as the load balancer realized that was happening, the pods that was still available, that's what we would expect in this type of scenario, right? There's a couple digit second service glitch for certain operations. But queries that had come into nodes that weren't impacted by this will continue to operate and will be able to process queries against the database. And in fact, I want to show you that we are still processing, see the metrics did not go down. And our client number one is still working. Although you see it had some glitches. So this client didn't have a connection problem, but Cochrane was adjusting itself. And there were two errors, which this particular client manages with retries. And that's a best practice that you should follow also in the code that or developers should also follow in their code. But you see the client didn't break and continue to work. Same thing for the second client, it only got one error in this case. And that was the third client died because, well, we severed the connection even to the tail of the log here, right? Okay. So what we have done so far, we have simulated the disaster and we have demonstrated that we didn't have to do anything, the system reacted by itself and continue to work. Now we are going to restore connectivity and we're going to show that again, we don't have to do anything and the system resumes working with all of the capacity that is available. Because another problem of disaster recovery failures is that, you know, usually, yeah, you have a disaster recovery procedure to recover from a disaster, but when the system that was down comes back up, it's usually as painful also to restore the workload where it usually was. It's the same kind of process. Keith, what you were about to say something. Yeah, I was just going to say right now, those three nodes are still listed as suspect. We don't evict them from the cluster until I think it's five minutes. Then we assume that they've been dead. The only difference in recovery if we were to wait for five minutes is just which path we take for re-replicating the data to the nodes as they come back. Under five minutes, we assume that the nodes aren't that far behind and we can get them caught up using the raft logs. After five minutes, we assume that they're too far behind and that we're going to re-replicate the ranges there, which is a slightly more expensive operation, but still, both of those paths are invisible to us as a slightly different performance impact after we bring those instances back online. Keith, so what is the pitch while Refile goes and saves the world? Maybe I can ask you another question. So what is the pitch with my customers are using Oracle Database and they're moving applications to OpenShift and they just not connect to this Oracle Database, which is outside OpenShift. So are we saying instead of that, use CockroachDB now? Well, this will allow you to move the database into OpenShift as well. So one of the things as a recovering operator, like I used to run systems like this in production, one of the things that it's really frustrating is when you have to treat something special. So right now in that scenario you're talking about, the infrastructure for Oracle is special, the tooling for Oracle is special. If you have an Oracle disaster, you have a completely different runbook for resolving that disaster than if your application fails. By using CockroachDB and moving that into OpenShift, all of a sudden, now you're handling a database failure just like you were handling an HAProxy failure or an app to your failure. It drastically reduces the scope of the types of disasters that you might have to manage, the types of availability events that you might have to manage. On top of that, getting all the great self-healing capabilities that Rafael is showing here today, just by reducing the administrative burden of having to understand multiple different ways that multiple different applications in your stack are running, can drastically reduce how difficult it is to do that work. So there's a ton of other things we can talk about too, but we only have like... Yeah, I'm going to save the word as... Oh, sorry. I'm going to save the word like you said because I like that word. And then we can talk while we observe what happens, okay, and how the cluster recovers. So what is the underlying layer where it parses the data? Like does it write it to a file system? So we use stateful sets in Kubernetes. So those stateful sets are presenting a file system to the database for sure. We use a KV engine to act as our kind of storage layer. So there's not like a... So you have a decent amount of flexibility there. You generally use something like a persistent volume plan to get a persistent volume from whatever storage layer happened to be available to you in your various OpenShift clusters here because we're on Amazon. We're using EBS volumes to act as the backend store. Yeah, before Karina kicks me out, I just last question. So when this data center came back up and you know, it had to replicate all the ranges which had issue. So while that is happening at that point of time, if a request comes into this to the OpenShift cluster, it just came back up. What happens to that request? Is the cockroach DB database locked at that point of time and it cannot handle... No, because we never lost Quorum on any of those ranges. Those queries will... Until the West Coast data center in this case was caught up, any queries that came into the West Coast data center were routed to one of the other two remains. Every node in cockroach DB is a common gateway to the entire cluster. So as soon as those nodes had connectivity, the rest of the cluster again, they could act as what we call a query coordinator. They aren't necessarily the query responders. They're not the ones doing the work on the data, but they are acting... They still can act as a client gateway immediately. So you don't get a scenario where your database is locked up when we're replicating or any of that kind of stuff. It's just a matter of those queries get routed to whatever node currently has authority to act on that particular segment of data. And then once the West is back up and running, it takes over that its fair share of that workflow. In fact, these true clients kept working and they're still working. And so to... See, once I restored connectivity, the first thing that needed to heal was the network tunnel by some marine and needed to heal and restablish all those connections. And now... And then cockroach... See, it's healing right now. See, it's replicating the ranges, and then every node will be back at full capacity and serving traffic. Here we go. Now it did it. So again, I think the point to take home here is besides the inner working of cockroach is as an administrator, I didn't have to do anything. It managed the disaster, it reacted to the disaster and also recovered from when we healed the disaster and when we fixed the disaster recovered to full capacity all by itself. I just want to add before we close, this demo is completely scripted and anyone should be able to produce if you guys are interested. And everything is here available on these links. We also have an awesome like two-part blog post that Raphael and I co-authored walking through exactly all the like underlying steps we did here that we can share that link as well. Then you also share the link to the slide in the chat. So let's not do that just yet, please. Raph, could you in the references, can you put the links to the the blog posts and then we will post the link to this, right? Post it out. We are a bit over time, so I want to make sure that Ananirag, you can ping Raph offline too, right? Yeah, actually the last two links here are the blog posts. Oh, perfect. Awesome. Thank you. Thank you for having me. Raphael, it's been great working with you on this project. Yeah, same here. Thank you, Keith. Thank you so much. Bye-bye.