 Good afternoon. Welcome to our virtual talk. This is a talk on Kubernetes persistent data challenges, availability zones, regions, and multi-cloud patterns. So, what's the next slide? Quick hello's from us. My name is Chris Milstead. I'm a solution architect on that. So, working with customers around all things data in Kubernetes. And my co-presenter is Patrick McFadden. I work in data stacks. I work in developer relations and developer experience and work in the Apache Cassandra project and a few others, but mostly around open source data and Kubernetes. I'm also writing a book right now on cloud native data on managing cloud native data on Kubernetes available at a bookstore soon, Chris. Very soon. Oh, I've got an order. Don't you worry. Yeah. Well, you can still download it. We'll show the link at the end. Brilliant. And before we get into the bulk of the presentation, just a quick couple of thanks to Alex and Rags. Alex is one of the engineers on the Kate Sandra project and has been invaluable help in setting up a lot of the demos and stuff we're going to show later. And Rags actually built a lot of the multi-cloud crazy demo that we're going to show right at the very end. So, with that, onto the bulk of the presentation, and I believe you're going to lead us off, Patrick. I will. So, Chris, I think everyone's going to ask this question. Why would we even do this? So, I'm going to walk you through some of the reasons that you may want to be outside of one data center, one rack. And this is a pretty typical question because it does add complexity. It's very simple. Let's be very clear. It's very simple to run something. It's running on your laptop. And that's the most simple way to do it. But of course, that's not how we put things into production. So, when we think about production workloads in trying to conceptualize failure, this is why we're using multiple AZs, multiple multi-anything, is things around failures. And then there's more relevant reasons for contemporary reasons, I would say, which I'll get into in a minute. But we'll start with just availability zones. So, availability zones are built by clouds for this problem is to mitigate certain types of failure. So, I have a little chart here that is out there. It's nothing new, but it's about managing your failures. And the service level agreement, the SLAs that clouds give on an individual VM is you get three nines. Okay, that's not bad, but it's not great either. That's still minutes or hours of downtime. With hardware failures, and when we're talking about cloud, usually one piece of hardware could run multiple VMs, multiple images, multiple containers. But you get a five on the end of that. But when we're talking about entire data centers, we're now into four nines. And that's when the multiple availability zones come in, because if you look at a map, if you look at the satellite map in Northern Virginia, you'll see all these really big buildings with no windows. Well, you can see them all lined up in one part near Reston. That's Amazon East. And each one of those buildings is an availability zone. And they could lose power network or anything in one of those. And that's what we're trying to avoid. Now, when you get into multi-region, and like Amazon has regions all over the world, there's, like I said, more contemporary reasons to do this, where you may want to be actually closer to your customers. Speed of light is still a thing. Especially when we're talking about network speeds, because it's not the speed of light. It's much slower, 70% in speed of light. So when you look at distances, like difference between North America and India, that's pretty much halfway around the world, you're looking at hundreds of milliseconds around time. And if your data is in one place and your customers are in another, somebody's going to lose out. And so getting your data closer to customers is the name of the game now. And if you have global customers, you need global data. There's also a lot of regulatory requirements that you have to manage. Every sovereign data protection law has different rules. If you're in Europe, if you're in North America, if you're in Australia, like any of these governments, and they're always changing, by the way, are going to have regulatory requirements on privacy and things like that. You have to manage that. But sometimes when you have customers in those different areas, you need to manage the data that goes there. And then finally, I mean, this is my favorite part, is maximum uptime. I say this, you have 0% chance of 100% uptime if you're only in one region. One region can fail. And if you just do a quick Google on cloud failures, you can see that this happens quite often. And then multi-cloud, and this is getting into where people really get freaked out. But why would you do multi-cloud? And it's a funny thing because sometimes it's something that's done to you, as an SRA or an operator, multi-cloud can come about whenever, for instance, if you have an acquisition and you acquire a company that is in a different cloud provider than you're in, or if you're migrating to a new provider, let's say you're going from Amazon to Google, well, while you're in the middle of that, which is probably going to be three or four years, maybe five, you are now in multi-cloud company. And there's my favorite, which happens all the time, which is there's some rogue unit in your company that's just decided to go use a different cloud. Why? I don't know. No one's ever figured out exactly why, but it happens all the time. And I know a lot of you are probably shaking your head and like, yeah, it's that group. And multi-cloud is, there's other reasons, and I think these are more relevant and contemporary, like I said, that we're at the point now where we're talking about avoiding lock-in, we're talking about spreading your bet, using the cost basis of cloud in a better way. So this is where we're getting into this world of hybrid. And hybrid is like, when we're building out, this is probably where a lot of you are now, where we're trying to save money in the cloud, but we still own a lot of data centers. And when you're migrating from your physical data center that you spent billions of dollars on and you're moving over to the cloud, during that timeframe, again, probably measured in years, you are now in the hybrid zone. And when you're in the hybrid zone, you have a whole different management challenge. So what we're attempting to do, I think at this point is help you manage that and make it less painful. Chris? Brilliant. Yeah. And the other thing I think, just to add to all of these kind of, you know, the things we're using to build these Kubernetes platforms and how we're trying to get this maximum uptime is we're also talking here about data. And why are we saying it's now going to be critical to get your data into the Kubernetes clusters? I'm a big fan of looking back at history. And I'm pretty sure that there's nothing really that new we've come up with in the IT industry over the last 20 years. We probably did it in the last 40 years before. And I think two of those trends are the first one is looking at the data problems. You know, we've moved our compute to the data. If you look at the whole kind of map-produce shift we had and how we do, did, you know, data transformation, data information, data, you know, I need information from my data, how do I mine it? We found out that moving datasets around is just too costly. So interestingly in that sense, we took a compute to the data, but all our compute now is in Kubernetes clusters. So actually, QED, we need to have our data in our Kubernetes clusters. And the other side of it is virtualization. So we still think of like, I'm going to go grab some storage and we still think of like grabbing a disk. If you look at, I think, 15K disks from yesteryear, they're like 200 IOPS that they max out on if it's random. You know, we can take an enterprise-grade SSD NVMe drive and we get a million IOPS out of that without batting an eyelid. And the gentleman who looks after a lot of the Linux kernel components here, he's got a set up with a couple of Optane cards, so two Optane cards, and he can get 10 million IOPS out of two cards. So thinking about what do we do when processors got that much more powerful? We virtualize them. So these two trends about looking at how do we make the best use of our resources? And if we're virtualizing them, how do we then get the spare resources when they're not being used? And how do we go back to those virtualization principles? And the other side of it is how do we actually bring our data into the Kubernetes cluster? So if you look at the next slide as well, the reasons for doing this as well, and there's a community that myself and my companies involved with and Patrick's heavily involved with as well, which is this data on Kubernetes. So we're surveying all the people in there and saying to the people, what are you finding when you put data in Kubernetes? They're finding there's a few things they've got to fix, but actually they're finding that all the benefits they were getting from their applications and the standardization and the CICD and the clicking a button and putting their developers closer to the computer and giving them back control and moving faster, you know, applications to production safely in a day, all those kinds of things we're looking for. Data is actually complimenting that as well. There's a few things to solve, but it's the what we're seeing is what people are looking to do in their Kubernetes platforms now. So watch out for this is the revolution. And I think Patrick, you're going to talk about hybrid licenses next. Yeah, the thing that I feel, and this is a very important part of what, you know, this is part of my book, Chris, so this is a shameless plug, but it's this concept in Kubernetes where we're building out something very revolutionary and evolutionary at the same time, where we went from physical servers to virtual machines to containers. And now we're orchestrating containers with a little bit of magic and creating virtual data centers. This is a logical progression in our virtualization, where instead of owning a data center and managing the power of managing the network, all those things, we are at the point now where we're renting compute network and storage, creating our own virtual data centers for what we need bespoke. And the reason is because we are building applications that have a certain amount of input, and then we require a certain amount of output. And what happens inside that virtual data center, we can be very specific about and we can build the tools that we need instead of just, you know, renting a service or something like that all the time. We're thinking a lot more about what builds our business and being more agile. So just one thing before we go into, you know, the data patterns and look at how we can complement the data in Kubernetes and where some applications have got application level capabilities. We're not saying to change any of the Kubernetes architectural patterns that are already out there. There's a link at the bottom here to the CNCF disaster recovery paper. It's very good. Broadly, it's don't stretch your Kubernetes clusters across regions. Start doing that at the application space, as Patrick said earlier, you know, the speed of light is finite. As soon as you leave an availability zone, if you're building in the cloud, things cost money. And, you know, bandwidth is not infinite and all those other fallacies. So at some point, you know, cap theorem will pick in and your partition or availability or your consistency, one of these things we'll have to give. So what we're saying is this is how to complement with the data layer. But don't think we're saying do anything different at the Kubernetes layer. So we've just been very clear on that before we go any further. So yes, on to the next slide. So now we're going to get to the bulk of the fun, which is the first set pattern. So this is the demo that I've put together. What I'm going to try and do is show the single cluster single az single region. And for that, what I've picked is I'll call it a traditional application. So I'm picking Prometheus. So it would like to have a single data store. And if the pod dies, then it will want to restart want to reconnect to that same data store. On the other side, I've got Cassandra, which is using stateful sets. And it's, you know, we can, Cassandra is a fantastic application. And we can stretch that halfway across the world, as we'll see in the next demo. But in this one, what I'm going to show is how we can allow Cassandra to do all the kind of storage layer intelligence. So the important thing in both in these kind of two demos on the left hand side and the right hand side is how we complement with the CSI. So the container storage interface plugin and the application. So making sure these two things are working in harmony. So there's about an hour and 45 it from cold to cold start to end of this demo. We've cut it down to about eight minutes. So I'm going to try and narrate it as we go along. So Patrick, let's go for it. So first step, we're going to spin up our cluster. I'm using AWS because I can use EKS CTL just to easily spin up my cluster. I'm using I3EN instances. Those are instances with local SSDs. And I'm using that because I want to get, you know, 300,000 IOPS. I want to get the fastest storage I can. And you can hopefully see there, there's some two in each of the 1A, 1B, 1C, those availability zones. So hopefully that aligns with the picture that you saw on the previous page. So if we're going to lens, I'm just using that as my GUI. So you can see what's going on. You can see my cluster and there's no metrics. So that's the important thing on that page that we just went past. I'm using my on that component as my CSI plugin. I'm using that because it can run against local SSDs and will allow me to do intelligent things like storage replication, encryption, topology awareness. And I can also do things like fencing for faster failovers. So I'm taking dumb SSDs or NVMe drives and I'm using a CSI plugin to add some intelligence to that. So I've installed the storage OS is the underlying technology name on that product onto the platform. And you'll see that that component is coming up there. So when it installs by default, we'll get a storage class created, which is just give me, I'll call it dumb storage, but just give me straight, here's a bit of storage from an NVMe drive. I also created in there a more intelligent storage class where I'm asking the storage layer to do replication. And I've sold it to also encrypt it. So we can do per volume encryption. I've also asked it to replicate it between the three AZs. And that's the one I'm going to use for Prometheus, the more intelligent one because Prometheus doesn't have the smarts to replicate the data. But for Cassandra, I'm going to do that. So we're now into installing the actual workloads. So certificate manager has been going on in the background there. And then I'm using customize here to actually build and deploy the Kate Sandra operator. So so far, we've had a Qubes ETL plugin. We've had customize and we've had Qubes, just a box and Qubes ETL apply. So we've had three different ways to install it. So now we're going to try and put Prometheus onto the box. So we'll go for the fourth way of installing things. So we use the community helm charts. So we're going to put Prometheus onto my cluster now as well. So what we're going to get to is the point where we have that kind of traditional workload. And we should start to see metrics in our cluster. And we're also should have our in a second, we're actually going to spin up our actual Kate Sandra. Sorry, Cassandra cluster using the Kate Sandra operator. So let's just go and apply that. So there should be a custom resource of... Oh, I'm going too quick. There we go. But hopefully the important thing in there was there's one A, one B, one C. So I told Cassandra, the Kate Sandra operator to deploy a Cassandra cluster. And there should be a stateful set in rack one, rack two, rack three. And those are pinned to the one A, one B, one C availability zones. So hopefully in the bottom right, you can see there's that little one A, one B, one C. So I also do a little trick here. I'm going to actually label them on the on that layer to do fencing. So what we can do is we can do fast to failover. So if a node times out rather than waiting the five minutes for a Kubernetes node timeouts to expire, the platform can actually monitor it and kill the pod and actually start a faster failover if you wanted to do that. So just a little trick to speed up the stuff. The other thing here, I'm now running a job. So I'm running no SQL bench. So I'm running workload now and it's running live against this Cassandra cluster. I've now got running on these components. And you see the top right there? We've got metrics. So Prometheus is running. We've got metric storage for Prometheus. Cassandra is running. You can see my no SQL bench that'll be ticked along. We'll keep going back to the no SQL bench. And what you should see is we get up to about 60% through my workload. It's not running a hard benchmark, but it's just running background loads. And now we're going to do something nasty. We're going to coordinate and drain a node. So I think this is the one in one C. I can't remember exactly, but it's one of the availability zones. And I picked a node because it's running both one of the Cassandra pods and it's running my Prometheus server. Are you sure you want to drain the node? Yes, of course I am. I want to kill my workload mid-flight and see what happens. So Kubernetes has been Kubernetes. Yeah, exactly. Kubernetes has been Kubernetes. Does its happy thing goes, oh, I've evicted my pods. I better go and talk to my scheduler and go and see where to put them. So the Prometheus server is coming up there. And if we go and look at the Cassandra, so here we go, you see my rack to instance is very orange there and very pending. It's not very happy. So let's check my no SQL bench benchmark is still running through. So even though Cassandra has lost a node, it's not using any smarts at the CSI plugin. It's just using raw swell. It's using the CSI plugin to do smart things to be able to automate access to the NVMe drives, but it's not doing any replication at that layer. It's doing that all in the Cassandra layer. It's still running. So happy days. But my Prometheus workload, we killed it on a node. It was using a local SSD, but luckily I've got replication at the storage layer. So even though that workload has moved to another node, another local NVMe drive, if we have a look in a set, we should see if we go into one of these, we've still got metrics and we've got all the historical metrics as well. You can go back and look at the timestamps and you'll see that that's running all the way through. So we'll bring that node back into the cluster as well. And now, of course, because I'm evil, we're going to go for the big test, which is rather than turn things off nicely. Let's just check our no SQL bench. Yeah, still got workload going through our Cassandra cluster. Brilliant. Things are still ticking along. We're going to pull the plug out the server. So I'm going to go into my AWS console and I'm going to terminate my instance. And of course it's going to say to me, are you sure because you're using very fast local NVMe 300,000 devices, but if you kill them, you lose them. And I'm like, yeah, don't worry. We've got that covered. We're using a different plug-in. It's fine. So we kill it. And you notice I've lost one of my other Cassandra nodes now, but my no SQL bench is still plugging away. So brilliant. I'm still got workload. We're up to about 40% of that run, 44% of that run at the moment. So it's still running along through now. What actually happened here, it couldn't recover to the second node in that availability zone, because the Stargate pod was running. And I discovered later on, in the Cassandra operator, there's actually a flag you have to set if you want to allow Stargate and Cassandra to actually coexist on a node. So what actually happened here was Kubernetes did its stuff. So if you go in and look at the Kubernetes cluster in the EKS, you'll see we get it's initializing here. So it's actually building a seventh node to replace the sixth one that I nicely killed. So taints as it comes up. And then as it comes up, notice on the right hand side, the metrics, there is a small gap in the metrics here because we killed that node. We killed that stateful set by pulling the power out. But there we go. There's another storage OS demon set. So we brought storage, the demon sets running on there so we can access the storage over the network now. And there we go. Cassandra cluster is back to full health. We've got three stateful sets in three AZs and I know SQL Bench is running along. So there we go. Hopefully that everyone followed that as I said is about an hour 45 end to end in real time. We've compressed it to eight minutes. And if we go on to the next slide, you know, this is the big takeaways for me from this. Follow the Kubernetes design principles. We're not trying to do clever things by stretching Kubernetes clusters. We are doing clever things with the CSI plugins. I'm trying to get the fastest storage at the most cost effective point I can. And I'm trying to make sure that that CSI plugin is doing the right things for the right workload. So for a workload previous I'm doing storage level replication, storage level encryption, storage level topology awareness, all of those components in there. But for Cassandra, I'm just saying, hey, I will automate you access to these NVMe drives so you can have as many IOPS as you want to do your back end storage. And what we're trying to say is, as we said, you know, you can get to know to low downtime using a single region by taking this approach, even using because this approach as well because we're using local NVMe drives, this will map to, you know, your EKS anywhere or your on-premise, anything like that. Hopefully you can see as well, you know, there's nothing special about running this in the cloud. We're using local NVMe. So you could take this and run this in your data center today as well. And I believe that's the end of the first kind of set of patterns. And I'm going to hand back to Patrick now to take us into the multi region, multi region, multi provider ones now. Yeah, the multi, multi, multi, the multiverse, the multiverse of madness. Speaking of shameless plugs. So this is a, this is another pattern and it exploits the goodness of Cassandra. So this is again, using Kate Sandra, which is a project to help this built for running Cassandra and Kubernetes without pain, batteries included. And so there's some really cool things that Kate Sandra can do, especially with the latest operator about running a multi cluster, multi Kubernetes cluster. So if you have multiple Kubernetes clusters across regions or different clouds, it doesn't really matter. Thankfully, Rags is really the expert in this. And he recorded a quick video about how to do this. I'm going to just let him walk through it. And then at the end, I'll just go through some lessons learned there and some takeaways. This is a demo about Cassandra on multiple clusters. So the first thing you're probably going to want to do is set up the networking, the routing peering and all that. I myself am not a networking geek. So what I do is I have used a product called Aviatrix. And using Aviatrix, you can set up, you know, the VPCs, the routing and the peering fairly easily. So essentially, this is the Aviatrix controller. And it's running, you know, on the cloud somewhere, right? And as you can see here, I've set it up in such a way that, you know, AKS, which is the Azure Humanities Service is running on 10.2 and 10.2.0 and 10.2.128 and so on. And, you know, on EKS, it's 10.1. And on GKE, it's 10.3. And we can take a really brief look at the telephone scripts. It's essentially on this, you know, particular GitHub, Rags and S, AVX, multi-cloud Kubernetes, and I basically forked it from, from Aviatrix. And essentially, what you can do here is you can specify the different cloud accounts. And once you've onboarded these accounts, which I haven't talked about, you set up the controller IP, the username and the password. And, you know, then the rest of the telephone scripts use this to, you know, do the appropriate networking. Let's go into Lens, okay? And look at these different clusters. As you can see here, I have a bunch of different clusters, but I've added them to the hardbar here, the AKS, AVX cluster, the EKS, AVX cluster, and the GKE, AVX cluster. Okay, so these are the three ones. So now that we've seen this, the next step, you know, once I have the set up the clusters and on the three public clouds, they used to be a manual step if you wanted to do a multi-cluster, in which you had to manually inject the seeds from one cluster into another. You don't have to do that anymore. Instead, what you do is you have the concept of what is referred to as a control plane, and then you have the concept of a data plane. You know, very similar to, you know, what the Kubernetes services provide, the control planes essentially take care of installing Cassandra on a Kubernetes multi-cluster, multi-region, multi-cloud again, doesn't matter, and the control plane kind of handles that for you. So what I did to install the control plane is pretty straightforward. You know, basically what I did was I installed the Kitsander operator, okay, and then, you know, essentially there's the concept of the control plane, and you said that to True. So if you look at, you know, the different clusters, you will see that one of them, the AKS is set up as a control plane, and you know, the remaining two, which is GKE, and EKS are set up as data planes. So as you can see here, this is GKE cluster, right, and you can see here that it has the Kitsander operator, right, and if you dig through this, you know, Kitsander operator, you'll be able to see that this is operating as a data plane, and, you know, AKS is the control plane. So that's kind of how I set that up. So to install the Kitsander operator on the control plane, what you do is you essentially, you know, set the control plane as equal to True for the control plane, but for the data planes, what you're going to do is you're going to set the control plane as equal to 4. Next part is to be able to install the client configurations into the control plane cluster. So you take the GKE client config and install it in AKS, and likewise, you know, take what is an EKS and, you know, let the control plane on the AKS know that, you know, EKS is, you know, is part of it. And look at client config, you can see that, you know, these two clusters have been set up, which is the EKS cluster and the GKE cluster on AKS. To install the cluster, you know, what I do is I, and we can take a quick look at this, essentially, this is the, we are installing a Kitsander cluster. You can specify some default values for the storage class, but as you know, you know, the storage class is going to vary in the different clouds. So for GKE, I'm using premium RWO for EKS, I'm using GP2, and then I'm going to set Stargate, which is our unified APIs for REST SQL, GRPC, and, you know, GraphQL and so on. Now that we've done that, let's go take a look at one of the clusters, like let's take a look at GKE. Let me take a look at the parts that are running, and most critically, the Kitsander data center itself. So this is the GKE cluster. And, you know, if I go back and look at the EKS, you'll find something similar as wow. So if you look at the Kitsander data center, this is an EKS. And essentially what happens is when you install Kitsander operator with the control planes and the data planes appropriately, they will set up protocol called gossip protocol. So you need to be able to enable those ports to be able to talk to each other. And, you know, once they're able to talk to each other, they can form a multi-task. And one of the critical things that you do is basically make sure that it's one giant, you know, humongous cluster. Basically it's like a node tool command. So what you're going to do here is you exec the pod that is running the data center, right? And then you specify the, obviously the namespace. And then you specify the username and the password. And it will tell you, you know, what kind of a data center it is. And essentially coming back saying that, you know, it is a multi-data center on EKS and GKE. So essentially what I showed here is how do you set up a multi cluster with Kitsander is pretty straightforward. Because once you enable networking, Kitsander is able to talk to, you know, to the other data center or the other node and automatically join up the cluster. So that was a really quick and interesting demo about how we can rent a multi-cloud. And there were some really important takeaways inside of that. And the first one really, and Rags was very clear about how he did it, is getting the network right. When you're transiting from one region or from one cloud to another, that's not just a click of a button. And so we're not going to try to sugarcoat that. And getting the network right means that you need to have your routes, you need to have your security and everything else right. It's not impossible. We've done worse things. Those of you who've ever done BGP and Cisco networking know that this is probably going to be way easier than that. But it is the most important thing in this case. And keep in mind, your charges, because when you do transit, that's when the charges for clouds get really up there. The next thing is about the, if you'll notice in the demo, he did this cross-credentials where like Google needed Azure's credentials from the cluster, from one cluster to the other. That's just a really important thing to take note of because when you're spanning one Kubernetes cluster to another, they need to be cool with each other. And the way we do that is with credentials. And then finally, with the Kate Sandra operator, we do have this difference between a control plane and a data plane. And that's an important call out because it's not just one full thing. You have to think about what we're trying to do is running across multiple Kubernetes clusters. We're against what Kubernetes really wants to do. It's like, I want to run it all. Well, you have to share. You have to share it out. So the Kate Sandra operator does the important work for making sure that Kate Sandra works across multiple clouds. And Chris, you got the next one. Yeah, so I think just a summary of it all. So hopefully, you've seen how you can put data into your Kubernetes clusters safely and how you can take these patterns and how you can build applications with no low downtime and how to use the right, I mean, this is the big thing. The big chunk in the middle here is complement your CSI provider with your application. There's got to be that contract between kind of the infrastructure kind of operations of Kubernetes and the applications consuming it. So they've got to know how to design their application. You don't want them to have like nine times replication where you've got three-way replication in the application and three-way replication at the storage layer. The thing in there as well, protect yourself. Don't let yourself get a denial of service. Think about your usual Kubernetes kind of resource quotas and limit ranges and all those kind of things. Use a CSI plug-in. And lastly, the thing, as Patrick said before, is that you can do this and you will be swapping kind of complexity into the network and security layers by doing this stuff. And always remember about what your costs will be as soon as you exit anything. Ingress and egress costs are going to be the things that will start costing you money when you start running this in anger across large distances spans. Yeah, and that's a really important point. Neither of us are here saying this is a click a button easy. And it's because the clouds don't get along with each other. That's okay. But it should be something that we're all used to as engineers and operators. There are the hard parts. That's why we still get paid money. If it was a click of a button, then developers are doing this all day, right? Indeed. And I think, in summary, we're saying data and Kubernetes, if you're going to be successful in Kubernetes, you've got to get your data into Kubernetes. So I'm expecting over the next year, 12, 24 months, every person's going to be having this as their next leg of their Kubernetes strategy. So hopefully there's some good questions for us after this. And please hit us up on the Q&A tool. And some links in here as well. All right. I just thank you everyone for listening and hope it's been useful. We've had a lot of fun putting this stuff together. So thank you. Thanks, everyone. Yes.