 So yeah, welcome to this talk. Title is Multidata Center Strategies for Data Services. It was meant to be a full session, so I had to cut it down into 10 minutes, which I didn't achieve. It's going to be a little bit more. Also, a few words about myself. I'm Julian Fisher, CEO of Any9s. Probably you might have heard of us. We focus on data service automation for Cloud Foundry and other platforms. So a little disclaimer. As I said, the talk is actually meant to be longer than that. So we are going to scratch the surface. And at several times, I will have to oversimplify things to a degree that I would throw tomatoes at myself. So if that's the case, feel free to grab me later outside. And we can have discussions or questions afterwards. So the reason I would like to talk about this topic is because during our work with customers, we've seen that topic being discussed widely. And also that, in many cases, some elementary understanding of the dos and don'ts are kind of missing. So because of the length of the talk, I'm trying to getting through the basic scenarios definitions a little and then talk a little about what we can actually achieve in the terms of data service automation and about the multi-datasenders topics as such. So I think one of the most important things is that when you think about having multiple data senders is that you actually be clear about what you actually want to achieve. Sounds obvious, but having two data senders can be very, very different. So the term, in fact, actually gives you no information so that you can, whatever, present some fancy slides and tell somebody what this means for data services. Just there's no connection between having multiple data senders and a good design to do it. So you have to go down the road and make yourself requirements and define goals that you want to fulfill. So in those conversations, many different motivations may be of relevance. For example, you could have an audience that's spread around the globe so that you want to serve content from a nearby data center instead of having a long-distance connection with a lot of latency. So lowering latencies may be an interesting motivation and, of course, availability. So in case something happens, so, for example, the local disaster, such a fire, sadly, California, experiencing something like that, at the moment could put a threat on your data center. So there are things in between. For example, do you want to have a load scale out or do you want to have a capacity scale out? Do you want to provide redundancy and load balancing? Things are kind of interconnected. And you have to go through your particular scenario and define what you actually want to achieve. So for us, and we've been serving Cloud Foundry as a public offering since 2013, we've started on vSphere. We moved to OpenStack. We had a lot of regret, so we moved to vSphere and AWS. So we've been running Cloud Foundry, and we're talking about migrating customer applications in every of those cases, migrations, that we've learned that we have to be infrastructure agnostic, because infrastructure technologies that still change a lot, either because it means that you have to learn that whatever the infrastructure you've got is not the best, or that another infrastructure provider becomes more important or cheaper, or you want to whatever, go to a geographical area where a certain provider is more accepted than another. So Asia, for example, I heard local companies preferring Microsoft Azure over Amazon, where in the US, it might be the other way around. So in our platform strategy, we always had the assumption that we would like to have a homogeneous operational model, which means I wouldn't have this similar operational procedures for running a Postgres than for running a Redis. And we want to have that across multiple data centers without tying into the technical details too much. So I've seen customers who are running on OpenStack and they are happy with it, and therefore, they accept infrastructure-specific ties. So this is one of those design goals you have to be conscious about. It's all I'm saying. It's no right or wrong. It's just something you have to be clear. And then you can start thinking about the data service design. So just to have a common understanding, there are two major scenarios we've seen repeatedly. The first one is where you have multiple data centers being in a close range. I think Switzerland is a good example, right? Because it's smaller. And I know a company who have multiple data centers and they have the luck to also own the data center connections between the data centers. So that's a good example for a close-range data center because you have a geographically close data center being so close to another. So I'm talking about less than a few hundred kilometers, ideally less than 100 or 50 kilometers, right? So you have a high bandwidth. You have a low latency. And therefore, a reliable connection. So this makes a few tricks more appealing than if you were in another scenario, which we'll look at a little later. And we're talking about the possibility to stretch the infrastructure. We've seen customers do that. We are not infrastructure experts as such. I won't say anything against it, despite of my stomach feeling would suggest that this might be a problem unless you're really, really sure about the quality of the connection between the data centers. So in the scenario of a stretched infrastructure, you look at possibly several data centers, and somehow you manage them to get a shared network happen. So in that case, each of the data center has one or multiple availability zones, so which is already interesting because look at that scenario, you see two data centers and three availability zones in total. So we've also seen two data centers with each one availability zone obviously leading to the problem that any Quorum-based data service won't be free of the split-brain problematic because a Quorum requires two N plus one nodes, so at least three. So in this case, you have three nodes, but we are looking at what kind of disaster. Why are we going to a second data center? So the reason is not because data center one is full, maybe because of redundancy. However, a data center might be troubled entirely. So if you lose data center one in that scenario, you're losing two or three availability zones. And for most data service, this means either losing the capability of electing a leader or even losing data depending on the data service you're looking at. So you could add a second availability zone, a data center two, but the problem here obviously still is Quorum, because one of the most likely scenario here is that you have a connection problem between data center one and data center two, and then you have a split-brain problem because a cluster manager, usually running on nodes and the nodes are spread across the availability zones, they would be split in the middle. So there would be two cluster nodes on the left side, two cluster nodes on the right side, and none of these partial clusters would be able to determine whether they are the new majority, so a leader election impossible. So if you add another availability zone to one of the data centers, you're basically going back to the first scenario, where if data center one goes down, you still lose a majority of the nodes. And also it really depends on the data service and whether the data is lost. At that point, so again, in most of the data services, you would be able to cover and survive losing two out of five nodes, so only the scenario of losing data service one would be really, really problematic. Still, what's the point, having two data centers then? So another scenario could be to have three data centers with each having one availability zone. Well, that's a way to go. And of course, you could go for all in and having three availability zones, having a local failover and three data centers, but who would be able to pay for that? Nobody would ever do that. So in any case, the stretched infrastructure scenario has the appealing part that you can just declare the availability zones and not tell BOSCH so much about the existence of different data centers and let the virtual machines being spread across availability zones. So in that particular scenario, you would be distributing virtual machines across three data centers with minimal impact to the data service automation. So in our case, data services will be provisioned with BOSCH. So we'll just create a Postgres cluster. And the Postgres cluster would be automatically spread across data centers. I'm not advertising the solution because deep down, I don't trust the stretched data center. So I would be careful with that. But it's an appealing idea. And I've seen people looking at it. I've actually seen people doing it. And so far, they didn't tell me that everything went south. But so if you have data centers close by, whatever scenario will work for a multi-region setup, obviously also works for a setup where the data centers are close by just to question whether this gives you any benefit. So let's look into a scenario which is pretty common to countries such as the US, where you have East Coast, West Coast kind of scenarios. So your data centers are widely spread. And with that, even if you own connections, I'm pretty sure that it's just the physics that will give you some fluctuation and bandwidth and latency. So I'm not really sure whether you want to go with the stretched infrastructure in that scenario because it just takes some time. So scenarios I've seen is that you have multiple Cloud Foundry instances or foundations, however you want to call them. And so for example, three data centers and three Cloud Foundry foundations. But that immediately leads to the question who's orchestrating those Cloud Foundries, who's synchronizing those Cloud Foundries. Because spoiler alert, Cloud Foundry doesn't really support that, at least not at the moment. Pretty sure that they got something on the roadmap. And if somebody knows something about that, maybe it would be an interesting question afterward. So next thing, how do you deal with multiple Cloud Foundries? I've seen people using CI pipelines and load balancers in the front and proxy services on the front that will replay activities towards multiple Cloud Foundries and try to keep their state similar. Now the thing with that kind of approach is always the question, what does it mean for the data services? So long as the data servers are not connected, the question is, can you really assure that each Cloud Foundry with all the data services is in the same state? So I wouldn't bet my life on that, because it's really hard to achieve. And also, similar to a database, you'd have the problem that if one Cloud Foundry is unavailable, whatever changes are to be committed to that Cloud Foundry would have to be stored in a kind of transaction log and then replay it once it goes up. But then a change could be different depending on when it's actually executed. So not really sure. So you have a lot of trouble with that one because you have to monitor or you have to perform failover and you have to think about what actually happens if a Cloud Phone goes away and comes back. So another way could be that you have a primary Cloud Foundry, which is kind of the master slave scenario where this would be your multimaster scenario and this would be a kind of primary secondary scenario. So you have one Cloud Phone you're writing to and you're trying to replicate all the changes to that Cloud Foundry to the secondary ones. So in such a scenario, what you could do is having a database. I'm using Postgres as an example. We are using Postgres with asynchronous replication. And asynchronous replication is pretty tolerant when it comes to fluctuating bandwidth and latency. So in a multi-region setup, it's something you could do. So in that scenario, you would be writing to a master database in a primary Cloud Foundry, maybe because the request hit that particular Cloud Foundry. And then you'd replicate the changes to databases in other Cloud Foundry foundations or other remote regions. When it comes to automation, this is something that seems to me kind of doable. So we could have a service broker that triggers a Bosch deployment either in multiple Bosches or having a Bosch that supports multiple infrastructures. So you could provide, on the level of behind the service broker, an automated awareness of different regions. So doable. But the problem then would be when you look at asynchronous replication, the inherent drawback is having a replication lag. So we'll look at that a little later what it means to have an asynchronous replicating database. So one of the questions I'm being asked a lot is, is there a generic way to make data services multi-DC capable? So we're having a framework automating data service. So we've been looking at that. And the answer clearly is no, because data services are vastly different. Even when you know that's an asynchronous replication, their implementation might be so different that you have to look at a particular data service. And it's very specific operational model to see what it means to make it distributed across data centers, or even if it makes sense at all. Another commonly asked question was, does multi-region awareness a matter of the application or the data service layer? And I think the answer to that is not clearly given. So it really depends. It depends on the design goals, on your data center topology, on the data service you're using, and the application architecture you're looking at. But if you force me to give you some kind of navigational knowledge, it wouldn't be much, but at least I could say that it is a data service matter if the data service is inherently designed to do it. So if you look at technology such as Cassandra, where you can configure the replication behavior, particular to your certain use case, that is surely something that, well, it's some kind of in between, because you're still application specific, but also it has a data service side to it. And it's data service automation matter, looking at the Postgres example, where Postgres gives you the ability to replicate, but it doesn't give you a cluster manager. So it doesn't give you a failure detection, doesn't give you the automation that needs to be executed once a leader election takes place. So they are still playing to fill in, but at least you have the tooling to do that. So I'm pretty sure there are a lot of data service that are inherently not designed to do multi-TC scenarios, and in those cases, you're most likely up to do it on the application level or choose another data service. But let's just for a second look into the idea of having a asynchronous replicated database in a multi-regional environment. You know, you can make that happen. You have your master database on U.S. East and your slave in U.S. West or your primary and your secondary. And as long as the right request to your primary, to your application on the primary location is low enough, you won't have problems with that. And I found that picture to be very suitable because the replication lag always means that, you know, if you go faster with your car, at some point, your dog will start suffering. And same with databases. So if you have databases being widely spread and you start writing to the master database too fast, the replication lag increases and increases. And at some point, the replication is just, you know, too way too out of sync and may collapse. So it works on the certain circumstances, but how can you grasp these circumstances and over a wide range of applications? So it's hard. So my assumption was that in every other case, it's more kind of application concern, but the conclusion is actually that multi-datas awareness is per se not a aspect of a platform. It's not something that you just, you know, enable your platform to operate and it will be transparent to your developers and they will just deploy something and then you're going to be multi-datacenter aware. Not without having assumptions about the data service layer and choosing them carefully. However, it is wise to design early for multi-datacenter awareness that I've seen customers who've been adding that feature to their platform lately and starting with one configuration of availability zones and number of data centers and so on. So you have to make those changes and propagate them across the stack which is costly and time consuming. So in our opinion, the multi-datacenter awareness in general hits the entire stack. So from the data center design to the infrastructure design, how you configure your availability zones, the choice of the data services and possibly the automation when looking at cluster managers and so on. And obviously also your application design where you can still, you know, do a lot of tracks. Sorry guys, was way longer than expected. So thanks a lot and please feel free to blame or ask questions.