 I guess we are ready to start. Hola, buenos días a todo el mundo. Hello, everyone. My name is Paro, and I'm here to talk to you about high availability and high reliability in a distributed cloud-native infrastructure. And I believe that everybody in this room, at one point or another, has had problems with cloud outages. These things happen. They're inevitable. It is really difficult to avoid them. And no cloud provider can guarantee 100% uptime just because there are so many things that even the best teams can't avoid because they don't depend on themselves. There are external factors like upstream provider, internet connection issues. There is the human factor, and sometimes we can make mistakes in our deployments or call that we deploy. There are natural disasters that can take regions down or cause major trouble for all kinds of services around the globe. And it is prohibitively expensive no matter what type of business you are to deploy your applications everywhere around the world at the same time. So it is not a question of if your servers will go down or services, it is a question of when. And usually these things happen when you least expect it or you're not prepared for them. And somebody gets paged in your company and they have to figure out probably in the middle of the night if they're lucky during business hours how to solve a problem. And even major things like your DNS can go down and your team can make a mistake that is difficult to roll back. So many of us usually deal with these things with disaster recovery. And the vast majority of people fall in one of these four categories. When it comes to disaster recovery, you either have some form of active, active deployment, which means if your primary server goes down, you flip the switch in your DNS and your request go to the second one, which is active. If you're in this category, probably you're one of the lucky guys. The other one is active passive. It is similar to the previous one, but it's cheaper because you don't pay for the hosting for the passive instance or cluster or whatever it might be. And you have to spin up the passive instance. Then you flip the switch in your DNS and now you have happy users, no more complaints in social media about your service search are down or maybe your customers aren't happy. Another option is, again, to have a periodic backups of your databases. And then when things go down, you spin up your code somewhere, you restore the backups, and then you continue serving as normal. And the final category is you don't have any disaster recovery whatsoever because you're busy building features and when things go down, you'll figure it out. And unfortunately, there are way too many people in that category. Even people who used to be in the active, active one, they tend to fall down the category list because it requires extremely high discipline and human factor. The reason I've titled this slide like this is because your entire team needs to be responsible about what will happen in case of a disaster. Because disasters will happen. Disaster recovery usually requires to be played. You have some kind of run books or some kind of script, something that you have to execute in a case of a disaster or some system of yours. That's the best case. No people are involved. Some system will execute it when some metrics hit unreasonable low or high level, whatever the metric might be. You still have to replay your disaster recovery plan over and over again because as you keep adding new features and new components to your system, your disaster recovery might change. So it's not create it once and be comfortable forever. You have to replay your disaster recovery plan maybe every month, every quarter, or as often as you can. Not all the companies can afford to have a dedicated SRA team. That costs money. Not everybody can be like the big guys out there, like big companies. I won't mention names, but they have huge SRA teams, not especially startups. You can't afford to have a huge SRE or DevOps team that is only responsible about this. And does your disaster recovery require many intervention? Even if you have reasonably good disaster recovery plan, oftentimes, somebody has to wake up or step in or open their laptop in the middle of a trip or whatever and type in some magic in the terminal and things get restored if they get modified on time or and so on and so on. So there is a human factor oftentimes involved during an incident or disaster recovery. And it all boils down to these two terms, RTO and RPO, they are extremely important in solutions architecture world or even management meetings. The RTO is recovery time objective. And many times you have a talk with your service provider or some form of cloud provider. What is their RTO? Like what is the recovery time objective? In case of a disaster, how soon can you recover from the disaster? Guaranteed. Like there is nothing guaranteed in this world, but we have some goals. They give you some SOAs. And the RPO is how much data can you afford to lose during an incident? So if you talk to your high management or your customers, they will always want, you know, recovery time objective immediately. Actually, they don't want any downtime. And recovery point objective in some industries, actually you're not allowed to lose data. Financial industry, for example. Healthcare industry, you're not allowed to lose data. That could lead to serious compliance issues if you lose data in these industries. And during an incident, you can't predict what will go down. So these two terms, you can hear SRA teams talking about these things over and over again. So what if there is a different approach to disaster recovery? An approach that is self healing does not require any human intervention whatsoever. I repeat whatsoever, no human intervention. An approach that doesn't involve any single point of failures and expect that anything could go down including your DNS servers. The things that you rely on to switch to something else in case of a disaster. So I have my opinionated view that these type of problems can be solved with BGP, specifically any cast IP addresses and how many people in the room are familiar with BGP or any cast. Okay, we have half the room. That is super cool. So for those that don't know what BGP is, I'll mention it very quickly. The internet is like a network of networks and BGP is, excuse my drawing skills, but BGP is like make sure that these networks which are called autonomous systems, each communicate between each other in the most efficient way. So if there is one server that needs to reach another server, BGP makes sure that these TCP packets from server A to its destination find the most efficient route on the internet. That doesn't always mean the shortest path, but in the vast majority of cases, it's the shortest path. Each of these dark dots on the map, they are, let's call them, BGP actors, they announced some IP range and each of them tells to their peers, their neighbors, hey, this IP range lives in my autonomous system. Tell it to your peers and then each autonomous system propagates that message to everyone else. So when a server announces an IP address, BGP makes sure that in couple of seconds, the whole internet knows where this IP range lives. Then when there is a packet that needs to reach one IP from that range, everybody in the world knows where to send that IP to and when it reaches this guy, then it uses internal routing to find the exact server with the exact IP address and send your packets there if that server exists and it will be the shortest path. This is important because it is what I'm gonna be using in my demo and what is anycast IP? That's another thing that you need to understand. There are different TCP packets on the network. There is onlycast packets and these are one-to-one packets. You have one server that sends a message over the network and that message is supposed to reach exactly one destination. They are both somewhere on the internet. BGP will make sure that this packet will reach the destination server in the most efficient way. There is also multicast. Multicast is different in that, again, there is one source of the IP packet but there are multiple destinations. Think of it as a chat room. You type a message in the chat room, you don't send the same TCP packets to, let's say, all of the hundreds of members of the chat room and when you send it, these packets reach all of the members and they get multiplexed. That's multicast and anycast, which is what is the most interesting, this is where the magic happens, anycast is one-to-nearest, which in simple words mean there are many servers around the world with the same IP address. I will repeat that because it's extremely important. There are multiple servers online with the same public IP and your packets are guaranteed to find the nearest one. This thing opens endless possibilities. The benefits of using BGP for failover over DNS are basically DNS has TTL and apart from your DNS can go down, the server can go down, it has TTL and we've tried in production with DNS servers that have extremely low TTL but not all ISP providers honor that. They don't respect it. We still have had to wait like five and more minutes to recover from a disaster that doesn't depend on us. We've set proper TTL with try to recover and it just doesn't happen because somewhere around the world there is somebody who didn't honor our TTL and the customers happen to be just on the other side and they say, oh, it's your fault. BGP convergence take seconds. So you announce an IP address, the whole world knows about it. You stop announcing it, the whole world knows about it. As simple as that. The downsides of BGP, well, you have to have your own IP range but the good thing is you buy it only once and your cloud provider has to support bring your own IP address but most of them do the vast majority of cloud providers support that and of course there is learning curve. Like if you don't have the expertise in your team you have to invest a little bit in getting your feet wet with that. And now I'm going to demo three Kubernetes clusters in these three locations on the map. I have one cluster in New York, one cluster in Amsterdam and one cluster in Sydney and I'll try to send a request to them and they're supposed to reach the nearest one. Then I'm going to kill the nearest one and our request should go to the other one which is closest. I've created single-nose clusters because I'm lazy. I've used K3S just to deploy some bare metal instances. I've created the Kubernetes cluster there and let's show that. I've recorded my demos because actually I had to record it 11 times because I wanted to show something and it happens only in certain circumstances. So without further ado, I'm going to play the demo and I will talk it over. Okay. I hope that you can see. Can everyone see it? Okay, what we are looking at is my YAML file of a simple deployment. It's a simple goal line application that has a small GraphQL server and I've deployed that to all three Kubernetes clusters. It is exactly the same everywhere with one little difference. It has this environment variable, which is location and I've created that just to know where my responses are coming from. Okay, I override in my customization YAML. I override just the value of that variable. Everything else is the same in all my clusters. And this simple demo we will now show you. I'm access, I'm intentionally not using a domain name. I'm using IPs to show you that I don't use DNS. Okay. I'm now accessing my servers using the IP that was provided to me by the cloud provider. Okay. And I see that this response comes from New York. Then I access the next server and I see that my response comes from Sydney and then the next one and my response comes from Amsterdam. Now I'm going to SSH into all of these clusters and I will prove to you that they all have the same public IP. They actually have two public IP addresses. The one that was given to me and another one that I announced and I added myself. So I gripped the response from IP address and I look for that particular IP address and the red thing on the screen shows you that all two of them they have the same public IP address. That's an any cast IP address. So if I were to send a request to that IP address, theoretically speaking, it should reach the nearest server. And since we are in Valencia, voila, I'm getting a response from Amsterdam which is exactly what I expect to happen. This is so powerful and it does not involve moving pieces. This is how the internet works. So if I execute the same request from Kuro, I get the same response. I still get Amsterdam. So I'm gonna start executing these requests every second and I'll take one of the clusters down. So while true, every second, send a request to my cluster, my little application and tell me where the responses are coming from. And I start getting responses from Amsterdam. I'm using the bird demon tool to announce my IP range and I stopped announcing my IP range which simulates your region going down and then you see a little error. I want to comment this error because I recorded this demo 11 times to show the error but you see that in less than a second I started getting responses from New York. I did not touch anything. I didn't have to do anything. I don't even get paid if I don't want to, right? Then I took down New York and I started getting responses from Sydney without touching anything. My system is 100% uptime and this is how the internet works. I'm using the backbone of the internet. BGP is the backbone of the internet. I didn't have to disaster recovery. BGP does it for me. And the reason I got this error, this little error and the reason I recorded my demo 11 times is I wanted to show an error. It can happen during long lift connections. The reason I got that error is that curl got to the server, started getting the response. During that time it got some packets of the response but then that region died and it couldn't get the rest. So I get the error. So if you have a long running connections with such a setup, you have to just configure your client, tell your developers to just retry and reconnect. So if you are using something like GRPC, some other long lift connections or database connections or some kind of streaming, you just have to reconnect and then continue as normal. So that's why you can't guarantee 100% of time but it's pretty close. It took less than a second. For the same continent it's gonna be milliseconds and my system recovered. Like, how cool is that? Just remember that long running connections have to be reconnected. That is important and you can't have a piece of mind, oh, this thing will save me in all situations. This has serious implications and you have to be prepared and you have to reconnect and your developers just need to know about this. They don't need to be low level network experts, they need to understand how this thing works. Okay. And now, the elephant in the room. The question that I get after showing this to someone, always, always, what about my data? Like, you just queue one region but you promise low latencies but your latency is as low as the nearest data. If you're adding and removing clusters on the map, what do we do with data, especially data consistency? And here I'm going to present my highly opinionated view of this problem and I believe that in the vast majority of cases, especially in microservice deployment, I'm a developer, especially in microservices deployment, the answer to the data consistency problem and latency problem, reliability problem and disaster recovery problem is eventual consistency. And there is this cup theorem diagram. How many people in the room are familiar with the cup theorem? Okay, quite some people. Okay, for the remaining half, which are not, which didn't raise their hands, every database or data store solution that exists out there can have at most two of these three features, availability being always online, partition tolerance, meaning your nodes can go up and down at any time, like in our team being on the edge compute platform, all clusters appear and disappear on the map at any time. And then the third one is consistency. All your users see exactly the same data at all time. So you can't have all three of these. And because my talk is about availability and reliability, I have only one option to choose eventual consistency. That is not always the case. There are some applications that cannot ever sacrifice consistency, like low latency financial trading, for example. And there are other examples, but the vast majority of applications and services that I've been working with can tolerate eventual consistency. And as a matter of fact, they do, and you will be surprised how many services you use online are using eventual consistency. You can use a hosted DB solution that promises high availability, but this is yet another moving part in your big picture, right? And do they provide 100% guarantee? Chances are they will, until you sign a contract. Do they have 100% consistency? Some of them do. How much do they cost? And are they yet another single point of failure in your system, okay? So again, I'll drop a grenade under the table in a lot of microservices deployments. I would say the half of the microservices that I've ever built in my life for the past 16 or 17 years do not need a database. I know that this is extremely bold statement in my highly opinionated view, but I'll explain further. I fancy using event sourcing. What is event sourcing? It is a deal for microservices because your microservice can use first a CQRS pattern, which is you're producing events and somebody is consuming them, but the producers do not need know who is consuming them and the consumers do not know where the event came from. And in a microservice world, where technically speaking, in order to keep up with good practices, you have to have at most one service using one database. So every database has exactly one owner. If you need a report from multiple databases, they're in trouble because you have to use data from more than one database. And how do you guarantee consistency? Well, with eventual consistency. If you're consuming events, you will consume the events since the beginning of time and then you end up with the current state of the world. It requires a durable event store, something like NotJetStream, my favorite, or Kafka, whatever you fancy. There are other options as well. One benefit with that is that your data is immutable. You never have a delete in your system. Like you keep appending data forever. Your bugs and data and requests are repeatable. You can replay the events since beginning of time at any point in time and you reproduce the problem that you had. And of course, it results in eventual consistency, which fits exactly what I'm talking about. How does it work? Well, imagine that you have, let's say Git, you start working in a company. And this is your first day and you want to write some code, but you have to download the project on your computer. So you Git check out the project. You consume the commit since beginning of time, which is just events. Then Git builds the materialized view, current state of the world, current files on your file system. You start editing them and when you're done, you start sending commits. Your colleagues start sending commits. Everybody's sending commits. You don't know where the commits came from. You don't care who put them there. But if you consume the commit since beginning of time, you will always get to the current state. They're repeatable. They're always there and provided that nobody in your team is doing forced push. And this is how event sourcing works in simple terms. You have immutable event log. Just the events keep on that picture. I keep appending them on one end and you can always replay them. You do not have to replay all the events. You only replay the significant events. Imagine that events are things that happened. You're sitting in your office and somebody is doing something, somebody turned on their computer, somebody went for coffee, somebody is doing something. You can ignore all these events. They're events that happen, but you don't care about them. Suddenly, your boss enters the office. Everybody's pretending that they're working. That's a significant event. So your app can only react to a significant event. The events that it cares about. It can ignore all the rest, which means that you can always consume the events since beginning of time in seconds. Even if you have a lot of events, it's still possible to consume them pretty fast and then start responding to your requests. When does my pod become ready? If I start my pod and I start serving requests immediately, we have a problem because I don't have the data yet. So what I do is I check the last event, the one that is the last in the queue. My readiness probe is false. I start consuming events since beginning of time. When I reach that one event, I turn my readiness probe to true. I start serving requests and then I continue consuming events from that point on as normal. I build my materialized view, like my view of the world in a local variable, like in a map or whatever it might be. You can even use in-memory full text search option. So that you have the response to every single query ready to serve. You don't have to make database connections. And then if your pod dies, it consumes events since beginning of time, just like Git commits when you join a company or a project and then you continue from there. And here I have a second demo. This time, I will try to show that same GraphQL application, which is deployed to those three clusters. And I will create some events, some data crew operations. I will queue again one of my clusters and see if there are any downtime for my users. Okay. What we're looking at is I'm trying to hit my data in my New York cluster. There isn't anything there because it's a clean slate, empty cluster, empty application, nothing there, no events and no articles. This is an article application. And if I try the same thing from my any cast IP address, again, I'm hitting Amsterdam because it is the closest location to Valencia. I paste from my clipboard some articles that I want to insert and that will generate some events for me. I will try to probably update some of them, delete some of them just to create some more events. I want to create events that get propagated, not just stream, make sure that these events will get propagated to all the other clusters in the nuts mesh. I change my title to something. I change my body of my article and generate some more events. These things are eventually consistent because they don't change everywhere at the exact same time. But within milliseconds or seconds, all my data will be the same everywhere. I can even delete that article if I wanted to. I can change, check what I have. Currently I have like five articles in my database and my database is a variable, a map, simple map in my memory. When I hit that request, I request the data in memory. I don't have to make a database connection. Let's delete some article again. I send the request. That request goes to nuts. Then the event comes back. I consume the event and I change my map variable with my articles and I'm left with only four articles. Now let's queue this server, okay? Let's queue. I'm using bird, the BGPDemon, which is open source. Everything in this session is open source. And I queue my K3S server because I queue Kubernetes just to simulate region going down. And at this point we don't have an IP announcement and we don't have the Kubernetes cluster. And now I'm accessing again the Anycast IP, but my users are seeing New York. There was no downtime for my users. And because browsers tend to retry until they connect, you actually will almost never have a downtime in your browser. And all the data that I created in my Amsterdam cluster is now in my New York cluster and in my Sydney cluster. And I didn't have to do anything about it. That's how Nuts JetStream works. Like there were other talks in this conference. I linked one of them from a previous conference. It explains how this work and it's a bit outside of the scope of this talk. But it is impressive how I have extremely high availability, super low latency because the data is always in my memory. I can give another example. I could build like a microservice to this publishing platform that is, let's say, an RSS service. RSS service, let's say it only has the last 20 articles that my team has published. Every time I start this, I don't even need to consume the event since beginning of time because if the company is publishing, let's say 500 articles a day, I can only consume the events since yesterday, for example. And I can build my materialized view which can be like a queue in a variable and then keep adding one article at the top of the queue when a new article event arrives and removing one if I reach the 20 articles limit. So I will always have, I can even render that as XML and I can return, I will have the response to every possible query and I'll have sub millisecond responses without ever making requests to any database whatsoever. So that service will be unlikely to ever go down in any situation. In summary, outages will happen sooner or later. It is expensive and non-trivial to keep up to date disaster recovery plan that will guarantee a low short RTO and zero loss RPO and any cast and event sourcing are really good fit to help you solve these problems as a given benefit because any cast IP addresses have the same address for many servers around the world. If you get DDoS, guess what happened? You load balance the request to all your servers or if the request come, if it's a DOS attack coming from one location, you spin up a cluster next to that and all the DOS attacks go to that one cluster which doesn't even need to have any workload and the remaining servers are not even touched. So that gives you even more control of protecting yet another problem that might happen. All the resources in this demo that I've used are open source, no proprietary solution. I'm going to share the code that I've written in my GitHub profile. I've used BGP, the border gateway protocol which is the backbone of the internet, Nuts, my favorite CNCF project and this video over there is highly recommended. BIRD is the code that I've used to announce the IP range and probably your favorite cloud provider has some documentation how to announce your own IP. Like it's outside of the scope of this talk, I just wanted to mention that yeah, it's possible to have same IP address in multiple servers. I use K3S which helped me build this demo in hours instead of a lot more time and the GraphQL library, I'm maintaining that library so I'm a big disclaimer, I chose it because I had an existing app that I tweet a little bit. And with all that being said, I believe that there is some time for questions if anyone wants to ask anything. Okay, we have a question over here. I just have one question really which is just about the persistence behind the scenes. So you mentioned that about CQRS and you're obviously interrogating something which is durable somewhere. How in a cloud scenario when potentially you've got an outage and then the storage has gone as well. How do you manage that use case? So the question is what happens if my durable Q goes down? Which in my case, in this demo, is not JetStream. Not JetStream was designed to have zero points of failure which means that in all my Kubernetes clusters I can have a nuts cluster with three or more instances and they can use persisted volume claims and if that goes down, I will have some kind of health check. It would detect that my system has a problem with one of its components and it will turn off the BGP announcement. That server stops accepting requests in seconds. This is what we do in production in my team. Like if something in our system, we have a health checker which is checking for way too many things but that's necessary and if one of these things, some things in your big picture goes down, we turn off the BGP announcement in that region or in that cloud provider because we have different cloud providers in different regions and we are multi-tenancy system so things are going down all the time and if we had to fix those things manually, probably we had to double our team and only have people that are doing disaster recovery but because we have a health checker, we turn off the BGP announcement and then that cluster is off the network. It doesn't accept requests anymore. Our user requests go to the nearest cluster to them which is ideally in the same region but if not in a different region or a different cloud provider and then we don't have to care about this problem. It's a problem for another day so these problems happen. That's why we have BGP to take care of them for us. Does this answer the question? You've got to presume that only works as long as in the other regions that you're failing over to. All of that data has been propagated. Okay, so if there is an event, the way Nuts works, to my understanding, is you have to configure the, you have way too many, like a lot of things to configure in Nuts but you can configure how many acknowledgements in the clusters you want before that event is considered acknowledged. So you might configure for full cluster mesh which means every node in every cluster around the world needs to acknowledge the event before you consider it's published or maybe you can say, I want at least let's say two clusters and two cloud providers in two different regions to acknowledge it and then I consider the event published. So there was really, really interesting Nuts talk, I believe, not the previous CNCF, but Conf. There are many, many talks explaining exactly how Nuts works. It's outside of the scope of this talk but Nuts is really amazing for these types of problems. We have a question over there. Hi, so my question is, would something like this work on a load balancer lever? Okay. When we have, for example, a Cloudflare as a provider in front of the load balancers? Can you repeat the last part of the question? Yeah, when you have in front of the load balancers, you have a Cloudflare which is handling your IP addresses. Yes, it does work and many people use it this way. You can have somebody else in front of your BGP announcements or after your BGP announcements. There are many, many talks explaining different scenarios there. I'm just, in my presentation, I'm just saying, hey, it's possible to have two servers with the same IP address. There are really a lot of solutions that you can work with. In our case, we have our BGP and after that we have a load balancer that goes to all of our ingress nodes. But you can play with it and the scenarios are endless. The sky is the limit. You can use it in that particular scenario. It's possible. Any other questions? Well, thank you for coming to my session.