 Thanks very much. Slight corrections. I spent about five years at VMware and 10 years at some microsystems doing various things, including Postgres. Actually, I think the only public student credit I have is actually did the only published Postgres benchmark with SparkJab server back in 2007 at Sun. I don't think I've seen any real industry publication of Postgres. But that's always something that can change out there. So I think let's go to quickly about the current things I'm doing. Currently, I am architect at the App Arbit. What does App Arbit do? App Arbit primarily does actually three things. We help you set up Devintest lab environments. We also help you set up DB as a service for your applications for your Devintest environments. One of the new things we are now doing is we are helping you with your test data management, especially like if you have production data and you have Devintest environments, what's the workflow you would need to bring that into your environment, do maybe cleaning, subsiding, whatever you want to do, like the workflow process for that, and then make it available for your Devintest environments. But to this topic would be more about high availability of Postgres, especially in a containerized world. So let's actually go with that. So how many of you actually are running containers right now? Just a quick poll, that's good. How many of you are running Postgres in a container? That's good. I would like to see that number go up. So let's take a step back. We really want to do is talk about high availability. So if you are positioning or doing any project in an enterprise, the first thing you want to figure out is what are your requirements? How do you call it highly available? So before that let's kind of take a quick look. So here is how a typical enterprise application looks like. You would have web tier, app tier, and typically there would be multiple versions or replicas as we call them in the container world for each of the sections, but at the end of the day there is one instance of DB. And this is like a classic legacy in some respect, how an enterprise application looks like. If you look, jump to the next level, the modern applications are all microservices based, right? So in that scenario kind of the changes that you're seeing is like you would see kind of, every service is an isolated function with its own data stores out there and your web application basically is a collection of this microservice doing one task out there. So now kind of like the front end is more of an gateway to a backend service which actually does one good thing and that could be maybe just your UI with its own data store, it could be an authentication with its own data store and so on. So these are kind of like, the overarching if you look at an enterprise, here are the kind of like the different levels of application that you would see out there. Now what we are going to do in this presentation is only focus on the database side of it. We are not going to talk about the rest of the things, we can probably do that in a separate session but what we are going to focus is more about on the database. So when you go to the database, there are things that you would certainly need to say, hey, I'm going to talk about HA and I'm going to make sure that we have an HA service deployed in an environment. So you need to kind of figure out what do you mean by HA, right? Last thing you want to do is cut HA and go to a solution. Don't do that, never do that. I think the first thing you want to do that is look at your enterprise collectively and kind of like create a set of requirements. And what I'm trying to do is like guide you to the process of how you actually approach HA in your enterprise and that would actually help you identify a lot of things that may, which may not have been previously picked up. So for the purposes of this talk, I'm going to say very three requirements. The number one thing that people complain is simplicity. I want something that is simple to kind of use, right? HA typically brings complexity and that is actually a big problem in the enterprise. So what do you want to do is you do want to focus on that particular aspect, like whatever you develop needs to be easy to use from the end user and typically for a database, your applications and other consumers would be other applications or end users, right? So there are only two real classes. It will be applications using the DB service or it will be some analytic platform or a DBA kind of group which is actually querying the database directly. So these are kind of like the things, right? So you want to make sure that you actually have a good way of that. The two main important metrics typically used for HA are something that's called high availability from a service point of view and the other is the high durability. That is like, you know, what is your durability when it goes to the disk? How quickly can you recover back? What are your chances of losses and so on? So these are kind of like the three aspects will drill down more deeper, right? So we talked about HA and we talked about high durability. How many of you have heard of this five nines metric, right? So what you need to know is there are actually two different numbers out there for the metrics. One is for the availability portion of that which is typically what people talk about. And then there is another number which is the durability part of that which is on disk portion of that. Why is that on disk important? I think once we go more into the deeper side of the Postgres replication, we'll talk more about that. But the concept of that is like the way people look at them is it helps you address two things. The availability is like when somebody is trying to query your database and the service is not available. That's an availability problem. And the durability is like something has gone wrong. Can you get back all your transactions? That's the durability problem, right? So that's how you need to look at those things. And just to kind of give you a context, right? When somebody says like five nine, what it really means that in a whole week, your availability cannot be more than six seconds. Try doing a master replica failover. So that's an important thing. So like when you talk about this number, the numbers actually are important because they would help you to kind of gaze to how deeper work you want to do in your high availability setup. So I think what you really need to start with is what kind of SLA expectations are you setting with the rest of the team that are going to consume your database? So without that, I would advice like, don't jump to solution. I think you really have to kind of say, what are your SLA requirements? Because at the end of the day, if you, if you architecture, you come up with something that doesn't meet the requirements, the whole project will be a big flop. And that is something that you need to be careful about. Any quick questions up here? Okay, cool. So there is this Murphy's law, right? I mean, how many of you have actually used that to say? So it's interesting. Murphy says anything that can go wrong will go wrong. To be honest, there was no person called Murphy. There was a mathematician called Augustus D. Morgan who kind of said, if anything can go wrong statistically, if you do enough iterations, you will hit that scenario, right? And it is actually true in this distributed world right now. If you look at the way people are doing things and the number of transactions that is actually exploding out, if there is a probability of any occurrence occurring, it will occur. And that is actually not the truth. And you can see from history, you know, disk where assumed to fail every three years, but as soon as you have like say, thousands and 10,000s of disks, you will hit that every day or maybe every six hours or so. So it's kind of like the probability of anything happening will happen, right? So there are a lot of things out here we need to kind of consider out there. Also an important thing is sometimes organizations differentiate between plant downtime and unplanned daytime. So this is an important thing because this will help you give you some leeways in your SLA agreements. So you could actually have different SLA agreements for unplanned and different SLA agreements for plant downtime. So these are kind of like more tricks on how you use them internally when you're positioning highly available database service, right? So now let's start drilling down further, right? And now we know what we are talking about, what are the expectations we are setting. And now we are trying to get into the details like, okay, how do you do this? Now, as you start planning out your deployment, there are, we talked about kind of like the high level things, right? A, you need to design to handle failures because we talked about Murphy's Law in a distributed world, any probability that they have of a system going wrong will go wrong. You need to actually, we talked about microservices world, which means there are more and more databases instances that will be deployed, which means whatever you do, you wanna do that at scale or multiple times. So there are no more like one database to serve everybody. I think the modern world is actually moving to kind of lots of database servers serving to different applications. The other thing is it should be still convenient for your consumers, right? You can't like tell them, hey, today this is your IP address, tomorrow that's your IP address board. So you need to kind of have like this uniform way of accessing a database server. And then you go into the more advanced features, it's like what you wanna do is you want to avoid human intervention. Nobody likes a page in the middle of the night to say, hey, my master has gone down, somebody go and do a promote of the replica, right? So these are kind of the things of that. And then the final thing is more on the application. What are the impacts of actually doing a failover or anything on your database side? How does your applications really react to it? So you have to kind of consider from this all this global view of how you're gonna use your Postgres server. Okay, so design point of view. It's interesting that if you look at the history of highly available services or maybe last 15, 20 years, one thing, one style of architecture which is still primarily used today is actually a shared storage-based I check. Think about it as like a highly available EMC server or maybe a NatApp server or anything. But this is actually a common design that is still used in many enterprises right now. There are advantages and disadvantages. Let's talk about the advantages. I think the main advantage most of this shared storage-based designs are they allow you to use hardware acceleration. I mean, hardware accelerations are kind of using storage-specific features out there which would actually help you out quite a bit. Also, they go and tie into like cluster services along with the OS. So it helps you do failover automatically at the hardware level. So because you have this shared storage, if a node goes down and the host, these are multi-host storage systems which means another host will take up all the storage arrays and then they will restart all the services back on. So this is still kind of like heavily used in many of the enterprises if you go out there. And also it gives you more advantages. Like if you look at the storage vendors, they've given you like a kind of a DR replication at the block levels and other things. And then there are newer versions. If you look at many of the startups, what they're trying to do is they're trying to do kind of like a more distributed store which kind of gives you the same shared storage type of access patterns. So this is still like a prevalent design that many people use. There are disadvantages with this system though. A, as soon as you talk about expensive storage systems, I mean, the main thing is it's expensive. It actually prohibits lot of enterprises out from this. So what do you really want? If you look at motors, like software is going to eat the world, everything should be software that I should be able to deploy on commodity hardware. I mean, that's what we would like to do, right? So from that point of view, the shared storage actually doesn't really board well out here but it is actually still, so like I mentioned, there are interesting startups that are coming up with distributed stores. So those would be good to look at. The other main problem that I would say is there is no offloading of read scaling that you can do in this shared storage systems. Most of the shared storage systems are idle systems, right? They are designed to be hot stand-bys but they cannot do any read-only queries or they cannot do them easily. There are some people who are trying to work there around. They would give you some storage snapshots at the storage end to kind of spin off another system for read queries or reporting servers, as they would call. That's kind of done more in the classic ones but there are still limitations. So if you look at from the postage point of view, you would see that the read scaling is one of the heavily used features when people talk about replication. And of course, the biggest thing is like I talked about, like you cannot do more than six seconds. This systems cannot actually help you to converge into a finite scenario. So that's something, this is one of the primary reasons why you would be looking at other architecture, right? So then what is the other architecture? The other architecture is in database replication and we are going to particularly talk about Postgres application. Postgres application over the years have actually increased in the number of features that have helped gained tremendous value out there. But the thing that I really like is the very fast promotion that you can do on a slave. That is the key thing that would actually help you cut down your availability, the downtime availability apart from the detection portion of it but switching over to a replica is actually very fast. That is kind of an important requirement. The other things you would see is like Postgres use you many, you know, Postgres for most people is a Swiss Army knife, right? You can do a lot of things in a different way depending on your needs out there. That also makes it a bit bulky because you know to figure out which tool to use to do the right job but all the features are out there. So some of the things is like what are the main options for Postgres application? You have asynchronous application and synchronous application. But before we talk about that, the main thing to see is we talked about the disadvantages of the shared storage system. All those disadvantages go away. A, it's less expensive to implement the solution. B, you can actually do load balancing for your load read queries and other things which means now we can get out of those additional replicas that you have. C, the most important one for me at least if you're trying to do a highly available service is to cut down the unavailability time when you're doing the promotion. You could practically do a CTL promote and take out the node and then your down times are practically in second. So that is actually a very important consideration for that. Now let's go a little bit deeper into the Postgres application, right? We talked about there are two ways to do Postgres application. One is the synchronous application and the other is the asynchronous application. Why do you wanna do synchronous application? Let's go back to my original slide where I said there are two metrics that you care about. There are the numbers for high availability and the other was the high durability. If you really care about high durability which means you want to kind of reduce down your transaction loss probability to as close to zero as possible, you want to do synchronous application. If you do not care about that high durability metric that much then a synchronous application will be something that you could use. What are the disadvantages of that high durability if you wanna do that? Postgres synchronous application is in some ways I would say latency intensive which means you want to do this into the same data center. Every latency that you have for synchronous application really impacts your performance for your applications. So if you really care about high durability you wanna do a synchronous application as conjoined together as much as possible, possibly not in the same rack but at least like in that same data center close by to each other. So that would actually help you kind of like do like a one hop to the switch and back so you get the minimum latency out there. People would say that you can do synchronous application over different data centers. What I have seen is every person who starts with that assumption I'm gonna do synchronous, I do not care. I have the fastest bandwidth I have possible. I'm gonna do that across data center. They start with it, bang two days later they have performance problems and the first thing they do is turn off synchronous application. It's a failsafe switch but now you're completely avoided your high durability requirement. So that's why you need to kind of make sure what are your SLA agreements. If the high durability is something that you really care then you wanna plan accordingly for your deployments. The other advantages are even though you are using shared nothing storage you can still actually leverage shared storage if you want for some cases not as the classic multi-home scenarios but for kind of using the snapshot features and other things on your shared systems or distributed systems that you may have in your enterprise. People may have done investment in those systems you still want to leverage that so you can actually do synchronous or shared storage system but not for the same purpose which is not for actually doing availability requirements for that but maybe for like spinning up your synchronous replicas and other things. So you could still use the combination but not for the purpose of the replication per se itself. Okay, right. Then why do people use asynchronous? I think the most simplest answer for that is performance. There are two actually there is another advantage that you can actually do is like you could actually do cascading replications. I don't know about most of the enterprises but there are some enterprises who want to kind of figure out how to do fat finger failure recoveries and there are not enough tools right now but it is possible to actually do an asynchronous application with a delayed concept to kind of then actually have a delayed replica. So if you have something going wrong out there you can cut the replica and actually use that and preserve what you have and quickly recover back. So these are kind of like some of the scenarios that people want to do out there. Unfortunately the asynchronous replication the in-stream the streaming replication doesn't have that kind of feature. So for that kind of if you are looking for that thing you still want to use the classic lock shipping mechanism for that where you kind of dump it into archive locks and actually have a replayer but the replayer actually is playing like a delayed time. So that will actually give you kind of a context of the real-time DB and our later DB so you can actually have something that you can fail over to an earlier copy. It's that kind of like some of the considerations and again you may want to do that because of your SLA agreements and things like that. So like I said depending on your metric the cost of implementation will be higher. So things to kind of worry about. Now Postgres replication I don't know what's your experience with that? The biggest thing I find is it's fully featured but not easy to use. It's kind of violating my every requirement for deployment that I wanted to. It doesn't have automated filler which means you have to do some work for that. The setups are complex. I think with every version of Postgres it's getting easier but it's not something as easy as like YAM install Postgres, YAM install Postgres replica and I'm good to go. That's the kind of concept of easiness that you need to figure out how to get out there. The other common problem is like most customers that we are talking to, they have multi terabytes of data. So if you lose a replica and if you need to re-spin up one it's going to take a hard time to kind of restore back. So a reprovisioning of a failed master is actually as a slave is an important feature. So PG Rewind actually is a very good thing which was actually delivered from VMware when I was out there done by Hikey which is a very useful feature for that. So it's something that you need to kind of figure out how to use that. There are a lot of other side things which doesn't really happen. You need to wait to figure out like how do you update your DNS? Why do you want to use your DNS? It's for easy access and so. So there are a lot of small pieces missing in the classic Postgres application just by itself. So these are the things that you need to take care of. I'm going to just speed up. I do have a lot of slides to go over. So as we, this title of the topic was to deploy Postgres in containers. So we'll let's recover that. Why do you want to use containers? I think the metric that I generally say is like the way to use container is because it really simplifies your deployment. How does it do that? So like before that let's see how many of you are aware of containers in the way that, how many of you are using Docker containers? Let me kind of quickly pull up. How many of you are using containers other than Docker? Just two, three, four, okay. All right, cool, sounds good. So very quick introduction and I'm going to speed up some of the things. So containers are kind of like a lightweight virtualization. The concept of container is not new. I mean Docker really popularized it but it has been there for ages. I think I was at Sun, we had Solaris Zones at that point of time which was itself influenced by DSD jails earlier. So it has been around for a time but as they say, once you get it on Linux it becomes really popular. That's the nature of it. And then there are some things that I think Docker did well. They actually did the user experience pretty well. Like how do you kind of push the images, pull the images to give you like a seamless experience. This goes back to my number one requirement. If you make it easy to use, people will come to you. So that's some of the things that you need to consider. So anyway, in the old days we had bare metal with the operating system. Then came the hypervisor virtualization that VMware really made it popular out there. I think the container is trying to do the same thing but except like it's more lightweight. So now you have a common shed kernel out there which would allow you to kind of pack more applications. Why are people doing this? There are kind of like two main things. Been packing problems, like you have limited resources you need to pack a lot of things out there to get the efficiency of the hardware up. The second thing is you're kind of trying to do a management at a more granular, containerized level which you can then do it easier. It's very hard to manage processes. It's kind of more granular when you're just managing the whole VMs running multiple processes. So container gives you kind of like another abstraction to kind of package them all together and operate it at that group level. Right, can virtualization and containers work together? Yes, it's actually nothing preventing you to kind of like deploy containers within VMs. Different people have different opinions about that. I per se don't see any problems doing that. I think it's like, what are you comfortable with? There are a lot of people comfortable deploying VM images first and then deploying containers with them. There are some people who would say, hey, containers can be good on just the bare metals itself directly. Yeah, I mean, either way it's possible. I think it's all more about the philosophical choice and your requirements. Like if you really need a high availability to your particular hardware register or something like that, you could go for optimization and take out virtualization in between you. So there are some things out there. So going through that, so I think one of the more important thing guys I already covered is like bin packing is one of the main things that you wanna look at. The other advantages are there are very quick start and stop time. When you boot up a VM, you first boot up the whole OS instance and then your application, right? If you are using the containers, there is no OS boot up. You're booting up, you're directly just your container instance. Why that is important? Like when you're trying to get finance, every piece of time that you wanna cut down will actually help out there. So that's why I kind of prefer containers more in that retrospect because everything that I can take off from my calculation that kind of gives me the downtime components, I wanna take, peel them off. There are disadvantages of containers also, right? All your application have to use the same version. It's very hard to kind of run windows in a Linux container. It's not possible right now. And there are a lot of changes that are happening. There are not great patterns on security on containers yet. They're slowly improving, but it's still like there are some questionable things out there that would be out there. The number one problem for containers, like container is a technology. It's not a solution. You just then push a technology as your solution. I think when you talk about solution, you have to look at the different aspects and see how your technologies will be helping to solve some of those needs out there. So the number one popular container runtime right now is Docker. So it's actually very easy to use Postgres in a Docker environment. All you do is you say Docker run Postgres and that container name and you can get it running. I think Docker has really done that usability quite well. It will go and pull the Postgres container from the cloud deployed and start executing it. And you actually now have a Postgres container running and you could start connecting to it and do things on it. So that's kind of like going back to the easy to deploy and easy to manage out there is really helpful. When you do Postgres on Docker containers, make sure that you have a volume defined for your data. Container images, if you look at most of perspective, they should be considered more as an immutable image, which means any changes that you wanna do in your system and you wanna track them out, you wanna put them into volumes. What that volume is now your persistence layer. So think about it that way. So in any configuration changes, anything you do of that, you wanna put it into a volume. Think of the pure container images as an immutable runtime image. So because if you make changes of that and you go to a different system and do that same Docker pool image, it will not work because you made some configuration changes or that. So the best way to do that is preserve everything into volumes. Next thing is to, we talked about the ease of use of running Postgres, but then there are still things when you kind of go for deployment in the enterprise, right? You need to do resource mapping for memory. You need to kind of like figure out like what ports to kind of manage out there to what access. So these are kind of some of the things that probably share the slides later today. So you can see them, but I wanna just go on to the other portions of things. But I would say after using containers and Postgres for a while, here are kind of like set of my best practices that I typically say, start with memory limits. One of the important problem that I saw in Docker containers running into any Linux system is when you look at the system memory, it will show you the entire host memory, even though you do not have access for that. See, if you have any tools that kind of rely on PROC mapping for PROC CPU info, it's gonna back out your calculations. So things like that will really hurt you. So you need to kind of be aware of that. These are some of the feature requests we actually put to Docker to kind of like limit the usage of memory that a container sees. They right now show everything over there. So those are some of the things that you need to figure out like how do you map them appropriately so you can size them properly. The other thing is like, I mentioned about volumes for a data, but when you're doing high availability, you also need to take care of the wall volumes, the archive logs you will need to put somewhere so you may wanna have secondary volume for those things out there. And because I mentioned that the Postgres container images should be considered as immutable, so you need to kind of do it like a pre-selection of all the extensions that you want and customize that in your private image because that way you can guarantee that all the extensions are available to all your deployments out there. Once we go into more further use cases, you will see why that would be important because you wanna make sure that as soon as I start, it should have everything that it needs and the best way to control that is control the image that you're going to deploy in your organization. Any quick questions for that? Cool, all right, sounds good. So based on the trends that I'm seeing with the containers, these are some of the things like going back a few days, I would heavily optimize my Postgres. I would do my special layouts of wall logs here, other systems in this directory. I'll create probably like six table spaces for like separating my index, hot tables, and other things. That is a problem with the replication, not just containers, but replication because every time you do a replica, you need to make sure that the same setup is out there. So that's kind of like the other train that is now coming back is like, hey, put everything into one bucket because then it will be easier to replicate everywhere as a cookie cutter stop. So these are some of the trends I'm seeing like it's not something that it's right or wrong, but it is the trend that people are picking up because they want to avoid problems later on. So if we do worry about some of the settings, you need to go deeper and say, how can I solve it? But the trend I'm seeing is I'm seeing less and less people using table spaces, especially when they do this white deployments out there. They actually do combined Postgres and X logs or wall logs together, which in the old days would be a performance problem kind of scenario, but in the new day, because you want like a consistent view everywhere, it kind of put it into the same thing, so you can quickly have the same scope of the view at the other end. So these are kind of like some of the interesting journeys. Good or bad, don't know yet, we'll see. I've already mentioned about microservices and faster updates. So we'll kind of like continue to use you or very fast, we have patching images. So if you have a new image with Postgres, security fix out there, it's very good. Just bring down the container, upgrade the container version image and boot it up. It's very fast to do upgrades that way. So it does have a more operational excellence out there to kind of like cut down some of the times. The next technology I want to go to and talk about is Kubernetes. Kubernetes is like kind of like the way, we talked about Docker container at the base level, I'm kind of managing a single container. Kubernetes comes in and says, I'm gonna help you deploy this on multiple nodes and kind of scheduling them appropriately, wherever you have resources and other things, right? You don't want to do that managing. One of the things I mentioned is we want to take down the human interaction required because that would increase your non-availability to time frame, right? So Kubernetes as a container scheduling system actually gives you very kind of high throughput on cutting down some of the normal decision making that you do is like, where should I deploy this container? Which one has the resources? Which one will be enough to do this? Kubernetes actually uses this nice labeling system kind of way of determining things. So if you don't like some of the selection it does, you could put your own labels on some of the things and make that label as a mandatory for your deployments. So it gives you a lot of nice features to do out there. So that's more on the deployment. I'm gonna rush through the other topics. I did mention like for your consumers, you want them to make it convenient to access, right? One of the things that it would say is like the main packing of container comes at a cost. For a given system, and a port can be used by only one container, which means if you're packing like multiple containers on them, they would have different ports out there. So these are some of the challenges that you would see from the usage point of view out there. So Kubernetes actually has this nice feature called Kubernetes Services. So one of the things like how you would use them for Postgres is it gives you two main features. It gives you kind of your virtual IP for your services. That way you can access your master and your replica clusters together. So for a Postgres deployment, typically you would have two services. You would have one node master. So you are always able to locate your master. And then you would have a load balance replica, which means I want to get a replica connection. It doesn't matter where I get it. So it's kind of a collection of nodes that can do that. So Kubernetes gives you and within the Kubernetes itself a lightweight load balance so that you could do for your replicas. And it also allows you to do the VIP kind of equivalent out there. So it kind of helps you out there. But working with the replicas is kind of like hard, right? So what people are more familiar with, actually I like is the DNS names. So how do you kind of like tie into DNS names along with this? So there is another component of that that you would use, which is called console. People use console as a distributed storage system also out there. I don't know whether it's really great for that, but the purpose that I like it the most is using it as our fast updated DNS lookups for doing service discoveries. So I mentioned like you would typically have like a Postgres master and a replica configuration out there. What now you can do is abstract wherever it is in the world, you would create like a DNS names, which are dynamically updated all the time to go to the right service. So using console and the Kubernetes services, you can guarantee that there would be a non-changing URL name for your service, which is actually important in enterprise deployments. If there are 50 different clients using an Italian, oh, I need to now kind of like relocate. I've done relocation somewhere else. I need to update your systems. It's going to be nightmare, right? So you want a DNS that can be fast updated to the latest location where it is. And based on that, console comes to the rescue. You could actually do subdomains in your network policy for your service discovery. And the service discovery helps you find the service at the right location. It's kind of like a very time-saving thing that I found, which helps you avoid reconfiguration of all your clients, right? I talked about the port things that happen. So I think one of the enhancement we need to put, especially in the LiPiQ, is to support the SRV records, which is actually a standard, but not many services are using that. The advantage of that is when you query by a name, it can be now be a service. And as part of the information, along with the IP address, you also get the port of that service, which means you no longer have to kind of figure out what port number to use for that service, right? Even if the port number changes, it would allow you to find them dynamically. Currently, there is no support for that in the LiPiQ mechanism, but that's something that actually can be easily resolved, maybe something to pick up, which would actually make this service discovery seamless to the end user. Now comes from the main topic, self-healing. Self-healing is, to me, is more like an automated failure. Like, you know, we talked about the various failures that can happen, how do you handle them? This is where you need to use something specifically to fix the automated failure of Postgres itself. How do you do that? There are now many HA projects now available out there. Zalando actually has a project called Petroni, which was based on Covernor from Compos. There is another component called Stolen, which is a completely GoLang-based, trying to do the same thing. The key requirement out there, let's kind of go with that, right? So I have some things, I'm not gonna go over them in the interest of time, but these are kind of like the architecture, but let's kind of go to the distilled architecture, right? What any automated failure system that you looked at, they have kind of two main components out there. One is a distributed store. Think of it as like ZooKeeper at CD, console also for that purpose. The real thing for it is to kind of keep the state of the nodes in your clusters. It kind of helps you figure out what state each of those nodes are. And the second piece out there is actually the main controller process, which will help you take decisions to say, hey, who's going to be the master? What is the state of the master? If the master fails, what are the steps to be done to get the next replica promoted as a master? So these are kind of like the two components. If you have not attended the patronage presentation, I think the slides will be available later on, so you can probably take a look at that. Since it is already covered, I'm not gonna cover much on this topic in this particular presentation. But to kind of like have the list of steps out there, the primary thing to do is detect failure, shut down stonet, the master, elect the new master, promote the master, and see if you can even repurpose the old master. The reason you wanna spend time doing this is to again make sure like how much availability and the non-available time can you decrease. The more you do, the higher your fines will be delivered to your end user, so that's kind of out there. What are the limitations of all these things, right? Some of the things that we ran into is it would not ought to always select the right replica because in some cases if in some of the scenarios, they use async replica out there, which kind of has a distributed status of all the replicas, and it may actually select the random replica out there. So you may still have actually problems in your setup out there. So things then to take care of is like, it's not completely human-free yet, but it will get out there. But some of the things that it doesn't really know about is multi-data center topologies. We are all designed to be all in the same data center. There are a lot of edge cases we still run into that we need to fix, but yeah, it's out there. But it's a great step, I would say, in actually doing this automated failure is a requirement if you wanna cut down human intervention and if you cut down human intervention, only then you can actually get the fine, you can reach your SLA agreement out there. All right, so now let's put everything into perspective of what I have talked about and how your actual real deployment looks like. So this is kind of like a classic one instance cluster setup where you would have like a three-node setup managed by say Petroni or Stallone and the service information is given and stored in your console. So when your client comes, they look up the service host name and the port from console. They get a VIP which goes to the Kubernetes proxy. Kubernetes proxy provides you, connects you to the right master or the replica is needed, load balance it out there and you can now actually do that very seamlessly. No human intervention required to the most extent right now and you could also leverage shared storage if you want like when you want to spin up new replicas, you could just do it from a hardware snapshot clone which would help you kind of provision new nodes very fast out there. Taking the next step, as we talked about, you're getting into now more and more different database servers that you will need to now manage. This is where like the power of Kubernetes will come in because now Kubernetes can help you now do this cookie cutter setup that whatever we have talked about deploying multiple cluster set of instances that you can run for each of your microservices or each of your different applications in your environment to kind of like to the same grade that they actually work similarly across the board. If you have multiple geographies, the good news is like with the console that will actually help you kind of locate the services across the different Kubernetes clusters, you can actually go to the right point out there. You could actually do more innovative points out here. One of the things like I've seen people do is like if you have a Petri light, you wanna figure out who is your incoming connection coming from which country and you want to go to that database. Using console and that combination, you could actually give the same URL to everybody around the world but they will go to that respective database cluster within what is located for the region. So this actually gives you very powerful features on how you can do that. So this actually helps kind of like as you do a bigger and bigger deployments across the world. If your company shoots up and becomes really popular, has data centers all over the place, this would actually help you kind of like manage the chaos of, you know, which thing runs where, how do you keep it simple for your end users over there. The final thing I wanna talk about is like graceful, right? We've talked about a lot of this ease of use, a lot of self-healing. One of the things that is still the trouble is how do applications react to failover or things that are going down. So I don't know how many of you have seen that but I've seen this many times. I have failover happening, the application starts connecting and the error that they see is database is coming up. It probably rebooted or something. These are the things like completely crops out all the applications. I don't know why applications aren't designed to kind of with fault tolerant in mind but hey, these are real world problems. Applications do not have the capability of that. So to avoid this kind of issues, I would recommend to actually use like a connection pooler which would help you to kind of twice and stay normal, like not connect anything and kind of do a delaying tactic so that would help you not break applications down. You could use some sort of like a customer speech bouncer or something. The key idea is to kind of while your failover is happening to hold the connections and not completely drop back to the application and when the connection is available, it connects to that and your application proceeds out there. The other way is you could actually change your application connection logics to retry if there are failures. None of the applications that I've seen mostly do any type of retries if a connection fails. That kind of like your failover happens, your connection application still dies. It's like those are the problems that you would see. So even though you have done the right thing on your database, if the application crops out, you still don't meet your SLA. So those are things that would still impact you. One of the things like to keep in mind is because you lose connections and you restart at the replica, if you had any session variables that you had set like special config parameters to select a specific type of Nestor join or anything, those will need to rerun, right? So these things can actually crop you out. These are kind of like very fine points. Like sometimes it went to a different replica and now I'm seeing a different performance, but like a lot of these things are because the session variables were lost in your connection. So these are some of the things that the application needs to take care of. Like when they do a reconnection, how do you kind of handle them out? So in summary, I think I'm really running out of time. So these are kind of like the main pain points that we talked about today. And the way you would solve them in a containerized world. So I think the key thing to remember is like, make it simple, make it easy for your end users to use it. Be more self healing kind of thing of that, which would actually help make your nights more productive in sleeping and be graceful. Like, you know, teach your applications on how to do the right things with your database. And whatever we have talked about can be summed up in this one slide. There is no just postgres that would help you to solve the needs of your HA requirement. There are a lot of other components that you need to bring in together and come up with this highly available storage architecture for your deployment. With that, your feedback is important. If you have questions, we will have some time for the Q and A, but if you have more questions, feel free to email me out. Okay. Sure. Put them in red. I have used Swam. Unfortunately, some of the things, it's still like not ready for the users that I'm kind of looking at, like trying to manage services and other things. Swam also didn't give me a lot of features. So it's still like kind of like very early days out there. Let's kind of see how we are there. When we started, a lot of this work was a couple of years ago. Kubernetes was way ahead. And even now, I still see a lot of features that Kubernetes are having is I kind of solving my current needs that I'm doing. So I'm kind of like more hongo on Kubernetes right now. So when we deploy right now for the individual instances, we kind of like do, we have currently started with the single instance of SCD, but we are still trying to do. Do we want to replace it with a distributed SCD outside or continue with the single one? So right now, we haven't seen much failures directly related to HCD itself because what happens is if HCD fails because it has its own volume, Kubernetes starts it up again. So we actually get that high availability anyway. The distributed HCD is more useful when you really have fast changing data and you want to do more complex deployment out there. But I haven't seen the need right now to do that because the fast failover that because it's a Kubernetes pod by itself, Kubernetes restarts it pretty quickly out there. So we haven't seen much issues yet. That is actually a completely separate topic. There is no major version support out there in this ones. The only ones you should be doing is the minor version upgrades. Major version upgrade is a topic by itself. Let's stop. So that's the reason I would say there is no mechanism that Postgres gives you or any application gives you, which is why we give you two endpoints right now. One is the master and one is the read replica cluster URL. And the reason for that, it is still a problem for the DBAs to kind of figure out what connections they want to use. If you don't know, you would select the master. Whereas Kubernetes will do the mapping correctly out there. So there will be a labeling system. So when you do promotion, the label of that pod will get to a master and the new connections will come out there. Yeah. Okay. Any other questions? If not, thank you for coming in and I hope this was education. Thanks.