 Hi, me Casanova. I'm from Ecuador. I represent second quadrant there Now I work a lot with Filover situations rep manager and things like that. So that's what I'm talking about here today I'm going to show you An example based on a customer How we solve some problems they have the customer is the Corporación nacional de telecomunicaciones telecommunications national corporation, okay They are a public company that provides those services Fixed telephony internet access television mobile Telephony and Other services for example in this case We are going to talk about the service they give to the national police Which is the something they call the panic button? Yeah, yeah, you press the button and then you panic So this is a positive enable database That they used to track patrols and When someone calls 9-eleven or press the panic button then they can say which patrol is nearest and send it very good service Of the list specialised service for them and Police is actually very happy with the service. They are they are receiving now As you can expect this is a critical service This can't fail never because Someone can be in danger So citizens security is some risk That service started four years ago Now CNT is also giving all the services Based on the same platform The panic button we talked about they improve some core businesses For example when they need to send a tech tech people They can locate it locate the boxes where the land lines are easier to invoices and a lot few services They are even Implemented new services One of them is to track tablets So you can buy a tablet give it your key and if the kid gets lost You can find it Also for phones and things like that. It's Some service they are implementing based on the same technology Because they are now very happy with which what they got they have basically 99.98 percent of availability currently They can improve it, but of course that cost so What was Their first approach to high availability that That of course is not high availability That has some failures in it basic design Because for example, they use it a single store us for two servers Okay, if one of the servers fails. Oh We are happy, but if the Sun fails everything crashes and I have seen three different companies with some failures Just in Ecuador And I know there are lots more not Something well, it's a little more complicated for us if the I don't know the network card in the storage fails There is no spare parts there. You have to import it from here. So that's a Few months to recover Start work remember Okay Of course, they said, ah, okay, we need redundancy. So another start That's not highly available yet There's still a lot of flaws in that design Why? Because for high To be really highly available You don't need only to fail over the database. You need to instruct application servers to look at the sir at the new master so if you just make the failover happen and The application service is still looking at the wrong server You are not solving anything You have the exactly the same problem Sometimes We that works with databases think at the database level, but we need to see the whole picture not just the database Okay, so For be really highly available. We need redundancy. We need redundant data server centers. We need redundant networks and everything actually When I say redundant everything I normally sink and hardware But we actually need redundant people If only one people knows what happens you have the exact same problem so Yes redundant everything so when we Get at this customer The first thing we do was to remove all they had and create this basic infrastructure we have a machine in the capital keto and Well, they have a local node in keto that's back because it's in the same city but well and They all they know they put it on why a kill In another city actually for having really high availability Set up you need at least three servers The best you can do is to put every server in a different location in a different data center especially in a different city or country or whatever Because of political constraints servers can't be out of Ecuador those but Different cities is good What else do you need? What else? we Asked to for they to have a high availability. Well, they need a plan seriously Making a high level of high availability set up is not a mechanical thing You can think of the basics basis But then you need to know the infrastructure they have The exact problems they are going to have what happens if the application server crash What happens if the database server crashes, etc. Everything must be covered You must understand the time you have to recover if You are allowed to lose data That's an important question high availability stops and how everything affects to other components You also need to detect the problem of course otherwise Doesn't matter to have redundancy and redundancy of course, okay What do you do if you server database server crashes? sometimes We don't like fill over automatic fill over situations. I Don't like them Yeah, I don't like them because a Computers not as a smart as a person But if you really want Something to be highly available you need out some kind of automatic Fill over because otherwise you need to hire this guy to enter the building and Do everything Okay, so this is the basic infrastructure they have now they have application servers they have a Master server and replicating via streaming replication to the distance servers Even if one of the distance is local Okay Yeah, now What else we need? We need to know which server we are going to promote in case of a failure We need to know What happened with remaining stand by and inform application servers what happens? If one of those doesn't follow You don't have a high high available Set up so what tools do we use rep manager and PG bouncer? rep manager is the tool we use for detect the problem to detect the problem and For doing the failover for us To determine Which node is the best candidate to be promoted in busy bouncer? We are Using it to isolate the master to avoid Split brain situations in which the master fails and then reappears and Then you have a big problem if that happens so The manager is from second quadrant current version 3.1 It's a very good tool. It's also help us to Set up everything you can use a standby clone to create an instant by You you can register the master well you need to register the master and register the stand by and it shows shows you current situation of the cluster and I'm going to See a few of the configurations and what happens when ease of Each of those commands gets executed And you have this orders to commands standby promote standby follow Which helps with the magic? so Okay, you need to install everything and you need to configure progress with this configuration This is just a template. It's not That's and need to be that way for everything for example you can instead of using world level equal logical You can use world level equal how to stand by You don't need to walk it segments, but for cloning After that you can change it These ones are most for Practical reasons If you are you have standby You are going to archive walls at some point so we activate the archive mode the exit zero is just to Do nothing for now Until we have a command to do there to put there for example some customers use barman To take backups so they put here the barman archive command and that's fine Or everything you want to put here Okay, that's not what is in the customer Just for the test you can copy this command and execute them and it will work So this is just for you to execute Okay, remember to allow port 42 32 the firewall and Connections okay, so we create and a standby. This is the exact same thing as executing I forgot the name of the command the binary Yeah I don't use it. I only use that so I forgot it Why using this if it's the same as executing P base backup if you are using devian probably you have the problem that P base backup doesn't follow the configuration files Because they are in another directory. This one follows The configuration files and put it in the right place Okay and If you get something like this everything executed fine just this One will be in English not Spanish Yeah Then you need to create a rep manager that comes And you put you need to put one of those in Each node one in the master and each slave you have. Okay. This is a very simple configuration There are more configurations in that file in this file This is the basic you need to start working. Okay register the notes This is important for the rep manager to know which servers are in the cluster So I know which is the master which is standby. I have available Etc. Okay, so register the standby is to You can use cluster show to see what nodes Do you have and which is the role of his note? Okay, you can You can check the same information the catalog rep nodes The rep manager schema Okay, this is the basic things to do after now we have Cluster with one master trust and buys and Everything is registered in rep manager. There's no magic until now You need to put rep manager D to run This diamond is looking at every rep manager node and if the master crashes This diamond will notice and this is the one that makes the magic If this is not running the node will not participate on fail over Okay something that Rep manager knows is the lag between The notes and the master He keeps tracks of that information all the time So if the master crashes He knows which node is the best candidate to get promoted What manner the diamond connects to each node to each standby remaining and Start asking this information and the one that has the better Get promoted, okay, so This is the Basic Parameters we need to add to rep manager that comes For the failover to happen the spawn stem out of master reconnect attempts reconnecting terrible Failover automatics you can put their failover command failover manual and that server is only counted as a vote He to know if I am I can see the rest of the notes Priority with no I want to promote This is not static. It's not fixed. It's just a suggestion. We do We suggest this server to get promoted But if rep manager knows is that it's a bad Decision it want to it promote command What command to execute to promote the note and Follow command what command execute to make a standby to follow the new standby Follow command you can put exactly that and it will work Promote command normally you need to improve it Because Too simple We are going to change that line and put a script there a script provided for you by the user and That script can do anything more on that after Let's write promote interval sex How many how many Times should pass if they if he could couldn't promote the node Okay, so with this configuration The in the worst situation You are going to be without a master for two minutes, right? 60 seconds waiting for the master response and And 60 seconds trying to reconnect to the master there is some time between this and the failover at most five minutes to start with a new master basically, okay, so This is what we have right now if the master crushes We now have a new master Rep manager will if we just execute this command standby promote This is what will happen and We have our new master and Standby following we still are in problems Because application servers are looking at the wrong server so We need to see the whole picture not just database Okay This is what looks like a Promotion trying to reconnect Promoting a standby successful You are going to see this in their rep rep nodes table Look at this. There is two master, but one isn't active and this indicates which one is the real master Alone doesn't provide high availability Okay For the second part of the high availability setup, we are going to use Pgbousa is not the only way we can do it There are lots of ways. I Like this way because it's simpler It doesn't depends on other people It doesn't depends on other hardware Yes, I don't care The machine can Can explode the server could Get a funnel and stop You can stop the server the service That will the thing is that if put press See Corrupted data it will stop Yeah, okay So the master is not responding the master is not responding Whatever happens to whatever reasons. Yeah Yeah, it's not exactly select one, but daddy you get the idea okay so What are going to do with Pgbousa? Okay Configure it and The the important here seems here is that when configuring Pgbousa, you need to indicate Which node it will connect to and and here You must indicate the master you can be smart and put Different entries for with only servers and change it whatever you like, but I'm just doing this this way and I'm going to put Pgbousa on every application server exactly Here why here? Why not on the database? Why not in the middle? Well, my answer is Because it's more simple If I put the Pgbousa here and this server crashes I don't care because this server connects to its Pgbousa and Every server connects to its own Pgbousa if I put the Pgbousa here in the database then I need to Connect to this I need to connect either way to the application servers to inform the problem so what if the application connects Has the connection in the code how do you change that? It's a little more complicated. So I prefer to put here Yes, that That's clear. Okay, why not in the middle? Simple because then I have the same situation again I need to have a feel over for the case in which Pgbousa server crashes so This is more simple. Okay, and every application server connects to a hundred localhosts and You can get the same strings you need in Pgbousa executing this query and replenouts So if I only have this entry I can execute this query and know which one is the master now Yeah, yeah, but one of them is Fault in active Yeah, I know absolutely the first time I got the two records. So Okay Okay, then I use a script similar like this one This is a very simplistic Script based on the one on the customer's side Which connects to every application server Executes a well well first I have the list Execute standby promote to promote the node the master Get the line I need to put on Pgbousa. So I fix the template and Move The new configuration The interesting thing is that Pgbousa allows me to connect as if it is if it were portraits So I can replace the configuration Connect to Pgbousa on every application server Both reload resume the application servers are reconfigured They see the new server. Yes Yeah As I said, this is a stupid version of the real script, but then that works Yeah Okay You have to post on every application server now Some some people ask us why we have That promote command why don't give rep manager the ability to know This by default or something like that. Well, the the reason is there is a lot of situations Not only linux application servers for example, for example, this customer has a few Servers that are Windows servers and Of course, I can't do that in a Windows server We actually Replace it is his host on those machines Yeah, it's not pretty but works We Tried to do it with DNS but because there their DNS is a Windows server and They don't trust on that because a virus can And it happened to them So we prefer to use the ATC host We actually survived the DNS servers crashing on the capital so the DNS and why you kill take place and They see this without a problem that I raised was working in more than one situation database was Working and the application servers really pointing to the right database and They have some external problem and they can't recover the service because another reason but we We do the right things every time It's Usual to see if we love our situations there with me and sometimes Why a kill that the node in why a kill get promoted even if the The other node in keto is still active and when we check Red Mania did the right thing for some reasons The remote node was more up-to-date than the local one that I Don't know I don't understand why but that happens In this network at last Okay, so this is what we do there and this is what it looks like after that happens of Course then we need to reconstruct This as quickly as possible and that is just a matter of another standby clone with the force option to reconstruct all on the data on the same data and Don't copy everything Okay, so we have around nine minutes if you have to do some questions I don't know how to translate that but I have the Stick here to hit people okay, so Any questions you can do you want to do yeah? You can lose data. Yes, you can lose data a few seconds probably unless you have a sink Synchronous standby if you have synchronous standby You are not going to lose any that yeah Or by that in cascade Yeah, it supports it That actually that's the reason why rep nodes has this This column So it knows if there is another Level of standby those can't be promoted As well, so sorry can't understand you Yeah, ah, okay Yes You give me your email Consent you I am not sure but I I believe they they are going to put this somewhere Yes, yes By the way, I Don't put my own email here, but Hi me at Second quadrant that come If you want the slides you can email me and I will send it to you Hi me at second quadrant that come Yeah, I've never seen console so SSL Then don't understand the question you mean you mean here Oh This one it's working with the same technology of portraits. It's really using but so Okay, okay Yeah In Ecuador everyone wants automatic feel over Because they basically don't know how to do anything else Yeah So with that Okay How is The set using the same witness for a hundred clusters Yeah, yeah, it doesn't matter actually because The witness is only use it There is a witness in case you have an even number of notes you can create a witness and That note is only used to ask If I can see the witness anything else so actually and don't tell anybody about this but You can't have a You can't use what manager we all need to know because he needs to know That the new master see the majority of the other notes if you have only two and One fails. There's no majority. Okay But you can Fake it to make it work And you can say, okay this note Which is the standby? It's also the witness And in ask if you don't need another database. You don't need anything It only connects to it Execute a query to see I can see it and count a vote Okay, so you can use it with a hundred clusters if you want or Do another others crazy things like having all need to service Okay, I Have two minutes Yeah Yes and no People like the lower that values You need to put something that you can't just let it in zero. I have seen that too But the problem is that red manager still restart the the new master So that restart can take some minutes so you can Lower these to a few seconds But they'll start will take at least a minute or two Yeah, I understand you are asking about the last parameter and the rest of the question It tries to promote the note It's if can't promote for for any reason it couldn't promote the note Then it waits this time and tries again to promote the note It says basically for the racing situation. Okay No It tries two times Because the best note the best candidate Yeah You can force it. Yeah, you can force it via the the promote command but Normally what it does is it tries a second time and then stop trying and we are out of time so