 I would like to introduce myself and my report today. My name is Lee Kwong Ha and I'm doing a DPI for Blackberry Company in Ontario. So Blackberry Company is selling cloud services for different kinds of database including Postgres, Oracle, MySQL, SQL Server, and Cassandra. So we have around over 3,000 software and working on around 800 staff of computer in IT. 800 above total of 10,000 are doing computer in IT in which around 80 staff are doing database. With other people doing cloud service, networking service, assurance, and so on. Storage supplies. So I'm working on database administrator of Postgres and we have a plan that we will move mainly the software into Postgres. So around a few thousand software of Blackberry on the cloud will be changing to use Postgres within this year and next year. Because like seeing the whole last year, we're moving only a few tests, everything. And my work mainly about this report to create the image for Postgres to sell on the cloud for the Blackberry cloud service. So about the report, I would like to present our models, our implementation, our performance testing, and the conclusion. So we have been developed for a few years the first cluster called Keep Alive Red Manager. It is same containment on the software on Postgres database of Blackberry for a few years. And we are testing the HESA Proxy and PC Pouser to bring the Postgres image into production. So this is the model of Keep Alive and Red Manager that we have been using for the past few years. That we use replication by Red Manager and Open Source to our Postgres to do master and standby. Including host standby and asynchronous standby with a witness to avoid the split plan errors that sometimes the witness will choose the more privileged standby to promote on failover. So because Keep Alive is installed on this cluster, the requirement of Keep Alive will require that the subnet of the whole cluster should be in the same containment. It means that we will not be able to cross zone with this model. So let's say on failover where the master fail, the Keep Alive will switch the IP to the host standby. And the Keep Alive on the master will notify master script to promote the host standby to be the new master. So we come to check and try to create the image for HESA Proxy PC Pouser. And we will discuss which solution we choose to create this image for Postgres. We use HESA Proxy like high availability Proxy for load balancer. And we use PC Pouser, a new version of Postgres to do connection pooler. It is more compact than the previous version PC Pool. So it requires lower memory than PC Pool. On our design, the front-end servers are inside the same subnet. But the cross containment, it means cross cloud zones can happen with the back-end. That the back-end server can be in different subnets. And we have the load balancing in stone that the workload will be distributed to all three of the master and standby databases. This is the model. On this model, we see very clear that on the front-end, there is Keep Alive. And because of Keep Alive's requirement that the HESA Proxy 1 and the HESA Proxy 2 will be in the same subnet. But the load balancer, PC Pouser here, the back-end can be in different subnets. So in order to compare between PC Pouser and PC Pool, we see that PC Pool can distinguish which one is the current master of the cluster. And PC Pouser, we cannot do that with PC Pouser. For PC Pouser, we have to make health check. And the HESA Proxy will use telnet and HTTP curve to health check each back-end server and return the code that this server is writable or this server is writ-only. By that way, we can share the workload. In any case, and load balancing happens, even failover happens. Or even any of the server fail, then it detects right away by health check and continues the load balancing transparently without any error to the software. So for our implementation of this model, we would like to explain our development of Keep Alive and Red Manager first for the past few years. And then we try to report our research on HESA Proxy PC Pouser, including how the read request on load balancing, how the write request by using port splitting of HESA Proxy to distinguish between the read and the write for PC Pouser. And then the statistic report on HESA Proxy. The two kinds of failover with HESA Proxy, the farm failover and the auto failover. Then we would like to go into the details of implementation of the front-end cluster and the back-end cluster and how to install and share with Nex. And we sweep it back. So at the end of this part, we will come to the approach of how to create our auto image for postgres on cross containment. So for Keep Alive, we develop Blackberry Cloud including 20 network zones in which there are 16 productions, two laboratory zones and two restrict zones. We develop this called Saturn ring software, a software that tries to start information about the storage request of customers. This customer requests how many storage, how many devices we start inside the Saturn ring software. So on each zone, each cloud zone, we need to run a version of Saturn ring to know on this zone which customer how many storage they are required. So these 15 zones are low bone. It means half of them will be located all over the Canada and the United States. A few located in Germany and Amsterdam and a few located in Singapore, India and Hong Kong. So it means that the customer all over the world can have quick access from our service. So this is our implementation of our HESA Proxy and PC Pouser. On this one, we see that on the front end, we only install on HESA Proxy 1 and HESA Proxy 2 keep alive to create a version IP. And then the HESA Proxy here forwarding forwarding to the back end and load balancing. On the back end, we have PC Pouser, Postgres, Rep Manager to control the failover. And then Rep Manager, this is the watching dot service of Rep Manager. And the sweet neck here, we don't have PC Pouser. On the wheat necks, the database is empty because we don't need data on the wheat necks. And we don't hope that the wheat necks can be the master. So we don't put any data here. And the replication here through the log send from master to standby. So at the first observation, we see that the tune, the tune's in stone on the front end and the tune in stone on the back end. There are totally two different sets of components. So the first question is currently maybe many one will only use one image to do all of six virtual machines here. So is it good? We need to test it. Because this one is HESA Proxy, keep a line, no Postgres. This one is Postgres, PC Pouser and Rep Manager, where we come to two different images. The front end image and the back end image should be two different. So we can tell that we test it. We use OpenNepola to create the virtual machine. And then the first step, we use only one image, like a lot of place. And we use the OpenNepola test script to detect if this is the front end. We remove the Postgres, Rep Manager and PC Pouser. If this is the back end, we remove the HESA Proxy and keep a line. The first thing we detect is that it runs slow. Because when you create the VM, there are different VMs with one image, it will run slow. When you create the VM, you stay there and wait for why. So it's not good. And then we come to try to make two images. The first image only contains HESA Proxy, keep alive. The second contains only Postgres, Rep Manager, Rep Manager D and PC Pouser. When we use two different images, we see that the time we create the VM, only one third of the first approach. Only let's say around five, six seconds, we have a new front end server right away. And after less than 10 seconds, we have right away a back end server. So it means that we don't know which one can be better. Because if you do scaling, you manage two images at the same time, maybe more more difficult. Then the quicker time is obviously one third of the... Only one third if you use for timing, if you use the two different images. So we tend to develop two different images, and we try to develop like we will try to manage more work to have the cloud solution maybe better, maybe better. So this is how the load balancing happens. Suppose we have the keep alive IP on the HESA Proxy 1, and HESA Proxy 1, we listen on the port. This can be any port, but I just put 6432, but you can change it in your setting. So we application send a request. When HESA Proxy receives it, it will be distributed into three loads. So we test it is load balancing. Okay, this is how the drive, we use the different port, splitting the port to distinguish between the rigs and the drive for PC Pouser. For the old version, PC Poon, we don't need to do the port splitting because PC Poon have declaration inside the setting for the master. So if we send the drive request into a different port, it sends directly to the PC Pouser on the master. PC Pouser then forward the drive request to the database of the master and write it down. After that, the master send logs to sync the host standby and the async standby. So we have the port splitting for HESA Proxy. Hold on, I would like to, oh sorry, I need to go back. So how can we do this? We combine this with the health check to know which is a current master. The health check set in by CNETD will send an SQL statement to the current database to see if it is a writable database. And from the front end here, it uses telnet and HTTP protocols to check the health check on here and know which one is a current master to write on. So in case this is the current master, the health check will detect and the writing will come to here. Then we can do monitor by statistics code reports of HESA Proxy. This is the setting. We set the user admin with the same password admin to look at this and on this we can also verify the load balancing everything. We put the link here. Now we would like to come to testing the two kinds of failover of HESA Proxy, the first one called farm failover. Failover means that when this master is gone, the host standby will be promoted to be the new master. The question usually for beginners like this, is it important because if you set up a wrong image or postgres, can the asynchronous standby here be promoted to be master? The answer is should not. Should not. And then... So on read request, the happiness the load balancing will share into two back end here only. And this failover will not generate any errors back for the front end and then back to the software. So we have the read now. This is coming to PC PowerShell on the back end. And then forward to the postgres port, 5, 4, 3, 2, and then load balancing. So HESA Proxy set it by doing back end cluster. So there will be the current back end cluster that do a check. If the current back end cluster on PC PowerShell check and saw is not all right, then it's good to the back end cluster and do the reading, avoiding the fail server here. For auto failover. So suppose we have many standby, we don't know which one is going to be the master. So the auto failover attach to the health check and the write request more than the read request. Because for the read request, we only health check on port 5, 4, 3, 2, postgres and that check can be no need to write any script. You can flush indirectly in HESA Proxy. But for write, we need to send a sequence statement to make sure that that server is writeable. So for auto failover, we use health check here. Check which one is a writeable server currently. And then we write to the correct server. That is a new promoted master after failover. Then the host standby send log file to the async. The log files are sent between postgres server to make sure that all the database on the tile is the same. So when you create postgres image, suppose that we had to test everything. Even the send log, he had to test. So currently the question is, if you set the one log folder size, how much will be the best parameter? Let's say I have plenty of storage. I put the PC1 folder 40 gigabytes. And then I say by that way, I can receive up to 36 gigabytes of log file. And then by that way, then I have huge capability and I will be strong so I can set PC1 log parameter which is around 2000 by my calculation using formulas of log file size that you can find from postgres homepage. And then we go to test it on the real postgres image. What happened? First day, very good. One more week, very good. The third week, why is our cluster frozen? We opened it and saw the IOPS reach very high. Why we get very high IOPS? Because we believe that we are very strong. We set up, now we start in the system nearly 40 gigabytes of log file and any tile one log file create into the same folder is very slow. And when you send log file from the master to the standby, it became very slow because it access into a folder of 40 gigabytes of log files. Log files together. So then if that put into a real application, what will happen? The thing is after a few weeks, application will start to have exude when the database becomes slow. And you know the PC is locked to clean it. You clean manually. Difficulty. So currently on the real tile application, I would prefer to set the log file size. PC is locked. Number is 500 only. And then in order to have the real tile every two minutes, I can have a backup that whenever crash happen, I can recover every two minutes of data. So I use snapshot and combine to a small number that will guarantee the speed of the application on our Postgres image. So now we go to the details of the front-end cluster implementation. We use Keep Alive. If the HSA proceed one fail, it doesn't matter. This one only one to second to sweep the VIP here because the server here know data on the front-end. By default, the master will keep the version IP. So the Keep Alive, they are setting inside Keep Alive to speed up the switching IP of the master and the master in the backup here. For back-end, we install PC Pouser to do connection puller. So when we put in the image PC Pouser, we test. We test to make sure that with and without PC Pouser, is it PC Pouser will improve the performance of the back-end? And we see that yes, PC Pouser improve. So if some people ask naive question like this, I throw away the PC Pouser, then I just go one step directly into the database. I don't need PC Pouser and I suppose I go that way, it will be quicker because if you go PC Pouser and then you forward, it will be slower. So that question is not right and the belief that PC Pouser, we test and see the improvement in performance that the cluster running smoother, everything becomes smoother because the connection can be managed by PC Pouser. For the witness, so we suppose that a witness can store many clusters inside and control failover for many clusters. So we, at first, we still had HSA Proxy PC Pouser cross containment here that you see the cross containment here, the subnet C4918. Here is different subnet 122183. So we have the witness here, it's share one. And then suppose we connect it to a second same containment, keep a live wrap manager. Okay. So which way will be the best way to create the share witness that also we made a lot of tests to make sure our image will run smoothly, automatically have a witness with an empty database. So we currently, our image is on Postgres version 9.4.6 and inside version 9.4.6, this wrap manager config file will be priority is going to zero. The new version of Postgres says this must be at least zero, but before that, this is minus one. And the minus one are the zero to avoid that the wrap manager have no choice and decide to witness to become a new master. This one to avoid the witness to become new master. So we, inside the witness, we put the first wrap manager config file to manage the HSA Proxy and PC Pouser. And then this is inside the first folder, main folder. Then we put the second wrap manager config file in the keep alive wrap manager in the main folder, wrap manager witness folder to have the second different files. So on the witness, we have two different files upsetting for two different clusters. And then to create, we use the statement, wrap manager witness create. Then the first statement we pass in, this is the master for the HSA Proxy PC Pouser cluster. The second, we pass in the IP for the master of keep alive and wrap manager. Then we pass in on the first statement, the wrap manager config file for the HSA Proxy PC Pouser. And we pass in the second file for the keep alive wrap manager. And this will create a share witness. Then on the same share witness, we use wrap manager cluster show statement to show the two cluster on one server. We pass in in two different, the first config file of the HSA Proxy PC Pouser and we output the current back end of the HSA Proxy PC Pouser 1818319. And then we pass in the second config file of the keep alive wrap manager and we show the second cluster of 187188189. And from the two we see the share witness, same witness 191191. So how to do sweep back? This is the pawn. If we try to develop our image into scaling, let's say if we scale horizontally, then scale up can get the error like once we try to scale up, for example, at that time our image file suddenly inaccessible, cannot access it. Then the VM create into an empty standby. And when the load balancing happens into an empty standby, it returns in one night 100 errors back to the application of customer. So this is related. And then if we do scaling vertically, then the config of the scale down, when we add more RAM, when we add more hard disks, it's easy to add. But when you delete, it is difficult. Because suddenly a subveraning you delete is RAM, we will stop. So it means that we also test this one for the ability of scaling. So the step to sweep back for HSA proxy and PC browser, we suppose the master has been failed for a while. So at first, after we start it on, we have to reclone to get the data up to date. Then we start service. And we had to follow the new master from the failed one to make sure that on the cluster always at one time, only one master. So the failed master is now restarting at a new standby. Then we come to the next step that we had to return the cluster at it was that the master should be the master again and the standby, the promoted standby need restarting back to standby. So we stop service on the promoted master, that is the host standby. We promote the host standby, sorry, we promote the failed master back to master. And at this point we have one master, that is the master is back. Then we clone the host standby to the master because the host standby now need to come back at a standby. So it requires a clone statement. A clone statement will say that this host standby is a read-only database. So we start service on host standby and we follow the master. Now the cluster becomes, as it was, the master fix sweet back and the standby return a standby. If you have any other standby, we will also need to do a rep manager follow. So this is different. After a lot of changes started of the whole cluster, we on the front-end cluster, I will do a restart HSA process service. When we restart HSA process service, that we will ensure that the load balancing will be smoothly within the next week. Because a lot of changes started restart on the back-end, then when we restart HSA process on the front-end, that the whole cluster will be refreshed and for the whole next week load balancing will be very smooth. So now we now come to the decision. Now we have two images. Indeed, one image for the keep-alive rep manager and two images for the front-end image and back-end image for HSA process PC Pouser. So currently we have 100 software on the keep-alive rep manager and we haven't got any of them on HSA process and PC Pouser yet. So we are comparing to verify which one is a better one, which one is more reliable, like that. And we do checking the throughput, the input output, the CPU of each cluster to compare. We do this by... I would like to report here the J-matter, but we also have the PC page that generates very similar CPU and input output to J-matter. And then we also use Wally Hoss and Rafa Na Dashboard, the Resale CPU observation, everything very similar. So we show here the J-matter that we use J-matter formula throughput is the number of transitions per second. And the k-pi per second is a throughput multiplied by average pi divided by 1024. You can find these formulas back from Apache Homebase. You search for Y and then you will see it doesn't state it's into a formula, but it says one line like this. The throughput is the real number of transitions over the real execution time. And then with a little bit test, you know that here is in seconds. You can find all of these two on Apache Homebase. So on J-matter, we do HTTP requests, do tests for read-only, symbol-write and read-write. For read-only, we do without data execution loop and with each row execution. For symbol-write, you distinguish between insert and update and the list. Why we need to distinguish between insert and update and the list? Why we don't have symbol-write with insert update and the list together? At first, I borrowed this idea from PGPange. I saw that. Why PGPange? When we test, we saw two cases. Why we don't use one case? And we discovered that the delete throughput, it is much higher than the insert and update. It means the statement nature are two different. The number of transitions per second will be very different for insert update in compared to delete. So I agree with PGPange that we should divide into two cases for write. So because if you combine two very different like this, then the result is depend. Sometimes throughput high, sometimes throughput low. And finally, you discover that the insert update get throughput totally different to delete and you have different test result if you combine it together. You expect to get same result anytime you test. So because of that, the read draw also has select insert update case and select delete case. For each test in this, we measure the throughput, the CBU, the threshold, the response type, the bias throughput, the response type percentage and the response type over threshold, the resistance throughput over threshold. So we have all together for Jmeter, PGPange, Rafana, Dashboard, Wallyhose, we have around 200 charts. But we show here maybe we show to you 20. 20 and only Jmeter because they are very similar. Another one looks a little bit better than the rest, but only need one of them is enough. So this is the throughput. We measure on one million tests. The size of test database is 7 million samples and we run one million requests on 7 million samples database and then we have the throughput. Lowest is 1,000. Highest is 2,100. So we have the throughput like this. Then this is the throughput, transaction per second of keep a direct manager. This is the CBU of the keep a direct manager. This CBU show only the master. So then keep a direct manager will not offer load balancing because the 10 to 20 percentage here of the standby only show that the log file send communicate with the master. This is only for log file sending. No real withdrawal here. So load balancing is not offered by keep a direct manager. This is a threat chart of keep a direct manager. After one minute of test, the number of active test is 100 and then nearly six minutes it readily drop until the test of one million tests done. This is the respond time chart of keep a direct manager. This is the bias throughput over time. This is the respond type percentage of keep a direct manager. And this is the respond type over test. On average we have 37 milliseconds of respond type at around 85 tests. This is the transaction throughput over test. On average we have 1700 seconds of transaction per second of around 90 tests. This is the failover. So the failover of keep a direct manager surprisingly from it happened from six seconds on our cloud until maximum nine seconds. So the failover time happened less than 10 seconds. And I think that we also use order code on our cloud and order code failover keep a similar like this a little bit slower. But then it means that post-crash time is rate that if we can get a failover time less than we do a test like this per minute we send 120 update statement nonstop per minute to the database cluster of keep a direct manager. And then we do the failover happen. Okay, now we set the optimization here keep a direct manager failover less than nine seconds the application the user continue to use a software no error at all. No warning, no error. Because if the failover time less than 30 seconds the application in Python, the Django will not detect the changing of database master. So it pass. So then it means that on our first image we got very slow failover and then the application in one minute return 120 errors with 120 auto updates we test. So we had to tweak the config until the failover reach around less than nine seconds. And then we have no errors on any failover. So you see when the keep a lie detect that the master fail, the CPU of the master reach to 37%. It means keep a lie run a notified down. Then the hard standby it detect that in the cluster now there is no master. So it reach CPU here around 29. And then the second percentage of CPU of the hard standby is around 18%. This is the red manager promote inside the notify master of the keep a lie on hard standby. So after the second one, this is the promotion of the hard standby, 18% of CPU. So now we come to the second cluster. Our new cluster has a cross CPG browser. The throughput is still from around 1000. But it reached to only 2300 in all the tests. Then the first test we are doing is, is it load balancing happen? So we use sista. We use sista and this is given to 30 and given to 14, version 14. Then we measure that at the same time, when we send in to the database, there are 1100 requests on the master. 1,900 risk on the hard standby and 800 risk on the async standby. So load balancing is sure. It happened right away in front of us. So that means that our setting is right. So this is the load balancing. Now we get into the process and the throughput of the HES-APROC CPG browser. This is the front end CPU U6 that only the HES-APROC 1 receive request from the application because there is no data on the front end. So...