 Hello everyone. I am Amit Kapila. I am a senior PostgreSQL developer and committer. I am working in PostgreSQL from last 10 to 15 years and in databases from past 20 years. Today I am here to share with you about logical replication, especially past, present and future of logical replication in PostgreSQL. And also I forward to tell that I work as a senior director in Fujitsu who sponsored my all the work in PostgreSQL open source community. So the broader level of agenda for today is I will first walk through the overall high availability and scalability solutions and then how the evolution of logical replication has happened in PostgreSQL. And then the basics of logical replication in PostgreSQL and some of the key features of it. Then I will talk about what's going to come up in the coming release of PostgreSQL 15 which is going to happen in September and probably what's going to happen beyond that in logical replication area. And then I will talk in general about the multipaster solutions that could be built on top of the PostgreSQL logical replication solution. So yeah, that's all the agenda. So let's start with high availability and scalability solutions. So I think this is the most basic solution for high availability where there is a master and multiple standby nodes. Here we perform the physical replication. And if the master nodes go down, one of the standby users can promote it to make a new master and make the other standby follow it. So now here the thing is that the writes can only be performed on the master node. And we can only scale by routing reads to the standby. And for multi-site writes still need to go to master, which is time consuming. So if there are multiple sites and writes have to happen at different sites, say the standby is at a different site, then still the writes have to go to master. So this is very time consuming given the network latency involved. And then on master node failure, there is always a time lag to promote the standby because we need to apply the residual wall so that it reaches to the latest level and then we start accepting the write request. Then there is a common phenomenon in this, which is we know well known as split brain like when two masters start operating at the same time. Basically the old master which went down sometimes comes up and now we will have two masters in the system. So this can lead to data loss and some human intervention is required to make it work. So this is the basic form of high availability solution where your data will not be lost, but the right scaling and is quite limited. And also there is a quite good amount of delay while promotion of the standby. Then we have other kind of scalability solutions, which is we know as master master solution. And so here writes can be performed on all the nodes, unlike in the previous topology and all the modifications are done on a single shared storage. If one of the node goes down, writes can still continue on the other node. In this, there is almost zero or minimal fail over time. But because all the data is shared, if writes happen on the same resource like table or page, there is a lot of communication required between nodes to make that right happen. So such a solution though provides high availability, but it can limit the scalability of such workloads. Oracle Rack is a typical example of such a system. Now, then comes another thing with a multi master which is called here we have separate storage for each of the nodes and writes can be performed on all the nodes. Modifications are performed in a separate storage and can be replicated to other nodes. As you can see in this diagram, the wherever the data is written either on the node one or on the node two, it can be replicated to the other node. And if even if one of the nodes goes down writes can still continue to perform for the from the other node. This has a zero or minimal fail over time and multi site writes can take slightly more time because the data needs to be replicated needs to be replicated to the other node. Now, in this case, in such solutions, normally, many a times conflicts can occur when modifying the same database or resource, like sometimes we try to delete the row, and that row is already deleted on the node where we replicate the delete. So such things we call as conflicts. I will talk a little bit about this conflicts later on in my slides. So the for such solutions we need to see how to resolve the conflicts. Normally the conflicts can be resolved either automatically based on predefined rules, or user can choose to manually resolve them. For example, there is a primary key violation user can manually try to delete the conflicting rope from the node. Here split brain won't be a problem because from the very beginning, all the nodes act as master. Unlike master standby here both reads and writes can be distributed among various nodes. So this is the like biggest win point over the first solution which I talked about high availability solution where we have a master and standby here both reads and writes can be distributed. Some other typical use cases for such a system or such a topology are like, we can send incremental changes in a single database. Or a subset of a database to the other subscriber as they are. Or consolidating multiple databases into a single one. Then we can also replicate between different major versions of the PostgreSQL. And then we can also allow giving access of replicated data to different groups of users. So now I will talk a brief about how the logical replication solution has evolved in PostgreSQL through various releases. So the founding features for logical replication have been started to getting matured in 9.4 where we have added logical decoding, which is the backbone for this whole logical replication. This is the solution which allows the raw wall data to be decoded in a form which can be sent to another server in a SQL form. So these are and to support it we have replica identity and replication slots introduced in PostgreSQL 9.4. Then through each release like through 9.5 till 14 we keep on adding the features to make this solution more mature. Like then in 9.5 we have added track time stamp or progress of replication. These things help us in conflict resolution or if the replication solutions need to track the progress till where the replication has happened. And I think the major feature then in core we have added the logical replication in PostgreSQL 10. And then through 11 we have allowed to replicate the truncates and reduce the memory usage. And then we have in 12 and 13 we have enhanced it with slots to be copied and allowed logical decoding workmen. So before 13 the memory of the replication system can grow to a very large amount depending on the size of changes. But with 13 we can control that and then we can also we have allowed also allowed to replicate the partitioned table. Like all the partition hierarchy could be replicated just via the root table. Then we also allowed a new parameter max slot wall keep size which will allow us to limit the wall that needs to be protected corresponding to the slots. Then PostgreSQL 14 bring in quite a few number of features to further enhance the solution. Like streaming of in progress transaction, then monitoring of logical decoding of transactions, then we have improved the initial table synchronization. Then decoding of prepared transaction, then we have improved the performance when there are a lot of DDLs in the system. And we also allowed to transfer the data in binary mode. Then also we have allowed some syntax to ease the users to set up this logical application. So there are various features which made logical application a bigger with PostgreSQL 14. Now I will talk in brief about logical replication how it works in PostgreSQL and some of its core ideas which we have lately introduced in PostgreSQL 14. So this is a two node system where I have shown how the logical replication happens. See the box on the left side which I mentioned and primary database and a standby database. Actually the rights can happen on any node, but here I have shown that the rights happen on the primary database and then it flows to the standby database. So what we do here is like as the operation happens in the primary database, there is a background process called as wall sender, which keeps getting the wall or keep reading the wall from the wall files. And then via the output plugins, it sends the data over network to the secondary database after decoding it. Then in the standby database, there is an applied worker which receives that change and it performs it like a normal SQL operation like insert, delete or update or truncate. And then the data goes to data and wall and the users can read it or even write it in that system. I have shown a single way system but the same thing could be built in both the directions. So this is the basic mechanism how the replication works logical application and note that the basic by default it works at the commit time like each of the transaction is decoded only at commit time and then sent to the standby database, which then applies the entire transaction. So, as I have told in the previous slide that in the basic or the default system all the transactions decodes at the commit time. For the very large transactions, this leads to a very big apply lag because still the transaction, all the wall for the transaction is decoded, we don't send it to the standby. But with Postgres equal 14, we start streaming the data for in progress transactions as well. So, some of the benefits are like in many cases we even don't need to send the data. In that case also normally it used to, normally it used to accumulate that data on the disk and only at commit time it decides whether to send it or not. This leads to a lot of wastage of resources especially IO on the publisher side. Now because we can allow to send the data at the earlier point of time, the data gets discarded and never gets to the disk. So this feature saves a lot of disk IO for such transactions. And we have seen the performance improved by two or more times due to this. And then I will briefly take you through the enhancements we have done to allow the incremental decoding. So now when to start the streaming like we can't do it for each change at as it comes because then it can take a lot of network bandwidth to send each change to the standby. So now we send the changes when the wall size exceeds the logical decoding workman parameter like which described which can by which user can configure how much more memory after how much memory the transaction should get either spilled or sent to the subscriber. Then the big thing is that because we can decide to start streaming the transaction at some random point in the transaction we need to have the association of all the transaction and its sub transactions. Otherwise, we did. We know that only at the commit time. So we can never gather all the data which needs to be sent. So this. We have introduced a new mechanism so that these two production and all its transactions get associated. The first time for any sub production we write a ball. And then there a post this has a mechanism of sending the invalidations. So we now started writing invalidations at each command and into the wall so that decoding can use this information. This is basically helpful in processing the DDL. So, earlier it was again done at the time. So we need to do at each command so that we can send the data as soon as we want. So this is the basic idea like now you can see from the left side that instead of waiting till commit. We have split the transaction into multiple streams for each stream we send the start stream and close stream packet, so that the standby was that one stream has happened for a transaction, and it spills that data to work temporary file. So you can see the multiple streams of a transaction can come and there could be other transaction between those streams as well. It could be quite complex. So, but the end goal is that all the transactions get applied in the order in the commit order only. This way we allow the data to be sent to the subscriber and reduce the applied lab by them. But now at this stage also you can see that we till the commit time we don't start applying the amount we have saved here is that we didn't wait till commit to send all the data. So, there could be a natural extension to it that as the data comes, we start applying it via some background worker, which we are planning to do in future releases. So, to support the streaming of in progress transactions at the core implementation level we also extended the logical application protocol to identify in progress transactions. And then we have modified the apply worker on the subscriber side, so that it can write the files or spill the data to the temporary files. Basically, by default, the streaming doesn't happen for in progress transactions users have an option to make the subscription with this streaming on option after that, the publisher side or the master side can stream the data for in progress So, this is one of the features of course, major features of PostgreSQL which I have explained. Next we'll talk about PostgreSQL 15 and beyond. So, here one thing to note is this is based on what I could see being proposed in the community at this stage and being done any feature being done for PostgreSQL 15 can be changed or reverted till it is released based on the community decision. So, this is a nice feature, which is going to come in PostgreSQL 15. Hopefully, logical replication of prepared transactions. Which means the now we can send the transaction and it's all the changes at the prepare time instead of waiting till it gets committed. So this first of all reduces the apply lab. And then it can allow a two phase between multi master solution. So I will come to the advantages of this. But you can see from this diagram that at the prepare time only we apply all the changes on the subscriber side and at commit time we just committed. So this reduces lot of apply lag before this PostgreSQL 15 all the data of such transactions only get applied at the commit time. And mind it, as I have shown in the diagram between prepare and commit, there could be a lot of other transactions as well. So this will allow us to support two PC biological replication and it will reduce the apply lag as I have shown you told you. And the other big feature is that this kind of feature can help us to build a conflict free logical replication. Because at the time of prepare only we will know if there is a conflict and if there is a conflict, we can roll back that transaction even on the master. This is this will be slightly costly but in some financial institutions, it could be quite useful. The next feature or set of features I'm talking about is again to enhance the logical replication solutions to make users replicate easily the data. The first thing is to allow the replication of all the tables in the schema. Before this feature, if you want to allow all the tables of the schema user has to manually specify all the tables. With this, they need to only specify all tables in schema with schema name and then similarly for the sequences. So this will allow to publicize all the schema, which will help users to replicate all tables in that particular schema instead of specifying them individually. So basically this will help reduce the manual intervention by users where they need to keep syncing sequences in sync before this feature. Then another interesting feature we have developed for PostgreSQL 15 is row filtering for logical replication. So here you can see in the syntax that user can specify the bare clause to filter the rows on the publisher side. So this will help us to send the selective data to the standby. So this will reduce the network bandwidth and improve the performance. We have seen the performance with this improved proportional to the number of rows being filtered. And another big advantage is that user can create multiple publications and then chart the data based on this, on the various nodes. So that the data can be various parts of the data could be replicated to different nodes. So this was one of the major features of logical replication for PostgreSQL 15. This was going on from last four years and we were able to complete this in this release. So I hope this helps a lot of users. Then we have another feature which is being developed for PostgreSQL 15 is named as column lists. In this, users will be allowed to specify the columns which they want to replicate rather than replicating entire row. So restricting the columns will be useful when the target table doesn't have the same columns as the source table. And the columns in the, and sometimes columns in the source table can have the sensitive information which users might not want to replicate. So such a feature will be really useful for such use cases. Now comes to an interesting topic which many users are care in the system, which is we call as conflict resolution. So as I have told you that the conflicts can happen in various ways. Like the primary key violation or the update delete conflict we normally call where the row that we tries to update on the standby database doesn't exist anymore or the row is already deleted. So all such things are referred to as conflict. Currently users can resolve conflict manually by removing the conflicting data or by skipping the transaction via PG replication origin advance, which is a function which will allow you to advance the transactions wall and start applying from a new transaction. And then with Postgres 15 we will be allowing to skip the conflicting transaction via directly specifying the LSN in a SQL command, which I will explain in next few slides. Now the big thing is that all the detailed error information will be available via the server logs that will help users to identify the conflicting transactions and help them to skip those transactions in case they are not able to manually resolve them. And now another feature is sometimes like currently whenever the error happens the replication stops but the apply worker keeps on restarting and keeps on retrying till the conflict is resolved. So for the cases where conflict can't be resolved automatically we have provided a feature to automatically disable the subscription when a conflict occurs. And then we have introduced a new system view PG stat subscription stats. This shows the stat information about error occurred during application of logical application changes as well as during initial table synchronization. These stats are removed when the subscription is dropped or users are allowed to use a reset subscription stat function to reset single subscription or all subscription error information. So here we can extend this to track the other set related statistics like number of sets committed about it for a particular subscription. So the information that will be available to users in this view is like the apply error count and the sync error count for each subscription. Basically this gives users some statistics like how many errors have how many different errors have occurred in the system either during apply time or during sync time. Now, I'll try to explain in brief how users can skip the transaction with this new feature, which we name as auto subscription state. So with this feature users can actually supply the LSN value of the conflicting transaction and skip that transaction. So normally the conflicts can produce the following kind of error you can see here on my screen with third bullet point like duplicate key. And then we give the detailed information about the replication origin and the transaction and where the connection has finished. So users can fetch this information and use this LSN with the auto subscription command. Skip command and the apply worker will skip all the data modification changes within that transaction. So this should help us users to proceed when the manually conflict resolution is difficult or tricky. So hopefully we'll see this feature in PostgreSQL 15. So then I will talk a little bit about the some other miscellaneous features where we are also working on allowing logical replication from physical standby. As of now, if the primary database goes down, the subscribers cannot connect to the physical standby and continue the replication. So with this new feature, we can allow the subscribers to connect to physical standby for existing publication and get the data. This feature will also provide a way to continue logical application after the standby is promoted to master. So this work is not going to get done in PostgreSQL 15. Probably in future releases, we'll see this work. But we know this is a very important work and will help a lot of users. So then we till now there was a restriction before PostgreSQL 15 that only super users are allowed to perform the replication changes. With this, we will allow non super users to also apply the changes on subscriber provided they have permissions on the required objects. So this will help in easing some requirements for logical replication and I hope users will like it. Then there are other some big features which will make logical replication much more loveable for PostgreSQL. And they are also they are not active discussion under active discussion for 15, but I am hoping some of this work has got some traction lately. And I hope in future releases we will see some of these, or all of these, like DDL replication. This is one of the most demanding feature for major version upgrades where you currently users need to manually set up the schema on both the nodes. This will help to happen it automatically. Then we will provide the to provide automatic conflict detection and resolution handlers. This is really a very big and multi-year project, which we hope to see someday in PostgreSQL. When we want to also do parallel apply, like one of the uses that for streaming transactions, we currently just write the temporary file and then fully apply at commit time. We can slowly start applying it in the background worker. And also we can start applying as soon as the normal transactions also start coming rather than waiting for the commits to happen. And by multiple workers. So this should help us to improve the speed of replication. Then we don't have a bi-directional replication of same table which limits the use of logical replication in PostgreSQL because the user won't be able to set up bi-directional replication. We have started working on this and I hope to see this feature in future release. This is all about the features and how it evolved in PostgreSQL, the logical application, how it evolved in PostgreSQL. Next I will talk about some of the multi-master solution or some of the multi-master topologies that could be built on top of what PostgreSQL provides. So this is the first topology where one can set up the N-way bi-directional logical replication, which means that there will be N copies of the data and any table can be read or written from any of the data nodes. So if you see from this diagram, the data can come from the coordinator node which will be used for the node load balancing and it will route the writes to different nodes depending on some configuration. Like data node 1 will write to table 1 and table 2 will be routed to data node 1 and that data node 1 will replicate table 1 and 2 to data node 2 to allow scaling. Similarly for table 3 and 4 the data will be routed to data node 2 but it will be replicated to data node 1 so that reads and writes can be performed from any node. So for the multi-sites here you can see the data can be written to any node or sorry to the nearest node. So this is one of the topology which could be built over top on top of PostgreSQL solution. So what I have shown here is not by default provided by PostgreSQL but users can or the other application developers can write a thin layer over top of PostgreSQL to route the data and to route the read and writes to get the benefit of bi-directional logical replication and it will also help us in scaling reads and writes. One of the drawback of this system is that there will be n copies of data which will require massive amount of storage and massive amount of network transfer. So then there is another way we could build a multi-master solution on top of this. So this we call as one master database which will have all the data and other data nodes will have specific set of tables. So whereas this will help us to reduce some of the drawbacks of the first solution where data needs to be seen across all the nodes and the storage in all the nodes have to be for all the data. Here we should have one master node which has all the data and other nodes are for specific tables. So you can easily see that if user knows its data topology and queries well they can short the data in such a way that reads and writes can be scaled to a really each node level and in fact with a fallback that if there are queries which have to perform across all the nodes which have to get the data from all the nodes they can query the master database and get all the data. So here we can see that for table 1 and 2 only it needs to go to data node 1 and from data node 1 it gets replicated to the master node and data node 3. For table 3 and 4 reads or writes have to be gone to data node 2 and just it gets replicated to node 3. Now for a join of table 2 and table 3 it needs to the query needs to be routed to data node 3 because it has the data from both the tables. So yeah this let me try to summarize both these solutions each have its own advantage and disadvantages. So for the first solution where we can have n copies of data. So number of data copies will be equivalent to the number of nodes and here total data distributed across the nodes plus just one copy for supporting joints. Then single table read read from any node will happen in n copy data whereas in the distributed solution we will coordinator will identify the data node and route the query. Similarly multi table reads can happen from any node in the first solution whereas. query has to be routed by coordinator to the special data node we call as master node from where all the data needs to be fetched. writes can be done to any node whereas here writes needs to be done to the to the particular data nodes. The data gets sink to end nodes this is the one of the plus points where is in the second solution data needs to be only sink to one node. And in the first solution, it can be managed with a very light weight coding load balance error coordinator, whereas in the second solution coordinator needs to manage the data sliding slicing across the cluster. The key point is that it allows us slightly older data as data replication will need some time, unless user choose to use the synchronous replication, in which case there won't be any lab and data will be visible in the same order. Now replication sequence will help. Sorry, we'll allow conflicts, which I have explained how conflicts can be resolved scale up. If we have to add more nodes, none of the solution requires downtime. Whereas if we have to scale down probably the first solution doesn't need any downtime. Second solution how we implement it depends, but probably it will require the downtime. So this is the broad level comparison. And that's all from my side for today's presentation. And I'm happy to hear your feedback. Please provide your feedback to me at amit.capila 16 at the rate gmail.com or if you have any questions after this presentation. Thank you very much.