 Thank you for coming. I'll be talking about how to make your own little transaction across multiple servers using just Python and Postgres. So let me introduce myself. That's me. My background is chemical informatics. I worked on software for drug discovery. Then I co-founded machine learning startup. Then I worked for Twisto financial startup. I was there since it was about 20% company until it was sold. I helped it scale up technically across multiple countries. And now I work for a company called Monitora. We do media monitoring. And again, it's a small company, mostly Czech Republic, Slovakia. And we are working on building a worldwide product. So that's what I do mostly. I help companies to scale up from a small startup to a major company on the technological level. And with my wife, we are also organizing courses mostly for people who want to learn programming but who do not want to become programmers. So in Monitora, what do we do? We scan the internet. We watch for news, for printed news, for online news, social networks, and so on. So we can answer the question, what do they write about me as if I'm a big company, a politician, whoever? We can give you the top news from today. We can perform analysis on the articles. On the back end, it means we need to do a lot of crawling. We do OCR, a lot of interesting tech. The deduplication of articles across the web is an interesting problem. We do a lot of machine learning to recognize what the article is about, to be able to summarize it. We search large amounts of texts. And what is interesting in this company is that we are a small team. At the moment, it's six people on the back end. And we want to remain a small team as we build a worldwide product. Currently, we are looking for one DevOps person. So if anyone is interested, what scale are we talking about? At the moment, we are mostly Czech Republic and Slovakia, partially Germany and other countries. And we have more than six terabytes in a Postgres database. And we are aiming for 50 times or 100 times more to be able to scale it to the whole world. So for this, we use a distributed database. So we want to scale it horizontally, because that big server would be possible but a bit unwieldy. We chose Citus. It's a Postgres extension. You can put only some tables to the distributed mode. So you have clear migration path from your existing Postgres database. But some of the more complex joints between the tables, which are on different servers, might be limited. And I like to understand how the technology I use works on the deeper level. So here, we will be doing something similar to what Citus does, but implementing it by ourselves. So let's build a toy example. Let's build a piggy bank. We will have some users with their accounts. Each account has a balance. And we want to be able to safely transfer money across the accounts. But because we are a piggy bank, we do not lend money. We do not allow overdrafts. So we want to be able to move money and not lose any. But we will be making it distributed across multiple servers. So first, we need to decide how we will distribute the data themselves. So for this, we can use the sharding approach and just make the same table structure on every worker server. So we have here two servers. And each of them has a table of accounts with their balances. But each server stores only half of the data. And how do we decide which server should store which row? We will simply take a hash of the account name and take some operation that will deterministically decide where to put it. So in Python, we will do it this way. We define what shards do we have, which workers should they be placed on? We define a helper function. We call it worker for account. And this function will take a name of the account and return the ID of the worker or ID of the shard. And there is another helper function. Just it's called SQL. And it will just take a name of the worker or ID of the worker and SQL commands to execute on that worker. It will come in handy later. I'm not showing the implementation for the second function. There will be a link to the code if you want to play with it at the end. So first, we want to put some money into account. We want to create an account for a person. That's easy. We just need to know which worker manages that account. And we want to route the insert query to the right worker. That's fine. We know the hash function. We know which worker it is. So it's a matter of this single server. Now we want to do an operation which spends multiple servers. For example, we want to know how much money people have at our bank, so how much money we might be forced to pay if they come and ask for their money. So yeah, again, that's kind of easy. We just list all of the workers that we have. And we go across the workers, ask them for their sum of the balances, and just report it. That's fine. OK, so that's fine. Case closed. We can go home. We have our big bank across multiple servers. But what if one of the servers fail? I will be talking mostly about what happens when something fails, because that's why we do transactions. So one of the workers went kaboom. And we cannot longer connect to it. That's bad. We lost half of our data, because we have two workers. How do we deal with that in Postgres? We can do a replication. Just a standard tool for Postgres database. We replicate data from one server to another. And we do it independently for each of the shard workers. So when one of the servers fail, we have some mechanism to route the traffic to the other one. It can be a proxy. It can be a virtual IP that moves. It can be a DNS name that allows us to say, OK, hey, this is the worker one. The worker one is now on a different server. There are standard tools for Postgres that do it. We use Patroni, but if you use Kubernetes, there is a crunchy Postgres operator. Or Patroni also has Kubernetes support. So this is something that you'll probably want to do, because if you have multiple servers, there is a higher probability that at least one of them will fail. But it's a problem that can be managed. Short diversion, the CAP theorem. Just raise me your hand, who has never heard of CAP. OK, so it's a theoretic theorem, which says that in case of network partition, some of the servers cannot reach the other servers. Maybe the network fails somewhere. Maybe they are just too busy. Some of the servers cannot talk to all the others. You have to choose between either consistency of your data or availability. Because if you cannot reach the other part of your cluster, that means that either you stop and wait until you can reach them, because they may have newer data than the part you can reach. Or you say, oh, well, we know some. Maybe all the value of the data. So we can serve it from the part of cluster we know. So there are a lot of trade-offs and a lot of places where you can decide if you want to lean more towards the consistency or more towards the availability part of the problem. For example, when we are doing replication, then we can do either synchronous or asynchronous replication. If we are making replicas synchronous, that means that we will need to wait until all the replicas have confirmed that they have the data which we are writing. So that means if the replica fails, we need to wait until we know that either it fails for good or until it recovers. And at that moment, we cannot continue writing. So that means we are more towards the consistency part. Or we can make asynchronous replicas and we are more available. If you are interested more into this topic of distributed systems and these trade-offs, there is Jepsen.io. It's a company by Cal Kingsbury, the person who is very knowledgeable and has very accessible talks and blog posts about this topic. So OK, so we made a replica. And now we can add an assumption or two to our piggy bank. So in order to be able to build the distributed transactions, we will make the assumption that the node, the server, doesn't lose the data it committed. Once you received a confirmation of a commit, or the database commit, it will be never lost. And if some node will fail, it will recover eventually. Maybe from a replica, but it will recover. So we will rely on these assumptions. OK, so this is our function to transfer money from one account to another. The accounts may be on a different service. So we will need to find a worker for the destination account. We add the money that we transfer. We need to find the worker for the source account. We need to subtract the money that we transfer. It's fine, right? What could go wrong? Nothing, no. Unless the server fails. So yeah, we added money to one account, and the other account didn't get debited. Because yeah, we get an exception in our Python code, but the data is already written. So we do not do anything and just lose money. This could happen also when we would want to transfer a larger amount of money than the balances. Because here, in the bottom part, the database constraints that we say that the balance must be always positive, it would fail. So here, we would lose money. So how can we do it better? In plain Postgres, if you would have just one server, you would insert begin and commit to wrap those two commands in the transaction. And it either happens all the way or doesn't happen at all. And this is something what we want to do across the service. Luckily for us, Postgres has a feature that supports exactly this use case. We want to prepare the transaction across all the servers. We want to know that it will succeed. The servers should guarantee us that they have the data, they will not lose them, and that the transaction doesn't violate anything, and it will safely commit. So once we have this confirmation from all the workers that are part of the transaction, we commit them all, again, sequentially. So the function that we have would work like this. For each of the servers, we will create a transaction. We will execute our SQL code, which is relevant to that particle server. And we will append this transaction to a list of transactions to commit in the second step. So it's fine, right? Works. If we didn't get any exception, we go and go over the list and commit everything. If something went wrong, we roll back those transactions that we have. Fine, right? What could go wrong? Yeah, something can go wrong. So what if we fail? Remember, we have a list of transactions in the memory. And the code, the computer, where we run this Python code, can fail at any time. Again, there may be a bug in our code. The computer may catch fire. So these are four discrete steps. So we have three places where we can fail. And because the database guarantees us that the transaction will safely commit, that means that it still holds the locks. So for example, if we change the balance, we cannot. And we called this prepare commit on one server. Then we cannot modify the account from any other transaction at all. So if we fail after the first prepare, that means that this account on that server will be locked forever. And we will not be able to touch it until admin comes and unlocks it for us. So what do we do? How do we fix this problem? Well, we, who manage the other servers, have to be a database as well. We have to remember what we did. And in each of these steps, we have to be able to recover and continue. So yeah, we will extend the assumption that the nodes don't lose committed data and that don't fail forever, including the coordinator node, including us. Even though we are not storing any data for the piggy bank, we need to store some data regarding the distributed transactions. Yeah, this first part is something that I have been talking about already, that there is a problem that the transaction may take longer time than you are used to. So from the moment you have made some change on one of the servers, until the moment that all the servers have made their work and you are committing the prepared transaction, all your locks that you hold in the transactions are still locked. So you cannot, you have to wait for them. And if the coordinator does not serialize the access or does not do something clever with regards to the transactions, the changes on those different servers may be visible not atomically. Because as you see, when we have these steps that we commit the prepared transactions, there is no way to do them at one moment. We can guarantee that they will either be applied all or none of them, but we cannot guarantee that if someone goes directly to that worker server and asks it for the state that it will be synchronized with the other workers. And again, this is where we touch the CAP theorem. We can let the clients do this without the synchronization. And they might get not consistent view of the database. Or we can handle this somehow in the coordinator and make everyone go through the coordinator and somehow serialize the access. So this is all for the talk. Thank you for listening to me. There is a link to the example code that I was showing. You can play with it. I can do a show demo now. And if you are interested more in that, find me here in the hallway after the talk, or maybe here in this room, because I think there will be no talk after that. And I can go through it with you in more detail. So I will just show a small demo. How does it work? Oh, actually, are there any questions at this moment? I may answer the questions now. OK, yeah. So the question is that we are handling sharding at the application level and replication at the database level. So why not delegate sharding to the database level? Actually, this is what we do with Citus. There is this tool called Citus, the extension to Postgres. And you interact with it with plain SQL. So it makes some tables that look like they are there, but they are just empty. And it passes the SQL queries and passes them on to the other servers. That's what we are doing. But somebody had to write it. The database is an application as well. So here I am just making a toy example with Python, how we would do it. But what we are actually using in production looks like a database. But there is an application code doing exactly that. Yeah, so the question is, I'm not sure if I understand the question, so I'll try to repeat it. So the question is, how do one shard know about the other? And how do I make a transaction happen at all if the shards don't know about each other? Yes. Yeah, that's where the coordinator comes into play. The coordinator knows about the shards. And the coordinator is the one who says, OK, this shard, please subtract the money. OK, this shard. Please add the money. OK, so everything is fine. This shard, commit it. This shard, commit it. There was a question. So the question is, how to make a consistent view of the database across all the transactions? I'm not sure how this actually works in Citus. It definitely takes some logs by itself. So it excludes some transactions from the others. I am not sure at which level exactly. Because on the most strict level, you would have to serialize all the transactions one after another. But you can be more clever to wait only for the transactions that conflict with each other. Or you can just let it be and accept that at some point, some of the data might be inconsistent. I am not sure how exactly this works in Citus. But you, as an engineer who would be developing this kind of system, you have the possibility to make different trade-offs and decide for yourself. Depending on if the application can tolerate it or not. So yeah, if you let it to the developers, there may be a ton of bugs, yes. But if you're architecting a system, you need to know what kind of inconsistency can it tolerate because consistency at all costs is slow, very slow. For example, in our case with the online articles, we do not care that much about consistency between the different articles themselves. We do care about the consistency between an article and its related metadata. So we care about the consistency in some pockets of the database. So for us, if we are consistent within one chart, it's actually usually OK for most of the operations. It's OK. In our case, as the media monitoring campaign, we do not usually need this inconsistency across the charts. For some cases, yes, but they are rare. Yeah, we prefer to be able to do. Yeah, I would be happy to discuss with you. We avoided several things. OK, any other questions? Yeah, yeah, this is exactly how it works. The coordinator, basically here, the coordinator remembers what it did, stores it in a database table. And if it fails and boots up, first thing it does, it looks at the transaction log and goes clean up the transactions that should not be there and commits the transaction that should be committed. Yeah. So the question is whether you have one access point through one coordinator through all the databases. You have to go through some coordinator. Again, depending on your trade-offs, how much consistency do you need? The coordinator may be itself distributed. If you do not need that much consistency and you want higher throughput, you can have a distributed coordinator. But if you want more consistency, you go through one coordinator. This is not about Citus. This is about the problem spacing in general. If you want more throughput, you can get it with more coordinators. You need to have some place which knows about the charts and is able to coordinate them. But it doesn't have to be a single one if you do not need to synchronize a realizable point of view to the database. And the case is architecture. We're thinking about keeping as simple as possible. First, then making your developers thinking about how it should work and before. Great, huge problem. Yes, so try to do by our one piece sharpening approach. So the question is? Maintenability. Uh-huh. So the question is about maintainability. You talked about the question. Yeah, yeah, so you say that this approach is not maintainable. Yeah, I agree with you that this approach is not maintainable. And this is not Citus. What I was showing, the Python code, this is my demo here for you. It's about 100-line Python script that implements distributed transactions with a two-phase commit. And it's on GitHub and you can play with it. This is not Citus. Citus is something that you install into your Postgres. And for most part, it behaves the same as the tables were behaving before. It has some limitations. It cannot process some of SQL queries. It cannot enforce some of the foreign key constraints, for example. It cannot, for example, enforce a unique constraint across charts. But for the application developers, it looks like any other SQL table. So the question is whether we should not use a lock for locking the user balance. Not really, because when we use this kind of SQL query, where we use balance equals balance plus something, this is atomic. And as soon as you do it, the row is locked. This is how Postgres behaves. Uh-huh. Uh-huh. Uh-huh. This is why we are doing these two phases. That first, we try to do the operation at all of the servers. We do not commit. We just prepare. Uh-huh. Yeah. Yeah. After? Uh-huh. Yeah. So if we do just prepare transaction, the other users, if you run Postgres in read committed serialization level, then if you do just prepare transaction, that the change is not visible yet. It's visible only after you commit prepared. But the commit prepared, because you do it across multiple servers, itself, is not at, you cannot do it automatically. Just you cannot do it. There is no way. So if you need to maintain a consistent balance, a consistent view of the balances across all the system, you have to serialize it in the coordinator. Yeah, for coordinator, well, here in this case, in case of this small demo, the question was whether we use event sourcing approach or something different. You can look at the code. There is no event sourcing. We just log the transaction. Well, if you look at the code, it's four Python functions. They are just called directly. And as you go across these steps, you just log the state. So we are able to reconstruct it. So yeah, you can tell it's event sourcing in some way. If you are interested more in that, I'll be here for some time, because I do not want to compete in the launch queue. So I'll be able to show you the demo itself. OK, so one last very quick question. Yeah. So the question was where we thought about using Mongo and something that is horizontally scalable by design, as opposed to Postgres, where the horizontal scalability is bolted on. Actually, yes, we thought about using Elasticsearch as the primary source for the data. But there are two proponents with that. One is that we already have quite a big application, which is written using Postgres in mind, several years old, and so on. But the bigger problem is that these databases, like Elasticsearch, they are not well optimized for relation data. Originally, I thought, let's move everything to Elasticsearch, because article entity has some restructure. It's a document, after all. Not really. You have authors. You have new sources. You have a lot of metadata, which are highly relational. It surprised me, actually, how much relation like this. We have like 500 tables. And it's much easier to work with a SQL database. So instead, we evaluated Citus as the extension of Postgres and two Postgres-compatible solutions, which are horizontally scalable from day one. It was Cochroach and TugaByte. All of the three are cool. If anyone has any more questions, I think you can find it outside. So thank you.