 Hey everyone, can you, can people hear me all right, this is set up all right, cool. So my name is Tess and I'm an engineer at a San Francisco startup called Chain, which builds blockchain infrastructure. And if you don't know what that is, don't worry, I'll, I'll get into that a little bit in just a minute. So today I'm going to talk about a system that I wrote called Iago. And Iago is our high availability storage system that's built on top of Postgres and at CD. And it really is the storage backbone of every system that we build. Okay, so here's what I'm going to cover this afternoon. So first, as promised, I'll give a brief introduction to blockchains and describe the network topology of our, of our systems. Just in order to give you a sense of the use case that we're solving for. Next, I'll touch on Postgres replication, introduce at CD and raft, and then talk about Iago itself and the algorithm it uses before discussing how the system behaves in the event of various failures. Next, because Iago is a work in progress, I'll talk a little bit about some of the changes that are coming to it. And then finally, just for fun, I want to tell you why we give it this kind of goofy name. So first, what is a blockchain? I know some of you are probably already familiar with blockchains. But I bet at least one person out there is sitting there like this. Which is certainly how I felt when I started learning about them. So I do want to give a quick overview. I know there was also a fair amount of blockchain talk at the regulated industry summit yesterday. So if you're like, this, just close your ears for just a couple of minutes. We'll get back to the database stuff in just a minute. So a blockchain is a ledger where transactions are grouped into blocks and then chain together via the hashes of all the data in the other blocks. So in this sort of simplified diagram, here we have block one which is full of all these transactions. And then the block that comes after it, block two, contains a bunch of its own transactions in addition to the hash of the data in the previous block. And as more transactions happen across the network, they get grouped into a new block, block three, which also contains the hash of block two. So this block then gets appended to the ledger. So you end up with this long chain of blocks, right? And each block points, again, to its previous block by including that block's hash. And this blockchain is that ledger. So then this ledger is distributed across every node in the network. In many blockchains, including Bitcoin, the network comes to consensus about what the next block should be through an algorithm called proof of work, which involves every node in the system racing to find the hash of the current block. Now, if you've ever heard of Bitcoin mining, that's what's going on there. And there's some really cool things going on with proof of work. But unfortunately, that's out of scope for this talk. And by the way, the word blockchain is a little bit like the word cloud, in that it's a buzzword, it's a little nebulous. And if you ask five different people what a blockchain is, you'll likely get five different answers. The way I see it, you could pick maybe six attributes. And any system that fulfills like four or five of those attributes could reasonably call the blockchain. And if I had to take a stab at what those attributes are, I'd suggest the following six things. So first, a replicated ledger. In other words, does this system use an append-only log? And is that log replicated across multiple servers? The second attribute is a cryptographic commitment history, used to protect the integrity of that ledger's history. So this of course is what's happening when blocks include that hash of the previous blocks. Third, transactions that are authenticated by public key crypto. When I send you money, do I have to authenticate myself by signing something with my private key? And this idea is pretty central to Bitcoin and most other blockchain implementations. Again, unfortunately not something I really have time to get into detail with today. Fourth, accounting rules. Are there accounting rules about the way that transactions must balance inputs and outputs? For example, both the Bitcoin protocol and the chain protocol, which I work on, both use a UTXO model, which stands for unspent transaction outputs. And they use this to balance their transactions. But something like Ethereum, if you've ever heard of that, doesn't. The fifth attribute is shared access. Is there shared access, either publicly or perhaps across several organizations? Generally within a single organization, there's enough internal trust for things like cryptographic commitment history to be more trouble than they're worth. Although this certainly is one of the more contentious attributes, I would say. Finally, proof of work, like I mentioned. Do nodes in the system come to consensus through proof of work? Again, that's that algorithm that relies on mining. And Bitcoin uses it because it's resistant to malicious attacks from within the network, so it can truly be decentralized, but it's also quite expensive. This value goes up and down, but processing a single Bitcoin transaction takes roughly the same amount of energy to power an American household for a day. So it's a very expensive consensus algorithm to run. And again, not every blockchain has all six of these attributes. Many only have four or five. The chain protocol in particular has the first five, but not the sixth. So you may, at this point, be saying, I've said the phrase chain protocol a couple of times, and you may be saying, chain protocol, what's that? So let me clarify. Chain has created a blockchain protocol, which is very imaginatively called the chain protocol, and then an implementation of that protocol, which is called chain core. And we've partnered with a number of financial institutions, including these four, and chain core powers their blockchain products. Meanwhile, we've also open sourced what we call chain core developer addition, which is a complete implementation of our protocol and contains all the blockchain logic, all that good stuff, so that developers can experiment and get to know it and sort of decide how they like it for themselves. We didn't open source the code necessary to safely run any of this in production. So that's most of our security and scalability features, including Iago itself. So, although most of what we do a chain is open source, today I'm talking about a corner of our system that's actually closed. If I had to succinctly describe exactly what we're building, I'd say that chain builds infrastructure for permissioned multi-asset blockchains. When I say permissioned, I mean that the networks are closed. You need permission to join them. And when I say multi-asset, I mean that unlike in Bitcoin, there's no need of currency. Assets are issued onto the network instead of mind, right? Like I said earlier, we don't use proof of work. Instead, we use a consensus model called federated consensus. So federated consensus is an algorithm that takes advantage of this permissioned blockchain model. And in this federated consensus model, there are three types of nodes. A single node is the generator. It's responsible for receiving transactions from the network, filtering out transactions that are invalid, and periodically aggregating those transactions into blocks. It then sends these blocks to the signer nodes. The signer nodes, in turn, are responsible for validating all of the transactions in a block and ensuring that there isn't already another block that's been accepted at the same height. That's that consensus problem, right? So if this block passes validation, the signer will add its own signature before sending it back to the generator. The generator will publish the block to the network once it's received M of N signatures, where M and N are two numbers specified by a network-level consensus program. And note also that the generator itself can also be a signer node. And in the system, all of the rest of the nodes are simply what we call participants. They can verify and validate blocks and transactions, but their approval isn't required for the blocks to be published and accepted by other nodes. So you might visualize, to go back to this blockchain image, you might visualize that like this. We have a single generator, a couple signers, and the remaining nodes are simply participants. So federated consensus is pretty safe. It guarantees that we won't have conflicting blocks. That is, it guarantees that the nodes will be able to come to consensus, assuming that no more than 2M minus N minus 1 signers violate the protocol. So for a three of four network, two signers would have to misbehave before the protocol fails. And note that signers can misbehave in a few ways, either by validating and signing an invalid block or by refusing to sign a valid block. And this protocol does support consensus improvements, so we could implement a more paranoid consensus algorithm if we wanted to. So federated consensus is more efficient than proof of work. But it does rely on the assertion that the generator and at least M of those signer nodes have to stay online. They must be robust and highly available. And this is really different from a system like Bitcoin, where there are many nodes that can create a block. Bitcoin has no availability requirements for any particular participants. In a system like Bitcoin, each node is a single server. So in that environment, in that context, when we talk about distributed systems, when we talk about fault tolerance or consensus, we're looking at networks at this level. And so far today, I've been talking about consensus at this level. But I'm about to drill down and take you into the internals of a single one of these nodes, because when chain networks are running in production, each node contains its own little distributed system for maximum fault tolerance. So this is the anatomy of a single node. A single node can contain several servers, which are each running core D, which is that core daemon that runs all the blockchain logic. That's largely what, when I talk about chain core developer addition, that's largely what that is, it's largely core D. And off to the side, on the far right we have signer D, which talks to this HSM or hardware security module, which securely generates private keys and stores them and uses them to sign blocks and transactions and things like that. Now, down towards the bottom is really the meat and potatoes of what I'm talking about today. Each core D server has a fleet of replicated postgres instances. So in other words, within each core D server, there's a set of three or more postgres instances that all have the same data. So let's now talk about that storage layer itself and how we use postgres. So I know I don't need to pitch postgres to this crowd. But I do wanna note that we chose postgres as our database because it's reliable and will establish, because lots of engineers are familiar with it, because there's a great community. And also because it has really great replication features. So as many of you probably know, postgres transactions can be replicated between a primary and a secondary database both synchronously and asynchronously. In synchronous replication, the primary won't commit transactions until the secondary has registered them as well. But in asynchronous replication, the primary will commit transactions as soon as it receives them. Now, by default, postgres replicates asynchronously. This is more performant, but there's a higher risk of data loss. The primary could commit a transaction and then crash before the secondary commits that transaction. So for this data layer, we wanna combine those two replication modes. We want core D to write to a primary, which will synchronously replicate to a secondary, which in turn will asynchronously replicate to a tertiary. And we can call these instances A, B, and C. We could also add instance D, which would asynchronously replicate from C. And E, which would asynchronously replicate from D, and so on and so forth. We can have as many asyncs as we like. Now, if our primary, excuse me, if our primary A dies, we want core D to start writing to the synchrep B, which will become the primary. And C will move up to become the new synchrep. And D, E, so on and so forth would continue to asynchronously replicate from their upstreams. Now, if there are fewer than three live instances, I will generally only show three instances at a time, because you can sort of imagine the asyncs. But if there are fewer than three live instances, this is considered a degraded state, and an operator has to intervene. Now, this is still a safe state. We're not gonna lose data this way. But if we only have two instances, we won't be able to recover if another peer goes down. By the way, this is all controlled by this config file, which has a name that, as I understand it, is a little bit controversial to pronounce, so I'm not going to say it aloud. But anyway, here's a very abbreviated version of that config file, with sort of just the replication settings highlighted. So streaming replication works by shipping and applying right-ahead log records from one peer to another. So the very first thing we tweak is the wall level. This setting determines just how much information is written to the wall. By default, Postgres writes just enough data to the wall to recover from a crash. But this isn't sufficient for streaming replication, so we set the wall level to replica, which logs that extra information necessary to synchronously replicate, as well as to run read-only queries on a standby server, which is actually not a thing that we take advantage of in the system, but it's kind of a nice thing to have. In the same vein, we also need to set up wall archiving, which we can do by setting archive mode to on, and then choosing an archive command, which is platform-specific, and this is a really simple command that we chose for our Unix systems, which copies all of the archivable wall segments to an archive folder inside of Postgres's data directory. Next, we set a couple of things that are specific to this database's current role. So this config file will be different, depending on whether this particular database is the primary, the sync rep, or one of the async reps. So for the primary, we set the list of synchronous standby names, that is a list of servers that could act as sync reps. And this database won't consider transactions committed until it's received an acknowledgement from one of those databases listed here. On the other hand, if this list is empty, the primary will commit transactions before they're acknowledged by anyone. So it's really important to set this value to something. The simplest option is just to use an asterisk to allow this database to accept acknowledgments from any other replica. Now, for the sync and async reps, we toggle the setting called hot standby, which is just ignored completely on the primary. And by the way, at some point, I got curious about the etymology of this phrase. And it turns out that the phrase hot standby comes from steam engines. So back in the day, steam train operators would sometimes keep a spare engine running in a station, hot, ready to go, because starting a new steam engine could take a really long time. And here, we're basically doing the same thing. This setting tells the sync and the async that they need to remain queryable and ready to go throughout any kind of recovery process that might happen when they get reconfigured to a different role. Okay, back to this config file. So lastly, just for the async reps, we turn off synchronous commit. And this ensures that async reps can acknowledge transactions before they've written wall segments to disk. So in general, this creates a risk of transaction loss. So the default setting is on, but in the case of an async rep only, we do wanna turn this off for performance improvements. Okay, so with all of that config out of the way, I'd like to take a step back and talk about Iago itself and the algorithm that it runs to coordinate this fleet of Postgres instances. So Iago uses an algorithm from Joints Manatee project. And yes, that is an ASCII Manatee. I took him from the Manatee projects read me. Although actually, I think he's also at the top of every single file in that project, so strong commitment to the ASCII Manatee. And Manatee is a high availability Postgres project that uses ZooKeeper, and it's written in Node. And my team really liked the algorithm, but we were less excited about operating ZooKeeper or managing a service written in Node since we don't use those technologies for anything else. So I re-implemented the algorithm in Go, and swapped out ZooKeeper for SCD, and the resulting system is called Iago. If you're not familiar with SCD, it's a distributed key value store that uses Raft as its consensus algorithm. And Raft is supposed to be a more understandable version of PAXOS, so it isn't Byzantine fault tolerant. That is, it's not resistant to malicious attacks that come from inside the network. And of course, this would be a big problem in like a Bitcoin network where anyone can add a node. But in this particular case, Raft is totally safe because it's just being used inside of a single node, inside of a permissioned blockchain. Each Iago peer contains three things. It runs a Postgres process, an SCD process, and then what we call the sitter daemon, which owns the Postgres process and is responsible for knowing what this peer's role is and for promoting this peer if necessary, so from async to sync or sync to primary. So in other words, the sitter is the daemon that's responsible for running that algorithm for manatee that I mentioned. So let's dig into that algorithm a little bit. When a new peer comes online, the first thing its sitter daemon does is register itself with at CD. So it creates an ephemeral representation of itself with a configurable timeout, which right now I think is about four seconds. And it refreshes that representation halfway through each timeout. So the key thing here is that if this node goes offline, its ephemeral representation will also disappear, right? So with that in mind, each node's ephemeral representation is watched by another node's sitter daemon. So the primary system monitors the sync rep and all of these async reps and the sync rep monitors the primary. So everyone is being watched by someone else, right? And if any of these nodes go offline, their ephemeral representations will disappear from at CD and then the monitoring sitter will take action to restore the entire system. Whenever an Iago peer makes a change to the system, it does this by modifying cluster state. And the cluster state is stored in at CD, so every participant agrees on what the values are. And it can only be updated by the primary or by the sync rep if it's about to become the primary. So if the sync rep notices that the primary has gone offline, it's allowed to modify cluster state. But other than that, it's really just the primary. And every peer in the system checks the cluster state in order to know what their role is at that moment. So this is what the cluster state includes. It includes unique identifiers for the primary, the sync rep, and all of the async reps. It also includes a number for the generation, which uniquely identifies this particular cluster state. So if a peer changes anything in the cluster state, it must increment the generation to prevent collisions. The one exception is if the list of async reps changes, the cluster state can change without changing the generation. But other than that, the generation always gets incremented when any change gets made to cluster state. It also includes the position of the write ahead log, recorded at the time that this generation was declared. So keeping track of the wall can be used to prevent data loss during certain edge KC events, and I'll also get into that in a little bit. So to dig into this, let's look at some of the code executed when a peer is running a sync rep. When the sync rep runs its monitoring loop, it's checking for three things. First, it's checking that the primary is still online. I've highlighted these, but can you see the highlights or is it totally blown up? Okay, I'll point to you, how's that? So first it's checking that the primary is still online. And if it notices that the primary's ephemeral representation has expired, it will go ahead and promote itself to primary. Second, it's checking for a change in generation. So if that generation has changed, that means that the primary, right, it can only be the primary, made a change to cluster state. And so this node needs to check its role and make sure it's still the sync rep. Finally, in this monitoring loop, the sync rep is running a health check on its postgres process. And if this peer's postgres process fails, it will take itself offline, signaling to the primary that it needs to promote a new sync rep. Before we dive deeper into internals and failure cases and things like that, I do just want to take a step back and point out how Iago looks from the perspective of Core D. Again, which is basically our application server or business logic server running all of that blockchain logic. When Core D goes to query this data layer, it doesn't have to step through some complex web of Iago logic. In fact, it just doesn't touch the CIDR program at all. On startup, it simply asks at CD for the IP address of the primary postgres process. And then it queries that process directly. If Core D ever can't reach the process, it means that that postgres process has gone offline. And Core D will query at CD again to get the latest primary. So as far as our application server is concerned, this is actually a really simple interface. Okay, so let's get back into the guts of this. I'd like to really illustrate how this works by showing you what happens when something goes wrong. Okay, so I just remade some animations for this like yesterday. So if I lose track of them, just bear with me here. Okay, so let's first look at this scenario. Beginning with a very basic cluster state with four databases, primary A, Sync rep B, a Sync rep C and D, generation one. So here's basically the state of the system as far as at CD is concerned. The first four slots are all part of cluster state. And then the last one is, again, there's ephemeral representations that represent who's online and who's not. Okay, so what happens if the primary goes offline? I'm so happy I finally found a good use for that particular keynote animation. Okay, so first the primary's ephemeral representation will expire. You can see that with the skull emoji up there. And then the Sync rep will notice this. And so the Sync rep will go to declare itself primary. And the first thing it does is reconfigure its postgres process. So it will reconfigure, excuse me, the first thing it will do is promote its postgres process. So the first thing, whoa, did something just change? Oh gosh, okay, cool. I'll just keep going. That's a little spooky. Okay, so where was I? The Sync rep has just rewritten it, just promoted it, promoted its postgres process, so it's reconfigured itself so it can receive rights from a client instead of replicating from a primary. And next we'll rewrite cluster state, since it's allowed to do this at this point. And so now this cluster state shows that the B is the primary, C is the Sync rep, and the generation counter has increased. And at this point, Core D can't connect to any more, right? It's dead. So Core D will query at CD to find the new primary. And connect to B instead. Meanwhile, C will notice that the generation has changed and check cluster state to see its new role. So it will notice it's now the Sync rep, and it will reconfigure itself accordingly to act as the Sync rep for B. So we end up with this state that we were hoping to get if we have to fail over. Now let's look at another scenario. What happens if the Sync rep goes offline briefly and then returns? That could cause some confusion, right? So as before, once the Sync rep goes offline, its ephemeral node will disappear. And the primary will notice this, and it will rewrite cluster state in order to promote a new Sync rep. So note that the Sync rep, a Sync reps in generation fields have now all changed. So now A will start replicating to C, and C will reconfigure itself to be the Sync rep. But now let's say B comes back online. Yeah, more animations. This is what you have to do when you have an entirely black and white deck, right? Anyway, so B will notice that the generation has changed, and therefore it's not allowed to rewrite cluster state, even if for some reason it thought it needed to declare itself primary or something like that. So instead, it will just recreate its ephemeral node. And when the primary notices, it can bring B back online as an Async. So once again, we end up with a totally safe state. Excuse me. Finally, excuse me, I'm sorry. I'm sorry, really awkward. Okay, finally, I want to look. Wow, I do not know what just happened to my voice. This is very embarrassing. I guess so. Hang on, I'm gonna get a cough drop and maybe that will solve the problem. Thank you for bearing with me. So finally, I do want to look at a very, a particularly hairy failure case. So let's imagine that the primary receives a transaction from the client, and forwards it to the Sync rep. The Sync rep then will commit the transaction, acknowledge it for the primary, and then the primary will, again, act it for the client. Meanwhile, the connection between the Sync rep B and the first Async gets disrupted. So the Sync rep fails to send this transaction to the Async C. So in other words, we're now in a place where the client believes that the transaction has been accepted, but none of the Async reps have committed it. So now let's say that the primary A, as it does, goes offline. This triggers a new generation with the old Sync rep B as the new primary, and the old first Async rep C as the new Sync rep. And I'm gonna be a little bit less explicit with walking through every single step of a new generation being declared this time. But this new generation gets declared, and this is where we're at. So now let's say that this new primary B goes offline as well in short order. So this leaves us in an uncomfortable position. What's happened to that transaction? The client thinks it's been committed, but the database, that is database C that's supposed to become primary at this point, doesn't actually have it. So fortunately we do have something in place to catch this very issue. Remember, I mentioned we store a wall position in cluster state. This value represents the position of the wall on the primary at the beginning of this generation. So whenever a Sync rep is about to promote itself to primary, it first has to check its wall position against the wall position stored in cluster state. And that node can only promote itself if its wall position is larger than that of the wall stored in cluster state. So if a potential new primary is behind the cluster, it cannot become the primary. And in that case, Iago will stop and an operator has to intervene. So obviously we'd rather not get to that state, but it's better to halt the system than to lose transactions. Okay, so with that in mind, let's take another look at that Harry failure case. In this time, we'll keep track of the wall position, both for the cluster and for all of our instances. So again, basically the same starting state, and let's say that at the start of this generation, the wall position is N, stored in cluster state. So now as before, the primary receives that transaction and forwards it to the Sync rep. And the Sync rep acts it, and both databases commit it. And notably now, both A and B's wall positions increase to N plus one. And when the A Sync rep C fails to receive that transaction, its wall position remains at N. And then as before, the primary A acts the transaction for the client before going offline. So now when the Sync rep B tries to declare a new generation in which it's the primary, it first checks that its wall position is ahead of the wall position recorded in cluster state. So since the Sync rep has a wall position of N plus one, and the cluster state has recorded a position of just N, this is safe to do. So database B, again, declares that new generation. This is what it looks like. The important thing to note now, of course, is that in this new generation, the wall position in cluster state is set to N plus one. So now when the primary B goes offline, C will try to become the primary. But when it checks its wall against the wall position stored in cluster state, it will realize its wall is behind. That is, of course, database C has a wall position of N, while the cluster state has recorded a position of N plus one. So at this point, the system stops and an operator has to intervene. Again, not the most pleasant solution, but it's the best way to handle this particular failure case, which is pretty rare, anyway. So here's some abbreviated code, excuse me. I guess the appropriate analogy here is you don't want to give a talk with a cough drop in your mouth, but it's better than not being able to say anything at all. So best, we're rolling with it. Okay. So here's some abbreviated code that the SyncWrap will run that checks whether it's allowed to actually promote itself or not. So I've removed a lot of the error checking and things like that. But since if any of you are familiar with Go, including error checking, we'll literally quadruple the length of your code. So I have not included that here. But otherwise, this is pretty much what gets run. Okay, so first, we get the wall location on this instance. And the code says xlog because that's the name of the directory that all the wall files live in. And I'll show code for how we do that in just a second. So next, we pull the wall position stored in the current cluster state and etcd. And this like weird variable kapi is the keys API used to access etcd. Finally, we compare these values. And if this peer has a wall that hasn't caught up to the position stored in cluster state, we log an error and return false. So that is, this peer cannot promote itself to primary. Meanwhile, here's how we actually query that wall position from Postgres. So when the sitter daemon needs to talk to the Postgres process, it usually uses psql, and this is no exception. So here you can see we're using psql along with the usual bevy of options used to actually specify a given database. But we actually use two different Postgres queries here. So the primary uses a pgcurrentxlog location and the other nodes, the sync and async, use pglastxlogreplay location. So you can see those two commands above being passed into this piece that's actually querying Postgres. So next, we parse the actual wall position out of the response since psql returns sort of a messy thing like this with hex encoded values and this column heading. And we need to parse all of that out before we can hand something useful back to the sitter. Okay, so having done that deep dive, I want to talk a little bit about the work that's ongoing in the system. So I started working on Iago about a year ago and then I stopped working on it for a little while and then I started working on it again recently. And that's really due to one big thing that's happened in the past year. And that thing is that we started doing production deployments into our partner's data centers. And what we discovered very quickly is that for on-premise deployments, operational complexity is much more expensive than code complexity. So that is the operational costs of installing and maintaining dependencies are actually larger than the cost of developing and debugging new code. So we sort of set this new goal of basically minimizing the number of Unix processes that need to be run at once. So this again is what the relationship between Korty and STD looks like currently. We have a cluster of servers running Korty, talking to an Iago fleet where Iago is running these three processes. STD, the sitter, Postgres. If we consider our goal of reducing the number of different processes, there's really only one candidate for removal. We can't remove Korty, since that's literally like the bread and butter of everything we do. We can't remove Postgres, or at least we can't remove it easily. Now, we can't remove the sitter, but we can remove at CD. We don't need to be running a bunch of independent at CD processes. We just need to be able to query some highly available, highly consistent service. So we took the raft package from at CD, which is also written in Go, and we moved it into Korty. So remember, raft is that consensus algorithm that at CD uses in order to maintain consistency. And the folks at CoreOS, which, again, made at CD, have written this really fantastic raft implementation. And people sometimes do use it without sort of the bells and whistles of at CD. So the idea in this case is that all the Korty servers can actually just run a consensus round amongst themselves. And Iago can, Iago sitter can query back to Korty to get things like cluster state. So our application server now has routes that do normal application logic kinds of things, like creating assets or building transactions. When I say transactions, in that case, I mean financial transactions. But it has a whole set of like raft related routes as well. So it has a raft, it has a route for new nodes to join the raft cluster. It has a route for nodes to communicate with one another. To communicate state changes. And this work isn't done yet, and there certainly is tooling from at CD that we miss and will probably need to minimally replicate. But the prospect of not having to maintain an extra dependency inside of someone else's data center pretty much makes this worth it. We also get one more nice benefit from this, which is that we're now able to put other data into this raft storage cluster. So we looked for data that needs to be consistent but doesn't get written often, and it turns out that the configuration state for each Korty node, which is stuff like, is this node a generator? Is it a signer? That data is actually a really prime candidate for raft storage. So this work is like really, really ongoing. Like I have PRs open for it now, so this is one of them, and it's all public. So if you're curious, you can sort of see how that's going. Okay, so finally, as promised, I want to share why I named this project Iago. And obviously the most important consideration when you have a project written in Go is to include the word Go in the name somewhere, right? But I also liked Iago because it's short, and because Iago is a Shakespearean character from a fellow who goes around creating chaos by lying, telling different lies to every different character in the play. So this name is kind of aspirationally ironic, because Shakespeare's Iago tells everyone different things and creates chaos, and our Iago coordinates every node in the system by telling them the truth. Okay, that's all I've got. Does anyone have any questions? Sorry? That's true. Again, aspirationally ironic, it could. Right now, we pretty much only run three Korty processes at a time. So that would be a real consideration. But right now, we do like three or five, so we use, I mean, what is it, libpq, I think? I think it's like the one like a bunch of Roku folks made. My boss is sitting here nodding. It's libpq. Yeah, all the code, so the code that I showed occasionally might have used the etcd client package to communicate with etcd in those code samples, but all the code I showed was stuff I wrote. Yeah, I mean, etcd really only gets written to you when there's actually a change in the Postgres state. So it really only gets written if one of the Postgres instances goes offline, which hopefully does not happen enough to stun etcd into a stupor. No, it's only if there's an event. To be clear, the wall position gets written at the beginning of a generation, not every single time there's a transaction. That probably would. Well, I haven't stress tested at CD, but I can imagine that wouldn't be great. Yeah, I mean, basically it's like so we can have an extra level of security or safety if something goes down. You don't need two asyncs. You can totally just use three. Or you can totally have three database instances total. The nice thing about having more than one async is that it gives you an extra buffer if one of them goes down. So again, if you have three nodes and one goes down, you have, obviously, you end up with two nodes, which is safe, but you can't recover if another node goes down. So if you're in a two-node position, someone is getting paged. And if you start with four nodes, you can have one go down and you get a little bit more buffer before there's a crisis. Sorry? How does that synchronization work, actually? A to C, A to D, or A to C? It goes through A to B, B to C, C to D. So it's like a chain of. We're really into chains. It's a chain of instances. Yeah. Right now, it just lives in a directory inside of Postgres's data directory. Is the whole archive for A on A and not on B? Yes. I think. I think. I'm not sure, actually. So A has access to N to catch B or B to catch C. But if B's gone, C can't catch up at all. If no, if B's gone, C can catch up to A. Yeah, I'm sorry. I haven't touched this in a little while. I don't know exactly. Sorry, any what mechanisms? Yeah, we haven't looked at that. Generally, if something goes wrong, the node will just take itself offline. That's the current expectation. Yeah. Yeah, I mean, you can access them. Like you could like P-SQL into any of those instances. But the normal mode of operation is just that Cordy writes to the instances. Right. We don't have that. Yeah. Do you use a verification slot, so? I don't know, actually. Yes. OK, thank you. About this particular implementation, the short answer is no. The Manatee project itself has a really amazing and thorough documentation, which is really what allowed me to do this in the first place. But yeah, this is like the most information right now is the most information I've ever shared with anyone about this project. So maybe in the future, but not yet. Right. The honest answer is like that's what Manatee does. And I copied their work pretty faithfully. There are a few things that they've done that I have notes on, like investigate why we do this instead of that. And that is definitely something that I will add to that list. Yeah, no, actually, I really enjoy that. Yeah? Well, when you look at GD, is it all the same data center, or are we talking like multi-data center? At this point, it's all inside the same data center. It really depends on what our partners want. So we're really, we provide some level of operational solutions to them. But all of them operate their own data centers and to have opinions on what the right way to do things is. And so we really work with them to develop whatever answer works for us, works for them. And this is really just a small piece of the operational puzzle. No, sure. At this point, it's just assuming NCD is up. As that raft piece moves into Kordi, it really will, something will have to be online if Kordi is basically querying itself at that point. So yeah, yeah, that hasn't been, or that is not a state that we check for right now.