 Can I just go? So, hi, everyone. My name is Bart. I'm a software engineer working for SiloDB. I'll be talking about what we're doing to introduce stronger consistency guarantees into the database by implementing raft. So, before we go into raft, how many of you are familiar with SiloDB? All right. So, not that many, so I'll start by going over a brief introduction. Silo is a NoSQL database compatible with Apache Cassandra. We implement the same APIs, and we use the same file formats. We are, however, much faster. We have ten times the throughput at much lower latencies, especially the higher percentiles. The database is also self-tuning. We employ a lot of control theory to make sure we can schedule all the different user workloads that we can balance them with the internal processes that the database has. We use ODirect and bypass Apache Cass, so we do have seen correctly. And the database is written in C++ with a lot of attention paid to the underlying hardware, which we want to fully utilize. So, we are a dynamo system, so there is no special nodes. Data, a particular piece of data is replicated across a set of replicas, and the full data set is partitioned across the whole cluster. When a user performs a read or write, he can specify the consistency level, which is how many replicas should acknowledge the operation before it is considered successful to the client. The data model is based on tables. The tables can have many columns. There is a primary key with two components, the partition key and our set of columns that specify how the data is distributed across the cluster, so they determine which nodes get to handle which pieces of data. And then there is the clustering key, which defines the order in which data is sorted and stored on disk. In terms of consistency, when we are performing a write, the coordinator knows that handles the user request is going to replicate that write to all of the replicas. We can, as I said, specify the consistency level, so if we say we want a quorum, a majority of replicas to acknowledge the write for it to be considered successful, then we can't tolerate one failure when a replica not receiving the write. If we then do a quorum with, and let's say that we only read from one of the replicas that had the data that we wrote previously, then we are going to consider the operation successfully, and we are going to return that piece of data. This provides us with some consistent guarantees, namely read your own writes and monotonic reads. And during this read, we are going to notice that a replica was missing that particular piece of data, and we are going to propagate it there. We also do this in the absence of reads by having a background anti-entropy process called repair, which periodically ensures that all the replicas converge on the same values. There is a peculiarity, which is when, let's say, we are doing a quorum write and two replicas fail, so only one replica acknowledges the write. In this case, the user receives an error, but one of the replicas did start a new value. So repair will in time propagate that value over to all the other replicas in the system. So the user write failed, but the value made it into the system and it's going to be observable by future reads. Another part of the consistency model is how we deal with concurrent updates. So if you're trying to update the same key and the same column to different values, how do we specify which one wins? And more importantly, how do we make them be commutative so that all replicas eventually choose the same values? And the technique we employ is the same as Cassandra's, so let's write wins. We look at the timestamp and we pick the value that has the highest timestamp. If the timestamps match, then we pick which are with the highest value according to the underlying data type. This is really not a great solution because timestamps don't capture intent and we are also silently discarding one of the values that the user is trying to write. And the solution to this right now would be to make changes to the data model. So model the data as an appending log, make all writes we add them put in, and work arounds like that. So another particularity of SILA is that it is implemented on top of the C-star framework, which is a C++ framework we wrote ourselves, but it's used by a bunch of other projects, and it's used for high-performance applications. So it imposes a thread per core design in which there's one thread pinned to a particular CPU and only one. And that means that if you want to fully utilize your CPU, then there can be no blocking ever. Otherwise, the CPU will just go idle. Because of that, C-star provides a set of asynchronous APIs not only for networking, but also for file.io and multi-core communication. In C-star, data is confined to a particular core, and when another core wants access to another one's data, it uses message passing to fetch that data. And we expose these APIs in the future promise API. So for SILA, this has the consequence that all of the components of the request path are replicated per CPU. And also, data is, like it is partitioned across the cluster, it is also partitioned inside a node. We use the same partition key to decide which CPUs get to handle which subset of the data. We use the middle bits of the partition key just to avoid aliasing situations when the shard counts per node approximates the node count in the cluster. And so the reason we want to introduce a stronger consistency model in SILA is truly because it just enables more use cases. If you want to have uniqueness constraints, you want to specify that a particular email address exists only once in the database, then it's really cumbersome to do that with just quorum reads and quorum writes. If you want to have read, modify, write accesses, where you read a value, make some decision, and modify the value and write it back, you can't really do that safely in the presence of concurrent updates with the mechanisms we have right now. Similarly, if you want an update to be all or nothing, so if you want all the value to be written on all the replicas or none of them, instead of those partial writes that we saw, then we have no mechanism to prevent that right now as well. Of course, we want these stronger consistency operations to be up in. We don't want you to have to pay for what you don't use. So Cassandra exposes this feature as lightweight transactions, which provides strong consistency per partition, so per key. It is essentially a distributed compare and swap, so you specify the new value and you specify what has to be there for the new write to apply. It's not really a full transactional API, and so you end up having to write your programs like you'd write log-free algorithms. So you make some modifications to some table, and then at the end, you try to update a point term to the new data, and by having clients coordinate on this pointer, then you get the strong consistency properties. It is a data path operation, so we want to make it as high-performance and as available as possible. And finally, because of the way the feature works, it does require an internal read before write, so we fetch the current value and are able to match it against the preconditions that the user is specifying. So to implement strong consistency, typically we resort to consensus algorithms. Consensus is the process by which a set of replicas reach agreement over a particular value. Consensus protocols provide consistency guarantees about underlying data, and they are leveraged to implement what's called replicated state machines where a set of replicas apply the same operations and in the same order and work together as a single coherent unit. We can tolerate some non-biasing time failures, so for two times F plus one nodes, we tolerate F failures. And a run of the consensus algorithm is typically called around, and it advances the underlying state. There are a set of guarantees that make these protocols worthwhile, so that of stability. If a replica ever decides a value, then that value stays decided forever. Agreement, that no two replicas should decide a value to different values. Validity, that if we decide on a value, then it's because a replica did propose that value, didn't come up out of thin air, and also very importantly, that of termination, that eventually all replicas do reach a decision. So how do we go about choosing an algorithm? There are two very popular ones, Paxes and RAF nowadays, but there are other ones. So first of all, we want to look at an algorithm that's understandable. So there was a very famous paper that came out of Google detailing their experiences implementing Paxes for Chubby and basically the gist of it was that Paxes was a very under-specified algorithm and that they had to search all of the literature to come up with a frank and sign of a solution. So we also want the algorithm not to be too cutting edge, we want there to be some real world usage of it and some experience and validity. Another thing to look at is how many, what's the overhead of having all those replicas coordinate to decide on a value, and that's usually in terms of round trips for women to be reached. Another thing is general performance, how amenable is an algorithm to typical optimization techniques. And an important trade-off is whether there's a strong leader or whether any replica can decide a value. So if the latter, then we require at least two round trips to decide on a value. This is classical Paxes or the new algorithm cast Paxes. Cassandra does something similar, but they require four round trips. The other approach is to have a strong leader. So we go through a process called leader election, select one of the replicas and to be in charge, and then we can commit values with one round-trip communication between the leader and the followers. This is what multipaxes, rafts, and XIV all do. Aside from this, there are a bunch of challenges in implementing this algorithm. So they typically rely on a right-of-head log to order operations, which takes space, and we need to find ways to compact that log. Also, read-only requests typically require a full run of the consensus algorithm, but we just want to read the latest value. We shouldn't have to go through that. So there are a bunch of optimizations that apply to some algorithms. Dynamic membership, having those come and go from the cluster, is also a big issue. The other one is multi-key transaction support. Can we extend the algorithm, or can we compose the algorithm so that we can have multi-key transactions? Another thing to look at is performance over the one, and finally, whether there is actually some formal proof of its safety, or whether there's like a Jepsen block post about it. So not surprising, giving the title of the talk, which was raft, it is focused on understandability. It's a very descriptive algorithm. It is very widely used. It has a lot of transactions, and it's running in a lot of databases. It has strong leadership. There's a leader election process. And for, in the context of lightweight transactions, it is going to be the leader that does that read-before-write and validates whether a transaction can apply or not. And for us, the fact that there is a log of operations is no big deal, and it's compacted as we shall see. So in raft, a node can be in one of three states. It can be a follower, a candidate to become a leader, or the leader itself. If it is the leader, then it is the one that handles user requests and carries out the read-before-write in the context of LWT and applies a new value and replicates it to the followers. It is also in charge of sending periodic heartbeats to the followers to keep them devout. If the leader becomes unavailable, one of the followers will notice, will become a candidate, and will propose itself for leadership. A candidate will become a leader if it receives votes from a majority of the other replicates. In raft, we have three components. We have the main consensus module that drives all of the algorithm's logic. We have the log, which implements this rather head log, which is the thing that actually provides most of the guarantees of the protocol. And when those entries of the log are safely committed in a majority of the followers, then they are applied to the state machine, which in the case of CELA, the state machine is a database itself. And of course, then the client observes the data in the state machine. The algorithm provides a set of guarantees that makes it very understandable. So the election safety guarantee means that at any given time there is only one leader in the system. The leader is the only replica that can append entries to the log. The raft enforces through the log matching property that if two logs that are the leader and one of the followers contain the same entry at a particular index for a particular term, then the logs are identical up until that point. If that is not the case, then either we'll send whatever entries the follower may be missing or we'll remove from the follower whatever extraneous entries it might have. In the raft, for a node to be elected leader, it needs to be at least as up to date as the previous leader. And finally, if a log entry is applied to the state machine in a particular replica, then it is guaranteed that the same entry is applied for the same position of the log across all the other replicas. This is how we guarantee that all the replicas are in fact working as a replicated state machine. So in raft, time is divided in terms. Each term is comprised of the leader election process. And from this, it can result that we're able to elect a leader or maybe no node received the majority of votes, so the process needs to be repeated. But time in raft is logically divided in these terms. So applying this throughout to CELA leads us to a bunch of interesting design choices. So first of all, because CELA is partitioned, there is no single node that holds all the data. So even if we wanted to have a single leader for the whole cluster, it would be very cumbersome because it would need to contact other nodes that actually hold the data to do the read before write and then to eventually apply the read, the write, all the while holding some walks that would prevent other concurrent operations from continuing. And it would also mean, obviously, that if all the write operations had to go through a single node, it would prove to be a huge bottleneck. So what we do instead is we have as many raft groups as there are partitions of the token space. So for each token range, there is a set of replicas that handle that subset of the data and those replicas form a raft group. So there are many leaders and many raft groups. A nice consequence of this is that we can have higher concurrency. So all writes can go to different leaders, to different raft groups. And if one of those leaders fails and we have to elect a new leader, there's going to be a brief period of unavailability. However, this will only affect a subset of operations. There's some fine print next to this because if we have n nodes, then if one leader, let's say we have 10 nodes, if a leader dies, then we are going to lose 10% of our token ranges. So if a client is making random requests, then there is a 1 over n probability that it will hit a key for which the leader is unavailable. So it doesn't mean that we beat the cap theorem, we're still going to be unavailable under the prentice of failures. Because of how SIL is implemented on top of SISTAR, each group itself is going to be sharded on a particular node. So a shard or a CPU is going to handle a subset of the operations of that particular raft group. This is going to impact how we organize those raft logs. So inside a particular node, we are going to have many of these raft groups. Here are the typical raft modules, the core consensus module, and the log database. The state is going to be whatever precision state we are going to need to keep across reboots. And then the RPC module is just what we use for messaging between replicas. So each shard, each CPU, is going to have an instance of all of these groups. However, shards between all of these are going to be the hard bits module. So if a node is a leader for many groups, and there's a follower, that's a follower for many of those groups, then we don't want to send any hard bits. We want to call as the hard bits into just one message and send just that one hard bit. This is how the simplified write path looks like inside a node. So we have a mutation, which is how we represent write request in SILA. The mutation is idemponent and commutative. And a set of restrictions which we are going to compare whatever is on disk. So first, we apply some logs to ensure that concurrent operations are blocked. We query the database to find out what's the current value. We compare that against the set of restrictions. And if they match, then we apply the entry to the log. From there, we communicate with the followers to try and replicate that log entry. And when we finally replicate the two majority of followers, that entry is committed. And the entries can be safely applied to the database. So sharding, then, is where all the differences between a typical ref implementation and this one are going to arise from. So, first of all, you notice how we organize our ref groups per node. But inside a node, it's shard handles a non-overlapping subset of the ranges. So why didn't we organize the groups by shard instead? And the reason is that this would lead to combinatorial explosion of states. So the metadata for all those groups would overwhelm the database. In raft, a particular entry is identified by the term that it was written in and by the index in the log file. So in CELA rafts, we add to that what was the leader shard that entry belonged to. We're going to have to be careful with heterogeneous clusters. These are clusters where the shard count is different between nodes. This is not a typical situation, but it can happen during a rolling restart where someone is provisioning more CPUs on their machines. And for a short while, shard count is going to be different among the nodes. And also resharding, which is when one of those nodes reboots and comes up with a different shard count. So if we were dealing with a homogeneous cluster, then everything is fine. Shard zero of node zero is going to talk with shard zero of some other node. And their logs are going to be exactly the same and organized the same way. If, however, we have different shard counts, it can be that different leader shards are talking to the same follower shard. This means that in the follower's log, these entries are not going to match how they're ordered in the leader logs. In particular, for the log matching property of raft, when we send the new log entry, we also send what we expect there to be before. So the follower can check if this log is up to date to the expectations of the leader. And if it's not, then we trigger that process by which the follower is brought up to speed. In this case, the follower can have entries that another leader shard gave it and they won't match what this other leader shard is expecting. We could solve this by going over and filtering out entries in the log, but it would be cumbersome. We can have a similar scenario where the leader shard is talking to different follower shards. And in this case, the logs in the follower shards are going to have gaps in them because some of the entries went to another shard on that machine. And here is the same thing. When we're trying to append an entry, the log matching property will require that we look at us in the log and we're going to find gaps there. And the algorithm is going to assume that it's missing data. So it won't work very well. So to solve this, we change how logs are organized. We organize them by term because for a particular term, the leader shard count is stable. And then we also organize them by leader shard. When a leader restarts, then its term ends. It has to go through another leader election. In other words, whether a node is a leader or not is not preserved across reboots. This is why if you're changing the CPU counts of nodes, you have to go through a restart. So this means we don't have to worry about a leader resharding in the middle of this term. And because now in the followers, logs are organized by leader shards, it can be that in some cases, different shards have to access the same log. So they will require synchronization. This goes a little bit against the philosophy of CSTAR where everything is partitioned. But this situation is really not typical. So we do require some synchronization here. So finally, log compaction. As I said, when entries are committed, replicated to a majority of followers, they can be written into the database. At this point, we can compact the log and get rid of all of these committed entries. In other words, it means that the database itself is now going to be responsible for that prefix of the log. This has consequences when we are dealing with a follower that has fallen behind and needs to be brought up to date. So whereas before, we could just ship the log entries over to that node, now we have to look into the database state and figure out what that node is missing. Fortunately, we already solved that problem. We had exactly the same problem with diverging replicas. So we can use that anti-entropy process that we call repair to figure out exactly what piece of data a follower is missing and then transmit only those mutations over to it. There are some interesting aspects to this. Due to natural concurrency, it can be that what we're transferring to the node actually overlaps with entries that are in the log that we are also shipping over. But because our mutations are idempotent and commutative, no harm comes of this. And it's also the case that repair is a heavyweight process to run, so we'd like to avoid it. So if we notice that we have some followers that are falling behind, we can choose to keep some committed log segments around. I say one gigabyte worth of log segments around just to make it much faster to bring those followers up to date. It's much faster to just send sequential pieces of data over. Now, the most difficult aspect of all of this is membership changes. So as we saw, and there will be many ref groups in the silica cluster, it's going to be for each key space, we are going to have replication factor times the number of V nodes, which is how many token ranges the key space is divided in, many ref groups. And raft has a restriction on the configuration changes. It specifies that only one node can be added or removed from a group at a particular time. So if we want to have more complex changes, these are going to have to be implemented as a series of the single step changes. In sila, unless you are changing the replication factor of a key space and the member count of a group is going to remain the same whenever you're adding or removing a node. So this means that if you are adding a node to a group, you'll have to later remove one of the nodes from that group. And conversely, if you are removing a node. So we have here a segment of the key space in which node A is the primary replica for a particular token space. So it's going to form a raft group and replicas nodes B and C are going to be the secondary replicas for that token range. So these three nodes are going to be in a raft group together. And the same thing for B is going to be the primary replica of that particular token range. And C is going to be in a group with it, and whatever node comes next. Now if I want to bootstrap a node D and D claims a token range that falls between A and B, some things are going to happen. Namely, now D is going to be a secondary replica for A, and C is no longer going to be in that raft group. The ranges for which B was a primary replica for now have changed because some of those ranges now belong to D. And because of that, now D is also a primary replica for a new token range. And B and C are going to be in a raft group together with it. Out of all of these things, what's important is having node D join the group of A and having node C live it. Rafts for B said these two operations happen concurrently, so we have to order them. And this is actually going, it requires changes to how our group membership algorithms work, our existing ones. So before we bootstrap a node, we need to wait until the node has joined all of the groups. And this is when we know that it's safe for now the other nodes to exit those groups. Doing a single operation at a time like raft requires has the nice property of ensuring that the majority of the configuration before overlap with the majority of the configuration after the change. Also, these configurations, what replicas comprise the raft group are written as special entries to the raft log. So this has the drawback that now bootstrapping nodes or removing a node is going to be a much longer process because it requires consensus rounds across all of the groups, many of the groups in the cluster. So adding or removing nodes becomes much more expensive. And finally, new nodes, because they don't hold the data, they enter the system as non-voting members. So they can be sent log entries from the leader, but they can't really participate in the leader election process. This can also hurt availability in some cases. So this is it for rafts and for everything I've said we're going to single key transactions as shown to the user in the form of LWT. However, LWT has another important constraint, which is it has a consistency level that says a transaction should only apply within a particular data center. This means that our raft groups cannot contain replicas from multiple data centers. This means instead that each data center has its own raft groups. So now we are left with the other problem because there's also a consistency level that says a transaction should apply across all data centers. So now we need to have some sort of way to coordinate among raft groups. This is more work upfront to support LWT, but it's actually a subset of what is needed to support multi-key transactions. So the way we solve this is by layering an agreement protocol to face commit on top of raft groups. Other databases like CockroachDB do this. So it's a typical to face commit implementation with the benefit of all of the participants, like the resource managers or the coordinator or the nodes that are going to hold a transaction status being fully replicated and fault tolerant because they form raft groups. I'm not going into details. It's typical to face commit with additional fault tolerance. So I've spoken about how external users can benefit from strong-core CCC, but the database itself can also benefit from it. One example are concurrent schema changes. So today, schema change is carried out locally in whatever replica the client contacted and then propagated throughout the cluster. This means that we don't have any protection against concurrent schema changes or changes that are done in incompatible orders. So, for example, if I'm trying to create a table on one side and on the other side, I'm trying to drop a user type that's used in that table and there's really no way to order these two operations. And because they are concurrent, there's really no way to provide a good error message to the user. So this is just going to be errors in the logs later. However, if we say that if we partition our key spaces, like the key space name, if we partition that into the cluster, then we're going to reach a raft group that can be responsible for all operations for that key space. And once we have that, then we can use LWT when we are internally, when we are applying schema changes so that they all go through the same leader and then they are ordered and we can provide nice error messages to the users. Another thing are range movements. So currently, we can't really have concurrent range movements. We can add more than one node at a time because they can pick conflicting token ranges. And whenever a node joins the cluster and picks whatever token range it wants, it will do so randomly, so it's not really optimal. So what we want to have is a strongly consistent way of ensuring that when we're writing a new node that we actually give it whatever token ranges will be best for the system. You can see which nodes are more overloaded and take ranges away from them and give them to the new node. It also means that we can tell the system that we want to add many nodes at the same time and the system can pick tokens in a way that will make those operations actually be able to proceed concurrently. Of course, they wouldn't be able to be in the same raft groups, but we could do better than what we have now. The only thing is that we can use the same partitioning approach that we could use for schema updates because picking a token is a centralized decision in the place of the whole cluster. So we could do one of two things. We can have a global group. We can say all nodes in the system belong to a single raft group, which is simple but has many disadvantages. Or we could say that the raft group that coordinates cluster range movements is formed by very specific nodes. So we can say that the seed nodes, the ones that everyone knows about are the ones that form this raft group that controls range movements and token ownership. Finally, materialized views. This materialized view is derived table from some base table. It is typically used to index a column out of the base table, but it is different than a secondary index because you can actually denormalize data in it. So you can include in a materialized view many columns from the base table. A materialized view is handled by a totally different set of nodes than the base table, and it cannot be written to directly. So when a base table is updated, the database will process those updates as well, will calculate the view updates that are needed to update the materialized views and are going to send it to the view replicas. It does so in the background and in an eventually consistent fashion. This has the advantage of preserving base table availability. So if one of the view replicas is down, or all of the view replicas are down, the base write is still going to succeed. We're just going to accumulate in memory depending view updates. This can be a problem because of two reasons. One, we can have issues with consistency, with keeping the views up-to-date with the base tables, and it also means we can have issues with flow control. All of those pending view updates in a base node can start to take over all the memory, and then we need a distributed flow control solution, so we are able to communicate that resource pressure over to the coordinator node, so we are able to slow down the client. This becomes very complex. With Moody Key Transaction, we could instead specify that the base table would be updated together with the view table in the context of a single transaction. This would ensure that they are kept always consistent with each other and would provide natural back pressure from the view replicas and base replicas to the client. This would, of course, have to be up-in because it would come at a performance cost. That's it. That's all I have on the subject, so if you have any questions, I'll take them now. Thank you. The question is how we support multiple workloads. How do we find blocking by having a slow chart? The question to that, the problem is similar to how it is handled by having different nodes. If you have a slow node, you're going to have the same problem. The way that it solves is by better data partitioning. If the way you select your partition key is picking hot nodes, then you're going to have to employ a different partition key. The same thing happens with shards, so a good partition key is going to ensure that all shards are balanced and they're roughly the same amount of work. If you have a hot chart, then there's not much we can do about it. It's like having a hot node and it's up to the user to tweak the partition key. Makes sense? No. So I didn't mention that, but when I was showing the request path and we were taking logs, so there's an opportunity to allow a lot of concurrency there. So if we're talking about different keys, then they'll take different logs and they'll be able to progress independently within the node. Any other questions? I don't think so. Okay, thank you.