 Hi everyone, pleasure to be here. Nice to see all of you and many familiar faces, some. Yeah, that's really good that all of you are here and my name is Alex and today we're going to talk about the importance of transactional metadata and the avenues it's opening for Apache Cassandra. So before we begin, I just wanted to mention that I have several copies of Database Internals book and I'm going to giving them out for the people who are asking good questions. So I have four copies on me and I have several more stashed, yeah. Mick is, yeah, they're gonna be handed out. Anyways, let's begin. Let me first give you an overview of the stock and we're going to start with and talk about the motivation, about motivation why we wanted to make the changes that we've made. Since the best way to plan the project is first to have clear requirements about what we want to achieve, let's set also clear requirements and because I'm presenting to a technical audience I'm quite certain many of you would be curious about the implementation details and in fact, many of you are probably here exactly because of that, okay. That was weird because, yeah. Lastly, since transactional metadata has already landed in Cassandra trunk we can also start talking about the avenues it has opened and the future steps and things that we have started working on. So let's first talk about how we came to the conclusion that we need to make the cluster metadata transactional. Around 2010, a new wave of NoSQL databases mostly inspired by Dynamo TV paper have appeared and all of them had several things in common. Having homogenous nodes running on the commodity hardware and implementing cluster metadata through Gossip was one of the features that they had in common in fact. With Gossip you can have huge clusters of nodes with spotty connectivity and nodes will be collaborating probabilistically propagating state throughout the cluster. In Cassandra Gossip was used for disseminating many things among which were schema and cluster metadata which consists of membership and ownership information. Membership information identifies the presence of the node in the cluster while ownership information identifies the target nodes for reads and writes. There is nothing wrong with disseminating cluster metadata through Gossip since that's what it's best for. The things start getting tricky when we expect Gossip listeners to have a coherent view of this metadata and act as if this was the case. So Gossip has become a subsystem that used to react to a membership and ownership changes announced by the nodes and it was creating among many other things something which was called token metadata objects. So anybody who worked with Cassandra for long enough probably knows about its existence. Token metadata is built from the information passed by the subscribers of the Gossip state changes and it is invalidated whenever the ring version changes which also means that it has to be rebuilt. Unfortunately, this was happening on the hot cot path which we ultimately would like to avoid with the transactional cluster metadata project. Now let's talk a little bit about the motivation. Most of the Cassandra clusters operated today contain only hundreds of nodes which is a relatively small number compared to the number of participants in something like a mesh network or some internet of things setups where it might be impossible to live without Gossip or spanning trees or something like that. We believe it may be more efficient to use simple broadcast as a first line of distribution and natural cluster communication as a backup to disseminate information that is currently being distributed by Gossip in older version which will make it simpler to reason about and lead to quicker time-bound convergence. Let's talk through the process of the node bootstrap. So when the node comes up, it contacts seed nodes which can give it the last snapshot of the membership and ownership information known to it but unfortunately we cannot know that there are, whether or not there are other nodes that are currently trying to make changes to the ownership of the range we're about to bootstrap. The most straightforward way to work around this problem is to introduce a delay so that other interested parties could make their intentions known. To reframe this a bit, shadow round and other delays you see mentioned on the slide here are necessary since we cannot really grab a lock on the cluster instantly or quickly ensure there is no other node that will attempt to modify overlapping ranges concurrently to us. Another source of complication is that the current implementation Gossip does not provide a way to ensure the strict ordering of these changes which may lead to the out of order propagation of the concurrent events. Say if two nodes are bootstrapping simultaneously, coordinator A may observe node one joining before node two joins and coordinator B may observe them joining in the opposite order. This also means that there can effectively be two disagreeing views of the ring. Of course in practice all of this is not super important and can be worked around since operators are usually careful and make sure to only permit safe operations in the cluster but unfortunately even if we're willing to ignore these things there are still things that we cannot easily solve with Gossip. I will give an example of a transient data loss issue later in my slides and besides because of the issues with automation software or human mistakes there is still some room for error in the current implementation which we cannot fully eliminate. So one of the things we've been discussing in the Cassandra community lately is a concept of token ownership. While an idea of hashering and a consistent hashing seems to be good on the paper there are still some things that aren't obvious at the first glance. First one is that it is using randomness in form of consistent hashing and load balancing for load balancing and distribution and this does not guarantee optimal load because unfortunately even though your hash is providing uniform well distribution your data is not uniformly distributed. There are partitions that outweigh the others in terms of size and load and there are two to the power of 64 tokens if you assume the murmur three partitioner and the space between these tokens is rather large which means that there are still many chances for hotspots. V nodes as you may know were an attempt to address the granularity issue and they did allow us to pick the multiple sources for streaming but this has not completely eliminated the problem with load balancing. The ultimate goal of load balancing is to have a way to gradually and dynamically relocate the data. The second problem is that token ownership makes it such that nodes position in the ring alongside with the replication factor dictates which ranges it is going to own because of the ring which means that you cannot pick a subset of nodes for a specific range and relocate the data to them in order for them to own this data without doing token arithmetic or relocate a range to a subset of nodes without having them owning the neighboring ranges as well. So I considered this a problem. It is also important for us to have a single point of reference. Any node turning the cluster should be able to quickly and easily find the latest membership of the ownership state and modify it without having to wait for other nodes to announce their potential actions. If you take a look at how for instance Paxos V2 is implemented in Cassandra its payloads contain a list of participants and because there is no other easy way to ensure that two nodes have exactly the same view of the ring and we should make similar information available for every operation even eventually consistent ones and we can make it very compact. A single number identified the ring epoch should be sufficient. Of course this applies not just to Paxos but eventually consistent operations just as well and coordinators and replicas should agree about the view of the ring and we should have mechanisms that allow lagging participants to quickly catch up to the latest epoch for the operation that is ongoing. And lastly we would like the ring consistency for the duration of every operation which means that a version of the ring available at the end of the read or write operation should be compatible with the responses that the coordinator has collected from the replicas. All of these things are required to maintain consistency levels honestly. We have for instance, if we advertise the replication factor of three we effectively are making two promises. By the end of the quorum write we will have at least two copies of data available and that the quorum read that follows the quorum write will observe the effects of this write. And let us see under which conditions this assumptions or promises may not be held. So I'm gonna walk you through the example of the transient loss of the write during the bootstrap. Imagine a cluster with only three nodes with a replication factor of three and nodes A, B and C present in the ring and we would like to bootstrap the fourth node which is X. And as you may know in Cassandra in order to bootstrap a node you have to add it to the cluster at first as a pending node, right? So it's going to be receiving the writes but will not serve any reads. So the reader still sees the node X as joining. Meanwhile, from the perspective of the writer X has fully joined the ring, A is already excluded from the write set and from its perspective the placements for this write are going to be B, C and X. And the result is the transient loss of the write during the bootstrap. So effectively the reader is going to collect the nodes A and B or may collect nodes A and B as its replica targets and since it's quorum it is sufficient for it to complete the operation and the writer may collect C and X as their write targets and since these quorums are non-overlapping we have a transient loss of the write. And one of the problems is that reads and writes are not monotonic with the ring evolvement and there is no order guarantees between these two and this may result into this problem that has existed in the system for a long while. Now, let's discuss other problems. So if you're in the community long enough you may have heard about Cassandra 1304 which was a bug that was resulting from the schema disagreement in two nodes corrupting the data during the read. So essentially a read repair was corrupting the data in the SS table. However, even if the results are not as drastic there are still problems that are difficult to resolve so there could be racing create table statements and like schema disagreements can be manifesting in many different ways. So let's now set up the requirements for what exactly we're trying to achieve. First, we would like to make sure that we remain honest about the consistency levels even during the range movements. This is a non-negotiable correctness requirement. We would also like to have safe concurrent bootstraps. Multiple nodes should be able to join and leave the cluster and you should have zero knowledge about Cassandra internals to make this happen. Next, we would like nodes to find agreement about the state of the cluster metadata quickly and easily ideally using natural internal traffic so there should be no additional traffic introduced to the cluster. Lastly, we would like the metadata changes to be completely transparent for the user. Clients connected to the cluster should be should see absolutely nothing while range movements are in progress not even a single unavailable or a timeout. That would be great. So let's talk a little bit about the implementation details. We have discussed two things already. So let's try to sort of incorporate them into the implementation. We will maintain compatibility with the tokens during the upgrade period and for as long as you continue using all node tool commands. So everything remains 100% backwards compatible. After migrating to the cluster metadata nodes however will own ranges and not tokens. Good news for Cassandra maintainers is that while doing that we have also made sure to remove the range wraparounds. So all ranges are now from minimum token to the maximum token. So there is technically no ring anymore. It's just like flat line from the minimum possible token to maximum possible token. For now we have decided to maintain full placement compatibility with the old token model but the plan is to allow dynamic relocation fully decoupling range ownership from token ownership. So these should be not interrelated in any way. This is technically possible even now it's just not exposed to the operators just yet. Right now token metadata changes are just re-computations triggered by the listeners of gossip state changes meaning that in 5.0 and previous versions instead of this we will now have explicit transformations that live in their own classes and explicitly describe what they're doing. So you will be able to inspect every change executed on the system in lots of detail. If we assume that cluster metadata is an atomic variable and say that we start with some empty state we can formulate the change events as pure functions that take cluster metadata from one state to the next one. No side effects in any one of them whatsoever. Results will be computed on all the nodes in the same way and there will be no way for two nodes to have divergent views of metadata if they have seen the same sequence of events which they will. This also means that every cluster metadata can be uniquely identified by the epoch. Epochs are monotonically incremented and comparable. So any two participants can quickly and easily understand who has a higher epoch and has to catch up. To preserve the history all of these transformations are going to be stored in the metadata change log. Long operations such as bootstrap, decommission, move, replace are now able to survive the node crashes and are now safely resumable. The metadata change log is also going to be strongly consistent. There is only one permitted order of the events and there can be no other history. The log is stored on a dynamic subset of nodes which are called a cluster metadata service or CMS for short. This service can be consulted at any point in time to find out the latest state and make the modification to this state. The participants in the cluster metadata service are managed by the cluster metadata service itself without any user intervention. Reacting to the changes in your cluster size and health of the metadata log owners. So you simply specify the replication factor for the cluster metadata service and the rest is done for you. Newly joined nodes only have to know a single node in the cluster in order to discover the latest state of the metadata. So there is no concept of CMS nodes or it's gonna be introduced to your scripts or setups. Since some operations have no visible trace in the four metadata such as bootstrap that was rejected because there was another bootstrap ongoing already so the ranges were locked or recreation of an existing table. The only events that actually make it to the log are ones that actually modify it. So your script which creates, I don't know, a table, if it exists will have no effects on the cluster metadata object because it's not going to grow it exponentially large if it's even running every single second and we only preserve the things that actually make relevant changes to the metadata. The rest are just transparent and do not bump the epoch. And finally metadata state is computed from the log as I have already mentioned. Starting from the empty metadata state we apply all the events in the order without any gaps and we this way obtain the latest version of the metadata state. Since we plan to allow rather many operations in for that can transform the metadata state in order of tens of thousands per day we decided to use data structures that have structural sharing in memory to reduce the memory footprint because if you have several copies of cluster metadata objects in memory we would like it to be as small as possible. And finally most of the time new cluster metadata changes will have no influence on the ongoing operations. Computation is going to be done completely off-thread and will be made visible only when it is complete. So when we manifest the epoch this is the first time the node is going to see it, experience it and there will be no additional computation required for it. All metadata elements including schema are now versioned. Changes to one key space or table do not bump the epoch for other tables and key spaces and placement information is also versioned. Epochs are changed only for the ranges that are actually affected by the operation. So if you have bootstrapped the node it doesn't mean that entire state has to be recomputed and the epoch for each and every range has to be bumped. So only the touched ranges are actually going to get their epochs bumped. Read and write requests now include relevant coordinator schema epoch and epoch when the range for which the current request is being performed was last modified. This means that nodes can quickly and efficiently check that their epochs match and the lagging node can catch up to the relevant epoch. And best of all, none of this requires communication with the cluster metadata service. Nodes can simply exchange the relevant metadata mutation among themselves without involving CMS nodes. So there will be no extra load on these nodes either. Let's talk about maintaining the consistency levels and how we have achieved whatever I have mentioned before with the loss of the transient writes. And I will simply walk you through the same things in more visual form. So let's talk through how the node X is going to bootstrapped with the transactional cluster metadata. We now again have the same setup. We have a three node cluster replication factor of three and the node X is being bootstrapped into this cluster. And here, the first and foremost step, we have something that we call prepare join. And during this, the ranges that are owned by the neighboring replicas are split and there are no other changes being made to the state of the cluster. So basically we introduce the new token where the X is going to reside, but apart from that, we do not make any changes to the ring. The next one, we add X to the write set. So this is very similar to the introduction of the pending nodes in the previous model. And yeah, we essentially say that, okay, now the write set is for the given ranges is expanded by addition of the node X. So if in the previous case, the range from zero to 100 had a write set of A and B, now it's also going to include X. For the next step, we had to introduce a concept of a progress barrier. It is serving as a way to ensure we maintain the promises about the consistency level as we discussed before. We can summarize the need for it in three conjectures. First one is the bootstrapping node has to receive every single mutation. So every single write that has been made. In other words, streaming must wait for the event that adds the target node as a write replica to be acknowledged by the majority of replicas owning the range before we can start streaming. Second one is the streaming should start only after a writing starts to the pending node. So before that, so streaming has to be blocked and it has to wait until it knows that the majority of nodes has experienced this epoch, manifested it. And the third one, all possible read and write majorities of replicas sets for both reads and writes, as I said, have to overlap their majorities, right? And this is done by essentially computing the majorities for each and every steps in a way that is done. We have a special simulator for that and you can check the Cassandra code for more details. Every subsequent event must wait for the previous event to be acknowledged by the majority of replicas owning the range before it can be submitted. So the progress barrier is essentially preventing the next operation to be executed before the previous one has been acknowledged and manifested by the majority of the existing replica set. The next step is something we called mid-join and in this step, we simply replace the node that is losing the range with the node that is gaining the range in the read replica set without any touches to the write replica set. And the last step, we manifest fully the X as a complete and rightful owner of the token. It has bootstrap four. And in this case, you see that we have moved from basically introducing the nodes through all the steps and now it serves as both read and write replica for the ranges it is going to own. So to summarize the bootstrap process, now we are bootstrapping using the ranges and node tokens. And in order to bootstrap the node first has to contact one of the ranges, one of the nodes in the cluster and this node is going to notify who is the CMS. The CMS is given the latest information about all the epochs. The node is catching up to the latest epoch, then it plans, splits the ranges, starts streaming and moves through all the necessary steps in order to get bootstrapped between every operation we have a progress barrier that prevents the progress until the majority of the nodes have experienced the epoch that precedes the next one. And finally, we unlock the ranges and the node is fully bootstrapped. Of course, you may be concerned about what is happening if you're losing the majority of the cluster metadata service nodes, it's not a problem. We have a series of scripts that allow you to quickly and easily recover from that. So essentially you can introduce new nodes by forcing a so to say two face commit over the existing nodes. We expect this to never be used by anybody. We wrote it in order to in a way placate everybody that don't worry if something really, really bad happens, you still have a way out, but we cannot really conceive a scenario where something like that can be necessary. And of course we've done a lot of testing in order to make sure that this feature is as rock solid as possible. As you may know, Cassandra has a state of the art simulator and we've done hundreds of hours of simulation on the Cassandra simulator using the fast testing tool that we wrote, which is called Harry. We also wrote special simulator for quorum intersections in order to analyze and sort of like conclusively prove that there can be no way for us to collect non-intersecting majorities for the steps that are being executed. So as long as we comply to the protocol that is given by the progress barrier, then can be no transient data loss. And we have created a new sets of tests which we call coordinator or replica tests. In this case, you are writing tests from the perspective of a single node. It is either a coordinator or a replica and the rest of the state of the cluster is completely simulated. So you can simulate arbitrarily large or small clusters by essentially talking and writing the behavior of a single node. So this is very useful for testing the scenarios of like this transient data loss and recovery from it. Yeah, so we have written a bunch of tests for DTCM as well and you can of course take a look at all of them in the Cassandra tree. And if you leave this presentation with just one word in your mind that is associated with the transactional cluster metadata, I would like you to keep the word elasticity in mind. And I think that Cassandra has to become more elastic, like more dynamic. And the things that transactional cluster metadata is making not only possible, but many of these things are actually becoming trivial are the following. I don't wanna read them from you. I'm personally most excited about the byte ordered partitioner and I have heard that for instance, Brandi Mir is also very excited, the person who presented right before me. Yeah, because it very nicely placed together with the tri-based MAM tables and SS tables. And it is going to serve as a better metadata for LWTs that are existing and for Accord. You will be able to improve your control planes and many of the operations that have been previously painful and had to be watched by a team of trained Cassandra professionals now hopefully can be executed completely painlessly. So in order to recap and sort of summarize everything that we talked about today, so we talked about the motivation, like why we wanted to introduce the concept of transactional cluster metadata and we discussed why we would like to stop using Gossip for this subset of things. Of course, we still are going to be using Gossip for things like load, failure detection and some other transient information about the nodes but it is not going to be used for cluster metadata anymore and cluster metadata is going to be transactional. We've discussed the requirements, what exactly we would like to achieve and we set the goal for ourselves before we set out with the project. You can also read the document of the CEP on the Cassandra Confluence or Apache Confluence and we presented it to the community and many folks have agreed that this is the way to go and we would like to pursue that. Then we talked about the implementation, some of the details of the previous implementation, some of the details of the new one. So basically moving from the token model to range model making the metadata changes strictly ordered and consistent and we've talked about the future and the avenues that transactional cluster metadata is opening for us and that's pretty much it. Thank you very much for coming to the presentation. I appreciate your attention and of course the books still await the persons who would like to ask a question. So go ahead and shoot. Yes, please. Yes, so in addition to the two simulations that I have mentioned, we have a third simulation that I failed to mention which is metadata change simulator. The metadata change simulator picks one of the possible operations given the setup of the cluster. So it first regenerates an arbitrarily large cluster and then just looks at the state of the cluster and thinks, okay, can I bootstrap the node in the cluster? Can I make one of the nodes leave, nodes leave? Can I make, like move one of the tokens in the ring and so forth? Then it makes a decision that, okay, I would like to perform this operation and it can also on the next step instead of continuing with the bootstrap, it can choose to, well, bootstrap one more node or have one node bootstrapping more node leaving or one node leaving and the other one moving the token. So we definitely have simulated the concurrent operations. This has been a huge area of our focus in the very beginning because we thought that if we cannot achieve that, then what are we doing at all? So we definitely did a lot of that and yeah. Yeah, so I will repeat this one because I think I haven't repeated the previous one. How do we expect this to affect the cold restarts? So I'm assuming that cold restarts is like rolling bounces whenever you need to restart every single node in the cluster, is this? So once again, would you like to bring up a whole new data center with hundreds of nodes and none of them have been a member of the cluster before that? Is this what you would like to achieve ultimately? Okay, so in this case, we have two scenarios. One scenario is first you bring up some of the nodes that are going to be the cluster metadata service. You can bring up one node that is going to be CMS. That's the quickest and simplest scenario and easiest to achieve. I would personally probably do that. Then you just start all the other nodes and they start registering and registering is as easy as simply submitting a single LWT or PAXOS operation. It is very quick, it's currently two round trips because we are using PAXOS V2 and I expect maybe there will be some contention but even for a thousand node cluster, like we actually simulated way more contended scenarios with PAXOS V2. We have not simulated anything like that specifically with the TCM in this case but I don't expect any problems to be happening and in the scenario when you don't have any data in the cluster instead of using the multi-step operations that I have presented previously, you can save something that you can use something that we call unsafe join which is a single operation which means that everybody just joins in one step. So the register, they join and after that you have a thousand node cluster and I expect it to be. I mean, essentially all of the time taking is gonna be startup time. You will not even notice anything on top of that so that's my take on that. Anybody else? Josh, sorry? I think I'm about ready. Okay, so that's the difference between the concept of a pending range which means that you have to write to the majority of the previously existing nodes plus one. We have changed it and we essentially expand the replica set by one and the majority is gonna be number of nodes divided by two plus one which means that for three nodes and one bootstrapping we have four nodes so the majority is of course three nodes but it can be three existing nodes. So you're not good with your old quorum. You have expanded your quorum and you will have to write to all of them effectively. Yeah, in this case, you have absolutely no problem. As you expanded the quorum, you will have to write to three nodes instead of just two. You can quickly revert the operation meaning that you can sort of like abort the bootstrap and it's a very easy and quick operation. But yeah, so that's the answer. Yes, please, in the white t-shirt. Okay, the question is what is the upgrade path from the existing gossip clusters to TCM clusters? The process is as follows. You bring up all the nodes with the version that is aware about the existence of transactional cluster metadata and you start the upgrade process by running a simple node tool command. You don't remember, it's initial ICMS or something like that. There is a documentation on the Cassandra website, of course, for that or like entry even and this command does the following. It ensures that every single participant in the cluster has the same gossip view of the ring meaning that they have converged in terms of gossip because otherwise what state can we base our transactional cluster metadata upon? So that's prerequisite. If you have some nodes that have corrupted view of the gossip, you can actually ignore them. You can manually inspect, like in fact, during the upgrade we print the diff for you and we say that okay, here is the disagreeing nodes and here is where they disagree and you can still say that you know what, I think that these nodes are wrong. I don't know why and I don't have time to figure it out. So let's just say that this gossip state that main one sort of looks fine for me. So let's go ahead with this one and after that we are using a two-phase commit which requires all the participants to respond during the first phase and say yes, I promise you that I will upgrade to a single instance of transactional cluster metadata because if you accidentally launch two upgrades then well, we would like to prevent that and we would like nobody to think about like how to avoid something like that. So we have to consult not the majority of nodes but actually every single participants. This is the point of the phase commit and after they all of them have promised they you go through the second step and you essentially instantiate the first instance of transactional cluster metadata and after that everything go on as previously just not with gossip but with TCM. There is a single limitation that during the upgrade process you cannot like the metadata is essentially locked. So you cannot bootstrap or decommission the nodes but the word on the street is that like if you are upgrading, you better not have also like bootstraps while doing that because there can be several things in flight that yeah, it's just many people don't do that. So yeah, that's the only limitation. There is a way to downgrade as well. So you essentially say, okay, convert the transactional cluster metadata back to the gossip state. The gossip state is propagated and you downgrade but I don't really anticipate anybody needing this barring the issues with something else in Cassandra. Yes, sorry. I'm really sorry. I understand what you're saying. I just don't hear the question. Could you phrase it in a form that could be like what are you asking? Like are you trying to achieve something or you think that the way we've implemented it has some, yes, thank you. Okay, now I see. Of course, the transactional cluster metadata can be used in order to dynamically move individual partitions between nodes. This is absolutely true. The only reason we have not done that is for everybody in the community to be able to keep using their current tooling Cassandra tooling because I'm assuming that you have actually like concept of tokens in all of your scripts, right? There is absolutely nothing that is preventing you from saying that certain ranges are owned by like a single node or even a single token is owned by a subset of nodes. You can even say that the replica set for a given, I don't know, token is not three but five or seven, like anything goes really. So from the perspective of transactional cluster metadata, everything is just like a range and a set of nodes that owns the range. So this is already implemented. The reason we have not exposed it is that we wanted everybody to have an opportunity to transition first to the, well the system where this is, like everything is transactional and after that we will start introducing new tooling that will allow dynamic repartitioning, different ways of load balance, reintroduce the binary ordered partitioner which would mean also that you will be able to essentially have shards and Cassandra and so forth. So this is in the books, everything for that is essentially implemented as a prerequisite but like the features itself are still to be done but this is planned.