 Hey, good morning, everyone. My name is Scott Andreas. I'm an engineering manager at Apple in our cloud services group focused on Apache Cassandra. I'm here this morning to offer a retrospective on where Cassandra has been over the past decade or so and to share some possibilities that recent work like transactional metadata and Accord open up for the database. The features that I'll be sharing today aren't exhaustive, but are some of the ones that I've been a little bit closer to. It's so good to see all of you. I'm glad that we can have an in-person event again. And I hope that we can connect well at the conference. We'll start by framing the context of Cassandra's emergence in the original Dynamo paper. And then I'll take this as a jumping off point to follow how Dynamo derived systems like Apache Cassandra evolved. We'll survey what stuck and some concepts that didn't pan out quite as well, at least for us. And then finally, I'll offer some thoughts on what's new and what I see as next in majority quorum databases. This includes an introduction to the new leaderless Paxos protocol developed to implement cross shard transactions in Dynamo derived systems that execute in one wan round trip, which we call Accord. Other features that we'll talk about aren't necessarily committed, accepted, or even discussed roadmap items within the Apache Cassandra project, but maybe they will be. The question this talk poses, though, is whether systems that begin from humble origins but with a few really uniquely powerful capabilities can build on that architecture as an asset and slowly close the capability gap with traditional databases through about a decade of hard work. You can bet because I'm here today that I think that the answer is yes. So let's dive in. The Dynamo paper and Apache Cassandra, which was modeled after it, were my introduction to distributed systems. I was a liberal arts grad and early career engineer just learning how to build services when a colleague shared it with me. Dynamo described a system for storing shopping cart state, functionality that would have typically been served by a relational database at that time, but in a very un-database-like way. I was fascinated by the path that this paper carved to solve a very specific problem and the set of scenarios that it seemed to anticipate along the way. I'd love to thank all of the Amazonians that are here and everyone involved in that work for what that paper taught me. So what did the paper claim? The paper didn't call Dynamo a database and it seemed really careful in not using that word. The system was a highly available key value store designed with a leaderless architecture focusing on incremental scalability, properties that were pretty uncommon for contemporary databases at that time. The API was very simple. It stored only opaque bytes and let you retrieve them given the key. Availability was paramount for that system and it meant meeting an aggressive latency SLA at the 99.9th percentile. But there's a lot that it left out. There wasn't a query language. There was no query planner or optimizer, much of which was delegated to a smart client. Dynamo had no concept of transactions. There weren't indexes, there was no MVCC, no snapshot isolation. There were no rich data types, no aggregations, views, functions or procedures. The paper didn't even describe a storage engine. It noted that BDB and MySQL were both pluggable options. So if you think about contemporary storage systems that were popular at that time, like Oracle, MySQL, SQL Server, Postgres, these were really glaring omissions for something that acted as a data store. And that paper about 16 years ago sparked a lot of discussion and interest in what became the field of non-relational data stores. Dynamo was successful because it took a use case that typically would have been served by an RDBMS. It framed a very specific set of goals, their availability, latency and scalability. It pivoted its entire universe around those capabilities and then optimized for them with the expensive features that any RDBMS would have offered at that time. And that raised somewhat of an existential question. What is a database among distributed systems practitioners along with some really reasonable eye rolling among people who had worked in the capital D database engineering field for a long time? They were, and are quick to remind the world not to confuse a distributed hash table with a database. But some started to wonder if there were more database adjacent problems that could be solved with an architecture that had unparalleled availability, scalability and latency characteristics and also how much people were willing to give up for it. Crucially, a lot of people agreed with that idea and sometimes a little too readily. If you remember the database landscape around 2010, 2013 or so, and some of the work that followed in Kyle Kingsbury and Jepsen's research, that work taught us that many early data stores from that era were dangerously unsafe and failed to provide the set of properties that they advertised and yet still enjoyed a surprising degree of market success, which surprised a lot of people too. Kyle's still doing this work today and things haven't gotten a whole lot better. This pattern isn't a new one though. If you've read it, Clayton Christensen framed it really well in a book called The Innovator's Dilemma, which featured a case study on the disk drive industry at the time of early personal computers. Christensen identified a pattern and it was that new technology solves a problem in an undeveloped market. It lacks table stakes features that most would regard as critical, but it finds a niche that it can establish itself and succeed in. And then over time, as it gains market share, that technology gains capabilities that are needed to enter minor segments of mature markets, then slowly chips away at that gap while building a moat around what it really excelled at until it can compete with entrenched players. Eventually, those entrenched players retreat upmarket to more sophisticated segments and the new technology becomes firmly established, gaining most of the missing capabilities while renegotiating with the market what requirements really are table stakes. This process of succession maps really well to some of the transformations that we've seen in the database industry over the last decade. Emerging products missing major features carved down a niche, then by time to grow what they lack and move into more established markets, building a moat around what made them disruptive in the first place. And this is where Apache Cassandra comes in. Cassandra was co-authored at Facebook by an original author of the Dynamo paper at Amazon. It took many aspects of the Dynamo design like data placement via consistent hashing, tunable consistency, gossip, hinted handoff, Merkel trees, and timestamp-based conflict resolution. Cassandra added a few features that weren't originally included in the original Dynamo paper, like rich data types and an LSM storage engine. But it sure had a lot of growing up to do. Dynamo's tight framing of its own problem statement was critical to its own success. But what problems faced the Dynamo derived systems like Apache Cassandra try to succeed in today? Most of these are extensions of the design goals of the original paper. For us today, scalability and availability remain paramount with a pretty incredible range from 3 to 1500 database instances per cluster. Availability standards are very high. Dynamo's architecture enables Cassandra to deliver five nines of availability easily and especially with multi-region deployments, six nines or more. This is only possible because all maintenance operations are designed for zero downtime. And we'll talk a bit more about how the majority quorum architecture contributes to that in a bit and why it's so critical that we preserve it. Cassandra also targets wide distribution with databases deployed active active across five regions. It can store many petabytes of data in a single database and serve millions of queries per second. It offers strong consistency and linearizable transactions. There aren't many databases that can do these things in our industry, especially few that are open source. So within our innovators dilemma frame, we've moved up market a lot from Dynamo's origin, which had very loose guarantees and a very slim API. A question that we can ask ourselves is, how far can we push this while building on the strength of Dynamo's scale, availability and distribution characteristics? It's useful to distill some of the most important Dynamo properties that help Cassandra get to this point because we probably wanna double down on those. They're what have really made us successful. For Cassandra, these are Dynamo's leaderless architecture and majority quorum design. The simplicity of the leaderless architecture was brilliant. It avoids bottlenecking on a distinguished transactor. It eliminates the impact of gray mode failures in the case of a leader on failing hardware. Think like a bad disk that's slow to respond or a dead fan that results in thermal throttling that might not otherwise be detected. It eliminates the risk of blips during leader elections or the impact of a failed election or loss of a leader. Quories and transactions can be initiated and led from any region active active. With three or even five regions, this enables Dynamo derived systems to remain available and consistent during even major disruptions to internet infrastructure. Transactions can be initiated from any region with equal latency. And crucially, failure domains in Cassandra aren't global, but they're scoped to a small set of, say, three replicas for a given key. And in a cluster of, say, 1,500 instances, this confines the blast radius of concurrent hardware failure to just 0.2% of the cluster. But individual hardware failures have almost no impact at all thanks to the next attribute that I mentioned a moment ago, which is Cassandra's majority quorum design. This approach was a great design to build a database around. In majority quorum systems, the probability of failure is a function of the likelihood of concurrent failures of replicas across multiple failure domains. That's a powerful concept. Cassandra's data placement strategies can be made aware of the failure domains of the data center and the cloud that they're deployed on to achieve even greater availability. For physical DCs, this means replicating data across rack, power, and networking failure domains. The database's awareness of the physical topology of the data center makes concurrent replica failures incredibly rare. Similarly, control planes that are aware of data center maintenance activities can green light or deny non-essential maintenance like upgrading software on a switch that would take down a rack or deploying a new software release, which can even further mitigate the risk of concurrent failure. This approach maps really well to cloud concepts too with availability zones and regions mapping similarly. Apache Cassandra achieves strong consistency by ensuring that all reads and writes are served by overlapping quorums. And in cases where replicas disagree, locking read repair of the most recent value before returning to the client and publishing that value outside the system. This produces a consistency contract that matches what a user would experience when querying a single machine database like Postgres at a non-transactional isolation level. Together, Dynamo's leaderless architecture and majority quorum designer at the heart of Apache Cassandra's scalability and availability and contributors to the project are designed approaches that enable us to achieve the consistency and even linearizability characteristics of traditional databases. This took a long time. So thinking back to the innovators dilemma example of moving up market by gaining features, let's take a look at Cassandra's evolution of these database-like features that were built on a Dynamo origin. In 2011, Cassandra gained an SQL-like query language called CQL with rich data types as well as secondary indexes. CQL was unique among query languages in that it was designed to allow users to declare a data layout on disk that was optimized for efficient storage and retrieval of what the user had wanted to store rather than something that's meant to be passed to an optimizer for planning and execution. And this is, I think, one of the genius elements of CQL. Like, if you think about SQL, the goal was to refactor queries in terms of a relational algebra that let the database construct a plan and execute it. In this case, CQL's design which focused on declaring the data layout and retrieval strategy mapped very directly the mechanical sympathy of how data was stored on the disk to what the user was trying to access. In 2012, the database properly supported strong consistency and in 2013 added linearizable transactions for a single key in the database built on Paxos which we'll talk more about later. In 2015, Cassandra grew a new storage engine and now Cassandra's gaining strong HTAP capabilities such as deep integration with Apache Spark making it possible to query the database at over 600 gigabits a second which Dinesh presented on yesterday as well as distributed transactions with strict serializable isolation across multiple tables. That's all pretty exciting but what's with this little gap here in the resume? During this gap, the Apache Cassandra project confronted the quality depth that plagued many non-relational databases at the time. This took several years which you might have noticed during Josh's presentation yesterday on the health of the community which touched on some of the time that it took to ship Cassandra 4.0. Becoming a database that competed at the upper tiers of the market really required foundational work to make the system safe and correct. One downside of Dynamo's resilience was that its design can be devilishly good at hiding inconsistency and fiendishly hard to investigate what happened after the fact. Property-based testing was among the most effective strategies that the community adopted to address these issues. These tool chains worked by fuzzing a Cassandra cluster with randomized queries and schemas while keeping track of the responses that the database that the client has observed from the server. These responses provide a signal to the model that enable it to differentiate between what should have been allowed to happen versus what could not have happened inside the database. Some examples would be that acknowledged writes can't be dropped and must be observed by later reads or that timed out writes may or may not be visible to a later read because they're indeterminate. Maintaining this model provided the basis for asserting correctness, a test that's great when paired with fault injections like cluster restarts, range movements and instance replacements. These tests can be executed both on a developer's workstation or at scale in the cloud which really helped the project improve reproducibility given deterministic seeds. Simulation was another technique that project contributors adopted. And this one was invaluable for validating the evolution of our Paxos implementations. By simulation, I mean generating randomized input then taking full control over the execution of Cassandra and modeling it as a single threaded program taking control of all aspects that could introduce non-determinism like time, like disk IO, like thread scheduling activity and lock acquisition to give you what effectively behaves as a single threaded database. This means that from a given seed regardless of the hardware or operating system platform execution of the suite is guaranteed to be completely deterministic. This is a really powerful primitive. In the context of Cassandra's transaction subsystem we simulate thousands of hypothetical clusters and topologies and millions of Paxos transactions paired with a verifier that asserts the linearizability of the result. Simulation enables at scale fuzzing of consensus algorithms with near instant local reproducibility. This has dramatically improved our ability and confidence in evolving Cassandra's Paxos implementations. These techniques have also helped us identify and resolve over 30 critical bugs that resulted in data loss or incorrect query responses. And today provide us tools to think with while we develop new features in Cassandra. In the early days, the project also used to deprecate features in a manner that required action for many users and resulted in users often running database versions that were somewhat far behind the most current release. But we learned that heavy deprecation of features forces work onto those users work that they often won't do or can't do in the event of a vendor deployed product that uses Cassandra, which means that the project risks losing those users. That meant finding strategies to carry forward some previously deprecated features to enable drop-in upgrades across versions, something that we value pretty highly today. And finally, we had some reflecting to do. We've come this far, but what features are we missing? There's a lot of good stuff ahead, some of which you probably heard about so far this week. You might have heard from Piotr's presentation that storage-attached indexes are coming to Apache Cassandra 5.0. SAI is a scalable partition local index. These indexes achieve feature parity with existing secondary indexes, but with much better performance, storage, and operational characteristics. Because they're implemented as SS table components, they get zero copy streaming for free, meaning they get shipped straight over the wire at link rate rather than triggering expensive index rebuilds as they're streamed in. The initial release adds support for AND queries, numeric ranges, fixed-length numeric types, text indexing, and quality queries, as well as optional case sensitivity and unicode normalization. Soon, we expect to incorporate prefixed like and or queries that you might be used to in other databases, as well as case sensitivity and unicode normalization. In the same way that Cassandra moved upmarket from a Spartan feature set 10 years ago, SAI will enable indexes to rapidly gain new capabilities through incorporation of other functionalities in Lucene's search and indexing libraries, another Apache project that the Cassandra community can greatly benefit from. Transactional cluster metadata is a down-to-the-studs remodel of cluster membership in Cassandra that's designed to resolve some of the databases oldest and most persistent problems. As Alex presented yesterday, Gossip and Epidemic broadcast were NVOG around 2007 Distributed Systems Research, and their fun techniques and solutions for systems that might have 100,000 or more nodes, but that typically doesn't describe almost Cassandra deployments. It was impossible to build correct and reliable state machines on Gossip because that model relied on one node deciding to do something, like claim ownership of a token and stomping on it, and then telling everybody else who then have to take compensating action as they learn. You can't build a great state machine on that. TCM enables safe and rapid modifications to cluster state, which improves Cassandra's elasticity. It eliminates schema divergence. So if you've ever found that schemas diverged after making rapid DDL changes, or maybe you used a library like Liquabase and then had a mess to clean up after, this is a problem that just simply won't happen anymore with transactional metadata. And if you've got a lot of scaffolding that orchestrates operations in Cassandra, like instance replacements, repair upgrades, and so on, this work may enable the database to take responsibility of some of those features and some of those in future releases. But this last one here is likely to be pivotal to the future of Cassandra. Today, Cassandra distributes data based on tokens, usually via random partitioning. Ordered partitioning is necessary for cross partition ordered scans and joins, but to do it as a database, you have to be able to split hot or dense ranges. TCM provides the machinery that might allow for efficient range splitting in a future release, as well as a transition from token ownership to range based ownership where those ranges could be redefined and moved. This is gonna be a super heavy lift, but if we complete it, it would enable Cassandra to become a completely new kind of database, pivoting from being a multi-dimensional unordered map to an ordered map, much like Spanner and DynamoDB. Next is witness replicas. This is a feature that was controversial a few years ago, but I'd love to share more about why I find it extremely important to the future of Cassandra. If you've ever had an exec or finance partner come to you and say, we need to get our cloud costs down, you'll probably care a lot about this. This feature could save you about 30% on your storage costs. Today, Cassandra couples quorum size and replication factor. They're the same to the database. That means that if you have three replicas and three data centers, you're paying to store nine copies of the source data. And if you deploy on top of a replicated block device like EBS that replicates data of further two X under the hood, those nine copies of data are actually 18 copies of the original data. This is incredibly inefficient. Witness replicas build on a fundamental invariant of incremental repair. IR guarantees that the repaired set in Cassandra is immutable and consistent across replicas. Because the repaired set is guaranteed to be identical across replicas, there's no point in querying more than one copy of it. This means that we can drop the durable RF from three to two without impact. And then we only have to read one of those copies as well. Quorum reads are guaranteed to be consistent if they include one consistent copy and a quorum of the unrepaired sets. So this feature shift is experimental in Cassandra 4.0, but contributors are working this year to bring it up to scratch and to remove that experimental tag in a future release. So please do not go turn it on now, but think ahead. Saving 30% on storage is a pretty big deal and we hope to have more to share in a year or so. Distributed acid transactions are our next focus. Some of the design goals that we've established are as follows. We think users should be able to transact over any subset of keys in the database, including across tables. We think that Cassandra should support strict serializability, the strongest level of isolation possible. We want to achieve optimal latency, one one round trip for all transactions under normal operation. We also want optimal fault tolerance. The latency and performance must be resilient to a minority of replica failures. In terms of scalability, we don't want a single point of coordination or a bottleneck introduced. We also value portability as an open source project. No specialized hardware should be required by the feature. All of these sound like some of the capabilities that Dynamo was built around and a lot of that are what you see echoed today in our design for the feature of distributed transactions. To achieve this, project contributors have designed a new Paxos protocol that's harmonious with Cassandra's architecture called Accord. The name Accord is a nod to Leslie Lamport's part-time parliament paper that introduced the original Paxos algorithm. Here, adapted to broker consensus between parliaments or sets of replicas that form a consensus group. Think of the name like a grand bargain between parliaments. Enabling transactions over the whole of the database without a leader means that members participating in the Paxos round may vary from transaction to transaction. So when I say leadership here, leadership is scoped to the coordinator of a single transaction, just a single query. Accord achieves this via a hyper logical time and without specialized hardware or time distribution infrastructure. Transactions execute in one round trip if a sufficient number of replicas are available, falling back to two round trips in rare cases. We validated the protocol by specifying a formal proof of the design and asserting linearizability by a simulation of millions of transactions and thousands of logical clusters in the software as implemented. But that still doesn't mean that everything's perfect. So we've continued to collaborate with academic researchers who have identified a couple additional problems with the protocol, then suggested solutions to those. I've never received a bug report in my career in the form of a three page LaTeX typeset document. And it was stunning to get that from a group and then say also we can fix it. So we value that research collaboration very much as well. There's a link to the paper on this slide. You can also find it on the Apache Cassandra Confluence Wiki. So a lot of distributed databases use leader oriented protocols like a distinguished transaction manager, multipaxos or raft. Why is Cassandra thinking leaderless? Well, that comes back to the Dynamo architecture again, which presumes no distinguished nodes. This means that every instance operates as a peer. This eliminates bottlenecks on a leader. It avoids gray mode failures and it leaves a path for scalability if you bound the size of the consensus set that you're transacting over. It also preserves the design simplicity of the database that's the heart of about everything that Apache Cassandra has achieved. For multi-region applications, it also enables initiating transactions from any region with predictable latency. Active, active apps can issue transactional rights from any region without having first to route them to a primary region, unlike multipaxos. Finally, accord achieve single round trip latency, meaning the same network latency that you would incur from a reader right at quorum. This speed improvement down from four round trips for rights just a year ago will make a new class of applications possible. And when you think about distributed transactions or even just a linearizable right becoming as cheap as a quorum read or quorum write, that opens a lot of new possibilities. So I shared this slide yesterday during a keynote with Patrick and Ekaterina, but I'd like to zoom out a bit from the protocol to think about what impact this work can have on Apache Cassandra as a petabyte scale database in relation to the wider field. With this work, Cassandra will become the only distributed database that can scale the petabytes of data, offer strict serializable isolation, can be deployed in any data center or public cloud, offers the availability and performance advantages of a leaderless database, and single round trip latency for transactions whether single key or multiple, local or remote. This slide doesn't mean that Cassandra is the very best database for all applications. There are tons of other characteristics that one could optimize for that aren't captured here. But the set of capabilities is incredibly compelling and something that we wouldn't have expected to be possible at the outset, and something that I certainly don't think was anticipated by the Dynamo paper about 15 or 16 years ago. But there's a lot ahead. There are a lot of new possibilities that distributed transactions open for us. Strong transaction primitives open new possibilities for Dynamo derived systems like rethinking indexes. An index could be modeled as a transactional mutation to a base table and a derived table. The transaction subsystem composes well with SAI today and might allow us to build even more on top of SAI in the future. They let us rethink materialized views, a feature that the project needed to retroactively classify as experimental. You could think of modeling a materialized view as a transactional mutation to a base table and a derived table with a predicate. They provide a basis for enforcing foreign key relationships within the database, something Cassandra's never been able to do. They could be a primitive to enable Cassandra to support multi-version concurrency control or even snapshot isolation via something like an epoch visibility protocol. Autobalancing data placement solves the biggest problem with Cassandra's byte ordered partitioner. This lets us reimagine Cassandra as a multi-dimensional ordered map with keys sorted within the cluster according to their lexical sorting. Snapshot isolation and ordered partitioning are building blocks that could enable the database to support features like cross table joins. This may enable Dynamo derived systems like Cassandra to support a large number of even relational use cases. All of these things can be possible by reimagining a Dynamo derived system as a transactional database. So we've come back to our frame, a cycle in which disruptive innovation slowly moves up market after carving out a niche and gaining new capabilities while building an impenetrable moat around the core competency like scalability and availability. But the news isn't all good because when you get to the end of the book you see that once you reach the top there's usually another technology that's maturing right behind you. Some example modes for future databases might include things like hard multi-tenant isolation, making exceptional use of tiered or disaggregated storage or full SQL support. I think that all of these could be built in Apache Cassandra, but they're things to think about because you've always got to watch your back. Until then though, we should be really proud of what the Dynamo architecture has enabled us to build. A database that can serve petabytes of data and millions of queries per second with up to six nines of availability. One that can replicate active active across five regions. One that can be deployed in any data center or public cloud without specialized hardware. One that can execute leaderless transactions across the entirety of the database from any region. And one that can be downloaded, learned from or modified by anyone. That's a great foundation to build on. Thanks all of you for your time today. It's great to see you here in person. We've got about three and a half minutes for questions if there's anything that folks would like to ask. Thank you. That's a great question. The question was do you ever, as you evolve the database and gain new features do you ever risk forsaking the foundations or straying from what you were? I think that in any software system there's always kind of a constant renegotiation of identity that's usually based on where you've come from and what you need to become. And now and then some of the things that you valued in the past might matter a little less in the future and need to be rethought. And there are some examples of that in Cassandra's history like replacing some foundations like gossip with things that are fundamentally transactional. There's been a shift from a focus on eventual consistency to strong consistency and linearizability. I don't think of those things as necessarily forsaking the history of the project so much as becoming what we need to be in light of where we've been. You mentioned relational, Scott. Can you talk a little bit more about that? What sort of use cases Cassandra might be able to address and which ones it may not? That's a great question. So I think that if we were to try to bite off becoming a fully relational database at least right from the beginning that probably wouldn't be a good lift. But if you think kind of about that frame of negotiation with the market of what features are required I think that there are a lot of semi-relational capabilities that are really easy for Cassandra to gain. So if you look at APIs like DynamoDB which aren't fully relational but which do allow for some cross-table activity or cross-partition activity people have been very successful in building on top of that API. So if you look at distributed transactions as features that allow you to enforce things like referential integrity constraints across tables or the possibility of a cross-table join if we were to move to ordered partitioning in the future I think that people can express a very high degree of traditional relational applications in the database. That's a bit different from having a full ANSI SQL grammar that's capable of common table expressions and 14-way joins and aggregation that you might have in a more traditional Oracle style database. That's probably not a good goal for Cassandra but I think it gives us at least a path to move a little further up into the right. 50% of the relational market, maybe by volume. We've got about 30 seconds for one more question. One other feature that's new to Apache Cassandra that people may not know and it's pertinent to this is Vector Search. Absolutely, thank you for mentioning that. There's so much that's in 5.0 and coming after in 5.1 that it was difficult to get all in one presentation. Vector Search I think is a really great example of Cassandra building upon capabilities like SAI to evolve features that are coming and needed by new markets like artificial intelligence. Thank you all so much. We've just hit our time but I would be happy to connect with any of you in the hall and would love to chat with you at this conference. Thanks again.