 Welcome, welcome, everyone. Should be an interesting talk. I'm here to talk about a common problem that I think us at Datastacks have been dealing with for the past few years. And this is probably something a lot of people who are trying to build a Cassandra platform for their business is probably also seeing as well. My name is Jake. I'm a lead developer at Datastacks. I've been working on Cassandra for many years since it came out. And this is sort of an evolution that we took internally. And I think it serves as a good example of a different way of thinking about Cassandra potentially how Cassandra serves as a great bedrock for building an effective database as a service. Now, this is specifically about Cassandra. I think the core premise of this idea is Cassandra is built with the idea that it does everything itself. It doesn't rely on anything except for TCPIP and the kernel. That's it. Everything else it does itself. Whereas we took the idea of there's this new modern world of cloud computing infrastructure as a service, Kubernetes operators, object storage. How can we rethink Cassandra in this world? And how can we simplify some of the issues? So this is probably the most controversial slide. Cassandra is awesome. We've built my career on it. And it's fantastic. A lot of the talks you saw today were about our action strategies, our tri-base mem tables, our SS tables, all these nifty, really low level engineering bits that we've used to make Cassandra the most scalable thing. And it's built and proven by many large organizations, lots of users all around the world. The problem is, at least that we've found, is that trying to operate it in production can be difficult. And this is something that it's getting better over time. But us running Datastax Astra had felt this pain in a specific way. And I think part of the reason why it's hard to operate is because of the things that used to make it easy to operate, if that makes any sense. Back in the day, before the cloud, before Kubernetes, before all these things, running Cassandra on your own, just setting up your seed nodes and having fixed IPs and everything just worked, it was nice. And the problem is over the years, as the platform has gotten bigger and there's been more features added, more things added, it's kind of some of the design choices on the Dynamo side have actually made it a bit difficult. This is actually something, if you went to the talk earlier from Alex Petrov, he was talking about how we're trying to fix the consistency problem of our topology metadata. So this is a common problem that people have. And to me, I just wanted to break down that some of the insights we've had since all nodes, since it's leaderless, all nodes are the same, they all do everything, is actually a crutch in this world. Because your CPU and your memory and your disk all have to scale at the same amount. So you end up wasting money because, oh, well, my i3, 4x large machines all get this amount of disk and this much memory. And I have a disk problem, so I have to add more nodes. But my queries per second is relatively low. So you end up in this weird world where the fact that you can't break things up makes it hard. The other thing is, let's see. Yeah, I think the last point is a good one, too. This is something I've seen a lot is that a lot of Cassandra clusters out there are running at max scale because they can't handle scaling up and then scaling back down. So elasticity is a problem in the sense that you have to run at your peak because if you run anything less than your peak, you can't respond to a peak. So this is kind of the world that we started in. This is like Cassandra has a service circa 2019. Let's just run Cassandra clusters on Kubernetes. So where a lot of people are, this is what Cassandra does. And what we try to do is build Kubernetes operators, which automate some of the operational burden. It tries to deal with the impetus mismatch between Kubernetes and Cassandra, specifically like seed nodes and IPs changing and attached disks, reservations of attached disks being assigned to specific nodes, which get restarted, all that stuff. Now, what ends up happening is we're just kind of going with the basic approach of we're running Cassandra in the cloud. It's on Kubernetes. But it's not using Kubernetes in the way that it was meant to be used. It's kind of we're taking an apple and we're putting it into a cake and we're calling it an apple cake. And it's not you want an apple pie. It wasn't thought of from the ground up. So that's a terrible analogy, but that's I don't know. Yes. OK, so what would an ideal database as a service look like? This is the problem we had. The other problem we had is if you're paying for this, you're running into the same problem as running a Cassandra cluster yourself, where the minimum cost of running this is $1,500 a month in terms of instances. So it's difficult to attract people to come and use this platform because the threshold to start is very high. So what would an ideal database as a service look like? Cheap, that's kind of the most important thing. You just want to pay for what you use. You want it to actually be elastic. You want it to fulfill the promise of being able to scale up and scale down. You want it to be integrated into that cloud ecosystem. You want it to be secure. You want it to be reliable by building on things which are proven rather than building them yourself and waiting, have them many years for them to stabilize. And you want it to be simple from an operational and development standpoint. We wanted to build things that we could, instead of having someone join and having to learn all of everything, they could kind of focus on one individual piece. So as an aside, we started looking at, like, OK, well, how does this is in 2020? How does these cloud databases work, right? They're on a massive scale based on load. You only pay for what you're using. They're deeply rooted in cloud architecture. And they're built sympathetically to that end to basically make it leverageable to provide a better service. And a great example is the Aurora system, which is built, if you read that paper, it's very informative in how you can build a cloud-based database. The problem was at the time, other than papers, there really wasn't anything out there. This is back in 2020, the only cloud databases or serverless cloud databases that really existed were Amazon, Microsoft, Google. There was one, FaunaDB, which came out of the Twitter engineering spinoff. But other than that, it was kind of like, well, how can we build it ourselves? So that's kind of what we set out to do. We decided, let's use Cassandra as a shared library. Let's not use Cassandra as is. Let's use the parts of it that work really well for what they are. And let's use the bits of the cloud and Kubernetes that work really well in the cases where the Cassandra bits don't necessarily fit. So the first choice is, the first choice we made was let's stop using attached disk. Let's make the source of truth for disk S3. So when we flush data from a memtable, let's write it to S3 instead of the local disk. Or let's do both. And we'll treat the source of truth as the S3 bucket. And if you went to talk earlier about consistent cluster metadata, this should be a no-brainer. It's like, we're already running in Kubernetes. We're already built on that CD. We're dependent on that CD availability. Let's use that CD for our cluster metadata. So a lot of the same tricks that is coming out of 5.0 we've been already doing like consistent schema changes, consistent topology changes. And I think I put the wrong CEP. It should be CEP 21. But this all fits into exactly what CEP 21 is doing. We're just instead of building it from scratch in Java, we're building it on top of something that's already working. And the other idea is let's smash the monolith, hulk smash, and break it into its constituent pieces. And combine all these things together to build a very cloud sympathetic service. So this is kind of the logical description of what I just came out of my mouth. We have our object storage at the bottom split across multiple availability zones. We have the physical underlying nodes which are running. We have lambda services, which, for example, compaction is now running as a separate service rather than running in the same JVM. And that can scale on its own. We're using at CD as our metadata tier. We have our data tier, which holds the disk. And we have our coordination tier. So this is kind of a more, I guess, physical mapping of what things look like. Now, obviously, I'm not going to get through all the details of what this is here, but I'm trying to paint a picture. And I can go into specifics as questions come up. I'm trying to leave as much time at the end. But for example, our search of truth, again, is S3. So we still use fast disk. You can't just fetch data off of S3 and expect it to be fast. But one of the insights is you can get really the relative cost of NVMe disks on nodes which have ephemeral disk is way less expensive than attached disk for the same number of IOPS. And those come for free with the node. So it was kind of ideal in the sense that we could basically just build a file system caching layer, which if there was a miss on the local disk, it would just go to S3. This gives us a huge amount of benefits where when you flush things to S3, for example, then an SED is our metadata service, that SS table gets posted there. And our compactors go, oh, look, a new SS table. I'm going to go and compact that data and put it back in to S3, which notifies the SED update, which the node that owns that SS table then go fetch that SS table. So it kind of created this cloud system. And actually, it turns out if you look at today, this is like the new hot thing. This is like what everyone's doing. We didn't really talk about it too much, but this is something that has been proven in production. This is what runs Astra. So we have our coordination tier. We have our data tier. We have our metadata tier. And we have a bunch of auxiliary services off the end. And this is sort of an example plot of our scaling. This is when we were sort of initially launching. This is just scaling like just the clients. This is an orange. And this is actually scaling up the nodes, which source the data from S3. To be able, you could see within a few seconds, you're able to 2x your throughput. And the nice thing is that you can actually scale it back down. Now, the benefits of this design as a summary of all the different things, right? We basically get our topology and schema changes are now consistent. We have components that can scale independently. Like our compaction can scale separately from our reads and writes. We have all pods have access to all data. This means you don't have to stream data anymore from node to node. Whenever you're trying to scale up your Cassandra cluster and you try to bootstrap a new node, it actually makes your current workload worse because that's to stream all the data out. All that's gone. We now have the option to build a, we now have this tiered storage layer, which I mentioned. And the interesting, you know, computer science problem we have is like, how do we most efficiently pre-cache the data on the disk so that we have as few misses as possible? Now we have a bunch of different strategies for this. And this is an area that's like sort of the fun exploratory cloud-based solutions that we get to focus on the hard engineering problem of how do we track what data's being read and write to optimally pre-cache the right amount of data onto disk versus always having all the data on the disk. So this means for a workload that has, for example, a very, you know, a time series type workload, we're really only looking at the past day or two of data. You can actually have on S3 like a petabyte of data sitting there, but it's not actually being read from, but it's still accessible if you really want to. So it kind of gives us a nice, you know, benefit of both. The other nice thing is, you know, we don't have the attached disk problem of Kubernetes, which is kind of, Kubernetes was built with like API services and, you know, stateless services in mind and not, you know, a durable, consistent, stateful database. This kind of solves that problem and that's why we call it a serverless database because there's no individual piece of infrastructure that has to be there. All the pieces can fall apart, we can bring the whole thing back. To that end, we can fully recover our topology and scale up and scale down quickly. The other nice thing about using S3 is a backup then just becomes the manifest of what SS tables are live at any given time. So we don't actually have to do, like a backup will remove a bunch of data around. It's already backed up nine times in S3. And, you know, the value here of the integration and security is, you know, it's just an S3 bucket. Every database has its own S3 bucket, which belongs to it. We could potentially make that an S3 bucket that you give to us. So like, we don't even touch your data, it's your bucket, it's your data in your bucket. We just support reads and writes to it. Now the last bit is probably the most interesting bit. And this is where, so all that's great, right? But it's still, you're running a bunch of hardware for a particular database. But the real holy grail is like, could you come up with a multi-tenant data platform? Where rather, and we did, by changing the concept of like a ring in Cassandra is a node level construct. And we decided to make it a key space level construct. This means that a given set of nodes in our production fleet, in a particular region, can be part of a ring of one or more databases. Now, since the data is isolated per key space and per bucket, it's not like you can access someone else's data because the tokens don't match in the first place, but also, they're already separated and our auth is already integrated deeply into the Cassandra roles-based system. Now what's nice about this is the ability to scale things, right? So we have this fleet of pods that are sitting there. They all have a disk cache, they all have CPU, they all have memory, and we can dynamically move the metadata and the tokens around to individual pods. So that means if you have an active workload, you will end up on your own set of pods, isolated, and they can scale on their own. And then if you stop using it for a day, then it gets squashed back down into that smaller shared pool. So the problem then becomes for us, how do we optimally manage this fleet to optimize the, to make the resources as highly utilized as possible while maintaining a good SLA for everyone? And that's where having the Kubernetes operators, the monitoring framework and the ML logic involved to look at what is the state of the cluster, which nodes are active, which ones aren't, which 10s are active, which ones aren't, and how should I optimally shuffle them into this group? So this is where the real interesting cloud bits come from. And that's kind of it. I dumped a lot, because I figured there's probably a lot and there's probably a bunch of questions, so I'm happy to go into details here. Yeah, I have a microphone who wants to talk. Yes, okay. So with the S3 storage, if you were caching the data on one of the ephemeral disks on the server, it's gonna be fast, but if there's a cache, a read, it's a cache miss, and it goes to S3. What's the latency of cache misses? Because it seems like cache hit ratio would be critical for your read SLAs, right? So what would be the expected, for cache misses, the latency on reads? Well, so the latency we see for fetching from S3 is in like the 10 to 20 millisecond range, which if you're reading from like five SS tables, isn't great. So, but this is where the, we have a bunch of different caching strategies. So folks who want to pay for really low read latencies, we effectively pre-cache as much as we can in a greedy fashion. My colleague here, Zhao, he just worked on a caching strategy which takes into account time-based access, the age of the SS table, plus the token range accesses of the SS tables. And it generates a histogram once per minute or once per hour, I forget what it is. Is it once per hour? And so we use that data whenever a new node comes up to more intelligently cache it. But this is a cold start problem. So solving the cold start problem is tractable. And this is why, for us, we're just like, hey, if we don't know how to intelligently cache it, we're just gonna aggressively cache as much as we can. And over time, improving our intelligence of the caching so that our users don't see the SLA hit. So it's not like seconds, it's still milliseconds. It's still fast. But it really comes down to, so that caching logic is applied per table. So if you have a table that's a bunch of log data that you don't really care about or really read, we don't cache it. But if this is a table that you're actively using all the time, then we're just gonna cache more and more. But there is sort of the cold start problem. A lot of cloud databases, it's like, you have to kinda run your, you have to pre-warm your workload. Like if you're expecting a big job to run, then you're better off trying to pre-warm it. So one of the things that I think with database as a service that we've seen where a lot of people who come from using databases like Dynamo or whatever, in their docs, they say if you want really good read latencies, like pre-warm your workload. So it's one of those things where, but for people using, going from a five node Cassandra cluster to this, there's a difference there. So that's one of the things that we're trying to find the right mix of how aggressively do we pre-cache everything versus how intelligent can we be versus, can we even separate disk and compute even more to build a more durable version of that disk cache? That still isn't the source of truth, but it's there all the time. Thank you. I was just wondering, how does consistency levels and application factor in all this now that you've moved everything to S3? Yeah, so we, it's still Cassandra. It's still RF3. We split the racks across availability zones. This does mean that when you flush, you're flushing three copies, which then is flushed again, three copies. But one of the beautiful properties of this is once you compact the data, you've actually repaired your data and you've duplicated it. So only the recently flushed data is multi-copied, but since the S3 bucket works across AZs, you have much better consistency in terms of your database, because you don't have nodes that like, oh, we haven't repaired this in a long time. The system's always getting all copies of the data pushed to it when it fetches from S3 new data up here. Sorry. This is my boss. He's handing the microphone around. That's how I like it. You know who's the real boss. You know who's the real boss. So how big are the local disk cache? And you talked about pre-warming the data, but since they are ephemeral, there are chances that you can't predict when you might lose the disk and data, right? So how do you deal with those scenarios? And as they are local and ephemeral, do you detect when they're lost and then when new ones come and you cache data on it? Yeah, great question. I mean, so this is where the operator, so the Kubernetes operator pattern, we have operators that manage this for us. So if a node goes down, one of the design decisions we made, which I think worked out well, is we don't have down nodes. We actually effectively, we remove the node from the cluster when it goes down, since it's ephemeral, and we bring up a brand new node that rejoins the cluster. And since we're working with a quorum-based systems, like if a node goes down, it's fine. You're still achieving quorum. So yeah, if the disk dies or the disk goes bad, we just kill the node. That's the nice thing about a serverless ephemeral system is there's no state that we can't recover. Even the data in SED, we describe it as sort of a transactional cache. So it's the transactional barrier of state changes, but we store the state itself back in S3. So even if we lose all of SED, we just rebuild the state from S3. So it's like a fully self-sufficient architecture in that sense. Yes. Do you do any kind of cache coherency across pods? And then if you do, what's the overhead for that? So the cache coherency, I mean, since it's Cassandra, we don't do cache coherency. I guess I'm not sure what the question is. As in like if you're maintaining these caches which are in ephemeral disks, right? So unless you're flushing everything down to S3 every time and that's the source of truth that you don't have to worry about it. Yeah, that's what we do. That's what you're doing. So our transaction for a flush is flush the memtable to local disk and flush it to S3. If either of those fails, then the flush failed. So we're guaranteeing that the data is in S3 before we are done flushing. So, we know we have a backed up copy of the SS table available for everyone. And that notification, the end of the transaction is it updates in SED, the live SS table list. And it's like, hey, new SS table and the nodes that own the ranges of that SS table will go, hey, there's a new SS table in my range. Let me go grab it. It grabs it back from S3. Now this does mean that S3 needs to be consistent which it is across all three platforms. So it's not like you get in a case where just because you wrote it in S3, it's not actually available somewhere else. That's something they fixed right around 2020, which was very helpful for us because we didn't have to worry about that problem. Okay, thank you. Yep. Is your Cassandra running, is your Cassandra binary, is it running inside the work node separately? Yeah, so this goes back to the shared library analogy. I think that's the best way to think of it. We're actually running Cassandra 4.0, but we're using the bits of the class path and the classes, that makes sense. So our main functions are like, okay, start the compaction manager, grab the metadata from SED. So we built a lot of plugability into Cassandra to make it so that we could swap the implementation of where the token metadata comes from or where the schema comes from. So those are all things that are in 4.0 because I think this goes back to the bigger picture that Cassandra, this is a funny old ticket from Jonathan Ellis, the founder and the main, the Cassandra guy from a while ago. He wrote this ticket where he was like, Cassandra is not your database tool kit, construction kit, but actually it makes a pretty good database construction kit. So we just kind of leaned into it. It was like, we're just gonna make everything plugable, we're gonna make it so we could use the bits that we care about. And if other bits get better over time, like CEP 21 might negate the need for us to use SED at all. So it kind of just gives us the freedom to use the bits of it. I think one question we've gotten a lot is like, is this open source? Can we use this? And it's like, I think the problem is we took a really deep decision of like, we're going all in in cloud, you can't run this on your laptop, unless you're running a kind cluster. So it's like, we took a really opinionated approach. So we're gonna build a very cloud sympathetic service that requires all these bits and pieces that come with the cloud for free, but if you're gonna run it yourself in your own data center, it's probably better to just run Kate Sandra. So it's like, but I do think it would be great if we get to the point where we can work together with the community to come up with like, maybe this is a sub-project, I don't know, maybe this is, but I think it deserves to exist in some way. Thank you. I wanted to ask a little bit about repair when things are going wrong. So you were talking about for each of your nodes, you're writing one copy, which is effectively unrepaired. Do you do something to collect those three copies and then write them into the repaired set? Is that exactly what you did? Compaction. Just general compaction? Yeah, actually compaction, since we're using the UCS strategy, it's sharded by token range. So, and since they flush at the same time, they're all on level zero. So the compactor groups them together into compaction. But for multi-region clusters, we do support multi-region. We do have a separate repair service that does Merkle Tree repair across regions. So we also have hints for multi-region in a service, like as its own service. So smashing the monolith allows us to kind of build the bits that make sense for the piece. But the great thing is it doesn't affect our GC profile. You don't have these competing allocation workloads running at the same time. And it allows you to sort of strip out the garbage, pun intended, and get to like just a built for purpose services that just do one thing well. Okay, thanks. We're out of time. Out of time. All right, thank you.