 All right. Thank you everybody for coming to my session. Let's get started. I'm Dinesh Joshi. I'm a Cassandra Committer and PMC member. I'm also a engineering manager at Apple. And today I'm here to talk to you about Cassandra analytics. Apache Cassandra is a very popular open source database and it's used to store a bunch of data, petabytes of data for various applications. It's super scalable and it is very versatile. So that's why it's very popular. Now, today I'll be talking about two open source projects that my team has contributed to the open source community. They are part of the official Cassandra project. These two projects together represent a paradigm shift in how you analyze data, either you bulk load the data or bulk read the data from Cassandra. So let's get started. Today we have a packed agenda. I'll give a quick introduction to Cassandra just by a show of hands really quickly. How many of you have already used Cassandra here? Okay. Most of you are familiar with Cassandra. That's great. I'll set up a problem statement on what doesn't work in Cassandra right now. And then in order to understand this problem statement, we need to talk about how we move data around in Cassandra and what is the proposed solution that we have implemented, which is covered in the Cassandra analytics architecture. And then we'll also talk a little bit about how do you get started and some quick references on further reading if you're interested. So what is Apache Cassandra? It's an open source distributed, no SQL wide column, highly available, linearly scalable database. If you read about Cassandra, these words, they appear at some point in its documentation. It's really great. It's linearly scalable and that's its power. It's about 13 years old now. So it's kind of getting to a point where it is maturing as a database. Database technologies don't move very quickly. It's used to store petabytes of data and this scale brings unique challenges to the table. So let's set up the problem statement. Like what is it that we are trying to do and what's the problem? So if you've used Cassandra in production and you also have analytical workloads, the architecture that is proposed has been proposed throughout the ages is connect your live workloads like your OLTP applications that are user facing to a data center that is the live data center. And if you're performing any sort of analytical workloads, make sure that you use a separate data center. And most of the times we use something like Spark to process data. This is done to create an isolation between the live workloads and your OLAP workloads. What happens when you throw a hundred or a thousand Spark workers at a Cassandra cluster is that they dominate all of the IOPS and your live traffic suffers. Your SLO suffers. And this is the reason why we, this is the architecture that is proposed in many, many cases like just isolate your analytical workloads. These two data centers are set up within Cassandra, so Cassandra will replicate data to those two data centers. But this is suboptimal, you know. We want to talk about how can we get to this architecture where you can actually do your old TP workloads and OLAP workloads on the same cluster and at scale without impacting your SLOs. So the problem number one that we are talking about here is how do we analyze data in place without really creating another data center? Problem number two that we are talking about here is bulk loading a lot of data in the Cassandra cluster. So whether it is moving data out or moving data into the Cassandra cluster while not bringing it down or impacting your SLOs, right? That is the most important thing. So when you try to analyze a ton of data in any database, this is a common problem. So this begs the question, so why is reading or writing a lot of data a problem in Cassandra? Now, Cassandra is supposed to be super scalable. You can store petabytes of data and that is all true, right? It's horizontally scalable. But when you try to analyze a lot of data, you have to isolate that workload from your Cassandra cluster. So I'm going to give you the TRDR version not bore you with it. We need to look at the read path of the database. The read path of the database performs a lot of work, right? So we do a network read, we have to read data out of the network buffer, deserialize the query, parse the query, look up data in caches, look up the data in memtable, check the bloom filters, look up the compression chunks to decompress the data, and then we collect all of the results and then serialize those results and then send it back to the client. And that is a very, very expensive operation when you try to read all of the rows that exist in Cassandra, right? So if you're trying to read 500 terabytes of data, doing this operation over and over and over is very expensive and the cost adds up. And let's not forget the coordinator. The coordinator node is yet another hop that your query goes into and the coordinator has to then contact other nodes in the cluster and it has to federate the response back to the client. And when you do this for a lot of rows, it is slow. It's a very expensive operation. And this is why, you know, when you throw a bunch of Spark workers, it can cost up to 80% of your CPU just in serialization, deserialization, and all of these costs and your live traffic suffers. What about write path? Write is supposed to be super efficient as well. And it is for the most part, right? In this case, also we do a bunch of operations, which is we do a network read, you know, query comes in to write data into Cassandra, we do a network read, we deserialize that query, we parse it and write to the memtable, write to the commit log, not in that order. And then serialize the response and then write to the network that you have succeeded, right? And then the coordinator node needs to make sure that the consistency level is met before it says, hey, we are good to, we have successfully written this data. There's a lot of, you know, when you're trying to bulk load, let's say 500 terabytes of data doing this over and over and over for each row is very expensive and the cost adds up. So basically writing a lot of rows all at once dominates the CPU network IOPS and IO on the disk as well. And this also creates a unique problem like compaction. How many of you have had issues with compaction and in Cassandra? Okay, at least 40% of the room has had. So for the rest of the 60%, basically what happens is when you write a lot of data in very quick succession to Cassandra, Cassandra creates these SS tables on disk and that causes, that impacts your read path because now you have more SS tables to read. But also in the background, what happens is we kick off compaction and compaction dominates your IO on disk as well. And this basically causes Cassandra to, for a period of time, to saturate IO on disk unless you throttle it. So you have to be very careful about creating too many compactions or running too many compactions simultaneously. Compactions should also not fall behind. So all of that is to say that if we have a way to optimally create these SS tables that lead to minimal number of compactions, you will reduce your impact on SLOs. So just to visually represent what I'm talking about here, right? If let's say you're using the Java driver or the Cassandra Spark connector that is out there currently, both of them basically go through the same flow. Here we have a Cassandra node that has basically some components called out here. So this is the same pipeline where we are reading, writing data. We are deserializing or serializing the data. We are processing the data. And then there are in-memory data structures that we are maintaining also on-disk data structures like commit log, SS tables, and so on and so forth. And that's the mental model that I want to start with, right? So how did we solve this problem using or in the open-source community with the Cassandra analytics project? So going back to this diagram, basically the main cost comes from the serialization, deserialization of individual rows also comes from touching a lot of internal components, right, in-memory components and data structures. So what we have done in the open-source community, we have introduced two sub-projects. One is the Cassandra Sidecar and the other is the Cassandra analytics library. This is part of the open-source Apache Cassandra project now. What does this individual sub-project bring to the table? So the Cassandra Sidecar basically brings APIs like streaming, token metadata, and snapshot. And the Cassandra Sidecar essentially allows you to skip talking to Cassandra altogether, right? It has APIs to stream SS tables out of Cassandra, the Cassandra node, and into the Cassandra node, really without involving Cassandra in any part of the streaming in and out. It gives you token metadata to discover the topology of the cluster and to understand how the tokens are laid out. And then it also allows you to snapshot data. And on the other hand, we have the Cassandra analytics library. This is an artifact that you can deploy on Spark. And what it gives you is the ability to read and write SS tables on the Spark nodes without involving Cassandra altogether. So the part of the code that Cassandra has that takes data from an SS table, deserializes into Java objects, has been moved into this library, right? Not physically moved, but it exposes the API to allow you to do such things. What this means is that if you have an SS table on disk, you can point the analytics library and it can deserialize that and give you the actual rows and columns in that SS table. So how do we put these together and really improve our performance? This is basically how we do it. So you deploy the Cassandra sidecar alongside Cassandra on the Cassandra node. This sidecar runs alongside Cassandra and exposes these APIs. On Cassandra analytics library is basically integrated with your Spark job. And when you want to read or write data, instead of talking to Cassandra, you would talk to the sidecar. While the Cassandra Spark connector, which is the existing open source connector that is out there, what it does is it internally still uses the Java driver. So it is still going through the same path that read write path that a Java driver would go through. Whereas with Cassandra analytics, you are completely skipping that altogether. And as a consequence, if you benchmark this, you'll find out that it is about 30 times faster to do this because you don't have to actually involve the CPU or the garbage collector or anything on the Cassandra node because you're basically streaming out data and streaming in the data without at a file level. You're not really doing it at a row level anymore. So let's look in detail on how this works and how to get started with it. Basically, when you're doing a bulk read, and this is available in the Cp28 in the open source community, the whole write up is available in the Cp. This is a screenshot I took directly from the Cp. Basically, what happens is when you start a Spark job and you are, let's say, reading, you want to bulk read data out of Cassandra, what you do is first the Cassandra analytics library has a Spark data source for Cassandra. And all it does is it goes to the sidecar of the cluster and then it reads the topology of the cluster. It figures out what the topology looks like. And then Spark basically forks a bunch of tasks. And these tasks are the ones that go and read individual ranges of data from the Cassandra cluster. And the way they read it is they, again, go to the sidecar and stream the actual SS table files instead of reading each row at a time. It just streams out the entire file from the cluster. And streaming files is super efficient because we use send file under the hood, and that doesn't really impact the CPU or really the CPU. It does use IO, it does use network bandwidth, but it basically goes and streams the actual file. And as a consequence, what happens is the files are transferred to the individual Spark tasks for the range of data that you're going to read. And then using the analytics library, the data is read from the local disk of the Spark nodes. And this is how the bulk read works. On the bulk right side, what happens is we materialize the data frame. So if you're familiar with Spark, Spark has a concept of data frames. And when you are writing, let's say 100 terabytes of data in 1000 Spark executors, you each of those tasks is taking the data that subset of data that it owns, and it materializes the data on disk as SS tables. Those local disks on Spark basically act as a staging area where we write all the SS tables. And once the SS tables are written to the original Spark task, those tasks, they again stream the data into Cassandra using the Sidecar. So we do achieve some level of isolation between the Cassandra daemon process and then the Sidecar. Now, once the Sidecar receives these SS tables, what the Sidecar does is it just calls import on the Cassandra daemon using JMAX. If you're familiar with node tool import, it basically allows you to take a bunch of SS tables that exist on disk and load them into the Cassandra's working view of SS tables. And again, when we move these SS tables into the staging directory on the Cassandra node, this operation is felt very efficient as well because we are not really involving the CPU, and we are just copying files again via send file semantics. As a consequence, all Cassandra has to do is just import these SS tables and make it part of the reading working set. And the SS tables become part of the Cassandra's data directory. So this allows us to skip the entire read and write path while providing isolation using a separate process on the Cassandra Sidecar. So how do you get started? These projects are available on GitHub under the Apache projects, Cassandra analytics. We have been contributing to this, and it's still under a lot of heavy development. The other project is called the Cassandra Sidecar. The Cassandra Sidecar will deploy alongside Cassandra in your Cassandra node. And then when you want to do any sort of bulk reads or bulk writes, what you all you have to do is just if you're familiar with Spark, I hope folks here are familiar with Spark. How many are familiar with Spark here? Okay, I think 50% of the room. So basically, Spark has the concept of data frames, which is like a table. And when you want to do a bulk write, you have a data frame that you have obtained from a different source, and you want to write that to Cassandra. All you do is get that data frame instantiated, and then you call data frame dot write, and then use the org Apache Cassandra Spark bulk writer Cassandra bulk source. And that bulk source is basically has all of the magic baked in. There is a bunch of configuration that you need to provide it, like what the side instances are, etc. But once you do that, all you do is like just call save, and it will do everything under the hood for you without having to touch the Cassandra database at all. Similarly, on the read side of things, you do the opposite, right? You are obtaining a data frame, or a data set from Cassandra. And this basically is similar in nature. So if you're writing up, like a bulk job, that it does data ingestion, like a ETL job, or you're performing any sort of analytical computation right on the data set that exists in Cassandra, you can take all of the data out of Cassandra. And this, the way this works is it under under the hood, it's going to create a snapshot on the Cassandra ring. And so you can read the same snapshot over and over without really creating new snapshots. If you want a fresh snapshot, you can always ask them ask the set the option to create a snapshot. And it will create a fresh snapshot whenever you run the job. But what this allows you to do is it allows you to essentially just stream all of the SS tables out of Cassandra and into Spark, and then you can deserialize them using the analytics library. So if you have existing Spark jobs, moving to this is fairly straightforward. One of the key differentiators here is that this makes your network the maximum bottleneck instead of any other component in your system. So you want to be a little careful. And on the sidecar, we have throttling controls. So you can throttle the data that is being streamed out of Cassandra. For further reading, I would encourage you to visit the Wiki. CEP 28 has a lot of interesting details. The sidecar project and the analytics project have getting started and read me. It has an example where you can play around with it locally on your machine. You can set up a three node Cassandra cluster on your machine and then run Cassandra analytics in a Spark job and see how it feels. If you want to engage with us, you can engage us on ASF Slack. We are part of the ASF Slack. And if you want to engage with us, you can also engage with us on JIRA or the dev list on Cassandra. That's pretty much it. Thank you. I'm not sure if I have time for questions. But yeah, if anybody has questions, I'll be happy to answer them. So the question was what specific version of Cassandra this works with. So we currently support Cassandra 4 and by extension 4.1 as well. There are plans to support 5.0 of Cassandra. So if you are on Cassandra 3, it's a bit tricky. But the idea is to support version 4 and later going forward. Yes. Yes, I was the author of the zero copy streaming work in Cassandra. So yeah, so this is similar work and extension to that work. Yes. Yes. Yes. So the question is like where does the filtering would sit? So in this case, since we do not have visibility on what we are not inspecting the actual as stable files, we don't really look at the partitions in the stable files or the tokens. The data is actually streamed as a whole file. So if you want to do any sort of filtering, you want to do it on Spark. So this is an important observation. This is useful for workloads that actually need to pull out all of the data or pretty much all of the data. So if you have a highly selective query that you want to run against the cluster, this is not the tool for it. Yes. That's a great question. What's the consistency guarantee? So the library understands the token metadata in the cluster configuration. So it looks at the key space and it ensures that the data that is streamed meets the consistency guarantees of your tables and key spaces. And you can't, I don't believe you can specify a consistency level. But if you have a consistency level in your key space, that consistency level is met. So we do actually do an in-memory compaction and apply, remove all the tombstones and everything. And we are streaming data from the two replicas and ensure that the two replicas are in agreement indeed. If not, then we pull data from the third replica and ensure that you your consistency level is met. Yes. So yeah, for that, I would say look at CEP 1. The Cassandra sidecar is meant to be a management process that sits, it's a co-process that sits alongside Cassandra. It exposes a bunch of REST services that allow you to manage Cassandra. As of now, the concrete use case that we have is the analytics library that allows you to stream data in and out of the Cassandra node without actually involving Cassandra in the read and write path. Yes. So from an operational standpoint, you need to deploy sidecar alongside Cassandra for this to work. Yes. So the two questions is the separation in terms of security model. And the other question is, is it part of Cassandra? Yes. So we have not yet cut a release, 1.0 release of the sidecar or the analytics library since it's under heavy development in the open source. So basically, it's not part of a Cassandra distribution. If it were to be released, it would be released as a separate artifact alongside Cassandra since it's not really tied very deeply with Cassandra. In terms of the security story, authentication would be via mutual TLS. That's something that is work in progress and we would be open sourcing at some point. There was a question in the back. So it is currently available in the open source repository as code. You can build it yourself. We've not yet had a release of it. As soon as we have a bunch of features worked out, but you can actually build it yourself and run it. So I don't have, unfortunately, a concrete date for you on the release, but we are working on it. I believe Spark 3, 2 and 3 both. And I think there's a whole matrix on the documentation on the version of Scala as well. So Spark and Scala together. And then there's JDK, Java versions as well. So there's a bunch of matrices. I think, okay, any other questions? We have three minutes. Yes. So in terms of disk IOs, since you're streaming the entire files out, there would not be any technical reduction in disk IOs. I get what your point is. We will see probably IOPS reducing, but not the actual IO to the disk. So you would still have to stream the same amount of data that sits on disk over the networking. Yes. So in terms of the infra cost, if you're talking about the number of nodes that you have to run, since Cassandra Sidecar is a pretty lightweight process, it's a vertex net application. You don't really have to allocate too much memory. So there will be some delta increase in the cost, but not like 2x or 3x of cost. Cool. There was one more question in the back. So the question is, what is the resource allocation? So you can always run the Sidecar with like a two gig heap. It doesn't require a whole lot of heap space since it's not doing anything for the most part. So in terms of CPUs, also a couple cores is more than sufficient for the process. It's a fairly lightweight process. There was one more question here. Yes. Great. So the question is about what's the difference between sending these SS tables, snapshotting them and sending it to S3 and then connecting to the rest of your infrastructure versus this. So it's perfectly fine to have that pipeline, but with this approach, you don't have to incur the S3 cost at all. So if you have Spark and you're already processing data on Spark, you can directly read data from the Cassandra cluster. Now if you happen to have Spark deployed in the same VPC, same zone as Cassandra, then you are really not incurring any bandwidth costs either. So you can actually create the snapshot and stream the snapshot to your Spark directly without actually involving S3. That is the advantage here. So the only additional cost to you is to set up the Sidecar. And I'm assuming that for pretty much anybody who runs Cassandra also has their internal tooling which has some form of an agent or a Sidecar that runs alongside it. And the goal for that, this project, the Cassandra Sidecar project that we proposed several years ago, was to ensure that there is a community supported Sidecar version. All right. I think we are out of time, but if you have questions, I'll be happy to talk to you in the hallway.