 Alright welcome everyone. Hello I'm Phillip Moore, I'm from Voltron Data. I was going to be co-presenting this session with Sebastian Svez from DataStax. Sebastian just had a little baby girl so he wasn't able to make it so I'll do my best to do it on my own. Please forgive anything I miss. I'll do my best. So this session is how to do OLAP on your Cassandra data with Aero, FlightSQL, ADBC, and DuckDB. So we'll be talking about these technologies, how they all fit together to allow you to do OLAP queries on your NoSQL Cassandra database, including AstraDB of course, DataStax's commercial offering of Cassandra. So DataStax and Voltron data we kind of teamed up and we built a FlightSQL server for an aeronative engine to enable OLAP capabilities for AstraDB or Cassandra. So if you had not heard of FlightSQL, it's relatively new. It's part of the Apache Aero project. It basically lets your connect a server up to a data engine that is really, you know, the data engine can really excel at querying data but may not be necessarily interested in presenting a server to allow client connectivity. FlightSQL kind of lets you complete that picture and let you enable that client connectivity to your data engine. So think of it as a component and a composable data system. So at Voltron data we really believe we're big believers in the composable data systems. So we do that by establishing standards. So we have a substrate which is a representation of a query plan. It's a machine representation that should be universal and can be sent to various data engines including DuckDB, an engine that we've just announced called DCS that uses a GPU. I'll talk about that a little bit. And we have ADBC which is Aerodatabase Connectivity. And the really cool thing about ADBC, probably heard of JDBC and ODBC drivers, ADBC takes it a step further and allows columnar data exchange throughout the entire communication process. So if you're familiar with, you know, even columnar databases like DuckDB or Snowflake or whatever, usually at the last leg of the trip they have to transpose the data to row-based format before it gets to the client. Arrow changes that. And Apache Arrow format is columnar. So that means that the data is coming in a columnar format all the way through to the client. So it's an optimal transmission method for the data. So I just touched on arrow being a columnar data format. So formats like Parquet work really well with Arrow because it's already a columnar format. But Arrow just enables that columnar transfer. So if you're not familiar with columnar data, it's really easily compressed, much better than a row-based format. So it's much better for OLAP queries and it works really well with vector databases that process vectors with one single instruction that can process multiple rows of data. So we enable a user interface via ADBC. We also have an Arrow JDBC driver and an ODBC driver. Now the JDBC and the ODBC, the data comes in a row-based format at the last second. It gets transposed. So it's not as good as the ADBC driver, but it lets you get to your Arrow data flight SQL server. We, of course, have execution engines. In this case, in this demo, I'll be showing DuckDB. DuckDB, if you're not a mayor, is an in-process database or data engine. It's really, really good at OLAP queries. It's free. It's open source and it performs really well on AWS Graviton or even X86. So we'll be showcasing that a little bit. And we believe in open data storage formats like Parquet and ORC that allow you to talk to your, you can talk to your data with various engines if you use those open formats like Parquet. All right, so I'll kind of get into the solution architecture of what we built here. So the first thing, of course, Cassandra is a no-SQL database like an object store, but sort of, and it's not really, it doesn't lend itself well to OLAP queries, like things like joins and things like that, that you typically do in OLAP type queries. If you're familiar with data warehousing, you know that typically have a star schema, a fact table that you join to dimension tables via dimension key. And those types of things don't work well in SQL, sorry. It's not designed for that purpose. So we have to export the data from Cassandra to an open format. We happen to choose Parquet. So Seb and the team at Voltron built this real-time, near real-time export process. So it makes the Cassandra data available as Parquet files for analytic engines like DuckDB or Theseus. Separate analytics infrastructure with zero impact on Cassandra latencies. It doesn't go through SQL. So this is like a separate process. It handles tombstones deduplication. The tombstones deduplication happened on the OLAP engine. So this thing is kind of pretty lightweight. It doesn't have to handle those types of things. This proof of concept that we built uses a repo that Seb, I think, created. It's a data stacks SS table to arrow. And it just basically lets you export your data from Cassandra to the arrow to format. So you can see the snapshots of backups happen near real-time. And it goes from the SS table format to a Parquet. Now when the Parquet files get dumped out, we upload them to an object store like S3 or GCS, Google Cloud Storage. So once you've exported the data, now you need to be able to query it, of course. So you need a data engine that's really good at OLAP. So we happen, for this proof of concept, we use DuckDB. And the really cool thing about DuckDB is it speaks arrow very, very well. So it talks to the arrow ADBC and JDBC drivers in the arrow format which is columnar. So the way the flow works, there's kind of two avenues to get to the data. If you remember, I mentioned JDBC and ADBC. So a typical SQL client is usually JDBC or ODBC. So they will issue SQL to the JDBC driver, which then sends the SQL down to the arrow flight SQL server process. This then sends the SQL to the data engine, which in this case is DuckDB, hopefully it's Voltron Data Thesis in the future, that uses GPUs for acceleration. Duck, the data engine, of course, talks to the Parquet format. Now that could be on local storage, on the actual flight SQL server local file system, or it can be in S3, remote file system. It doesn't matter. There are performance implications. Like if the local file system is in VME, solid state, and all that stuff, it's going to be probably faster. It's going to execute more swiftly because the data is local. It doesn't have to go through the network to get it. So once the DuckDB engine gets the data, it processes the query, and it sends arrow data to the flight SQL server, which then sends arrow data to the arrow flight SQL JDBC or ODBC drivers. At that point, the driver has to transpose the data to a row-based format for the client. So that last leg of the journey is not as optimal as it could be. That's where ADBC comes in. So ADBC, it's pretty very similar to the route I just explained, except that you're sending SQL through, and you're getting arrow data back at all legs of the journey here. So it's a little bit more optimized for that high data transfer. So if you're querying millions of rows of data, you're going to see a big bump by going to ADBC. Now if you're just querying five rows of data or one row of data, the difference is so negligible that it's not a big deal. So as always, there's always a trade-off. ADBC is very new, so not a lot of tools adopted yet, but there's a lot of tools that talk JDBC. So you're probably going to use JDBC for compatibility, but if you need every drop of performance, you probably want to go to ADBC. So that's kind of how the architecture works, or the solution. Now let's talk about the architecture. How do we actually implement this thing? So we deployed this proof of concept in Kubernetes, and we used AWS Cloud for this. There's nothing stopping you from doing this in GCP or Azure. It's just the Cloud we chose. So we run a stateful set in Kubernetes here that has the AeroFlight SQL server running in DuckDB, of course. We have two avenues for the data. So we attach a local NVMe drive to the instance, but we also have access to S3. So it just allows you two avenues to get to your data. If you can live, if you want, remember there's always a trade-off as you guys know in IT. If your exporting process is running to S3, it's always going to be up to date if you're querying S3. But if you're going to be querying your local storage for performance, you're going to have to have a synchronization process that copies the data periodically from S3 to your local drive. So the other cool thing is that we are using Graviton. So DuckDB runs really, really well on Graviton. So we're using Graviton 3 instances. And these nodes can be pretty beefy. So we're using R7GD16 extra large, which has 64 CPUs and 512 gigs of RAM. So it's a pretty beefy server. It actually gives you really good performance that's comparable to things like Snowflake and Databricks SQL with terabyte and below workload. So you can't go too big with this. You can't query like a 10 terabyte database. You can. It just won't perform as well as like a commercial solution. But it's really good for the money, which is very cheap. So these instances, I think, cost like $4.50 an hour. So to run terabyte scale analytics for $4.50 an hour is really good. And again, so the AeroFlight SQL, ADBC or JBC Client is just sending queried through a load balancer, which has TLS. So all the communications are encrypted. And of course we enforce authentication so that we don't want to just let any schmo get in and get our data. So if you're familiar with DuckDB, there is no authentication. So we actually, the AeroFlight SQL server adds a value layer there that's enforcing authentication and encrypting the traffic to and from the database. So it's a secure solution. All right. So I'll try to demo the solution here. Please pardon me. All right. So I'm going to use one of my favorite tools, Dbeaver here. So I think they're one of the sponsors of the event. And I talked to them upstairs. Got stickers. So I'm really proud of that. So I've gone ahead and instantiated connection. So I'm running a flight SQL server now in AWS in Graviton. This is a live demo. What could go wrong? And it's run on that R7 16 GD extra large. So it's got that local NVMe, 64 CPUs and 512 gigs of RAM. So if I have to instantiate a connection here, and it did, I'm running 092 of DuckDB. I think that's the latest version as of an hour ago. Hope they may have released since then, who knows. And if we query the DuckDB settings table, you can see that we've given it 425 gigabytes of memory. So we didn't give it all the memory on the instance, but that's a lot of pretty a lot of memory. And we, I think the threads parameter is the same. Yeah, so we give it 60 CPU threads. So it's pretty beefy in terms of a database much bigger, much bigger than you would think you could get available for free. All right, so we've registered some views here. If you expand here, you can expand your views here. And so we don't have any tables in this database. We just, we have views on top of Parquet. So the cool thing about DuckDB is you can create a view and just select directly from the Parquet. You don't have to ingest the data. It's actually faster to query Parquet ironically than the DuckDB internal format, because the DuckDB internal format is designed for updates and things like that. Parquet is not really designed for that. So we've, we've got basically two versions of the TPCH schema here. The one that's on local NVME and then the other is on S3. So if I query the customer table, it's going to go from the local disk. If I query the S3, it's going to read from S3 live. So let's test that out. Moment of truth. So you can see I query the region S3 table and it just pulled directly from the S3 bucket. You have to trust me that it worked. And so forth. And you can run longer queries like I can sum the extended price and get a record count. I won't run this one because it takes a while. But if I, if I run the other version here, you'll see the latency is a little bit better. Right? Now, how much data are we dealing with here? So this is TPCH scale factor 1000. So it's a terabyte scale. So if I run this query, you can see that's six billion rows of data. It just counted some that extended price, which is very high. But so it queried six billion rows that fast in the Parquet. Now it cheats with Parquet. There's some information in the Parquet that lets it cheat a little bit. But this query is a little bit harder to cheat with because it's querying all these attributes. So we're going to query the line item table for and filter on the ship date and do a group by and order by. And so this query will execute a little while. Talk amongst yourselves. All right. It almost always takes 10 seconds. So it's like old faithful. But you know, it produces four rows of output, but actually queried six billion rows to generate those four row totals. All right. So to me, that blows my mind because I came from the Oracle world. And to get that type of performance on terabyte scale, you had to spend a lot of money with Oracle. So it's very exciting to me to be able to see this an open source solution running on Graviton with ARM processors for $4.50 an hour, run that fast. That really blows my mind. I'm excited about it. So pardon, I'm nerding out a little bit. I apologize. But so and then we can run query two. So you get if you're familiar with the TPCH query suite, there's 22 queries that you run to see kind of get a general feel for performance of the database. So I think I don't know. I don't know how long this one runs, but it shouldn't take too long, hopefully. Yeah. So five, five, six seconds, something like that. So as you can see that DuckDB is really performant. And this the cool thing about the flight SQL server is if we find something faster, and if it speaks Arrow, we can swap out DuckDB for that new thing. So it's really good for the champion challenger model. You can swap out components as new components speak that are shinier, faster, better, become available. All right. So all this cool stuff is is open source, of course. So, you know, we were staying on the shoulders of giants. So Voltron data has a repo for the flight SQL server portion. So it's it's it's a public repo. If you do me a favor star, it'd be awesome. But it's the flight SQL server example. And there's this awful guy named Philip Moore that works on it. But that's me. But the long story short, in the read me, we show, hey, how do you run this stuff? So it's easy to run in Docker. So you can literally copy that, start a terminal, and spin up a flight SQL server on your laptop just like that. So if I do a Docker PS, me make sure it's running. There it is. So I do Docker logs, flight SQL dash f. All right, so it should be listening. So instead of the AWS one, I'll just connect to local host, hopefully. And in this one, we actually do have real tables. So if I open a console, I didn't generate a terabyte data on my laptop, by the way. So if I do account star of this nine item table, so 60,000 rows, much less data. But I have run 10 terabyte scale on this. It's not as nearly as zippy as terabyte scale on the AWS, of course, and on the laptop. So it's probably not a good solution if you're trying to query 60 billion rows. But if you're, if you're at that terabyte scale or below, it's a pretty good solution, I think. Now the other things to worry about our concurrency, the solution, we duck DB doesn't really like multiple users connecting the same time is not designed for that. So if you try to run two queries on the same database at the same time, it's like crossing the streams, bad things can happen. If you're familiar, I'm old, sorry, it's a ghost buster thing. But so you can get some strange errors if you try to do concurrent things with duck DB. Now, we have mitigated that in the past in the arrow flight SQL server code base, which is a C++ code base, you could put things like mutex is in that, you know, basically lock a resource, you query some stuff, and then you release the lock and then the other session, they just block and wait until that mutex is released. And then they're able to query the database. So it is, that's something that's not ideal. Obviously, a more commercial solution, like snowflake or Oracle, they're going to handle concurrency much better, much better than duck DB would because they're designed for that. So cool things like so this is a JDBC driver, of course, and then the repo there are instructions on how to download and set up the JDBC driver. So we have a link to another repo that how to set up the JDBC driver and DBver. So it just takes you to the screenshot step by step, pay if I want to start messing with the flight SQL server, how do I get started? This is a good way to get started and just query the flight SQL server and and and turn some really wicked crazy queries against it. Of course, there are links also to you know, what's what's Apache arrow and talks about flight SQL and arrow and all that good stuff. So there's some article blog links and things like that. And of course, we say, hey, if you just want to mess with duck DB directly, just you can click on the links to go to duck DB. There's also a section here where you can start up. You know, this is an example where you can attach your own database and you can in this in this example, we generate a one gigabyte scale factor TPCH database and we start the Docker container mounting that local database file. So you can get an idea that's going to be I think 6 million rows as opposed to 60,000. So the 60,000 is a much very small, of course. But you can start playing with bigger data sets on your laptop. You don't want to get too big, you know, don't want to go a terabyte, because you probably don't have that much storage local. And if you did, you probably don't want to fill it up with the database. So the other cool thing is we do talk about how to set up the JDBC driver, but we also talk about how to connect and run the ADBC Python flight SQL driver. So if you follow these instructions, you can, we'll try it here. We'll try it live. Bill O'Reilly style. So we'll just we'll go ahead and set up a Python virtual environment and install the packages. Alright, and then so this is a little snippet of code here should connect to the flight SQL database that's running locally on my laptop and just run a query. So this will use that optimize ADBC driver and get calendar data all the way through. So if we Hey, it worked. All right. All right. So as you can see, it just returned I just selected from nation, where the nation key is equal to 24. And that happens to be the United States. That one record out of the nation table. So this format might look a bit a little bit weird is because it's the pie arrow format, you probably used to pandas pandas is row based and pie arrow is of course column or data. So each column is represented with an array as opposed to a row based format. So that's why it looks a little bit different. Now you could of course do a two pandas on that. So if we do a x dot two pandas, it looks on that row based format. So arrow works really well with pandas. You probably heard of pandas, of course, West McKinney was a co founder of Ultron data. He's moved on now but he, he made sure that pie arrow really worked well with pandas. So that's about it for the presentation and demo. I'm glad to take questions, if you'd like. So let me get you the mic. I think they want to have everyone. Thank you. So when you copy the SS table to the table to your destination, right? You mentioned while there's duplication of data there. I said that, okay, so the whole portion is going to take care of it. Can you elaborate on that? Is that when you run your query that's going to take care of those duplicate or is all those duplicates going to be still duplicate in the destination? Yes, a really good question. So in the way we handled that, and I didn't actually demo that part. I apologize because it wasn't here, but we actually create views on top of the SS table export data. And we do a max buy. So there's some timestamps that are generated for each column that basically like a version history of each column of the SS table. And we select the value that has the greatest timestamp. So we create these views that have that max buy operator. So it lets you get the maximum value by the other another value in the in the table. And so that effectively that that takes care of the tombstones and make sure you have the latest version of that data. There's a performance boundary because you're reading more data than you would have to if you just had the latest version. But it's still it's still functional and works pretty well. Okay, well to tombstone and latest is one thing, right? But you've got though, if you're using a replication factor of three, you've got three copy of that data there, right? So your six billion rows that you mentioned earlier, originally was only two billion rule, right? And in this case, this was TPCH. So this was an old app data set. So it wasn't even SS table. But so Seb actually has the process generate the SS tables. I don't have that part of the solution. He was going to demo that part. Got it. Okay. Yes, I was just I was acquiring regular generic data. But essentially, it's the same kind of thing. Okay, thank you. Thanks. Yeah, but you're all out there. Yeah, it is a penalty for that. So it's not going to be as fast as if you just had the latest version. But did you compare the performance with their old solutions that click house? Oh, that's a good question. Now, I haven't done click house. I have kind of benchmarked against like snowflake and Databricks SQL. They're faster, of course, but they're way more expensive. But I haven't done click house. I need to do that. So that'd be really interesting to try. And also, have you it was the prototype was a powering a client facing application. It's just a prototype right now. I think I think Seb is gonna roll it out a little bit soon. But I'm not don't quote me on that. I don't know if you had, you know, multiple servers that had that to be running on them. Do you have any idea how you would keep consistency with the local files? So that if you're writing to us one and knows to get their same data? Just asking answer a good question. I think if you have them all query the s3 live, that would maintain some sort of reasonable consistency. But if you copy performance costs, yeah, exactly. Yeah. I haven't done that though. So I'm hesitant to say that works great because like there could be some concurrency cost, or some locking that I'm not aware of that could happen. But I would be me to try. Yeah, when you export data to park it, we have any kind of options to partition a certain way, because like, all of the things when you're doing parquet, practically depends how you partition it. Yeah, I'm not so is it faithful export of SS table? It's probably just brute force. I think it's just brute force now. But of course, it is open source. You can change the code to partition what you want. I think I don't want to speak. I think we have a question back there. Hey, thanks. I was curious about it seemed like a core part of this was this kind of like keeping a columnar on the wire all the way up to the client like the adbc. Like is that actually significantly better than just doing some form of like compression of row based like some form of like dynamic LC four on top of that is this like significantly lower total bandwidth back to the client. So like what kind of ratios are you seeing? I've seen I've seen 10 to one on big data sets. Because like, particularly it all depends on how much repetition you have in the column, right? You know, in row based data compresses pretty well to if you have a lot of repetition. But since it's columnar, like say you have a very low cardinality column, like that's two values, right? And yet may as a maze of rows, that's going to compress really well in a columnar solution. So we have seen 10 to one. But if it's two rows of data, it's not going to be measurable. You're not you'd have to be really fast to measure. Yeah, I was just curious relative to like generic kind of like if you have a page and then you just run a good compression algorithm that like LC four or something, like you also get ratio. And I'm just curious, like, is the columnar format really coming out ahead there? And I guess it sounds like the answer is yes. Yeah, from what we've seen, yes. Yeah, I understand the advantage on the storage engine side. The other thing is the serialization and deserialization cost. The cool thing about the arrow format is there you minimize deserialization serialization. Because I guess it's shipped in the same format all the way through the sub this ecosystem, if that makes sense. Other databases typically have to do some sort of translation for the client to understand it, I guess. So like an Oracle JDBC driver, for example, is going to get it over the in the Oracle wire format, but then has to transpose it or deserialize it for to present to the client, for example, I see. So you're saying that internally arrow is even though it's presenting it as like a list of integers, it's actually only storing the ones internally. That's what you're saying. It's Yeah, essentially, it's some sort of magic that's happened. I can't I can't explain it 100% faithfully, but like, it's it minimizes serialization deserialization. And it's the same in all all languages all format. So like, it's the same in Java as it would be in Python, and so forth. Right. That's very interesting. Thank you. Thanks. Any other questions? I think we're good. Thank you. Thank you. Appreciate it.