 Okay let's get started. So my name is Aaron Morton. I've been involved in the Cassandra project since the middle of 2010 and I've spent my professional career since the beginning of 2011 just focusing on Cassandra. We're at data stacks now and what I want to talk today is about how we made a JSON retrieval API. So there are other, let's get this out of the, let's get this clear. There are other document based databases out there that have set some standards and some of that is around usability and we would like to see how we can be a bit more usable for people in the Node.js community. So today we're going to talk about the requirements of what we were trying to do and try to understand how the JSON data model works so we can map that over to these two features that are the future of Cassandra. Secondary indexes and transactions for it sounds pretty weird. I've been doing this since 2010. Cassandra isn't big on secondary indexes and transactions but if you're just in this room listening to Scott from Apple talk about the future of transactions and secondary indexes then we'll go and look at how we can do this now with our data model of storing it and how we have to deal with the updates. So our goal here was to make something that a Node.js developer would expect when they're storing data. Back in 2005 when they lived in London, Honda had this advertising campaign for launching their diesel cars called hate something change something. So in the spirit of hate something change something we're going to go learn about the JSON data model and how we can make Cassandra deal with it and on that pathway here's what's important functionality is more important than performance we will get to performance performance will be good enough and we will make it better. If there's something hard to do and it makes Cassandra and CQL better then we should really put in the effort if it's hard and we had to work out what that Node.js developer was looking for we don't have a clue we don't have any Node.js developers in Cassandra using Cassandra. So we partnered with this project called Mongoose.js it's an object document mapper on NPM and to say it's popular is an understatement gets about 1.8 to 2 million downloads a week GitHub claims there's 3.8 million dependent repos out there using it and Valeri Kapov the author and maintainer is a great guy and we worked with him to understand what is the back end someone writing Node.js is looking for. All right into the data model learning about JSON data model well it looks pretty simple I've got a map I've got names I've got values they have a type number and strings but there's only one type of number so one and 48.8 are the same data type that's kind of weird there's a null but the null is not typed it's an untyped null kind of like a Python null so maybe I could just put this into a table because we're coming from Cassandra there's tables and things this looks all easy now I'm going to add on an array so I've got an array of things that look like numbers maybe they could be anything not just numbers in there so if I talk about this in a table perspective I could have a cql list or something like that to hold an array but I don't know what's in the array my example here are just numbers okay to add a bit more complexity we could put a sub-object into our JSON document here so address points to an object that object has other things in it okay bit harder if I was doing this in a table mapping maybe I could have another table for the sub-object or I could use a cql map to store the sub-object it's getting harder and then when we try to go across documents it just gets crazy so in the document on the left the age is a number 48.8 and in the document on the right the age is a string so which is it is age a number of a string well it's it's there's no rules you've got this situation of of some things but what did we learn so a document or a JSON object has a number of fields in it those fields are identified by a name that name has one or more parts to it address dot suburb those fields have data types they could either be a scalar type with some basic types and null is a type or they could be compound data types and you've got objects and arrays and you've got an unnamed implicit object at the root like the document and across documents this is kind of polymorphic ish I know across two documents they've both got a field called address one of them is a string and one of them is an object all I know is that there's there are type value next up how do people think about being able to pull this data out and again there are other examples in the industry of how people think about this one of it is uh jason path I don't think jason path is very easy to understand this is a typical sort of query uh it looks pretty easy go and find one document where the filter where the name equals I guess aren't easy I could add to this so now my filter clause is two operations I'm going to guess they're joined with an and and for age I've got uh not just equals I've got some type of operation I've defined add a little bit more complexity now I've got four operations running against in this filter uh they each have a path that they're talking about that field name and there's a couple of comparison operations one looks like an existence operation so does the is human field exist in the document and another is an array operation find me documents where fab numbers is an array of size three so the complexity there is growing and now I get to the next annoying part all right so on the left go and find documents where age is equal to the number 48 and on the right go and find documents where age is equal to the string 48 which one do you want do you want me to treat it as a number or a string do some learning again our filtering has a number of expressions those expressions can either be logical or comparison expressions they are basic logical expressions such as and and or our comparison expressions would point to a path and we say what's the operation could be a comparison equal to greater than biological operation you can do element operations such as does the element exist find me all the documents that have the is human field I don't care what value it is and there can be operations on arrays which are matching the contents of the array finding arrays that have array finding documents where the array contains one thing or finding things where the arrays of a specific size operations like that and then this polymorphism ish type thing that json does we're going to use the expression to tell us what type we should be locking in on so if you say equals 48 the number I'm going to go find documents where it equals 48 if you say the string I'm going to look for the string four and eight we're going to just kind of trust that the clients sending us requests are adults and they know what they're looking for and if they're getting confused between numbers and strings there's nothing we can do about that another part of this is that most of the time right there is a strong thing like mongoose that's making the schema so people are defining classes they're just using something like o dm and we're going to get regular normally we're going to get all the data will be the same all right last part of this learning step what do we have to write well here's an operation there are basic inserts that do overwrite and everybody in Cassandra land knows how inserts work that's easy but things get a little more complex this is where functionality trumps performance so what this says is go and find me the document where id equals one and that update clause says add the field is old and remove the field is human from the document so I've got a couple of operations happening there in that update this is getting a little bit harder because we don't do this this says find the document where id equals one add the field is old go to the field age treat it if it's a number I want you to set it to be the maximum of its current value or 50 that's a read modify write operation we don't you can't even do that in a cql update you can't do cql field equals field plus two unless it's a counter all right now we're getting a bit more complex this says find the document go to fav numbers it's an array add 22 and 33 to it if they don't exist so that's what dollar add set add to set says so again we've got a level of complexity in the update operations and we've got a bunch of read modify write things that are going on and these are things we don't like to do in Cassandra this is all of the things that I could possibly want to do and this find one an update so find the document update the is locked field to be the max of its current value or one give me a projection so return back to me the resource and is locked fields from that document and return it back the state it was before you did the update these are all the capabilities you can do and find one an update so this is how you could implement the distributed lock it's basically doing that and again that's like the level of complexity there from cql we are used to the idea of just being able to do a no look right and just overwrite everything all right so what do we learn we've got documents that have updates those updates can be on fields and arrays our updates require the document in memory because we have to be able to make read modify write and decisions about what the document state should be based on this current state inserts by definition we already have the document in memory so from a Cassandra point of view we can just keep rewriting and things again and again because we always have it we'll see that in a minute okay the c-star features that are actually useful for this the talk before here Scott was saying SAI is great and it is and it's the future and it is because all the vector stuff runs on SAI and it's good after all these years to have good secondary indexes and lightweight transactions Paxos and Accord are the future as well which is really weird briefly touching on SAI a third time a try third time trying for secondary indexes in Cassandra available in OSS 5 very efficient and being able to deal with multiple indexes on the same table we'll see a chart about that in a minute and very high performance they've got and and or support so whether that or support is in the version that's in Cassandra 5 I've not kept up with everything but it's coming up JD's shaking his head it's coming and we're going to see the list of things that make SAI better that we've we've uncovered in this so yeah in the real world this is maturing and technology this is from CEP 7 SAI was one of the first things to go in in that process you can see the blue line at the bottom as you add more SAI indexes you are not using a linear amount of additional disk space it is very efficient of being able to index multiple columns on a table and that's really important when we go and see the the monstrosity I'm going to show you shortly also on that CEP 7 there's some of the performance numbers here this is reading four billion rows I don't know all the details of what it was it's on CEP 7 it's fast and other people are making it faster that's always a good situation quickly if you haven't seen them make a table create a custom index you can put it on the text a number of date I've included some of the options there for text indexes coming soon as well we'll be leucine tokenizing of your text fields as well now the really exciting part I can use those like a regular like database I can say where name equals Aaron and that's powered by the SAI index but in those last two examples I can put two terms into my where clause and SAI will go and read both of those and do the join for me and do it smartly so it's not a sequential get one and get the other it's parallel work and it can do the same with or so when we found all this and this was great stuff we also looked at collections so I can put them on maps and sets and lists I just haven't included lists here because we don't use them you can have three different types of indexes on the map you can index the key or the keys or the values or the entries which is the really useful one so you can see if I've got a map here of metadata I can index and say get me all the doc get me all the rows where metadata topic equals monkeys I can put that together in multiple clauses it's starting to look pretty cool and I can do a set test as well to see if the set contains a value okay and now on to lightweight transactions so being around since Cassandra 2 they are used in a lot of places and relied on at scale that probably you'd be surprised about given their reputation uh Cassandra in Cassandra 2 they came in in Cassandra 4.1 there's a new version which is supposedly 50 faster and in accord it's going to be more better performance so it's basically the same performance as a single right as a normal insert in most cases so we now get to have better consistency which is handy because some of those use cases for Node.js users expect to have primary key constraints and we just went through all the need for read modify write in order to have all the functionality so quick example here I can do an insert and if not exists if there are no users with the username the beard of Zeus and I can also do an update and only change the value if the value is the same as what it was previously all right we've got the building blocks now we've got a pretty good understanding about the JSON data model how how we store it how we need to read it back and we know that there are these features that are in Cassandra now and getting better there is SAI is getting better our transaction support is getting better so we can put these two things together and have something that will get better in the future so we came up with this idea called super shredding which goes through three steps we're going to walk through the JSON document take note of the nodes in it work out that path name so it's user.name or something like that understand the data type that it is and we're going to look at all the types of operations you could want to do against that and try to work out how we what data structure we need to store for that and do we need any extra metadata so we're going to say for for the exist clause put that in the set because we want to go and look at all the fields we can test we'll look at that an example we want to put it into one row of CQL so that that see that SAI query planner can go into his business it's actually pretty good at working out how to run those queries so we always want it to be in one row so if we have this document here and we have the is human tag the is human field we see for dollar exists we just know to store a way to a set of that has is human in it and then we can go and query to find that back all right so the next slide has this data model we've improved that this is the easiest way to explain it and and I do have people send me emails and things at work about some of these things all right yeah so let's break this down there's really two things we need to do store the blob and maintain some way to be able to pull the data back the primary key is a tuple here because the the idea of your document could be any jason scalar value so we have to encode that string 48 is different to number 48 the tx id we use in the transactions we'll see shortly doc jason is where your blob goes all the others are data structures maps and sets that together allow us to implement all of the functionality of the api by being able to go and pull those up we'll go through some examples in a minute this is even the worst part so then I go and put all these SAI indexes on there um SAI is going to be slower is going to take more effort on the rights that's how databases work and we will work to improve this but we now have a good understanding of what it's like when someone puts too many SAI indexes on the table bit of a work example we've got a document and we want to be able to serve the exists filter which basically says find me all the documents where is human is defined I don't care on the value it could be null could be a bullying could be a string so given the document at the top there we do an insert and I've just pulled out the exist keys it's a set we're walking the document we've got all the paths and we are an insert those into the super shredding table and then when you come back to do a read where is human existence true we we go to that set and say where set exist keys contains is human now SAI in the current installation cannot do not contain so that's coming as well as one of the things we've added so we'll see that at the end but that allows me to do a really simple one so what about a more complex query I've got is human I've got the age so we have a map of all of the numeric values query double values we put the age and I put the id in there as well so if I get a query that says find me documents where is human is defined and the age is greater than 48 we can construct this cql query where we put both those terms into the where clause push it down to the database and say go and solve this problem and that's a really easy thing for me to do it's a harder thing for the database people to go and sort that out but again SAI is going to get better because a lot of the types of queries that you see coming from vector space are kind of filtering filter using this metadata then do a vector similarity score and then maybe I need to do some de-duping and some ordering after that similarity because I've cut off the top 10 and I want to order those by some other thing complete queries like this where you need to use ad hoc metadata the filter are going to be more important over the next few years here's an example here with the sub documents just to show that pathway you can see where we've got address in the sub document city we do address dot city we put that into the map of all of the text values and if I get a query that says find me age greater than 48 address dot city again I'm just constructing a SQL where clause and using the address dot city as the path putting it into the query text values okay so the super shredding breaks it all down based on knowing the types of queries we want to serve then it puts a whole lot of pressure on the SAI to go and be able to serve these queries updating what did we learn we've got basic inserts and overwrites and then all of our updates are read modified write that's probably not true I think you could look at an incoming request and say does this require a read modified write operation and then make a decision but unfortunately in Cassandra if you want to go down the strong consistency linearizable consistency path you can't do it ad hoc I can't say changes to this document read modified write so I'll do one thing and this one isn't so I'll do something different doesn't work that way so we're always going to have the document memory we're going to override all the time that tends to lead to better outcomes from compaction and from being able to compaction has more work to do but it's easier to drop stuff off because we don't have like one column written six months ago that has to sit way over in and all this table and other things living at the front so when we have an insert we're just going to do the inserts we saw those before and then put if not exists on and that runs a lightweight transaction currently in Cassandra four that's four round trips it's faster in four point one and then Cassandra five that will be the same as a regular insert in the happy path all right now there's a really complex one find one an update this is all the things that can ask you to do right it can ask us to say find something but if you don't find the document create the document make the change project it before and after basically we have to pull the stuff into memory and make the change they can't really get around it so given the query that looks like this we're going to update and again add the field and set the max we're going to do a select and we know that this query is by ID so we're just going to get by ID here we're going to deserialize that Jason update it and then redo the super shredding and we fortunately have the maintainer and author of the Jackson library on our team so when we want to make that go faster we can so that's kind of fun and then we're going to rewrite everything with an update and see that transaction ID if it hasn't changed from when we read it we go back this stinks this is not great it works there's probably more elegant ways to do it once we have a better understanding maybe if we move more things down to the database or something from a like from a Cassandra point of view when 2015-16 we would talk about these things as something you shouldn't do a lot and by this time and next year when we have a court everywhere we're going to look at things like this and say oh this is just easy like there's no reason not to do this you can do this now okay so what did we what improvements did we come up with here we've got a bunch of improvements coming to SAI for not contains on the set not equals on map entries you can see them all there all support is coming there's a cp out there for not i'm still a big believer and still push on global order by in cql would be great and limit an offset to be able to support those queries earlier we've found some things we can improve in Paxos v1 without having to go to the next level so what did we get out of all this we can do successful just one line config changes and the no j s app can come and work against data data stacks astra or stargate sitting in front of Cassandra doesn't mean everyone's going to change this thing what we wanted to do was make something that was appealing to no j s developers from a performance perspective it's a difficult thing to get a handle on because we're doing things that Cassandra really hasn't liked to do in the past but from a basic trying to compare apples to apples just reading by id adds about 0.3 0.5 milliseconds to go through HTTP and the jason stack and things and there's still a lot more performance improvements we can put in there going to SAI we've got a table with three million rows and we go into a read on one of those SAI indexes on the map and you know 1.4 milliseconds read at the coordinator level to get through the rest of the stack another one and a half milliseconds so SAI can deliver some performance we also have an example of reading with two fields future we've got more HTTP tuning to do because that's a new skill going up against native binary transport which has years and years of becoming very good at being able to do high concurrency high throughput maybe relax some of those consistency rules we started this and in the before vector was such a big thing with vector being such a big thing you tend to have data that loads in as a bulk job or something like that right the embeddings are calculated you want to push it in working out some way to deal better with that all these SAI indexes we're going to go and now work out how to deal with them all and make them more efficient we've got a use case so that's great and just sort of more operations to support vectoring if you want to get hands on with this go to astra and in they get one of the vector-enabled dbs there's a github repo for this stargate slash rest api so you can stand this up against Cassandra sorry jason api typo says jason api sorry and then also Jeff's doing a talk do I get a t-shirt for a cocking up Jeff's doing a talk about using this api now for things and if you see this blog post about was building a Taylor Swift enabled chatbot so this happened in the last couple of days uh there was a whole bunch of work put in to get this going quickly you can see an example on that blog post that came out this morning an example of just being able to go oh look I've got I'm writing uh type script no j s I'm able to go and use Cassandra slash data sacks astra with all that power and do something really simple in in no j s rather than have to get a driver and do stuff in cql so this example here is just searching on embeddings so I have a couple of minutes for questions if there was any I guess t-shirts I don't need this thing um what's the most uh what's the most complex app that you've tested against this so far uh we've had some production apps test on it like things that were originally coded to use another leading data a document database from another leading vendor that were originally coded for that and so production things that other people have run where we launched private preview in september some of those SAI enhancements where it'd been able to do basic things like less than and greater than that are now available turns out there's a whole bunch of apps that don't use those so the complexity will be I think in the vector space right so imagine at some point in the future if someone wants to do this I want to do a query of all of the restaurant reviews that Aaron's written that mentioned the word hamburger in the mini where I then want you to take that set of things and do a vector sort on it and then I want you to sort them by the distance of what street corner I'm standing on those multi-stage queries will be the complex parts all right thanks everyone