 Greetings everyone. We are here to talk about enhancing Apache Cassandra drivers with vector support pretty much what it says on the tin Quick introduction rope just sorry so you guys can know you're not talking to a you know somebody just a random face My name is Brett McGuire. I have worked at data stacks for Almost nine years now. I've done it a number of different things while I've been there But I have recently been working doing a lot of Support and development for our various native driver offerings for the past three or four years somewhere on there and I am currently a member of the Astro connectivity team which for those of you who I don't know if anyone was just in the room next door We were talking about another one of our products ZDM and we we cover a wide portfolio of Offerings at data stacks Including the native drivers EDM DS blog we cover a lot of things But enough about me This is a rare opportunity for me to actually Find out information from from from the field from some of our users who are actually using things I write a lot of software and ship it out into the world and then it goes off and Our wonderful support people give me feedback, but I always like to to sort of get this information directly from people So data stacks currently maintains five active drivers that we we offer support for including the those are drivers for Java Python C sharp C++ and I'm bringing one. No JS there it is So I just just a quick show of hands. Don't worry. I'm not gonna like interrogate anybody Jeremiah is excluded because I know he already knows the answer to all of these things But out of out of these how many for folks who are in them Which of these drivers have have you heard how many people work roughly with the Java driver just rough show of hands? Okay, quite a few How about Python any Python takers still a pretty good representation, okay? We'll kind of lump the other three into a larger bucket. We can interrogate them individually if we need to but C sharp any C sharp takers. Oh, okay, excellent. We know to see sharp here. Um, no JS stuff Kind of a little bit. Okay, that works and C++ any C++ takers Wow, I should have known So, okay, that that kind of matches matches about what I was expecting Out of out of out of the folks in there So my next question for out of the folks who I know people have used or it sounds like people have used the drivers And have used some of the API's does anyone here and I'll again I'll exclude the core developers who I know already know the answer to this There's anyone here having the experience I've spent any time looking at the the binary protocol that the drivers used to interact with the servers Peter Jeremiah keep your heads down anybody else. I okay. I if not, that's okay It does come up a little bit in our talk or in the talk here of things that we need to Review so I'll try to mention those but as mentioned I won't go into too much detail because you know it can be a little little intimidating when first people first meet it So we'll cover just enough of it to kind of move on and then then keep things going from there So what are we trying to get to here? Well, we haven't really even said what's what's been done yet, but I do want to highlight sort of what my intention in this presentation is This is the the another instance in which we've had to add support for Some new functionality in the form of a new type. We'll get into that in just a moment Within the driver So I wanted to sort of document and serve a chronicle for adding for describing that process from a driver perspective So that we have this here in case we ever have to do it again or in case people want to do it on their own by modifying the driver Source that's always an option all of the drivers are referenced our Open source so you're free to make whatever implications modifications you'd like and get your functionality as you wish but I also wanted to highlight some of the How do I phrase this some of the design considerations associated with the driver not not everything it's it's it's very very Easy to look at the drivers and think this is simply a technical question of putting bytes together and shipping them over a wire And then reading those bytes off and generating from them But there are also there is also a number of design considerations that go into that because you are presenting an API to users And you have to think very seriously about how you want to do that and how you want to expose some of that information So we're gonna we're gonna touch on some of those notions as well as we kind of move along here But that's those two things are kind of broadly what I'm what I may mean to get out of this this talk All right, so what is a vector anyway? It is an object that has been added into Cassandra will get to which versions in a moment which has a couple of these features And I or a couple couple characteristics, and I list those characteristics up here I swear I have a quote from Jonathan Ellis somewhere that actually lists these these properties But I cannot for the life of me find it So I'm going to present them for purposes of our conversation as a framing device and you'll have sort of to take my word for it These are these are basically fixed-length constructs. They consist of elements that all of those elements are of a common type so vectors do not have that we don't know none of the strings and ins and combine there they are all of a common type and They also don't allow nulls and this you can think of this very similar to for those of you who are familiar with Java These come sound very very similar to Java rays. We'll come back to that in a moment But that's that's what we're talking about broadly speaking All right, so why are we talking about this at all? Vector support was originally being discussed and some has been referenced in a number of talks here in in the Cassandra enhancement proposal 30 Which there's a link to for those who want to download the slides Which originally proposed the idea of implementing vector vectors backed by storage attached indexes We originally had there this the support was originally rolled out and made publicly available Via debt via data stacks astra the hosted Cassandra offering that data stacks makes available And as you all have been hearing over the past few days of talks that support is also now coming to Apache Cassandra 5 itself Yes, who we all have a great party for that because that is awesome But our initial driver support Was intended to support data stacks astra With the intention that we would eventually support Apache Cassandra once it stabilized its implementation of drivers 5 the the conversation was Not exactly like but not dissimilar to hey, we're going to be releasing this thing out Can we get driver support and if we can't get it tomorrow? Could we get it the day after tomorrow? So it it got it got done pretty quickly, but but it got done Reasonably well and has has been a point of so far. I don't have it on this slide But we I'll mention now that we do have native driver support in or excuse me native vector support in three of our drivers right now that the Java driver and the Python driver and Subsequently, it's been added to the Node.js driver as well. So though each of those three drivers already include native support for vectors All right, so how do we implement this and we're going to go a little bit under the hood here So bear with me again. I'm sort of wanting to document this process so that this is the services as an instance Vectors are implemented via custom type I miss I'll ask generally if folks are familiar with the concept of a custom type in Cassandra. Is that is that resin for people? Okay, good. I'm getting good nods The really nice thing about implementing vectors in this way is that we don't need a new protocol version Because a new version of the the binary protocol is a really big deal It's it entails a lot of changes to not only the drivers, but also The servers as well and and administration costs. It's very very intrusive Sometimes it's necessary to bring new functionality. That's life But adding a new data type ideally is not one of those one of those instances doesn't necessarily drive a new protocol version It can be coded in subsequent protocol versions, but it shouldn't drive one itself and as we mentioned here We actually do have Some precedent for the way this has been approached I don't know for those for those who might remember but the earlier version of the protocol just before or currently on protocol version 5 Protocol version 4 introduced the notion of a duration type That was supported by by the servers and that was originally implemented as a custom type in a very very similar to what we're going to describe here And it was subsequently elevated to a first level type in the the new version of the protocol version Which is protocol version v5, which is the version that is supported by Cassandra 4 So There this is not the first time this process has been done, but it gives us a rough idea of kind of kind of where we're at Cool. All right, so what does this look like in practice? We'll take a a simple example The process of iterating over the rows and results has something everybody has done before And now we're going to get into some of the the row level messaging because that's really what the driver deals with In large part it implements this at a messaging level or implements support for the messages involved in and processing them and evaluating them So rows are returned in a certain kind of result message It's a very a very specific message which includes a format I can I can go into the format here if people want to see it, but That that result message, oh, sorry, there we go that result message includes a set of metadata to provide additional information about The the data in each of the columns that are in in the rows that have been returned And included in that that metadata is a notion of call specs This is lifted almost entirely from the the binary protocol spec Which talks about I Won't read it all to you there. You can see it all there But it talks about this notion of an identifier for the type of the column that is provided in the call spec defines that For most for most first types that are supported as a false that have full support within the protocol version There is a unique integer identifier that will identify them zero zero one for I don't forget which one one is but in for say in Two for flow to the center etc for caught for custom types The call spec, which is a pair it is an integer key integer tag in the the string identifying it the integer identifies that we are dealing with a A custom type, but the string then subsequently defines what the actual type is to the driver That's what the driver looks at to know Kind of what it should do and we have an example here to see kind of what this looks like for the type I referenced earlier for the duration type which was a custom type in Particle version 4 but was elevated to a full type of v5 This is what the driver will actually see when it goes to parse this this result message that it gets back from the server And it's it's dealt it's dealt with a it has a large infrastructure To parse these types and to to understand how to what to do with them So For vectors this looks a little different. This is this is an actual extracted custom type that we get for For a vector of size three of floats Notice anything different We we now have Params we didn't have params before The This is the first instance of an actual custom type which includes Parameterized types in which we don't simply say hey the parameterized type is a date type or is a duration type But now it's a parameterized type. This isn't new for the type system within Cassandra to be clear We've had the notion of parameterized types for some time in the form of lists and sets and maps But we haven't had to deal with them in in custom types at all so Implementing support the first task in implementing support for vectors was to make sure that our parsing infrastructure for Custom or for the custom type handlers for to handle these types of type streams Could handle the notion of parameterized types and process them accordingly took a little while to get that down But not a big deal. That's that was that so we got that done So so that's it right we do we do parameterized custom types and and we're done and and that's it I'll not not exactly We also have a second type of type names or a type type specification which are available within Cassandra introduced in Cassandra 3.0. I Meant to look up the JIRA for where that was and I did not find it But it's used within the system tables and it's sort of a simplified shortened version of the string that you just saw earlier So instead of this long lengthy Java ish class name that we have before we now have a much smaller version in the form of as you can see list and List what so our simple are much larger version of list type and float type reduces down to list and float Vectors are actually implemented on the server with this similar notion. So when one goes to look for Vectors and and sees them then then you will actually see this this is primarily implemented within the schema tables So the drivers don't you wouldn't necessarily hit a problem with this from regular usage of the drivers But if you did say want to go to the metadata that is provided by the drivers and look at any type information that is within there You would immediately hit some exceptions for being unable to parse this notion of vectors. Oh Sorry, sorry. I jumped ahead here. Sorry Yeah, it works for vectors too So we have the simplified notion of types in the form of vector and float 3 That we see within our within our schema table, so We are required to make sure that we can handle this We're actually this is a little better than the previous case because the types in question here Already support parameterization. We already we use these simplified types in the schema for some of the collection types I referenced earlier some of the types that are already parameterized things like lists and maps This simplified syntax is already there So the many of the the type parsing machinery within the drivers already support this We just have to extend it to make sure we can deal with vectors. So, okay, cool Here is a very specific example to show Precisely what I'm talking about. This is a table that has been created which includes a a vector type the the we want to make or retrieve these operations on and a query underneath to show The schema tables provide a by Cassandra so we can see precisely how those types are rendered now This is actually very similar to The queries that are done By the driver when it connects to the server the one of the things that a driver does what it initializes this connection Is it queries all of the schema tables so that it has some notion of what the tables key spaces and columns and types for those columns are a connection or that is for the what it's dealing with Cool, all right, so now we're done right like we've we've added support of parsing for both kinds of this Surely that's it. Yeah, of course not quite so We still have to be able to read and write data in this form in the format that we're getting from the server So the implementation for this varies pretty widely across the drivers and I I'd like to take just a brief digression here to talk just about a little bit of history Oh, in the in the past we have we have largely had Discrete teams for handling all of the the various drivers that data stack supports We had a driver for Java driver for Python the teams would collaborate on sort of higher level goals and General principles but didn't necessarily worry about Implementation levels of tails that was left largely within team The the result and the key up update or the key upshot here Is that the internals of implementing this can can vary considerably across the driver? So don't assume that when you implement support for this in one driver that you will simply find the Parallel implementation the other drivers you very definitely will not We'll make that make that kind of concrete here I'll just run through a couple instances so you can kind of see what I'm talking about This is what sir day looks like for the Java driver It it actually has a a third-party object in the form of a codec who is responsible for serializing deserializing objects Each type has its own type codec in some cases multiple type codecs That are responsible for turning bytes into objects and objects into bytes Those are implemented separately. This is a extremely simplified version of the The codec interface, but it's enough to give you the idea Python does it differently Python actually has Representations for each individual type that we have but the types are responsible for serializing and deserializing themselves So we have an example here for duration type but you see that the type is is responsible for saying given a set of bytes create myself and and the inverse Both with static methods on on the the classes that define them I'm not I'm not saying that any of these are preferred over the other That's not that's not my intention My intention is merely to point out that there is there is a significant difference between how each of the drivers approach this problem last we have The last driver to get vector support, you know JS driver Which kind of does away with all of this and just has some functions for doing decoding and encoding If you there's actually an encoders file within the driver Which has a whole host of encode functions for all of the types that you can imagine and a whole host of decode functions For all of the types that you can imagine So in each case you can imagine what adding vector support for each of these drivers involves it But it's different for each of them one other thing to consider and here we get into Something of the design consideration of this. This is not really a technical question. This is What is it is a it is a mixture? I should say of both a technical question and and a usability question, which is if we if we we need also need to expose The data that we're getting in some way that in a type within the program language that is meaningful for that system So what is what does that mean if we if we want users to get to do queries on? Rows and get back data out of that. What what type should we give them? It should be a type that is meaningful within within the system that they have But that though that involves a lot of judgment about what is really the right way to do that and And and and here we get into sort of some of the principles that have guided some most of the driver implementations Over time in general The drivers try very very hard to reuse as much as possible native types within the platform You don't see within very many of the drivers very often. There are certainly exceptions This is not a blanket rule But you do not see instances where the drivers implement a new float class instead of just giving you back a float That is very much by design. It is much easier for developers and engineers to reason with floats that they are already familiar with from having worked with Floats in Java or floats in Python than having to learn about the functionality of your cool new float class That does all kinds of neat things that they don't really care about That's something we do very much by design now as I say that is not a universal Constraint we do not apply that universally because it doesn't always work. I cite an instance here of The the Java driver not reusing what seems to be a very natural mapping for duration Because of differences in implementation that actually prevented that so unfortunately we had to have our own duration class That's just like sometimes you have to deal with that Why this matters more than you might think because The the drivers have there are certainly a number of situations in which the drivers will try to infer types Based on what is passed into it. This happens a lot with say creating a sequel statement that accepts arguments of just some some random type of Data type from the user we're then tasked with with trying to find some way to convert that data type into what we need based on the schema But that means that that we the way this is set up Once the type is assigned once a native type is assigned to an underlying sequel type They don't get reused because you want to know that when you see a type of say a Float type that you know that this codec will be used to handle that for it So you don't you don't wind up reusing those cases. So you want to be very judicious about assigning Codecs for very common types to or excuse me assigning codecs to handle very common types The situations where where they they will actually make sense and not be reused And we'll see an example of that come up here in just a moment So here are the decisions that were actually made with respect to how we expose vectors for each of the three giant two drivers now We now let's get into Java is a sort of the example that I wanted to refer to earlier with respect to The the reuse of something that is very common within the language The very natural mapping for a vector based on the criteria that I defined earlier is to a Java array It has very very similar properties and can very naturally map to a vector as it stands However again, once we once we map a given type to the the the sequel type We then can't you we then don't use that for other situations We see an array from that point for we try to make it a vector and we may not want to do that for a New type like this that is just just sort of catching on so I Decided given that situation to not use arrays for for the representation to and from vectors and to represent them via via new custom class sequel vector This was a difficult one because I was very tempted to go with arrays But in the end it seemed seemed easier to have have a custom class to handle that and that class actually supports The idea of being able to easily move back and forth between arrays, but it is not itself an array Python is less of a concern really we Were helped out quite a bit by the dynamic typing and basically returned something that's that's Seekable, and that's broadly broadly workable for what we need in Python No JS was more of a challenge Because again our implementation across the drivers is very distinct So in the node JS case, we already had arrays taken care of their that rays have been mapped earlier to the notion of list So on a node JS driver if you pass in An array of elements that will be get converted to or the the type system will try to convert that into a list before sending That off in the sequel query So we couldn't use that that was already that was already taken up Fortunately, I was I discovered that we have a Relatively new issue at least new to me. I had not come across this before Type in JavaScript called the the numeric type arrays which actually matched very closely again to the notion of Numerical arrays in other languages And they hadn't and maintain some of the prop most of the properties that we have defined for Vectors originally in my and the earlier slide. So so that was That's how this is represented for a node JS It's a little bit of a change because not everyone's familiar with type arrays, but oh, hello. What is that go away? All right So now we're done right we've done we've done all kinds of type parsing. We've done serialization surely We're done. Yes, fortunately. We're done So I provided this here as a summary. This is this is how I've I've included this To reference a summary of the set of things that need to be done to perform a task like this That includes this includes a notion of type information includes the serialization deserialization operations And again the very important notion of considering how you expose this within a language and how you make this available So that these these goals right here sort of represent the the the sort of concepts that one has to deal with in dealing with this in the language That's cool, but Where's the new stuff for vectors The the nice part about this is that we actually don't really need one Because of the way vectors were implemented both in the server as well as implemented with within these mechanisms within the driver They fit within the existing driver APIs So as long as you are aware of the types that are involved with either submitting or reading data Into the driver APIs You'll be able to use everything as they stand and not really have to change very much. This was again very much by design We didn't want Vector-specific operations or vector-specific APIs that would potentially confuse things So as much as possible, we've tried to preserve the existing functionality All right, so Cool, we're done Anything else to do. Yeah, we do still have some things to do The original implementation as I mentioned for this was designed to support data stacks astra Data stacks astra includes support for vectors of a single type or at least it did in our initial rollout That was vectors of type float and that was that was all we had Apache Cassandra 5.0, which is coming out very very soon And is much more expansive in terms of the set of types that it will support within vectors So our drivers need to make sure that they can work with Vectors of many many different types in order to support everything that users of Apache Cassandra 5.0 will want We have some work to do to do there to make sure that we're okay on that front We also have a lingering problem With respect to when we get to that depth process around The serialization for variable size types We I mentioned that we can support floats right now and we can pretty easily support given the current implementation Types that are of a fixed size so floats integers things like that but when you start getting into types of a more indeterminate size because of the way the serialization works within within Cassandra 5 We have more of a problem as things stand right now So we have some work to do here in order to support this This is all represented within within a the JIRA the JIRA referenced here So if you want to find out more about that process and sort of what that effort looks like You can find out some additional information at that spot And I believe that is all I have I have some time left here for questions if anybody has some Yes, sorry go ahead. Sorry say again Check with check with me after the talk and let's see if we have a if we have a ticket for you already If not, then then we'll see if maybe we can submit something I'd like to see the the sort of the code in the area you're actually seeing that was the yeah That was the most recent version on MPM. That's the one you were using. Yeah Yes, it's tested the the current version of node the the MPM version or the version that we last pushed MPM Supports, I believe it's I believe it's officially supports 18 and 20, but it also works for 16 kind of unofficially, but it's it should work pretty well for 16 So any of any of the the major LTS versions of nodes should be supported pretty well with that so But check with me after and we can talk about that any other questions anybody have any other questions about anything We talked about oh sorry So the the vector the the param in the in the vector case. Where is it? Oh Definitely did you have Yeah, those so the the yeah those those that yeah the most of the the native the native numeric types in Python are set up via Packing or the implementations off a packer for yeah, that's that's yeah, that's that's That's that's what opens truck gives us so I mean, you know, it's just kind of the life we live Right. Yeah, so that's that's that's what we have to live with in that case. Yeah. Yeah All right anybody else. Oh yes in the back. Sorry Same same situation. I mentioned earlier check with me after I we got to make sure we have time for the next speaker So check with me after we can talk about that some more, but we'll we'll we'll get we'll get see we can get you some information on that Okay. Oh, yes, good Right now there are no plans to add vector support to either C sharp or C plus plus If you would like to see that I would encourage you to Either get in contact with data stack support or post something on the mailing list and sort of indicate that that's a Desirable case you could also file a juror as well You know, we have a lot of those so I'm not I'm not gonna make any guarantee I can't make any guarantees on any specific ticket getting addressed in a specific time But that would be certainly a good way to get it into the conversation and and have a discussed What Jeremiah said? Like we've we've laid out an excellent example here to give you a lot of guidance as to hold all the things You would have to implement in order to support it. So I'd love to see that. Yeah And our C sharp developer is wonderful. So yes Fortunately, they don't let me anywhere near the C sharp code. So that's that works out well for everybody All right, anybody else any other questions. I we're wrapped up. I think I'm time So if anybody else has any questions, please come down check with me and I think that's it. Thanks everybody