 Might as well start, thank you for being here today. We're gonna go over semantics searches in Cassandra, how it can be done today with Cassandra 4, and as well as how we can use it in the future with the new versions, utilizing open source Cassandra with the Apache license, free license to use Cassandra. We'll go over the basics of semantic searches, vector databases, vectors, and later we'll have a demo of both the current versions of Cassandra with open search, without open search, as well as with the new version coming out of Cassandra 5. So I am Baslam Shaheen and with me here is Moralum Miranda. We're from NetApp. We're part of the Instacross group within NetApp. So we'll start with vector databases. The key here is that they enable AI and NLP applications to process large scales of data. The utilize various types of information encompassing unstructured data, whether it is text documents, rich media, audio files, video files, and also structured data such as geospatial, tables, graphs, and so on and so forth. So we're able to handle these diverse documents in vector databases. Unlike traditional databases, vector databases are capable of high-dimensional data efficiency. They can perform similarity searches, involve similar points in large data sets based on a query vector. Embeddings are created for word-possessing similar meanings and often occurring together with similar context. So there is a relation between them. And these embeddings are vectors created with a high-dimensional space that captures the aspects of the word's meaning. And so we'll be going over the description of what we're going to do with the benefits or determinants of using existing versions as well as the demo. So we'll start with Cassandra 4. So in Cassandra 4, you can store vector data in the database, but there is no specific type for vectors. So in essence, if an application wants to access the vectors, we're having to perform a full table scan every time you want to query the data. The problem with that is with data science, the AI and LP, you're dealing with a large set of data and the application is going to have to load all the full tables data into memory and the limitation on memory could be a problem down the line. Also the performance is not ideal because having to do a full table scan every time you're trying to get this data. So the approach to do this in currently in Cassandra 4 is you have a data processor that downloads the data model as well as loads the file that you want to vectorize. The data processor will vectorize the data based on the model and then stores it in Cassandra in a non-vector column text or something else. The tool to run the query, the data processor will do a full table scan, do the ranking of the matches and then present it to the customer. The, all the workload of the ranking is being done on the application and therefore it's to burden some on and may have issues with memory and performance. So today with Cassandra 4, we have another option is to add yet another great open source tool, which is open search for indexing of vector data. So open search has a plugin that allows for nearest neighbor searches off your data as well as being able to store data that will enable us to look up with the values from Cassandra and provide the feedback of the query. So the flow here is a bit different than what we saw earlier. So the first step is similar, you have a data processor that's going to go through and pick up the data model and the PDF file. It will vectorize the data, so it will create vectors based on a document ID as well as a paragraph ID. In open search, the vector data as well as the document ID and paragraph ID are stored there in the index. Cassandra will have other data related to the document and the paragraph in Cassandra as metadata. So when and from the other side of the equation, you have queries coming against this environment, the data processor will first look at the open search, run the NLP query against the index, get the vectors that match based on those vectors and not the full dataset, just those vectors, will have the document ID as well as the paragraph ID that can be used to retrieve data from Cassandra. So this will shift the load off doing the matches rankings to open search. It will reduce the need to do full scans so the application not loaded with too much data. And you have the benefit of having Cassandra with the fast data storage and retrieval. So now I'll hand it over tomorrow to do a demo of how we're able to show that. All right, so in this, we will have two demos today. The first one will be the showing what we can achieve with Cassandra 4, next one, we are going to show an evolution of this. This is, I'm going to do this on a notebook. I preloaded some things here in order to not spend too much time during this demo. But basically, I had to define some functions here to extract text from documents. We are looking for a use case that is, people are looking a lot on these nowadays after that AI hype that we got after chat tpt. And this is the base of that. So you can have your knowledge base on some folder, for example, some, I don't know, data lake, whatever, and get your data into a database and be able to look for things there, to information there. This is the part of the pipeline where you can read your data and search for your data. And if you want in the end of the pipeline or in the beginning, you can have an LLM system to do some more complex data management or get information on a different way or smarter way. Okay, so for the purpose of this demo, we are using fast text model to convert text into embeddings. So in order to do a vector search, you need to have vectors. And this is a pre-trained model that was produced by Facebook search team. And this is free. You can download it and use it. This took like eight minutes just to load. So that's why I said that pre-loaded some parts of the notebook otherwise we won't have time here. And the first part of this notebook is we have just functions to get the documents from the folder and index this. So there are functions here to get PDF, a PDF file, for example, extract the text and then convert this into embeddings. Also for PowerPoint files would Excel files as well. So those were the user cases we used and we tried here. We can extend this to all the file types. Of course, you need to create functions for that. So those functions, all of those functions are meant for that. And then we have the main function here, which is getting the text after the extraction from the document and converting this into embeddings using this model. All right, so proceeding. Here, we are going to connect to Cassandra 4. Load the variable with the key space name. I'm not going to run that because I already have the selector created and also documents there. But what is important to note here on Cassandra 4, I don't have vectors. We don't have a vector data type. So we are going to use a list of floats to start the vectors, all right. And this is a function to insert the data. A simple function, like with an insert, we are going to add the file ID. We are going to get the file and get chunks of the file. So we are splitting paragraphs. So each paragraph will have a unique ID as well. Then we are going to start the file name, the embeddings that we created with the fast text, the clear text related to that embedding, and then the date that it was created. All right, so in this part, we're passing file pass. This is Cassandra best practices documents. We are going to call the function to extract the text from the PDF, convert it into a list of paragraphs with embeddings, and the structure of this list is here, a little bit up. Okay, so each item of this list will have the file ID, the paragraph ID, file name, pass, text, of the clear text of the paragraph. Page number, in PDFs, this makes sense. On Excel, this like is going to store the name of the sheet and the embeddings. All right, having this list, we can call the function to insert in Cassandra, the function I just showed here on the top. For that file, we produce a 302 paragraphs. We set everything in Cassandra, and then we get the part where we need to search. And here is the main problem when you are just using Cassandra 4 with this approach. We need to load all the table. So we did a full table scan here. As you can see, there is nowhere. So we are just loading the whole table into memory. We are loading the ID, the paragraph ID, and the embeddings and the text. And afterwards, we are going to use this cosine similarity function, which is available on this Python library. And we are doing the similarity comparison in Python. So the weight of this processing is going to the data processor here. So we need to load everything into memory. And if you have a very huge table, this is going to be effective. So this may not be a good idea if you have a lot of data. Anyway, this works. And here, for example, if I look for cluster, this is going to, I don't know if the size is good, but you can see this found some sentences with cluster here, clusters. So the main sync with the embeddings is that you are not looking for a single word. This is putting the word into context and looking for similar words around that. So for example, California and San Jose, this would like relate to syncs and hit on a text. Of course, the similarity score would be different. If you look for San Jose, for example, and you have a text with San Jose, the rate would be higher than if you find California in the middle of the text. Go back here. All right, so this is the way to do that with, one of the ways to do that with Cassandra 4 only. So this is still possible. It's not ideal. With smaller data sets, this might work depending on your use case. However, you have another way and a more effective way, like Bassan said, which is using Cassandra 4 and OpenSush. So I'm going to connect to OpenSush here and I previously created an index using the K and N vector. This is a plugin, actually, K and N plugin. And this index is having three fields, basically, three properties. One is the file ID. Another one is the paragraph ID. And then we have the embeddings on a K and N vector type. I need to put that dimension for this case, the dimension is 300. And having that, we can insert all those embeddings in OpenSush. As I showed, I just added three properties here, which are the IDs to identify, basically to cross the information between Cassandra and OpenSush. And if I do the same search here, like this, if I do the same search, like look for a cluster, this is what we are going to have, just IDs and a similarity score. And as I have the rest of the metadata in Cassandra, I can use those unique IDs to go to Cassandra, like you're seeing here, and query for the file ID and paragraph ID. So I can enrich my data based on the Cassandra. So I can have the similarity score got from OpenSush, OpenSush, but I can also have all the other information, all the metadata that is stored in Cassandra. So I have the text, the clear text, I have the file path. So if I want, I can go and grab the file. And from this point on, you can insert an LLM and get your text and provide like a nice answer for the end user. Okay. Yeah, that's it for the first demo. I will give it back to Bassem now. Thank you. So right now, Cassandra 4 is what's out in release. So if you have a need to do NLP or semantic search, I guess Cassandra is available, but we recommend that you use tool like OpenSearch as well to be able to perform this action. The great news is that there's recently released is Cassandra 5 and Alpha Cassandra 5 beta version. We are looking at the first quarter of next year for Cassandra 5 to be publicly available for use. And embedded in this new version of Cassandra are some features which will degrade the need to use OpenSearch and just use the one tool, which is Cassandra 5. I'm just kind of curious, how many people here are currently using Cassandra and how many are in the AI, I hear for the AI portion. Okay, cool, thank you. So then I think this will be useful so you can see how it can be easily done in one place. So the benefits of Cassandra is that now it is able to store vectors. So there's a new data type introduced in Cassandra 5 that is a vector data type. It allows running functions against that column of both cosine, Euclidean, dot product and your cell functions. It also uses the approximate nearest neighbor function against the data for searching. So you're getting the similarity searches. And the way that this data is indexed is using the new storage cache index that's being introduced also as part of Cassandra 5. So the flow gets simplified again back without having the OpenSearch in there. So where the first step is similar in the sense that we're getting the data model, we're getting the document, running the document through the data model, generating the vectors, then storing the both the metadata and the vector data in Cassandra. The vectors is in a vector data type column. Which then makes it easier for the data processor at the end to do a search against Cassandra on the data vector column and retrieve all the data in the one query. So again, back to you. Yeah. All right, so we are going to the second demo and on this second one, Cassandra plays a different role here. So before Cassandra was merely like a metadata store. And now we can have more power from Cassandra because we have the vector data type on Cassandra 5. So looking straight to the path, let me just connect to Cassandra 5 here, make sure connection is up. Difference between Cassandra 4 and 5 is this data type here, one of those. So you have a vector data type is a float and then I need to put the dimension of the vector which is matching with the fast text again. All the other functions are the same. So to extract the data, convert the data into a badgings and et cetera, et cetera. This is all the same. Difference here is the vector data type and then we need to create an SAI index on the column in order to be able to search for similarity. As we are still dealing with a not a final version of Cassandra we are getting this warning saying that A and N is still in preview. The path for inserted data is the same as well as the other demo, same stuff. And then we have the search path. And this is much simpler than before because we can do everything in Cassandra now. And this is like a simple select query where we can look for a text. So this function receives a text, okay? Convert this into a 10 badgings which is going to be sent to this function here. This will be an array of floats, like before, like on the other demo. And this query we have, we'll be using the similarity cosine function. This is a native Cassandra 5 function that is going to return the similarity score of each row. And then we can order the rows by the similarity as well using this syntax. So going straight to the point here, if you want to search again for cluster, for example, we'll have from Cassandra the similarity index and all the metadata basically. So we can get to whatever ones here. In this case, I got the text. And this is wrong, this wouldn't be text. It's like the file name. And here's the pass of the file. So we could get everything directly from Cassandra. So you don't have the need to deploy and open search or process everything on your data processor because Cassandra is able to compare, get a similar text, return it to you. So we have a repository and also a smart path where you can get the similar data from our database. You can just to show we are not just looking for a cluster. We can look for place, for example, and this is looking for, even if there is no place words there, this is going to look for a similar word in the file. Of course, the similarity score is lower than before, but you'll still have the list here. All right, so pass some, I don't know if I want to talk about that now. Go to the slide. Sure, so if I go to this slide, just briefly here. Thank you for the demo. This, we're both part of, and I mentioned earlier, part of Insta. Is that showing the? Yeah. We're part of Insta cluster, which is a part of NetApp as well. We are a company that provides managed service for multiple open source projects, including Cassandra, OpenSearch, Kafka, Redis. You're able to deploy clusters within minutes used on either the cloud, multiple clouds, or you can deploy it on-prem. For this demo, we were able to use the both OpenSearch and Cassandra on our mesh platform. Easy to deploy, test, and then when we're done, we can then remove it. So we provide support for all these parts as well. So if you're already deployed in your environment and you're looking for support, we provide that 24 by seven on production systems, as well as consulting for implementations and guidance and reviews. So to go back to, so this is what it looks like is if you log into the console, you'll see I've deployed Cassandra 4, Cassandra 5, as well as an OpenSearch cluster. To create a cluster is straightforward. Pick up the technology you want and just put in a name for a cluster. Let's go with the try to create a Cassandra cluster. Hit next. You can choose a version of Cassandra you wish to deploy, you have from three and five right now is Alpha's is in with the availability but beta will be available for you to use soon as well. And hit next. Pick the region where you want to deploy in AWS. For example, the node type, you can change that to your needs. Hit next and within five minutes you'll have a cluster created. You can see the cost upfront of what it takes. So we'll click that off. So really that's kind of our presentation. I want to open up for any questions you may have about this process. Go ahead, please. No, for both, we use fast text to get the text and convert it to vectors, okay. No, no, no, not LLM, I use it like a fast text which is a model, pre-trained model to convert text into vectors with some meaning, okay. So the main difference is on how Cassandra is storing the data. For Cassandra 4, you need a model like, yes, yes, but yeah, but in order to do a similarity search, you need to have all the text converted into embeddings on the same way. So that model is a trained model with thousands of words like it's a very high model. No, no, no, no, no, no. The model is just used to convert the text into embeddings. After that, the model is not useless because every time you need to search for something, you need to get the text, you want to search and convert to embeddings using the same model you use it to insert in the database, okay. And this is the difference here. So this is the clear text view of the model, clear text in Cassandra 4. So this is what the model is producing. So the text is here, this is the text and this text was converted into this, okay. This is a list of quotes. In Cassandra 5, as we have a vector data type, this is how the vector looked like in Cassandra, okay. So this is an excerpt, yeah. So in this case, as the data model that does that finds the similarities between words, takes away the comparisons between whether it has a tick mark or not, creates a vector data. Now that vector data is stored in Cassandra with Cassandra 4. You are not querying that data to do some active searches, you're just storing the data. The load comes on the application, when you're trying to query the data it's going to pull the full set of data as well as the vectors. The application will perform the search against the vectors to pull the data that's needed and then present you with the ranking of the matching documents. Yeah, that's the main difference. The part of comparing the data for Cassandra 4 should be done on your side for Cassandra 5. Cassandra does that for you, okay. So you'll get rid of that processing power, processing part because Cassandra 5 can do that for you. And return the list with everything, with the score and et cetera, all the red. Yeah, yeah, exactly. So as an intermediate step, if you are using Cassandra 4, it cannot move to Cassandra 5, you can then leverage open search to handle the similarity searches using the indexing with the K-line plugin. Yes. To me there are other vector databases out there where I come and play around with Cassandra piece of it, so maybe he have an answer for that. No, for example, we are using open search to complement Cassandra 4. So open search is only one of the options you have out there, so the good news is just that Cassandra 5 will have that and so it's another option we will have and you don't need to deploy another service and pay for it because everything is in the same place. And we like that because you still got the free software that you're using. It's very well known for fast writes and reads, so you're leveraging all the scalability of Cassandra, so those are all great benefits and now you have the added benefit of you being able to store and retrieve active data as we find that's a very good combination, okay. All right. If you have any other questions, thank you for coming, appreciate it. Enjoy the rest of the summer. Thank you. Thank you.