 Thank you everybody for joining our talk today. And in fact, you have so many choices. And so we are very extra thankful for you having chosen us. And I hope it will be a very productive one for you, like that what we'll be talking about will be helpful in what you're looking for in the kind of direction of JNAI stream processors, all of these things that are happening. So our talk is about build a chat GPT data pipeline with rising wave stream processor and Cassandra factor search. It's a bit of a exploratory kind of talk to us. You know, there isn't any true integration yet, right? And then we may be doing some like building our connectors at some point, but the fact is having, you know, being working with data is actually very, very nice because things are loosely coupled. You can actually make kind of connections to it too. Anyway, so this is us and is it good? Yeah. Yes. Okay. So first of all, this is an agenda. Just run down on what we're gonna be talking about and who are we, and then just brief an overview, just assuming not everybody already know, you know, will work with JNAI chat GPT, which is kind of so popular these days. And so we just kind of give a brief introduction overview. Then we'll get into introducing rising wave and then extra DB vector are actually more like it's powered by Cassandra. So, and then we'll talk about this building a chat GPT data pipeline, giving an architectural conceptual walkthrough. And that should take up like the next 30 minutes or less now. So, okay, so quick intro of the speakers of us. Let me have Corinne introduce. Okay. Hi. Can you guys hear an echo? Do you hear us okay? All right. Okay, cool. So I'm Kareen. Let me see what it says on my slide. So I can like act. I'm like, who am I? I don't know. So I'm a consultant. I do developer relations related kind of stuff, community, endeavor, all type of stuff. My background is actually, I used to work in entertainment and music before I worked in tech, which is an interesting transition. And like, yeah, that's it. I don't know if it's that interesting. That's like, that's my life story. I mean, I'm sure it's more interesting than that, but. Great. Yeah. Okay. It's my turn. Yes. It's your turn. Okay. It's just that I realized I make sure I turn off. See, look at how you even have, she even has like her jumping with a sword, you know, like how am I supposed to compete with that? No, this is Mary jumping with a sword. Okay. So this is a quick slide about me. And rather than talking about it all, I just, you know, have pictures to represent. And I'm a senior developer advocate at data stacks. As we all know, data stacks is the primary sponsor of this conference. And thank you so much again for coming. And I'm a passionate advocate that you see me everywhere. Everywhere. Anyway, I really enjoy like outreach to the developer community, especially. I'm also Java champion. Also I run the Chicago Java users group and other users group in Chicago too. So these are kind of things that I've been doing. And also prior to this, I was doing more actually event streaming, reactive systems at IBM, also advocacy and Java space too. So that's about me. And we'll share with you our contact information towards the end of this slide show. But my interests are really into streaming, distributed systems and now moving into AI, machine learning, which I'm embracing now. That's what our company is all about these days. Okay, so, but first of all, as I mentioned too, because it may be new, you know, for everybody, not everybody has worked with Gen AI, chat GPT. So let me step through it kind of quickly to kind of give an introduction to set the stage. So about the chat GPT, Gen AI, as we all know, it came out chat GPT specifically a year ago in November, right on November 30th, 2022. That's when it came out, open AI, publish it, release it. And it has taken the world by storm, as we all know. And I think it's because all of the things are kind of more like user in like kind of enabled, right? For doing things you can just kind of bring up your open chat GPT console, ask questions. And now of course there are a lot more chat bot kind of spring up. All of the companies are also doing same kind of thing. So, but ultimately too, what I wanted to talk about is that this is all about automation to kind of making our lives easier. And it won't by no means it would take away our job. And it's actually going to be, you know, kind of peaking on our creativity to create new ways of solving problems. But always remember the bot isn't going to take us over and we'll have more kind of interesting to do, interesting things to do and make us more creative in creating new opportunities too. So let's just really quickly step through like AI machine learning, just to get an understanding and this is like the more popular way of describing, you know, in this diagram of AI is mimicking the intelligence and, you know, high level wise mimicking our intelligence and behavior of human beings. And then if you kind of peel down this onion and you see machine learning, machine learning is really about us trying, you know, to train the computer to learn from the data. So then we don't even have to actively training it. It's basically by data, then they learn from the data and then keep kind of building up your model. However, it can't go too far without more deeper way of doing things, which kind of address the cognitive level. And now it comes in, you know, the core of the onion is really deep learning and it brings in, you know, the kind of the neural networks. How does it, how our brains actually work? And that's where like all the magic is happening. That's where you can find LLMs that we will be talking quickly about. The natural language processing all comes into that in the core of this onion. And let's just real quickly talk about just what is gen AI. Is a disruptive field in AI in the sense is that now it can actually take prompts that are spoken, that kind of can be spoken language that, you know, essentially is human, very humanly understandable language as input. These are prompts. So it has the potential way of changing the way we create and consume contents too. And it uses a combination of machine learning as well as deep learning to produce the contents. And it's very, very much very kind of a, you know, disruptive ways, a new way of doing things because assuming that a lot of you are engineers and so am I was I, right? And we learn right when we program what it used to be is that you need to give input that are very strict. You need to have strict rules of specifying the data type what kind of parameters and give all the parameters. It's kind of very, not like human kind of, you know, friendly. But that's the thing is a gen AI is that you can talk to it like you're talking to another person. And that's what is really kind of disruptive about it as well as, you know, it allows us to do a lot more creative things. Yeah, so before gen AI was actually predictive AI that actually deals more with business forecast, weather forecast, those things they are learning from some past data and they're making predictions based on that. And that's kind of less still kind of doing things the old way there isn't as much of this, you know, they're probably not much like human understandable way of talking to the bot. But the thing also is that with predictive AI is that the data will expire after a certain periods of time, if you have weather forecast for five days after five days, the data is gone. It doesn't apply anymore. But generative AI can have a lasting kind of effect too. So, okay, so that's that. And then GPT. So chat GPT deals more with the chat kind of text kind of involved with that in that sense, right? As input as prompts, but essentially too GPT is like it stands for generative pre-trained transformer GPT. So it essentially takes simple prompts, right? And human language format and that's input. And then it does what it does behind the scenes is a lot of pattern matching in there. And it's called like similarity searches. And then it answers questions for the prompts and produce contents that are creative now writing and essays, writing our code and then also like designing our dress and maybe writing a blog post, things like that is very creative. And now then let's get into NLP too. That's important because it's an interdisciplinary subfield of linguistics of computer science. So the concern is to be able to process natural language data sets, right? So that's what is enabling the LLMs which I'll talk about in a second. And it's kind of dealing with kind of figuring out what the language is trying to say and then do all the work that it needs. And it uses rule-based or probabilistic machine learning type of approaches. So it enables computer to actually learn. The actual learning is happening on this kind of level NLP. And it's kind of allowing you to draw insights, right? From the documents, for example, some input not just like strictly A equals A or B equals B it actually goes beyond and creates some context around it and so. And now it gets into like LLMs, large language models. There's a type of machine learning model. So these are foundational type of model. But typically though, it takes a lot of money to produce it because we're dealing with petabytes of data or even bigger, right? Like an unending kind of, oh, her mic is out. Just scream. Mary's being tested on her. Here, you wanna talk into my mic? Okay, no, no worries. Yeah, okay. So yeah, so essentially large language models is what we kind of developers mostly work with. We're working with LLMs and that's where kind of window into all of these magic that's behind the scenes doing all of the cognitive type of work. I'm gonna just give you this mic. Thank you. Okay, thanks. All right, so, okay, LLMs, that's it. And so this is just some examples of these API frameworks that I think most of you are familiar with the Leng Cheng, Lama 2, Palm and Hugging phase of these. Okay, so now then, let's kind of get into, I just wanted to point out to the, how this, you may be wondering, okay, our title of our talk is Data Pipeline. So we are kind of concerning with data, like processing data flowing through different kind of steps within your processing. So now, like with Korean, and Korean's working with Rising Wave. So yeah, let's have Korean introduce and then we'll talk about this test case. Okay, so I'm gonna talk a little bit about Rising Wave. This does assume that you guys know something a little bit about streaming in general, streaming data versus like batch and things like that. So stream processing is a data processing approach that deals with continuously flowing streaming data. So if you're familiar with streaming systems like Kafka or Pulsar or Kinesis, as events happen, you are bringing them into the system, stream processing basically allows you to run queries and processing and transformations on the data as you're bringing it in before you even store it. So that's like the basic kind of high level thing of what is stream processing. So it gives you the ability to do real-time analysis, transformation and computation of data as it arrives. You really can't get any more real-time than that unless you're predicting what's gonna happen in the events before they actually happen, which you can eventually do. But it is basically the data's coming in, you're running queries on it and then you're storing it kind of on the downside. So this is used to extract valuable insights, detect anomalies and trigger actions based on the incoming data. Okay, I'm talking really loud. Back it off, that's my whole life. They're like, you are so loud. How about this, too far? Okay, now I have to like, I'm just gonna go like this and make it a little bit more exciting for you guys. Types of queries in stream processing. Okay, so the types of queries you could do in stream processing, one is filtering. This involves a selection of specific data from a stream based on predefined conditions. I put examples in here because sometimes it's hard to like apply and figure out how you can use these different kind of queries, but an example of this is like filtering out log entries that contain errors that are needed for immediate attention. Another one is aggregation. It involves summarizing or reducing the data over a certain time window or key. So an example of that is like calculating the average temperature over the last 10 minutes from like a sensor stream. Then joining streams. This is personally one of my favorites just because I think it's cool. You can literally do this with like dozens of streams. So you can combine data from multiple streams based on common keys and conditions and be able to do processing on those multiple streams. So an example of that is combining user click data with product data to analyze user behavior and continuing types of queries in stream processing. Windowing these queries group data into fixed time intervals or windows for analysis. An example of that is analyzing the total sales for each hour of the day over a week. That's like windowing. Pattern recognition is identifying specific sequences or patterns in the event stream, detecting a series of login failures as a security breach attempt. A lot of like anomaly detection and fraud detection kind of things fall under this as well. And then stateful processing maintains state information to analyze data over time such as tracking user sessions. An example also is monitoring user behavior and detecting when a user session is idle. Very important. For this specific case, I'm gonna talk about one kind of use case example. Oh, I didn't even, sorry, I'm going a little ahead of myself. So what is Rising Wave? I'm like, I don't see the next slide and I'm sure it was the next slide. So Rising Wave is an open source distributed SQL database used for stream processing. So it's designed to reduce the complexity and cost of building real-time applications. So if you tried to do stream processing in the past, it can be quite a large big build. It's not easy to like kind of get started and Rising Wave is meant to tackle that. Just being all SQL based, really easy to get started and use. So essentially what it does is it consumes streaming data. So events as they are happening are being brought into the system. It performs incremental computations when the new data comes in and then updates the results dynamically. I'll dig into this a little bit more just to like make sure it's all clear, but it also has a downstream side. So Rising Wave has a storage system. So if you're running transformations and queries on the data as you're streaming it in, you have the ability to store it as well to be able to access that data very quickly. It's built in Rust. I think this is cool. That's why I added it. I was like, it's cool, right? It's very performant and it's actually, the co-founder of Rising Wave happens to be here. So I'm like, oh my God, lots of pressure. So make sure you guys clap, okay? So, and it's also SQL based and PostgreSQL compatible. And this is great because if you know SQL, you can do stream processing. So like the learning curve is like pretty simple. And it's PostgreSQL compatible. So even like the downstream side, if you want to use other types of database, you can do that as well. Another cool thing about Rising Wave is the storage and the computation are separate. So you can scale them out independently of each other as well. And it also offers you the ability to do incremental updates on materialized views. Okay, so just kind of high level. So like what is Rising Wave? Basically on the left side of the screen, I'm like, I don't know if we're mirror to the same way, but I guess it is. Otherwise everything would be backwards for you guys, but like, so events are happening, these streams are coming in and you could pull in data from like a variety of sources, all listed here. You don't really need me to name them unless you want me to. You know Kafka, Pulsar, Kinesis, whatever you have the ability to stream them. And then it gives you the ability to do the real time ingestion. So as the events are coming in, you're bringing the data in in real time. You can join streams, as I kind of mentioned before, which I think is super cool and helpful for these types of use cases to be able to join streams. You can do filtering, aggregations, transformations. You're basically running queries in SQL on your data. And then it has the continually updated and queryable states. And then you have the ability to, the downstream it into a storage area, message crews, databases, warehouses and data lakes. Okay, so high level overview of rising wave architecture. If you have questions, you could also, again, the co-founder is like right over there. I'm not gonna point him out, but he's right there. So the rising wave architecture, so it includes these key components. So first there is a serving layer that basically parses the SQL queries and then performs planning and optimization of query jobs. Then you have a processing layer, which performs all the computation. And then you have a metadata management service, which manages the metadata of different nodes and then coordinates the operations amongst these nodes. And then you have a storage layer that stores and retrieves from object storage like S3. Okay, so example of some rising wave use cases. I'm gonna try to run through these because I know that there's still like a lot that we have to cover. Some examples is streaming ETL, the practice of continuously extracting, transforming and moving data among different systems as the data comes in. So again, like the stream processing thing, like the big value of it is everything is in real time. Streams are coming in, you're running these queries and transformations on top of it before you're even storing it. Real-time analytics is the practicing of analyzing the data as it's generated or received rather than after the fact. I also just wanna touch on something. There's something like a thing called actionable insights where you can create basically things for your end users to be able to take action on based on like those real-time insights. And that's something that's really, really powerful that I think people need to keep in mind when they're thinking about real-time analytics. A lot of times we think about it for use internally or we think about dashboards and just everything, everything in real-time. But when you have any kind of real-time insights and you're providing that to your end users that can be really, really valuable. Then you have event-driven application, enables continuous event ingestion and processing, detects complex patterns and correlations with the stream of events. There is honestly like a ton of different kinds of use cases. There's actually two people here from Rising Wave. There's Ray's over here. He can also talk to you guys if you have questions about use cases. He knows a lot about this. So I took a bunch of screenshots, a couple, not a bunch, a couple screenshots off of this documentations page. If you, and it's not the whole thing, so it's kind of kind of chopped up, but if you want to see the whole thing, there will be a QR code to be able to access the slides at the end. You'll be able to go and look at this. You can also just Google fast Twitter events, processing Rising Wave and you'll find it. Thank you. Okay. So this is an example of one of the use cases. This is basically your processing Twitter events and Twitter hashtags as they're coming in. So this is just how you build it out. So the first step would be launch a demo cluster. Again, there's a lot of things that are chopped up. I kind of just screenshotted pieces of each section. So if you want to see the whole thing, just go to that link. So you launch the demo cluster. Step two, you connect Rising Wave to the data streams whether you're using Kafka or Pulsar or whatever, multiple data streams if you want. And then you can define a materialized view and then analyze the data and then you query the results. So if you want to see the whole thing, you can just jump in there. And then I'm going to hand it over to Mary. Thank you. We are ready. Yeah, that's right. Oh, no problem. She's got a handheld, that's fancy. I'm like a tiny little microphone. Okay. So now it gets into vector search, right? We're kind of going back to talking about chat GPT, gen AI type of scenario. So what is vector database and why is it important, right? In the gen AI kind of space, right? So essentially two vector database is a purpose built database that serves up vector data type for complex machine learning purposes. And it relies on vector embeddings. You know, what is kind of making it possible is these embeddings. So kind of like a quick explanation is that if you normally, right, we're more probably trained to think about things in single dimension. So we're more used, actually more used to thinking of data as scalar type of data type, right? But then with gen AI, we need to figure out the context. So then when we do searches and storing of these data, we're not only storing that value, we're also storing more than that. So that's when data needs to be expressed in different forms. So these are like multi-dimensional data. They're vector data type. So if you're like a math person, you're aware of linear algebra. So think of that. And that's what is making the magic happens is all of these linear algebra matrix math, all of these things are making it possible. And then it also helps to do all of these similarity searches behind the scenes. So think of, you know, you have a string of data coming in, you do need to tokenize and break down your string. And then each word came out of that, we'll need to map to some numerical representation. So let's say in a database like Cassandra, and you know, Astra too, is that we actually will store the data as vector data type. And if you kind of do a selection, there will be arrays of, you know, an array of floating point numbers in it that are being stored that will represent the string that you're trying to store. So that's the idea. And then these are vector embeddings. And so they get stored. So it becomes very important that you want to select a database that have very fast and very stable fast kind of storage. So for example, like in our Astra Cassandra database, we have making use of storage attach indexing. That's a very, very, very fast too, kind of a way of doing things. And also as far as doing searches, it's concerned, that's also very important when you do queries. So we're using the approximate nearest neighbor kind of technique to kind of do selection of the data, of querying the data. So essentially too, it's a form of like machine learning, like in that space, right? When we're talking about, this is sort of like providing you with feature store, like a feature engineering type of purpose for a vector database. Okay, just real quickly too, what are vector embeddings being used for? It's not just like for, you know, you can use it outside of the context, exactly like Gen AI, but essentially too, it is very good because now you are able to store data in vector type of format. So it's good for doing searches, right? Kind of, you know, you rank all of the results by, you know, how close it is to the query string, for example, that these are called similarity searches. And then there's also clustering, and then recommendations kind of scenario. And say for example, you're doing like shopping cart, you know, at a shopping site, we want to be able to select some products that are close to what you're looking for, then it helps in doing recommendations, kind of test or use cases. As well as like anomaly detection, right? That's also kind of good to be used for. Diversity measurements, as well as like text, kind of classification for that purpose too. Okay, just a real, real kind of quick too, you know, for those of you who are math savvy, right? So this is what it looks like. And if you kind of represent your data, like in, you know, normalized vector kind of format, using like a two dimensional kind of space to describe it, you can now see then all of the words being represented as vector embeddings. And when you look at the graph, that's what they are kind of placed in, you know, in this kind of a space too. So it transformed the text into vector, they are called embeddings. And the thing is right now, you know, with open AI, the embeddings are 1,536 dimensions, right? There are certain dimensions to how much it can have. And essentially too, as you can see too, is that the version three will have more similar to version two, or actually a V1 is more similar to version two than version three. For example, if you kind of do searches, the results, that's how it's kind of determining how close your result is to what you're trying to search for. So here too, it's just another diagram that explains if you are searching for cat, dog and house, you know, to some searches. And that's what you can kind of look into the graph to see how close is your result to the search result. So you can use this kind of representation. Okay, and then one more. This one too, I'm just kind of wanting to say this gets a bit kind of deeper and show you all the layers of, you know, the searches too. And so this, I won't kind of get into all of the details here, but just to give you an idea. And also too, you know, it helps in storing to have it like be more efficient to storing it. And also you can, we also have built-in like algorithms to help with fast retrieval. So using like nearest neighbor type of embedding. So I want to point out JFactor, for example, is developed by our founder of the technical co-founder, Jonathan Ellis. So he did the JVactor and that's actually becoming very popular. It's really fast. And then make use of diskN, which is a Microsoft algorithms too that he's using for doing super fast kind of searches for vector type of embeddings. So, okay, so that's that. And then just kind of real quick, since I don't have as much time, just wanted to point out, right? You can still kind of use, go back to using traditional database to do this kind of gen AI things. But the thing though, you'd be aware of that. It's just not able to handle the complex data type that's required in gen AI type of multi-dimensional data sets, all of these things. And also too, if you want to do that, it's gonna take a long time, essentially. So it's not practical. Also, so just really quick kind of description of a data pipeline. How does it kind of brings data to the vector database, right? So normally too, you would read data from a source such as S3 bucket or a website or a Kafka topic. Then you process the data to then extract the text to vectorize it. Then, and then you will then split the text into chunks of a given size. And then you compute the vector embeddings for each chunk. And then you write the chunk to the vector database. Then after which you will clean up all of the obsolete data from the vector database. So that's how essentially, when you interact with the vector database, these are the steps that will be needed. Okay, so this actually, we only have three minutes left to just kind of give you an idea, it's more like exploring, right? Now rising wave, right? How does it work? And we feel that rising wave has this kind of strength, right, in doing some of these processing and Karin looks into this Twitter feed type of example. So let me kind of take a look. But let me also kind of point out too, for a chat GPT data pipeline kind of experimentation, we like to look at, you know, there, this is a diagram that explains, you know, you're working with large language model, we're dealing with the training phase and you know, essentially generate your LLMs and LLM is such, you know, it's limited as we all know is static, even though it deals with a lot of data, but the data could be come from a year ago. So what happens in between those kind of things, right? So if you kind of look at it, right, we use prompts to kind of prompt the bot to kind of search for things, what that we need to look for. So I'm thinking, well, using rising wave, for example, an example is using the Twitter feed, you want to search for some tags, they can be done very fast using that example that Corinne shows, you know, with rising wave. And then from there too, we can then kind of massage the data and form, kind of essentially massage the prompt and then essentially then send it to a chat bot to doing like searches that you need to. So that's kind of like the example we like to be sharing, although we just didn't have time to put it up and then maybe in the next phase we'll have this project actually up running too. But so essentially just a high level description of this data flow of the sample, you have Twitter feed and then send it to rising wave and are kind of like the feed coming from Kafka and then basically rising wave will extract maybe certain tags that you're looking for, maybe the top selected tags. And then from there we construct the chat prompts and then we'll send it over to the chat bot in which we're using extra database vector or Cassandra to kind of help manage the query and storage of these vector data too. So yeah, so I think we're pretty good in here but this, yeah we're done. But this is just another example just wanting to show you where maybe a vector data type or data, vector data and database can sit within like a generative AI like RAC application. It's a retrieval augmented generated application. So essentially kind of think of it like this thing can help with augmenting the LLM data. So LLM again, it's very limited in what it can hold. So okay, all right, good. All right, so I think it's good. God, we're done. Yeah, there's t-shirts in the back. Yeah, that's right. So wanting to kind of point out to you if you like some t-shirts, you can get some. Your data stacks t-shirts. Yeah, data stacks. And we have some, Karin, thank you. Karin put some t-shirts for us and also some information about our company, stickers at the back row. I love how she's giving me credit. She brought the t-shirts over here. I literally just put them on the chair. Thank you. So yeah, thank you. And then, okay, resources of course. And then if we run out of t-shirts, if you can't find your size, please visit our booth. Yeah, there's only medium and large. Yeah, that's right. So please visit our data stacks booth upstairs and there will be plenty of staff to help you and answer questions. And this slide deck if you are interested is here. You can reach that using this QR code and short billing. And then these are resources from data stacks. And also our free usage of Vector, our extra managed cloud platform, extra. So this is how you can get to it. And you get $25 US dollar per month for free tier usage. You get, it gets renewed every month too. So you get $300 per year. It's very good enough for personal projects. I kept it really simple. Honestly, you could just go to risingwave.com but you guys are in a fortunate situation where you have like the head of product for risingwave right there and then the CEO of risingwave right there. So if you have questions about use cases or anything related to risingwave, these are the guys to talk to. Okay, awesome. Okay, I also have a Twitch stream too. I promise to kind of get back to it in the new year. So this is my Twitch stream. If you want to follow me, I'll be doing more hands-on things on there. And if you want to participate. You are such a hustler. Yeah, if you want to participate too, I invite you to join my stream too. That's how it works for me. Okay, so with that, we want to thank you again and this information of how you can get ahold of us and we invite you to stay in touch with us and LinkedIn, Twitter or X and then I also have a Discord channel and then Karin has her own consulting service. So if you need her help, yeah. Please reach out to her. Yeah, so. I feel like Mary needs like theme music when she comes out like you know the Alma Hustler song. Okay, but thank you. Thank you so much again for, I mean, stay with us. Thank you, thank you. Thank you.