 Welcome everybody, hopefully you had a great coffee break and you're all energized to listen to us over here. So today with Michael, I make the Sachan, I'm a senior engineering manager with Amazon key spaces. Today with my colleague Himanshu over here, we're going to talk about how you can leverage Apache Cassandra to be building highly scalable and efficient AI based application. We'll be going over, I'll be doing some baselining. What's machine learning? I'm sure you're all very well aware of it, but just take two minutes to go over that. Why Apache Cassandra is a great backend system and you know, we have chosen some technologies. One of them is Amazon SageMaker to set up a demo. And the fun part of the talk is going to be with Himanshu over here who will be going over the two use cases and showing some live demos as we speak. So what's machine learning? Some of us, you know, it's really good as humans. Some things come very naturally to us. You can look at certain data sets. You can make some observations. You can make some decision teaching that to machine is machine learning. Usually in machine learning, you have two techniques that is generally applied supervised and unsupervised. In supervised, you have known input data sets and a known output data sets and you do train your algorithm based on that to predict what the outcome of it is. In unsupervised, you have two techniques that is classification and regression. In classifications, usually you have, you know, you label the data and you could actually, you know, put it in a normal and an abnormal category and help teach the algorithm what is within the right pattern. So you use the classification technique and to predict the outcome of an event or what could happen in the future. So general use cases for those things would be like, hey, is this an email coming from a genuine, you know, place? Or is this a spam? Or, you know, your bank could use your data sets or, you know, you're doing some transactions with the bank, you have a mortgage, you have a credit history with them. You know, they could use that data set to predict when a customer could go on the default. That's where supervised comes in. There is a similar thing in regression which is also supervised, but there you're looking at continuous set of data sets to come and teach your algorithm to say, hey, when it's different, how these things connect with each other and, you know, use that algorithm to predict certain things. A general use case could be about, you know, look at house prices. Hey, the lot size, the dimension of the house, the, you know, how those are associated, the neighborhood, the number of bedrooms in the bathrooms that could be, how that is associated with the price and, you know, predicting your pricing model for the house is where regression data sets could be used. Now coming to unsupervised side of the technique where, you know, you don't know, you do not label the data set. It's all about, you know, the algorithm is looking at what is a common cluster. So that means you actually cluster the data set based on similarities of the object and what's different between those objects. Some of the places where you could look at applying this thinking would be, you know, identifying where what kind of customers actually did purchases for a similar kind of product, a very powerful way. This is another technique that commonly gets used in machine learning as well that is semi-supervised where you can apply your unsupervised techniques on the known data set and you can have a better predictive algorithm on part of that data set. Now we would think, where could we be using some of these, you know, machine learning. One of the common areas or common problem sets that we could is anomaly detection. Now what's anomaly detection? Well, you know, as human, I'm like, we can look around and we can say, hey, well, I'm not brightly colored pink shirt wearing a person. But if I, oh, that person looks different because they're wearing a different color, we could stand out. But, you know, that is a data set, you know, you could or you could teach your algorithm. So some of the places where we could use this is fraud detection, health monitoring, you know, what's a normal looking, you know, a image or what's not. It also can help you find opportunities like, you know, you are in a certain market segment and an anomaly could be that, hey, we have not penetrated this market. This is a use case that you could be actually looking at building it, marketing it. So it's a very powerful tool that could be utilized in many various places, aspects of it. So now I'll come back to come to say why Apache Cassandra. We are at almost at the end of the second day. I'm sure we all can agree that there are some great benefits Apache Cassandra provides as a NoSQL database. You know, it's a highly scalable system. You can horizontally scale by adding multiple data centers. You could add multiple nodes and scale vertically. That means it can actually provide you a very high throughput read and write. That means you could be sending in your IOT data, your log services data, a lot of data munching that you would want to do on some of those data sets. It's inherent architecture, which provides fault tolerance. And the replication actually helps us, you know, give confidence that you're not going to lose any of your data. So you could be storing your mission critical data over there and do any kind of analysis on that along with all the high throughput. It actually offers high performance with that high throughput. And that makes it really optimal to be using in big data large data set to be using on the OLTP side of it online transaction processing where you know you define your schema. Really well, you can answer your questions really in real time or in conjunction with products like Spark Apache Spark Apache Kafka, you know, Cassandra data set could be really looked at at, you know, answering questions for where you have multi dimensional data sets for online analytical processing. So, you know, I'm, that's what I would say that that's it's inherently really great. Other benefit of Apache Cassandra it's not bound to any single product, any single cloud. So you could be in a multi cloud strategy could be a hybrid. You could be, you know, in your own single data center, and you could be looking at that it's really great. Now, I will quickly talk about a certain product choices that we have used in our demo, just to help, you know, showcase a product. We've used about Amazon kinesis instead of Kafka to help, you know, for the ease of setup, as well as Amazon key spaces. Amazon key spaces is for Apache Cassandra. It's a managed service, which provides a similar kind of scale. You know, it's a compatible Cassandra so you could be using your SQL queries as similarly as you are using in Cassandra the same driver as long as it is 2.2 license. And, you know, you just don't have to manage your service. It's pretty much that with that, you know, we also have chosen to use Amazon Sage maker. You know, one of the great service that is the ease of being able to actually choose your data set. We can choose your model. It has all those models in there. So if you are just trying to figure out which model is best suited, you could use some of them, you know, in there to figure it out, train it, publish it, deploy it. It just works. One of the great things that I think that was very recently added to the toolkit was model monitoring. While, you know, it's really great to be able to set up the model, train your things on it and get the right protection over the period of time that can change. You don't, you know, certain attributes change. You're not getting that data. Some biases come in into the data set. So being able to figure out, hey, how well your prediction is and being able to adapt to it is very critical as you are working through a long time. So having that monitoring is really key, you know, to make sure that in long term, these things are working well. With that, I know I have given almost a lot of talk over here. Let's get some action. Himanshu, all yours. Thank you so much, Ekta. Hey, folks. I'm Himanshu Jendal. So let's go into the demos. So I'll try to do something very brave. I'll show two demos in real time. As we speak, I'll run those commands. Hopefully everything goes well. The two demos that we have are the first is we're going to actually run a recommendation engine using SageMaker. So I'll show you how you can actually train your data, read from key spaces and actually create sediments and then store them back in key spaces. We'll also do a real-time fraud detection exercise. So we'll actually pump data into Kinesis. We'll see our machine learning algorithm predict based on the training that we have provided it and we'll insert that data into key spaces and we'll see all of that happen in real time. So let's go straight into it. So the very first demo, it's about a recommendation system. So we have a use case where we are a retail company and we are putting transactions into key spaces. So it's a very common use case where if you're a retail company, you want to figure out, okay, who do I send my ads to? Your ads for customers who are very new are likely going to be different from customers who have been using your products for a long time and very different from customers who haven't used your product in a long time, right? So they haven't come to your store. Maybe they were using it before. So we are going to be using a technique called customer segmenting. And for this example, we use a very simple model to segment our customers into just five segments so we can see what kind of ads we can send them. So we'll basically read the transactions from key spaces, segment them, and then store them in key spaces back. So let's see how this comes together. Mirror the display. All right, we're live. So yeah, so we have this table here where we have this table here already created, which contains a lot of transactions for our customers. So let's just take a look at that table, right? So it contains invoice, it contains stock code, country, you know, description. And we have about 20,000, you know, entries over here. So what we do with SageMaker is I have a lab created already. And I have this command stored. So what I'm doing here is I'm loading this data into SageMaker. So if you see here, I'm establishing an SSL connection to key spaces. I'm selecting all the schema, selecting the data from that schema. And then I'm basically munging the data over here. And you can see there's a lot of comments over here because, you know, as you are building an algorithm, you want to remove all the nulls, you want to add data, you want to enrich that data. And then I'm creating a recency frequency monitoring score of it, which is a base of how recently customers come to your store, how frequently they're coming to the store, and how much money they are spending on your store. And then I'm running k-means clustering on it. And then I'm storing data back to key spaces. So I'm going to change this run ID to something unique. Let's say 121. And then run this algorithm. So it's running. Yep. So it has segmented. And it takes a little bit of time to process that and put that into key spaces. Yep. So now we go to key spaces. Right. So we have this tables here. So we have this ML clustering results table here, which we can query and see. And you can see there's for run ID 121, there's five segments created and it's the number of customers here. So if you wanted, we could send them targeted emails or different emails that you wanted to. Right. So yeah, so that's that's the first demo that we have. It's a, it allows you to segment customers. But as you were seeing, let me just make sure we can see this. So we got it back. Yeah. So now that you saw how we can actually segment our data. Now let's see how we can build something real time. Right. So let's say we are a bank and we have these transactions coming in and we want to see how do we detect fraudulent transactions in real time. So the number one piece that we need to build is we need to build a model that can give us real time results to tell us if a transaction is fraudulent or not. So we have used the service. Amazon fraud detector to train a model. So we took an annotated data in S3. It was about 200 megs in size. We trained the model and using Amazon fraud detector. It took about a few hours. But once we did that, we have an endpoint which we can call with our data and it will give us whether it's a fraud or not. So now that we have that model, how do we actually use that to build a real time system and build a workflow that we can just deploy and it is totally serverless. And it's just, we don't have to worry about it at all. So the first thing that we'll do is we'll take our transactions and we'll put it into Kinesis. Kinesis is a managed serverless streaming service. So you can put your data in and you can get it back and you can put multiple processes on it. So here we're going to put a step function workflow on it. Step function is another serverless offering. It allows us to do orchestrate serverless workflows on events that are coming in. So in this case, we have hooked a step function workflow onto our transaction data stream. And then what we'll do is we'll firstly we'll call it will perform inference using the model that we just trained. And then we'll store those transactions in key spaces. And then we'll send those if the transaction is fraudulent, then we'll send a notification to SNS where we can take actions like we can say if the transaction is over $1,000 then send a text to someone or send a notification to a customer support person. We'll also extend this use case to replicate this data to another AWS region. So imagine you have a use case where you're storing these transactions, but you want to provide 24 seven support to your customers. So you might want to replicate this to I've replicated to this to India, for example, so they can also access this data and act on it. So if there's a transaction that's fraudulent they can see they can call that person they can maybe fix those or put some notes in. So yeah, so that's a lot of services. And so let's see how they actually pay around in a demo. Nice. Yeah, so let's see. So let's work from our building blocks. Right. This is the fraud detector model that I have that I've already trained using fraud detector. So I have a detector here. You can see it was created four days ago. And here's the model. And it's trained. And there's basically three rules. There's basically assigned the score. And if it's above a certain number, we're like, okay, this is a fraudulent. If it's between certain numbers, we aren't sure. So we say if you're unsure, if it's less than we're like, okay, we approve this transaction. And for this example, what I'll do is I'll use kinesis data generator, which if you haven't I recommend the stool a lot. It allows you to simulate to put in real looking data into kinesis. So I will actually show you how to create a template that puts transactions that look like real transactions. So, all right, so you just select your region. And you select your stream. You specify the rate here. And here's a template. So I'm saying, hey, you know, we've created transaction ID with these. With these ranges time stamp a customer email. So I create like a record. And then I can like test my system and see how it works. And here's the step machine, a state machine that I have. And I can even show you the definition of this. So we can see here that the state machine is first it's calling fraud detector to get prediction on it. Then it's calling a lambda to store that transaction in key spaces. And then if it's a suspicious transaction, it's sending it to SNS. Otherwise it's just ending the state. So you can see it's all just configuration very easy to deploy. And actually now let's see how it actually looks like an action send data. So now it's sending data. And you can see I already have events coming in here. And I have my key spaces stable here where I can select my bank transaction and you can see it's getting filled up. Run this. So this is in my US East 1 region. All my resources in US East 1. And you can see here 311. These are the latest one because it is 311 right now. So yeah. And we can see in our work though it's running. And if I refresh there will be many, many more which have been executed. So yeah, as you can see it's just running in real time and is detecting some of the transaction as. Blogs some of these need to be investigated. Some of these are approved. Yeah, so as you can see it works in real time. And what we'll also do is I'll also show how these are being replicated. So this is my bomb Bombay region. So I didn't have to do anything. These are just being replicated to the Bombay region automatically in real time. Yeah, so it's also less. Yeah, and you know the machine learning algorithms that we've used up pretty simple. You know, if you if you are into machine learning can really make these more advanced train these. And as I said you can then monitor your model and then improve on these models as well. All right, so yeah, these are the links for the demos that I performed. And you can even ask me any questions or anything that you have. Yeah, thank you so much. These are contact information. These are the talks that we did. So thank you so much, folks. We'll take any questions that you have. Thank you. And thank you everybody. Hi, so our team currently actually uses Cassandra as our main data store. And it's a part of the data pipeline and right now we don't have any machine learning components. But I think, you know, many teams are always kind of thinking of all as the next step. How do we integrate ML right just like how you guys do. But the challenge I think comes in that Cassandra really depends on how you design the data model. And that really can, you know, I mean, it changes how the partitions are created. And then that affects read and write throughput and all of that, right. So when you created your products, did you have that like machine learning component in mind as well as the client endpoint in mind? And did that kind of change how the data model or partition design occurred? Did you have to make any tweaks and just kind of curious about your experience? Yeah, I think I would say for prototyping, it's the schema is less important, right? Like your most important goal here is to show that you can provide value to your business. So one of the things that I'm doing is I took a subset of records. So 20,000 out of a big data set and I ran clustering on it. So I would say the first step should be to just see how you can get value out of that data. So you can show to your business that yes, you know, by running this algorithm using these tools, we can get data out of it. And then once you're going to scale, then I would look at, okay, you know, for my, for machine learning, I need this schema. But for my reads and writes, I need a different schema. So if there's a different schema, then you can think of creating a replica. You can build the model like we did with in my second example where there's a stream and you can put it into two places. Or you can also use Apache Spark to run a job every few hours to just take that data and put it in another place. So, you know, then I would look at the cost analysis and see how important it is. And depending on how big your data is, it might not be as big a problem. But if it's a big problem, then you can definitely consider one of those options. I will add, I'm like, you know, Cassandra is a NoSQL database. So actually, you know, defining your schema, especially if you're going to look at real time. So everything that he mentioned added, if you're doing analytical processing, that means that you're munching that data after the fact. You know, you could, you could read it with Spark or anything, or could you store it in a temporary setup, right? And analyze the database from that. But if you're trying to answer more real time, your schema really matters in your NoSQL configuration. So I would say, and storage is cheap. I'm like, that's the whole thing, right? NoSQL. I'm like, that's where we have gone to. So having a copy of that same data in multiple forms makes it more efficient. And that's the core behind it. I repeat what they say that in reality, you want to use key spaces because you already have existing data. I would not recommend you use key spaces if you try to do a brand new project, and you know that you're going to use for just ML, for analytical processes. As I said, in this case, you take existing data from key spaces, create your data lake, create some other processes that will provide you. But if you want to get data from existing schemas, yeah, and this is the best example. And again, one thing that you, from the cost perspective, you pay for each read. So for the ML approach, sometimes you need to do the same data multiple times. And sometimes benefits, again, just build some process to extract the data, and we have multiple way of extracting data, store it in the S3, store it in your data lake, and use this as your primary one. Any other questions? If not, thank you so much for coming. It has been a pleasure to have this opportunity and share some of our learnings. Yeah, thank you so much.