 Today, I'm going to demonstrate to you how to detect fraudulent transaction in a financial institution. This demonstration will portray Red Hat OpenShift data science project which helps data scientists to seamlessly train their model and not not figuring out the overhead of setting up a cluster and all the tools associated with it. I would be using TensorFlow and I would also be using product called Starburst which internally uses Trino and there will be briefly Open Vino is used in this but Open Vino is more under the hood of Red Hat OpenShift data science model mesh, it's under, model mesh uses Open Vino model server. So let's say you're a data scientist and you have a user over here who is who has some transaction who wants to verify if that transaction is fraud or not. So to do that, what we would do, let's say we have a data set. I picked up a data set from online. I will share the resources later from where I picked up and that data set has some, we need to do some pre-processing. So we will do some pre-processing and this is the area where Starburst will help us or at least Starburst has many other features but this is the place where I'm going to use Starburst for and then I'm going to use TensorFlow to train my data set to a model and after that, I'm going to serve that model using OVMS, Open Vino model server, which erodes model mesh users as a backend internally. So now the user, when I have this model served over here, the user can send a rest request and can receive a reply. It gets a reply with it is for a fraud or not. So just some brief interaction about what roads is, but there will be links for more, if you want to learn more about roads, Starburst or Open Vino. Red Hat OpenShift Data Science is a managed cloud service for data scientists and developers. It provides a fully supported environment in which to rapidly develop, train and test machine learning models in a public cloud before deploying in production. Starburst is a fast and scalable SQL engine architected for separation of storage and compute. Starburst Enterprise is cloud native and can query data in S3 Hadoop. Basically, Starburst can have multiple data sources displayed to you in one dashboard. Open Vino is an internal toolkit. We would be using that briefly. They provide a kernel. They also provide a model server using which we will be serving our model. That said, this is the section where my Starburst query editor or Starburst functionality will come into play in my demonstration. This is where Open Vino would be used internally. Let's see all of this in action. This is the GitHub repository which has all the code which is needed. It has all the instructions and everything. You might be here from Sandbox or Red Hat Demo Platform, RHDP. Or you have your own OSD cluster or something or your ROSA cluster or something in which of course the prerequisite is you will have a row which is the backbone for this demo. Just try to open this on one side like this. As far as our repository, first we need to create our data science project. The way to do that is we will go to our dashboard, Red Hat ODS dashboard. Here it will come to the data science projects place. It will create a data science project. It will name this. It will add a workbench. The notebook image we are going to select is a tensorflow because that's what we are going to use. Basically this will install all the tensorflow packages for us in our Jupyter Hub notebook which Rhodes uses in the backend. Rhodes is an accumulation of multiple individual softwares if you may think it in that way. We will give that a name and we will create this. We will give it a few minutes or maybe one minute for this to start up. That's when I will have my Jupyter Hub notebook ready. If we hover over here, we can see what this guy is doing. You can see the event logs if you want. Like it's pulling the notebook image and stuff like that. And then voila, it's there. So now our notebook image is pulled and we are ready to access our Jupyter Hub notebook. So when I click on this, I will not see this permission thing because I have already logged into this a couple of times. So it's the RBAC policy you may see. You will just see it for the first time. So when it's logged in, what does it say? We are here now it's asking us to clone this repository. So we'll come over here and clone this repository. At this point, pretty much it's asking us to double click on our repository and notebook. So we can ditch this read me file because we will have all of that in our Jupyter Hub notebook. So close this window, open this guy in a bigger format or full screen. Yeah, full screen, good. Now we'll start by installing some dependencies which we need for our project. Then while that is happening, I'll paste on this cell. I need my AWS access key and secret key. So I'll paste that. I'll blur this area in the video later. Next, that's done. The second is done. So this cell over here, this helps us to visualize. We will visualize few graphs later in our notebook. So this will actually makes the graph look pretty. So run that. After running that it's saying use Starburst Enterprise platform to visualize and clean your data. Okay, so we go to this guide. This guide has some requirements, but basically we have our original data set which has some empty values. It's a CSV file, which is our data set. It has like around 280,000 entries in it. It may have some empty values. Not me, it will have some empty values. We need to update that to our S3. Depending on how you came to this video, you probably already have this set up for you or you might be bringing your own S3. In that case, you will be following this instructions. And then you also have to configure Starburst. But in this video, again, we're depending on the source from where you're coming. The Starburst might already be configured. Starburst needs a license. So that might already be configured. I have already gone ahead and configured all of this, which is to say, uploaded my video, sorry, not my video, but my CSV file to my S3 bucket, which is over here. In this folder called data, which has my values. And I've also configured Starburst. So if you see at the end of this configuration, there is something called Starburst provides UI, which I have a route here. And all of that information is actually inside this. Then you have a file to create a license to create AWS credentials and the secret. And then the Starburst enterprise CRD, this Starburst Hive CRD, or CR, not BCR, and also the route, which will help us route or route. I don't know how you pronounce, but so, so which is all done here. And now when I clicked on the link, clicked on my route link, it got us here. You can give any name. This is use admin. When you have the Starburst enterprise license, you get this query editor. Nice looking query editor, which can help us with this is one of the features, which can help us, you know, seamlessly use Starburst. So what I'm going to do at this point is take this out and split the screen, half and half, so that we know what to do here. Okay, so, so first thing first, it's asking us to create a schema step one. So we'll copy that. We will paste it here. And this is a different bucket. I'm trying to do my own bucket, which is this guy. Yes, that's right. So this will create a schema for me with the name fraud. So that created a schema. And if we refresh this file viewer or data source viewer, you can see that the new schemas here. So one fun thing which I love to do is use this S3 here and then a schema. Now, in that way, what happens is here on when I run the command, I won't have to specifically say, okay, I want to go to the S3 data source and I want to go to the fraud schema. I can just directly say the table name and stuff like that. So that's, I find that useful. So let's see to continue with our example. The second thing is we will try to connect our, our data set, which is here in our S3 to our starburst and the way to do that, to visualize the data to do that would be we will create a table named or we can remove this part here. You can keep it that manner. We will create a table named original and these are the fields. You know about these fields already. And we will use a with pause where we will see our external location is S3 that's the protocol to access S3 fraud detection DS data format CSV. This run successfully. We should see something called original. So there you go. So our data is already here. The next part is, so this guy is done is to visualize this data. So what we are going to do instead of copying the command from there, starburst already provides us like few most, I don't know, use commands for some of this click on that. I'll print it. I'll make this big here. Yeah, so there you can see we can visualize all the data. And you may see that like, for example, this guy here, we 13 column has an empty value on this row. So this is what we will be tackling next using starburst. But before we do that, I also want to copy this command. I can just do control C, go and enter paste. And I want to count how many, how many columns are sorry rows were there in the original data set before we do our stuff. So originally there is 284,800 something rows in this data set. We will now clean that data set, which is basically remove any column, which has any column which has empty, any row which is empty will remove that column. We will use this command over here. But before we do that, we'll actually use a trick, a trick to set a session variable. And you may ask, what is this? The curious mind may ask. So basically what happens is the way starburst works in the back end is you can have there's something called coordinator and there's something also called a worker. So the worker basically think of the worker as the one which does the heavy lifting in these queries and the coordinator as like the traffic guy. Like it directs what to do. So whenever I run this command, this command goes to goes to the coordinator and the coordinator uses us one of the compute or all of the computes depending on how many workers you have to do the actual heavy lifting there. But in the process, what happens is you may end up with multiple CSV file and that's okay to end up with multiple CSV file because that it parallely processes our data set. But we want one single data set as an output because that's how I wrote this. The Jupyter hub notebook which is this notebook is written in that way. So what we are going to do, we're going to first run this command and we'll see that it's something called writer min size which is 32 which basically anytime a data set goes beyond 32, the writer scale property kicks in which is enabled and it says okay well it's more than 32 the size of the data set so I should create another file and as long as till the time all of the data set is not processed. So I know my data set is roughly 160 MB. The way to get the trick or kind of like a hack to get around it is I know my data set is can max be 160 MB. So I will set this property to 160 MB and then if I run this again we will see that this is 160 MB. So in this case what happens is while one of the workers processing my data set it won't parallelize, parallelly try to process it because it's like oh well this data set is within 160 MB. So although writer scaling properties are enabled but I have not met the minimum threshold. So that's what it is here to get a single file in the end. So after this the next part is this is the way how we clean our data set and after cleaning the data set what we will do is we will save that in a new table called clean. I'll remove this part I don't need that because I have already set it over here. So I'll use clean as the table name and this will go to my pocket named fraud detection BS inside a folder called clean. Once I run this guy this thing is going to take like I think about a minute or 45 seconds. Meanwhile, yeah that's about it we can review while that guy is doing. You can review there's nothing more to do here now results are like we should go back to our S3 bucket and we'll see there's a folder called clean which has our new file something like this. So we'll go back to our original and we'll close this read me file. So this guy is finished and it also tells us how many rows it has written. So if you remember, initially we had 280,000 rows now there is 253,000 rows approximately. So the rest of them had some empty values so they were removed. So this in my case I'm considering this as my data set is finally cleaned. So let's anything else to do here now we'll just you can also check or no the schema. See the new table is updated here so that's all cool. So I'll lock this out now close this window zoom and make this window larger. Let's go to my S3 bucket. Let's refresh it and like it said over there I have a clean folder in my S3 bucket which has a CSV file doesn't have the extension.csv but it is a CSV file because we've set the format in our query as CSV file. So now we will go back to our notebook. Open this guy and we already posted our access keys and secret keys secret key. Now there's a small helper function over here which will help me to download that file the clean data set to my local directory over here. And once it's downloaded the reason we are downloading is then we will use that file from our local to perform our model our training process our training and testing process. So let's do that let's run this guy on goes well I should have a folder over here so it's already completed as you can see. I let me refresh and there you go I have this file here it is 142.6 MB great. Next part is I will use a pandas data frame read CSV file and I have used over here zero because I only have one file. Remember how we said that session variable earlier session variable I could never see that word nicely but that session variable let's see if you had not said that what would happen is you would have ended up with multiple CSV files here and basically here you would have had to loop through the whole folder. I didn't want to do that I wanted to keep it simple or maybe use a concatenate function or something I don't know but loop through and yeah but I wanted to keep it simple so I left it I used session variable as a hack. Let's describe our folders we can see the total number of rows like we saw in starburst UI editor there were 253,000 approximately rows. So if I have to explain this in this data set these columns in the beginning or all of these which you see. I think there are about 30 30 columns these will be our training features. If I am using the correct data science or machine learning word terminologies and this is the variable column which all of the all of these columns will help me to predict the class so the class is binary if it is if it's zero means no it's not fraud if it's one that means it's fraud. So so what we will do next is we will we will rename that class field or column to something called is fraud to make it more you know it would make more sense and we will also see the percentage of fraud in our whole data set so let's run that. So after doing this if I. Okay, so so we can see the percentage is 0.0017 that's like really small number. Of of fraud so this is something in this in data science world if I understand correctly it is called imbalanced. Data set and because therefore our next task would be to one of the next task would be to do the. Only only 0.17% is fraud which is like an imbalance so while a low percentage of credit card fraud is certainly good news for the credit card company but. The writer whom I am following. The blog in total towards data science is kind of funny. Is is good but it's not good for our model so we will use a technique called smoke. And we'll also look for some missing data but I don't will not find any missing data because there is none in this data set will see the correlation graph. The correlation graph. Gives us a all rounded understanding of the variables in. In in our. Data set. Yeah, the next part is will define the X and Y's. If I understand correctly X are the features and Y is the variable. Yeah, let's define those then we will use a use a standardization technique basically it's best practice for training neural networks to standardize the features. So that we have a good model will do the training and test split split. And this is the smoke technique I forget the full farm yes the full form of smoke is synthetic of minority over sampling technique in this what this will do is basically. Because of that low percentage of fraud which we saw earlier here which is 0.17%. This this smoke technique is basically going to balance it out so now you can see that equal amounts of so we have zero and one zero means no fraud and one means from this has balanced our data set. There is equal amounts of fraud and equal amounts of no fraud. The next part is I don't know how this legible. I'll try to zoom a little bit seems like it may not be very visible. Okay, good. I hope it's you can see well now. Okay, so here now. We will define our layers so we are using like we like I said we are using 30 input dimensions. And we have a couple of layers over here. And I think sigmoid is this is the one which is used for something finally like zero and one. Let's run that so that's our layers. And the next part is the training part. Okay, so this is the fun part. The training here we have 100 epochs and which will take some time to run 100 epochs here. What I will do is I will run this and pause the video. And come back when when it's done or maybe I will just do 10 even 10 takes time. But yeah, I'll pause the video and I'll come back when it's done. Okay, so now we can see that our 10 epochs are completed. But something to note here that we start with an accurate on epoch one we started with an accuracy of 0.21 and we had epoch 10. We ended with the accuracy of 57%. So with 10 epochs, that's all you're going to get. If you run it higher, best case best would be 100. You will see that your accuracy is improving a lot more. But this is just a test model. I'm just showing you how to run this when you're doing this at your leisure. You can probably run 100 but just it takes a little probably 20 minutes or so. Well, that again depends on how many workers and this and that how much resources your cluster has all of that. But yeah, it takes a while. Anyway, so after running some model is ready at this point. And we will now what we will do is we will we will run. We will save our TensorFlow model. That's what we are doing here. In a folder called TensorFlow PB models, because that's the format I think dot PB or something which it saves. And then you're going to use a binary called model optimizer. So if this binary is provided by open vinyl, this binary will take the TensorFlow model and convert it into open vinyl IR format so that we can run the model. Serve this model through open vinyl IR. So let's run this cell and see what I get. So we can see the assets are written over here. Let's refresh this so we can see here we have our saved model and the assets and everything. And then we have the model optimizer has also converted taken that model and converted into into the file into the supported format open vinyl IR format, which the open vinyl model server can read. Okay, so that's it. The next part is to upload my open vinyl IR formatted model to the to our S3 bucket. And there is a guide for that. But one thing to note here that depending on where you are again coming sandbox or RHDP or having your own cluster or what. The step one in this guy is kind of optional. I should mention that here is optional to download this and upload it optional in a way because I recently added this new cell over here because we already have the AWS credentials earlier here. We're going to make use of that and the session and we will upload our files, which is these two files over here to our S3 bucket. So let's actually first go ahead and run this so that our model is uploaded in S3 bucket. And that's done like that. So let's go to our S3 bucket and check if our model is uploaded. There you see, there you go. Our model is uploaded in the model folder. Great. Now, at this point, I'll go back to the guide. The guide is still useful. It's not like the guide is not useful. The guide is still useful. And I will tell you why the guide is useful. Because now that we have our model uploaded to S3, we need to do some more configuration in our data science project or in the ODS dashboard. And that is what the step two, sorry, the step one, we don't need to do step two. We don't need to do here because we have already uploaded step three. This is the part which we will be doing here. So basically we'll create a data connection, give it a name, something, something. Okay. I was looking to copy paste something, but then finally credit card S3. And again, we will use our credentials, have to blur this section. I need to remember to do that and paste that. And we'll leave everything else because my bucket is US East. So I'll leave the region to there. And my bucket name, yes, that's the important part. Copy that. I'll paste my bucket name here. And I will just create the connection. Does this new thing about connecting to my workbench? I need to. Is that needed? No, it doesn't seem like it's here. It's all right. Okay. So now my data connection is created. The next thing which we need to do over here is to deploy a server first. So this is just the properties of the server. It's just one instance. It's a small, I will definitely need to make this routable externally. So I'll click on that and I'll click configure. So there you go. It's pretty instant. My model server is there. I need to just now deploy a model in it in which that model will serve. So I'll scroll down. Oh, actually this chip shows me how to do it. Click the model. Oh, okay. This guy here. Then model server open window IR format. You can, if you have your model as ONNX, that's also supported. Existing data connection. Yes. We already created a data connection and we will give it a path. And the path would be models.model2. Actually, no, it will be model, path is model and inside model. Just, yeah, just, just model. Let's deploy that. We'll wait a little bit for this model to get deployed there. So our model is now deployed. Like it says, if I use the size of this guy, it's deployed loaded. So this guy also tells me to copy the inference link and go back to notebooks. I'll copy this inference link and I will go back to my notebook. Go back to my notebook, back to my notebook. And here this guy is asking me to paste my inference link. The next task. So that's what I do. I have already pre-formatted a request, JSON request, which I'm going to send. Now, JSON data, which I'm going to send as a post request to my REST endpoint, which is on my inference link, which is this guy. So let's run that and see if that is a valid. Oh, okay. So good. So based on this, it is telling me that my output is actually, this is a fraud, because it says it's one, which means it's a fraud. Now, how effective our data, our model is, we pretty much know our model is probably 50% accurate at this point. 49.99% which is 50% accurate. So there's a high chance, 50% chance that this is wrong. But if we, yeah, but the purpose of this, this, this, this demonstration was not to, you know, have the perfect model, but more to how seamlessly you can do, which is Jupyter, all this different software products, which is like Jupyter notebook, TensorFlow libraries and other Python libraries and Starburst and data sources and connections. And so, so all of that, how it can help your productivity and how quickly you can stand up, you can focus on what data science can focus, data scientists can focus on is just doing creating good models and not on the overhead around it. I also have some few fun stuff, which I was playing around with. Yeah, so basically, yeah, let's let's let's run those guys. So I'm just taking the using data science people pretty much knows what it is. But I'm using the prediction where if it is, if it is more than because this value is here supposed to be in probability in a probability of 0 to 1 can 0.1, 0.2. So I'm saying that if it is more than 0.5, then it is fraud. And if it is less than 0.5, then it's not fraud. So that's what I did. Now, what I'm going to do is I'm, I'm also calculating the accuracy score here. The accuracy score here tells me that it's 99% accurate, but where till we get the F1 score. The F1 score tells us that it's not 91% accurate. And I honestly don't know a lot of different between this guy here and this guy here. Yeah, that is something I'm just not so data science literate, I guess. But I do know that the F1 score is different than the accuracy score. And this guy, the confusion matrix, this is, this is the thing which I like is tells us or tells me that. So, so what it tells me is the true label and the predicted label. So it's saying that the true label predicted non-fraud to be non-fraud 50,606 times. So that's great. But, but it predicted fraud to be non-fraud 22 times. That, that is probably a problem. You cannot predict a fraud to be non-fraud. So, so our model is not perfect. We see that here. The next is then the ways our model also predicts 25 of non-frauds to be fraud, which is okay because that is, it's a false positive, right? That non-frauds are treated as fraud. That's okay. You're not, it's not a problem in the financial world. But what is a problem is this 22 guy over here. And the other, other, other thing which we know is our model says 75 of the frauds as fraud. So that's good. Anyway, so that's all about this presentation. I hope you enjoyed it. And the, the, the GitHub repository has all the, all the details on things which I was setting up over here. Those details are there, things which I was not set in, which I didn't set up here. Those details are also there in the, in the GitHub repository. Yeah. Thank you. Thank you so much.