 welcome everyone to fraud detection use case demo walk along or follow through or follow along however you call it this is the version 2 of this use case I had a previous video recording for this for this too but this is an updated version we have made some changes to the use case like the executed federated query and materialized views we have added those things and I've fixed the resolution in OPS this is more this is going to be a lengthy video and probably 30 minutes or 40 30 40 minutes so double cup of coffee or something if you but at the end of this you would know about roads not everything about roads it's a it's a bigger project product but parts of it and starburst and what is serving and open why no all of these things so if you're absolutely new you can you will pick up some some things here if you are already in this field of AI and ML probably will have a nice use case to demo to your customer or a potential customer or just play by yourself I'll stop talking and and I will bring the blog which my manager and I wrote so this is the blog which is associated to this to this video and this use case and you can read more over here it will explain what we have done I'm gonna replace this video with the one which I'm recording right now this is the older video but the last link over here will take you to the GitHub repository so this repository is the base repository for this workshop our use case and if you can if you don't have access to demo dot reddit.com or RHDP then you will be bringing your own cluster in that case if you are you will have to install certain things in it there are some prerequisites the config folder tells you about all of that what you have to install make sure that you you also need to bring the starburst license if you have your own if you bring your own OCP cluster so you probably have to get a trial license or something or contact starburst for that your postgres credentials which you have to deploy and then get credentials and your Amazon S3 bucket or Amazon credentials this will be used for you we will be creating an S3 bucket and access read our data source or sorry our data set from there so you're getting used to the terminologies yeah but the best part is if you have access to RHDP I would highly recommend you to deploy the workshop which most of these overhead which is associated with this workshop I have automated using an ansible scripts the workshop link is over here if I click on it it will take me to the workshop and basically I click on order I'm using my activity as practice and you can use your own but remember some of them you need to provide the Salesforce ID otherwise this will remain in grayed out most important thing is to make this an open environment US East to region I confirm and now I'm gonna order this and this will take about I don't know it will take about 30 to 40 or maybe an hour to deploy an OCP cluster for me and then it will run my ansible script to do all the overheads the pre-requisites like roads starburst conflict deploy them install them install them configure them according to my the needs of this workshop once all of that is done once all of that is done I will come and I'm gonna pause this video now but once all of that is done I'm gonna come back and then we will continue from there thank you come back everyone so our cluster is the RHDP cluster our workshop is now deployed and we can continue with our workshop let me just get myself out of the screen and put this on great so still recording yes good so this is our base repository and this is where we had schedule it was provisioning now it is status is running and so here because we enable open environment we are getting our AWS access key ID AWS secret key and all of these all of these information is also email to you the email which using which you did in the deploy this workshop some key things which you will see which I have added over here is the starburst query editor link which is this is the link for that and the AWS bucket which was created by the workshop yeah so these are the two important things which are just three actually these two this and this keep an out of it so have this page open and one some side in one tab and then then head over to the open shift console and also for AWS key just keep an out that these these keys are only valid as long as your cluster is there and it's a pretty formal cluster so it's gonna delete itself there's some auto destroy auto stop so so these keys are no good after a certain point in time and but just keep in mind not to expose these keys so let's log in is the login credentials so we log in as cluster admin here so we are in the next thing we are going to do is just just hover over and take a look at the see this is this is what the operators are installed on so the road is installed starburst enterprise helm operator is installed to there wouldn't be anything here I'm not but in the starburst section I do expect some I do expect some some instances yes and there is the starburst type instance in starburst enterprise instance which are initialized and deployed that's check the resources yes all of them deployed and running that's a one thing to note here though that because the how red demo platform is is is set up which is it has a auto stop feature and how starburst licensing is concerned the license is deleted from the cluster after deploying the workshop so if your OCP cluster or this workshop stops then when you restarted or stops or pauses or whatever and then when you restarted it you will have these pods in pending or config error or something like that because they will be looking for this secret which is not there so that's why make sure that you kind of do maybe non stop it's not advisable to do non-stop but you can take responsibility of it but make sure you don't you know spend too much on the cloud fees here and if that that is something which is concerning then what you can do is you can contact the starburst partner from within Red Hat and there are a few people who represent starburst and ask for the license from them and once you get the license from them you basically need to create a secret which this this deploy or where is it this deployment is kind of referring to the deployment is referring to let's look at the yaml file yeah so this is the this is the license which which sorry not that this is the guy stand the secret name is starburst data so you can create a secret named starburst data and and just insert your basically this this guy over here if you go to config if you go to config then starburst license this is the this is the thing which you need to create in your cluster all of it is there you just need to install insert the license which you receive from our partner starburst or anyone you know who has the license or demo license or anything or if you receive or if you sign up by your I believe they have one for the for people who visits their website but I'm not too sure that's that's a little information there but let's go back and so all of this is good over here so we'll head over to the red the roads dashboard or ODH or roads dashboard whatever however you call it it's the same username password which you see over here it's the authentication yeah good so we are over here now let's do one thing let me put this as a side like this and let's see if that works surprisingly it doesn't work how can I and another window share can I okay let's let's skip that part but basically what I was trying to do over here is have to is wanting to have both of the dashboard and and they get a repository side by side but I am a little bit technically challenged here not a video expert or how to do all of this alright yeah so but but point which I'm trying to do is all these steps which I would be doing now are mentioned over here as step one and what you need to do step 2 step 3 all the way up to here so I'll refer back back and forth to this repository as I move forward alright okay so the first step is is to create a data science project so credit card fraud create this guy then the next step is to create a workbench which I do the instant image which I would be silly is selecting notebook images tensorflow everything whatever is recommended small size use the data connection this is something new I haven't used it you connecting data connection through my workbench but yeah something I will explore it so so right now I'm just leaving it unchecked I'll create the workbench so we have come up to from step one to yes so we are over here at step 2 okay so we'll wait for this open to be this guy to be let's look at the event log okay I don't know how long it's gonna take I would assume it takes a minute or two but since I don't know so I will just pause the video and come back when this is deployed okay so our workbench is ready now and we can start working on it so like you can see it's you instead of starting you know it says open so we click on it and again it's gonna ask you for the username and password which we received over here and this E this is also in your email just in case if you lost it there is some authorization allows selected permissions great so we have our notebook now and let's go back to this we have all of this we did allow all the next thing is to clone the repository so we'll just go ahead click on this which will copy the link clone repository paste over here clone takes a second great now we have that all of that and our pause and our notebook so now we don't necessarily have to go back and forth with the read me that much because most of the instructions are over here and all the rest of the read me files are over here too so I'll just start by executing the requirements file yeah there are a few if you are curious what it installs it installs these stuff here and those are the on-axis needed for for us to get the things for one of the I forget for something which we are doing later this is for the imbalance algorithm which we are which we are going to use which is called the smote the smote the that that comes from there and yeah anyway so next we'll just run the imports here TensorFlow is usually very chatty and give you a lot of errors sorry not errors but warnings and stuff so this is going to reduce the number of warnings which we are going to receive the next is this particular cell is actually gonna make the look the graphs and everything look pretty but let's look into the data set which we have so this is the data set this is where I got the data set from in from one of these links I think and some of the work we a lot of it is from here some of the only differences I think the way they are doing the balancing is different from what I'm doing here I rather followed this other blog which is yes this blog by Sarah best share this blog I have followed oh no I can't see it I need to sign in but but this blog is a little or is older and a lot of functions and things are are no more have been changed with the newer versions so yeah it's quite it's three years old but it uses the same data set which we are using yeah anyway if you want to read more about what we are doing in in this workshop then you can you can head over to the blog which I had it open earlier yeah here we are talking about what's going on in the in this whole use case so if you need more but I will carry on with just the not much explanation of the data set or anything I'll just explain the little bit which is needed for for our for this follow-along so this data said content it's a credit card data set which has these columns so it took me a while to don't know how to scroll there is more actually on that side there's one more called class and each columns represent each row represent a transaction and this V1 V2 all this time these are all my variables and my output or my output is the class that's that's my feature if I see yes so basically based on these columns what we are trying to do is we would be trying to our models who try to predict the class in future in future when we provide a transaction which has all these variables and it will predict the class for us that said so now this data set is a bit there are few the original data set was clean but we have added some some just empty value just to mimic that we are doing some pre-processing to do that I have we have this starburst readme file and this is where the starburst comes into play so note that all of these steps in the beginning are optional if you are using the demo platform all of this is optional the only thing which we will we will straight ahead and jump to these queries which we are going to execute in in starburst so we head over to the starburst query editor link just give it a name then click over here we should already see this postgres and S3 thing to note here is before in version one of this day of this demo all of the data set was in S3 but now we have split that data set and and we have kept part of it in S3 and part of it in postgres the reason why we have done it in this way so that we can show you a federated query and then create a materialized view of bringing both of it together so let's view how all of this would work so let's go over to I'm gonna close this guy I'm gonna also close this guy just so I don't know so the first thing is to create a schema well for the schema for we need to create a schema for S3 here so that we are able to read a good thing to note here that if you want to check the data set in S3 basically you head over to either you can use the AWS AWS CLI or you can head over to the web console and in this web console this is your username which you can use and this is the password which you can use and once you sign in you will be able to see that the workshop has already put into where is it S3 has already submitted the data set or uploaded not submitted uploaded here's the bucket name which matches with what we see here and if you go inside that you'll see data folder and data folder has feature dot CSV file and that's part of the that's one part of the original data set yeah anyway so to access this guy in starburst or to read we will have to create a schema replace that press and enter control and enter and then we'll go back so that the schema creation the next is we'll have to create the table within that schema so that so that we can do a select statement and visualize the data so control enter yeah so the next task is now we are going to verify that I can read on that I'm gonna close this guy too I don't need it I will need this guy later so I'm gonna keep that enter enter paste control enter yeah so you can see so one thing which we also have done is we have added an ID field which is why we have that's how we have modified the data set and I know this in the in machine learning you shouldn't use ID in the in data set the only reason we have this ID is so that we can run our joint command which you will see later and once the joint command is run you will see that over here we are dropping off the ID from our data frame or yeah yes we are we are dropping the ID column from our data from but so need don't worry that we are using the ID field to train a machine learning one and more over this this this workshop or use cases not to make is not for to make the perfect model it's more to show what we can what we can do with the help of our interactive engagement with the user or let's say power user who know a little bit about commands and this and that okay so we can see we can read from s3 so this is the section of where we have we want we want to all the way to v28 what we do not have in this part of the data set is is time and class you will see that is there in that is there oh actually that's not so we don't have a command for that over here but let me do something select star from postgres yes public dot transactions okay so we'll do postgres so postgres is our data source this is our schema and this is our table transactions is you see here transactions is a table to us be one like you can see that we have our our the second portion of the data set which is which has time amount and class class we later see in this demo that this is the one which we would be basically will rename this zero means not fraud one means fraud anyway so we can we saw both of our the two section of the same data set next let's do now is this is the most important command in this whole workshop which is going to demonstrate to you two things one it is going to do a federated query which is a query which has two different sources one in the in our case one is s3 and the other one is postgres and and then it is creating a materialized view of that of those two data sources and you will see how that when this is that now when we query this the query this materialized view we will be able to see will be able to see the data the full after the join how the total data set look like so to query that we'll do it something like this yes features was our original data and s3 is the one which we created that's what we have it here go control and enter now let's see you have dv1 to v2 all of that all the way up to v28 and then you have time amount and class great so we have successfully executed the federated query and a materialized view I think our task over here is done yes so that's all the commands which we would be running here now we are heading over to our notebook and and we will continue so we did this part the next is we would be using we will be connecting using we will be connecting from our notebook to the starburst host so let's learn that and now the next thing is basically here we are reading it in this way and it's not only reading that the raw data frame we are using pandas to read sql we are then performing this drop because we don't want the id column when we are training our model you would I think this yeah it's done the next thing is we will be changing some of the data types because of whatever reason pandas was not able to understand so we'll have to convert that actually okay I remember because when we did this it got imported as a object data type which was giving us a hard time in this command and subsequent commands anyway let's go ahead and run this shell or cell and then run our describe command we can see the count mean this is the this is the description of the whole data set which we had and if I go back oh actually even in that classification you should be I just don't know why I'm not able to scroll horizontally over here it is supposed to scroll horizontally I know there's more stuff but sometimes chromax really anyway yeah basically the reason I wanted to show you this is because what you see over here would be same as what you see over here next what we are doing over here is we are trying to get we are renaming class column to is fraud like I was saying I give you a little sneak peek over here that we would rename that as is fraud if it is zero means it's not fraud if it is fraud then it's one after renaming what we did we were calculating the percentage of fraud in this data set and it appears to be 0.17 percent which of the whole data set is fraud this is why this data set is called an imbalanced data set set and our subsequently some of the next tasks which we are going to do would be to to balance the number of fraud in the data set and the number of non-fraud in the data set so that otherwise our model will be trained in it will be a bias model it will not have enough learning about fraud and it won't be able to predict nicely and there's different ways to do it I have used the smote SMOTE technique yeah anyway so let's look for missing data we don't have any missing data and the reason we don't have any missing data and what what I'm calling as missing data is basically some rows which has empty values some in some of the rows there are few columns which have empty values that's what I'm calling as a missing data and the reason it is zero over here because we earlier performed all of this uh they these over here the spec specifically this command which excludes which basically removes all the empty rows because those I am considering in this in this model training as as dirty data or yeah basically this is what I'm calling it as preprocessing next is we will check a correlation power up graph plot which will show us how different columns are connected to each other I'm no data scientist so I understand little but data scientists who are who are exposed exposed to data science and machine learning probably understands better the next part is we're defining the x and y variables like I was saying and I'm trying to remember I'm using the terms feature set and variable as the right thing here okay yes so like a y is what we are trying to predict and x is is our feature set which is the values which we know and using those we would try to predict y next we are doing some standardization um extreme extreme x test y train white test uh this is where we are applying this mode uh technique this graph shows us uh that now we have equal number of fraud and equal number sorry this is fraud uh an equal number of non-fraud uh in our data set after applying this more technique earlier it was only 0.17 percent of once here right next we are using deep neural network dnn and uh these are the layers uh these layers uh I have taken them from uh the blog which I showed earlier uh they have done some rnd and the these number of layers and dropouts uh give a good result that is why there is this many number of layers so let's run that so this is our model summary the next part is the matrices and the training part uh the most important and they compile and fit uh so epoch the higher the number is the better or the more accurate it is but 100 will take a long time to train uh with the amount of resources I've given to this cluster or this cluster has so I will just put put it down to five and uh so there will be only five uh epoch instances um yeah so we'll give this uh like it takes 100 will take usually I don't know it has stayed in 30 minutes or 40 minutes for me sometimes five will take two to three minutes so pause the video here and come back when it's done okay wonderful now that uh the five uh five epochs are done we can see the summary over here that test loss is this much to uh 1.2 percent and uh test accuracy is 26 percent persuasion or precision is 69 percent and recall is 85 percent which is pretty expected as you let it continue more and towards 100 you will see that the precision increases recall decreases accuracy improves um yeah but uh for the sake of time we are doing less but like I want to remind that this is not uh this this use case is not to make the perfect model it's more to just demo the features here okay so now that we have trained our model so we have a model and let's save that model we use TensorFlow framework and let's save that model and after saving it we'll save it in the folder called TensorFlow PB models uh we are going to use model optimizer which is a binary provided by open wino to convert it into open wino IR format which is which basically makes the model remove some dead neurons and uh it's a it's a performance optimization and also uh it will uh later when we are serving the model it will help us to use that um that format open wino IR format is what is supported I think now maybe they do support other uh formats to serve in roads or the work is going on but but we will convert it on to into excuse me open wino IR here so let's go ahead and do that okay so TensorFlow assets were written over here like we see that there are two uh folders were created this has the all the pv models a and then we used a mo to create uh uh open wino convert that model to open wino IR and these are the three files which were created now the next step is this step that is because when we would be serving we would be what we would be doing is we would create a data connection here which is fill in all the details and connect to s3 and our model will reside in s3 and then we will be serving that model reading from s3 we will be serving it I do think that it creates a local instance a local copy of it for serving it but yes that's that's the way so we we have followed again all of this information what we are doing here will be there in the blog so that you know what's going on okay did my okay my chrome froze for a second there so this particular guide is optional and the reason it's optional is because all of this uploading we will do using this guy uh and um however this will will be needed to be done by uh by people who are not using a demo dot red dot com okay so so let's go ahead and the reason the other people would be needing to do this because for them I do not know the bucket name and secret key and all of that so they do they can do it manually or they can use the script if they put in this in these information so for us what applies is we I'm going to copy this access key and paste it here copy this to paste the secret here and the next thing is I need the pocket name which is this guy two community tabs and then we should hear now when we run this shell what will happen is this will in our s3 bucket it will create a folder called model default one and it's gonna upload this guy these three files in there so let's go ahead and run that it's done let's go ahead and check that if it was done so when we come over here let's refresh this guy we have a models folder now it has default and it has the version one and these are the three files which we had it is up in in the s3 bucket which is this guy so so the model is now in s3 the next we will have to follow some steps here in this guide to create the data connection and then do deploy configure a more server model server and deploy a model so let's do those things let's add a data connection let's give it a name something cc access key again let's copy these guys all to here the end point end point do we have an end point yes we use the us s2 http m so let's do http colon forward slash forward slash s3 dot us east two dot this is the new format the way um end points are done back back in days the legacy format is s3.8 amazon form was just s3.amazonaws.com the new one is you need to the format is you need to put punch in your region in the url itself amazonaws.com do i don't know if that matters but i'll just put it i don't need it's not required feel a bucket that's where i will see this guy so hopefully this works i'm not connecting it to any workbench because i don't need to so the object storage is now there now let's add a server let's name it this runtime we will use open vinyl model server and and oh things might have changed a little bit here before there was nothing called runtime so this screenshot is a or this gif it's a little bit old um click make sure you make it externally routable this click on that link and then and here so that we configured our model server the last part is to deploy a model in our server let's name it this model framework oh see look they are now using tensorflow as tensorflow 2 or which is api version 2 i think as uh uh as one of the model framework so that's good so this conversion which we did over here is not necessarily needed uh you can directly use uh just upload the pb file um but it's good to do this conversion because i i don't know a lot but i what i've heard is this is just a performance pools and it's good um to do that open my noir format is good so we'll since we have already done it we'll come over here existing data connection yes at cc model part this is where we're going to say models let's actually copy that uh models data copy and paste looks good let's click on deploy everything goes well here up to this point we should see this part being deployed so it's just a waiting game again i will pause the video and once it's deployed i'll uh come back and continue from there thank you okay so i have an error over here which is it says fail to pull model from storage due to error unable to list objects in bucket fraud detection that's missing region so uh now that i have missed uh i see that error i will actually go back and edit this and i will put this region i'm guessing that's the problem and update the data connection after updating that i'm going to go back here delete this guy we'll try to redeploy the model i have to delete the whole model server for doing that i think so probably i'm gonna delete this whole thing the model so actually that would probably make more sense because the model server is what is attached to uh and shift side of things just to check if things are working the way to do that would be come to over here and every data science project you know in our case eric got far the name that is the guy who with that name there is a namespace and um in that namespace there will be workloads and for uh my model server i will have this deployment and uh so apparently it is running it just says it's just started right now all of that um let's check for some logs okay look at that so uh status is loaded however we are missing the inference point i don't know why let's refresh this page maybe it will appear yeah there it is so our model is deployed now and we have also got the inference point we'll just click on copy and um we don't need these guys anymore and um we'll come over here so we followed this guide and to how to do the model serving or and deployment we'll copy paste our inference link here and i have a sample request uh with all the value with the data over here and i will see if it can pray what does and we'll see what it predicts when we run this uh server so it predicts that it's it's a not a fraud so uh that's what our model predicts but i think if we go and train our model more with more epochs uh like probably close to 90 or 80 there is more accuracy and that's when i think this particular data said if i remember correctly is actually should predict as as as one which is it's a fraud uh so that's the whole uh no workload oh sorry no workshop not workload workshop and uh the use case and i hope you guys enjoyed this is a bit of an extra stuff here you can um we'll do some fun stuff here which is we'll get an accuracy score says the accuracy score is 99 which is not right uh but if we look at the f1 score that's when we see that it's 76 percent so which is out of 10 times seven times you will be right three times you will be wrong i guess um and last bit is this is a confusion matrix um yeah people who work in the data science they know all of this i think what i understand from this confusion matrix matrix is um fraud and predicted so these 12 are the problematic one which is it in real in in reality 12 of the transactions were fraud but they were predicted as non-fraud so that is problematic uh and the number of true label which were fraud and were predicted as fraud was 69 of them number of non-fraud which were predicted as non-fraud is this that's fine number of non-fraud which were predicted as fraud is 30 so this is fine non-fraud if they are predicted as fraud that's still fine it's a it's a false positive kind of thing uh but but it's still okay you as a bank or as a financial institution you're not going to lose money on that but um or or as an individual you're not going to lose on that what you need to consider is this guy but these are all bonus stuff this is our model it's not perfect uh like we like i mentioned earlier this is not about the model it's more about the whole usability and looking at the futures and all of that anyway thank you thank you so much everyone for being with me here and um and um and doing this uh follow along with me hopefully you made it till the end and uh thank you take care