 and welcome to the fraud detection use case demo using Open Data Hub and OpenShift Container Platform. In this use case, we wanted to build an end-to-end AI ML platform that can detect fraud from credit card transactions. We downloaded the data from Kegel and this data set included time, amount and 28 hidden features to protect consumer data. We created an AI ML model that can predict fraud detections. We served it and we collected model metrics out of that model. And then we provided model monitoring tools so we can see the different metrics using monitoring tools for DevOps. We also provided development tools for data engineers and data scientists. In this high-level architecture, we can see from the left side that we have the data scientists using Jupyter Hub as their development environments. This is where they do their first research on model development and they try to figure out what's the best model, how to train it and how to test it. In that Jupyter Hub, we can see that each user can get their own Spark cluster with their own Spark master and workers. After the data scientist creates the model, tests this and finally decides on what the model should look like with parameters and what kind of model it is, then they fully train the model and save it on the self object store. Once the model is, the final model is stored in a self object store. That's when we serve it using Seldin. Seldin provides many metrics that is displayed from Prometheus and Grafana as dashboards for DevOps to monitor. So we can see here at a high-level architecture our critical transaction data is stored in SELF. Data explanation and model development was performed in Jupyter. We used the random forest classifier model to save the model. The Seldin model extracts model optical from RookSef and serves the model. To simulate the critical transactions we used Kafka. We had a Kafka consumer and a Kafka producer that was able to generate data, read data out of the SELF and generated and use the prediction coming out of the model for every one second to five seconds. Next we will show a couple of Grafana boards that we created to show monitoring first from the Seldin model. We can see here the red spikes are all the broad detections and then another example would be the Seldin core metrics that we will show in the demo in the coming couple minutes. And then the next one we can show also is the Kafka Grafana dash model that has Kafka data. And now since we saw example Grafana dashboards let's move on to the main OpenShift dashboard where we see all the pods. So this is our main OpenShift console. You can see the Kafka pods. Well these are all the different Kafka pods that we have. You can also see going down further. You can start seeing the Grafana pods, Jupiter hub. Now we can see here there's two Jupiter notebooks, one for each user. We have two users currently using it. And then we can see the Kafka producer consumer and the model Seldin model pod that's serving the model. We can also see the Prometheus pods, the Open Data Hub operator, the Seldin core API and the cluster manager pods. I also want to show you that there are Spark cluster master workers for each user. As I mentioned before we have two users and we'll see both of them here. So these are mostly the pods that we have running in our namespace in OpenShift. So now let's move on to one of our notebooks that our data scientists worked on. So first what we're going to do is to try to install some needed PySpark libraries. Then we are going to try to upload the credit card data sample of it only to our self storage. So we ran that one and you can see that the HTTP status code is 200 and that our dataset was saved under the key uploaded creditcard.csv and in the bucket that's called Open. So after we have the data itself what we are going to do is we're going to use Spark to read that data and put in a data frame. So as you can see here that to get a Spark session what we do is we can use the internal environment variable Spark cluster to get a link to our own private Spark cluster that's running for us. So we create a data frame from Spark and we use that data frame later on to do model creation on it or whatever we need. So let's run this. We'll see that what we're trying to do here is trying to get a Spark session. Again this is the Spark session is to our Spark cluster that's specific to the user that has logged in to this interpreter hub and after we get the Spark session what we're going to do is try to read the Spark data from the CephS3 interface and we place it into a data frame and that data frame will be later used for data manipulation or wrangling or creating any models. So you can see here that the total number of creditcard transactions are only 10,000 like I mentioned and only 38 are fraud and it's a skewed dataset which is okay for this example. So next what we're going to do is we're going to create the random forest classifier and we're going to take only 75% of the credit card transactions for training and we're going to drop time and class from the features and keep class for the feature actually the feature vector. So what we do is we create the random forest classifier and then we do the model fit and we run the code and we can see at this point we have around 28 features and the number of training transaction is 7,500 while the number of chest transactions is only 2,500. Alright so next what we do is just kind of another method of just doing some exploring what the model is doing we can draw confusion matrices and what confusion matrices show is the prediction result versus the actual result and this is just a way to show us how many number of credit card transactions were correctly predicted and how many were not correctly predicted. So we go ahead and we plot the predicted X label and Y label true and we run that again and we can just see and just explore these graphs. So again this is just to show the capability that you can do this on the data frame and try to analyze your model. So now that we have too many features we figured out we have too many features on the model what we want to do is try to eliminate only include the important features and the next thing we're going to do is plot the important features and we can see from the graph that the important features are the top seven that we see here and then it tapers off to the less important features. So we take the top seven features and we recreate the model and retrain the model. So we drop all the features we don't need and then we create a random forest classifier again with the same parameters or maybe some adjusted parameters and we do the model training. After we're done we basically create the model.pickle file. This model file is going to be used to be uploaded to Ceph again. So let's run this again using the new important features and create the model again and after that we will save that model and test it a little bit later. So you can see that the number of features now are eight that's seven plus one for that amount and then the file is called model.pickle. We can do the confusion matrix again we're going to skip it for this one but what we're going to do something interesting here is test the new model and the way we tested is first we either send fraud or not fraud and we see what the prediction that's come out coming out. So we're going to do not fraud first and we see that it's returning prediction of zero meaning it's not fraud. We're going to test it again and give it just fraud and see that it should return one from this test. So let's see that. So here we go we're just changing the data frame. We kind of created two data frames one fraud and one not fraud and we filtered based on the value of the of the class. So fraud we're sending a fraud now it should be sending us back one for prediction. So we're happy with this model for now of course in real in the real world we have to do more testing with the model and evaluate the model better. But for this little demo we are just going to upload the resultant model.pickle into Ceph again into the bucket called model and in the key called upload slash model. So we have the model ready and now we get back in HTTP code 200 and where it's loaded and what bucket it's being loaded in. So now what we want to do is we want to serve that model. So we log into our OpenShift cluster and so after we log in into our OpenShift cluster into the right namespace that we want to serve this model from what we do is we try to run another pod based on Selden and that pod is going to go grab the model from the Ceph interface and run it in that namespace in OpenShift and you can see when you do also get Selden deployments you see that we have the MyModel which we did now and we also have a model for that's the fully trained model in the 200k rows third of the 200k rows and we also had it served from before. So what we're going to do again we'll do the same thing testing the same thing of the served model we'll send it fraud and not fraud first and then we'll see what it gives us for predictions and for our next part of the demo here coming up after this we're going to show you how the simulation with the full test with the full trained model is shown on the monitors on the refiner dashboard. So we're sending it here fraud and you see it's bringing sending us back predictions of one meaning it's fraud so we're going to switch it back here to not fraud again and then it should be giving us zeros. So again as I said we just had two data frames and one was fraud one was not fraud so we're going to run it again with fraud and you can see one we'll run it again with not fraud and it should give us zeros again. All right so now we are going to hop over to the monitoring tools that we see in Grafana dashboards. So the first dash board you're going to see is the model prediction dashboard that shows all the parameters we're getting from the seldom. The first one is plotting probability of fraud which is between zero and one versus the amount and you can see that we don't see much of a pattern there maybe we need more data to see the pattern but for probability of fraud versus variable 17 you can see that every dip basically shows a fraud detection. Same for probability of fraud versus v10 you can also see the dips indicating fraud detection. So let's move on to the seldom core metrics and these are the metrics that seldom by default provides the success rate which is between zero and one of the 4xx HTTP 400s and 500s HTTP responses and the request per second in the model. We can see here sometimes we have a spike maybe because it's when we're hitting it at a shorter time frame. We can see the Kafka parameters too we see the topics in bond message rate how much messages are going through and the Java memory utilization so and then let's move on to the second dashboard which is the normal cluster monitoring dashboard and we can see here cluster memory usage cluster CPU usage file system usage those are just normal cluster parameters that you see in all open shift and then we have the pod CPU usages lists the top pods with the most CPU usage which is pretty interesting and then the pods memory usage also showing here spark using most of the memory from a pods perspective. So all interesting parameters that a normal DevOps operator will be looking at to maintain a full functionality of an open shift cluster and that's it for a demo today. Thank you.