 to believe what the screen tells me. So welcome everybody to this episode of Dev Nation. We are going to talk today about credit card fraud detection and AI ML. So my name is Erwin Granger. I'm an architect and I've been at Red Hat since 2021, so nearly a year by now. And I'm going to let Audrey introduce herself as well. Hey everybody, my name is Audrey Resnick. I'm actually a senior principal software engineer acting in the capacity of a data scientist. And I've been with Red Hat since 2020. Beyond that I was in the energy industry for a number of years acting as a data scientist, scientist, technical advisor. And before that in the dot com industry also for a number of years in the capacity of a full stack developer. Okay, all right. Thank you, Audrey. So that's us. But I have to say we're also curious about you. So we were thinking, you know, are you a developer? Are you a data scientist? Are you an architect or any kind of other profile? So why don't we check if the chat feature functions and does what it's supposed to do? I can see some people have been typing already. So that's great. So if you don't mind just type in quickly what your title or position is. And this way we'll have an idea on the types of profiles that we're talking to. So take your time, do it whenever, just making sure that this functions. All right. So while you do that, what we're going to do today over the next hour of 58 minutes now, we will use a credit card fraud detection example to illustrate AI ML processes and how Red Hat OpenShift data science fits into that. Now, let's be clear, we don't want to teach you how to fraud and we don't want to make you into a fraud detection expert, right? This is not the goal of the exercise. This is really just an example to illustrate, you know, what it takes to do data science and what are the phases and steps, okay? All right. With that out of the way, let's kind of answer the obvious question. Why does fraud detection need data science? And I like to give as simple of a definition as I can. So if you're a data scientist, if you're in the fraud environment, you will probably find it simplistic. But essentially, I see it as two points. The first point is, well, fraud is costly, right? If you look at the numbers, it's costing companies billions of dollars and it's not planned to get any better. So this is not something you can ignore. You have to do something about the fraud. You cannot just accept it as a matter of life. So that's the first reason, right? But why data science? Well, if you're a developer, you might think, well, you know, that should be easy. I just need to write if incoming transaction type equals fraud, then reject, else accept, right? This would be kind of the traditional coder slash developer thing. Obviously, this is not something you can do, right? If it were, then probably we would need data science for these types of things. You don't know in advance when a new transaction comes in, is it going to be a fraud? Is it not? We can't have it coded like that. So that's where we call in the data scientist. That's when people like Audrey comes in and helps us with this. So essentially, instead of that, which is once again science fiction, what we do is that we use existing data, maybe the past fraudulent activity that we have a record of. And the data scientist is going to build and train a model. And the role of this model is to make a prediction, right? So we just went from an if statement, which is very like the true as it falls. Now it's a prediction. We're trying to determine the truth as close as we can, but it might not be 100%. So this model will make a prediction. That model and that prediction essentially now replace our if statement, right? If we are coding a web application, we will then reach out to the model and say, you know, for this, given these parameters, do you think it's a fraud or do you think it's a legit transaction? So that's what the model is going to do for us. Now, however, what might happen is that the pattern might change, right? If the people trying to fraud realize that it's not working anymore, they might change the way they do it. So what we've done, which worked initially, this this training and all of that may not work any longer in the long term. So if the pattern does change, then you need to be able to detect that the model is not as good as it used to be, no fault of its own. But the cycle begins again, we need to collect better data based on the new patterns of fraud. And then we need to potentially retrain the model to make it match better with the new pattern. So I did say I wasn't going to make you a fraud expert. And I think my very simple explanation here of data science proves my ambitions in that regard. So now that we've answered this question about why fraud detection and why data science, I'll just explain a few quick words about Red Hat OpenShift Data Science. So what is Red Hat OpenShift Data Science in case you haven't heard of it? It is an add on, which you can obtain as a managed cloud service on top of your managed OpenShift. And you can read the complete description here, but essentially it's an environment where your data scientists can work and have the resources that they need to put it in simple terms. We have this diagram, which I won't spend too much time on. But essentially we have, if I can organize my mouse. So this is powered by AWS on OpenShift Dedicated or Red Hat OpenShift on Amazon. And we have tools like TensorFlow, Jupyter, PyTorch, which you might be familiar with, you might have heard of them before. We have a lot of things related to OpenShift itself as well. And we also have not just this software, but also third party software. You can see things like Selden in OpenVINO, those types of things, and Aconda might be familiar as well. So we want this to be a full environment that has all the tools that anybody in the data science industry might need. Here at the top, we have kind of the traditional model of data science, right? I gave a quick explanation, but gather and prepare data, develop the model, integrate it, monitor it, and then retrain. That's essentially what I said earlier. And then we have the personas that are involved along the way. So the data engineer, the data scientist, the app developer, and then the IT operations along the way. So I am not a data scientist. I play one on TV or on Dev Nation TV, I guess, Audrey is. So my role today will be just to kind of get us started. And then Audrey will take over when we get to the real data science. All right. So instead of talking too much about Red Hat OpenShift, I'm going to say let's jump in and let's show you what this looks like. So what we're going to do today, you can follow along if you want. You don't have to. You can sit back and relax if you prefer. That's okay too. This session is being recorded. It will be uploaded later. So if you want to just watch it and then do this later, that's fine. I'm going to paste that first URL here in the chat for you. There we go. All right. So I've pasted that first URL in the chat. You don't have to type it. You can just go there directly. And I'll explain a little bit what we're going to be doing as an activity. So if you want, you can sign up for a sandbox account. This is going to grant you access to one of our environments in which you can follow along with the presentation. The signup process will take a few minutes. You have to have a Red Hat account, which if you've attended Dev Nation stuff before, you probably already have that and then access the sandbox. So I'm going to go through that. I'll clone the project that we're going to be using. I'll give a very basic orientation around Jupiter. If anybody here has never seen Jupiter, I'll show kind of the ropes of how this works. And then Audrey will take over and she'll talk about how do we start building and training this model? How do we test it? How do we deploy it? So that's the plan. Let's get started. All right. So, okay. I've zoomed in my browser as much as I can. Hopefully this is legible. So this is the project that we're going to be using. Oh, in case you're wondering, RH ODS stands for Red Hat OpenShift Data Science. So that's what we're doing here. So the first step here are going to walk you through basically the basic setup of this. I will, since we only have, well, 50 minutes now, I'm not going to read on screen all of the content. I'll just point out a few things. So if you don't have your own OpenShift environment with RODS installed, then I suggest you can use the sandbox environment. So when you do, it's going to look somewhat like this. Let me zoom that in. Try OpenShift Data Science in the sandbox, that big button here. If you click on that, it's going to get you here. And so in my case, I've already signed up. In your case, if you've never signed up before, it will prompt you to follow these steps. So I believe here, if I just click start, I end up directly in the interface. So this is what the beginning of it looks like. You can go slower at your own pace later. So for the beginning, I will follow the instructions that are here. So I'm told to use this Jupyter Hub interface. So this is what we call the Red Hat OpenShift Data Science dashboard. In the sandbox, that is all you will see in a normal environment, you would see much more you might have enabled other software like IBM Watson and Aconda. So here, the only thing enabled in the sandbox currently is Jupyter Hub. So I'm going to do what I'm told and I'm going to launch the application. And I'm going to be met with this screen. So I'm going to pretend that I don't know what I need to choose here and scroll back to the instructions. So what is it that I need to do here? Ensure that standard data science is selected. Make sure that you change the container size to small. Okay, so I need standard data science and small. So let me get back to that screen. All right, standard data science, that's fine. Small. Well, it was smaller default. So not much choice here. Once again, this is because this is the sandbox environment in a real full blown OpenShift cluster. You would have small, medium, large, extra large. If you need 200 gigs of memory, you can have 200 gigs of memory. The skies, the clouds, the limits technically. So this environment is just because it's the sandbox. So I've selected this. I've selected that. I think I can start the server. Do I need any environment variables? Let me see. Nope, nothing about environment variables. Okay, so that should be good to go. Let me get rid of that. And let me start my server. So at this point, what's happening behind the scenes is that a notebook is starting up. And technically, if you know a little bit about Kubernetes and OpenShift, really it's a pod that's starting up with a container inside of it. And if you look at the event log, you can see that it's assigned to a node and it's going a little bit too fast. I can't show you too much details. And then my screen is going to refresh and I will be in the Jupyter Hub interface or Jupyter lab in this case. Okay. So this is the starting point. There is currently nothing in this environment. And what we're doing here is we're trying to kind of fast forward through what might take a data scientist, you know, hours or days or sometimes even weeks. So we don't want to start from scratch. What we're going to do is follow the instructions and clone the project. So it tells us here to use that git icon and click clone a repository and the window that pops up, we're going to copy the URL of that. So not very difficult. Let's just go here, do that. Clone a repository. Here I'm going to paste the URL and clone that. Now I have a new folder that's been created. That's what was expected. So that's good. Let's see what do the instructions tell us to do. Yeah, I saw that. That's fine. And then at this point you should double click on the folder and then double click on 00 getting started. Okay. So and you can see I'm like I'm at the bottom of the page. That's it for the instructions. The rest of the instructions will come from the notebooks. So let me get back there. Double click on this. And okay. So yep, it did say there was going to be a getting started notebook. So I can see it here. So the good news is if you're familiar with Jupyter notebooks, you don't need to look at this one because you probably will know all of that stuff already. Just in case you're not, I'll go through some of the basics of it. I don't want to assume that everybody is familiar with that. And even if you are, I don't know, maybe you'll pick up a couple of things along the way. So a Jupyter notebook is essentially a way to have in the same file to have essentially your documentation, your code, and the output. So this is a markdown cell. So if I press run, nothing happens, nothing happens. But this is a code cell. This is Python. And so if I press run, something will happen and the output will be displayed. I tried the first time I saw this, I tried to guess what was going to be displayed. Yeah, I was not successful. So I'm just going to do it. And yes, this is this show is fun, as you can see. I don't know how many of you would have guessed this, but indeed. So you can see now this cell was run, and then I'd had this output, and then this cell hasn't run yet. So I can also run this one, and then things happen. If you want to run everything in one shot, what you can do is click on this guy, restart the kernel, rerun the whole notebook. This is something I recommend doing just to make sure everything gets run and from top to bottom in the right sequence to ensure you don't have any surprises. And now you can see all of it is getting run, and I'm seeing the output. Now, I've been using my mouse to do all of this, but if I was developing or coding, what you can do is you can actually simply by using your keyboard, you can make modifications, and then Ctrl Enter will run a cell without leaving the cell. So you can see Ctrl Enter, I keep running that stuff. You can scroll back up, and what you can also do is Shift Enter, and Shift Enter will run the cell and get you to the next screen. So all of this that I'm doing, I'm not actually using my mouse to do this, I'm just using the keyboard. Or VI, yes, if you prefer VI, definitely that's allowed to actually, sorry, that comment about VI in the chat is a good thing. If you are a VI fan, you have this launcher here, which allows you to also launch a terminal. So this, now I don't like to work this way too much. So what I like to do though is I like to, where did it go? Not like that. I like it. There we go. So here you can have kind of the best of both world, you can have your notebook here, and you can run the cells. And then here, if you want to, you can use VI in this text editor. So this is just a standard prompt inside of your notebook, and you can access all that stuff. So right, let's not have a fight over text editors, please. That wasn't the intent. So what I like to do, I'm an architect, I like to do, see how things run. I often do things like this, where I'm actually running talk here on the right-hand side, and then I'm just running stuff here to see what happens, what's part of the environment being stressed out, and so on and so on. You kind of get the idea. All right, so the first notebook really is not anything to do with our stuff. So I'll just save that guy. I won't show this, but this little git plug-in here is quite useful. You can actually do most of the things you need to do and get right from here. So let's say you could look at a diff of what are the things that changed. You could stage your change. You could write your commit message here, and you could commit. It won't work here. And then you could even push back to the repository. So if you're starting with git, this is a nice easy learning curve for that piece. Okay, so I've done the 00 getting started. So now I'm going to put on my fake data scientist hat, and I'm going to start looking at the exploratory data analysis. So this is my way of getting myself acquainted a little bit with the data that I'm dealing with. Okay, so let's start running some cells. What do we have here? PIP install-r requirements.txt. I wonder what this does. So if you're familiar with Python, you know that this is going to install a number of packages, and what packages you may ask, well, those ones that we defined in requirements.txt. And you can see I have requirements.txt right here. So let's bring it up on the screen. And I'll just try to put it here. There we go. So these are the Python packages that I want, and these are their exact values. So I know that I'm using a known set of versions, and if it works tomorrow, it will work the day after, and so on. I'm keeping these versions. So that was enough. So these Python packages, okay, here I am using the mouse, but maybe I should use the VI key bindings, I guess, some VI fans in the crowd. So import pandas, import numpy, if you're a data scientist, you know all of this better than I do. So here we're going to use the Boto3 client to download a couple files from S3. Okay, and okay, so now this doesn't look like it had any output. So you might wonder, well, did it work or not? If I refresh here, it beat me to it, I can see there's a new file that showed up, the fraud clean sample parquet file. So now I have my data file in the environment, and I can start using data frames, and I can start looking at what types of data do I have, for example. I can see that the label variable has two values, legitimate or fraud, so that's going to be interesting. So this is kind of the historical data, what we've seen as legitimate transactions versus fraudulent transactions. And we keep going down. So where was I? Okay, the types of transactions. So we have chip and pin, contactless, online, swipe and manual. Very good. And here we're going to have some counts. So how many online, how many contactless, how many chip and pin, that kind of stuff. So these numbers are interesting, but we might want to graph this stuff a little bit. So I'm going to go there and, okay. For example, in this historical data, we can look at the fraud here, which is blue and the legitimate. And then we match it to the transaction type, chip and pin, contactless, manual and online. So we can see that, you know, from if we only look at this, there's a lot more fraud with manual and online, right? There's more fraud than legit versus swipe, chip and pin and contactless. There seems to be less fraud. So do we go, all right, job done. We reject all the manual and all the online transactions. And then we can call it a day. No, right? We've just looked at one piece of the data. We're just getting ourselves acquainted with this. So you have another thing, the foreign versus, well, I'm guessing domestic or dependent on where you live. So where do most of the fraud happen? Well, by ratio, it seems to, it's mostly transactions that look foreign are more often fraudulent. You get the idea. I don't want me to dwell on that too much. Transaction amount distribution. So let's look at that. So this is about whether it's more likely to be a fraud depending on whether it's a high amount of dollars or low amount of dollars. I'll keep going. Interarrival times is the time gaps between transaction. So if you have five transactions in the same second, it's a bit suspicious. Usually people don't really shop that quickly with their credit card. So here we, but once again, this might be your intuition. You need to look at the data and make sure that the data kind of confirms those gut feelings. All right. Activity by time of day is also an interesting thing. I rarely use my card in the middle of the night. So the kind of the local time zone would have an impact. So for fraud, we see a very even distribution of the time zones that the fraud transactions happen in. And whereas for the legitimate transactions, it's the distribution is quite different. All right. So that so far is the beginning of my exploratory data analysis. And then when I reached the bottom here, I'm told that, okay, now the real stuff starts. So we're going to click to zero to feature engineering. And I think I'm going to slowly fade in the background and call in the real data scientist. So Audrey, would you like to take this over before I make a fool of myself pretending to be a data scientist? Of course. So let me go ahead and share my screen and let me know if y'all can see that. All right. So I just want to put some context where we are kind of in this journey. We've been talking about looking at some of our data. I just want to flip back to one of the slides that Erwan had shown you earlier. What we're trying to do within a model life cycle are four distinct steps. We are gathering and preparing the data. Erwan actually looked at the data. As a data scientist, I'm now going to go ahead and develop the model. There are going to be a number of steps with that. And within our Rhodes framework, we've been using Red Hat OpenShift Data Science. That's an ephemeral IDE that we're using. Some of you guys may have used PyCharm or Spyder or some of those others. So I'm going to continue within Rhodes right now and let's take a look at the feature engineering. So once we've gone ahead and collected and cleaned our data, we have to look at the feature engineering. And what the feature engineering is, is it looks at processing that data into a format that the machine learning model will interpret correctly. Because we can't just put a lot of text strings into our model. That's not going to be very efficient. Any of you who have worked between transforming between data types, you know that if we have a numeric data type, that'll be essentially a lot quicker than an alpha numeric or a character data type. So in this notebook, we're going to transform our data and we're going to ensure that the data still holds the information to kind of distinguish between legitimate and fraudulent transactions. Now remember, we had to do this, we have to remember kind of what our transaction data looked like. So as Erwin mentioned, we had transaction times. You know how long has it been since the last transaction was initiated, transaction amounts, transaction types, was it in person, online, contactless, were merchant IDs used, and other details like where was the transaction made? Are you having transactions made both in Europe and North America at the same time? It might be a little bit difficult, so that could point to kind of a fraudulent transaction. So we take all of this information and we encode it as just a point in space. And we call these points feature vectors. And this is really important because from feature engineering, when we move on to the model, we can think of the machine learning model sort of as a function that's going to take these feature in vectors and basically return a prediction, right? Because we have to be able to take all of the human type language that we've been looking at to determine if we have fraud detection and transfer that into a binary language that our machine learning model can understand. In our case, when we go and take that machine learning model and it takes in that feature vector, the model is going to return a label. And that's going to predict whether we have a fraudulent or a legitimate transaction. And I know that that's a lot. If you're a data scientist, you'll dig this. If you're developers, you'll go, okay, what next? Well, let's go ahead and kind of go through the notebook. The first thing that we're going to do is we're going to obviously go ahead and load the data. We're using Parquet. You can use CSV files, but Parquet is maybe a little more useful in terms that it has the metadata stored and your schema stored or your schema stored in your metadata. When you get a CSV file, you don't have that. You have to kind of guess what the schema is by taking a look at what columns you have. And once we've gone ahead and loaded our data, we want to go ahead and train and have training data and testing data. So we're going to split that data into two parts. And we'll just show you here that the length of the amount of training data that we have is over a million. And the testing data is around 600,000. And then we're just looking at the proportion of our training to our test data. So once we've got our data loaded, we could say that some of our features are very obvious quantities. We talked about interval times and transaction amounts. But others are categories of things like the merchant IDs and transaction types, which again brings me to that reference that we're using alphanumeric characters. That's not really easy to deal with. And if we use conventional programming, it used to be a Java developer, so I can go ahead and refer to that. I want to use distinguish small integers to the model categories of things. But again, trying to do that into input into a machine learning algorithm isn't going to really work. So there are a couple ways that we can take that, those categorical, categorical features and make use of them in the notebook. And we use two methods. We use feature hashing and we also use one-hot encoding. And I'm just going to bring a screen over here to kind of show you. So for feature hashing, what we'll do is we'll take terms like John likes going to watch movies and Mary likes to watch movies too. John also likes football. So what we do is we take a term and we can apply a numerical index. That's one way that we could do it. The other way is one-hot encoding where we just kind of represent those categorical variables at binary vectors. So instead of having a categorical number for say apples or chickens or broccoli, you can replace that by a binary. And I know this again is a lot, but I do have to go through this because I mean for the data scientists out there you want to know what we're doing and what we're talking about. So going past the encoding and the categorical features, we've gone ahead and we're just going to go ahead and use these different techniques for feature hashing and one coding. And then we're going to see can we visualize them to make sure that they look okay. And this brings another topic that's a little bit esoteric and that's reducing the dimensionality of our encoded categorical features. So we have these features like going ahead and saying it's a merchant ID and we transformed it into kind of a binary, but we need to be able to now point those plots on a plane. So we're taking all this text and basically stringing it down and saying we want to represent this as a point. So this means what we're going from in terms of our data, hundreds of dimensions, or five or six dimensions, depending if we used hash merchant IDs or one hot encoded transaction types to basically two dimensions. Now this process is actually very expensive in terms of compute and memory. So what we do is we only sample a small amount of our data and we look at two types of analyses to help us with that. The first is principal component analysis and the second is what is called TSNE which is T distributed stochastic neighbor embedding. So we'll go ahead and see if we can visualize these categories and we look at them and we're seeing anything that has nonzero and we're going wow there are a couple that are really high and the rest are kind of low. And this is okay because that just suggests that the first ones that we're looking at the data points were hashed to something within our first kind of five buckets or categories. And that's all right because within the first five categories the transform of our transaction type is going to be either fraudulent or non-fraudulent. So those vials will be high. It means that our visual categories are actually working. And when we go ahead and sum up all of the nonzero forms, they come up to 5,000 which suspiciously in our case is the exact number, exact number of entries that we took in our that we took a look at for fraudulent and legitimate transactions. So with that we're going okay we are able to visualize all of this. Wouldn't it be just easier to say that if we see a merchant ID that kind of looks suspicious can we correlate it with fraud? And again we're using the principal component analysis to kind of plot the first two principal components of encoded merchants so that when we go ahead and do that we can think of that basically as mapping from again high-dimensional space to the two-dimensional space. And if we take a look at this graph and it is interactive we can see there's really quite a lot of overlap between the classes. So we can't say for 100% that the merchant ID alone is an obvious way to differentiate between something that is a legitimate transaction or that is a fraudulent transaction. In that case we what if we use kind of a non-linear visual technique. So because there was a lot of overlap obviously the merchant ID isn't the obvious way to differentiate between the legitimate and fraudulent transaction. So a non-linear visualization technique can be better. So the next process that we use is called T Distributed Stochastic Neighbor Embedding or TSE for short and that learns the mapping from high-dimensional points to low-dimensional points so that the points are similar in high-dimensional space would also likely be similar in a low-dimensional space. So if we go ahead and execute this see if I can get this here when it eventually comes up we'll see that it gives a better representation but there's still a lot of overlap between the classes and we know from looking in that exploratory analysis notebook or our EDA that the numeric features do contain a lot of information to help us distinguish between the classes. So let's see how we'll exploit that with our our models in another notebook coming up but first let's go ahead and process these features. So we're going to encode these features and we basically need to input any missing values for things like the internal rival times when a transaction was made. For example the internal rival time might be undefined for the first transaction of each user since there was no previous internal rival time and we'll need to kind of scale all the numeric features to a constant range and we actually do this using a pipeline facility that you can use from sklearn. So let's go ahead and we'll just encode the numeric features and go ahead and fit and save the feature extraction pipeline and just a little bit more on this pipeline again we went from that data we transformed it basically into points we tried to identify certain points or features really what we're doing is we're kind of stating how we want each column from that original data to be transformed and that's to basically use these various techniques that I've talked about and put this all into one pipeline which we can then fit to our training data because we don't want to do all of this exploration over and over again. So we go ahead and we train it or sorry we create the pipeline and then we'll go on to train the model on our transform data. Now because of time I'm not going to go deeply into logistic regression but logistic regression is basically a statistical model and in its form it uses what we call a logistic function to model one binary dependent variable but there may be more complex extensions that exist and when we use it in statistical software or in our case here we're trying to understand the relationship between a dependent variable and one or more independent variables by estimating the probabilities again that's done by our logistic regression equation and all of this may seem like very esoteric things that I'm talking about but at the end of the day this type of analysis can help you predict the likelihood of an event happening or a choice being made so does something pass or fail do I win or lose do I have a fraudulent transaction do I have a legitimate transaction so we'll go ahead and load in our data and then we'll split our data into training and testing sets and then we'll load in our pipeline create our feature pipeline but there's something here that also happens with creating a model it's now that we've gotten all these features and we have these classes we have something called imbalanced classes so that basically means that sometimes our training data set can contain an unequal representation of each of our classes so in our data set there are fewer than two percent of the samples that are fraudulent and the remaining 98 are legitimate which is awesome for the bank but not good for us when we're trying to make our model very accurate in pinpointing those fraudulent transactions because of this we're told that we have an imbalanced class and again it can cause problems because if a model goes and classifies all the transactions as legitimate it would be correct 98 percent of the time and that high accuracy can think can trick us to think that we're having a model that is very good we can do a couple things to tackle the problem and today we will use metrics and you've probably heard about we and waiting is a good way to to tackle this because basically you wait the samples and these weights will be passed to the logistic regression model to ensure that the model is penalized proportionally for making a miscalculation for each class when it's training so we try to get a more realistic point of view so we go ahead and we'll just compute the weights for each of the data labels that we had and then we have to go ahead and validate our model again this will come into the topic of confusion matrixes we have this model and we just trained to make predictions for data in our test data set and we want to compare those predictions to the truth the model is performing okay but it's better as we saw at identifying legitimate transactions than fraudulent so we use a confusion matrix to visualize kind of the accuracy and essentially when we take a look at the confusion matrix we're going to go ahead and look at the predicted classes in the actual class eventually I'll have a nice diagram that shows up but when we take a look at the predicted fraudulent transactions versus the actual fraudulent transactions we'll start getting some values that we can can look at to compare at what our round count is and what the estimation of the value is so we can see if we're predicting fraudulent transactions the value is pretty good now it's 0.90 if we predict fraudulent transactions but those are actually legitimate the value is quite low and if we go ahead and look at legitimate transactions that are actually looked at as fraudulent you'll see that the count is also low which is good and then if we look at the legitimate transactions that we've predicted we'll see that the count is pretty high and this is actually good we want to do this visual comparison so we're going to dump anything everything into the model so that we can save it outside of our notebook and then we go ahead into actually creating a mean pipeline now I'm running short on time so I'm going to skip this part of the pipelines where we're creating that pipeline that will take into account all the algorithms that we used to kind of go ahead and see whether or not our model was robust enough and added in weights and I'm going to go ahead to look at how we deploy the the actual application because remember we have this model now we've determined that it's pretty good but what we want to do is to be able to create a prediction function we don't want to deploy a Jupyter notebook we want to create a prediction.py file and we want to have it connect to a flask application so that from the flask application we can feed it some transactions and then the prediction function using our model now that is pretty good with all of the different methods that we've used to analyze it and weight it appropriately should return either a fraudulent or a legitimate transaction oops that's not good and then we can go ahead go ahead into to the detection and I'll just take a look at the flask app and that's just to show those of you who are double alipers we're just calling ahead on our prediction.py function and within our prediction.py function we'll be giving it a pickle file which will contain our pipeline and then we will also be calling the model for an actual prediction and if I go ahead and run this flask app let me not run properly because I think I may have lost the connection here at the end of the day what we want to do is we want to use either a curl command just to check on the status of our service so what it's going to do is it's going to hit our local host it's going to go through our flask app and we're going to give it some information some transactions and it'll say whether or not that transaction is either legitimate or false so the way that we go ahead and we package this because we're using OpenShift we're going to go into OpenShift itself and what we do is because we've saved everything that we've been working on into a GitHub repository we could go ahead and import from Git and basically go ahead and build that application what's build that application inside of OpenShift just one moment here so what we'll do is we'll get the get repo it's the road the road's fraud detection get repo in our advanced get options the contacts directory is an app we leave the get references blank this will go ahead and detect an image for Python 3.9 we don't want Python 3.9 we actually want to use Python 3.8 and UBI 3.8 and we're going to call our application a roads fraud detection application we're going to just deploy it i'm not going to create it because i've already created it but what we'll want to do is once it's gone ahead and created we will see that OpenShift will containerize our application and have it available on the public domain for us to actually go ahead and send a request to the prediction function that's available there we do this by actually going ahead and taking the URL that was created when this containerized application has been generated by OpenShift so we grab this URL and within our fraud detection workshop here what we can do is then put in that URL and go ahead and give it a and of course I didn't generate all of the didn't include all the libraries that I needed but we would go ahead and give that URL and then send it some predictions and see whether or not our prediction is is true or false and I wanted to go through this quickly so that we could have some time for for questions what I might do now is just bring both Erwin and myself on stage and see if anybody has any any questions at this point in time thank you Audrey so yeah I'm here I'm looking at the chat so yeah I think we have 10 more minutes so if you have questions type them in the chat and we'll try to start answering them and then I'll check my environment to see if I can show what this looks like in mine I'm just going to go back in here and just see if I can put this too yeah I did this yesterday so it may not keep address it might be because I also lost my connection half ways so if you didn't lose your connection Erwin you might be able to go ahead and demo that yeah you know what I'll I'll just talk a little bit about this piece while we're waiting for questions so Kristen if you could switch over to my screen thank you okay so right so let's talk a little bit about this piece so everything that Audrey showed happened here inside this notebook and notebooks are nice when you're getting started but you you can't move a notebook to production so the result of all the hard work done by Audrey is essentially a little web interface a little python function that relies on the actual model that's been built by all these activities so once that is ready this needs to be put in a place where a developer can reach it essentially so by putting everything here in this app folder of the git project and by doing the steps that Audrey went to to to do this it kind of does a few things at the same time so I'm not sure how familiar you are with with OpenShift and Kubernetes so I'll try to describe it quickly this application results in something that's called a build config which is a way of building a container image so it will build the image and you can see here I've I've clicked that start build button a few times so you can if you make changes in git and you need to rebuild you can come here and click the start button if you want to you can also configure a webhook so it happens automatically and redeploys on the fly now not only does it that like creating a container image is nice right it's this unchangeable version of your model self-contained you can move it anywhere but the image itself isn't that useful you need to run it so the next thing that it does is that once the image is built it will bring up a pod of of this and when I say a pod it could be multiple pods if you have like a lot of requests and you need more than that it also automatically creates a service on top of that which is nice if you're in the same cluster but in this case by default it also brings up a route so if my model has been deployed yeah okay I got there so right now when I hit this route it just stays status equals okay because I'm not exactly asking for a prediction here I'm just like poking at the model so it's like yeah I'm okay so this is the URL that basically I want to copy this link address I'm going to go back to this which one was it it was the packaging application near the bottom I believe yeah just go to the very bottom there and then just replace yeah so here uh just a variable to hold the external name of my route and then so here what I'm going to do through this request here uh I'm passing some data right here's the user ID here's how much money here's the merchant ID and I want the model to tell me what the model thinks about this so if I go like this it goes yep that's a legit transaction so I'm I don't know if it's a trait of my personality but when I see this I'm like okay that's good how do I know it really works and I know it's not enough but it always makes me feel better to kind of tweak some of the numbers and go how about now still legitimate okay what about if it's foreign equals true how about now do you still think it's legitimate mm okay it's still legitimate uh what if I change the inter arrival time whatever this is if I make it five how about now you think it's still oh okay now I've tricked the model into predicting fraud so I mean this is not how you test to make sure the model functions well but that's a little sanity check that I like to do when I'm doing these things and so here all of this all of this that I'm doing I don't need notebooks anymore right this is running in my OpenShift cluster this is reachable from anywhere in the world at this URL if you know how what the URL is you can pass it data and it will predict whether a transaction is fraudulent or not and because it's on OpenShift you can also scale it up if you need to so right now let me see details so right now it's running one pod but it's as simple as clicking and maybe well I might hit some limits in the sandbox if I if I do this and get in trouble but essentially by doing this now what I'm doing is I'm spinning up more instances of this model they're going to be running potentially on different machines in my cluster so now if there's lots and lots of transactions coming in they're going to be load balanced across all these instances and I get like the goodness and the reliability of OpenShift so this is we find like a nice way to illustrate like the job of the data scientist and the job of the developer are not are not similar but they do intersect right and this is a good way where a data scientist can get started here use their notebook interface to get the ball rolling get started and then by simply pushing changes into Git have it automatically upload you know upload the image and run the thing and then as a developer all I need to know is that the prototype is sitting at this URL and you know when we move to production it might be a different URL but they can start you know mocking up the access to the model okay so hopefully hopefully we didn't confuse you more than we need to so let's get back to the slides I haven't seen any questions in the chat so I will assume that everything was crystal clear from beginning to end right Audrey you understood everything I said I understood everything you said yeah we're good so to wrap up we want to thank you for attending and giving us your attention this sandbox account I've seen some of you did go through the steps so thank you for that that account is good for I believe 30 days and at the end you have the option to renew it so by all means feel free to go through all of these steps again at your own pace whether you're a data scientist or not it does help to see what you know data scientists jobs are like and vice versa it also helps to see a little bit what's what's Kubernetes and OpenShift doing and all of that stuff if you have questions you can use the Slack from DevNation we keep an eye on it the recording of this session will be posted later as well so that's two minutes out so I want to say a big thank you to all of you and I think we can stop the stream now and