 Hello everyone and welcome to another episode of Code Emporium where we are now going through some code on fraud prediction. So I've created a video on this a couple of months ago in the past where I went mostly through the theory of like how chargebacks are created and some of the nitty gritty details but at a theoretical level of what happens when customers like file for chargebacks, what happens when they file for claims. Now let's actually use machine learning in order to predict whether or not a customer will create a claim or try to file for a chargeback at the time that they make the purchase itself. So transaction fraud it's a pretty common problem that people tend to tackle when dealing with machine learning especially from the get go. I do know that a lot of people tend to tackle it from the perspective of anomaly detection or treated as an unsupervised problem as such. But in this approach we're actually going to look at a typical supervised approach of satisfying and trying to solve for predicting transaction fraud at the time of purchase. So with that let's just get started. So right now we have a pretty solid concrete problem definition. Given past transactions try to determine if the current transaction is fraudulent. Right. And now with with the data source it's pretty cool. We have a little Kaggle data source which is massive. There's like this train CSV is like like what 300 megabytes. I'll just go through a couple of the columns here so that we know what data we're looking at. We have credit card number pretty pretty nice all of this by the way is synthetic data. So it's not like somebody else's credit card information is just shown on Kaggle. We have a transaction date credit card number the merchant name. What was the item that was purchased and amount amount is like the price of that item followed by some user details. And this is just 10 rows of the 40 or 10 columns of the 46 actual columns that we have to work with. One of these columns is also whether you know I think there's like a column that says fake or not is is fraud or is not fraud. So it's a binary variable right here that I could I could check as well. And that would be mostly our label for this data. So now interestingly all of this data was generated kind of by this but there's like a little repository here from which they got the data. And this person or rather used Faker which is a Python library to generate fake data. And you can kind of see you know certain certain fields just being populated with with just fake data over here. So be sure to check that out if you're really interested in just generating synthetic data. It's fairly useful I think when dealing with problems in data science where you know you're not able to actually use real data from some real database. So that's one stuff. All right. So now coming to the to the actual like looking at the data you know that data set for training is huge. But we cannot really train on those millions and millions of rows. And so what I do is I just sample 100,000 rows which is 100,000 transactions and then make use of that for like training and testing. Right. And it works just as well. So if you look at that this is transaction data I have 100,000 transactions sampled for from like beginning of 2019 to mid 2020. And there's 575 fraudulent transactions that have occurred, which is pretty reasonable kind of realistic data right here that we got. So let's work with this. Now in order to identify I want to kind of make use of regression right so the idea is to predict whether or not a person has committed fraud or at present. And what we want to do is we want to look at all the past historical data and the historical behavior of that particular customer. So we got to make sure that you know we're able to get all of the data with just like a single identification. In this case I could just use the credit card ID as the unique user identification. I'm doing this check here just to make sure that you know one unique user is mapped to one unique credit card just to make sure you know like there may be some cases where a user has multiple credit cards. Or like credit card might be tied to multiple users if you know users live in the same household. But it looks like none of those complications exist and we don't need to resolve those entities right here. And it's pretty cut and dry. So I just use the credit card number as the identification. But this is something that you would really need to look into when you are dealing with real data. So let's go to feature engineering right now. The thing is with these labels that you see like that I talked about the binary label of is fake and is fraud and is not fraud. You actually don't get those labels off hand like as soon as the transaction is made you won't know whether a transaction is fraudulent. Unless a customer actually comes back to you and says hey I see like a transaction that was made on my card you know two weeks ago. Can you like deal with this because that was a fraudulent transaction it didn't occur and it's only then that you get a label right there. And so you would ideally what you would need to do in order to get this data set is to wait for a certain amount of time. And then say okay after this certain time you can say that 97% of people tend to either claim that their transaction was fraudulent or not. And it's only then that you actually have the labels. So let's say that you know I'm in a company right now like that has this kind of data and I don't have access to the historical data in this particular context. But if I did and in a real company I would what I would want to do is like plot some graph that looks like this. So the x-axis is the days until the fraudulent claim and the y-axis is what percentage of the total claims came in you know before the said day on the x-axis. So in this case I see like a bunch of people start filing claims you know until like maybe 60, 70 days out. But then at 90 days it really starts flattening out which means that after 90 days you can see it's almost 100% like 98% maybe or something like that. So that means 98% of the time that a claim is filed it would have come within the first 90 days. So we can kind of use that as a label when actually creating our data set which you will see when I introduce you to a SQL file that is used to create the training set. But before then before you actually write any of the SQL you want to kind of take a doc out and then just think about like what are the factors that would affect whether a transaction is fraudulent or not fraudulent. And I've kind of explained this in the other video but basically you want to do this exercise so that you have like okay what is the criteria of you know determining whether fraudulent or not fraudulent. How do you think what is your hunch on how the behavior is and then go about actually analyzing the data to verify those hunches. So in this case let's say the first thing that comes to mind what is the indicator of fraudulent behavior. Well has this person have has this how many like trends past transactions were both fraudulent and non-fraudulent by this customer. I would think that more fraudulent past behavior indicates like this may be another fraudulent transaction for example and then I would go and verify that with the data. And the second I would do the same for the merchants we have merchant data remember so we can try to see if okay like let's see that if this merchant is legit as well. Sometimes there may be like coercion between the buyer and the merchant they might be working together to dupe the company we don't know and so we want to also validate this. The next is purchase amount I would think that you know for like larger purchases like maybe products of like $10,000 you might see that like fraudulent transactions do tend to occur a little more. But again you would need to verify this then location so with location it's I'm mostly doing location because you know and typically e-commerce companies you know when you do a delivery service with like Amazon. Or UPS whatever like UPS right so the chances are that like if you live in a shady area you the customer might just file a claim that their package has been stolen. But even though it's not like fraudulent but that's only because like they live in a shady area and that's kind of why they're doing it. And so that's why I've added this as an indicator so if they so it would you know location you can kind of pass in as like a categorical variable to your to your model right. Now I want to take all of this information there might be so much more that we can probably you know scrimmage from this kid data. But let's just take this information and then try to create our data set for training. And so I have this like base dot SQL right queries slash base dot SQL so let's open that file and it looks like this. So from the get go it's a bunch of CTEs that you know you can use to which I strongly recommend that if you don't really have a good grasp on SQL. I highly recommend you learn about how you know you should be writing SQL in general with common table expressions so that it just becomes easier to see rather than like nested queries. So let's just look at this first CTE just getting information from the transaction basically. It's just the entire like I'm just getting the data you know just what whatever I have in my data frame which is DF all of this by the way is being. I'm able to write SQL here because of pandas SQL. So it's easy for me to just write SQL rather than like operating on pandas data frame with data frame operations which to me is a lot harder than just writing native SQL. Hence I'm using it. Okay, so now you have this other CTE for past transactions and the goal for me to do for me here is like for every single user ID which basically is credit card number. I want to get the total number of past transactions that have occurred and also the number of and also like if there's nothing it would just be zero or for example. And what was the average price of those transactions. And remember I'm only considering transactions that happened over three months ago because it's only at that point that would actually have labels and I don't want to use anything beyond that three month mark like from like the last three months from today if I wanted to evaluate because my label doesn't exist for the if it was fraudulent right. And so I get you know all this past information right here. Now, I'm doing this now this is just like past transaction information and I write the same CTE for fraudulent just to determine like how many cases were number of transactions that were fraudulent in the past. And the only difference between the first and the second CTE is this that fraudulent transaction that is fraud is equal to one. So it only this so this counter only have just the fraudulent transaction count. And I do the exact same thing for these two CTE's but on the merchant side. So how many transactions that I have and how many of them were fraudulent right. And then I combine all of that together in just this final sequel expression where I'm just doing a bunch of left joins on the original transactions. And in this like you can see that I am also saying okay if the if the location really isn't if if the location isn't known or it's like it's missing. You want to just populate it with UNK which is unknown. And also like if you know if there are no transactions at all sometimes it will be null there too. And I want that I want that to actually be true because no past transactions is true in this case. And I think you can infer the same from all of these all of the other functions here. Now obviously I'm like you cannot do a divide by zero like I wanted to do fraud rate and I wanted to see if this would be like a good feature. I cannot divide by zero if then if there are no transactions that the user is made and this is the first transaction this will throw a divide by zero error. And that's why I'm putting I'm wrapping this in a case when statement and I'm doing that same for like the merchant fraud rate as well. So that's like a really quick overview of like how the data set is constructed. And remember we are only using data until three months ago because that is the data for which we have like an accurate label beyond three months. We just don't have labels right hope that makes sense. And I hope you get that right. So now once you actually construct that data the result of that here is like a single a line of like the features for a particular transaction. So the label was zero so there was no even three months later there was no there was no claim that was made that this was a fraudulent transaction by the customer right. And yeah that's it. Now with location you see 904 I if you see that that was I use the zip zone prefix so essentially just extracting the first three digits of the zip code. These if you extract too much then that means that like it's just too too many categorical features if it's too little it's going to be less important of a feature because zip code is actually indicative of regions. It's pretty important it's pretty cool to know of how zip zones are created. Now creating the model right. So first of all I have this function for creating for under sampling. You'll see this later when I when I use it here. Let me actually just go through this a little bit first. So I have a bunch of categorical and numerical features that I think are that I've just like manually put into this. And from here I am splitting the test set and the train set in a tennis to one ratio. So there's like 100,000 samples 90,000 will be trained 10,000 will be test and I'm not I'm choosing not to shuffle so that you know we're not predicting on data that happened anytime before the training set data. If you shuffle it you might suffer from like data leakage issues which is pretty bad and a very common mistake among you know when you're starting out with machine learning. But the thing is that you don't want to train and you don't want to literally train on this directly because something that we're seeing here is data imbalances. There's only like 500 or so fraudulent transactions right but there's like 90,000 90 or 100,000 of like total transactions. This great imbalance will mean that if the model just keeps on spewing out non fraudulent non fraudulent for every single thing. So just like a death return false then your model will technically perform super well now precision. Obviously the precision is like near the zero but I'm just saying like in general your accuracy would be pretty high and you don't want you want your model to be able to pick up on these little anomalies. And so in order to cater to that I created this function called under sample which will basically under sample the train set. So if I only have 500 of these samples in the like positive labels it will make sure that the ratio of negative to positive is at most just four. So if I 500 that means I will only have 2000 of the negative samples and that's what I'm kind of doing here where I'm just sampling just to make sure that we take care of the data imbalance situation. So we're technically under sampling the the data set from the from the negative label because there's just so many of them. I hope this pace isn't too fast but I'm just trying to get through all these ideas and you can probably like look through the notebook and understand with more details just by looking in line by line at the code and probably just Googling certain things too because I'm hoping that it's not too out of the box. So right now for categorical features where this is like a typical pipeline structure that I do encourage everybody to learn of how like when they create models. It's a great way to pipe in from pre processing all the way to model classification. So the way I'm doing this is like all the categorical variables are going to be pre processed in this way they're going to be imputed by simple imputer from scikit learn and we're going to impute it with the value of negative one. And then I'm going to do some ordinal encoding which will basically start ordering all the categorical variables with integer values and for the numeric variables. I'm just imputing it with the there's no there's no parameter so it just takes the mean of the column value and just imputes or rather fills the values that are in a into that cell. And then what we're doing is creating a cat boost classifier. So cat boost very fun. We're going to be creating that classifier and then piping all of our pre processed data into this classifier and then fitting the model on the under sampled train data. And so you can see well over like what one 1000 iterations the the loss is definitely decreasing the model is learning. Right now. And as far as evaluation is concerned, you know because if there's such like a huge data imbalance, we want to make sure that we are taking into account both precision and recall correctly. So instead of using the typical use ROC AC curve we're using a precision recall AC curve because precision recall are super important to consider when you're dealing with data imbalances. Right. And so I'm kind of taking all of this into into account over here. And we have trained and test metrics. Yeah, I'm just evaluating that and lookie here. We have pretty good we're actually doing pretty good with with our metrics over here. So I mean like even the AC ROC is like in the 80% I mean that could be reasonable depending on what situation that you're you're dealing with. And then as far as the confusion matrix is concerned that this is I think like super important because it kind of tells you the breakdown of how many fraudulent transactions be predicted. How many fraudulent transactions were actually there and then trying to see how many we got right. So we know let's just look at the test set because that's what we're concerned with. There's like if you sum these up there's like 10,000 transactions or so right now if you look at it there's 58 cases where you know they were actually fraudulent transactions these 58 they were actually fraud. But we end we actually predicted fraudulent which is great. Now these are five cases where that was actually fraud but we didn't predict them fraudulent so we missed these which is bad. But overall for the 63 cases that were fraudulent we picked up 58 of them which is pretty good pretty good actually I guess yeah. And then there's 124 cases that we predicted were fraudulent but weren't actually fraudulent. So these are like you know our model just gave false positives to which is which is fine it seems like a low reasonable number. And these are just the typical true negatives where you know we just normal transactions that just pass through which is good right. So overall it looks like our model is already performing pretty good. That's great but the thing is right now if somebody comes up to me like a stakeholder and says okay this is great and all you have a model but like. What what were the most important factors that are contributing to this model like what determines a fraudulent transaction the most. And to them well we have a model interpretability using Shapley values. So SHAP is it's a it's a library that's used for visualizing these you know the importance of certain features in models. Now I've made an entire video on this but essentially uses game theory in order to it's like all these features are now competing in order to see like okay if the sample is fraudulent. What exactly is making it fraudulent there is some some amount of space that is allocated to each feature. So the total of them should be like okay we have this much space what is actually making it fraudulent this divergence what's causing that divergence. And so all of your features would fit into this with different points in different ratios and it would be scored proportionally to like how effective it is. And now in this case it's saying that well the price is the is by far the most important feature. So it doesn't give any sense of direction but my hunch is like okay typically if it's a higher price chances are that it's most likely like okay you got to be careful of this one right. And then after that is no merchant fraudulent past transactions so if a merchant has had a fraudulent past transactions you know okay you got to be careful of this. And interestingly even category so it's like depending on the item that is also being purchased it seems to have a pretty high weight. So when actually looking at that I think I did that analysis a little later let me just hop to this. So category was interesting because it's like people who bought shopping nets and grocery and whatever these are like they actually had a higher tendency of actually filing for a claim or you know making these fraudulent transactions. I jumped ahead of myself a little bit but let's just continue here. I wanted to also look at there you know SHAP is a great library for visualizing these things you can visualize it at like the overall data set level of feature importance or even at like the transaction level. So let's say that this is an example transaction where you know we predicted the model there was not fraud and the predicted the prediction was also not fraud. So why exactly did the model make the prediction in the way that we did given these features. Well we have like so here is like the entire gist of like you know these each additional feature in their contribution blue and downwards just mean that it's decreasing the probability. Red and upward means that it's increasing the probability of being detected as fraudulent. So it's saying that okay the biggest factor is the price is super cheap it's like 6.84. That's probably why and there is never there is no merchant fraudulent past transaction. So the merchants never made any fraudulent past transactions which is also a huge indicator. And then there's also all these other things that are huge indicators and that's why it's something it's it's kind of pushing it down right. And that's also why like it's pretty low. Next is let's just take another look at another example but this is a fraudulent sample. It's actually fraudulent and our model correctly predicted it as being fraudulent but why was it being predicted as fraudulent. So it looks like the price is over a thousand bucks whatever this item was. Yeah this shopping net it was over a thousand bucks and that's really contributing to the fact that okay you got to look into this this might be fraudulent. But then there's also the category is equal 11 which is surprising this is like the second most important factor. And category 11 is essentially shopping net. And like I said before it's when I just tried to look at that I thought that was super interesting. But like people who tend to purchase shopping nets the fraudulent transactions for those is actually the highest among all other types of items that are being purchased. Which is a really fun fact that you can see in the data and you could probably pick up on so many other samples just by looking at them too. So that's kind of all I got for this entire notebook but there are certain other questions that I would answer in addition to these. Like one of them is like the actual modern model interpretability like I mentioned like this this shop graph right here. It doesn't give the directional importance like I know that price really has a pretty strong indicator of like whether something is fraudulent or not. But like what does that mean what is that dependence like is it higher the price is a lower the price. What is it my hunch is obviously higher the price but you can figure this out and also make sure that that is the case by plotting partial dependency plots. So it'll be partial dependency plots will be plots of the actual feature value versus the versus the actual regressed or whatever this categorical value. And just trying to see like how how that probability would vary just depending on varying this one feature keeping out all the other features as constant. So try to plot this partial dependency plots and see what you get. It'll be pretty interesting to also verify our hunches just to make sure that the model is performing in a way that you would think it would. And that is super important when you're dealing with data science in the real world. So partial dependency plots very fun. Another thing that you can look at is try to use model interpretation on you know Shatley values can be used for on any model. I just use it on CatBoost here but you can use it on XGBoost even neural networks whatever you want. So you can try even replacing this model and trying to see you know how what what changes in the model interpretability and also the performance right here. Like does this graph change by much if so which features change. And you can check that out try to try to replace the CatBoost model put something else in and I think it's going to be it's not going to be too hard because we have this like pipeline architecture that we set up. So essentially all you need to do is just like you just need to make a couple of changes here and everything else should like remain the same right. So plug and play for you know these this like machine learning pipeline makes it super simple and that's why machine learning pipelines are very fun and very important. So I encourage you to try that out if not if you can't if you get you know stuck on a roadblock I've created an entire video and also notebook explaining shop and also interpretation of Shatley values. All over here in this new notebook and I encourage you to check it out as well I have an entire video on that too. So that's kind of all I have today again you can treat fraud detection it's in fraud prediction I should say as like a problem of regression or rather classification rather than. You know some separate anomaly detection ESC kind of problem and it's not too hard to do but it is a little different in terms of data collection because we just don't have the labels up front. So I hope this really helps understand a really common problem that I've seen in in at least in Kaggle and in just getting up to speed and machine learning. Hope you enjoyed everything here learn something new and I will see you all in the next one. Yes, I'll see you all in the next one. Bye bye.