 There are a bunch of Kaggle notebooks and blogs online that take a credit car detection data set, probably run this through some standard machine learning process, and give out some performance with a generic metric. But there are many nuances to dealing with fraud from thinking about potential features to how we can report results. My goal here is to add some color to fraud detection and prevention with machine learning. Fraud is fun, you just need to know how to deal with it. But before we continue, this video is sponsored partially by Kite. They provide a code completion service for machine learning code, it integrates super well with your editors and even Jupyter notebooks, so click the link in the description to try Kite for free. Now back to the video. So let's start with a little base example here. Think about Grandma. She runs a laptop repair line where people place a work order online, ship the laptop to a warehouse, her workers repair the laptop, they send it back. We'll be using this business as an example throughout the video. Now let's first ask some basic questions. How do we know that fraud has occurred? Through chargebacks. Now let's see what that is. This is JJ. Hey Bank, I don't recognize this $500 transaction that I paid to Grandma Fixes. Hello Grandma, it looks like JJ's transaction may have been fraudulent. We'll be taking the $500 back. Oh, I see. It is possible. No dispute here. So JJ, your money is now back in your account. Oh, nice. People can file for chargebacks if they don't recognize a transaction on their phone. This allows banks to forcefully reverse a transaction. So another question. Why do we need fraud detection? Chargebacks do nothing to a merchant in the best case, but they typically incur losses. Charges can dispute the chargeback if they are confident it wasn't fraud. So perhaps setting up a fraud detection system can prevent malicious users from making this transaction in the first place. Now chargebacks can be a pain to deal with. Sometimes people don't file chargebacks until months after a fraud transaction occurs. They do their taxes and then they realize that they don't recognize a $500 transaction from six months prior. So fraud detection can mitigate this. Now the main question. How does fraud happen? Let's paint a few different scenarios. The first being malicious actor Malcolm. Malcolm starts by creating an account on grandma fixes. Hello, grandma. Can I get a work order for a laptop? Why, sure thing, sweet pea. That's $500. You can take it off my card, wink, wink. Why thank you, Malcolm. Here's your laptop fixed and good as new. Yay. One week later. Enter J.J. Hey bank, I don't recognize this $500 transaction that I paid to grandma fixes. Hello grandma, it looks like that $500 transaction may have been fraudulent. We will be taking the $500 back. I see, it is possible. This Malcolm was winking a lot. No disputes here. I see. Hello J.J., your money is now banked in your account. Rejoice. Ah, rejoice I will. Thank you. This kind of fraud is harmful. Malcolm created an account on grandma's website and made a fraudulent transaction with malicious intent. J.J. had to deal with the hassle, grandma had to deal with the hassle and the loss, and Malcolm got a free work order in. We ideally want a system that blocks Malcolm's transactions. But not all fraud happens this way. Incoming friendly fraud. Uh, hey grandma, can I get a work order for a laptop? Why sure, thanks sweet pea. That's $300. You can take it off my card. Why thank you. And here's your fixed laptop. Oh, thank you so much. Six months later. Uh, hey bank, I don't recognize this $300 I paid to grandma fixes. Hello grandma, it looks like the $300 transaction may have been a fraudulent one. We will be taking the $300 back. But I remember this young girl, J.J. though. From here grandma could file a dispute claiming the transaction was legit, or just not deal with the hassle and J.J. gets her money back. In this scenario the transaction was legit, but it's being flagged as fraudulent because J.J. forgot that she made the transaction. Friendly fraud is harder to predict since there is no suspicious activity. The situation isn't good for anyone though. Even though J.J. walked away with a free work order, grandma is going to be extra cautious about J.J. in the future. Especially since this third scenario could have occurred too. Let's get to that third scenario. Account takeover. Malcolm starts by logging into J.J.'s account. Hello grandma, I mean, no. Hey grandma, can I get a work order from a laptop? By the way, I'm J.J. Oh sure thing, J.J., what a sweet little girl. That's $500. Oh, um, take it off my card, wink, wink. Yes ma'am, and here's your fixed laptop, J.J. Yay, thank you grandma, I appreciate it. One week later. Uh, hey bank, I don't recognize this $500 that I paid to grandma fixes. Um, hello grandma, it looks like that $500 transaction may have been fraudulent. We will be taking that $500 back. Hmm, I thought that it was J.J. who indeed made that purchase though. Uh, nope, I didn't make that purchase whatsoever. Hmm, I think I've seen enough. Hello J.J., you get your money back. Oh, very nice, but who made that purchase from my account? Sounds kind of sus. Account takeovers happen when malicious actors get hold of credentials, like login credentials, of a person and proceed to masquerade as said person. And this adds another level of required fraud detection. For the first two cases, we were more concerned with fraud at the transaction level, but for this account takeover case, we need to be concerned with fraud at the login level too. And this can be difficult. For this video, we will be looking only at transaction level fraud though. So that's addressing mostly the first two cases, and maybe take on account takeovers and these more complex cases in another video. Now, incoming machine learning. I feel like this is where most blog posts and tutorials for fraud detectors start. But fraud isn't just about machine learning after all. You need to think like a fraudster and understand how they behave if you want to fight against them. I hope that intro helped paint the picture for fraud detection. Now we can think about the pieces of the machine learning pipeline with this fraud mindset. So the first step here is defining the problem. Let's take the idea of fraud detection and define a concrete problem. Like I mentioned before, we want to be able to catch bad actors when transactions are made. So the input is some features about the user and their account. The output would be a binary classification of fraudulent and not fraudulent. Now we need to build the data set in this way too. So let's start with building the features. To build features, a good exercise is to open a Google Sheet and create three columns. The first column being the feature. The second being what your hunch is about this feature with fraudsters. And the third is what the actual relationship is based on some exploratory data analysis. Let's walk through a few examples together. So transactions are being made by a bad actor from their own account. One potential feature could be how long has the account been active? Typically you would expect these accounts to be short-lived for the sole purpose of just getting lucky with fraud. Something else that may catch your eye is the number of successful purchases. More than number of successful purchases could be indicative of slightly less fraudulent tendencies, though this is not necessary. And what about the time between sessions on Grammar's platform? Shorter times between login attempts could be a little suspicious. Again, although not necessarily. Once you have these ideas and hunches, verify if your hunches are true with the EDA process. Of course, to do this, you would also need to know what the labels look like. So right now, let's build the labels. The labels for each transaction are either fraudulent or not fraudulent. And we only know this label, though, if someone files a chargeback for that transaction. So let's say 97% of chargebacks are filed within one month of a transaction occurring. And you can verify this by just querying the data. This means that you can take all the transactions up to a month ago, that's up to like 30 days ago, as your training data set. Since if they had been fraudulent, you would have already seen a chargeback by now. So overall, things that we need to do is brainstorm the potential features for the fraud model, verify if these features are useful by querying the data, determine the time window. You can comfortably say a chargeback occurs, query all transactions that occurred up to that time window. It's like until 30 days ago in our case. And then get the corresponding labels for these transactions and your data set is ready. Now, the next step is the model setup. So a typical tendency of fraud data is imbalance. We have way too many non-fraudulent transactions over the actual fraudulent transactions. We could sample some of the non-fraudulent data and oversample some of the fraudulent transactions so that the model learns something meaningful. Sometimes waiting the fraudulent examples more higher for your model may be useful. You may have to play around with this, though, since it really depends on your data and your objective. And the final step is evaluating the model. So how good really is this fraud model? So for fraud, false negatives are bad. We need to be able to call out fraud when it occurs. But at the same time, we also don't want to call out too many non-fraudulent examples as being fraudulent. We can typically look at ROC curves for a balance. There are plots of true positives versus false positive rate. Ideally, the graph should hug the top right corner. In some cases, though, like true positive rate and false positive rate may be a little too generic. And we would want to make plots of more company-specific metrics. And that's all I have for you now. Hope this video paints a little more color to dealing with fraudulent data out there. This is just the tip of the iceberg. And remember, fraud is fun once you know how to deal with it. Hope you enjoyed the video. And until next time, bye!