 Data is a funny old thing isn't it, I mean we have data everywhere, you know, on a minute by minute basis we are either gathering data with the applications that we build, or even the people we talk to, data doesn't have to be on a hard disk, and we're generating data even by being sat here, you probably have a smartphone, Google probably knows exactly where you are, a little bit scary. And we do see a lot of examples of data being used badly. Hi Facebook. But I think machine learning is a very important topic at the moment, it's a topic that I've seen lots more talks coming up in the past few years and lots of people attending such talks. And I think that's because there is this interest in, you know, we have this data, we know it can be used for evil, for want of a better term, can it also be used responsibly and for good. In Tebex's example, that's about being able to protect players and protect server owners. Before I go on any further, I just want to say very quickly thanks to the organisers. I think this is my fifth year at PHP UK and Joe and Sam and the rest of the team every year do a really good job. The staff here, you know, are great in, you know, they've done this now a number of times and it works really well every year, so you know personally I'd like to say thanks to them. And you know, I ask you if you get two minutes and you speak to Sam, you speak to Joe, one of the other organisers of volunteers, just say thanks because it makes a huge difference. So who am I? My name's Liam Welcher and I'm the CTO for a small company called Tebex. Now, if you run servers like Minecraft or stuff, you might have heard of us. If not, you probably haven't, but we provide a monetisation platform for community-hosted servers on games such as those. So being a monetisation platform, which is basically a fancy way of saying payment processor, we have an issue with Chargebacks. You know, anyone in the e-commerce industry, anyone that deals with money that deals with processing payments, this is always this, there's always this issue. So Braintree, they're not a big fan of our merchants. Stripe don't exactly like us either. I will say at this point throughout this talk I will use the word Chargeback. I'm not necessarily talking about credit card Section 5 Chargebacks. I could always be talking about PayPal disputes or mechanisms on other gateways, other payment platforms. It's that process of saying I didn't approve this charge, I want it back. I'm using Chargebacks as a collective term. But you know, I'm saying that Chargebacks are bad, I'm saying that Stripe and Braintree aren't our best friends. But we're not talking about loads of charge, that's right. I mean, it's no big deal. 0.5%. It's not bad. Yeah, that's not us. That's clothing and apparel. So things like misguided, things like the online fashion retailers, they experience about a half percent chargeback rate on purchases. So it's okay. It'd be half percent would be nice, it's not too much, but 0.56%. Yeah, no, that's media and e-content. This one scares me a little because media and e-content includes, you know, these things where it's like get rich in seven days by buying this e-book. And you would think Chargebacks were high on that, but actually they fall under this category and it's 0.56%. It's not too bad. So now we're leaving the 0.5s behind, 0.65%. So again, it's not too bad. It's still not us. Now this one I find amusing, Financial Services. So people who are buying financial advice or investment advice or whatever else and then charging back. It's like, hang on, you know that they work in the financial industry, right? They're going to win, but yeah, it's fine. So no, so it's not half percent, it's not 0.65%. Actually, across our network, our chargeback rate is about 0.85%. It's not the end of the world because like as long as you're below about a 1% threshold, you're kind of okay. You get above 1% people start asking questions, but you know, we're okay. But we are still for 2018, for example, talking about nearly 24,000 payments. So you know what, if there's something that we can do about this, that would be good. Now the kind of thing that people often think is well, if people are charging back, surely our merchants, the people that use our platform to sell, must be doing something wrong. And that's kind of logical. But it's not always the case. So this is a definition of a chargeback from which. For anyone who doesn't know which is a consumer association based here in the UK. And they say that a chargeback should be used in cases of goods not arriving at all, goods that are damaged, goods that are different from the description, or where the merchant has ceased trading. So that's a pretty clear definition, right? We can say from there quite clearly, we know that there are four particular cases where a chargeback would be a relevant course of action. So in our experience, is this how chargebacks are used? No. We see plenty of examples of players who will charge back just because. They will actually admit it to us. One of the things that we currently do is, if a player charges back, we don't share that data when you don't say the specific number of chargebacks, but other web stores across the network can say, well, I don't want to deal with players who have a chargeback rating of above 10%, for example. We will never tell them, oh, this player has this chargeback rating. We just won't let that person purchase silently. So the stores don't know. We're not sharing the data or anything like that, but we are kind of having that protection. But then we'll have players that will come to us and go, oh, well, I know I charged back, but I got bored of playing on that server. That's not a reason to charge back. That's ridiculous. Or they'll go, well, you know, I bought this thing and then they banned me. I said, well, why did they ban you? Well, I was cheating. So they banned you and then you've charged back because they banned you because you were cheating. This doesn't make sense. And sometimes they don't actually just attack another player because, you know, a lot of our stores will allow you to buy foot of the players. So I can go, well, actually, I've got a good friend online. I want to buy him this rank. I want to buy him this sword or whatever else. And I can buy it for him and he receives the sword and I pay for it and it's great. But then those, some players will buy something for someone else, then charge it back just to get them in trouble. It's a bit ridiculous. And clearly these things aren't the server owner's fault. And sometimes these things aren't even the player who's getting punished's fault. It's someone else that's causing it. So, you know, we want to see what we can do to protect the server owners and to protect those honest players. So we set ourselves a bit of a challenge. We have, you know, a lot of data, something like 18 million payment records last time I checked. So we want to try and help our server owners as much as possible. You know, if we can use our existing data to try and predict if a given payment is likely to be charged back in the future, if we can avoid false positives as much as possible, because ultimately if you've got lots of false positives, no one will trust the system and it's a waste of time. And it would be nice if we could also provide feedback as to why we flagged a particular payment for review. Then that's going to make our platform a better platform. It's going to make our merchants happier and it will actually protect the players as well. How hard can it be? As I've mentioned, I have been to some machine learning talks before. So when we started bounding this idea around at work, I went, I must be able to remember something, anything. Literally, my brain's not that useless. Well, it is, but that's a story for another day. And from somewhere from the depths of my memory, two words cropped up. Supervised learning. Now, it sounds great that those words cropped up. I could not remember what they meant at all. So what did I do? I googled it, of course. It turns out that actually supervised learning was exactly what we wanted. So I was like, hey, look at me, I look smart. That was awesome. Because supervised learning basically involves giving a learning function a set of training data with known answers. So in our instance, we have payment data and we know for historical payment data whether it was charged back or whether it wasn't, whether it's a good payment or whether it's a bad payment. For that set of data, we know the answer. And then you give your learning function that data with those answers and then it will analyse that data and then it will use that data to provide answers for previously unseen data next time. So you've trained it with these known things and you say, right, we have this new piece of data. We don't know what the answer is. Can you give us an answer, please? The opposite of supervised learning is, well, unsupervised learning, kind of makes sense. And that's a little bit different in that you don't have answers. You give it data and then you say, dear Mr. Learning function, can you sort this mess out for us and tell us how this stuff should be grouped because we don't know. So we don't need to do that, which I then read made our lives a lot easier. When we're talking about supervised learning, we're normally trying to solve one of two problems. It's either classification. So that means you're trying to give something a label. So for example, is this thing an apple or an orange? Is this tumour malignant or benign? Hopefully it's benign. The other question that you could be trying to answer is what we call regression. So that's when you say, right, given this data set, where along a line would it fit? So a good example of that is if I've got data for 50 houses, I have the number of bedrooms, the floor space, the number of bathrooms, the amount of outside space, and the price, I can plot those. And then when I get a new property that I don't have a price for, I can say, right, four bedrooms, this amount of floor space, two bathrooms, this amount of outside space, what price should it be? And it will use the data it already has to kind of work down the line and be like, oh, well, it would go there. So it should be £150,000 if you don't live in London. So because we just wanted to say, is it this thing or is it that thing, we clearly want to look at classification. So that's all good. So we know we're doing classification. We know that we're doing supervised learning. It's just the simple job of finding an algorithm. Simple, honest. There are lots of potential algorithms for machine learning. And there are lots of algorithms for classification in particular. As a complete aside, if you don't recognise these styles of comics, this is from XKCD. They're hilarious. They're well worth checking out. You'll find that I use them a lot in my talks. One of the things that we did say for ourselves was that we are not data scientists. I failed math at school. I don't have a clue. So I'm a software engineer. I want to use something that's simple enough that even a moron like me can understand it. So we literally Googled simple machine learning classification algorithm and came up with the naive Bayes classifier. If you do kind of do any research, this is the one that will normally come up first. It's the one that's fairly straightforward. It's based around categorising text. So it's used a lot in things like spam filters or language detection, things like that. So that's what we started. Effectively what it does is it says, right, given this block of text, the probability is it belongs in X category or the probability is it belongs in Y category. It's a fairly simple algorithm. You take in your block of text, you split it into individual words. You standardise the words where possible. So that might be removing plurals. That might be if it's in a language, say French, where adjectives, the ending changes based on the gender of the subject. You might standardise those, that sort of thing. And then you literally loop through every single word. You loop through every single word and you work out the percentage of times that word appears in each category. So the word balloon, for example, might appear 60% of the time in English in the English category in our test data and 40% of the time in the French one because it means ball. Thanks for confusing things. And then you go through every single word. You then give each one a point. You give each one a point. You give each one a point. You give each one a point. You loop through every single word. You then give each one a percentage. You then say, right, for each category, what is the average probability based on all the words we've been through? And then you go, right, okay, so on balance of probability, this block of text is likely to be French or it's likely to be English or whatever the result might be. 100%. That's not how effective it is. I wish it was because this would be a really short talk. No, 100% is the percentage of previous talks where I haven't done a demo. In fact, whenever someone does a live demo and there may be speakers in the room that contest to this, I go up to them and go, why would you do a live demo? I don't get it. Guess what? Live demo time. This could be horrendous. So that's right and that's good. Good start. So this is a fairly simple language predictor. Let's try to make that a little bit bigger, shall we? There we go. There is no training data at the moment. Hence the fact I've given this piece of text in French. I'm not even going to try and mangle it, but it doesn't know what language it is. Each time I tell it what the language is, it will then add that to its training data. So this first one I'm going to say is French, so type F for French. That one is German, so I'll type D for Dutch. This one is English. This is E and English again and hopefully in a minute. There we are. It's come up and this one, granted we've already trained this one, exactly it's fine. Because it's looked through those words and they've now been in training data and it's gone, oh okay, I recognise those words. The probability is that these words are going to be in a piece of text that is English and we'll say yes, that's correct. So this is what we haven't done, don't throw good money after that. Again, some of those words have been in the other data that we've used to train it, that's English, and this one is French. It doesn't know that one. And so on, I could keep going around and it's recognised that one is French again. So as you give it more training data it will become more accurate. That's the naive base classifier in a nutshell. Thank you and good night. Not so much good night. That's all well and good. That worked. That was a very quick demo that we hacked together that went, oh, this makes sense. Awesome. So, we're not dealing with text but what if we're not dealing with text? What if we're dealing with numbers? Well, actually it's quite straightforward. In that pseudocode that I said earlier if you replace the words with tokens you're doing the same thing but now a token can represent anything. I mean we use tokens in PHP. Everything we write in PHP is turned into tokens and they represent certain things and they do operations on the stack and everything is great. So, this is the same thing. So, here's an example for describing fruit. Okay, fine, let's assume that red and round and green and crescent are words anyway. So you could just leave those as words but let's assume in our database of our test data we just have a binary column for seed and stone. We couldn't just say one or zero because it wouldn't actually know what it related to. It wouldn't know that in Apple the one related to it having a seed and the zero related to it not having a stone it would just say oh well there's a one and there's a zero. There is no context. But by prefixing them with a little bit of color red, shape round, seed one, stone zero we're saying that it's red, it's round, it has a seed and it doesn't have a stone. Now, we can use that, we can stick that into an algorithm and it will understand the context effectively. So, that's what we did. We took all our data that we wanted to use and we basically prefixed it with each thing with something so each token was unique. So even if the same number came up twice in two different contexts or the same word, let's say we happen to have a gateway that was called USD and the currency called USD. If it's just said USD you wouldn't know what it was relating to but by turning it into a token there is actually within a library called PHPML which we'll come on to in a little bit. There is actually a tokenizer function. There's a really good reason we didn't use it. We didn't know it existed. Something that's interesting on this particular line you'll notice there's a token called SIGFIG, a significant figure. There's a reason for this. We googled something and we read it and they said it was a good idea so we did it. There's a thing, a thing called Benford's Law which suggests that in financial transactions certain significant figures should happen more frequently than others. I couldn't tell you which ones because I can't remember. But it was something that they said there was this big paper that we read and we understood half of it and we were like yeah that sounds good let's put it in. See what happens. Experiment. Notice here how also we have on the first line we have country and gateway as two individual tokens on the third line we have a token called country gateway. Now that's because in the naive based classifier the naive part relates to how there is no assumed relation between the individual tokens so there's no sense of because this is a one if this is something else it should be seen as odd or not. Everything, every word or every token is taken in complete isolation. Obviously in our dataset that might not be the case. A gateway for example such as ideal is used a lot in mainland Europe isn't used at all in China for example. So if someone who was from China was using ideal would be like okay perhaps we need to put a question mark on that. Whereas having them as two individual tokens there would be nothing to say that it's odd or not or whatever. So we added that as a combined token and then we ran a test. So for most of the test that we did we used our 2018 data as our training data and then we picked records at random from 2017 to test it against. So in this instance we took 100 records of known good payments in 2017 and 100 rows that we know were charged back from 2017. 200 rows. So far so good there are no false positives. It's not looked at any of those payment records that were good and actually gone I think this is a charge back. And bearing mind one of our main aim was to avoid false positives that's really good. It didn't identify any fraud either though. Yeah this went well. So what it's saying there is every single record even those 100 ones that we know are charged backs it thinks we're all fine. Yeah. That wasn't a good day that one. I went home slightly sad. Moral of the story don't base everything you try and do on one Google search. Do at least two. So we didn't really know what the problem was to start with because we'd done one Google search. We had to do some further research and we learnt our first important lesson. Invalence data is the enemy. Many machine learning algorithms not all I will caveat but many are sensitive sensitive to an imbalance of data if one category has just more data in general than others there will be a natural bias towards that category. In our instance we have 2.8 million good payments and about 24,000 charge backs. So it's a little bit imbalanced tiny bit. And it kind of doesn't make sense when you think about it. Again if you think of the word let's think about fruit for a second. Some apples are red and some apples are green. Most cherries are red. So far we agree with that statement. If however you have a set of training data with 50,000 apples and 500 cherries according to your training data the word red is many many times more likely I told you I suck at math many many times more likely to appear as an apple than a cherry even though actually if something's red it's probably only a 50 cent chance it's an apple and probably more likely it's a cherry but because our data is imbalanced it will never come to that conclusion. Right? So in a odd way our first algorithm was actually really accurate. I mean technically it was over 99% accurate over an entire set of training if we gave it all our payment data it would correctly identify over 99% of them because over 99% are good payments completely useless but accurate. So you have to kind of come up with ways of fixing this. There are some solutions I guess the obvious one is to collect more data. Now that's not really going to work for us because we can't exactly email people and be like hey players start charging back now no I would be out of the job if I did that. So we can't really do that that's not going to work. You could look at changing your metrics so as I mentioned the classifier that we're using is based on probability it says there's a probability of it being one category or the other so what we could actually do is lower our threshold and say well okay even if the highest probability is it's a good payment if it's over 35% likely to be in the fraud category perhaps we flag all of those that might be a way of doing it or the other option is to resample your data now that has kind of two subsets so you can either under sample the over represented data so in our instance we could just pick 1% of all our payments and put that into our training data or you can over sample the under represented categories in other words you change the weighting so you make one vote in the fraud category worth 99 times as much as a vote in the non-fraud category whatever and so you kind of balance up that way so you end up with the same amount of data even though some of the records will be replicated in one you end up with the same number of records so there's an even weighting as I said we can't realistically collect more data we can't ask people to start creating more charge backs be pretty stupid so we mixed resampling and change our metric we up sampled the fraud records to generate more data and then we down sampled the probability of the fraud label regardless of what it was just to see if there was any kind of any pattern and then we ran another test so again we pulled 100 records 100 of each now we've started moving the needle it's not all 00100 we've got obviously absolute bollocks excuse my language but it's done something that's the definition of progress it's broken differently to last time the accuracy is like 50% it's terrible we looked at the fraud probability thing but given that we're already flagging 71% false positives we were like this is a waste of time we kind of gave up on that quite quickly so it was quite frustrating we were sure there must have been something that we were missing we had no idea what it was but it's got to be something right so we did play with the algorithm a little bit more and we weren't getting anywhere so after we did a bit more googling because everyone googles everything right we started to question the approach that we were taking and we realised that we needed to understand our data better there were a number of flaws in this initial approach we had kind of recognised that there were different tokens we knew that there were different tokens but perhaps that relation was more important than we had expected so for example if someone pays $30 for a purchase and they're based in Germany where the average purchase price is about $25 anyway that's not that suspicious that's kind of within the realms of possibility in Argentina however the average purchase price is about $8 so someone making a $30 purchase there will be a $30 purchase which is over three times more than the average you start to go should we be looking at that does that need to be looked more closely likewise because high value purchases are more common in certain places if a low value purchase came from one of those perhaps we need to look at that as well we don't know whether these are the right answers but these are the sorts of things that you have to start asking and do we need to consider context so what I mean by that is if there's a store where the average purchase price or the average item across the store is $3 a $35 purchase would look a little bit odd because you'd have had to have bought probably 11 or 12 different items to get to $35 but if there's a store that actually sells a rank that is $35 then a $35 purchase would be much more expected okay and the other thing that we started thinking was what about price so in the algorithm we're using before because it's treated as a word there's no sense of scale or a continuous measure a price that is five euros is different from a price that is five euros and ten cents even though we know that actually the difference between the two is negligible but in that algorithm they're treated as two highly separate entities so perhaps we need to look at something that supports continuous data as well as discrete data so we started looking for a new algorithm it wasn't working for us we still didn't want anything too complex our rule still was that we wanted to be able to understand it we didn't want just to find some fancy algorithm punch some data in and it worked and that's have no idea why because that's the worst thing you can do if you don't understand why something works at least to a degree then if you get problems you're really going to struggle to understand why it hasn't worked if you didn't understand why it worked in the first place we're still looking at supervised learning obviously we've done a bit more research we came across an algorithm called K nearest neighbours now K and N is fairly straightforward to understand basically it means which K and K is a number other results are most similar to this one so I have this lump of data which ones that are in our training data does this most closely align with it met a lot of our criteria because it's based on distances it can handle continuous data and because it's based on distances there is a level of association between the different data points if a data point is further out it makes the whole distance longer so there is that relation but if they're all closely packed the overall average distance is shorter it's not a causal relationship it's not saying if this value is X then this must be Y but it's saying if this value is X and this value is Y it's going to stretch that and that value as a side benefit it is less sensitive to a data imbalance because it's looking at the local neighbourhood but if there's a large data imbalance there is still an issue there now the easiest way to explain K and N is with a graph we all like graphs hopefully it can be seen on the monitors I will explain it this represents our training data we have a cluster of blue squares which is set A cluster of white circles which are set B so if we were to add another point which is represented with a magentary triangle we now are asking the question which category does this most closely align with does it most closely align with the white circles or the blue squares if we were saying right we're saying our value of K to 1 so we're looking for the nearest neighbour and that's it it's all right there if we said 2 N so K is now 2 which are the two closest neighbours there's now 2 incidentally this is a good example why we tend to avoid using even numbers for K because then you have to have some form of tiebreaker rule because at the moment it's a draw one person says I say it's white the other one says I say it's blue and you don't know so you either have to have a tiebreaker or just much more easily just use odd numbers just much straight more straight forward so we're going to use an odd number we're going to say 3 so we now have 3 it's 3 N K is 3 and the first closest was that white circle then a blue square then another white circle so based on that which category is it most closely aligned to the white circles K and N in a nutshell this example obviously works on two axes or two dimensions mainly because it works on a graph and it can work on more or less any number of dimensions it doesn't matter there isn't any pseudo code to explain this one for good reason I don't understand it that well so we're just going to look at an example here thankfully as I mentioned before there is a machine learning library for PHP it's called they do loads of cool stuff around AI machine learning and implementations of algorithms so that you don't have to this is a fairly straightforward example so we've set up some training data we've got four that are labelled A and four that are labelled B and we're saying right the four As have these four they're basically like grid coordinates but on four dimensions four sets of values so there's eight values and samples in array we're going to train it with those samples and labels and then if we give it some test data so this is a new set of coordinates and say right which category does that belong in which category does all the heavy lifting underneath that's all you have to do it's awesome what more statistic yeah I still don't like live demos not a fan of live demos definitely not doing another one here we go so here we go so this is literally the implementation you just saw it's exactly the same set of training data I can now just give it four coordinates so let's go with a no two three five one and it says it's A if I do five two one one hopefully it'll say it's B and so on that's it simple if I was to give it something completely random I mean who knows let's just try it shall we three four five two one that's an A I don't know why but apparently it is and this in itself is an important point that you should try to first you know if you're using data that you don't fully know and we don't know I mean we have 18-19 million payment records I don't know all the data try out on something where you can predict the output to make sure that your implementation works even if you're using a third party library give it a go nice and simple so you can go okay yeah I kind of get that this works so if I'm now more confident to plug my own data in you might have spotted something in that demo about the data samples they're all numeric which is another important lesson the Kenan algorithm because it's based on distance likes numbers kind of makes sense I mean you can't really measure the distance between the word ham and the word cheeseburger I mean yes you can work out the number of counter changes and that is a distance of sorts but that wouldn't make any sense what we're doing we're not looking for typos so you have to kind of go right well actually our data isn't numeric so we now need to find a way of making it so first of all we thought oh well this is fine we're a our databases fairly well designed most of the time and so things are normalised we have IDs for this sort of thing so we have a country ID and we have a currency ID but that doesn't necessarily work so if you're measuring distance you're therefore implying that the distance between country ID 1 for example in country ID 2 is less than the distance between country ID 4 and country ID 50 so for example in our database the distance between Australia and Austria would be 1 because it's stored alphabetically so the ID is only one different but the distance between Australia and New Zealand which geographically are much closer than those two is 160 so that's not going to work there's going to skew the results a little bit now sometimes this isn't a problem if you've got data that is kind of sequential it might have names or words but it's nominal data that is sequential like grades at school that A, B, C, D, E, whatever you can substitute a value or in this example a really old fashioned company with really structured career progression people anyway but let's say you've got interns and then you've got non-management employees and you've got line managers and you've got department managers and you've got executives you have to estimate there isn't a fixed number but you can say within our organisation we think the gap between each one of those is however much an executive would probably say and if you've got school grades like I said an A, B, C, D, E, F, E, whatever you can assign numeric values to them if you've got non-sequential data it's not quite so straightforward so for countries for example we thought about well could we use latitude and longitude because they're coordinates anyway but then where do you get it from do you take it from the centre of the country because actually if you took it the centre of the country for example the gap between the centre of the USA is quite large much bigger than say the UK to Germany but actually someone could be either side of the border at Niagara Falls and the distance they've travelled is negligible and they could have just picked up a different Wi-Fi network or something so that doesn't really work and that wouldn't really work for things like currencies when how do you score currencies do you do it by order of popularity but what if the popularity changes you then can't change those numbers at the moment PayPal is the most popular but what if PayPal wasn't the most popular anymore you can't swap those values around and you're still adding an implicit bias so the simplest solution that we found i.e. someone told us on Google is to make everything binary now this makes things everything is a yes or no an on or off, one or zero question so rather than saying what is the country you're now asking a series of questions is the country the US is the country Great Britain is the country France is the country Germany and so on and so on it does result in a large number of dimensions obviously which can have its own issues there are potentially issues if you've got large numbers of dimensions that are equidistantly spaced it can make some weird things happen that are dimensions but we'll cross that bridge if we get to it obviously if there's lots of dimensions you wouldn't want to generate this data by hand that would be a really bad idea and the other thing you need to be careful of is normalising the data because if you don't normalise the data then your scales can be all sorts of out if you imagine again a two dimensional graph if you've got length and is red so something that is red and is 10 centimetres long and something that is red and is 20 centimetres long has a distance of 10 even though they could both be chilies for example but there's going to be a distance of 10 whatever that means but you could have an orange that's 5 centimetres long and a small chili that's 5 centimetres long one of those is red and one of those isn't red but the distance is only going to be one so it's going to say that orange and chili are more closely associated because the scale is much bigger on that length dimension that kind of makes sense? some nodding? good I'm glad you understand it so you probably want to normalise things you can use it to your own advantage to a point if there are particular dimensions that have a stronger weight in predicting if something is a charge back or not in our example we could say well actually we're going to normalise everything to a scale between 0 and 1 or 0 and 10 if there is a particular dimension we're going to normalise it between 0 and 15 so it has slightly more weight than the other dimensions so you can kind of play with those scales to your advantage to a degree so let's say we've got some example like this I like fruit, you might have noticed these fruit a lot so at the moment we've got data in this form we've got a length in centimetres like height if it's an apple I guess and then a colour and a shape so first of all we have to go and collect up all the potential value so what are all the possible colours and also work out the maximum length of the fruit so what is the longest fruit so that we can normalise all of them to the same scale we're then going to go through all our data and we're going to literally ask a series of questions is it round is it crescent is it green, is it red and assign a 1 or a 0 value to each one of those we're also going to divide all the lengths by that maths length so that the longest fruit is 1 the shortest fruit is probably nearly 0 so we've standardised all our scales and we now have a series of data that looks like some really really messed up coordinates but it works literally we have a length is it round, is it crescent, is it red is it green, is it yellow, is it orange so now we have everything all that stuff that was previously word is now numbers you know what if I'm going to do one demo I might as well do three right yeah I thought this was a good idea so here we go another demo so again this is all the same data I've just spat it out so I can remember what it is when I'm looking at this but for example if we've got something that's 5cm and it's round and it's green we know it's probably going to be an apple but let's just make sure yes it's an apple and again this is that point of use predictable data first to make sure you've not screwed something up horribly somewhere as we did the first, second and third times so let's go 6cm and round and orange it's going to be an orange incidentally to try and throw the data off I've included some blood oranges but told it they're red I know technically they're not red but just bear with me hopefully it's clever enough to know so if I say something that's bigger than an apple it's going to be nine and it's round and it's red hopefully I should have tested this first it's good it worked so there we go this is one of these it's taking the words we provided round and red and it's converting them into it it is round and it's not crescent it is red it's not green it's not yellow it's not orange so it's converting them all to those same set of coordinates as we converted our training data so that's pretty cool so this is what we did we took that data the same set of dimensions but we converted them all into binary and normalised the prices and things the test data was all about 200 dimensions I think this is quite big and again we did the test with 2018 providing the training data 2017 providing the test data and it is better we're getting there the accuracy is still not brilliant it's about 62.5% but the false positives have come right down bear in mind on that first test the false positives were 72 I think it was they've dropped to 42 so that's really really positive so you know we're definitely moving in the right direction but we're obviously not there yet if I was to say to my CEO hey let's put this into production I wouldn't have a job so yeah we're still kind of going right well what can we do to make this more accurate we know this seems to be moving in the right direction so we're going to stick with this but we need to understand why it's so hit and miss and we struck across i.e. someone told us another important lesson data without context isn't very useful now we had already kind of touched upon this because in that very first example we had put the country and the gateway together as one token because we knew that there had to be some interrelation but when we got all excited about oh yeah Ken Ann is going to solve all the problems we kind of totally forgot about it it's a bit stupid of us to really want it anyway so we started thinking that perhaps we needed to look at that in a bit more detail so consider two stores again we've got a store where the average price is $3 and the highest price item is $5 so we've got another store where the average price is $30 and the highest price item is $50 yes it makes it much easier if you multiply by $10 so if you've got a player who normally makes all their purchases from France and they make a purchase worth $20 or so these are two things to consider that didn't make sense as the other consideration is consider a player who makes all their purchases from France with an average price of $20 and all of a sudden they make a purchase worth $40 and they're based in the US same player suddenly has moved and is making more expensive purchases I can tell you if you move you can't afford anything so in either of those situations if someone was on the first store and they tried to make a check out a basket worth $50 you're probably going to go that seems odd it could be that there's a bug it's probably a bug but there's something going on and you would hope that your machine learning would flag this and likewise if the average price of things is $30 and someone buys something really cheap they've obviously dug out the cheapest item in the store and they're buying one of them when no one else does this again that would be odd and in the second question you'd be asking well has that person moved from France to the US and they're buying silly buggers probably the second so you have to kind of be able to apply this context which you can't do when you're dealing with data on a global basis so we realise that we need to start thinking about what's normal for one store might not be normal for a different store so we kind of need to start asking two questions we're going to say for each payment is this normal for this store and is this normal for this player it means building a lot more difference as a training data boo but hopefully if it's more accurate then I'm going to look really clever so there are issues with this that we'll come on to but we wanted to start going down this road and this is the thing that we're working on at the moment so we created a custom test data set for one store we weren't going to build data sets for every store to start we wanted to check our expectations first again using that you can understand and try and test things out before you spend hours building it so we took their their charge racks in 2018 we quadrupled them and then took 5,000 good payments at random and then we run the exact same test we're still using the KNN algorithm we did before everything's been done the same way just to see if it's any more accurate and again I mean it's not perfect but the point is every time we're moving a bit closer okay so the false positives are high still but they're coming down we've actually done a really good job of identifying the fraud transactions so an accuracy overall is about 70% so we're progressing we've not finished on this yet that's about as far as we've got with the per source stuff we've got a number of things we want to try waiting in different dimensions I've already mentioned this if you've got certain dimensions that you think or you can prove are a better indicator than others wait those more strongly than other dimensions setting different values for K we've used 3NN in all our tests actually perhaps by taking a bigger sample maybe setting it to 7NN or 9NN 9NN you might get a better result removing outliers that's something that we're actively working on if you've got particular outliers they can be skewing your data now you can actually use KNN to identify the outliers in your training data to then remove them to then make your training data better representative something that I'm working on at the moment at the moment it doesn't work but I've been told you can definitely do it and weighted distance it's something that we're particularly interested in if you imagine if you say you've got 3NN you've got your point is here I'm just pointing to space now it's really unhelpful and you've got a neighbour right next to it you look at that but then actually the next two closest are in the other category and they're way over here because in a standard KNN implementation every node has the same vote it's going to be associated with those two that are further away rather than this one that's next door to it which might be right but it's probably not so there are kind of modifications you can do to the KNN to actually say right well if a node is closer its vote is worth more than one that's further away and that also plays an effect and that's something that we're particularly interesting so we think that is going to make it significantly more accurate but we also wanted to try per customer detection now this is a very narrow context some customers might only have 10 purchases and they may never have charged back so obviously being able to categorise it isn't going to work you can't up-sample data that doesn't exist again SKCD quite amusing statistically significant boyfriends I don't know so we have to kind of change our thinking what we're now doing instead of classifying is we're trying to look for something that's dissimilar to the normal something that is an outlier or an anomaly now there are other algorithms that will do this but we've kind of got used to KNN we kind of understand how it works and very much it's based on distance we're thinking well could we use distance to identify an anomaly but if the data belongs to one category yes the result will be that category but if the distance is 69 billion we're going to go that's probably not actually that related so we started playing with this idea so we kind of we definitely didn't edit the vendor directory honest well okay we did a little bit so we edited the vendor directory and we added this extra thing to work out the average distance so it works out which nodes it's related to so we put those together and divides by the number of nodes it found and says right so the average distance for this point that you've asked us to predict is X so the theory goes that if it's a bigger distance it's less related or if it's a smaller distance it's more related again we wanted to test this out on known data before we just went ahead and started plugging our data in because then we'd have no idea so look we got through it again but this time you see we're spitting out the total distance not the average distance, apologies so in this one we know it's going to say it's a banana because it's longer than an apple or an orange and it's a crescent but we've told it it's red so we know it's going to say it's a banana but it's not going to be that sure it's a banana and it's returned a total distance of seven now we have no idea what that means is that good, is that bad I don't know so we'll carry on it's fine we said it's the longest banana we've ever seen and it's still a banana shape it's still a crescent so it will still end up in that banana category but it's also green so we're not going to eat this banana because there's something wrong and again it's now the total distance has gone up a little bit more so it was 7.0 now it's 7.7 so we're seeing that the less normal we make it the bigger the distance gets but we still don't know what normal means really so now we've done one that's slap bang in the middle it's 12cm, it's a crescent, it's yellow and the total distance is 0.12 so we've now validated our assumption that the bigger the distance that the weaker the relationship so now we just need to apply it to our data I mean that can't be difficult right so we obviously we've had to identify some players that had good payments and charge backs we kept all the charge backs to one side so that we could use those to test distances the other thing that we need to do is obviously we have to validate our assumptions so with that pool of good payments we had to withdraw a couple to hold those back to test against the training data to make sure those distances are indeed smaller and actually we've not just come up with something that happened to work on fruit and not being fruit pickers doesn't help so this is what we did we trained it for that particular player we grabbed all their payment data took a couple of good ones out stuck the rest of it in then tested it against charge backs and the remaining good payments just to see what the values would be so here's one and we can see this is a pretty good start the good payments that we tested all had an average distance of around 1.4, 1.6 and we can see that the charge back payments are quite clearly much higher so that's a good start this one's a little closer I mean we have got a charge back that's 0.8 and a good that's 0.8 it is still larger but obviously the gap between them is much smaller but there is still a clear line between the goods and most of the charge backs this one okay it's not so clear cut so we know it's not going to work in every situation this one they've got this one with an average distance that's absolutely minuscule but then we have got this other good payment that overlaps the charge backs it's not going to fix everything but that's okay again there's a much clear delineation the values in general are much smaller so there's a lot of nought point whatever's here but then the good payments have got really really small distances basically they've made almost identical payments multiple times and again there's a fairly clear delineation here so we can see that in many situations the distance does correlate to a charge back tends to be more unlike the other payments than good payments if that makes sense but what is also clear is we can't just pick an arbitrary number and say this is the line because let's say we picked one that would be great for most of these all of these are above one but on here we'd miss most of them even though actually they were still clearly different from the others so we can't just pick a single value and say this is it we're going to have to basically define for each user so the way we did that is we've now we calculate the average 3NN distance against the known values so with that training data that we give it we then loop through every single piece of training data and work out its average 3NN from its neighbours and then we work out the average of that so we can say the average distance between the nodes in our training data is this we then work out the standard deviation because something about statistics and then we add them together then we go right into the deviation we're then going to flag the payment so the what the target point is or what the delineating mark is will be different for every single player we know even just looking at the data around before that's going to miss some charge backs but that's okay our target isn't to reach charge back zero for a start not all charge backs are fraudulent charge backs some of them are genuine so we're never going to catch those that would relate to nearly 5,000 payments over 2018 so that will still be a pretty big win in itself and that's an important point is know what you're trying to achieve if you want to achieve charge back zero in our instance this is never going to work and we'd have to do something way more complicated and probably way more expensive and it probably wouldn't be worthwhile if we just want to be able to reduce them by a proportion then we can kind of accept those trade-offs and test and we tested with about a thousand payments in total so there were 366 good payments 786 so again we just identified like a hundred players that we knew had done charge backs and good payments we pulled all their data out we hived off three of each I think it was as our kind of our control group if you like and then the charge backs and we plugged in the training data identified 85% of good payments successfully so that's really positive a 15% of false positive rate is definitely much more where we want to be if it's 50% of false positives people will ignore the data they'll just go well I'd have no idea because you're saying so many legitimate payments fraud this doesn't make sense I give up and they won't use it you have to be able to convince people actually if it flags something you need to pay attention it only identified 31% of charge backs but that's still 31% that we're currently not doing and that would still be a reduction of about 7000 payments in total over 2018 so that's still a big reduction I would be more than happy with that as a starting point in short we've made progress this is another really good comic called commit strip about life in a digital agency I don't know if you may not be able to see it but it does amuse me so they're celebrating because they say ooh and the CTO comes in he says what's going on and the guy says oh we've been stuck on this bug for two hours and he's fixed it that's awesome and they've gone no the bug's there but the error message is different that's progress so we are totally not at the end yet however if I was to say to my CEO I want to put something like this into production we know we're only going to identify a small percentage of charge backs but we can say with relative confidence you might get a few payments a flag that are legitimate that happens I mean whose bank has text them gone there's been a car transaction we don't recognize it I just did something unusual happens with humans but that's much closer to than we can actually use there are still things that we're going to try again the weighted distances thing we think is probably a really good thing that we're going to experiment with and again removing outliers that we're actively working on the moment but it takes quite a long time to churn through the data so we think we can make it more accurate but actually we're quite pleased with the progress we've made so far so for tebex this is still very much the beginning we've continually set out our aims early we knew we weren't going to get a new that more important was to not flag fast positives that was really our big goal is identify some charge backs without causing bad flags at the same time we've redefined those slightly as we've gone along we've gone from saying we're going to use all the data to saying we're going to look at it by store but having that aim in mind means you can focus what you're trying to do rather than just throwing algorithms at stuff and seeing what happens you can start focusing it a lot more on the complex machine learning is hard we've tried to use techniques that we understand because it means when stuff doesn't work we have half a chance of knowing why it doesn't work sometimes it's taken a bit of googling most of the time it took a bit of googling but because we have some grasp of what's going on we can go I understand why that doesn't work or I understand why this does work and it means that we can change how we're doing things we can change our data choices our dimension choices whatever it might be to improve those results we might consider more complex algorithms in future but ultimately if you can understand what's going on if you can understand why something works or why something doesn't work that's half the battle as I've said a couple of times try it with simple data first if you've got a lot of data that you don't necessarily fully understand test your implementation on something that you do something as silly as fruit it seems ridiculous and our right is saying that would classify fruit that's of no use but it meant that I knew that how we were implementing it did work and therefore if we didn't get the results we wanted it's not the fault of the implementation it's either that we've picked the wrong algorithm or the data is not any good so make sure that you know remove those technical question marks first we've learned a lot doing this we've done it all by experimentation you know as I said I'm not a mathematician I'm not a data scientist I wouldn't claim to be in a million years the fact I would be alive in a million years helps so just try things out stuff will work stuff won't work if you learn a lesson each time like we went from that one where everything was classified as not fraudulent and we've now you know we've taken a step and we've learned something so something that actually we can now try and integrate into our production application on a limited scale and then we'll carry on iterating and improving more so thank you I think we're literally about at an hour but if there are a couple of questions oh god they've written stuff down yeah it's working so basically you talked about KNN algorithm which is basically making the decision on one or zero or yes or no so it sounds a bit similar to the decision tree algorithm is it so or I guess it is in that sense except with the decision tree as I understand it this is what I'm going to show that I know nothing other than what's on the slides you know you're going down that path so that's when I said like with KNN there is a relation between different dimensions but it's not a if you go down this path you have to carry on down this path each data can just move the distance back and forward whereas in the decision tree you're branching that way you can't suddenly be moving across because you've already made a predetermined decision further up but it's kind of yeah I guess it works in a similar way how did it work performance wise like how much data did you have to use for training and then when you had to make a decision how much time did it take because we were down sampling our good data was down sampling on the global tests the training took about two hours a lot of that to be fair was fetching the data because we had to pull that down from the database and then reformat the data and you couldn't do that in real time what you'd have to do is you'd have to have a running demon that had the training data plugged in that was just listening for a hey can you give me a decision on this payment so you do the training once and leave that as a long running process either in PHP or in Go or whatever and then you just literally have an API endpoint that says this is a payment so you have to retrain every time it would just then give you an answer back thank you so you were mentioning when you were training your machine learning that originally you started using all the dimensions from the payments and then you said you removed some of them I guess A was their criteria you used to determine which dimensions to remove and B in the ones you kept did you notice any correlation in the fraudulent payments A not really other than instinct so we took at one point for example we took out country because we were like well actually perhaps the country isn't that relevant and the accuracy dropped by about 20% so we were like no we'll put that back in so if I'm honest there are statistical ways of doing it that are way beyond what I understand so in an hour experience we did try an hour we took stuff out and went no that didn't work or we took stuff out and went actually it wasn't at all so we'll leave it out and obviously the ones where for the original naive bays would clump stuff together to make two things into one obviously we could get rid of that because that light relation if you like between the points that existed in the Canon anyway but no it was trial and error sorry