 Well, it's an honor to be here. I love this conference and boy, those are great talks. That was fascinating. We've been watching some sports cars race and now we're gonna get into a little bit of the engine. It'll be a little bit old school. I'm a little mixed feeling about the title of the group. I didn't know it was Big Thinkers. It reminds me of about 15 years ago, five kids and we were having some lunch after Sunday church and my oldest had just gotten back from college her first semester there and was seeing random facts that she learned, you know how wise you get with one semester of college. So she said, you know, for the first time ever, the life expectancy of American males has gone down and then the hypothesis was obesity and my youngest went, dad. And I had mixed feelings about that. I was impressed with her analytic powers. I was moved by her love. But anyway, so the Big Thinkers group. I'm a little, I have mixed feelings about it, but anyway, it's been a fascinating three sessions and now I'm gonna talk a little bit about the nuts and bolts. One advantage of being a little older is you can say, ah, back in my day, we did this, you know, and so what are the top three things if I were to impart in 30 minutes? And those three, I want to talk about ensembles, the idea of using multiple models, competing models cooperatively, target shuffling. The key question of statistics is, how likely is my finding to be real? How likely could I have gotten this result by chance? You know, there are a lot of spurious correlations out there and the more powerful our algorithms are, the more likely they are to happen and most people fall prey to believing in it and also publishing industry has a really hard time deciding what is real and what won't be and there's actually a crisis in that field. And then the third one's a little more detailed and it's one of the top 10 mistakes that people make. I'll talk some about those tomorrow and I just picked out my favorite, leaks from the future, using data in your training that you don't have access to in the real world. It's astonishing how often that happens. I do also want to pay passing notice to cognitive biases, the ways that we, carbon-based life forms are set up, the way we've been reinforced trained over the years and generations and millennia. We aren't well suited for thinking logically and it's much worse than we think and I highly recommend the books on that subject. I won't talk about that today, I'll do it a little bit tomorrow but the great obstacle in analytics isn't as much the technical part. The technical part is absolutely essential, being able to solve the problem, have it work on out of sample data unseen before. Almost all projects succeed there. Where they fail is people actually using them, people actually implementing the beautiful thing you've created. So I think it was a chef who had created a custom meal for a guest, they pay for it but as the day is over they notice it's been scraped into the trash uneaten and there's something wrong with that picture. So how can we get our environment and that was great for Constant to point out the incredible amount of work in the environment beyond the machine learning. So while we're focusing on the machine learning we really need to remember that the problem around that and the problem of getting people to use our work is sometimes the greatest hurdle and I have some statistics on that. But again we'll focus on the technical stuff for today and starting with ensembles. So the idea of using multiple non-linear modeling techniques they have all sorts of different kinds. They're using the data, using models that parametrically estimate the data, some are differentiable, some can work with small amounts of data and they have lots of different ways of connecting the dots. All of data science depends on a smoothness assumption that if you're close in the X space you should be close in the Y space and so the nearest neighbor algorithm which that friend of yours in junior high used to use for the answer and then you take the nearest neighbor's answer and write it down and it's actually a good one because it makes use of that smoothness idea that if you're close in this space and the input space your outputs are probably okay. A chaotic space would be one that's deterministic but not smooth so you can't do any kind of inference backwards from data into a general theory from chaotic data because there's no smoothness. So when you find problems that are smooth that have some relationship it can be non-linear between inputs and outputs that's very amenable to any kind of inductive modeling technique but which is the best? There's lots of critiques out there we've done a study years ago and here's just an out of sample accuracy lower is better in this case for five different algorithms on six different problems and so overall the lowest of the group there is neural nets, the red one and this is an expert user of neural nets these are not yet deep learning nets this is old school neural nets but there is a difference between a naive user and expert user of the tools and this was a fair race between people who were proponents of those tools and this is their out of sample score and neural nets win overall in this case they're very competitive I teach them as one of the top four techniques you should use in your toolkit even though I hated neural nets as a youngster I was on the, they were so over hyped and the astonishing amount of hype around them so it's wonderful to see deep learning coming around the third generation of neural nets and it's going through the hype cycle of the technology trigger the peak of inflated expectations the trough of disillusionment but it will prove and it has already proven extremely valuable and it's worth paying attention to it's just hard to endure all of the explosion and noise during that rapid rise but it is worth it because at the end the things that will hand up but one proponent of each of these techniques could be happy because all of them come in first or second at least twice and so traditionally one has tried to see look at the properties of the problem and see which algorithm might do better out of sample and that's worthy but there is a better way and the ensemble idea this is just averaging the predictions of those competing methods taking the five different models taking their estimates averaging those together and making a sixth estimate out of it or you could vote and voting is arguably a little better in this case because it wins more often and so that's an incredibly simple version of ensembles I was creating new versions of ensembles and they do a tiny bit better but the key idea is that basically all reasonable ways of combining disparate competing independently developed models tends to help doesn't always unless it's decision trees the case of decision trees an ensemble of decision trees has always so far beaten a single decision tree out of sample in every situation I've ever seen for two decades so that's pretty interesting it smooths the decision tree decision tree is not very smooth but an ensemble makes instead of crude stair steps more and we'll see that in a second here here's a tough problem for a decision tree draw this circle separate the red from the blue and the best it can do is this crude picture that only a mother could love you know put it up on the fridge Johnny's picture of a circle you know but anyway when you use a bagging bootstrap aggregating you build a hundred different trees that have all been handicapped in some way by not looking at some of the data or doubly looking at some of the data and when you find the trade-off there the 50-50 point you get something that's much closer to a circle still a decision tree it's still an etch-a-sketch toy you know with only the ability to draw north-south east-west lines but it comes closer to a circle it smooths it out and it doesn't over-fit putting a hundred trees together that have been built separately is not an over-fit even though the number of parameters is vastly greater it's a very interesting phenomena in fact it's simpler this tree is actually statistically simpler than that first one that we put up on the fridge so that won't go into you but that's an interesting result and here if you have a bunch of different decision trees the lift charts of what happens when you go a certain depth into the list how much of the return do you get how many of the sales do you get or if you're investigating fraud how much fraud do you get back these are different implementations of trees built on different subsets of the data but if you put them together as an ensemble you get something that has many more cut points many more places where you can distinguish between characters and the lift chart is smoother so ensembles are and that simple that tree in the middle is again simpler than any of its components in a statistical sense in terms of sensitivity to noise from the input training and then a last example for ensembles a very practical one credit scoring if you use one method at a time lower is better here so the Mars technique wins slightly over neural nets but if you use a pair wise techniques or three at a time four at a time five at a time you see that although there's an occasional outlier that's a scary thing up there doing worse than the individual models that in general the more models you put together the better it gets again that's not subject to overfit so ensembles are a key result and heavily used to good effect the next thing I want to talk about is significance finding how likely the key question in statistical is how likely could I have gotten something that interesting not how likely could I have gotten that but how could that got my interest for some reason if things were random something would get my interest there's always some best location and we'll see in a minute but how likely could that best location have come by chance that's the key question statistics and statistics is the hardest subject to master in school it's the least well caught I was teaching data science to a bunch of data statistics professors and they heard me say taught I didn't say it was the least well taught that is sometimes the problem but it's the least well caught most people don't remember enough except to pass the last quiz or exam and then it fades from memory because if you think about it it's a combination of higher order math and Buddhism this reality you see is but one possible reality now calculate this integral so it's a tough thing but it's so much better if we do it instead with simulation if we do it the way statistics was intended cards and dice back to when rich mathematician rich there's no such thing as a rich mathematician sorry rich gamblers we're hiring poor mathematicians to solve odds problems for them so that's where statistics started so anyway target shuffling just takes that idea and we'll look at it in a second but first I have to establish the motivation what's going on there is a crisis in science the problem of irreproducible results now Lancet is a good paper across the pond here and a good journal they only take about one out of 20 I've already self-selected submissions but they themselves believe that half of what they publish is false it cannot be reproduced Johnny Unitus who's a famous and brilliant Greek doctor has done a lot of studies on medical and he believes that 90% of medical journal articles are worthless 90% Amgen tried to replicate studies that their entire business depends on you can only do about six out of the 58 bear the same thing got a little bit better 25% of the 67 studies that they rely on they couldn't replicate when they went to all the trouble to do that recently down the street from me the University of Virginia someone there led a replication study of a hundred different psychology studies and only 36% of them got results in the same direction when they were tried again and that paper won result of the year in 2015 by Science Magazine and here's a chart where the original strength effect size is the x-axis the replication study where someone's trying to follow the recipe and repeat the result is the y-axis and you would hope that everything would be along that 45 degree angle line but instead two thirds of the time it's different significantly in the wrong direction and you can see there are many cases where the effect size was positive the first time and literally not just nothing but negative in the repeat study in fact the person who led the study one of those dots is his and it's red so he was asked well what is that what is that about he said well they must not have done the replication right he was not ready to abandon the results of this even though it was a study he'd set up so lots of fascinating stories in there again this is the best science papers from that year that are being tested the best psychology papers unless you think psychology that's a soft field which it is you know that's not as good as chemistry or physics or mechanical engineering or whatever the same types of results are there where a majority of people have not been able to replicate their own results and even a larger majority have friends that had that problem so yeah if you always want to know about how someone's doing there's a famous study of recycling when it was kind of a new idea how much do you recycle and it was kind of available commercially you could pay extra to have someone you know you could work harder to separate your trash and pay extra to have someone what a deal anyway and you ask a homeowner how much do you recycle oh recycling and then how much does your neighbor mmm not much but when they stole the trash from the homeowners and counted it this was a great estimate of how much you recycled so what you suspect of your neighbor tells more about you than about your neighbor which is very sort of a universal truth there's some news you can use as you take home alright so this is the problem even the best studies in each field have very difficult trouble being replicated and science depends on replication it depends on a finding being real if you're finding an association between you know what somebody looks at and clicks in the recommendation engine to what they buy next you would like that to hold up and not send them away by the way there was an organization that built a recommendation engine for a startup and the startup went out of business six weeks later after implementing it and the recommendation engine worked fine they just had one little flaw they had the sign wrong so they were showing you the products you were least likely to be interested in and that didn't help sales at all but anyway here the sign is right people are doing what the study tells them and two thirds of the time it is nothing or worse and so it reminds me I don't know if any of you are Etsy fans it reminds me of those beautiful cakes or something that people build and then someone follows the procedure to the letter and they get something else this is the state of research you have this beautiful chicky pops and you get this result or this lovely idea for a Christmas letter with a new rival and it just doesn't work out in practice this is unfortunately the state of research all over the world right now and so we've got to do better so I have a technique and I won't go into the nitty gritty details of it today except I'll show you a demo which hopefully you'll get the idea that really solves the technical problem of recalibrating and tell us how likely you know the statistical tests that were invented by those geniuses a century ago and statistics that said what's the likelihood that they've made this draw of a certain proportion from this population size it works if you try one thing but nobody ever tries one thing right we try a hundred or a million different things data science is a hypothesis generating machine and we cherry pick those results and then use the statistical tests and they just lower and lower and lower the thresholds I mean why is a publication threshold for a medical journal 5% if they really believe that was statistically right they're saying it has to be true 19 out of 20 times but wouldn't you take a drug that was right two out of three times I mean if the chances were that good in fact no drug is actually that effective I worked with Pharmacy and Upjohn and helped them discover one of their drugs that they developed in a decade long period one of the three drugs that they and Pfizer developed in that period and I learned from them a lot I learned that for instance that 70 to 80% of the effect of a drug is the effect that you get when you put nothing in it what's the placebo effect there you go I'm starting to need drugs myself so the placebo effect where you just are in the right setting and someone hands you something even if it has no compound in it accounts for the majority of the effect of even blockbuster commercial drugs that have been improved by agency so the level of effectiveness of medicine when you get into the data it's actually very scary, very low and drugs that passed 10 years ago wouldn't pass today because the placebos are much better than ever before people in social media are describing oh I'm in the study and I feel really sick in the morning wait you feel sick I don't feel sick I'm quitting I'm in the control group so they put side effects in all the placebos now and the placebos work better than ever so anyway we need that we need that thing actually in every other field we need actually a good placebo test what would be the effect of just running data science on randomized data and so target shuffling basically does that we take an example this is from baseball where you have a strike zone where you're throwing pitches and only a fraction of them about 9% are hits they were a person gets on base safe but you can imagine there's a lot of fields where people are divided like this their geographical location, their age, their gender or so forth you put them in boxes and you say what subset of my customers or my patients or whatever respond best to this treatment and in this case the interesting case is a hit or a red ball so if you search through the space you can use p-values to tell you how interesting a subpopulation is in terms of its ratio of red to blue and its count if it was two out of four that would be interesting but not as interesting as 10 out of 20 same ratio but bigger count and the p-value takes that into account so it's a great interestingness measure but it's not a probability unless you do just one thing when you do multiple things it loses its interpretation as a probability but it's still interesting and interesting this measure so take your interestingness measure find the best spot and it turns out that's going to be the best spot that has nine out of 37 hits so it's got a rate of one quarter of the time and it's a significant population that is the most interesting hot spot in this data so after we've searched it all we identify that was the winner and what was it scoring? It was a third of a percent it says there's a third of a percent chance of something that dense in redness but that larger population could have happened given this data but it's not a third of a percent chance after you've checked all 120 different boxes you've cherry-picked now it's not interpretable as a percentage anymore it was a chance of rolling a six if you roll a die it's one out of six but what if you roll that die 10 times? the chance of your best result being a six is much much higher so that's exactly what's happening with data science we're trying multiple, we're trying millions of different things and more powerful algorithm is the more likely we're gonna find a spurious correlation unless we have a lot of data to correct us but we can calibrate these things by just modifying the data so there the data is literally the same as before look at it, I'm just gonna back and forth between these look at one data point, a red one and it's probably blue this next time around I shuffled that nine percent of I didn't change the physics didn't change the inputs I just changed the output label the target variable, target shuffling now there's going to be in this new data some point that's most interesting and it turns out to be in a different location the location doesn't matter as much as the interestingness measure this was the most interesting thing on on randomly labeled data with the same proportion of reds but now the location was different and there was still a hot spot there's always a hot spot, right? just by chance, well it thinks it's interesting not quite as interesting as the previous real result so that's good but still quite interesting well, we just do this many many times and get a histogram of how often it beats it that would be on the left hand side so after 10 trials of two of our examples have beaten it after a hundred 15 percent have beaten and after a thousand 18 percent so it hasn't quite converged yet but we say that roughly one out of five randomized labelling of the cases results in a finding more interesting than our original finding and that's the key question in statistics how likely could I have been fooled? how likely could I have found something as interesting as my real result or better by chance alone so we've created a world with a null hypothesis rules where there is no linkage between the target variable and the input variables and we counted how many times we got a better result very simple procedure solving a very subtle problem of over search heard of overfit where things are fit too tightly to the data over search you can have relatively simple models you can just look at a million of them or a billion of them and that's a different kind of complexity that's completely hidden because it doesn't show up in the final model it's part of the invisible process that got you there target shuffling takes that into account can give you a calibrated number and now if you were a business person the sports manager and you had a four out of five chance that this hot spot is real and by the way there was a bunch of batters that were mixed together they do it now for individual batters so the hitting is a lot harder now because of analytics, unfortunately it makes baseball even more boring I was in a restaurant in Boston yesterday it seems, it was yesterday and watching different sports was up there and it took me three or four minutes to realize that the baseball channel was frozen so anyway, so all right, all right third one, leaks from the future real world examples neural nets, the hype around neural nets is great and the reality is good but it also means that the false findings are even greater because what happens is people say this is how the brain works or I believe that they have this an irrational belief in the goodness of their results and the problem is you get too good results they're too good to be true for a reason they're not true so when a PhD computer scientist guy using a neural net many years ago was asked to forecast future interest rates for a bank in Chicago he did and it got 95% accuracy and they said that's too good to be true and he said, well, no one is smart has ever tried, you know, anyway so they had to give it to somebody else to find the problem and it took days, which is kind of embarrassing because the end result was one of the inputs was the output variable so with future interest rates and a lot of other things you could only lose 5% of the information and predict what future interest rates were so anyway, a regression would have found that relationship right away, by the way so that's why I always say run multiple different techniques on the data once you've done all the hard work of prepping the data it's really easy to try multiple techniques I work in the hedge fund industry a lot and I'm the killer of dreams really because folks make it to me and then we find a bug in their stuff so one of the stories there most of the time, nine out of 10 times one of the stories there was a model that was 70% right in predicting tomorrow's S&P increase or decrease and it used thousands of lines of code of a fourth generation language and after much trouble I was able to accurately 100% reproduce their results with a three day moving average now that would have been bad except the three day moving average was centered on today so with yesterday's price, today's price and tomorrow's price, 70% of the time they could get tomorrow's price direction right I said, you know, if you drop a data point you could get 100% right and I didn't say that, it's a sad thing so they did not know that that was what their huge model had devolved to and there was an off by one error in training and they were off the races wearing suits flying places, showing fancy power points getting investment we were able to help them before they had to go to jail for fraud so they didn't have to go to jail because they stopped and sent the money back until they could fix this major problem but it's amazing how far things go an insurance company had data for predicting upsell how much, you know, AAA is a auto emergency a roadside assistance kind of program and have a wide selection of customers it's a pretty good deal and they try to upsell those customers to get their auto insurance or motorcycle insurance or boat insurance, whatever they offer and so they're sort of constantly bugging them well, there was one candidate input variable that if it had, it was blank most of the time but if it was non-blank a decision tree said, look, this non-blank there is a quarter of your purchasers here over here and it's 100% purchasers there are no non-purchasers in this group again, red lights should go off too good to be true, right but the data was all, you know, we're saying there's something wrong with this variable no, no, that's all illegal variable it's been, ultimately we found out it was a cancellation code it was how they canceled their insurance which obviously occurs after they buy their insurance not before and, you know, I have multiple examples Vanguard, an ad that Vanguard showed all over the world on the web has an incredibly, and they have great advice for how to do back tests in there and then their example that they show breaks that law and they have a horrible example I hope to detail more tomorrow data mining salary survey that might interest some people it was shown by a startup company that trains data scientists or trains people to be data scientists from other careers that if you negotiate on average your salary goes up 3% well, who are they interviewing? the ones that successfully got the job what is another possibility of negotiating not getting the job, right? but because those people don't have the title they're not interviewed and so you don't get that data it's a survivor bias problem you know, the famous example of World War II when planes were returning to Britain from bombing runs they measured where all the holes were they're in the wings we obviously need to put more metal on the wings it's like, wait a minute, somebody realized, wait a minute these are the ones that came back we need to put metal where these ones don't have shots because if they have shots and they made it back, they survived the ones that so anyway, you just have to think backwards what could have happened to my data before I got it? what kind of filters did it go through? and are those going to be the filters that a new data point out in the wild is going to have? you know, when I teach this I often ask people after they learn the top 10 data mind mistakes which ones have you seen in practice? and actually to my astonishment leaks from the future is slightly but it's the one least observed and in my experience it's the one that most often happens so I think there's a gap there where there's some hidden leaks that aren't being caught medical diagnosis they're very subtle in some ways too medical diagnosis company we worked with was trying to do a better job of identifying diabetes through infrared lights being shown through the skin and reflecting off of the blood and coming back and so there was a lot of hard work to take away the person-specific information that's in the skin related to their age, their gender, their sex I'm sorry, their race and so forth how much they work outside whether they have lotion on all sorts of things you have to get through the skin and then get to the common blood chemistry that we all share and get that diagnosis so we were really very successful in many different ways lots of different applications for it but one of their failures was related to they made this mistake over and over and over again where they used information from the future and didn't properly segregate it and so in one case they built principle components as a feature of the data and then used the principle components in modeling by separating the data into training and to testing now what did they do wrong why did they out of sample data on truly new data they'd never seen before under-performed their evaluation data well they had done principle components on all the data so if there was an outlier of something the principle component already knew about it it already was pointed in the right direction to take that outlier into account you cannot do feature creation on all the data you have to use only the training data and the evaluation data has to be segregated at all times so it happens a lot well a couple more stories lots of different fields but let's just summarize it summarizing it a little bit backwards the leaks from the future mean that we can't really trust the results the ensembles tell us that we can't do it alone and Targeted Shuffling says some of our success is luck so it becomes a very humbling exercise to learn these things as you realize we're alone, no one loves us, we're gonna die you know it's not that bad but it's, and Cognitive Biases also tell us we can't really trust our own judgment so we really need other people, we need teams we have to have a skeptics eye and that really helps to have another person try to red team it, try to break into our model and bust it because if we'd thought of it we'd have built it better in the first place so teams are really important we need those alternative perspectives and simulation can be a real leveler and they tell us how likely we could have cherry picked such a good result from random results and it's a very humbling experience but it's doable and the good news is almost every project succeeds if you do it right the bad news is if you do it right your results are always worse and if you cheat, I don't know if you notice that but they hold up better out of sample and that's what it's all about thank you thank you, thank you John I think we have time for a quick question again, thank you very interesting, humbling view of how hard it is to actually do good data science and the interesting part I suppose comes up is the teamwork aspect and a lot of times PhDs are done on their own and they get in the habit of writing thousands of lines of code and producing reports and results but in more of a commercial where there's like a result that has to work and you have to get the right answer how would you kind of facilitate teamwork and that kind of robust discussion that needs to happen yeah, there's a couple of things one is in Google had done a really good study on effectiveness of teams and the key finding there was a property they called cognitive safety is it safe to throw out an idea to criticize to brainstorm you know, are you listening to are you in a safe environment for experimentation and failure that's extremely important and many of us in the field not me but many of folks in the field are introverts and really want to take the work, polish it and get that blue ribbon when it's all done and you're going to polish the wrong rock you need to talk to the client at least once a week you know, briefly 30 minutes, 20 minutes but you have to have that interaction also that builds in their buy-in to the solution they're much more likely to actually apply it if they've seen the sausage being made so very important I absolutely agree, very, very good thanks so much John more of that tomorrow at the master class and as I say, there's still availability