 Če je to? Zdaj. Zdaj, tudi tudi tudi ljudi. Tudi ljudi, tudi duži ljudi in tudi statičnih. Zdaj, da je časno počet, da se poznaj, da je to izgleda. V Vaticanju zelo je 5,88 popov, per milič. To je ne in ljudi. To je neko, je neko. Nelj mi je v Vaticanju zelo, ali nekaj nezaj, ste povrčenati, da je bilo v srednjih. Sdaj se povrčen, da se povrčen, da je povrčen? Sdaj je to zelo. Seš? Dovrčen. To se vse bo v povrčenom, da je je zelo, da se povrčen na razrednje. Ako ne zelo, da ne se nešte, da je vrčen. So, na vse. A če si izgleda o prideg. Zato smo povrčen, na vse srednjih. in ne bo v njih dobro vse, have advanced degrees in maths, and what not. So this talk will be about the use, and mis use, and abuse of statistics in everyday life, and essentially how not to lie with statistics. So the idea is we're not talking about Python or any advanced statistical modeling or machine learning. We just want to be sort of good citizens and be prepared for when we are exposed to statistics in we want to understand what's going on. We're not talking about Python but just out of curiosity. How many of you are Python users at different levels, beginners, experts, being exposed to Python more or less, almost everybody, few people are too tired to raise the ends but, you know, almost everybody. Everybody feeling okay? Anybody feeling sick? Nobody sick? So, there you go. Statistične zelo, da se nekaj Python je pozitiv na dvej svoj bolj. Zelo, da so v tem vseh prvnih vseh, na kaj dobro ne zelo, in da je to vseh počkaj, kaj je koralizacija. Koralizacija je informalna definicija, je je v tem nekaj nekaj nekaj, koralizacija je nekaj relacija, kaj je vseh vseh vseh, vseh vseh, vseh, vseh, vseh. A bit more formally, we also want to measure the strength of this relationship of the association between two variables. When we talk about correlation, the kind of simplest thing that comes to mind is linear correlation. It's just easier to visualize, right? Linear correlation, when one variable is increasing, the other variable is either increasing or decreasing following some sort of line. So you see the line here, therefore, linear correlation. We talk about positive or negative linear correlation, but the idea is one variable moves and the other variable follows the line. To give you a more concrete example, let's say the temperature goes up and if you have an ice cream shop, also your revenue will go up, right? Nice weather, you sell more ice cream. And the way we look at this, there is kind of a, you know, a cause and effect. Nice weather, therefore, we eat more ice cream. But in the general case, that's not always true, right? Maybe you heard the expression correlation does not imply causation. So again on the ice cream example, we can see how there is a correlation between revenue for ice cream sales and the number of people who died drowning. So what's going on here? Is ice cream really the killer? To understand what's going on here, we need to introduce the notion of lurking variable. A lurking variable is a variable that we don't really see, but it's there, it's kind of looking at us, so it's lurking. Back to our ice cream example, that would be temperature, of course. Nice weather, more people eat ice cream, so revenue goes up. But also nice weather, more people go swimming and therefore more people die drowning unfortunately. So there is a third variable here explaining the connection between ice cream and drowning. One more example. Often people observe that whenever there is some sort of fire accident, if you deploy more firefighters on the scene, you will also have bigger damage, right? So from a decision making point of view, it makes sense to say, okay, let's deploy less people, less firefighters, so the damage will be smaller. Of course, big fire means you have to deploy more firefighters, so big fire also has a higher chance of causing a bigger damage. So that's the idea. There is the fire severity behind the scenes to describe the relationship. So, long story short, if we try to explain correlation and causation, it can be complicated, right? So here I'm kind of summarizing all the options. Either there is actually a cause, so A cos B, or the other way around, or maybe the two variables, A and B together, explain something else, or something else is the cause of A and B, or maybe there is a transitive relationship, A causes something, and something causes B. Or maybe there is just no connection between the two variables. A few examples of things that correlate. So the number of movies with Nicholas Cage and the number of people who drown into a pool. So Nicholas Cage, please don't do other movies. The consumption of margarine and the number of murders by a blunt object. So margarine makes you kind of more nervous, more aggressive. Facebook, quite easy nowadays to blame Facebook for everything, but Facebook, number of users of Facebook and the national debt of Greece, they kind of go together, so more users of Facebook, bigger problems for Greece. Number of users of Microsoft Internet Explorer and murder rate, yeah. Again, numbers are true. There is no lie here. Finally, my favorite one is the consumption of chocolate and the number of Nobel Prizes. So you see how every country is kind of following the line, the more chocolate you consume, the more Nobel Prizes you win. And there are a couple of outliers, Sweden, having more Nobel Prizes than expected, who knows why. And Germany, Germany not very efficient at converting chocolate into Nobel Prizes. So that was correlation. Now moving on to the next topic. So what's going on when you analyze data and you sort of slice and dice your data set. This is also called the Simpson's paradox that was observed and described by somebody not called Simpson. But still we call it Simpson's paradox. And I'm going to use a textbook example here to describe the Simpson's paradox that's from Wikipedia, essentially. If you look at the number of admissions in grad school in the 70s for the University of California and then you group by men and women, you see that there is a difference in the proportion between men being admitted and women. So the difference is kind of big enough to ask the question, is there some sort of gender bias going on? Now the numbers here are correct. If you start digging into the details and you break down the numbers per department, so each line is a different department, ABCD, and so on. What you observe is something funny. So you see how for many departments the proportion of women being accepted is actually higher compared to the proportion of men. So these numbers are also correct and they're kind of telling the opposite story. If you look at the absolute numbers, you see how men tend to apply for departments with a higher admission rate. And on the other side, women tend to apply for departments with a lower admission rate. So essentially, well, one could say maybe women are applying to more competitive departments. But long story short, you will observe this kind of paradox. Whenever you have a data set and you kind of slice and dice the data set and your classes, your groups will not be equally distributed. So the distribution across departments is highly skewed and that's why you observe this kind of phenomenon, Simpson's paradox. So all the numbers are correct. If you have some sort of agenda to push, you can choose one or the other. The next type of lies is related to sampling bias. So sampling bias, when I asked, do you know Python? Well, we are at a Python conference so we kind of expect a lot of people to know Python and I shouldn't use this information to draw conclusions on a bigger population. So back to the terminology sampling. The idea of sampling is selecting a subset of individuals with the purpose of doing some sort of estimate on a bigger population. Sometimes you cannot do estimates on the full population, so you need to build some sort of model and you do sampling. In the age of big data, that's what you have to do. On the other side, bias. In everyday language, we have maybe a bit of a negative connotation to the word bias. We also say bias with prejudice. In science, maybe there is not explicitly this kind of negative connotation, so bias is just a systematic error. We don't know if the error was on purpose or by accident. So sampling bias is simply an error done during your sampling process. And again, a bit of a textbook example, Dewey Defitz-Struhmann, president of the U.S., I'm going to say, something, 48 maybe. That's in the morning when he became president, so he was elected, and he's waving a newspaper that says Dewey Defitz-Struhmann, so the newspaper says the opposite. And you see the guy smiling. So what happened here is the newspaper put the wrong headline because they ran some sort of survey, a phone survey, precisely, and they asked, who are you going to vote? And remember, this is 1948, so not everybody has a phone. So the kind of people with a phone at the time who are actually readers of the Chicago Tribune were all Republicans, essentially, and they were voting for Dewey, so the survey was clearly bias. Therefore, the wrong headline. There's a special case of sampling bias. There is also survivorship bias that was mentioned yesterday in the keynote. So survivorship bias is when you focus only on the lottery winners and you forget about all the people who bought a ticket but didn't win the lottery. And also when you hear all the stories of success, all the billionaires, bill gates, jobs and so on, they are all college dropouts. So should you quit studying and become a billionaire? You are old enough to make your own decisions. I didn't quit studying and I'm not a billionaire. The next segment is on data visualization. So data visualization in data analytics in general is a very powerful tool. You can essentially use one image to describe a complex, a very complex kind of concept and also as a data analyst when you're doing data analytics, you still need to use visualization to understand what's going on with your data. So here you have, for example, four different data sets and they all share some summary statistics. So the average X, the average Y, they're all the same, the variance will be the same, some sort of correlation coefficient will be the same. So if you only look at the summary statistics of a data set, maybe you don't fully understand what the data set is about. Once you plot it, you will see how these data sets are really very different and again this is a bit of a textbook example but the idea is data visualization gives you better insights to your data set. But also data visualization is used to communicate to the broader public if there is a complex kind of topic. You can use just an image to communicate. So here there was some sort of a core decision and a newspaper just wanted to showcase how different parties support this particular core decision and you see how the bar for democrat is much, much higher, almost three times bigger than the other. So it looks like democrats are very much in favor of this particular decision. But then something funny is going on that the vertical axis is starting not from zero but from 50. Once you normalize everything, the one on the right is the correct version of the plot, you see how the bars are different but the difference is not so huge. So maybe the story on the right seems less interesting from a newspaper point of view. More visualizations, so guns in the US, very hot topic. So in 2005 in Florida they introduced what is called stand your ground law and here you see how when the law is introduced there is a kind of a drop in this graph that is representing the number of murders committed using firearms. Again something funny going on. For some reason the vertical axis starts from a thousand and goes down. Once you fix the plot, reality is literally upside down. So this was published in the Business Insider. The original visualization was by Reuters. One more example, this is from the Italian public service. And essentially this is a talk show, a political talk show. They did a survey and they asked whether the government is friends with the lobbies and of course being friends with the lobbies is bad. So if you don't like the results you take 44%, which is a big slice of the cake and you squeeze it into a tiny slice. And I always thought in Italy we are kind of world champions at this. But then I moved to the UK about 10 years ago and I realized things are not any better anyway. So some political leaflets in the UK just to give some context, the system is called first pass the post. So essentially the narrative from the main parties is always don't vote for the small guys because the vote is going to be wasted, you should vote for us. So it's always kind of a race between two main parties. Here I live by the conservative parties in blue. They say don't waste your vote on the UK people, you should vote for us because we're going to be ahead anyway. So it's funny how the bar for the labour party which is 42%, rather than 32%, is smaller than the one for the conservative. So kind of giving a message that they are ahead. But they're not the only ones doing this kind of little tricks. So this one is from the Lib Dems for some sort of local election I think. And you see how the yellow bar for the Lib Dems is kind of catching up with the labour. Almost there. We need your help. We need just a couple more votes. But then when you normalize it you see there is a huge difference. So it's kind of like white border. Now to complete the picture with the all the main parties. So again the story is going to be a race between two horses. It is from the labourers and they say don't waste your vote with these small yellow guys. Vote for us. A race between two horses. But they completely forgot about the Green Party which is the one competing for that particular constituency. So just to be a little bit politically correct you see how all the parties are doing the same little tricks. So that was the idea on data visualization. You can use visualization to convey any kind of message. But for us slightly more advanced kind of topic statistical significance. Statistical significance is one of the most unfortunate names in science probably because in everyday language when we say something is significant we kind of assume it's also important. So often statistical significance is used as a synonym for importance but it's not really the case. So when we talk about statistically significant results we simply mean that we are kind of sure about the results. So the results are more reliable. They're not by chance. But statistically significant results are not about how big the results are it's not about how important the results are and it's not about how useful they are. So they're simply statistically significant. So we're just more sure about the results. And that's it. One connected to statistical significance is the p-values. And so p-values when I was a student it was one of the most confusing topic for me. And I wish I could tell you that now I fully understand p-values but it's not really the case. And at the time we didn't have Wikipedia. Anyways, nowadays the topic is so confusing that it has Wikipedia page on what the p-value is not. So a lot of misunderstandings around the p-values. So I was chatting about the topic earlier with Vincent who gave a presentation so we were kind of preparing the slides before the presentation and I know he knows a lot about statistics so I asked him, do you know about p-values and I could tell he's an expert because he didn't answer the question he says, hmm, I'm a Bayesian. Okay, so let's put it on the slide. So even people who know about statistics like Vincent don't really have an answer on what the p-value is about. So let me try to upset the statisticians in the room. Let's see if I can. So the p-value has a kind of basic definition is a probability of observing the results that we get or more extreme when the null hypothesis is true. So that's the basic definition of p-values. Remember it's about probability not certainty and what you see in scientific publication usually is some sort of threshold which is arbitrary and usually set to 0.05 so p smaller than 0.05 other fields might have different standards but what you see more often is 0.05. It means 1 out of 20, right? That's the idea. So essentially, can we afford to be fooled by randomness every one time out of 20? Let's see the idea behind p-values. Connected to the notion of p-values there's an idea called data dredging. So dredging is dredging in kind of real life kind of fishing. And in fact, data dredging is also called data fishing or p-hacking to say we're trying to hack the p-values. So what's going on with data dredging? Essentially the conversional way of going about it you formulate some hypothesis, you collect data and then you either prove or disprove your hypothesis. In data dredging you kind of go the other way around. So you have your data and you look for patterns until something interesting and statistically significant comes up so you kind of build your hypothesis in retrospective. So looking for patterns in your data. It's fine, exploratory analysis so you can sort of understand more about your data set. That's totally fine but testing your hypothesis on the same data set that's typically wrong. That's what data dredging is about. Often it's quite easy to spot. Sometimes it gets through and you see publications where you might feel like they were going for some fishing but you're not really sure. So wrapping up we have seen a lot of examples in different directions where essentially you can use statistics to push your kind of agenda and it feels like we can't really trust anybody. Well the purpose of the talk was not to create or prepare the next generation of conspiracy theorists. The purpose was simply to remind you there's a big difference between big headlines from the media and proper science and anyway this is something that can affect everybody so nobody is really immune. So even if you are in good faith from time to time you stumble upon these problems and you might kind of introduce your own bias. So the point is always try to ask questions in particular what is the bigger context if you are observing something about the study who is paying for the study is there anything that is missing at all what is the bigger picture and long story short the best question would be so what you observe some data you observe some numbers so what are the connection what are we trying to describe here is there anything that we don't see and so on. So that's the summary of it the slides are on the speaker deck they will be around on the conference app on twitter usual things and more links if you want to know what they do and just to plug quickly by data London I'm one of the organizers of by data London so there was mention this morning by data London so you will find me there and you can ask me about by data London or other by data chapters around the world Thank you Marko, we've got time for a couple of questions if anyone has any questions So hi, thank you for the talk it was very nice I didn't quite get why the dredging is bad I mean I understand that if I have a hypothesis and I try one that doesn't work so I can try the next dataset ok that is bad I can understand that but if I have a very big dataset and I just look for interesting patterns and I find one why is it bad looking further into that so looking for patterns in fact that's what we do with a new dataset so during exploratory analysis you kind of look for patterns the problem is when you you kind of assess your hypothesis on the same so you do it in retrospect imagine if you coming from a machine learning background in my case imagine I do training and testing on the same dataset so it's kind of similar it's like cheating basically but looking for patterns is totally fine you just shouldn't validate your hypothesis in that way ok thank you for the presentation and you say them what are lies but then what is true and how to put the proper analysis yeah I think we need a lot of time to discuss what is true and what is not I think a question I guess the title of the talk is taken from famous expression yeah that's the problem there are facts and there are lies but sometimes there are facts that are presented in a way that is clearly pushing some kind of agenda and I guess the message is you need to be prepared for it I'm not saying everything is fake sometimes things are kind of representing reality just in a way that is package for you to kind of go in some particular direction it's difficult to break down when you get something from the news it's difficult to break down reality into smaller chunks to fully understand whether things are really lies or facts still you know if you want to be a good citizen you should make an effort that's just the message I totally agree it's a deep philosophical question and I guess we need a couple of beers to approach the conversation any other questions got time for maybe one more if anyone is interested alright in that case thank you very much Marko thank you