 Great, okay. It's my great pleasure to introduce Marco who's going to talk to us today a little bit about statistics should be no surprise Marco is a Data science consultant and one of the main organizers and of pi data and I believe he's chairing pi data London this year It's happening in about months time It's fantastic conference. So if you haven't got your ticket yet, you should definitely consider that It's not too far away now. So with all of that, we'll get started over to you. Okay. Thank you. Thank you So, yes, thanks for joining me today. I just want to tell you there are three types of lies There are lies that are big lies and there are statistics So as a starting point you can consider the following statement in the Vatican City. There are 5.88 popes per square mile And the number is correct, you know, I'm not lying I don't know much about the Vatican City per se, but I kind of feel that something funny is going on here And that's the idea for the talk. Essentially, we are exposed to the use of statistics in different ways in everyday life And statistics can be used to lie to us. So that's the topic of the talk We're not gonna talk about Python even though we can do statistics in Python and advance statistical modeling But that's not the point. And the idea is that you as the audience simply want to be kind of a better citizen You have an interest in statistics. You don't have to have any advanced Kind of degree in a PhD in statistics or in math. So that's the idea for today I'm going to start with correlation So as a simple definition Correlation if I give you a bit of an informal view first, it's already in the name a couple of things that are happening together So two things that are connected somehow a bit more formally correlation also gives you some sort of strength between Strength of the association between two variables It's easier to think about linear correlation when you're first getting into it so in linear correlation the idea is that One variable is going up the other variable is going up as well and they kind of follow a line That would be a positive correlation or if a variable is going up the other goes down again linearly For example to give you something a bit more concrete When the temperature goes up we say the more ice cream makes sense nicer weather more people are eating ice cream now You can see between ice cream and the temperature that there is a connection intuitively makes sense But in the when you look at the bigger picture at all other possible examples of correlation you don't always Have a causation you don't always have a cause and effect Connection between two variables. So the phrase here you will find it always refer to Correlation does not imply causation For example talking again about ice cream The more ice cream we sell the more people are dying drowning. So What is going on here is ice cream really a serial killer There's something called the lurking variable a lurking variable is a variable that is sort of looking at us but we don't see it and in the case of our Ice cream says example variable the lurking variable again is temperature. So if the temperature goes up more people Go swimming and therefore more people Potentially can die drowning at the same time also more Ice creams are being sold So lurking variables is something that is there, but we all often don't notice it We forget about it or we just ignore it one more example of lurking variable When there is a fire incident the more firefighters you deploy The bigger the damages. So you might say well, let's deploy less firefighters. So we keep the dumb stuff The lurking variable in this case is the severity of the incident So when there is a big issue more firefighters are deployed, but then, you know, maybe Politicians when they do cuts to public services, they kind of look at the wrong variables So when it comes to correlation and the notion of a cause and effect It's complicated. It could be that one is causing the other the other way around Maybe there is a third variable involved the A and B together cause C or C is the cause of A and B or maybe there is a transitive Dependency or maybe there is no dependency at all. So A and B simply correlate, but there is no connection And just by looking at correlation, you know, you can't really say if you have causation or not Just to give you a few more examples number of movies where you see Nicholas Cage as an actor and number of people Dying in a pool falling into a pool So those two variables correlate. So Nicholas Cage may be bringing bad luck Next the consumption of margarine also connected to violence or murders by blunt object margarine makes you kind of aggressive Now is it to blame Facebook for everything nowadays, but you know, if you look at the numbers more Facebook users Bigger problems for Greece. So the national depth of Greece Correlates with the number of users in Facebook Internet Explorer and again violent crime There's a nice correlation here and Finally one of my favorite if you go if you look at the chocolate consumption per country and the number of novel prices per country There's a nice linear correlation So more chocolate means more novel prices You see kind of Switzerland off the chart over there and you notice a couple of nice outliers, so the Sweden having more Nobel prizes than expected who knows why and Germany for once the Germans not very efficient. They consume more chocolate than they should Compared to the number of Nobel Prizes So there was a correlation next I want to tell you what happens when you start the slicing and dicing your Data set and what could happen is something called the Simpsons paradox, which is a phenomenon firstly observed by someone Whose name is not Simpson? So Simpson paradox as a textbook example I give you some numbers about the graduate school admissions in the 70s from Berkeley You look at the total numbers and you look at the percentage of admissions and essentially is fair to ask whether there is some kind of Gender bias so that the proportion of men is much higher compared to the proportion of women So it's fair to ask the question then when you start the slicing and dicing your data you break down per department for example And you see a different story looking at the proportion of Admissions on the women column Notice how for many many departments this proportion is much higher compared to the men's column The thing is when you look at the absolute numbers of application You see how men apply in big numbers for some departments and in those departments They have a lower proportion of admissions on the other side Women apply in low numbers when they have a high proportion of admission so There are a few interpretations here Maybe we men apply in big numbers for departments that are more challenging or the other way around men apply in big numbers for Departments that are kind of easier to get in Essentially, what is the problem here the problem is that if you have any kind of agenda you can show one part of the of the story or the other and All these numbers are correct. They're just telling you a different story So that was a Simpson paradox something else I want to mention is a sampling bias So what is sampling first sampling is something we have to do nowadays in the age of big data The idea is that we want to select a subset of data points of individuals from a bigger population with the idea of Making some estimate about the broader population And we want of course a subset of individuals that is representative of the whole population What is bias on the other side bias when we use bias in everyday language There's a bit of a negative connotation because we associate bias with prejudice Maybe there is some cultural influence going on but you know in science in statistics in particular we talk about bias In a neutral fashion. It's just a systematic error. So when you put together sampling bias We're talking about some kind of error that happened during Sampling so you didn't sample correctly. You don't have a representative subset of the population as a classic example a Big headline back in the 40s Dewey defeats Truman. So that's Truman president Truman the day He was elected president and he's waving a newspaper with the which is stating the opposite story So what happened here the newspaper from Chicago basically had to go Off for the printer before they had the real numbers and they simply trust is some kind of a phone Survey that happened in the previous few days But keep in mind this is 1948 So not everybody had a phone at home and when you look at who can afford a phone is a specific subset of the population So the people who picked up the phone to answer the phone survey. They were definitely not representative of the population in general. They were you know kind of Upper class if you prefer rich people at the time and essentially the survey was giving the completely wrong picture and that's what happens when You don't sample with a you don't get a representative subset when you're sampling correctly a Particular case of sampling bias is also survivorship bias survivorship bias is what happens when you focus on the lottery winner and you forget about all the people who bought a lottery ticket without Winning the lottery. So if you remember from the keynote this morning when Lucas mentioned If I could do it, then you can do it as well. That's a classic kind of survivorship bias, so He went through explaining what happened and he was very honest saying well when you are successful always is by chance And a couple of things could happen and things can go in every sort of direction But often when you read the stories about you know, what do successful people have in common? That's you know textbook survivorship bias and if you look at successful people if we Well, if you if you consider being a billionaire a variable that determines your success all these successful people's Let's say Bill Gates his jobs Zockerberg and so on are all college dropouts So should you stop? Your study should you quit? College first thing about survivorship bias is that it works in both directions So I didn't drop out of college and they didn't become a billionaire. So that's survivorship bias Next topic data visualization Why is data visualization important? Maybe you you heard already the phrase a picture is worth a thousand words Essentially the idea is that you can use pictures to better convey Insights on your data and often data visualization as a discipline will give you insights that you don't have just by doing some Analysation your data often We talk about summary statistics when we look at new data here We have a common example with four different data sets and they all have the same summary statistics so if you pull the average value for x and y if you look at the Standard deviation if you look at the linear correlation if you look a few summary statistics They are all the same But then in the moment you plot the data you will see a very different story So that's why data visualization is important. You get insights that you wouldn't have otherwise And you can use data visualization for storytelling to convey kind of complex Topics in a simplified fashion for example here There is a picture about how different parties Kind of agree with some sort of core decision and the picture is showing how the Democrats is based in the US How the Democrats have a much much higher proportion of agreement and that's fine But for some reason the y-axis starts from 50 rather than starting from zero So if you visualize the correct chart the story is very different. Yes, the Democrats have a Tiny higher proportion compared to the other columns, but they're kind of in the same ballpark So depending on how you want to convey the message you can simply cheat and cut the y-axis Again from the US an example that is related to Gun violence of course big topic in the US in 2005 in Florida They introduce a low called the stand your ground low and they show that how as soon as the law is introduced There's kind of a drop in the chart of gandets Again look at the y-axis it goes from 1,000 and it goes up to zero so If you fix it on the right hand side is the fixed version reality is literally upside down another example this one is from the Italian Public TV service the equivalent of the BBC if you want they have run a Survey the data asking is the government friends with the lobby of course being friends with the lobby is bad So if you don't like the answer 44 percent becomes a tiny tiny slice Luckily I don't pay TV license anymore anything So yeah when I was living there I was thinking well we are kind of world champion at this until I move to the UK So this is from a few elections ago a few leaflets From a variety of parties. There is no bigger at the bus with the fake numbers on it But look at the numbers these from the Conservatives they're telling a story about how you shouldn't vote for the others blah blah blah according to the graphic designer of the Tourist 42 percent is much smaller than 32 percent. So that's one way of looking at it The next one is from the Lib Dems. They are kind of close seconds They are you know, it's one of those messages Yes, go out and vote because we are close second and we need your vote to catch up but then when you look at the real chart that was there second but not very close and for completeness one more this one is from the Labour Party and In the UK there's always this story about this race between two horses because first pass first pass the post System and they say, you know, don't waste your vote on the third little guy. You have to vote for us So it is a race of two horses Although they literally forgot the second horse. There was the Green Party in in this particular constituency, so yeah, I tried to make it across different parties to either to upset everybody or to make everybody Next I want to talk about significance statistical significance. So slightly more technical topic the issue with statistical significance is one of the most unfortunate terms in statistics because in everyday life When we talk about something as a significant event, we also consider it important In statistics, we don't really have this idea of something significant being important as well So when we look at statistically significant results Essentially, we are simply a little bit more sure that the results are reliable that they are not by chance It doesn't mean that the results are important It doesn't mean that the results are big doesn't mean that the results are even useful for any kind of Decision process simply we are a little bit more sure that they are not really random But then of course when scientists report statistically significant results Then a journalist can take the word significant and make a different story Connected to the notion of statistical significance. We have this concept of p-values It's one of those concepts that you know when I was a student I couldn't figure out the meaning of p-values and now many years later I wish I could tell you that I fully understand p-values, but I don't so I was trying to look it up And there is even a big Wikipedia page of what the p-values are not So if you don't understand p-values, you're not alone So let's try to to figure out what p-values are essentially it's about probabilities Probabilities of observing our results when the null hypothesis is true. We're talking about probability not certainty and more importantly Typically in publications in publications You will see a p-value set that to some kind of arbitrary Tresher that often is 0.05 some fields are a little bit more strict others are a little bit more relaxed essentially, it's a probability about Being a little bit more sure that we what we see is not random as it doesn't tell us anything else Related to the notion of p-values. We have a practice called data dredging That's dredging When we talk about data dredging Also called the p-hacking or or data fishing essentially we We are looking for Significance before testing before having any hypothesis the convention should be we formulate some hypothesis We collect data and we prove or disprove the hypothesis Often people try to simply brute-force the data that they have and they look for something that is statistically Significant and they come up with an hypothesis irretrospective So the idea is if you're looking for patterns you are doing some exploratory analysis That's fine But testing the hypothesis on back testing your hypothesis on the same data set is not because you are going to confirm What you already saw? That's kind of like cheating So that was a statistical significance slightly more technical topic One more thing a little bit of a bonus content Celebrities on Twitter this could be you know a talk on its own talk of many many hours if celebrities talking about things they don't understand on Twitter and coming up with strong statements and You know I could be lazy and pick up any kind of celebrity and I would find something here I went for for big guys who is doing a lot of work nowadays in terms of Charity and fighting Malaria and lot of humanitarian Work and talking about his work. Basically. He's trying to make a point here about how mosquitoes are dangerous And he's reporting some numbers about number of people who die When they encounter a shark is a relatively small number compared to that's from mosquitoes Those are row numbers and that's fine The point is well Again, we're looking at numbers, but what is the cause so the conditional probabilities are of course different I never encounter a shark in my life But I've encountered many many mosquitoes. So You see the picture here is reported as if mosquitoes are terrible, of course They are the cause of malaria and he's focusing on that at the moment But the condition that we are looking at is something completely different So wrapping up I Gave you a few examples of lies and big lies and Something done on purpose or something done by mistake. So it feels like we are screwed essentially everybody lies now the point of the talk is not to make not to create like a new generation of Conspiracy or anything like that the point of the talk is simply to make clear that this kind of problems can happen to everybody and We as citizens In the general case and the specialized users some of us are data only some of us are data scientist Some of us may have some kind of more formal training in statistics Other less the idea is that it's important to ask questions and to make a clear distinction between what is good science versus, you know, big loud headlines And as I said, nobody is immune to this. It could happen by mistake to everybody So it's always important to ask questions. For example, what is the context? What is the bigger picture? Who's paying for a particular study? Is there anything that is not being reported? Is there anything missing and All in all, you know, so what you see some numbers. You see some statistics. So what you should put things in a context So that's the end for me Thanks for staying sticking around. I just want to mention Alex mentioned it already I'm part of the organization committee for by data London So if you want to listen to people who know about statistics and data science and machine learning and data engineering and all the stuff Next month, we will be down the road near Tower Bridge with a three-day conference Thank you very much We have two minutes for questions You're close Thank you. Thank you. That was that was great. That was very funny Unfortunately, we live in a world where the truth doesn't matter as much as Important people saying stuff and you know, we have the benefit of being in this lovely room and listen to these cool talks and There are millions of people not in this room listening to this cool talk who think for example that 350 million pounds a week is a reasonable sounding number So what what more can we do other than just be informed and just just us knowing about this Yeah, that's that's a huge question So the opposite of ignorance is probably education so that's that's my starting point and of course money always win against Education and good intention. So I don't have any short term or longer plan, but I Would say yes education and carry on with education is is kind of the reasonable Good citizenship type of thing in terms of making big statements and there is always the issue that There are part of society that should be held accountable by some other parts of society So when there is a politician making Wrong statement as wrong as in reporting wrong numbers on purpose. There should be a journalist asking the question It's not happening every time it's happening less and less nowadays and with what we experience online with the freemium modus of free content but Advertisement pay for it. It's much easier to publish You know extreme views extreme extreme Polarizing content if you want then it is to have a proper in-depth Analysis on whatever issue you're discussing Proposing mild views and suggestions. So yeah, it looks like we're screwed Another question, how can I get better at lying with statistics? Yeah, that's not kind of the point of I guess it yeah, if you have if you have an agenda you can always twist the table and Propose correct numbers that are Suggesting that your story is correct and pretending to ignore numbers that are not Supporting your your point. That's what people do all the time, right? So is omission really a lie Well, it depends on the context again I think you just need more Twitter followers. Yeah Thank you very much Marco. We're out of time for questions, but I'm sure they can take them offline. Yeah, can carry on. Thank you very much