 So thank you everyone. I'm really glad to have so many people around from all over the world. We're going to talk about the SARS-CoV-2 in terms of data and data visualization. It's probably not going to be a surprise to any of you that one of the side effects of the COVID has been a deluge of statistics all over the place. If you just look at a few examples, these are just two examples of official websites. So the three federal office for health on the top, the center for disease control, the bottom. These are the groups that you expect to produce statistics in the case of an epidemic. On top of that, there's been a large number of websites that have produced data that have started to create data visualizations. So you have a few here. You have the John Hopkins website, our world in data, the world of meta. And for the Swiss around you at the bottom right, you have the corona data.ch website. So there's been a lot of these and probably this side, most of them already existed before the COVID. And it's not a surprise that they would handle this data. But one thing that has been new to me is the number of medias that have started creating plots that have started creating dashboards all around the world. You have a few Swiss ones here. You have the New York Times and at the bottom, the two at the bottom are the financial time that we'll come back to. All these newspapers they had, all these medias they had data journalists on board, but it looks like all of them have been working on this topic in the past few months. And this topic, as I mentioned, is not that easy to handle. If some things look easy, like you just count cases, you count death and you plot them. But in fact, there's a lot of subtleties here and that's more as what I wanted to discuss today. But I was not interested just in seeing what people did right wrong, how they did it, but really try to go a bit further and say, okay, when we do data analysis at any other place than with COVID data, do we have the same issues? So we ask ourselves the same question. So how would it help us to have these statistics and data visualizations all around the place? And when I say all around the place, I really mean it, there's all even been some papers. This one was in the New York Times, an opinion wondering if obsessing of a daily coronavirus statistics was counterproductive. I'm sure most of you, many of you went through the same kind of scheme where at some point most people were looking at stats every day to see how we were doing. And indeed it was probably a bit too much. So what can we learn from this? Not the obsession, of course, but the way to manage data. And that's what we're going to be interested in today. Before we really start, I just want to mention this one thing that I was not missing when preparing this webinar. It was things to talk about data IDs, because I have literally I think hundreds of tabs open in my web browser with hints and IDs that I could use. So I didn't manage to put in this talk everything that I wanted, not even everything that I promised, because it filled up pretty quickly. If there's anything you expected or you were wondering about, please don't hesitate to ask him the questions or contact me afterwards because it is really a white topic. Okay, now we can really start and I wanted to start from a well known saying that I'm sure many of you have heard when people say we should let the data speak for itself. And there's something logical, there's something intriguing about that, which is we should not book the data, we should not message the data until it says whatever we wanted to say. But still, I cannot avoid not like this sentence, because this sentence assumes that the data has a voice, as I read on social networks. So we should not expect the data to speak for itself, but it's probably a delusion to think that data ever speaks for itself. Whatever we do, there's always a lot of steps where there's human intervention where we have to make choices with this ways to buy the data, either in good faith or bad faith. In the case of the COVID data, it's never been more apparent than in the past. So we're going to look at a few cases why the data does not speak for itself and in how many ways we can see that the data is different depending on who speaks about it. So yeah, well, there's a few reasons why the COVID data will not speak for itself. And I want to be clear about that. The COVID-19 data is dodgy. Wherever you are, it is not possible to have perfect data. And probably this is an understatement, you know, whatever you're doing, it's not possible to have perfect data, but the COVID-19 situation has shown us how complicated it is to get reliable data. So what's the issue? I can make you a long list, but even if you just look at one country like Switzerland, you'll find that there's different sources for the data and they're not current. There's some incurrency that have been going on for weeks. The data is not always updated timely, so sometimes I can go for days and there's some counts missing. At some point, it will be updated retroactively. So at some points, the numbers from one or two months ago will change because there's been some new data. The reporting will change. The methodology is used to count and we'll see a few examples of this rowing in just a few slides. But the methodology is changed. I'll just give you one example. Some countries like France or UK at some point, they were not counting the people who died in nursing homes or old people who were in nursing homes. And at some point when they were convinced that the data was of good enough quality, they started including them in the count. And suddenly you have data that has changed because the weight was calculated to change, so it becomes more complicated to compare the prior data with the new data or to make other comparisons. At some time, it's only a matter of bureaucracy. It's simply the way the data is collected is not really good because it's collected in a way that does not make sense, especially for scientists. So for those of you in Switzerland, you've probably heard this being a lot of discussion in March, a trail about how the federal authorities were not able to account for the cases that they had archaic methods, et cetera, et cetera. But I don't want to delve too much into that because most of the issues I just mentioned are not that bad. It's interesting psychologically to see how much the general public and the scientists, but also general public have wanted, have asked for up to the data and have complained that the data was not good enough. But by and large, in terms of having good enough data so that you can make a policy out of it so that you know whether you're in trouble because you have too many cases where you're doing well. We've had this data. So was it perfect? By far it was not. Was it good enough? From what I've seen, most of it was good enough. We see a few plots. And I've never seen specialists complaining that they didn't get access to the data when they needed. They complained because it was not possible to get some of the data, some correct data. We've come to that in a few slides. But these data, this problem of concurrency, you know, some sources are not the same. This may be a plus 10 cases difference. It's not great. It should be better if it was good. But data is never perfect. And here it was, in my opinion, not too bad. However, if you look closely, you'll find some cases where it doesn't go well. And that's the example I'm going to give you here is the on the bureaucracy side. That's something quite new. It was published in the Financial Times last week. They showed this plot that shows the number of new cases in Leicester in the United Kingdom. And you can see number. You don't have to know much about data visualization to see that we had something relatively high here earlier April and we've been going down. But if you look closely, you will see that we're told that this is only Pila one data. And I guess most of you just like myself a week ago have no idea what Pila one data is. And in the UK, the data is split into several sources and Pila one is everything that comes from tests carried out in NHS and each so the state laboratories. Pila two would be the data based on tests carried by commercial partners or private labs. And these were not publicly available for cities. For example, here Leicester, there were only Pila one data and Pila two data was not provided. So when the journalists find out about that and they looked for for the data I managed to follow, they found that the real craft not like this, but like this, because in recent days it's private laboratories that account for more than 90% of new cases in recent days. And of course, these type of cases is bad, because this is the time of type of data that you could use to base a policy on in a city that gives you the fitting. If you look on here that that everything is going well, one in fact everything is going worse. So this is not good, but to me, it is the exception rather than the rule, by and large, and I cannot talk about all countries, but those I've seen the data was good enough that specialists epidemiologist know what's happening, and they can base their, their policy on this on this data. But again, you know, this is broke I say that's not what I really wanted to talk about today. I was more interested in the, what we can learn in general, in terms of how we can we can handle data in in general. And to really understand some of the issues we have here we have to remember what we do when we do statistics. It's often forgotten, but almost every time when you calculate the statistics and every time when you perform a data visualization. What you're doing is a comparative task. You may have, for example, cases and controls. You don't want to know if your cases are different from your controls. And by cases and controls I mean it could be treated versus not treated or knockout versus wild type if you're in, in general. Whatever is this way you have different groups and a good scientific experiment always has groups because you always need controls, or it could be different time points. It's not a calculation that you follow over time, or it could be different cases. In our case here it could be different countries, and you'd like to compare them as different groups to see if they behave in the same way. So we're always doing a comparison. Most of the time, when people do bad statistics or do or create bad data visualization. It comes from the fact that they perform the wrong comparison. These are the examples that are here. When we're doing comparisons, when we're doing comparisons in science, something that we always do, almost always do, is adjust for confronting variables. Or as we often say, we normalize. That's something that we should always ask ourselves when we get some data. And I'll just give you a very simple example. Let's imagine that you're looking at water and you're counting bacteria. There are two samples, and in the first samples you find two bacteria, and in the second one you found eight bacteria, but the second sample is a four ml sample rather than one ml sample. What no one would say about that is there are more bacteria in the four ml sample. This is true, but we do not care because there are more bacteria because there's more water. So what most people would do is say, okay, let's normalize by sample quantity, and each sample has two bacteria per ml, so basically they're the same. Fair enough. So that's the easy one. There's many cases when you have real biological data where it's not been that easy to find out how you should normalize, but I'm pretty sure all of you who have been in a lab, all of your analyzed data will be like, well, that's kind of easy. And that's fair enough. The problem is, how do we do that when we have COVID data? How can we compare the data and how can we normalize this data? I won't keep the suspense too long, but the main answer is, this is real data, this is observational data, this is not in a lab, and by and large there's not that many ways in which we can normalize the data. But let's see how we could do it. The first one we would absolutely need would be to adjust, for example, number of cases according to the number of tests that have been performed. Most of you have probably heard Donald Trump, the US president last week, who said something along the lines of, yeah, but our numbers are high because we test a lot, and the solution is that we should test less so we look better. It's not clear. There's been a few people saying he was joking. He said he was not joking. That's beyond the point. Problem is, indeed, if you do more tests, you're going to have larger numbers. The problem is, there's no way we can compare. There's no meaningful way in which we can compare results, depending on the number of tests, because the number of tests is not the only variable of interest. More than that, it's the policy on testing. For example, in Switzerland, in early April, when there was a shortage of tests, tests were used only for people who were at risk. So it means they were at risk, or they had really strong symptoms. So there was a good chance if they had symptoms that they would be positive. So we would get a large number of positive tests. If you do something more akin to what countries like South Korea have done, which is just test everyone as soon as you have a small symptom or as soon as you want it, you're going to do many, many tests. I've heard numbers such as in South Korea, for every single positive case, they've done 8,000 tests, meaning that they do a lot of testing, but it's not just a matter of number of testing. It's a matter of policy, of why you do the test and who do you test. And sorry to be blunt here, but I don't think anyone has a good way of adjusting for these matters. The other issue we have is adjusting for the infected population. And that's quite a big deal. For example, you would like to know the mortality rate of this virus. You'd like to know if it's deadly. So we'd like to know among all the people who coat the virus, how many have died. And there were a lot of discussion at the beginning of the epidemic, saying it's 100 falls more deadly than the common flu or 10 times, etc. And again, no one knows really this number so far. It's not the number of tests that will help us find out about the infected population, because as I just said, it depends on the policy. But in the future, if we're not in a hurry, we may get surgical data that will tell us how many people have antibodies. And so we should be able to get an idea of how many people coat it. This comes with its own issues. But my main point here is, at some point, we must be humble and realize that we just do not have the data. And I'll just give you two examples. So one thing that is more reliable than tests is the number of death, of course, because you may miss someone who have coronavirus, because you may not test him or her and don't know about this person. Stages, but it's harder to miss a death. Still, not all deaths are recorded, not all deaths are assigned to coronavirus, but you can see there's quite a few countries that have started creating this plot or always did, but they've become permanent. This plot here for Switzerland, that's the platform yesterday, shows the green curve is the normal range of death week in Switzerland. So we start at the beginning of the year, that's the end of the year. You see that in winter, the normal range is a bit higher than in summer. This is due to the flu and other diseases like this. And you can see here this black curve, which shows an increased number of deaths that were not accounted. There's about 30% more deaths at that period than what we counted due to the coronavirus. The problem with these numbers is they're much better than anything else, but they're definitely not perfect. They're definitely not perfect because, right, someone had an issue in the plot. I hope it works. Now I've tried moving back and forth. It's not perfect because the authorities have admitted that we don't know why these people died, maybe some of them died because they were confined because of some other reasons. And we do our best, statisticians do their best here, but we'll never have the data. So what is important here to remember is getting data is complicated. In many cases here is going to be impossible, and making comparisons is going to be hard because of that. So we'll have a quick look afterwards at what we can compare, what works or does not work, but there's many things that do not work. One trick is I'm just showing you another graph here. This is data from Florida. One trick is you should not rely on a single number alone. For example, you should not just rely on the number of positive tests. You should not just rely on the number of tests. For example, here you see this is not one of my favorite graphs. I'm not a big fan when there's two data sets on the same graph. So you can see the light ground bars show the number of tests that were performed among several weeks. The dark bars at the bottom show the positive tests and you have a line that shows the percentage of positive tests. You can see that since March, the number of tests that are performed has increased, which is a good thing. We're better at finding whether people have a virus. The number of positive tests has increased as well, as you can see at the bottom. But if you just have this information, these two pieces of data, it's hard to say whether the number of positive tests increased just because the number of tests increased. Of course, if you do more tests, you can only find more positive people. That's where, for example, a second piece of information like the percentage of positive tests will give you an insight. In this case, it was high at the beginning. In March, it went down and then it started going up again. And that's the worrying part. Okay, you're screening more, but in proportion, each test brings you more cases. So that shows that it's not only because you're testing more, that you get more cases. It's also because there are slightly more cases. And if you want to lie with statistics, it's very easy to pick only one number or just one part of the graph and say, look, it's not going up, it's going up before this reason or it's going this way, etc. You have to look at several numbers to really get a good feel of the situation. And even then, it's not possible to do anything perfect. Another type of normalization that has been discussed a lot is population size. And I think this is a really important one. If you're looking at cases of cancer in the country, it's going to be very natural to consider the number of inhabitants in the country. So you don't want to say, okay, we have 1000 cases of cancer because 1000 cases of cancer in Switzerland or in the US or in China is going to be something very different. And in this case, there's no reason why cancer would probably come not randomly. We perfectly know that it's not random. But within the country, depending on some of the general policies of the country like screening, like the way the health system and a lot of things. It's natural to think that the number of cases of cancers, that depends on the number of inhabitants. If you have 10 times more people in the country, you should naturally have 10 times more cancer. Even that is not a given. For example, you have often a lot of heterogeneity within the country where some part of the country will have more cases for reason or another. In a sense, there's no reason in many cases why about borders, official borders are the natural way to compare diseases. Of course, when you're looking at the effect of a policy, for example, if the Swiss government decides that everyone should get a vaccination against the disease, it makes sense to see what effect it has globally in the country. But in many cases, there's no reason why it's the country level that is important. And in the case of the spread of a virus, by and large, it does not depend on the number of inhabitants. So when you have a country that has no one who has COVID, that's Switzerland in mid-February, for example, when you have one case, the first case, this first case is going to spread in the direct neighborhood of the first person infected. And whether the country has 100 people, 100 people, 100 people, 1 million people or 1 billion people will not make a difference. What's going to make a difference is the number of hotspots. It's the number of people spreading the virus. But at least at the beginning, having a large country or small country will not change much the spread of the virus. Of course, after a long time, when there's almost no one left to infect, the curve will stop and the number of people who spread the virus will depend on the number of inhabitants. But contrary to other measures, it mostly does not make sense to calculate the relative infection of people in the country. What you get if you do that, if small countries will proportionately seem badly affected, because you need only one person, you take a small country like Luxembourg, you need only one person to come and by itself, this person will give you a really high rate and it will transmit the virus at exactly the same speed that a single person would do it in the US or in China. So depending on policies, of course, but at first when we don't know that the person has symptoms, it will be the same. So that's the reason why if you look at many graphs, if you look at many data sets about COVID, they will mostly show you absolute numbers. And even though that may seem a little bit strange, because that's not what we usually do in science, it's probably the right thing to do. If you go to some websites, I'll come back to this one, but the Financial Times is one of my favorites. You will see, if you look at the roll numbers, it's a quite nice plot, I've put four countries here. You can see Luxembourg is quite high compared to other models, because despite having a small population, the number of cases does not really depend on the population. What you can see is that if you change that to a number of cases per million in a bit, you can see that Luxembourg, which is the light blue curve here, is much higher than everyone else, at least at the beginning. Why? Because you just need a few infections and you infect a certain number of people of a low population, so a high rate, while other countries will have a low number of people of a much larger population. I'm not saying one of them is better than the other one, but you can see it's always a bad sign when changing the way your present data changes the story. With Luxembourg being here at the top, but very quickly at the bottom of the four countries, and here being over all of them and then crossing again. So it's a good idea to look at different ways of looking at the data and compare them, otherwise we may get bad data. Or bad conclusions, at least. Now, let me go quickly before we talk about data visualization of different things that can change your interpretation of the data. Your data may have different starting points. For example, not all countries had the first case at the same time. So how do you compare them? How do you compare countries where the epidemic started maybe a month apart? I did a quick survey of several websites that show COVID data. Then I found, for example, these four ways of representing data, some fact aligned data on the actual date. So there will be a shift if it's a country that got its first case last year or in March. Some of them look at the day since 10 daily cases were first recorded, or the day since the 100th case was recorded, or the day during which the number of cases reached one per million inhabitants. Each of these may give you a slightly different way of representing the data. And they're all arbitrary. There's no right reason, there's no perfect way to align the data. And if we go back to what we were looking at before, you'll see, for example, on the road numbers, this is the French time. So they use number of days since 10 daily cases first recorded. And you can see here the scale. And you can see on this one, if you compare Luxembourg and the United States, they're quite far apart. And most importantly, which one did I want? Luxembourg or Belgium are more the same here. But I see my arrow has changed a little bit since I created it. And you can see that the United States is at a different time point than Belgium or Luxembourg, which are themselves at two different time points. But if you look at the other graph, per million inhabitants in here, the rule of the French time is they aligned the data on the number of days since 0.1 daily cases per million were first recorded. You see that the fourth whole country we're looking at here, but mostly the three were interested in the United States, Luxembourg and Belgium. They all at the same time point in the epidemic. So that gives you an idea. Here we have maybe 20 days difference and here they're aligned. I'm not going to say which one is best because there's no best. But again, we have to be careful that these little changes in how we normalize the data and how we decide on the starting point will change your story. It will change what the data looks like. It will make them look like they are the same point in the epidemic or it will make it look like some of them are way beyond. Again, that's slightly wrong. That's a good sign that the data is hard to interpret and we should be really careful about it. What you should do, in my opinion, it's quite hard to compare countries for all the reasons we decided discussed. What you should do is compare each country with itself and you can compare the shapes of the curves. For example, I go back to the exact same graph. You can see that these four countries follow more or less three different trajectories. Luxembourg went a little bit high and then down quite strongly and then they're going up again in some way. Belgium and the United Kingdom had the same trajectory. They went up and then they're going down like a firework. They're going down very slowly. And it looked like the United States was following more or less the same pattern except that they seem to be going through a second wave when the first one is not finished. So this is a comparison we can make. Comparing which curve is highest, which curve shows the largest number of people is a bit harder to do. It makes more and more sense as time goes by because the differences between small countries and large countries will tend to disappear as the epidemic spreads. But it's hard to make any conclusion from vastly different data. Okay, I'll just skip that because we don't have that much more time. With little things before I give you the final words on data visualization, some people have been looking at cumulative cases, other ones have been looking at new cases. Here I'm quite confident we should look at new cases. Why? If you look at this plot that was shown by Donald Trump in the White House that shows the number of tests that were performed. And you can see this is a cumulative curve. You can see that the number of tests increases. And it can only increase because it's a cumulative curve. So whatever you do, each bar contains all the previous bars and it's only going to increase. And the goal here was to show that the number of tests done by COVID increases. But of course it increases. It cannot decrease. If you plot it that was done by someone look at this plot and you plot the number of new tests, you see that the number of new tests does not increase. What has changed is at some point here, it's not read but probably in March, the number of tests increased from very low to a bit higher. And then it was quite stable. So while the first graph is the impression that the testing capacities have increased, it's just a total number of tests that have increased. The number of testing capacities has stayed the same. And that gives a very different message. Now I will argue that what we're mostly interested in is not the cumulative number between what happens on a day to day basis. If I go back to my curve here, we don't care about the fact that Luxembourg went up and is going up at a certain scale. What we care about is the fact that it was up and then down and then up again. We'll see it again. There's just one thing missing to be able to understand really the data. And that's probably the most important question of all is whether we should use a log scale or a linear scale. And in science in general, there's many good reasons to log data. One of them is when the data spans several orders of magnitude. That happens quite often when you have genomic data, for example, you have some counts that are very small, 10, 100, and some others that are in the order of a million. That's a good reason to use logs. When the data changes by a multiplicative factor rather than an additive one, often these two reasons are linked. And in the case of an epidemic, it's clearly what's happening because every person who has the disease is likely to infect maybe another two people depending on the current values of the propagation variables. So that means every person infects another two who themselves may infect another two. So we have a multiplicative factor here. And when you do this, you get curves that will be easier to interpret on the log scale. So in the case of the exponential process that we have here, both reasons apply. If you look at the cases in Switzerland, the cumulative curve, you can see we went up we have this curve that is shaped like an S. And it's hard really to see what's happening in the last few days, while it's a bit easier on the new daily cases. But even on the new daily cases, because the curve went up to more than 1000 cases a day, the scale does not allow us to see the little things that happened in the in the recent days. I didn't mention it earlier. The flight has skipped. We're about smoothing the data. So what you see in gray in the background is the actual data per day. And the blue curve is the smooth data. It's quite important to smooth the data, because there's a lot of viability from one day to another from a weekend to a day, a week, day, etc. And these curves are much more really good. So new daily cases is a bit easier to interpret the cumulative cases. But most importantly, if you plot the data on the log scale, the cumulative curve is not that interesting. But the new daily cases curve comes back to what we are seeing in the financial times curve earlier. You can really see not only what is happening when we have high numbers, one 1000, but also you can see this trend, this unfortunate trend going upward here. To me, what is really important is to realize that when you add 15 new cases within a day, it is huge in Switzerland. 15 new cases is about what we get at the moment. So if you're telling me that we are doing another 50, we get to 100, it's a big deal. It means the epitome is out of control. If you add 15 new cases in the US, it's completely negligible because they already have three orders of magnitude louder than that new cases per day. And your data and the way you look at it should reflect that. So in Switzerland at this point, adding 15 new cases is a big deal and we should see it on the plot. In March, when we were at really high levels, adding 15 new cases would make no difference and the log curve will not show anything here because these are additive effects. So in terms of multiplicative effect, when you add 50 to 50, it's multiplying by two. So it's a large effect and you see it on the log curve. When you have 1000 cases and you add 50, you're multiplying by 1.05, I think 5% so you don't see it and you should not see it on the curve. If you see it, it means you're not be able to see other things. So I have little doubt and that's the reason why most graphs favor the log data and the FT curve that I've showed here, I've showed a few other versions before. They use the log data and you can see exactly what's happening in Switzerland. You can see France going up and down. You can see the United States. This date from two weeks ago, you can see the United States being flat and you can see all of these, you can compare the trend, which one is going up and down, even though the order of magnitude are very different. If this was done on a linear scale, Switzerland would just be flat and you wouldn't see anything and definitely not the shape of the data. Just a warning, data on the log scale that may be counter-intuitive, most of us are not used to interpret it and even if we are, it's harder to interpret. I'll just give you one example. Imagine that this is your curve on the log scale, so you've got days and you've got a curve that shows a number of cases going up. Then most people would look at that and say, look, we can see that this curve is flattening. We can see that it's not as steep as it used to be. Well, if you look at it on a linear scale, you'll see that the curve is not flattening at all. The problem is this curve that I made up is not linear, so it's not linear on the bottom plot. It goes up with a slope that is ever-increasing, but it's not exponentially either. It's in between. It's a power curve. So it's less than exponential, so the log curve will go down, but it's more than linear, so the curve is not linear and go up. So that's the kind of counter-intuitive information you could get and one of the reasons why you should always look at data in several ways. You cannot just trust one way and forget about it. As a kind of conclusion, I just wanted to say a few words about data visualization in general. We've talked a lot about data visualization. Log scales are very useful for data visualization. As I said in the beginning, a lot of people have been doing data visualization with the COVID data to the point that some people, as the writer wrote articles, and that was only mid-March to say it's nice if you do data visualization, but it may not be a good idea to publish it, especially if it doesn't bring anything new because you may not know the caveats of the data, you may not know the constraints, etc., and you should be careful about that. This being said, we're not going to complain about the fact that there's been a lot of data, there's been a lot of useful data there. So what are some of the issues? First, some people as often have used data to live. There's not so many, but this is one of the rear graphs that I suspect was done in bad faith. It's the data of different counties in Georgia, in the US, and you can see that there's a clear trend, you have dates here, and there's a clear trend that the number of cases is going down with time. Except that if you look at the dates, they jump from 20th of April to 27th, 2029 to May to April, etc., and that more or less it looks like there were orders so that the bars would be descending. After this was noted online, they did a new plot, it's the exact same one, the colors change, you can see the big blue bar here is the big green bar here, and you can see that the story when the ordering is right is a little bit different. We go up at the beginning, I mean it's already April, at the beginning of the curve, we go down and then we seem to have a U-shaped curve. The final data points here are not complete because that was a plot made in around 10th of May, but you can see that the message and the policy you may decide to follow depending on this data is different depending on whether you see this or this. And you can compare them, they don't give the same message, even though it's the exact same data. One problem when you do data visualization is many people create puzzles, they don't create tools that are useful for transferring knowledge to create puzzles. So this is a case from the White House, let me just zoom in on it, you get all 50 US states. So it's the cumulative cases, we've discussed the fact that it may not be the ideal measure, but you can see that basically you see nothing here. The blue state that is at the top, it's a blue curve, so basically you have a list of blue states here which could be then. If you have domain knowledge, you may remember that the most, the worst affected states in the US was New York, so it's likely that's one. And this one may be New Jersey, but by and large we have spaghetties here and no way to get much out of it. It was also the case on the Swiss corner data to see it website. It's a fantastic website created by a PhD student in bioinformatics, who took the task of collecting data from different official sources and making it available and does a really good job about it. I'm not that keen on the data visualizations. These are the different Swiss contents, 26 of them, a long time. And frankly, it's very hard to get much out of that. But if you use interactive visualizations as he provides, it may be easier to just pick whatever you want. But on this one, it's a hard to get much information. So if I talk about the puzzle, what I really mean is, you have one piece of the data that is, for example, the yellow parts here and the other piece which is the legend and you have to please them together and it's a hard stuff. I'm getting to the end. I skip this part. What I wanted to say is that there's been some issues in scientific papers as well that don't use so much. So great data visualizations. And the problem is these papers have been read not only by people are used to read them and to look at the details but by people who look at them and make wrong interpretation because it's the first one they see some of the graphs. So this really needs to be careful about that. It's not all bad. I've shown you a few bad graphs here. One of the main articles this one is in the Washington Post explaining how graphs may have saved millions of American lives. Of course, it's the flattening the curve graph. It's interesting because it's not a true graph. It's kind of a makeup graph that shows that if we flatten the curve we may avoid having too many people in hospitals. This has been everywhere in March it's been quoted it's been printed it's been shown on TV and quite a few people believe that it has made a difference. So that's quite positive. I really believe that it's a good thing to have all this data around my favorite data set my favorite visualization as you've guessed is from the financial times. There's a lot to learn in how you can do plots in the science science when you have your own data, because these people have thought a lot about sending messages. And as a last graph, for example, this is that take on the number of excess death, you can see the Swiss graph that I showed you earlier. To me this is a fantastic graph because they have a really clear legend that is explained in the text. And even without looking at the legend you see that there is something special in red something behind in black, and there's really a lot of attention in the product makes it much easier to read and many. So we didn't have that much time but we covered quite a bit of ground. It's never enough to rise data using a single number. We all want to do that we all want to have just a number of positive tests and compare countries or similar data data point. And it's always complicated. It's always much more complicated than than that. As a final conclusion, two things. Statisticians like to quote from French writer Gustave Flaubert from the 19th century. He was really complaining about something he called the rage to want to conclude the rage of always wanted to conclude so of always meaning that you need to have an opinion you need to have a conclusion and not being able to say, I don't know. And the problem is with such data. This is really good example but in many cases when I have real world data, we cannot conclude we don't have the information and we have to say, I don't know. It's not all bad. For example, this is not science but it's an economy. I've seen these news that the United Kingdom, for example, have asked have started to provide economic data like GDP with uncertainty, not just one number, which is gospel but to say okay, this is a prediction and I know we may be wrong between this number is this number. This is what most people do in science. We love error balance, we show uncertainty, not always, but it's, it's something we commonly do. But if this pandemic has told us one thing, which is that everything is uncertain. And we use that when we do data analysis, it's going to be at least one positive outcome. I wanted to say today, so thanks a lot for your attention. Those of you have asked for the slides, you will have access to the recording and you've seen that I've put references to most of my slides so you will get that as well. If you want to go further. And as I said earlier, do not hesitate to ask me if you have any other question or you want to have references or anything. Thank you for your attention.