 I'm going to open the first notebook. So, notebook one, data manipulation or presentation. So, that's just so that we can make sure that we have a strong foundation. Before we do any sort of statistics, we have a strong foundation about the Python that we need in order to be able to just play with the data, okay? If we don't make sure that we are not at ease with all of these tools, then everything becomes a bit harder. But if you are at ease with this, then playing around, doing simulation, testing stuff, and so on so forth has become child's play or much, much easier. And you can focus more on the stats and less on the language. Also, you will see when we talk about data description and when we talk about data representation, already the way that we represent thing can help convey a message, can say something about the way that we think, the way that we model our data or the process that we try to represent. So, it's not a neutral topic, it's something with deep implication with a lot of importance. And if you use the wrong kind of visualization, you might introduce a lot of bias in the way that you think about your data. So, without further ado, let's go there. So you have a table of content with some link to help you move easily for the content. And we will first play with importing, manipulating, representing data. So, this first little cell here is mostly about importing. The libraries we will be playing with, Pandas, Matplotlib and Seaborn, Pandas to import table, Matplotlib to do some basic plotting. And Seaborn is an advanced plotting libraries which is built on top of Matplotlib and which allows you to very easily interface with Pandas to make complex plot in one line or two lines. And then here, this is not, let's say, mandatory, but for me when I teach online is very useful because this kind of forces the figure to be kind of large and to have a large font size so that you can actually see the figure well on my cast screen. All right. So, first step first, before we start playing with the data, we need to import the data from a file into Python. So here I will presume the most common case which is that you have your data as a table in a CSV or TSV file. So that means that it's a text file where you have different record, each record is aligned and they are made of several columns or several fields and each fields are separated by a fixed character that would be a comma, a semi column or a tab or a space which are the most common cases, right? And their Pandas gives you this very nice read table or there is also the read underscore CSV comment in order to be able to read your file. It's a function which has a lot of options to adapt to different cases. Maybe you have a header. So the first line with common name, a column name, maybe not, maybe there are some line that you should ignore at the beginning of the file or at the end of the file and so on and so forth. There is a lot of option. It's then up to us as programmers to spend a bit of time in the documentation of this function and to try and see what are the specific option which are the correct fit to our data file. So from time we look at the data file in maybe an external software, maybe like a tabular Excel or so whatever, we'll see what it looks like and we then adapt a little bit our comment line in order to be able to read that, okay? And if your data is not in a CSV file, it's maybe in an Excel file or it's in a, I don't know, XML or JSON file, Pandas has the way, in the same way that they have read underscore table, they have read underscore Excel or they have read underscore JSON and so on and so forth. And again, you can go to the website of Pandas. It's very, very well documented and can go much, much beyond what I can show you in the limited time that I have. But really I encourage you to go to their website and browse for their documentation, that's the best way that you can learn and that you can play around with this function in the end. I just list here a couple of the most important options. So separator, by default it's for tab which would be a CSV file, but you can change it to a comma for a CSV or sometimes some of the CSV are semi column and so on and so forth. Header is the row that you should use as column names sometimes and usually it's kind of the first line and that is the default is to use the first line of the file as column name, but sometimes it's header is absent. And so in that case, you should just write header equal none. All right, so we always kind of check, you know, this sort of thing and we don't hesitate to kind of change it to suit it, your data in particular. And then skip rows to skip a number of lines. So for instance, some file have start with a few line of blah, blah by the methodology of all this file was created and so on and so forth. So these are interesting to us as statisticians, but not so much for pandas. So these are skipped with this option. And then sometimes there are some other ways like that. You, how do you encode missing value? How do you encode a true or false? Is it with zero one? Is it with true or false? Is it with yes or no? And this, you have some options that let you automate this sort of reading. All right, so without further ado, here's one of the data that we will be playing with today and tomorrow. And this is a census data from the Swiss population in the 19th century. And so that's a big CSV file there. And it's in this subfolder data and we will read it with a red table because it's a CSV. I just put here the separator as a comma. Otherwise I use the default values there. And of course, once you have a data frame, you want to look it up to just check what's in there, what are the columns, and has the reading gone properly, okay? And for that, the head function is quite useful. For example, if I had forgotten to change the separator, then, oops, sorry. Then my output would be something like this which here kind of clearly shows that something wrong happened. And I can kind of identify that the splitting of the field into different columns did not happen properly and that I should change from the default tab to comma to obtain this. All right. So once you have that and you are confident that stuff was read properly, you want to start to get to know the data more. It's quite important. So first off, you can just have a look at the dimension of it. So here, number of rows and number of column is in this shape attribute of the data frame which is just a tuple with number of rows, number of columns. So we print that, there is 3,190 rows. Each row corresponds to a commune. And then we have 24 columns. And here are the columns. So year is 1880 for everything. Then the number of the town, the name of the town, total number of inhabitant in the town, number of Swiss people in the town, number of foreigners, number of male, number of female, number of people in different age categories, number of people from different faiths, number of people with different primary language. And then information about the district and the content that this commune is in, all right. And then one thing which is fairly important to look at is the type of the columns for tourism. First, it will inform you about the nature of what you are looking at, but it could also be used to detect error in data entry or in data reading. Here, for example, you can see that my town name is of the type object. This is kind of the type that Pandas uses for strings, okay. So it kind of makes sense. Town name is a string, okay. No problem there. And the rest, as you can see, they are in 64. So that means that they are integer numbers. And this, for instance, if we had one of these which was thrown as object, that would be a bit surprising. And that could be a sign that maybe at some point when someone maybe copied some data into this table, maybe their fingers slipped, they made a small mistake and there might be a wrong entry there which would then maybe switch the type of the column to something else and so on. So it's something to check just to be sure that you have what you expected and that each column is of the type that you actually expect there, all right. So far so good? Yes. Okay. Is the zoom level of my screen okay? Can everyone see my screen correctly? Yes. Okay. Perfect. So then, let's move on. So of course, now you have a table and you want to access part of it. So there are several ways of doing that. I will not necessarily show all of the ways. So to accept a single columns, usually you just either use a square bracket operator with the name of the column there. Be careful that the name should be your strings which should be in between quotes or also you can use this structure there, df. and then the column name. So you can use either this or df. and then or inner. There, this. These are equivalent. In both case, we are returned one column here. Then I also use a lot the lock operator. So lock will then let you select both which rows and which column you want and you can each time select several. So for instance, here this selects all column. So this here column there means everything and between rows zero and three, okay. Here, so all columns, row zero three here, row zero to three and then I want only column total and time name, okay. So I have the time names there and I have here total number of inhabitant and they are all rows and the column total, okay. So this can be kind of combined as you want. Here you see that I access elements by their name. Sometimes it can make more sense. Depends a lot on the data nature to access things not by the column names, but by the column index. So it's position. In which case you want to use iLock, okay. iLock for index localization. You have here a little link for the pandas tutorial where they give a lot more detail about how to use lock or iLock and so on and so forth. For our purpose, we will focus a bit more on one particular usage of lock and this is what happens when you create filter in your data for particular condition. So that's when you create subsets in your data. And for that, we use a comparison here to create what I would call a mask. So that's a set of true and false values. So here we say I want the df.counttone, so the countone column to be equal to vd which corresponds to the countone of low, okay. And that creates a mask object which I keep here in a variable. And then I will use a value counts method of pandas series in order to just count how many truths and false they are in there. Okay, so you can see here, I have here 388 case where the countone was low and 2,802 where it was false, all right. And if I look a little bit at what's, sorry, there at what's in there, I can see that here I have a bunch of truths and false. Here you can only see the false because they are in vast maturity but be sure that they are 388 true in there. And this here vector of truths and false, I can give it to lock there in the field for the rows and it will then return to me only the cases where we have a true here and discard the case where it's false. So that's a nice way of just selecting a specific subset of my data frame. So if I give it to lock this mask with then tone name countone and countone name, you can see that now I have selected only the things which are in vo, okay. And here are the different cities, okay. And these masks there can be combined as much as you want. It's just that you have to use pound disease and then you want to use the ampersand to meet end and the pipe to mean over in order to combine them together. And then the way that it works is very much like you would think about how you actually speak in English, okay. I want tone which are in the town of, in the canton of Zurich and where the total is above or equal to 10,000, all right. And then these are combined and then I get a mask which then returns to me only things which are in the canton of Zurich. You can check they are all in the canton of Zurich and the total population, total there should always be above 10,000, above or equal to 10,000. And that's the case also as well, all right. So these can be combined as much as you want as much as you need. So far so good, all right. Then we will check that everything is going well for you. So you have here a micro exercise. So your goal is to use this sort of method to select the towns with less than 1,000 inhabitants, all right. An optional display only the town names and number of inhabitants. So with this micro exercise, then it's your turn to play. I finally shut up. And I ask that when you are done with the micro exercise on where you think that you have achieved a good answer, just put a green tick using the reaction that I know that you are finished with it. And if there is any trouble, of course, ask a question. Okay, so I don't see a lot of green ticks. I hope that everything is working well. I have put a very small hint on my screen and I see that maybe most people have just forgotten to put their green tick. All right then. Seems that most people have succeeded. So great, well done. So I'm going to then give a small correction. And if there is no pressing question, we are going to move on. So let me clear the feedbacks. So we have column D at the total. Okay, we want something with less than 1,000. So less than 1,000, all right. So now we have got our bunch of truth and false. And then we just want to feed that to our df.log. So we just df.log of this. And there we have all towns with less than 1,000. You can see here that all the towns are quite small. And maybe we want to then show only the column, time name and number of inhabitants. So that's then time name. That's this column name here. And then column total, right. And then that's what we get here. And we see that we have 2,460 rows. So that's how we do that. You can see that here, I have given the mask directly in there. Of course, I could have done something like m equal and then given m there. Okay, for much the same result. And then of course, maybe you are afraid that something went wrong and you want to check stuff. And you want to be sure that you have indeed towns which are all under 1,000. So what one could do, could say, okay, this works exactly like a data frame, this whole thing there. So there is nothing preventing you from looking at the total column of this sub set here of your data frame. And then there are some functions such as the max function which will let you know what is the maximum value in that column. And you see that the maximum value is 999. So then everything is under 1,000. Okay, all good. So let me paste that in the chat. Just in case you want to have the correction and we can then move on. All right, so then you have some columns and sometimes what you are given in the column is not sufficient. So maybe for instance, you are given height and weight and you want to compute the body mass index from that or you've got the weight of the body weight and the weight of the brain. And rather than having these two weights, you want to have the ratio of the two, then you want to combine columns. For example, yeah, we want to have the number of people who are more than 14 year old. And so for that, we would combine these two columns. Okay, the 15 to 59 and 60 plus zero. And the way that you do that is actually very, very, very simple. You manipulate them as if there are single numbers, but then you act at the scale of columns. So here for instance, I create this column to 14 plus year old is equal just to some of these two other columns and that just creates it like this. And you see that this action has added a column into the data frame, which is automatically added to the end there. Okay. So it's actually fairly simple there. You just combine column with normal operators and you create new columns, exactly what you would add an element to a dictionary. And to drop stuff from a data frame, you use the drop function. So you would say, for instance, which are the column you want to remove. Here there is a single, but I could remove several. If I give them in a list, I give the list of the column names I want to remove. A little trick that you have to be careful about is that you have to say in place equal true if you want it to open right away. If you don't use in place equal true, it will not change DF itself. It will return a view of the data frame without it so that you would have to catch it in a new variable. Here I use in place. And then we can check here 14 plus year old is not part of DF.columns anymore. So the removal was effective. This in place thing there is, I would say easy to miss. So oftentimes double check when you drop stuff from a table just to make sure that you have not mistakenly forgotten to use the in place and so on so forth. If you want to remove rows instead of columns, then there is a rows argument to the drop function instead. And then you would give the index of the rows you want to drop. Okay, all right, so far so good. Yes, all right. So we start with very, very simple stuff, but I think it's very important to build together this common ground. As I said, being able to manipulate data easily makes all the rest much, much, much better. So then again, it's your turn to work with our first exercise. So the first one, we will just build up from this micro exercise which you just before, but you have to be, you have to go a bit further. So you want to set a turn with less than 1,000 in-habitant or more than one foreigner. So you have to combine two masks together and then ask the question, how many tones are there? Okay, let's satisfy this more complex filter and then create a new column in the data frame representing the fraction of population which is of the reformed phase. So that's the reformed column for each time. And then if you have a bit of time which is a minimum and maximum value for this fraction. Therefore, more complex exercise, I will stop sharing my screen, let you work. I think for, let's say at least five minutes, put a green tick next to your name when you are done. If at any point you are a bit fed up with it and you just want to look at the solution, feel free to just hear and comment this line and then you can hear when you execute that that will automatically load the solution which is in an external file, all right? So that's also your race after the solution with you. Okay, so then if that's all, I will then post one and let's do a correction for this little exercise. So first we wanted to select the tone with less than 1,000 in-habitant. I will just use the correction for that. So less than 1,000 in-habitant or foreigners, oh, there's more than one there and here it just says more than zeros. There's more mistake there. So more than one foreigner, all right? So this mask or that one, all right? The pipe is a single or. The little trick there is that you have to use here this spout disease to enclose each mask otherwise you will get errors normally. And then once you have that mask, there are several ways of looking at what's in there. So for example, here I just use the sum function because it's force and truth. So force are zero, truth are one. So if you make the whole sum of this then all the ones sum together and give you the total number. Yeah, but then there is also this other method which I've shown you earlier, which is this value counts. Yeah, which gives you the number of true and number of false. So here you see that we get the same count here. Then question two was creating a new column in data frame representing fraction of the population which is reform. So, okay, let's go there. Let's then split that, yeah. So we want the fraction of reform and it's just reform divided by total. Okay, that's how you get the fraction. And then you put that in this fraction reformed column and then we just look at the beginning of that just to get a feel for what it looks like. And then what is the minimum? What is the maximum? So there is a min function and the max function for each for that column. And the minimum is zero and the maximum is 1.0. Kind of makes sense. You expect that among all of the commune of Switzerland there is at least one when there is no one who have the reform phase and one where everyone is of the reform phase, especially in 1880. Okay, now a very small interlude for the rest of the notebook, it's much nicer, much easier if I play with the data frame which has some fraction in them rather than row numbers. And so that I will just create the effraction data frame there. And you can just look at how I do that. I create that effraction as from the data frame with just that first term name, count on, count on name and total number of inhabitants. And then for each of these column there, you see a fraction of speakers and a number of speakers and so on and so forth. I will create a new column in the effraction as the column divided by the total. So then they contain only fractions. So that's what we get now. We have name, count on, total number of inhabitants and then fraction of Swiss, fraction of foreigners, fraction of male, fraction of female, fraction of different age categories, faith and languages. All right, and that will be a bit nicer for the rest. Okay, so then we want to describe actually what's in there. Okay, we want to start to not just show a little bit but go a bit further. So for this, my, let's say, when my best friend, one of my favorite function for that is just a describe function. So you can use that on a single column or on the whole data frame. And for that, when you do that for each numerical column, you get this little summary of what's in there. So the first line here is count. So that just counts a number of cells where you have information. Here, it's all at 3,190. So it means that all rows have the information. There is no missing values. There is no NAs in there. Then the mean value, okay. So I have the average basically total sum divided by number of cells. So you see that on average, that is 892.20 inhabitants per community. So the average fraction of Swiss is about 96% and so on and so forth. Standard deviation. So what is the average deviation from this mean here? Minimum, maximum. And then the first quartile. So 25% of town have less than this, that's than 207.25 inhabitants. Then those 75% have more than the 50, the median here, that means that 50% have less than 50% have more. And then here 75, so 75% have at least this number and 25% then have more. So far, so good. All right, so then you have that for all columns. Again, for me, that's quite useful also to detect if there are some outliers in there. Like for instance, let's say here my minimum is 17, very small town, but after all, why not? And my maximum is 61,000, again, big town, but again, makes a lot of like quite possible, right? If I had some negative number here in the minimum, then I would say, okay, something weird happened. I have made a mistake or there was a mistake in the file. If this value, the maximum there was something crazy like 2 billion, again, I would suspect foul play. Same thing here, these are fraction, if there is anything which is negative or above one, I would suspect the problem. So this is both to get to know the data, but also eventually to detect problem in there. And there is none here and I won't go into too much detail in there, but sometimes in your data, you will have NAs, so missing values. In many cases, it's not too, too, too problematic. It's just that you have to maybe exclude the rows when there are NAs and so on and so forth in some procedures. And all you want to sometime to some imputation, so the NAs will be replaced by a set value or something like this. Yeah, you have two functions which are helpful. First is EaseNA that lets you detect missing value. There is a link to the documentation there and some example. And then fill NA to input to replace all the NAs by a set fixed of values, okay, depending on the data nature that how you do that can change a bit so I don't give any indication there. All right, but just note that this existing case at some point you have some NAs. All right, so as we said, describe gives you a mean, the standard deviation, the minimum, the maximum and different quartile, all right. And maybe when other describers could be the mode, so that's the most common value, especially among discrete data or the maximum value with maximum density in continuous data. So if you imagine that you have a density plot, that would be the point where the density is the maximum. That's the mode. All right, so now that we have seen just how to get some bunch of numbers there, let's see how we can get a visual presentation, okay, because we have all of these computing power at our disposal. So let's actually use it to get some nice plot. For that, I will use Seaborn, so which is imported as SNS as a shortcut. And so Seaborn usually takes either directly the column that you want or when you want to put several column, you can give it like the name of the data frame and then the name of the different columns that you want to show. It's very simple of usage, at least in this basis, and it can offer a lot of variation. So it's quite a nice library to play around. They have a very nice tutorial. So after the course, if you want to play around with their tutorial or gallery, I can only encourage that. So here, a single line SNS.ist plot for Instagram plot, and then the fraction of Swiss, so the column Swiss of the diff fraction data frame, and this is what I get directly. It has detected that this is a Swiss column, so this is here. It would just use this as label for the x-axis. And for an Instagram, the y-label is count. And so you get here and you see that here, most of them are very close to one, but then you have a nice little decrease there as we get to a lower fraction. Okay, so that's kind of the default, very simple. And of course, this can be kind of tuned and played around with. So for instance, if I do the same thing, but now I say KDE equal true, says there will be a kernel density line added on top of that, and the color will be red instead of blue. And I will here set underscore title. So add a title and this is a title that I will set, okay. So, and now this is the result and you can see here's a density line has been added on top here. All right, so there are many options. Again, this function have many, many options. I will show a few, I think some of the most important ones to know, but it's important to, again, I'm already at the point that variations are, a lot of variations are possible there. It's a library which is very, very versatile. And so spending some time playing around, having fun a little bit, playing changing color, playing color scheme, playing with the visualization is helpful. It gets you to be more of a style and more flexible with these libraries. So do not hesitate to spend time. They have a very good tutorial on their website. All right, so, and then from there, we can maybe combine them and we can play with different parameters that lets us see our data slightly differently. In particular, for the histogram, I want to just discuss quickly the bin or bin width, sorry, argument. And because it plays with how much granularity you look at your data. So for instance, I will show you an example of multi-panel figure where on one panel, I will show you a figure directly. It will be easier to explain. On one panel, I will use a default number of bins proposed by Seaborn. They try and find something that works well on your data and most of the time they do it well. And then I will manually set the number of bins here with the bins argument from five, 10 and 1000 to see how that changes, how we look at our data. And so you can see that this is the automatic one and we can sort of see a trend in the data there and we can see also a little bit of variation around this kind of, maybe there's a bell shape in there and you can see some variation around this bell shape. And if you drastically lower the number of bin then, maybe you see a little bit of a trend but then it's kind of hard to see there. There is maybe too much granularity there to really see the fine aspect of what's in your data. And with 10 bins maybe there here, you really focus mostly on the trend. And with 1000 bin, now you see that the trend is still a bit there but you focus more and more on the noise, right? So the number of bin is something that depends on the number of points that you have in the original data. The automatic usually does a good job but sometimes you have to play a little bit around and find the correct level of granularity with respect to what you want to show. And this is also quite useful because then sometime you will show different histogram of different data set with varying number of, of sorry, varying number of element in them. And so having finer control on the bins, there is also a bin widths argument that lets you rather than setting the number of bins, there's a number of column that you will see how wide they should be. Sometime maybe you would have a width of 0.01 so you would know exactly which bin correspond to what. And that can also help you combine together different histogram from different sources so that they have corresponding bins and are more comparable. And you then also see how we can create multi-panel figures. So here we use the matplotlib function subplots. So plt.subplots, I want two rows, two columns. The figure size will be seven inch by seven inch. Usually this, at least for me on all my screens, they, it is 14 inch wide or something like, I think 12 or 14 inch wide. So that gives you a little guide to how large you can make these. And then this gives you this F and axis object. F is an object that is used to interact with the entire figure at once. So for instance, to create a big overarching title or to save the figure to a file. And then axis is a, here it's a little grid, a list of lists, which you can then use to interact with each individual panel. For instance, axis zero, zero is top left. Zero one is so call a row zero. So top line and then right column. Okay, so that's top right and then one zero bottom left and then bottom right for one one. Okay. And you just give them using the ax argument of this is plot. So almost all function of, sorry, of Seaborn have this ax argument so that you can redirect, you can tell to Seaborn that it should do its plot in this specific sub figure. All right. So far so good. Yes. Okay. So we start with simple stuff and you see that can become a bit complex after that. So I have shown the Instagram plot and if I've seen that the Instagram plot could have this density line added on top at some time, you may also, take directly the density plot without Instagram for that. It's KDE plot. So Canada city estimation plot. Then you give it there and then there is this argument which for me is a bit important. It's the cut. So the default value is two and I'm not always very happy with that. I will let me show you. So this is a default but here you see that what they do is that they continue their kernel density estimation even beyond the limit of the actual data. And so to me, that's a bit problematic because for example here, we are playing with a probability. So we kind of know that there is a limit at zero and one, right? Everything above one is nonsensical and under zero is also nonsensical but this plot makes could give us the false impression that there is something in there, all right? So that's actually personally for me that's a bit undesirable. And so I do prefer to put the cut to zero. I personally think that they should make that the default but I'm not a developer. So there you go. Right, so just a little thing to think about when you are given such a density plot, think that sometimes some of this is an estimate and may not reflect actual observed values. All right. And then these can be actually combined, okay? So all of these seaborne plots can be combined with also normal matplotlip element and played with together as you want to create the visualizations that you want. So here, for instance, I create a little function that does something maybe not necessarily super, you know, super, super sensical but just to showcase a little bit what this does. So here's a function that takes some data. So just a column n, an ax, a matplotlip axis to plot on and then it will compute the mode, the mean and the median of the column. Here, the mode function may return all the modes or maybe several modes in the distribution in case the distribution is by model. Here we just pick the first mode, which would be the highest and then I make an histogram plot here of the column and then I will add some vertical line for the mean for the median for the mode in a red, green and blue and with different styles. So this one is dashed, solid and solid. And then with here different levels and this level will be then used in the legend which I call here. So everything from this ax which I plot on. And so then I can create my multi panel figure and give different column to this plot with mean median mode function and redirect them to different axes. And that will give me this. So here you can see that I have one plot there with mean median mode, one of the one and one of the one. Here you can see that it's not so looking so nice. There is some overlay of a lap here between the different levels. The solution for that is to just use this overarching figure element and then there is a function called tight layout which forces it to re-compute a little bit and move around the border between these so that they don't overlap anymore. I give it to you in the chat. There is a question by Enghe. So you say that your module C1 has no attribute esplot that this is big goal. This should be esplot, what is it? Oh, thank you. Maybe there's a typo somewhere. But there's also because in an older version of Seabourn it was esplot, no, there was desplot and they have changed a little bit the name of the function from desplot to esplot and sometimes can create some problems if you have different versions but it should be esplot. Yeah, I typed in the esplot and it still gives me the same error. Ah, okay. Then could you paste in the chat your version of Seabourn? All right, and in the meantime, so we have now this nice little plot so we can start commenting a little bit of what we see and so we can see that functions in some case we can here kind of recognize maybe some sort of, I don't know, at least one big amount here. There is a single mode and you can see that the mean and the median there are both very, very close to one another. So the distribution is what we would call yes symmetrical. And that's when the mean and the median coincide. We can of course contrast that with the second one there where the mean and the median do not coincide and I guess you can see directly that this is a case where the distribution is non-symmetrical, asymmetrical. All right, okay, and there's a skew toward lower value. Okay, and then finally, we have here a case of distribution with a fraction of reformed where the distribution is multimodal. Okay, that means that you have several mode, several peaks in your function. And that's a case where the mean and the median become a bit less informative because they don't necessarily represent a typical case of what exists in your distribution. So it's why also it's quite important to always plot the columns of interest and plot your data and not just rely on single number descriptors because they can be a bit misleading sometimes. Okay, so there is now a second exercise for you. We are going to do this little exercise. I think it will not take too long. So we're going to take, I think again, let's say five to 10 minutes to do it. And then after that we'll do a correction and after this correction we'll do a coffee break. In the meantime, while we do the exercise, if you are experiencing trouble, so there is this question that we were having. So do not hesitate to continue writing in the chat and tell me about your trouble and we try to get you sorted as fast as possible. All right, so I will stop the recording. What's the recording? Share my screen. All right, so this exercise was made of three consecutive questions. Okay, I hope that most people were able to do at least a few histograms and CT plots. So who was able to do the first one? Just make a plot of the total number of inevitable. Please use a little green tick or so and so forth. Who was able to do that? Only a few people? Okay, I see a few more, perfect. So if there are stuff that are still a bit like problematic and if you are experiencing technical trouble, don't hesitate to let me know. If I don't know, there is a problem I cannot help. So here I call KDE plot because I think that the density plot is what I want. An Instagram might work as well, but yeah, I will go with that. So KDE plot gives me this and I see I have a very, very big pike here which must be something around 100. So there is a lot of small commune and then there are some there and I see that it goes up until 60,000. Here with a little bit of experience, you would maybe guess that you have what we call a high dynamical range of the data. So this is good and fine. This is informative, but maybe we could go one step further by saying that I have some towns which have 10 inhabitants, I think the smallest is 17. Sometime with 100, sometimes with 1000, sometimes with 10,000, right? So that's kind of huge. Maybe I could do a log 10 transform of this number of inhabitants and maybe that would kind of smooth things out or let me see a bit more what happens in there. So you can actually just directly do that. So log 10 is obtained with a numpy function. So numpy will ask np.log10 of my column and then just give that to Eastplot and let's see what we get there. Ah, yes, I need to import numpy. Here there we go. And there you can see that we have now this visualization of the log 10 transform total number of inhabitant and I don't know about you but for me that actually looks quite fine, right? I can sort of see if you will continuity between the small town and the bigger town, all right? And I have here town in the hundreds and here town in the thousands and here town in the 10,000. So something also to think about when you look at your data, sometimes you can detect which column could be a bit nicer if you actually do a log transform on them and that's a very, very, very classical transformation to do on our data, okay? To do something a bit weird to something maybe closer to a normal distribution or something where you can have a better grasp in one glimpse of what's in your data. All right, making sense so far? Yes, okay. So there's another one which I can show you there another way of doing this. The result I think is less nice but nevertheless it's not uninteresting to know this little trick is that you do your plot and then you call this set the X scale as log and this is what you would get in that particular case. So you see that it's slightly less nice because it has density but now you've got here your total and you got the 10 to the power two, 10 to the power three and you got your actual scale there. So it's quite nice to know about in terms of little trick, okay? Maybe if I change that to a hist plot that would be better, I've not tried that. Let's check, maybe that would be horrible. Yeah, it looks weird because then you have this unequal bin size and so on but you kind of get the idea, all right? For me personally, I think that this one is a bit more appropriate but I would have to then manually maybe change that to say log 10 of total or something like this. Right, but just to let you know that, yeah do not hesitate to play around with these two experiments. Okay, so then second question let's try to call hist plot twice in a row once for the fraction of Swiss and once for the fraction of foreigners. So that's what I did here and if you do that once and then the other then you will see that you get actually the two superimposed on the same frame, okay? So then you cannot really know which one is Swiss and which one is foreigners. For instance, what you can do is maybe say that the color here is what sort of color will we choose? I will choose a mint color. I will not go into the detail of all the way that you can name colors. Usually you can try most color just you can type them out you can use their X code and I grabbed some from a special palette which is the XKCD palette which I enjoyed and so there now we have in blue the fraction of Swiss in mint the fraction of foreigners so we can kind of see the symmetry in that it kind of makes sense, right? When the fraction of Swiss is high fraction of Swiss is low, they are complementary. Okay, and so that's the second question and there's third plot distribution of the fraction of Catholic in the canton of Zurich so that's just to go back to what we had done before so create a mask and then apply that mask and plot the result. So mask fraction of, sorry. People that are in the canton of Zurich and then people we have this Catholic column and we apply the mask. So we have a fraction of Catholic in the canton of Zurich and we do a East plot of that and this is what we get here, simply. All right, so far so good. Everything making sense? All right, good. So how about the rhythm? My elocution, the speed at which I go through topics and so on and so forth. It works for everyone. Okay, otherwise do not hesitate to tell me to repeat something or anything. Okay, so then it's 10 34. Let's go on a good 15 minute break where you can go and grab a coffee or you know, I write yourself a little bit just to stay, you know, stay focused and stay present in the course as we will go through slightly more and more complex topics. So break until that would be then 10 50. I will cut my audio and video but I will still be around. So if you have technical question or whatever, you can still write them in the chat and I will answer as soon as I'm available. Okay, but otherwise I wish you a very good break recording. All right, so yeah, of course we kind of scratch really the surface again. So you have here a little link to the C1 official toil which will give you, it's quite well done and give you how you can go through that how you can use this to its fullest. We are going to continue to see a little bit how we can represent our different data columns different datasets and so for that, I'm going to modify our data a little and I'm going to create some categorical data columns. Here we can, we could have a look at the code briefly. So the idea is that I want to create a majority religion, sorry, a majority religion column. So the way that it works is that I select, so to get a majority religion, I select the three columns rating to that. So reformed, Catholic and other. And so you can imagine that at this point I have just a data frame with three columns and let me cut that in little pieces. Hop, hop, hop, hop, so I have all three and then I call this IDX max axis one. So what this does is that for, so the axis one means that this will be for each row. I want index of the column which has a maximum value. So here, for instance, it's 91, so reformed and here, for instance, this here, Catholic for the 88. So there we go. And so there you see I have just a bunch of then labels, reformed, Catholic or other, depending on which one had the maximum fraction for each row. And then I put that information into a new column majority religion. And then I do the exact same thing for majority language. So let's see a little trick in order to get my categorical columns. Of course, you want to check that everything went well. And once you have that, you can then compute your matrix, for instance, a mean or median for each groups using the group by function of the data frame. So basically you group by something. Here's a control name. I will group all my data by content. And that creates a grouped object which is specific to pandas. And this is, this behaves sort of like a data frame but all operations will be then split among the different categories that you have grouped by. So here, if you call some rather than having the total sum, you get the sum for all content. And so in the end, that's here, the sum of the total number of inhabitant. So that gives us the population of each content that was registered in 1880. Okay. And then we can sort of start doing stuff. So for instance, we can group by a majority language and ask what is the mean fraction of Catholic. So we can get for each majority language the fraction of Catholic. And for instance, you can see that among Italian speakers, this fraction is 97% but among German speakers, it's only 38%. Okay. So there might be something different there between these different commune which have different majority language. And this is just to apply, let's say classical function, the mean, the medial, sender, deviations and so forth. Sometimes you will want to apply functions which are not already pre-existing, pre-coded in there. And so you can feel free to just create any function that you want. Okay. So for instance, here I make a silly function that will go and find the town with a minimal population. So total IDX min, so index associated to the minimal value and I return the town name and the value of the smallest town. And then I just call apply. So group dot apply. And then I give my custom function and that will apply this custom function to each group. Okay. So this is something that's just a silly example to show you how to apply any custom function to your grouping objects. So there you see here, you have got the smallest town for each count. Okay. So that's all good and well that lets you group your data in categories and then get some numbers that has we have seen before. It's also nice to get plots. And for that, the best function I think from Seabourn is catplot. So you want to plot some categories. Its structure is like this. So you give here, which data, where's the data comes from? Oh, is there is a question by Ahmad? Yes. Can you please go up for David? Yes. Here we say that we need the bead of the Catholic who speaks French and German and Italian or how is it? Yes. So here we have a group, yeah. Get in there. So here we use the group by to create the group by object. So any action that we do after that will always be split by the different categories in the majority language column. Okay. The majority language column has these French speakers, German speakers, Italian speakers, Romain speakers. So if I go and do something like this, it has this category. So any action that we do there rather than giving a single number, it will split them by category. If I omit the group by, I would get one single number for the whole total. So I remove the group by there. I would just get the whole total mean and then because you group by, then it will split by the different categories. Makes more sense? Yeah, I understand. I just want to make sure that we take the bead of the Catholic who speaks the distance, right? Or it's vice versa. So this is the mean of the Catholic. So we group by language and then the in is applied on the Catholic. So it reads if you will from left to right. So, for example, if I want to explain this, explain this reasons to someone, so how I can say, so I'll say these numbers are the mean of the Catholic who speaks this language. All of them. This is a mean fraction of Catholic among communes where the majority language is French or the majority language is German or Italian or Romance. Okay, I got it. Thank you so much. Do not hesitate if you have further questions. All right, so to represent categorical values, cat plot is, I think, the way to go. So you give the data, this is a data frame, and then you specify all on the x-axis will be one column and along the y-axis will be another column. One of these two columns should be categorical. So here it's majority language. And then this is what you get with the default. You got this visualization, which is, it gives you like here a swarm of points. And this is basic, but it works well in many cases. For instance, you can see that here you have for these green dots here, it's majority high. And then for the other is kind of a little bit in both end. And you can see here that your labels are all kind of a jumbled and overlapping together. So that's not so nice. In that case, what I like to do is just you could either spend a bit of time rotating the labels and so on and so forth. That would be a bit ugly. What I prefer doing is just switching x and y. And most of the time, this solves the problem for me. Okay, that's much, much better. So that's the simplest one, but then cat plot has a different views that it offers to you and they are governed by the kind argument. So there is strip. This is the default that you have seen. There is box through gate box plot. Violent for violin plot, bar for bar plot. Swarm for, it's like strip, but with another way of arranging the point, it takes a long time to compute when you have so many points. So I won't display it here, but if you have a small data set, you can play with that fairly visual, let's say. Boxing is kind of like the child between a box plot and a violin plot and point. So let's demonstrate here for different kinds. Here I go through the kind each time I call cat plot with here the male fraction as that I represent as my continuous function and the majority religion that I show here. And you can see here, I have a further number of arguments. So the kind to govern what this will look like and then the height and aspect. So the height means that the created plot will be two inches height. And then the aspect is width to height ratio. So an aspect of five means that it will be five times more wide that it is high, okay? So that means that if the height is two inches, then the width will be 10 inches. So that's why I have this elongated plot there, which I think when you have only two categories works fairly well. So this is the box plot. I trust you all know the box plots. Everyone knows and have seen a box plot before. Yes, yes, of course, they are everywhere. The violin plot. So the violin plot shows the density there and you can see that by default, there is a small box plot included inside, which is nice as well. Then the bar plot with some air bars will come back exactly to this, but you'll see that I don't like bar plot very much. Particular, you see that here, it tells us where things are and how they are spread. We are here, we only have kind of a single information and a small air bar, which we don't really know what it relates to exactly. Is it a confidence interval? Is it standard error, standard deviation? We don't know here. Then the box and plot. So you can see that it's in the middle and in the between the box plot and the violin plot because it takes your box and then it has smaller and smaller boxes around it. Then the swan, sorry, that's the strip. Yeah, that's the strip plot. So just, this is a default and it's nice when there is not too much point, but when there is a lot of point, there's so much of a lap there that you get the feeling that the density is higher, but yeah, this is sometimes a bit hard to tell. And then finally, here the points, yeah, point, which is kind of like the bar plot is just that you remove the bar and you link the different level by a line there. It's not always the best, but when you have categories which are ordered so that it makes sense to have a line that goes in between them, that can be nice to detect some sort of trend. So what we have is that we have a number of visualization and in different kind of data will work more or less with different kind of visualization. So sometimes you have to try several or you have to know a little bit about the nature of your data to know which visualization is the most appropriate. So let's go and see some of the most common ones together. The box plot first. We see box plot everywhere. They are nice because even when they are super small, so in a very small panel, sub panel of a figure, they still convey quite a number of information and in a visual and a simply, sorry, fairly simply like, it doesn't need to be super large to convey a number of information. So this is the anatomy of the box plot. So box plot is made of a box and whiskers and eventually outliers. Outliers is just the internal term for them. That doesn't mean that they are actual statistical outlier and so that should not be necessarily removed from your data set, it's just what we call them. So the box is made of the median. So there's 50% of the data on the right and on the left. Then the limit of the box is the first quartile and third quartile. So there's 25% of the data in there, 25% of the data in there. And so the box represents 50% of your data. And then the whiskers here will extend as much as possible up to 1.5 times this, the length of the box with this entire quartile range here. And it will extend as much as possible but still anchor itself toward an actual observed value. That's why despite the fact that the limit of 1.5 times the entire quartile range is here, the whiskers only goes up till there because there is here the point which is the closest to that and still under when it's still inside this range. And then everything that is above this gray line there and above 1.5 times the entire quartile range is here, these little dots that appears outside of the whiskers, okay? The 1.5 time interquartile range is sort of a, let's say a, what do you say that? It's a neuristic, if you will. That's most of the time this works fairly well when you don't have a ton of points, okay? But when you have a lot of points, when you have like 1000, tens of thousands of points, yeah, will always be some point in the outlier there. As you can see here, for example, we have a few thousands of points. And so there is a lot of points which are outside of this 1.5 interquartile range just because we have so many points, our sample is so large that of course, sometimes we have some points which are slightly full slightly outside of the distribution that's completely expected. And that's, I would say, maybe one of the limit of the box plot there. But nevertheless, you can see that in a few elements, it can convey a number of information. Boxplots are very powerful, personally, I like them a lot, but where they fail is when your data is multimodal. So here I show you, for instance, the fraction of German speakers, in different majority religion. And here you can clearly see that the distribution is multimodal. Okay, there is one big blotch of data there and one big blotch of data there around one around zero. But if you do a box plot of that, well, it's not clear to understand what happens, right? It, this here could also be the same if you had completely uniform spread all along the range of the data. So we have to be a bit careful there. And I would always recommend that you use a couple of different visualization because otherwise you might be tricked into thinking false thing about your data. So you just use one single kind of representation. So the next is the violin plot. So the violin plot is a simple density representation. This is what it looks like, okay? And here you can see that now it's actually able to circumvent the problem of the box plot because there you clearly see that the kind of bimodal nature of the data, okay? And so I like also violin plot a lot. In my experience, the thing is that when you don't have a lot of visual space, when you have to make a tiny figure, the violin plot becomes a bit harder to interpret and the box plot, because it's much more summarized, if you will. For a small plot, it works a bit better, okay? That's just the way that it works. Like when you have less information, you need less space to encode it. But nevertheless, that's actually a great thing. One little caveat, exactly like the KDE plot, here you see that this goes beyond the range of actual value. So we have fraction that goes beyond one and under zero. So that's something that I think you, most of the time want to correct. So just with this cut equal zero argument, okay? And then you get something which I think is much, maybe true to the nature of the data. So a bit better. Also because you have a less range, you can focus a bit more of the visual space toward the range of the actual data. There are more options, of course, you can remove this little box plot inside, if you don't like them, you can put little bars at each of the points. You can make a half violin plot to have, half of the violin print of one category or half of another category and so on so forth. These are fairly simple options of the violin plot, which are all in the documentation of Seabourn. So again, if you are interested, I urge you to go and look at them after the course. Okay, so we have seen, I would say maybe two of my favorites, which are the box plot and violin plot. Now let's go to the one which I don't like and I'm not alone, many other researchers don't like them and I hope I will convey to you why we like to have there, we like to be a bit, a bit, a bit, let's say, defined against bar plots because they are everywhere, but they don't convey that much information and they can be hard to interpret. So a bar plot basically is just a bar that goes up until the mean of the data. Okay, so it just conveys you one information with the bar, the mean, and then it has an error bar which conveys to you another information, but it might be most of the time a standard deviation or a 95 confidence interval. So we'll come back on that this afternoon of what exactly is a confidence interval and how to compute it, basically it's two times sigma, so the standard deviation divided by something that depends on the number of points in the bar. Okay, and the problem is that most of the time, a lot of people will show bar plots with confidence with error bars and not indicate clearly if this error bar is a standard deviation or confidence interval or standard error mean and so on and so forth. And that changes quite a lot, these are very different. So let me show you, for instance, yeah, if I say CI equal SD, that means that the error bar will be standard deviation. If I say 95, this will be a 95 confidence interval. And so here this is standard deviation, here this is 95 confidence interval. And you can see that it paints very different pictures. That means that if you don't know, if you didn't know that this was a 95 confidence interval and thought that this was maybe a standard deviation, the way that you think about the data will change quite a bit. So you can use bar plot, but always, always be super explicit as to what the error bar mean exactly. Okay, because there are different definitions. So we have to be very clear about that. And now a little bit of a caveat, I'm going to go to this external shiny help there. I think it's to me kind of a nice visual demonstration of why we should use different kind of visualization and why this is an important thing to look at. So here are different columns, different little data sets with each a very specific profile, right? So here the first one has two blocks of data, okay, which are clearly separate. And this one is maybe more one, one big spread, okay? Then here there is one group of data which has most of the point and a smaller group with less point, then one, again when these are just inverted, then here three blocks of data, one bigger than the two others, and then four blocks with exactly one value each time and separated and equally separated. Okay, so each of them are clearly different and separate. And if you just do a bar plot of these, the bar plot because it only shows you the standard error and the mean, you lose all of this information then and they look exactly or almost exactly the same. You lose all of the fine-grained information and you may think that they are the same where in fact they are not at all. Yes, there's a question. Yeah, but I can see here that for the scatter block, box block, the only block, the y-axis are visual, I think so, and while the bar block is different. Yeah, it says that there's the mean. Yeah, because so it's on the same scale but because the bar shows you the mean, so that's why they change the legend here. Okay, but they are the same scale, right? Yeah, they are the same scale, yeah. So that's why the bar plot, I mean, it's nice because again, if a figure is super tiny, then the bar plot only conveys very simple information. So it's nice for that. But then for other thing, when you make a larger figure, then using a bar plot is just basically just putting two information in a very large visual space. So it's not super, not the best use of the space and it can hide a lot of interesting information. Here you can see that the bar plot is a bit better but still it has trouble like here, you lose the fact that you are multimodal. Okay, here you see it because there's one splotch which is here, which is much more uncluster which is much more prevalent than the other, okay? And there you lose completely the fact that you have four points which are kind of, which are equally distant and have a single value. The value is a bit better, first a bit better there but still gives some a bit of a weirdness here, okay? So there is this like no one visualization will always work, okay? So always use several. To me, I would say always use the scatter plot. So that's the strip in cat plot and one other, okay? So we can have these different views. And then the strip I would say is kind of oftentimes the one which is very rarely tricked let's say and can be very informative. The problem is that as soon as you have too much point you can see that here they overlap so much that you get just the idea that there is a highest density but not much more. You don't know exactly what this density is. And so it becomes a bit less informative than the violent plot. So there's never one perfect visualization. We have to compose, we have to play around and we have to try several until we find one that fits our purpose for showing the data and also that stays true to the data of course. So I'm not the only one in saying that. So you have here a few article talking about bar chart and possible alternative. This is a literature which is not an interesting and I will come back to a few results later on about how to efficiently and truthfully show our data. All right, let's now see together how we can try and go a bit further than that and see how we can show several categorization at once. And this is a nice argument. So there is in almost all function of seaborne there's a hue argument. So basically you take the KDE plot that we had before and you had a hue corresponding to a category. So here our category will be the majority religion. So I want to show the Italian speaker and split by majority religion. And so this is what this look like. Here you see that now I don't get a single line but I can one line per categories in this majority religion column, okay? With different colors and here a little legend. All right, and so this works for the KDE plot for the HIST plot but also for the CAT plot. So let's now see what this looks like with the categorization by majority language and majority religion. One will be along an axis and the other will be along used with the hue there. Okay, and also you see here my little trick to have log scale because when I represent the total number of inhabitant with the log scale it looks much, much better. So now you see here what I have. So I have here on the top among German speaker among French speaker among Italian among Romange and then the one with majority reform, majority Catholic reform, Catholic reform, Catholic as well. So you can see all of these different values there. Here with the Instagram it works quite well. So violin plot doesn't like the log transform a lot. It does some weird stuff, all right. Here the bar plot because we have very little visual space I think it's not too, too bad, all right. The boxing plot is not so bad as well. Here's a swarm plot we see basically nothing and here the points is like the bar plot but I think that there is a little bit less clutter. So it's actually quite nice as well, all right. So you can see just by adding this hue parameter we have had it a new level of categorization in our visualization. So it's quite nice, don't overuse it but it's quite powerful. Yes, there is a question. Yeah, I just want to understand what is the difference between the box and boxer plot, the first and the fourth one. Yes, so the box plot will just show one box going from the first quartile to the third quartile whereas you see here that the boxing shows more boxes and I think if I remember correctly that they go through the different deciles but I'm not sure 100% and I think that this can be tuned with the arguments. So I can look it up if you want. That's the idea, it's just creating more boxes. Okay, so now it's I think your turn to play. We have seen different stuff. We have seen the cat plot. We have seen the you and so on so forth. So go around and play with them. So exercise for you to represent the proportion of people that are more than 60 year old across all contours. Okay, try to play around with different visualization and try to find one which suits your purpose. Okay, so I will then stop recording. So we want to represent the proportion of people more than 60 years old across all contours. So I'm going to load the solution. So the idea is that our category will be the contour name. The present variable will be the number of people who are more than 60 plus year old, okay. So I think that for me, I will go with the idea that our category is, so because it's contour name, we know that there are many, right? So df, contour name dot unique. Yeah, you see there are all of these contours to represent. Okay, so we'll have to fit a lot of information in this. So then from that, we can, here I go for instance with box, but we could go for instance with the default strip. And that I think will look ugly. So we do a strip plot of the fraction for the 16 plus year old spread by contour name. So I do this. And here you can see that all this is not too bad. Okay, this gives us a bit of information. We can kind of see some trend appearing here and there. So it's, I think fairly informative. The default kind of rainbow scale, here I think I find it useful. Maybe it's not the most beautiful, but it's useful for just this data exploration purpose. So that's not too bad. Maybe I could have a bit more information, I think with box. So here box will give me directly the median. And so I can see already that now, more clearly that the median kind of shift a lot from canton to canton. Okay, so maybe to me at least a bit clear to see the different patterns or to make judgment based off of that. All right, have you tried some different things? Maybe a bar plot, maybe if I input, no one. No one had tried something different. If you try a violin plot, just violin, then you can see that now visually you don't have enough space here to be, for it to be re-worth it. Maybe if I change the height to 15 and change the aspect so that the width to hydration would be a bit better than this would be a bit nicer, but then it doesn't fit on my screen anymore. All right, but that's always things that we can play with. Hop, hop, hop, let me come back to that. And then that's my top list. I don't know if you've seen, I've added this little PLT.grid argument there. I think to me this helps also interpretation and this helps follows exactly what is what when you have a little grid visible. So that's also a nice little trick to have. Okay, so far so good. Everything is still making sense. All right, so far we are seeing a bunch of tricks to just play and represent our data. All right, so last but not least, once you have maybe modified, created a few columns and so forth, you will want to write your, maybe your table and your plot to the disk. So here for, you have your data frame, just go dot to CSV or to Excel or to whatever, depending on what you want to do, but generally to CSV. And then the name of the file that you want to write in and that's it. It's simple enough, right? Of course, there are some options. How do you want to encode NAs? How do you want to write through Enforce? Do you want to write a header or not and so on? But the basic usage is very basic. And then for a plot, it's as simple as just creating the plot exactly as before. And then at the end calling, so you save your plot in a viable and then your viable dot safe fig and then the name of the file that you want. And Python is not too dumb. It will automatically detect the extension that you've chosen and we'll save it as that. So dot PDF dot PNG dot GPEG dot SVG, all the classical ones are already known by Python. And here the default usually gives something fairly nice, but of course you can manually give arguments about the height and width in pixel or inches and the dot per inch is argument and so on, so far as to have full control of exactly what sort of thing you are saving. Okay. And when you do a multi-panel figure, here you have this F and axis element and then the safe fig is done with this F element, which is the one that controls overarching, you know, the overarching figure stuff. So here for instance, I save the whole thing as PDF. You see that it's still shown there, but it's also saved here and I will show you that in a moment. I come here, you can now see that I have my output dot PNG that was created seconds ago and output multi-panel also. So this is my PNG file and this is here my PDF file. Okay. All right. So that's, I would say in general, far from the most complex thing to do to save figures in Python. All right. So now what's left to me to kind of conclude this part is to talk a bit more about just data visualization, what sort of data visualization work well and what we don't just to steer you slightly away from that. So, and there is some research on that. So I will mostly use the figure from this 1984 paper. And what they did basically is that they wanted to compare different kinds of visualization. So on one side you have here pie chart and on the other side, you have the exact same information and the pie chart, but as a, you know, basically bar chart, right? And they just asked people to make them judgment. Like, what do you think, which one do you think is the largest? Is it A or is it C? Or is it B or is it E? And by how much do you think E would be larger than B? And then they asked people, like where they only showed the pie chart, people where they only showed that and compared how much they got it right or wrong. And then they did the same thing with this different kind of bar chart. So either you want to compare two columns just right close to one another or two, which are slightly separated or two, which are far away or two, but they are now on stack column so they don't have the same basis or two, which are stacked among one another. And so that's what then they call kind of position or length judgment. So that's position here judgment and here this is length judgment and this is here position represents that one and here's the angle represents the last one. And this is basically here representation of the sort of error that people make. So the lower, the less error people make, so the better. And what you can see, and I'm sure that you expected that is that basically when you have a pie chart it's much, much worse. People make more, much more error than when you'd give them this bar chart here. Okay, kind of makes sense. Here it's very easy to compare the bar. Here our brain have to make a bit more work. We can get there, but it's not as obvious. And then same thing here. We can see that it's much harder there to for instance compare these two lengths here than to just compare these two bars next to one another. Or okay, again, I think this is simple enough. I think this makes a lot of sense, but it's a good thing to just remember and to clearly see that and to remember this effect because when we create visualization, we have to remember this. We have to remember that people we have to then look at our data and interpret it. And if we can make it less of a cognitive load for them to understand our data, then we will be that much more impactful and convincing when showing our data. So pie chart in general are not ideal, prefer bar chart, it's much better. And if you are ever shown a 3D bar chart shown like that, run away, I would say. That means that you have either someone who is incompetent at creating visualization or that tries to sell you something and tries to scam you. Because here the angle that is visually shown is actually not representative of the actual angle. So you actually do distort the representation that the people have. And there have been studies done on this effect. And the perception that people have is indeed skewed by the purely visual effect. So that's actually a great scamming tactic that was used everywhere in commercial presentation, including in stuff by Apple to kind of oversell their success and so on and so forth. So yeah, be very mindful of that effect. And last but not least, one small GIF which I like. So you can go from something like this, very complex. And actually, when you think about it, if you want to show what you want to show, you can make it much simpler. Remove the 3D, it's useless. And maybe replace your pie chart with a simple bar chart. And if there is one thing in particular which you want to make pop up, then color only that and not the rest. I mean, there is always a label. We don't need to have color everywhere. Okay, this is better than the original. All right, so just a few key ideas. I know this is a lot, but just something to think about when you create visualization sometime, go for something simple. Okay, do we have question at this stage or is everything still going well? All good? Okay, so I know this was a fairly long intro, not too much stat for now. A little bit with Median, with the mean and so on. But that's just to ensure that we have all the same basis and as I said, being good at that exploration and representation makes the rest that much easier. If that is a topic that interests you and you want to know more, we are also giving at SAB a data analysis and representation course with Python. So next edition will take place in November. Okay, it's not open yet, but stay tuned. And if you cannot wait, the material is anyway already available on GitHub. So that's this course one there. And you have here a lot of stuff about data manipulation and presentation much, much more than what I'm showing here. If that's interesting to you.