 Okay, and so now we are ready for this second notebook, which is about delving a bit deeper in the description of our data, both numerically, but also visually, all right? So this is our little menu there, all right? And of course we need to first start with just importing all of the modules that we have. By the way, is this zoom level enough or should I zoom more? Please let me know in the chat or please put a little green tick or thumb up if the zoom level is fine. Yes, okay, seems that it's fine, but if you need me to zoom more, please let me know. Also if I speak a bit too fast or you need me to repeat something, please do let me know. All right, so now there's a little bit of code there that we don't necessarily need. You do not need to run that, but I do because as I'm presenting on the screen, this will make the figures that I will show you bigger by default, okay? So that I don't have to constantly tweak the size of the figures and that can be shown nicely enough on your screen. Okay, next, some of this, we have already known the scene together, okay? Now this is about description of the content of a data frame using some functions to get the numerical summary, if you will. All right, we have already seen together the mean and standard deviation function. There is the sum function, very useful, the minimum, max, medium, and also the quantile which can give us any arbitrary quantile or list of quantiles. A few others which are very nice is describe count and value counts. We will experiment all of these together, so I won't go into detail of the text here, but rather show you what we do. So we will play again with the Titanic data set and then the first one which I like to use a lot is value counts, which will count the number of occurrence of each element that you encounter in a column, all right? So you have the column embark which has a number of SC and Q for different embark town and then value counts will just count each occurrence and they will be sorted by the most common to list common, all right? If you don't remember where to put the S, okay, if it's values count or value counts which often happens to me, I don't necessarily remember everything, then don't hesitate to use the tab key to have the auto completion, do that for you. Then the mean, median, minimum, and max are shown here. So df.fair.mean gives you the mean for the column fare and same for the median, mean, and max. I won't spend too much time on it because this is already part of what we have seen. There is a little bit of redundancy there which is not bad in a little bit self, but if you need me to again spend a bit more time on that, please do let me know. And then the describe, super useful because it will give you most of the common kind of numbers that you might want to have out of this, out of a column with a single comment. If you are familiar with R, this is an equivalent to the summary function. So df.fair.describe gives you the count there. This is the number of elements that are not NA's, then the mean, standard deviation, minimum, and then first quartile, second quartile, so the median, third quartile, and the maximum. All right, simple as that. And then of course you can combine that with all sort of selections. As you wish, so for instance, if I do a mask which selects the males, I can then fairly easily get the median of males by doing df.fair, applying the mask and the median. And a mask can be reversed with a tilde, so the reverse of the male are the female, so then I get the median fair for female passenger. All right, and of course then these can be combined with all the mask and so on so forth to have the precise numbers that you would like to have. Right, there, so far so good. Is everything fine so far? Little green tick to say yes, little red cross if things need a bit more time to be digested. Okay, so it seems that everything is fine so far, right? Okay, so then of course you can have sometimes something that can be applied on the whole data frame. Okay, so I do that for instance on here a simple column but you can also do that directly on the data frame. There we get a little footer warning depending on your version of pandas, most modern version nowadays you will get the warning but it's just the warning. So it still does operate, it's just maybe later on because this is a future warning, this will not work anymore. But for the moment what we do get is a mean for all of the columns because by default, mean will operate on the level of columns. If sometimes you might want to switch this behavior you might want to compute like I don't know mean or median by row. I know that for instance if you have some expression data in expression matrices it's typical that you have samples in columns and you have genes as rows and sometimes it might be interesting to have a row wise mean or some or something like this. Then you just have to specify axis equal one. So axis equals zero is by column and axis equal one is by row. And then if that was a 3D table we could have axis equal two and so on so forth but that frame are only 2D. All right, so there we get all that and of course because we have some columns which are not numerical for example you have your embarked there then you have this little warning there that says oh you asked me for the mean of something that is not numerical and I'm not happy with that. Of course then you can just select the few columns which are of interest to you for which you would like to compute the mean and just call mean on that. So basically you do a sub-selection on your data frame just some columns and then you compute the mean. Okay, right. So then we get to our first micro exercise to try and get you to do some stuff to apply some stuff because this is when we actually see whether we have understood stuff or not. So micro exercise please do compute the mean fair for each passenger class. So remember the P class column contains passenger class can be one, two, or three and so try to have the mean for each of these separately. So let's try and correct this exercise together. So our task was to compute the mean fair for each passenger class. So first we need to make sure that we are able to select each passenger class. The passenger class have values one, two, or three. If you are not sure exactly what are the different possibilities there you can always use the unique function, okay? Which tells us what's in there. So then we can go with df.passengerclass and then we can say if it's maybe one, okay? Give this a bunch of true and false let's then our mask which we apply to fair exactly what we have been doing here, okay? So now we'll go df.fair, okay? Apply this and then ask for a mean and then I got the mean fair for passenger of class one and after that is just a matter of copy pasting, okay? And then I change this for two or three and then again two or three, okay? And then I get my three means, all right? So let's say the simple possible answer. Of course you can make that slightly more complex by putting this rather than doing it like I did manually by copy pasting, I could put that in a for loop with all the different values in the unique function and so on and so forth. But the result in the end would be the same, okay? So far so good. Any question on that? None, all right. So then another stuff thing function, you know that is tremendously useful. So now we've seen like stuff like standard STD or mean or median and they are super useful when your data is numerical and we have also seen value counts which is very useful when your data is categorical. Another function which I like a lot is PD.cross tab which lets you compute contingency table between two or more variable. And so basically the way it works is that you give two columns, all right? And then you get this cross table and so you can have the numbers for instance for each passenger class, for instance here you see how many survived or not survived, okay? It's quite simple to use. Okay, so then time again for you to do something, all right? So use the dataset that is in this file, all right? So read it and then compute the sum for each column and then normalize each column by dividing the value in it by the sum, all right? Of the column. You will see it's not two. So first I gave you then the code to just read this and you will remark that I have used is here the index as the argument index call equals zero such that the first column becomes the index of the different rows, all right? So all my columns are numbers, okay? In this particular case, it's single cell data so each column is a sample and each row is a gene measured inside this sample, all right? So first off we want to compute the sum for each column and I told you for that one you just need to call sum on the whole data frame and this will perform a column sum, okay? Column sum, yeah, all right? If for some reason you wanted to have a row sum, for example, I don't know, but you might want to do that sometime, a row sum you would just say sum but now you would say axis equal one and then that will give you the sum for each gene, okay? Here apparently this is a very, very sparse matrix. So most of what we see here are zeros but we see that when we do the column wide sum, we see that in fact there are some non zeros inside this matrix, right? Now the second part was about applying these numbers all right, to each of the values there, specifically we want to divide each column by its column sum. For that, the simplest is just to use the, to use the vectorized aspect of this sort of computation so we can just say df cell divided by its sum. So on the one hand we have df cell and we can look at its shape, okay? It is 32,000 something gross by 50 and then on the other hand we have these 50 numbers there. So Python kind of understands if you will how we want to do this sort of operation because we have the equivalent of a matrix with 32,000 by 50 in size and we divide it by a vector size 50 so it will apply the stuff by column, right? And that's what we get there. It's not super obvious to see what happens there but if we create something like a new data frame and we go then compute the column wide sum on this new data frame there, we see that now each column sum to one because we have applied this normalization by the sum of the columns, okay? All right, so that was it. As you can see, it's actually fairly simple. So what's so simple that it seems unlikely that this would work but this is actually how this works there, all right? Okay, are there any question on that? Is it fairly clear what happens when I do this different operation and why this works? Yes, no, green ticks, thumb up, thumb down, red cross if you don't understand. Maybe you're all dead behind your computers. No, still some sign of life, thank you. Yeah, when I teach online, it's very hard for me to gauge how you are and so on so forth and if I'm not talking in the voice so you will see that I regularly ask you to just give me little signs of life just so that I make, I'm sure that I'm not, that my web connection has not gone down without me realizing, for example. Okay, so we have been able to do that. Now, of course, I already discussed that with you but there is also the dot describe which lets you have a lot of info at once. So for instance, df.describe will give you directly for all the numerical columns. The count that is number of values which are non NAs, that's why you see that here, for instance, for H we have only 714 whereas for passenger class, we have 891, that's because there are quite a number of NAs in the H column and they don't appear there. Then the mean, the standard deviation, the minimum and then quartile, major and so on, so forth, right? So at a glance, you get a little view of the numerical column in the table. Now, maybe some of you may spot a little, not necessarily a big problem but maybe something that might not make a lot of sense given the nature of the data there. Please write in the chat if you think that you've spotted something a bit weird and also write how you would solve that. Okay, so Jean-Sébastien writes about P-class. Okay, maybe someone else can elaborate on that and also on how no one else about the import aspectors. Okay, I see. Yeah, so that's a very hard way of describing the problem. Yes, but you are right. So the idea is that the passenger class is encoded as a numerical column as floats. Okay, but in fact, it's a categorical column and of course in R, for instance, we would then transform this to a factor but in Python, it's not a factor and we just say that we need to change the type there. So we have here, for example, P-class which we would describe as type category, okay. And also survived, I think could be described with the type Boolean, okay, so that it's to zero one. And so when we do that, actually, now we see that the age, family, and fair are the only numerical column left, okay. If we then also go and describe the F dots, D types, D type, type types, yeah, there they are. And so we see that indeed now the passenger class and that was also the survived are now described with these categories, okay. Right, so that's little something that it's sometimes a little bit important to take care of because I've seen in some cases here, okay, it's just about a visual or numerical summary but sometimes if you keep some categorical column that have been encoded using numbers and you don't account for the fact that they are not numerical, it might change the way that some statistical function reacts to them and you might have some unexpected behavior in some statistical function, in some modeling analysis, which if you don't cache them early enough can be a headache to them like kind of understand and to solve later on. So it's better to act early if possible. All right, I won't go into detail there that just says what describe does, it doesn't mean standard deviation.