 Let me then move on to the third chapter there. Accounting for categories in the data, it happens very often and that's why we just did. In the previous microexercise, I first asked you to compute the mean fare per passenger class. And now I just asked you to do the same thing with three histograms, okay? And we have done so in very, very valid ways with three different cores, but it's also an operation that is so common that we need to have better way of doing this. So first, when it comes to just number computation manipulations, we can use the data frame group by method, okay? So group by will, while do what it's name and tell it, will group your data by the different categories that are in a column that you give as argument. So here, df.groupby6 returns an object, which is a data frame group by object. It's quite custom style of object, but when you, where you can then interact with it, almost like if it was a data frame, it's just that any operation will be then replicated for each level of the column that you have grouped by. So that's why there, if I create this group by six and then I can ask for column age and median of it, I will get not one median, but one per six, all right? And now I need another example to see, to show a little bit how we can do this sort of stuff, how we can play around with that. So for this, I use a little table that contains the 1880 Swiss census, okay? And I can, for example, it is a big table with like all the communes and all the little terms that existed in Swiss back then and plenty of numbers for each of these. And by grouping by quantum name, I can then ask the catalytic, so that's the number of Catholic people per commune. And when I ask for the sum total number of Catholic, normally I would get just the total number of Catholic in Swiss at the time, but when I've grouped by quantum name, I have this per quantum, all right? So far so good, anything okay so far? Yes, okay. And then it kind of, you know, all of that kind of works in the way that you kind of would want to, in the sense that here you see I can ask for the median, I can switch that for the mean as easily or for the standard deviation as easy. It works exactly like what we had before, right? Now I want to show you a trick that it took me a little bit of time to discover that trick and I had to do that quite a number of time by hand, but then I discovered it and made my life much, much, much easier. That's once you have done this group by, sometimes what you want is not just to compute a number, but then you would like to grab the data for each group, maybe for later on doing some testing. Like let's say you want to do some sort of ANOVA, then that means that you need to have maybe the list of specific list of all this number for all of this canton or for all of these categories that you want. And so for that we use the function AGG for aggregation and we say that we want to aggregate to something such as for example, a list. So when you do that, see what you get there is then a, it's actually a theory and each element in this theory are named by here, the canton that is of interest to you or the name of the category and it then gives you that each element are a list of all of the values that are of interest to you. And then it's very easy to then just grab them, okay? Rather than having to do, rather than having to manually do like a lot of masks, because if you have to do one mask per canton it will take to you forever, right? There it's kind of automated for you in one single simple code, right? Just to show you the sort of way that it can like then be taken and used in practice. I hear briefly, see, show to you how I would do a NaNova of on the different number of catholics per canton. And then I would take my DSNs group by canton name, grab the catholic, aggregate them to a list so I get something like this. And then I feed that to the function f one way is because a NaNova is an f statistics and it's a one way ANOVA from scipy stats module. And then I just feed this whole object there and this little star there is a Python way of just saying like give the first element of this container as first argument, the second and second argument, third and third argument. So it basically gives all of these to the function as different arguments. And that will make my NaNova as once at once. And I just, you know, here in two little lines make my little NaNova as I wanted to do that. So it's really quite a useful, I would say it's not like fundamental to know it. It's more like a trick, but it's quite a useful trick when you start to do a lot of data in this with Python. All right, that's mostly what I wanted to show to you. And then another little thing is that once you have these little structures there, it comes to you as a Siri. So it's a Panda structure. And sometimes it can be nice to have it as a dictionary where the key are the contour names and the value are these lists such that then also you can easily refer to one or the other, exclude a few and so on. And so each time you are playing with a data frame, you can always call this two-dict data frame or Siri, sorry. You can always call this two-dict method, okay, which will then transform the Siri into a dictionary where the index are keys and the values are the value in the Siri, all right. So then by doing that, yes, you see I chained one more stuff at the end, which is a two-dict. Now I have a dictionary where keys is Argao and then all the values, all the number of Catholic per commune in Argao and then scroll and then I have another one and then another one and so on so forth, all right. So all of these basic type are sort of interconnected and it's possible to go from one to the other in a few calls. Sometimes of course you need to maybe experiment a little or go back to documentation to check exactly what works but it's quite, quite, quite convenient to play around, okay. So let me then clear the feedback, yeah. So that's about the group by. Let's see, yeah. Do I want to do this micro exercise? Let me check just briefly. Yeah, so let's do this micro exercise. So this is basically a little application of the group by that we've done together. Just be sure to actually use this function that has already been copied there and apply it to compute the H category for all, for all ages, for all elements in the data frame and then do a group by these new categories to compute your something. Yes, so let's try and solve this together. The first task was to try and create this new column, all right, which was H category and then I gave to use this function there and we want to then apply the H to it and so that gives us some categories. We want to then push that to a new column, so then the FC and then we will call that, I don't know, Hcat and then an equal and then I can just check this with a little head and then I got there and then you see here I have this Hcat column and then it depends on the age. When the age is above 17, we have adults and then depending on this little threshold there it will be teenager, child and so on and so forth. So far so good? Yes, a few people, yeah. I know it's getting to be a bit later on in the day but let's make the most of the time we have. So once we have that, then we do our group by so dfc.groupby, okay, we group by Hcat, okay and then we are interested in column, compute server right, right by gender, Hcategory and passenger class. So we want a survival rate. So we group by Hcategory, okay and we want the survival, so let's survive them, okay and then we want the survival rate because survival is true or false, okay. If we want the mean of that, then we get a survival rate, okay on average or many survived. So then this and then the mean, right. So by Hcategory, we see that now adult survive in 39% of the case, child 58%, senior only 9% and teenager 47.7%, okay. And then we can duplicate that, maybe we want to do that whereas by sex, okay. So we see that there is a definite bias there where female survive more than men and then by passenger class, we group by and then we see that here, we have again a bias by class where it seems that passenger in the first class survived with a higher frequency that passenger in the third class. All right, okay, so that was it. Are there any questions there on this group by operation? Everything good, everything clear? Okay, so if there are no further questions then and we have seen how to do the group by and it lets us compute tons of stuff, all right. Now we are going to see together how to do this but visually and for this, we can use the hue argument of seaborn. So a lot of seaborn plotting function have this hue argument, okay. And it's always used to kind of do sort of like a group by but visually, well, basically you want to have one line per category. For example, on the, oh, let me make that slightly smaller. On the, with the this plot, okay, with a kind of KD I just have on X is the H, okay. The data comes from the data frame and the hue is the sex, okay. And automatically then it splits the line by sex, okay. And creates this little legend for me. This is all automated, all right. And I can do that with most of, you know, the existing seaborn functions, right. So for example, here I think is where I put a little digression about the colors. So the default colors are quite nice. Usually when I just do stuff, not just data expression, I don't change them, but when I want to create some plot for a paper, then of course, I definitely change them and have something custom. So here I just want to spend a little bit of time on four different ways of specifying the colors. The first is to use the default pallets of seaborn. The second is to change the pallets of seaborn. If you go online, you will see plant that it, sorry, that seaborn has a number of predefined pallets which you can cycle through and test, okay. Then third one is that you can manually set the colors. For example, either with the names as I shown you like teal, dark orange and so on and so forth or with X values, okay. So that's red, green and blue in exact decimal. And you can find plenty of guide online on where to go to go from these code to the color and conversely. Or you can use the named colors. So I've shown you a few. I show you also here an example with the XKCD pallet. So you can go to this little link there to see where these colors come from and so on. So basically you can have this little, like this is the same plot. I've just changed the colors and you do that with the pallet argument, okay. So the pallet argument, you give a number of colors and then it will know like, okay, I have two categories and you've given me two colors so it works and I will map the first category to the first color and so on and so forth. If you have too many colors, it's fine but if you don't have enough color then I think what it does is it cycles through that. All right. So there you go. That's how you can kind of play around and change the colors with Seaborn. Use the pallet argument. You have here some links if you want to go further, of course. And so that's our basic, let's say visual group buy. Sometimes it's not necessarily enough and we have plenty of other plots that we want to do with that. And so for this, just change that one, we have another figure level function. So instead of this plot, which is to show one distribution, we can use cat plot, which is made to show the distribution of a variable according to different categories. And so cat plot, you give us X one numerical column and as Y one categorical column. Or you can do the converse and then you will get it all arranged horizontally instead of vertically. So yeah. Then you can also define, if you don't want to switch the X and Y, you can also define the orientation of it. And then you specify where that comes from and then you have these two arguments there. Height we have seen already together. This determines how high the plot should be and aspect is the width to height ratio. So if it's higher than the figure is wider and if it's lower than the figure is narrower. All right. So let's get that. So that's our basic cat plot. You can see that it just shows little clouds of points there and that lets us see some of the data that we had seen before. So in one plot, we can represent now the fare for all three classes. Okay. Much simpler than what we had done in this exercise there. All right. So we have that. And then as I showed you, for instance, you can change the aspect. So aspect makes it now aspect. One means it's as high as it is wide. So you have now something that is a bit narrower. Now there personally, I prefer it a little bit like that. And as I told you, maybe I can remove the orient there and you can switch the X and Y or so. You don't have to have your categorical on the Y. You can also have it on the X. So see now passenger classes on the X and cat plot will understand as long as there is one categorical and one numerical, it will understand what is what and adapt the plot depending on what you want to show. All right. So far so good. Yes. All right. So now a little bit like this plot had several kinds which could be histogram and density line and cumulative density line. Cat plot has different kinds as well. All right. And the main ones are shown there. So you've got strip the default. So that's these little clouds of points there. And then you have many which are quite useful in particular. The one which are used the most I think is box to create box plots and violin to create violin plots. And also you have different kinds. For instance, bar is the bar plot. And then you have swarm which is clouds of points. It takes a while to compute but they are organized in a certain way. It takes, as I said, it takes a while to compute. So that's why I don't do it there because there's many, many, many prints to show. But when you only have a couple of hundreds points then it's kind of doable to use it and it looks somewhat nice. Boxon is some intermediary. So that's, this is Boxon and it's an intermediary as you can see between a box plot as is shown here and a violin plot as is shown there. All right. And then last but not least, there is points which is shown there. And it's a little bit like an alternative to the bar plot. Okay, but it's just that there is a line that is drawn between the height of the bar and that's it. Right, it's just another way to visualize something. It's here, it's not very useful but in some cases it's visually quite pleasant. Right, there. So you can see all three. You can see that there for this peculiar sort of aspect I made, you see that I made it very, very narrow. So for instance, you can see that here visually the box plot are informative but the violin plot are maybe too much squished. I would need more vertical space. The bar plot, you can see something as well. They work well with this limited space. The boxing are also quite informative and this as well but then it's also again, there maybe the vertical space is not big enough for us to really show something. So you see here that you have many, many options and then you have to think a little bit about what sort of plot will work well depending on the sort of visual space that you have at your disposal to show whatever it is you want to show on your data. So something also to keep in mind whenever you create visual presentations. It's not about purely the function that you use but also about how you organize the information in space. All right. So here you have some information but some of the details there and of course then once you are there you so you specify one numerical column and one categorical column but sometimes you can also represent two categorical column there by adding on top of that the hue parameter and so then you will create something a little bit like that. So there what I do is that my X is the fare. Here you see the passenger class which is my main categorical column and then the hue is set as the sex and so then suddenly it takes the control of the color there and splits each box plot into two box plot. So that's I have for each class the distribution for each sex. All right. And I can see that for instance female have paid seem to have paid more in the first in the first fare on average than male but this difference is maybe not exactly the same or slightly different for the other passenger class. Okay. So this is the sort of visualization that we can get fairly easily by tweaking exactly which category we show.