 Welcome back everyone. Last time yesterday, where did we leave everyone off? We left it off with some basic stuff. Some questions about the indexing, the lock and iLock. Yes, go back to that for a bit. That's very good. I think we skipped over that maybe a little bit too quick. Let's go back to that. So yeah, we had some good questions. So how exactly now do we select columns and rows? And the truth is there are many, many ways in pandas to do this. And if you try to take in all of these different methods at once, it will be a giant mess in your brain. So let's see if we can simplify it a little bit. To two cases. So maybe you can demonstrate. I know let's say a case number one is that we want to select columns by name. Just a quick reminder from yesterday, we had this Titanic passenger database that contains some information about each passenger. And let's use info to list columns. So this is all the data, all the columns of data that we have in the data frame. So what should I do? So when we select a column by name, then we use just the square brackets and we just give it the column name. Okay, that's how you select a column by name. So and now select two columns. So you can also give it a list of names, which will select multiple columns by name. So let's do the name. Brian, can you be a bit softer or Yarno a bit louder? I can be a bit softer. Okay, thanks. Okay, so let's try. Now I shouldn't be too much louder. Give it another column. Let's say it's survived. Yeah, so this is how you select multiple columns by name. The agent survived. You pass it a list of names. So when we're selecting columns by names, that's when you use just plain square brackets. If you're doing anything else that is not selecting columns by names, anything else, we're going to use either dot lock or dot I lock. So let's do lock first. So lock also allows us to select rows and columns by name. And so this is with lock you index it like you would index a 2d non pi array. So say we want row and then column first row and then column. And if you leave away the column part can also only select rows just like in so here is a row name. Yes, we do it by name. And this would now select a single row. Yes, let's try. Okay, so now if you want to select both a row and a column, I would do that. Okay, so first row then column. Yes. And for example, if you want all the column or a range of columns, so you want to slice it. Okay, range of columns. It's still by name, remember? Yeah, so we're slicing columns by name. Okay, so we'll take from name to age. Yeah, is name still one of our regular columns? Um, actually probably not. So age is here. Let's go from survived to age. That's how you do it. Yeah, so if you want to select both rows and columns or just rows, anything that's not just selecting column by name, you use lock. Now, if you want to use integer indexes, so we don't want to like write out these long strings, we want for example, we want row number 20 to 30. So instead of this whole name, I want to use a number. Yes. To select a particular person. Yes, or range of person. You said 20. Yeah. So dangling Ilock. And it will tell me the name fortunately. Okay. And Ilock works exactly the same as lock, but now you use integer numbers to identify rows and columns. But you can do the same thing. You can slice it up. Say we want rows 20, 25 and 30. We want the three random rows. You give that as a list. That's why. Yeah, that. Yeah. Okay. I was about to ask. So the first entry is the rows and the second entry is the columns. So it needs to be one thing. And in this case, it's a list. Yeah. So we get three things. And then say if you also want like the second and the third column or something. Yeah. Like this. Yeah. So second and third columns are passenger class insects. And I guess you could also do it by slicing like this. Yep. Actually slicing from two to four, because it's not inclusive of four. Okay. So I think with these use cases, I think you can do almost anything that you want. So to quickly recap, if you are selecting a column by name, then you use regular square brackets. That's it's only if you're selecting columns by name. If you do anything else that is not selecting a column by name, you use either lock if you want to specify stuff by name or use I lock if you want to specify stuff by number. And those are the three cases you need to know. And then you could do anything. Let's leave it at that. Okay. Okay. So let us know in the HackMD if it's now more clear. If you still have questions, put them there and we will try to get to them. But let's let's move onwards. So we've now up to now with, I think we've always now not always, but we've been loading our data using some of these read functions, like CSV. And we can also read from Excel sheets. We can read from HDF files, rotation files, anything we want. And pandas can write to all these file formats. But maybe we should quickly show how to create these data frames from ourselves within Python. Maybe if we have some data in NumPy, how do we get that into a data? Okay. So first we'll create a NumPy array. And to do that, I have to import NumPy. Okay. And then I guess we can create random data. So it's a quick reminder of the some things from the NumPy lesson. So I'm too many parenthesis. I'm creating a two times four matrix. Six times four, yeah. Six times four. What am I saying? Okay. And I should save it into a variable. Yeah, let's do that. Okay. So now, if we want to put that into a data frame. So new data frame. What would be the easiest ways to do it, like the shortest way? Hmm. Shortest way, not sure. But I probably want to construct a data frame. Shortest way would just be the past matrix and not do anything else. Now it will not give any names or well, it will not give any useful names to the rows or the columns. No, just not. They're just zero, one, two, three, zero, one, two, three, four, five. But the data's in there. So if we want to specify some columns, let's specify some column names. Okay. Specify column names. So columns equals and then a list. A list of column names, yeah. At this point, just to make sure that this fits into on the screen. I'll do this. Okay. Now, hmm. Maybe just call them A, B, C and D. I can call them anything, of course. Yeah. Right now, the content isn't really, doesn't make too much sense anyway. They're just random numbers. So yeah, let's call them A, B, C and D. Right. And now let's give the rows some names. Yeah. So, um, I guess I can do rows equals. Yeah, unfortunately, sorry. Pandas calls it something else. Yeah. Pandas calls this the index. Yeah. Um, so I could give it a list. Yes. Should we try that? In the materials, there's a somewhat different example. Maybe we could try that. Which is also nice to show. Maybe first show that you can just give it a list and it will do what you expected to do. Okay. Now I have six names. Yeah. So now the rows have names. Beautiful. Yeah. But the index, also the column names, by the way, but also the index, it doesn't have to be string names. Let's quickly take a look at another super useful thing. And that is to put dates there. So, okay. It can be, uh, uh, we have to define dates. Where each row corresponds to a different point in time. So we can put date objects there. So Pandas has a function called date range. Which will give me six different dates. And apparently I have to give it a starting point. So, um, let's do 2021. Um, that now is 10th and 26th. So that minus five is 21. Okay. And then I want six days. Now it's called periods because you can do other things than days. Right. Um, but by default it's a date, a period if it is a day. So this should give me six different days. Yes. Let's see if it works. It does not. Um, you call it data range. Ah, data. No, date, date range. Okay. All right. So now each row is a different day. Yeah. Ending at today. Beautiful. Right. All right. So, um, that's one way to do it. So now we have, so our original data was in the form of a numpy array. Um, if we have like different columns, like we have like one column with numbers and one column with text and one column with dates, maybe all kinds of different stuff. Let's do that. We can't easily put that into a numpy array. So usually we, we, um, so, so here we probably want to use a dictionary. We have the data and in dictionaries, I want to put it into a data frame. Okay. So we'll create a data frame again. What I want to do is this. Okay. And now, or maybe what I want to do is actually this. So I will start a dictionary here. Okay. Yeah. And what does this dictionary contain? Well, keys. The dictionary has keys like this and then it has entries for those keys. And I guess this will be, this will be rows. No, this will be columns. So the keys will be the columns. This will be the column names. Like here. So remember that a Python data frame, a pandas data frame is always a collection of columns. That's how we find it. So also our dictionary will be a collection of columns. So the key will be your column name and then the value will be the contents of the column. And the values will, okay. Values will be the contents of the column, which makes sense because this will actually be turned into numpy arrays. Yes. Or arrays of objects, but preferably numpy arrays in most cases. So if I put something like, um, should, maybe first some, some numbers or Yeah, I'm now reports are you're a bit louder than right? Okay, but not by much. Okay. So what I'm wondering is if I should go back to the previous exercise at this point and call these runners already. Oh, that's maybe actually a good idea. So we're now recreating one of the data frames from, from yesterday. Yeah. Actually, maybe just copy paste that. So now this is a list. That's another way to do it. You can either give it a list of dictionaries. Yeah, but it can be just one dictionary, but you can also give it one dictionary. So there are many ways of doing this. Again, just like with the indexing, there are many ways of doing this. So we had runner one, runner two, runner three, three, four. Now this doesn't quite fit. No, let's just do three runners. Just three of them. Oh, okay. Yeah, true. We had only three in the, in the example. Okay. So that's a runner column. And let's do another column. Maybe age, age of the runner. And that should be a list. So 21. I mean, these are, of course, professional runners. So they should be relatively young. It's 229. Yeah. I'm not sure if that's realistic anymore. Okay. Here we go. So we have three runners and we have defined ages for these runners. So that's another way to do it. Yeah. So actually now we've seen three ways. You can do it from a numpy array. And you can do it from dictionaries, either from a list of dictionaries or from a single dictionary. So it depends on what formats your data is in. Usually you can just give it to a pen as data frame and it will try to do the right thing. Let's see how we, all right. Let's go through some other things we will have to do. Like, for example, splitting and combining data sets. How did that work? So maybe first splitting. So maybe split this little table you now made into, like, only the first entry and then the last two entries. And let's be consistent. So we're going to deviate slightly from the dictionaries. Let's use the indexing we've been promoting at the start. So are we selecting columns by name in this case? Well, in the example, no. No. So we're going to use a lock or iLock. Yeah. Okay. And well, we can do both. Well, we can split it into two in one row. We could also do this in two rows. But let's take use lock. Let's use iLock. Yes. If you want to do a range, a numerical range. And this will get the first two rows. Yeah. That's fine. And then the second part will be the rest. So yeah, from two forward. That's great. So that should give you yeah, that should work, I think. Now it didn't print anything. But we have a table sub one, which is the first two rows from here. And a table sub two, which only contains the last row. So what if we want to glue them back together again? Okay. So if we have two data frames that we want to put together, and they have exactly the same columns. Yes, that's more convenient. Yeah. You can also do it when they have different columns. So we'll get to that. So I guess the most general option is to use concat concatenate. And that just takes a list of data, data frames. So sub one and sub two. And because I'm not assigning it to anything, it will print it. Okay. So we have our original pack back. Yeah. Okay. So that's yeah, splitting concatenating that you can do. But that's that's when the columns match. Okay, yeah, let's let's. So now we get to a point where where we would like to discuss merging these data frames where the columns don't match. And that is actually, in my opinion, one of the killer features of pandas. This is where pandas really starts to shine. So we now have this great little table of the runners and their age. And we have previously defined the table of the runners and the time it took for them to run a certain distance, which I have in fact overwritten. So can we quickly recreate that again? Yes. So here we created the database with data frame with these three runners and the times. And we use this melt operation to bring that into a canonical form, where each row has one time for one kind of measurement and the distance for where that measurement was taken. Okay. So we have now these many columns, each either runner one, runner two, or runner three and different variables for each. And we have a table that contains for each runner their age. And I will need to change that into something for the different names. So let's call it the ages database data frame. I keep saying database. Okay. So now we have data. Df is the big data frame. And then we have the ages in a smaller, yeah, smaller frame here. Yeah. Smaller table. This can be a, this is sort of a little example of a scenario where you pull data from different sources so that ages may come from the population back to history and the big table comes from the, from the running competition. So we now want to combine this data from these two different sources and want to just smoosh them together. And if we want to do this manually, it's going to be, well, it's going to be a bit of trouble because we have to go like row by row because nothing matches. The columns don't match. The rows don't match. We cannot just glue them together. We actually have to like piece everything together. But pandas can do it for us in really elegant way. And it's called a merge. So again, if you're familiar with database systems, this is what you do all day. You merge to get a different tables. Pandas can do the same. So there is a merge function. Yeah. So say for, we, yeah, we want, maybe use the, the big table as a basis. So maybe call it. Okay. So the big table comes first. Yeah. Well, well, let's, let's use the merge as a method of the first. Okay. Df.merge. Yes. Let's do it like that. So our Df, our big table is like the basis. And now we're going to merge in the ages of the runners. So this is now not the pd.merge, which would be a function inside pandas. Pandas is pd here. This is df.merge, which is basically the same function, but we are calling it, well, we're calling it from or with the Df data frame. Yes. The big data frame. And we'll have another data frame as an argument here. Yes. So, and that would be the ages data frame. And now there's one more thing we need to tell pandas. And that is what column to use to match the rows together. Yeah. So there's only, of course, here, one column that has the same values. So we cannot use age because the other table doesn't have ages and time or distance. But in principle, there might be two columns that match. Yes. But here we'll use the runner column. Yeah. So we're going to use the values in the runner column to see which rows should match to which rows. So let's see what this does. Okay. It has now successfully annotated our data with the ages of the runner. See in it, whenever runner one appears in the set, it will have, it has annotated with the correct age. So this is super useful for pulling together data from multiple sources and editing things. And so pandas is very clever. So you have merge, but you have, this is one way of merging. So the merge method, if you look at the documentation, it has many parameters, many different types of how to merge, what to do with missing data, how to deal with columns that are in one data set, but not the other data set and so forth. So what almost anything you want to do, like with smushing together different tables, you can even do it. And it will take care of all the little finicky things, make sure that the rows match and the right ages aside to the runner here and that sort of thing. So it says a huge amount of time. Okay. So let's talk about another super powerful function. So merge super powerful. Let's talk about group by. Okay. I need to refresh this page. We've updated this morning. Let's try. Yeah, that's better. All right. So we've just come, yeah, we just came from our merge example from the runners. Let's talk about group by. So that, let's get back to the Titanic data set. I like it. It's nice and dramatic. So you know, the nautical saying goes that women and children first, right? So let's explore whether they applied that to the Titanic. So if you were a woman, if you were a child, did you actually have a better chance of surviving the accident? So for this, we have to group the data. We have to split the data into different groups. You have to split first by sex where you mill. So what we did yesterday was use group by to group by. And we just had one column here. We had the survived column. Yeah, that's what we did yesterday. But now we want to look at the survival rate. I mean, I'm sorry, group by. So say we want to first split them up by, yeah. And then we can take the mean of. Now we want to select a column. We want to select a column by name. If we would do anything else, we would use a lock or iron. And then we do it. Then we take the mean. So this came to the survival rate. Mean of survived is the survival rate. And it will give it for each sex separately. So I think we did this yesterday or something very similar. But now we want to do something a bit more complicated. So we split up the data even more. That makes two groups. First split on based on sex and then split further down on whether you are a child or not. Women and children first. But we don't have that data and that information. Yeah, we don't have a column yet for whether one is a child. But we do have an age column, don't we? So let's see, when are your child? Well, it was a long time ago. So we just assume, okay, if you're younger than 12, you are a child. So if people are putting people into lifeboats, what do they consider a child who looks like a child? Yeah. So first, we're creating a new column. We are taking a column by name, but this column doesn't really exist yet. We're assigning to it. So this is completely legal in Python. And what are we assigning to it? Well, we're taking the age. And we want to check that it's smaller than something. Let's say 12. Yeah. Okay. And I kind of want to maybe see now what type of column did we create here? Let's do actually titanic.info. And did I create it twice? Well, actually overwrites one, because it's the same name. Okay. So our... Okay, but let's first send us a discussion of group by because it's such a powerful function. Yeah, okay. Almost every time you spend us, I'm doing group by stuff. Yes. So now we have the child column. And we want to know the survival rate of adults and children. Yes. And the child column was now, it's a column made out of Boolean value. So either true or false. Oh, right. That's the type of the column here. Yeah. And that's fine. We can use that to group by. Okay. So now we can group by two things, right? We can first split based on male-female. Give it a list of column names. And then we group by whether they work or not. Pandas, sorry. Yeah, this is great. We want the survival again, whether they survive or not, and to get the rate for each. So first we group, group by, not just group. Okay. We group by two things. So we create four groups. We take the survival for each group and the mean. So this is taking the survival rate. And it'll do it for each group separately. Like, well, we kind of saw earlier already. I made a mistake. What mistake did I make? The column not found. Survival. Oh, it's called survive. Survived. True. Ah, there we go. Now it worked. Okay. So first by sex and then by whether the person is a child or not. So it's certainly beneficial to be a woman if you want to survive the tectonic accident. Yeah. I mean, it is kind of children first, yes, in that children go before males. But it is women and children first in that order, I guess. Yes, first the women and then perhaps a couple of children. Yeah, adult male. Well, the difference between 75 percent and 59 percent is 60 percent. It's not huge in this small sample of people. Yeah. Okay. So that's group by. Hi. Okay. So, yeah, let's now we've talked long enough, I think. Yeah, let's give them an exercise. So go and play a little bit with this tectonic data set. We've shown you a few very powerful functions. Can you answer these sort of questions? Like, what different friendly sizes exist? There's a unique method there in panels you can use. Yeah, and then find out. Okay. So when you have the family sizes, find the largest family size and what are the names of the people in them. And you can also create histograms using the distribution of these family sizes and passengers and things like that. You have a button function where you saw also yesterday called East. And after this exercise, we'll look at a bit more plotting functions. Now I would invite everyone to start doing this. So, so we give our welcome back everyone. Yeah, welcome. So should we, I mean, we probably won't go through the rest of the material, but we can quickly showcase some of the most important things, which is plotting. I think. So, and we can just do this with the tectonic database in the, if you go through the materials later, you will create a new database with noble laureate data, but the tectonic database is fun anyway. So we already did this, we just did it here, we did this group buy and took this created the survival rate or calculated the survival rates for each, well, each sex and each, whether they're children or not. So, yeah, and maybe we can plot that in, well, at least one way, a couple of different ways, maybe. So here we want to take, I'm actually plotting the age in my example. Can we plot the survival rate? Well, you can try easily. We could create a new column that is the survival. So when you plot box plots, you generally want continuous values there. Yes or no. So it needs to be for each person. Okay, we'll just take the So in box plot, you group buy, you do essentially group buy, but you use the buy argument here. So let's use the child column and the sex columns, they need to be strings. Okay, so it's a list of strings. It will group by these two columns and then plot the age as a box plot for each. And here it is. So kind of unsurprisingly, children tend to be younger than adults. This is 12, is the cutoff point here. Cutoff point. And the women were slightly younger, maybe? Not significantly. Okay. Okay, for a final trick, we only have one trick left, right? Then we are out of time. So let's create a bar plot. So let's do something simple. Let's just say, okay, we have three passenger classes here, first class, second class, third class. And we also know the amount of money people paid for their tickets. So we're doing group buy. We have three classes. Yeah, so for the bar plots, we need to do this grouping buy. We have to do it ourselves. And this is true for many of the plotting functions. So it's a general thing that often when you want to plot stuff like the mean prize, somebody paid for a ticket, you first need to do this computation yourself and then feed it through the plot. So before calculating the mean, you need to group because otherwise it will take the mean for everyone. But yeah, let's take the fare for each class separately. Take the mean of that. And this will return numbers as you saw before. So it returns numbers. But we want to plot that. So is it just plot? Yeah, and then you said the kind of plot to bar. Some of the plots have dedicated functions to them like the box plot. And there are many different kinds of plot that go with the plot and then kind equals bar. Or I think you can also just do box plot or line plots or circle diagrams, things like that. But this does show you quickly and visually how much more expensive first class is compared to the other two classes. How close to each other actually second and third class? Yeah, it's not that bad. Okay, I think that's the time we had. So I hope you all got an overview of what you can do with pandas and how powerful it can be. There's many, many, many more things to learn and a good place. It also has good documentation. So I recommend everyone go read the documentation. Whenever you want to do some sort of transformation to your data, chances are very good that pandas has a function that does exactly as you want, if you can find it in the documentation. All right, let's leave it here. Let's go back to well, yeah, let's go for a break.