 So yeah, next up we have pandas. So pandas builds on the NumPy things. So under the hood, pandas is using NumPy and it lets you do a lot more high level operations. So here we have my colleagues who can introduce themselves and take it away. All right. Hello, I am Marine van Vliet. And I'm a research fellow at Aalto University. And well, I use Python and I also use pandas a lot for my research, which is in neuroscience. And with me co-teaching today, we have Jarno. Hey. Good luck today. So you've all seen me already today. Hey, again. That's great. So, Marine, can you tell me what is pandas? Yes, let's dive right in. So pandas, what you need to know about this is that it's all about tables. So up to now we've been mostly working with NumPy. And NumPy really deals with these huge grids, uniform grids of numbers. And that's the kind of data that you will get out of some measurement device like a telescope or a large Hadron collider or something, fields of numbers. Pandas is what you use when you have the sort of data that is more like the data you get from a database. For example, a list of all the books in the library or the passenger manifesto of a large passenger ship, as we will see. And I think the best way to show what pandas is and what it does is just to load up some data and look at it a little bit through the eyes of pandas. So I'm gonna ask Yarno now to get a notebook ready. Okay. You're of course free to follow along everyone watching or just watch Yarno. So we get some data into pandas. So maybe first import pandas. Yeah, so the first thing I'll do is import pandas as PD. I guess that's the standard name. Yeah, that's what people usually use just like NP for NumPy pandas is PD. And let's load some data. And pandas is very good with handling CSV data comma separated values. And it can even read data just by putting a URL in there. So this is what is shown here in the lecture notes below. We just have the CSV file as a URL there. And so this is a file that is a URL that actually points to a CSV file online. Yeah. And then I just typed. So you just use one of pandas many data import functions. And this one is cool. So let's do read CSV for read. So we have read CSV file and the file is called Titanic. So I'm calling it Titanic here. Okay. Oh, oops. I need to put URL as a parameter and see what happens. Now it's in Titanic. Now I have something called Titanic. Should I just take a look? Yeah, just type it in and let's see. So the third thing you will notice is that Jupyter notebooks know how to show these pandas tables really pretty. So Jupyter notebook will render this as an HTML table. When you are in a text console you will get a nice text rendering of it. But here inside the browser we get a nice HTML rendering. And here you see a table. So pandas calls this a data frame. So maybe I should also start calling it a data frame but it is a table. And this data frame. So you see this is the passenger list of everyone who was on board the Titanic when it sank on its own journey. And so what I was saying about NumPy really being about grids of numbers, you see that with this data we don't only have numbers, we also have strings and we have both integers and floats and made the category data, it's all in there. So the way this is organized now is in this data frame is that this is a collection of columns and basically what pandas provides is a huge amount of tools to manipulate this data, to select this data and compute all kinds of stuff with this data. For example we can print some summary statistics. Do you like to try that, Jarnal? So say we want to get a quick overview of this data and let's use the describe method of this, so. Oh, okay. So describe Titanic.describe, okay. Let's see what that does. So, all right, that's a lot of things. Yeah, so what do you see, Jarnal? Well, there's 891 passengers but 891 survived, that doesn't seem right. No, well it's the count. It's the number of items you need. So some of these have missing values, age for example, but survive doesn't. Yeah, we have all the numbers, all the data for survive but not the age for every passenger. Then we have the mean, mean passenger ID, I guess it's not that meaningful, but the value is actually, that is an interesting number. It's the survival rate on the ship. 30, I mean age, might be interesting. Then there is standard deviation. And all kinds of other things. Okay, minimums. What do you call this in English? Percentiles. Percentiles, okay. And maximum, of course. Maximum age is 80. And you see that it computes these values for all the columns. And pandas is a lot about columns. So when you have a data frame in pandas, the data frame really is just a collection of columns. So these are kind of individual entries of data, individual people. And then these columns are what data there actually is for all of these people. Yes, each row here is a single passenger and each column tells us something about this passenger. And we can get, we can do all kinds of like more complicated data queries on this. For example, you can try two that are listed below. Well, let's try one. For example, can we compute for all the people who survived and all the people who did not survive? So let's split the data into survive and not survive and compute the mean age of them. So, okay. So the first thing we want to do is get who survived. Oh, yeah, we want to get the group of people who survived, let's say. Yeah. Let's do group by, or are you suggesting something else? Hopefully a demo, like look at the shiny stuff that pandas can do for us. See how easy this is. We're not gonna go deep now in how this works, but... Survived is one of the column names now. So we're taking all the different values of survived and creating, okay, this is a generic object. Okay, printing it doesn't tell you a lot. But then we can take the age column from that. Does printing this do anything? No, it's syntax error at the moment, because we forgot to call it. Okay, yes. It's still kind of series group. I mean, you can turn it into something that actually is printable, but it hasn't actually done the operation yet. It needs to know what to do. But now we are asking for a number, the mean. So we have taken the passengers who have survived, or we have rather grouped them by whether they have survived or not. So we have two groups, survived and not survived. And we have taken the age column only and are calculating the mean of that column. So let's see. It does a lot of work in a single line. So again, don't worry yet about exactly how each of these functions works. This is now the demo to show, okay, what kind of stuff can you do with pandas? So you can quickly compute this. This is not exactly what I was expecting. The average ages are very close to each other. That's true. So the old adage, women and children first, well. I guess more children survived. Yeah, but can you, yeah. So let's do the second one. So let's plot the entire age distribution. Maybe you can just copy paste this one. Okay, well, we could. Here's for histogram. So pandas also comes with some plotting. If I do this, we'll add a line break and break the flow of the code. Let's see. Okay, well, we did it wrong. Oh yeah. So pandas comes with two plotting functions as well. There, tomorrow we will look deeper into the plotting functionality of pandas and a companion library called Seaborn which provides much more visualization functions. Okay, so we have two plots. So here's zero and one. So that's for survived, whether they survived or not. Yeah, zero means they did not survive. And we're taking the age. Yeah. And then there's some details about the graph basically. But the important thing is, this is whether they survived. Well, whether they didn't survive with zero, survive this one. And then he's the age. Yeah, so did most of the children survive? Can you scroll down? There's a lot more children that survived than who didn't. But not all of them survived. No, some children, you know, he didn't. Okay, but so this is a way to see how pandas can quickly allow you to just do some data queries, answer these type of questions when you have a big table of just... I'm reading this wrong. So the first two columns are really the smallest children. And I guess, yeah, there is nobody here. So smallest ones, they just survived. Well, one is survived, right? And the first bar is ages, I don't know, zero to five or something? No? Yeah, if this is zero to five, well, zero to three. Yeah. All the children from zero to three survived. Oh, okay. Yeah. Yeah, that could be, okay, that's good. Okay, good to know. All right, but let's dial the back a bit. And let's get into it. So how do we actually, how does this work? How does this data manipulation work? So, okay, maybe go back to the data frame, to the table again. Okay, I'll... Okay, just... Okay, I'll do it again. All right. So, like I said, with pandas data frames, we're always looking at things in two dimensions. So we have rows and columns always, where the rows are the data points that we have, the observations that we have, and each column gives us some information about this data point. And a data frame like this is really a collection of columns. And you can see that by using the info method. The info method of a data frame, so now this Titanic object, it will show you some information of all the columns. So this is how pandas really sees this data set. And it sees it as a collection of, I think, 12 columns here. Yeah, zero to 11. Yeah. And note that even though we have different data types, we have numbers and text in the column side by side, inside a single column, all the data is the same. So inside a column, there can only be either numbers or text or only items with the same data type, which are the integer or floats or strings, text is object in this case. So this could be NumPy arrays. These are, in fact, NumPy arrays. It's column is stored as a NumPy array, but that's behind the scenes. So when we want to select data from this, because most of the queries, most of the things, well, common things you want to do, of course, the data is to select different type of, we have been doing, want to select only the people who survived or the people who did not survive and that sort of things. Let's first look at how to select a single column. Well, that's pretty easy because you can select it in the same way as you would select things from a Python dictionary. Okay, so that would be, well, the name of the column is a string. So I guess I put in a string. And we use, you used it here correctly, we use the square brackets. So square brackets in Python means select something out of a list or out of a dictionary or out of a NumPy array, square brackets. So also here, we use square brackets to select something from this data frame. So now this is a certain column. Now you only have the age column. There is another way to select things which may be useful to quickly show. If you don't want to type the square brackets and this quotation marks and all this stuff, there is a convenient shorthand. You can just write Titanic dot age. Yeah, so this is very sort of Pythonic way of doing things. So very lazy way of doing things. But when you're exploring data on the terminal and you're just typing out commands, this is sometimes really useful to just quickly select something. Okay, well, that's a column. Now the big question, how do we select rows? Okay, so should I just write try the name of the column? No, actually. Well, there are... I don't really have a name. If you show the table again, if you show the data frame, you think, and there's something I must tell you about this first. So all the columns have a name assigned to them. Yeah, that's much as it's obvious, all the columns are named. Actually, in pandas, also the rows have numbers. Well, they have numbers. The rows have names. That's what I'm talking about. But the names at the moment are zero and one and two. So that's a little bit not obvious that they're actually names assigned to them. So we didn't tell pandas to assign any specific names. So it's just assigned zero, one, two, and so on. And so on. So I think it will be more clear if we assign something else to the row name. So we set the name of the passenger and use them. So what was the function to do that? I could check it, but can you remind me? Set underscore index. Set index. Yes, and it's called set index because these row names, they are referred to in pandas as the index. We are indexing by name. If you know something about databases and for example, SQL databases, then the term index may mean something to you. It's something that you look up rows by. But in any case, it's the names of the rows. They are called the index. So this line will use the names of the passengers as the name column, so the name of the passenger. So I'm guessing it's very important here. Oh, it didn't print anything. So if there are two people with the same name, this would not work correctly. This would actually work. So the row names do not necessarily need to be unique. Oh, okay. The column names also I think do not need to be unique. It will be very confusing. So I think it's good practice to be very much aware. Do I have duplicate data here? But no, they don't need to be. But it's very convenient if they are. So now it's more obvious that rows have names because now they actually do have names. So now to get back to the questions, how do I select rows? Well, we can do it in two ways. We can select rows by their number, give me row number 15, or we can select it by their name. And maybe that's even more intuitive. So yeah, here in the lecture notes if you see now below at the screen, there are many different ways to select data, select rows. For example, we can select a row and a column together by their names. So maybe do that. So we see how old is. Tyntonic dot. So I want a row and a column. Yeah. Okay. Well, the at should do it. Yes. So let's take a column. So the downside of using the names, it is very clear that they are now names, but it is the downside is that they take a wild type. And I might make spelling mistakes. So yeah, I'm taking the age of Mr. Alilam. Yeah. No column. That is none. So it's not set, I guess. Let's see, I saw there is a... Can we try someone else? Heikinen mislaina. Is there a dot after miss? Yes. And is it MSS or just... It's the whole thing. Ah, that means it's not miss, it's misses. Okay. So it's set to 26. That's good. So now we can query single passengers. So if you want to... So that's the at localizer. Also note that it's a call it a localizer, not a function, because see, we don't use round brackets here. We're still using square brackets. It uses a bit of Python magic to be able to index this object in various ways. So it's kind of as if this at was an array or something like an array. Yes. So let's take a look at the other... So say we want all the information of miss the line at Heikinen as you finish, but... Yeah, that's why I chose the name. I can spell it. So let's just... For example, the lock one, the lock will give you a row of data. You can just do by name. So for lock, do I need the colon as in the... That's a slice syntax, sorry. No, let's not do that. So let's just... Just give it the name of a row. Yes. Yeah. So now we get all the information about miss liner. Yeah. Okay, so that's by name. And we can also do it by number. For example, what if we want rows five to 10 say? Okay. So now we don't use lock because lock is with by name. We use I lock, which means... Index. I is for index, I guess. You are right. Although the name is also called index. So maybe that is not... Yeah. It does not clarify as much as I was thinking it would. So that would be the first five rows. Yeah. This is easier to remember when you remember that it's for index. So this is the first five rows. So you could do all kinds of clever things. So if you want, for example, row... Actually the first six. Yeah. Row like three, five and six or something. You can also select with a list. So this is much like how you would slice a numpy. Well, it needs to be a list. You know, this is very much how you would slice a numpy object. True. Okay. I will give you the list. All right. So that will... So two ways to do this things. Okay. And now we get to a super useful thing. So now that we can select rows. I think what we want to do is we want to select rows based on some condition. So say we want all the people that are older than 70. Okay. So the syntax for selecting is the same but we want something in here that's somewhat different. Yeah. Actually, did we do this in numpy? I think we did. I think in numpy, we might also do this. So maybe first... So maybe you can do this in numpy. Yeah. So yeah, let's do what's inside first. Yeah. So can you select... So I select the column H. Maybe say that the H must be larger than 70. Yeah. So that's what this looks like. It takes the... Now you see the names because that is the... The index is the name of the row. That's the name. That's the name of the entry. But then the value is either true or false. It's mostly false here because most people... When are there... Of course, the value is in the H column is higher than 70. Yeah. And if I do 30, it will probably see some more. Yeah. There's some true values as well. But let's go back to 70. All right. So this is... So this now we can use as a mask like we did with the numpy arrays at the end of the numpy session. Yeah. So we can plug this into the selector syntax into the square brackets. Yes. Now we get the whole row for people who are older than 70. Yeah. And the H is here. Okay, yes. So now you've used to generate square brackets which are kind of magic. You could also use the iLock locator for this selector for this. So if you have Titanic.iLock, I think that should also work and then square brackets. I think it doesn't. Okay. Maybe that doesn't work. It really needs numbers. Okay. Or loc. Oh, loc does work. Okay. Because this was by name. So if we take just this, this is kind of weird. If we take just this, it has name and true or false. So loc takes names. Yes. Okay. And that's what it wants. So it's, but at least loc specifies we're interested in rows. We want to rows. Yeah. But this is true. Which maybe sometimes a bit more clear than just using square brackets. Because if you just use square brackets, appendix will try its best to figure out what you want. And so they're very powerful, but can also be a bit confusing when you do more complicated things with them. Okay. So we got queries like this. Okay. So now if all the people are 70s, could we, well, let's try one more before we go actually we should go to the exercise. I think. So you've now seen a bit how to, how to load data. You've seen how to select columns. You've seen how to select rows and rows based on some value for column. So all the people above 70, for example. So let's go to a quick exercise where you can try this yourself. And let's maybe take 10 minutes for this. You can find the exercise here in the lecture notes. So load some data, maybe have a look at the API reference for pandas or use the autocomplete feature to see what kinds of methods are available to you. You've already seen that we have a method called mean to compute the mean of values, but there are many, many others. And if you can, so try to compute the mean age of the first 10 passengers. So then you first need to slice by row, take 10 passengers and then use mean to compute the mean age of them. See if you can do that. Then if you have time left, we will take 10 minutes for this. You can compute, for example, the survival rate among passengers over and under the average rates. You can do a bit more complex queries on this data. Okay, so yeah, we'll be back. All right. There it is. 47 passed or 32. And let's pick up then. Good luck. Everyone, I've hoped you've managed to play a little with pandas, get the data inside and look around a little bit. Let's talk now a little bit about a section that's called tidy data. And this really is about a convention. So if you want to use the full power of pandas, you should organize your data a little bit in a way that fits nicely with this row and column structure of pandas. Most important thing about this is that in a tidy data set that it's variable in your data is its own column. So when we have a passenger of the Titanic that we have one row is a passenger and every variable, everything we know about this passenger is called a variable and is in one of the columns. So let's take a look at some data that is not organized like that. And let's see if we can reorganize it to be in this tidy data form. That may be the best way to explain it. So Jarno, can you execute that bit of code that is in there? So this bit of code will create a new data frame. Tomorrow we will get more into how do you actually create data frames from scratch? But if you want to sneak peek, this is how we do it. This creates a data frame, but it's not tidy. So this is some data about three runners who were running a race. And there also was a person with a stopwatch next to the track. And it's time to runners as they run 400 meters and then when they've run 800 meters, 1200 meters, 1500 meters, they took the time, they written them down. And this is now the data that we have. You see that 400, 800, 1200, 1500, those are now the names of our columns. What makes this data not tidy is that these column names, they actually are the values of a variable. We have a variable called distance. The distance that a runner ran. And this distance is either 400, 800, 1200, 1500. So let's reform up the data such that we really have the runner as a column. We have the distance that they run as a column and we have the time in which they ran the distance as a column. And so this is the natural way that people will probably write this data, but it is less machine readable. It's not as good for pandas or for most of the programs. No, it will start to limit this when we want to do certain queries. So because distance is not explicitly a variable here, it's just encoded in a column names, it becomes more tricky to do queries based on the distance. So say we want all the runners that, or we want to mean time that runners took to run, I don't know, 500 to 900 meters or something, those sort of queries that involve distance, or we want to plot time versus distance, for example, or that sort of thing. Yeah, let's make a variable called distance, sorry. Go ahead. Yeah, in the teaching materials, we now use one of pandas many, many data mangling functions. Also with this one, let's not get into the specifics of this one. So I recommend everyone read the documentation. There are so many ways to mangle the data. It will take weeks to sum them all up and go over all of them in depth. Just know that there are many. So if there's something you want to do with the data, chances are very good. There's a specific method that will do that for you and you can find it in the documentation. So pandas is really you have your data on one hand and you have the pandas documentation on the other hand, and this is how you use pandas. You always use it in the documentation. But here I'm choosing, I'm just choosing that the value runners will stay as it is. And I want to turn these values 400, 800 and 1,500 into a new variable called distance, I guess. Distance, yeah. Yes, and then what we do, yeah. What should happen? So there is a, I put some unnecessary parenthesis there. So there is a name for the column I create, but I also need a name for the row. Yeah, well, the column distance, this is where the numbers 400, 800, 1200, 1500 will go. But we also need a place to where all these values, other values go. What's this? Let's call the bottom key error runners. Oh, there's no column called runners with a capital T. No, because it's just runner. Okay, yes, okay. So it's probably clearer what this does after I actually run it, right? So runner remains, runner one, runner one again and so on. But then these numbers become the values of this distance column. Yeah, so now the values of these columns become the values of the time column. Yeah, so for example, if you now want runner two at 800 meters, right, that's now a separate row, runner two at 800 meters, run it in 160 seconds. This is the same date, it's not a way of looking at it, but now distance is a column of its own. So note that we want each column to be a variable. However, it's not necessary that each row is an individual runner, right? We have runner one, we have three rows corresponding now that tells us something about runner number one. So in the tidy data format, each column is a separate variable but each row does not have to be a specific runner or passenger of the Titanic or something anymore. So this is a bit more, a bit easier now. So let's... Okay, let's start from the beginning when I was typing. Yeah, let's wrap this up. So what do you want to do, Yarni? Let's do some query with the distance just to show how this now works. Okay, runner one. And I think you need parentheses, right? And let's say df will take the distance. And go fancy, now we, let's see. We take all the data for runner one, and now distance is bigger than 800. Let's see if this works. Let's work. Okay. Yeah. Okay, you, yeah, that's one query. We can, I guess the point was that we don't need to specify the runner. So, yeah. There's a parentheses missing. Oh, actually an extra one. Yeah. Yeah. And you can quickly do things like that. Yeah. Or, yeah, so now if you want, you can do the mean of that and that sort of things. So especially tomorrow we will look at plotting. So now if you want to, well, maybe you can quickly do it here. Let's make a quick line plot for the, I don't think we have it here right now. Okay. We may be able to improvise this. Plot time versus distance, for example. So let's make a plot where on the x axis we have time, and on the y axis we have distance. Um, is it pd? No, it's df.plot. df.plot. And then you give it two parameters and let's name them. So we do x equals what we want on the axis. And y, for example, time. Y equals, now the x and y are the x and y axis of the plot. Of the plot, yeah. Time is with a lower case t. Time, yes. There we go. And there we are. Now there's four runners at each time. So it looks a little bit weird as a time, as a line plot. Yeah, we can probably group them by, we can probably give the runners different colors, but now I would have to look at the documentation. So probably it's color equals runner or something, but I'm not quite sure. Okay. I mean, we don't have to do that. No, it's not, of course it wasn't. But, yeah. See, now I would also have to go to the documentation. So how do I- Yeah, there's much more than you can remember. But these sort of things would be more complicated. So these things are easier when all the variables are neatly, like one column is a single variable. And we don't have weird column names that are numbers. That's the main point of tidy data. Okay. So now you know a little bit about pandas, how data should be in there and how we do simple queries of them. So that's it for today's lesson on pandas. Tomorrow we will pick up with more complicated queries and a lot more about plotting. But for now, this is it. Thank you, Jarno, for your assistance. All right. Thank you. Does anybody know- Yeah. Of the teaching. How to plot runner as a color? Is there a preferred way to specify both rows and columns? I think you would do that with the lock. So maybe go, so we talked about the lock selector a little bit. And there we said, okay, you can use lock to select rows, but actually lock can also take two selectors and you can get both rows and columns in lock. Why do you manage to do it? Yeah. So that was, the question was to- So what if it was both- specific row and specific column, right? Oh, did we know that you still remember the name? Nope. Let's just stop and make it take one. So this is actually exactly the line I was starting to type. So if I want the row of Haken and Mislaina and the column age, I use the dot lock and then give, well, both of the values. First of all, and then the column. Yeah, these can also be slices, by the way. So you can take all the rows from Lina all the way to Mr. Lin or something. So beginning to Lina. Oh, yeah, yeah, everything until Lina. Yeah. And from age forward. Yeah, all the age to the rest, right? But I mean, in this particular database, the things aren't really that ordered. So I'm not sure if this makes sense. No, we can't do it. We'll make more sense if we would sort them by name first. Of course, right? Pen is of many stuff to sort today than in many different ways, as you can imagine. The easiest way to get the different plots for different runners was to use group by before calling the plot. And I don't actually need to have this time here that was unnecessary. Oh, that's, well, no, now it made more. Okay, it made three plots. That is an interesting reaction. Why did it do that? Okay. Yeah, so now we have the times. Which one was that? First one. Okay. Okay. Any other burning questions or shall we leave it for today? Maybe we all deserve to take a long break.