 So, yes, pandas, so it's a Python package that builds on NumPy and allows us to do many more fancy operations that are basically involving tabular data, that don't involve tabular data. So we will see this has lots of different built-in functions and things like that. So, pandas is one of these things that, at least for me, like I know the basics of pandas, but for almost everything I do, I will do a web search and then figure out, okay, how do I do this particular thing to find the functions I need to do it? Yeah, that's the same for me. Yeah, I'm sure there's some people that know it really well and do things, but yeah, in fact, I wanted to add an exercise here, which was specifically about doing this web search to learn more. So what's the point? So the point is, be aware, this is like the starting point for this lesson. So there's three exercises. We do two exercises today, hopefully in about 20 minutes and then one tomorrow, where we'll continue. So, yes, let's get started. Here you should see my screen, yes. There's a lot of different getting started guides here and really I'd say we have an overview, but you're going to have to be reading these quick start guides and so on things anyway. Can someone add a panda section to the notes? I can't do that right now because I'm talking. Okay. Yeah, so the pandas overview view, it's conventionally imported by import pandas as PD as a shorthand and then it provides lots of functions, like for example, reading from CSV, which I think NumPy also provides a CSV reading, but it actually does a lot more and provides lots of summary information and things like this. But no, Yarno will be the one doing typing. So let's switch to Yarno's screen. Yeah. So I guess the main difference between in terms of a creating CSV file in NumPy, an array will always have all the data in the array will be the same type. If there's one integer, it's all integers. If there's one floating word number, it's all floating word numbers. And pandas has different kinds of columns. So it reads the column header and it has a lot more information about it, different types of data. Yeah. So let's show this. So I guess, yeah, first things first. So I guess I'll start by importing it. And we'll be going relatively fast here. So you can try to follow along, but at some point, we'll probably get ahead of you. And then stop, watch what we're doing and come back to it during the exercises. Yeah. Okay. So here I go, imported pandas. So then there is this address in the notes, which I'll mostly copy, although I think I missed .csv. So I'm just setting this string to a variable. And this contains a CSV file. And then, again, I didn't create it. Okay. So this could be also a file on your computer just as well. Yeah. So what we're seeing here is loading the data directly from the web. Yeah. So I will call the data frame Titanic because it is about passengers on Titanic. And let's just read CSV function to read it. The mouse is hovering over the text. Okay. And URL is the thing set here. And I will also tell it to use an index column. What is an index column? So, well, let's show the next line, the Titanic.head, and we can see. So every pandas array has columns and rows. But notice the first row here is an index. It's bold. And the first column, this one, the name column here. Yeah. So basically, in NumPy, we'd access these rows by rows 0, 1, 2, and so on. In pandas, we can also access it by these names here, which we will see a little bit later. Yeah. Okay. Let's do another one. So another very useful function to start with is describe. Yes, that's spelled correctly. Okay. So we see there are summary statistics here. So how much data actually exists? Yeah. I guess these are the percentiles, 25% to 50%. So they're not that useful for passenger ID. But if you look at age, for example, then it makes some sense. So let's go on. There's two more things here. We will show them, but not explain what they do. So these two commands show how you can do some pretty fancy things. So the group by means if they survived or not, we separate them to two groups and then take the age of those groups. And then we find the average age of these people. So, okay. So group by survived, take the age and take the mean of the age. Okay. So we see the average ages for survived one and didn't survive zero is pretty similar. And let's do the next one. So this histogram, which I think basically probably needs to be. Okay. Yeah. So age column. So this is going to make a histogram of the ages and there'll be two groups, this passengers who survived and didn't survive. And then some other parameters like the number of bins, the layout, and so on. I wonder how important these parameters are for this example. I'll just copy. Yeah. I would say just copy. So figure size, layout, some ordering, share, access. Let's just go on. We aren't, at this point, we aren't knowing what these things mean. Okay. So position argument follows. There's by equals survived there. Oh, this needs to be have an equal size. Okay. Yes. Okay. And we see these two figures. So zero didn't survive and one did survive. And this is showing the histogram of age. So, yeah, yeah. So we're not going into the details here now, because this is just a preview. But now let's go to the actual details of what just happened. Okay. So, okay. So what's in the data frame? Maybe start with the info function, I guess. Can you go to the screen or the lesson again? Yes. Let's, there's a picture there. Yeah. Okay. Yeah. So this is what a data frame looks like. So we see it consists of rows and columns with column names instead of just numbers at the top and an index column on the left side. So each of these columns themselves is a pandas.series object. And the data itself inside of these series objects is stored as NumPy arrays. Okay. So should we go back to back to the Jupyter and let's run Titanic.info? The head looks a lot like the picture, although a little bit different. Yes. There's also the info function, which tells you what columns exist and what types those are. So these are the series that exist in this data frame. And these data types are, in fact, the NumPy data types there in 64 and float 64. So once we have this, we can do things like we can extract out single columns. For example, we can do Titanic and then slice age and then take the age. And we can index it now by the name, the column name. Yes. The column name age. And we see, so notice it's pulled out both the names and the ages because this index got preserved in the series itself. And then we can, there's another work in here, Titanic.age with the column name. So this doesn't work when age is also an attribute of the data frame, but it's convenient for shorthands. So this is the same thing, basically? Yeah. Okay. We can list all the columns with Titanic.columns. So now it returns a list, otherwise this is the same information as in the .info function. But you can, it's a list, so you can, for example, do a for loop over it. Yeah. I guess it didn't contain the index. So you can, we have the name as an index, so you can also take the index. Yes. Although this will now be a list of the names. Yeah. The name of the index column is name. This is in fact a NumPy array inside of it also. So we can get single individual values different ways. So for example, this .loak method, Titanic.loak, we can give it a position by the index and then column name. So now I'm, so the index is the name again, so I'm typing a name in, and then column name is age. It's easy to make a spelling mistake here, so it might actually be good to, okay, Kiera. I think lamb is capitalized. Yeah. Okay. None. Okay. Well, that was, the age is not recorded in the data frame. So there's a little bit more here. I think we don't have time to go into all these details, but we can do things like using at, we can set a value. So if we copy. Yes. Okay. Well, I'm not going to use the index method here, so I'll just copy the name. Yes. Should we set the age? Okay. Age. It says set age to 42. Oh, well, okay. I mean, I said it's 40 instead of 42, but that's fine. Yeah. Okay. We don't actually know the age. Yeah. Okay. So one thing that's taken me some time to get used to is you can use this loc and add method to get values based on the names in the indexes, but you can use the iLoc and iAdd functions to get values based on like first row, second row, first column, second column, and so on. And these both have different uses in different cases. For example, if there's a big time series, then it makes sense to extract like what happened on this day. But if you're iterating through, maybe it makes sense to get the first row, then second row, and so on. Okay. So basically, it depends on if you're doing something for all the columns, then splitting it in some way by numbers, then use the numbers. Otherwise, it's more readable to use the names usually. Yeah. Okay. We can do the Boolean indexing just like NumPy erased. So for example, let's get the... So I guess I'll do it in a couple of steps. So did you want to get something other than this age thing? Let's go on. So let's extract the passengers that are older than age 70. So this returns a Boolean thing. It's just most defaults. Most people are not over 70 years old. But I can use that as an index for the array and just take the passengers who are older than 70 years old. Yes, five of them. So this is just now all passengers older than 70. Okay. And this looks a lot like the NumPy syntax because it's designed around that. Yeah. It's good that things work in a similar way. Yeah. So there's a lot more things to demonstrate here. For example, we can get all the NA values. We can remove the NA values. We can replace the missing data with other things. Which one would you like to demonstrate? Maybe let's... So, okay. What you would mostly need to do is... Often you just need to drop the none values. This doesn't create a new data frame. This does not overwrite the data frame. It creates a new one and returns that. So we don't actually change this Titanic data frame now. Yeah. But it's often useful to get the data frame that only contains defined values. Of course, you can also... You might want to first take a set of columns and then drop the values... Drop the rows where those columns are not defined. Because this is dropping everything, whether even any value in any column is none. This is dropping it. Okay. So we are going a little bit fast here, but that's by design to give you more time for exercises. So again, we're just summarizing the biggest high-level things. Okay. The next section... So now there's exercises, but instead we're going to cover the next little section. Well, not so little, but then have you do two exercises at once. So now there's tidy data. So this is not purely panda stuff itself, but also it's about how you arrange the data itself. So in tidy data, the idea is that every column is a variable and each variable has its own column and each observation is a row. So for example, down here, we're making a sample thing about runners. And if Yarno creates that... Yes, let's go back to demonstrating. So I will again just copy this data in, but it contains three runners and some values for, I guess, run times. And now runners is a data frame. Okay. So distances and times for those distances run. So is this tidy data? No. So there's multiple measurements per row here. It's measuring run times for different... Possibly different runs, but at least in four different places in the same run, possibly. But in any case, four different measurements. So basically, yes, every row contains the results from four different races. So what's next? This melt function. Okay. So maybe it's just a comment while I write it. So we'll replace the runner's data frame. Pandas.melt is the function and we put the current runners data frame in. And then we have to define a set of variables that... Well, I mean, these are kept in the rows, in every row. So this is something that identifies the runner or the experiment or the subject, experimental subject, something you want to keep. And this is not a measureable. This is not something you measure. This is something that identifies the measurement. So the name of the runner is not something you measure. It's something that identifies the measurement. And then I guess value variables. And those are then the... Actually, these are numbers. So these are the columns that contain actual measured values. So these we want to split into separate rows. And these we do not. I guess then we need a name for this new variable. So we split this into multiple rows so that creates a new column. We want the name for that column. So what is the name for these numbers? I guess distance run. What is the name for the thing that we're measuring here? And that's time. Hopefully... Oh, not quite right. It's not present. These variables are not present in the data frame. 400, 800... Okay, maybe they are strings. Maybe. And you... So I'll give you coffee from the lesson to see. Yeah. Let's see. I think we should try to go... Well, it did work now. Yeah. Okay, so it was some spelling error somewhere. Okay, so what did it do? It took all of these numbers and made a row for each of them. And then we have identifying information. So this runner name and this column title go here as identifying information for that measurement. Yeah. Okay. And this is now tidy data. And the reason is that each row is only one observation. And observation is the amount of time it took on a race. And this lets us do things like the group by we solve before. So basically by using the other operations, we can do cool stuff on this. But we need to carry on now. So again, this is a thing that there's a linked article that you should probably read. It has a lot more and really convinced me why this is a good idea. Okay. Working with data frames. So there's a lot of other stuff we can do with them. And I propose that you all can read this as well as we can say it right here. Is there anything to comment on? There's an example of making a data frame that has a date rage index where all the index values are a certain date. And then once with other things and shows how we can combine them, merge them, and so on. Yeah, I propose we go to the exercises now and leave the 20 minutes for them. So we'll come back at 52. And we can go into more details about things tomorrow and after we get back. Does that sound good? Yeah, that sounds good. Okay. Great. So let's go to the exercises until 52. And see you then. Okay. Bye. Hello. We are back. So we have a little bit of wrap up for the day. And we know this was a rather rough lesson. So like I tried to motivate when we started, pandas is the kind of thing that even we are always going through and reading about it to figure out how to do things. So this is sort of an impossible lesson to teach. So we can either go so slow that we don't show anything interesting or show some cool stuff. But you need to go back and read to figure out how it works yourself later. And we tried to have a little mix in here. And well, it didn't work that well. That's, well, it's unfortunate, but we'll try to do better next time. But it's sort of how it is. So by the way, at the bottom of the notes here, you have a place you can vote about what you thought of the lesson. So use this poll to say what you thought of it. And please give us comments. And we'll go look at those quickly. So any comments from the exercises or what we can do, let's take a look. So this thing here, so the problem with this we see and the solution here. So the differences, these are in parentheses here. So basically Python gets this order of operations wrong, where it would try to do the ampersand first and then the comparison. And I know this has tripped me a lot many times in the past. So much in fact that I just nowadays, if I type an ampersand, I also type the parenthesis around both sides just automatically because otherwise it will usually fail. Okay. Yeah. Okay. And yeah, so there's this convention in Python that when you're slicing things, the first index is included and the second one is not included. And that's, well, Python, or Pandas follows the Python convention. And you can read there the debate on which one is better or worse or, well, it's an interesting question. And let's not get into that. We can write by chat. Are there any other questions or comments here? So Yarno, what's your overall summary of Pandas? Like, what should someone have gotten from this lesson? Well, one thing is just that it exists that the how the data frames work. So there is a Python library that's really good at the sort of table like data. And you can do kind of magical things with it. Well, I guess that last part of the sentence was the second one. So you don't need to remember how to do this. All of these things, it's mostly you will look through, do a web search for what you want to do and look through the documentation. But yeah, so it is, it can do kind of similar things to what a spreadsheet can do, but a lot faster and you can save it as script. Yeah. Maybe my summary would be if you're making things that have like a bunch of NumPy arrays, one for each effective column, or you have these deeply recursive dictionaries of lists or lists of dictionaries or things like that, then maybe Pandas with all this extra structure like the indexes, the names, the way you can slice different things can do it better. And we'll see another example of this tomorrow. But as of now, we have this feedback of the day. So please, please comment here. It's the only way we have to improve things. So there will be videos produced for tomorrow. Hopefully, if you would like to help with that, please let us know. There is, yeah, and if you can do the stuff up today, then tomorrow should be okay. It's more important to have JupyterLab or similar, because we do visualization and we need to show these graphics in the notebooks. And if there are any problems with the software today, you should make sure you install this for tomorrow. Yes. Any positive feedback kind of things. So yeah, I mean, this is, if you're completely new to Python now, like you're still using the Python syntax a little bit, today will have been really hard. But don't let that discourage you, because I mean, this is a medium, like an intermediate kind of course. But hopefully, you can stop, take like step back and watch and see what we're doing, and use this as inspiration when you're learning on your own later on. Any other comments? A lot of this is an overview of a lot of stuff that will probably be useful at some point in the future. But so there is one comment that there is a lot of material and that can stress you out. But the idea is not that we cover or that you learn all of the material. It's a selection you can go for what you're interested in. And yeah, not everything will be useful immediately. You will not learn everything immediately, but you can always come back. Yeah. Okay. So, well, it's time to stop. So, let's go then. See you all tomorrow. Same time, a little bit early for icebreakers and initial discussion. And thank you for attending. Bye.