 Hello. I think you should be able to hear us now. Okay, hello. Is our audio balanced? Please write and hack and dee. Yes, please let us know. So now we are at Pandas. Well, first actually Thor, would you like to introduce yourself? Sure, I'm Thor or Thor Vickfeld. I live in Sweden, Stockholm. I work for the ENCCS center. So it's a competent center focused on HPC and AI. And I'm currently, we belong to RISE, the Research Institute, Sweden. And I've been involved in codifying it for several years. Okay. So, Pandas. Yeah. What is Pandas? What is Pandas? Good question. I think I really enjoyed the previous episode, the previous part presented by Marian and Johan about NAMPAI. NAMPAI is good for these numerical arrays, right? You can have one-dimensional, two-dimensional, three-dimensional arrays. All kinds of really fast methods happening under the hood. But sometimes data is more mixed, right? It can be more heterogeneous. You can have different types of data in a dataset, strings and integer and floating point numbers and dictionaries and all kinds of things. And that's actually where Pandas comes in as a very useful package in Python. But fortunately, it can also be very fast because it actually uses NAMPAI under the hood for some of the numerical stuff. We're not going to focus on performance now. We're just going to focus on how Pandas work, how we work with data frames. Yeah. So I think we can just jump right into the episode now. Do you want to share your screen? Yeah. It's shared here. So let's see. Let's see. Yes. Where do we begin? Well, so we'll be talking about Pandas, right? So some tips in there on how you in general would approach learning a new package. So we're going to focus on Pandas, but show some pointers on how you go about learning something in Python from scratch. Okay. That's a good way to look at stuff, since I guess we do that often. Yeah. You see those lists of links right there. These are some external resources, which are really good to have a look at. Okay. Good to know. So should I go to Jupiter and start importing things? Yes, please do. Okay. I have my Jupiter lab started here. I will split and arrange the windows somehow. Let's see. There is a simple interface and presentation mode. I guess this is probably good. Should I make a new notebook? Yes. Go ahead. Okay. And let's remember to call it. How do I rename this? So I usually just do, like, control shift S. Okay. Or command shift S on the Mac. Right. Yeah. That would have been easier. Okay. So here I am. All right. So first things first, shall we add a title? Markdown. Okay. Yes. So I push escape and M. To switch to markdown mode. Enter. And then. And this lesson. Yeah. For good measure, I'll paste the URL in here. Even though no one's going to be looking at this. Great. Okay. So the first things first, you probably all who are following this have installed pandas. You have it available in a Python environment on your computer. And to get access to it, we need to import it. So you see it right there. So the command is import pandas. You can type it out. And then normally by convention, we type as PD. So that we have a short form just like NumPy is usually called NP. Pandas is PD. And that's just a shortcut or something. Right. Yeah. Like an alias for the package. Yeah. And so then I'll ask you to copy paste a little bit because you don't have to type every single command out. And anyone who is typing along can copy paste to the what the idea now is to download a data set. It's right there. This is a publicly available data set. Yeah. Click the button. And just to show, I mean, some cool functionality, how easy it is to do really complicated things. Well, easy, at least you can do it in a very compact way with pandas. So first we'll just show something very sort of semi advanced and then we'll go back to basics to show more incrementally how you build data frames, how you work with data frames. Sounds good. Yeah. What do you think is happening here, Richard? Well, right here, this is clearly something on the web. I could probably open that in my browser and see it or download it. And it looks like here it's reading it from reading the CSV file straight from the web. Yeah. We are actually passing an additional argument to that function. So the function read CSV can be used for a file on your hard drive or for directly from the web. But we have a second argument index call equals to name. I think we'll come back to that when we actually visualize the data frame. Yeah. What that actually means. Yes. So you executed it. Let's have a look now what's in this data frame. And the first command to use is to write out Titanic, which is the name of the data frame object that was just created. And then we call one of the methods available for pandas data frames, which is head dot head. Okay. So object dot method and call it and I'm going to run it. Yeah. Hmm. Okay. Looks like. Looks like data. And I mentioned that there is panda supports. Different types of. Types of objects in the data structure, which is different from NumPy, right? NumPy you can only have one type of type. One type per. Yeah. It's like these are numbers. This is a string. These are integers. These are floating point numbers. And so on. And it's a fun data set. Actually, it's about the passenger list to the Titanic, which sank what 100 years ago. So we had names of people and now we actually see the effect of this extra argument index column. So if you don't supply that argument to the read CSV function, you get row numbers, zero, one, two, three, four, five. Now we. Yeah. So it took one of those columns and said the name column becomes the index. Indeed. Which is the column. Yeah. So how is the index different than other columns? How? Sorry. How is the index different from columns? Yes. How's it different than these other ones? Yeah. What to say about that? So. I mean, convention. I think we'll come, we'll come back to this. Let's figure it out later. Yeah, sure. Right now. Okay. Yeah. Okay. Let's another pandas command, which is describe. So this is also a. Method of the data frame. Yeah. And it's maybe self descriptive. Let's see. What does describe do? Yeah. So I see passenger ID. For each of the columns, it says account the mean value, the standard deviation, minimum max. Yeah. So. I basically get a pretty good picture over the things here. Like. Yeah. There's average age was 29. Standard deviation 14. Yeah. That's a quick statistical, like a quick statistical summary. And it's actually picking out only the numerical columns. Yeah. So it can. Yeah. So we see the mean, the count standard deviation and so on. So this is good. If you want to get a quick idea of what a data frame contains. Statistically. Quite nice. Okay. So what's next? Okay. Next. Let's look at two of the sort of, well, I wouldn't say advanced. I mean, these are sort of. Somehow standard methods, but they do a lot of things at the same time. And these two commands sort of just show us the power of pandas, how, how much. Things you can accomplish with a single command. And you can find them right there in the, if you scroll a little bit down on the. Yeah. So should I do this first one here? Yeah. Maybe I'll type it out bit by bit. So you can explain it. Sure. Yeah. The object again, the data frame. Then let's use the group by method. And then you open parentheses. What do we want to group by? We'll see in more detail later what this means grouping by, but we want to take one of the columns, which is called survived. So that's the survival status, zero for didn't survive one for survived. So that's what we're grouping by, and then we're picking out one of the columns with now not regular brackets, but square brackets H. And finally, we add a mean dot mean to compute the mean. So anyone can guess what this will now do. Before you run it. Oh, sorry. I was. That's good. Yeah. Yeah. So I see there is survived zero and one and two different things that look like mean ages there. Exactly. So we grouped by. So this means that pandas will look into the survived column and make two groups. In this case, because the value is either zero or one, if it would have been five different values in that column, it would have created five different groups. But now it's two groups, zero is for didn't survive. One is for survived. And then it computes, it takes out the age column and computes the mean on it. So the, we see that the mean age of people who didn't survive is a little bit higher than those that survived. Yeah. So it's a lot of statistical analysis done on a single line. That's pretty cool. I guess this is all pretty easy once the data is in the data frame. So basically by using this object, we get all of this kind of stuff for free or something. Yeah. Yeah. Exactly. So that's why actually data frames pandas data frames in different languages. This is does not only exist in Python. It exists in R and Julia too. That's why this is like a standard workhorse of data scientists in the wild, you know, people in industry, people in academia, people work with data frames because they provide so much functionality. Yeah. Okay. Should we try the next one? Yeah. Okay. Maybe I'll copy and paste this one. Yeah. So you see that there's another. Yeah. That's a lot of data frames. Yeah. That's a lot of data frames. Yeah. Okay. Okay. Okay. I see two histograms here. Which is histogram, I guess. Yes. So life has gotten simpler since I. You know, started my studies, let's say. One needed to write long scripts to plot. Now you can just have a data frame and you do the data frame. And then you just get a histogram. Right. So there's two histograms here. Yeah. So again, we're splitting by something. So we saw that there was a keyword argument to the function, to the hist function by. So we're splitting by the same column. We're splitting by the survival. Yeah. So these people died and these people survived. Yeah. And the age, age histogram in both groups. For example, people less than 15 or so years. Or more likely to survive then. Yeah. Indeed. And I mentioned right that. Pandas uses num pi under the hood for many. Mathematical operations. And what does it use them for plotting? You think. Well, I know it's map plot live that we're going to talk about tomorrow to see a bit more. Yes. Yeah. We're talking about the standard Python packages. Num, num pi pandas, and these are not independent packages completely. They, they sort of. Are based on each other. Yeah. Okay. Yes. Now. Yeah. What next? What next? Let's scroll down a little bit. This is just, I mentioned. You know, tips and tricks for learning a new package. You can see that they're calling out call out box. They're getting help one. I mean, how do you know there's a group by method? How do you know there's, you know, how do you know which methods exist for a particular. Yeah. Type a question mark. If you don't know, if you forgot how something works, you can type a question mark. Try a different thing first. Try to question marks. Okay. To. It shows the source code. Yeah. It shows the actual source code as well. Yeah. Okay. And this tab thing, if I do Titanic and I push tab. Mm hmm. Yeah, do that. I think just tab twice. Yeah. Yeah. So it shows everything that's on here. Yeah. So I guess we can find group by in there. Yeah, group by. Mm hmm. Yeah, push enter. Yeah. This is one way to figure out functionality that you don't know, or if you forgot the name of a function. Yeah. Okay. So what's next? What's next? Let's dig down. Let's talk. I mean, we've been talking about data frames. This might be a new concept for some people. So you can see a nice visualization there in the lesson. Okay. Yeah. What is in a data frame? We have rows and columns. Mm hmm. And we'll talk about some good practices on what you should have in rows and what you should have in columns a little bit later. And I guess it's pretty intuitive based on what we saw up above. Mm hmm. So columns have the same type and rows are the same observation or something like that. So rows are typically observations, independent measurements, or in this case, like in individual passengers in the passenger list. And then the observables are in the columns. Are we going to talk more about this later? Well, I just think if we have time, I think we'll, we can bring a little bit up the topic of the clean data or tidy data. Yeah. Okay. So one technical aspect here is that each column is a series object. So series is another, so there are two fundamental data structures in the Pandas package. One is data frame and the other one is series. So the series can be created standalone for a series of observations, let's say, but they are also actually the components of the data frame. Each column is a series object. Mm hmm. Yeah. And run one more command for me. So Titanic.info. Okay. This sounds like something I do a lot. Hmm. Yeah. So Titanic.info, it's a data frame, index column. So it shows passenger ID, survived passenger class, I guess, sex age. Part, ticket fare, cabin. So we see some are in 64, which is the NumPy type. Yeah. Some are object, which I guess is, means it's a string. Hmm. So, yeah, this is just to show what's in the data frame. These are the different series objects that it's composed of and we can see the pipe in each. And the series, so the columns need to be in a uniform type. Yeah. So either in 64s or floats or objects if it's a more, well, strings or so. Okay. So we've got info. Yeah. I think we should talk about indexing now. Okay. So I see that's down here, the index. Should I print the index and we see what's in there? Yeah. That's a good idea. Except it might not be what you think. I don't know. Let's see. Index. Well, these were the names from before. Yeah. So if we had not used this keyword to the read CSV, we would have had numbers here one, two, three, four, but we chose to have the names as the index. Yeah. Okay. So I guess, yeah, yeah. So I've usually seen in other things, index might be a number or an ID or something like that. But okay. But if we compare this now with NumPy, with NumPy, you know, you can slice data, you can have, you know, you can select an individual element of the NumPy array with numbers zero, zero, or one colon three, you know, for, for slicing an array, but it's more general in pandas because it's sort of a little bit higher level. And there are some examples of what you can do. You can type Titanic dot a column name. Oh, let's say age. Yeah. So this is the series. Yeah. So this is the series. We're picking out only this column. This is one way to select a column. You can also put into a square brackets with a string. So you can have a right there. It's the same thing. So we see it's the same. I don't think we have to spend a lot of time on it, but if you would just scroll a little bit down in the material, just want to point out that there are these access functions. There's the dot lock, the dot I lock and the dot at. Yeah. So this is like every so often I read about these and remember them. But then I forget and have to read again. Yes. Why are the three different ones? Yeah. So I lock enables us to slice in the same way as we can do with NumPy, for example. So I like for integer or index is something I locate. And it comes with a square bracket. It's not a function. And you can sort of take zero to two, three to six. If you want to slice out a particular portion of your data frame. So should I do this? Yeah, you can do one of them. Take the I lock. So if you recall from NumPy, well, the basic NumPy lesson, which we didn't do. This looks a lot like the NumPy indexing. Yeah. Or what we did with lists. And we see, yeah, so it takes two of them out. Yeah. But I just want to not spend too much time. I just want to tell you that without the I, so just lock, you use the. Names of the rows and columns instead, instead of taking their number. That's one thing. And another thing is that there's this at an I at. I at an act are actually a lot faster than lock. So if you have something in an inner loop, you know, in Python, you have a for loop. It's much faster to use the act functions, but you can only access individual elements. Okay. That's the difference. Okay. So one more. Okay. Only an indexing. Very cool. Okay. So here. So how do I do this? Should I try to do one of these things? So I see here, Titanic, Titanic age greater than 70. Let's do this one at a time. So Titanic age is all the ages of everyone. Titanic age greater than 70. Now I see something false, false, false, false, false. Yeah. So it looks like this has turned this into bullying where the condition is true. Yeah. This is called a bullion mask. So it's, it's a series object now with truths and forces, depending on if the condition is met or not, whether the value is larger than 70. And yeah, and if you pass this into that data frame, what happens. So now I'm using this as the selector. Yeah. So is this is giving me some passengers. Are these all the passengers that were more than 70 years old? Yeah, exactly. Okay. Interestingly, only five of us. I guess people didn't get us long. Yeah. Okay. And I see what's next is, well, a bit more stuff, but exercise. There's an exercise. Let's go to the exercise. The last box there with code is just about missing numbers. There are methods in, in pandas to drop missing numbers or not a numbers. You can fill if there's a not a number in your data frame, because data are sometimes messy. There are missing data and so on. There are different methods that help you deal with it, you know, replace them with zero or, or something like that. Should we give say 50 minutes, 15 minutes for the exercise? Sure. And then we have five minutes to wrap up at the end of the day. Mm hmm. Yeah. And then we continue tomorrow with pandas for about 30 minutes, it seems. We have until the hour sharp. Yeah, is that true? Correct. Okay. We, we do the exercise now for, for 15 minutes. Yeah. Okay, so I'm feeling this into HackMD. Yes, so see you then. Okay. Bye for now. Hello, we're back. So if you need to leave soon, you notice in HackMD, there is feedback of the day you can answer, but let's go back to the lesson. So were there any important questions to answer right now? Or should we go? I think they've all been answered there. So should we go straight to the screen? Yeah, sure. I hope we hope you had fun with exercise. There are many, there are two more exercises that will be done tomorrow. And there's some comments that it's, this has been a bit of a quick introduction to pandas. So the real introduction to how data frames work starts tomorrow. This was sort of the like big technical summary. So don't worry. We have it for tomorrow. Yeah. Okay. Um, so let's wrap up with a discussion of what tidy data means. Yeah. Okay. This is a good, actually a pretty good theoretical summary, something for you to think about for tomorrow and then we'll be good to go. So what is tidy data? I think the term comes from a paper that was published some years ago. Yeah. But what, how you structure the data and there are different ways to structure data in a data frame. And you might be doing it wrong. Maybe that's the bottom line. I have certainly done it wrong. It's not wrong per se, but having it tidy, which is also called the long format makes various types of analysis easier. Compared to if you have the wide form or the untidy form. So I think the basic idea is anytime you get some data. The first thing you would do is arrange it into a useful form. And then all the other querying and analysis becomes really easy with the data frames or other options observations. Yeah, whatever it may be. And I think seeing an example is easier than just talking about it. Yeah. So basically, each variable should be saved in a column, separate column, and each observation, like each sample, each observation is a row. And I guess what is an observation depends on what you're doing. But the basic point is you shouldn't have multiple of the same type of observation in the same row. So all the M observations are in one column like this. Should I take a look and open this up or something? Yeah, try taking that example with the runners. So this is just creating a new data frame from scratch. We're not pulling it from the internet. We're creating it. And specifically we're using a dictionary here. We'll talk about that tomorrow. Yeah, so what's wrong with this? As I said, it's not wrong. I mean, this works. But what's suboptimal? So, okay, I guess this is the runner identities, which are observations. But then there's, this is the distance they ran, and these are the times maybe. So you have the same type of observation among different in the same row when all time should be in one column. Yeah. And how could one do it differently? So I guess you would have runner race distance. Okay, so the observation is how fast they were in a race. And also part of that observation is how long the race was. So it goes like runner, 400 meter, 64 seconds, runner to 400 meters, or 800 meters, 128 seconds, and so on. Yeah. Unfortunately, there's a very easy way to do this, a very straightforward way to do this. Which is, I mean, you don't have to do this manually by writing your own function. This is all built into pandas. So I've copied and pasted this here. So it's the melt function. Yeah. Melt. All right, the distance time. Well, let's see what happens. You have to print it too. Okay. Yeah. So here we see runner distance time. So this is the long form, the tidy form. Just like you said, distance is an observation too. And it should be in its own column. And maybe one has to sort of explore this a little bit before it starts to make sense why this is better. But now you can do much easier, do different types of analysis where you want to plot things or compute whatever things you can think of from this data set. You can do it more easily if you actually have the distance, not as a column name, but as an observation in its own column. And if you do want something like what's the average time of 400 meters, you can do the group by or something like that or the filtering. Yeah. Yeah. So yeah. And now we're out of time for the day. There's this link here. I think this is the main article. So this is only the shortest introduction to tidy data, but it's actually a pretty good stopping point. So if you read this article for tomorrow, or at least browse it, then you'll be in a pretty good position. So we're not actually covering any more about the contents of tidy data and so on. But this article had changed the way I think of data. Yes. So tomorrow goes really back to basics as well. How do we construct data frames from scratch? How can we actually pandas is very powerful for working with time series and anything that can be interpreted as a time. And we'll see how we combine data frames, how we split data frames. Group by we go into more details how that works. And as I said, two more exercises as well. And a new data set, which is quite fun to the noble prices data set. Yes. So now we are done for the day you will get emails and announcements about what you need to know for tomorrow, but we'll keep here chatting looking at the questions and feedback for a little bit. So don't feel obligated to stay. Okay. So whoever is in hack and D and has selected everything, can you please unselect it because that highlights it for everyone. So here. Feedback, people seem to think we were either too fast or just right. And overall it was good. Please fill out this feedback. It's the only way we have to see how it is and we really do take what we learn and use it later. So there's a few good things and to be improved. Yeah, we should have recalled some basics from the NumPy lesson or reminded like sent it in the emails that please review this briefly. But I think that the basic NumPy lesson is pretty good. So even if you thought, oh, like I was missing something here. Now you're in a really good position to go and read the basic lesson and then the advanced one. And I think that it will still have helped you overall. What else. Yeah. So basically a lot about the preparations. We'll get to more advanced pandas tomorrow. Don't worry. Many questions coming in. I guess we'll have to take, we'll look into all the questions that are coming in into the shared document. So the credits link is on here. Some basics. Yeah, so I mean this really isn't a basic Python course. It assumes you know the basics in taxon. There is a summary of Python there but I mean there's so many basic Python lessons and we just sort of don't have time to do it again. Extensions in the notebooks. Yeah. Okay. Let's see. So hopefully we will see all of you tomorrow. Videos will be ready. Well, Twitch will have the videos immediately and on YouTube by later this evening. Assuming that I have time. And see you tomorrow. Anything else. I think that was good. Yeah, I hope you'll find it worthwhile to drop by tomorrow and the days after too. Yeah. Yeah, we have a lot more good stuff for you. Oh, yes. So thank you and see you later. Thank you. Bye.