 Hello, we are back so yes so now for the rest of the day we have two lessons one on data formats, which we're not actually going to much detail, we give this really high level summary that addresses some of the questions which people have been asking yesterday and then we go on to productivity tools, but we did teach this course or this lesson in full last year, and you can find it from the playlist of the course last year, link from the main page. So, Seymour, where do we start with? Yeah, so maybe I'll introduce myself so yeah my name is Seymour Tuomisto, I'm from the other scientific computing and I've been working with Python for, well, 15 years at this point or something, so a long time. So, in the chat, in the HackMD there were lots of good questions about like NumPy and pandas and what it means like what are the columns in pandas and there were questions about tidy format also, which is like popular in artist tidibers and we'll quickly explain these concepts, but like Richard said, we had a longer dive on this last year so if you want to see like that you should check the video but let's teach the main concepts. So, when we're talking about data frames, like data frames are like, I figure out this analog of like a hardware store. So if you go to a hardware store, you have like hammers in one aisle, you have nails in one aisle, you have nuts and bolts at one aisle and wrenches in one aisle. So you have different aisles of different things and lawnmowers in like you can have completely different things in one aisle and another aisle can have a completely different thing. This is basically what a data frame is. So data frame is organized in columns. So each column has one type of a thing. So it has integers or it has time stamps or it has strings. So in the code example there, you don't have to write it. We have an example data frame that has like strings and timestamps and integers and floating point numbers and all of these can be like in some sort of a correspondence of course. So you can have like temperature and pressure or something, or you kind of like if we go to the hardware store analogy, if you go down one aisle and you find a certain kind of a nut and you know that okay I need a corresponding wrench to tighten this nut. Then you go to the wrench aisle and you go to that place in pandas, you usually have the corresponding things in the same place. So in the same row, you would have like the nuts that correspond to a certain wrench and then you can easily find the things you're looking for. And this is basically what a data frame is. So you have multiple of these columns that are collected together and they are collected into this tidy data format. So below in the Richard's share, we have this view of this tidy data format. So you have each column is a variable. So temperature, pressure, time, I don't know, like it can be whatever. And in each row, you have an observation. So at a certain time the pressure and temperature was this. And the idea behind this is that if you keep this format, it's easy to write tools that you did like work with this format. So you can easily calculate like an average. You don't usually have to want to calculate average of time pressure and temperature like I mean like you don't want to calculate because that doesn't make any sense. But you want to calculate an average of let's say one column one you want to calculate temperature over time or something you want to calculate an average of that. So you some operations are like written for columns and some operations are written for rows and because everybody keeps the same format, it's very easy to like on like manage these tools. And this is why pandas and the similar kinds of things like tidyverse in in tighter in are like popular. And this is like important, like, even though you might have a table where you have like, you just have a table and what does it matter. Is it like organized in what way it's organized in certain way because like people expect it to be in that way because all of the tools have been written that way so you should just do it like the other people do. And this is basically how data frames are organized and NumPy are a bit different. So all of the columns are usually NumPy are guys and NumPy guys can be like multi dimension and they're always one data type. So you might have one dimensional array like a column, or you might have like a two dimensional array like like a matrix or just an array of numbers or multiple like things. But usually you have like, like, let's say, a temperature in x and y directions or something like that, or you might have a three dimensional array of like pressure at a different altitudes and different places in the, in the, in the world. So, so in NumPy, you have this one big block of same kind of data. So organized in this one, one block. And what, what does this mean is that, yeah, so Richard. And like you would do the same kind of operations across every row column and rank, I guess. Yes. So what you would usually do is like you would do, let's say you calculate, you take a matrix or take an array and then you multiply it by some constant and then you do it apply for all of them or you calculate the sum of certain rows or certain columns or something like that. But, but, okay, so you have these two different formats and they are like, they're different in the fundamentally different, but they, they are in some case, like in some sense, they are the same, like, but, but they are like different. It's very hard to explain maybe, but yeah, I mean, but the main thing, yeah, but the main thing is, is that for all of these different things you see. Sorry. Well, go ahead. Yeah. Yeah. But for both of these things, they are like tools that are designed around these four months. So for example, like, you don't do magic matrix multiplications in pandas, like, like you do matrix multiplications from NumPy arrays, because like for tables of it doesn't make any sense to do matrix multiplication, but for array, it makes sense. And, and there are tools for these and then there are like ways of storing this data that are designed for these things. And for pandas, there are many, many tools that are designed many formats like file formats like that are designed for certain certain kind of a data and for NumPy as well. And usually the situation goes like this, like, you might have seen this XKCD comic that they are like competing standards and then somebody's like, okay, let's just write a new standard that it's all of the problems. Let's do something that does what both of these does. Yeah. And then you have one more. And this is how it always goes. So you have like huge amount of competing standards. So there were people asking about are. If you want to collect smooth with data from other to Python, either you can like use CSV, for example, or you can use Parkett or feather there's there's mentions in the, in the article about these different formats. If you want to use matlab, you can use math files, which are like these HDF file files. If you want to use Python and let's say Fortran code, you might need to use HDF five or, or use the Fortran file. Either in scipy or like this million file format. And, and yeah, we don't want to give a too long of a talk about it. So let's just say that they are. And if you want to see more about these formats, there's a huge list here. And more will be added here as well. And who's the format that your tools use, basically, choose the file formats that your tools use and your data is optimized for. Yeah. And I guess you could say, I mean, talk to people like, I guess our main message here isn't use this or don't use that, but actually do take a little bit of time to think about it before you go too deep into your work. Yeah, yeah, basically. And, and I would say that I'm searching do some thinking that pandas and NumPy, they already have good interfaces for all of these different data formats. So, so check the documentation, like in this page that's mentioned about it, but check the documentation on pandas and NumPy, don't write your own data reader. Because somebody all has written it already. Like somebody has written a CSV reader, you don't need to open a CSV file in Python yourself, like you can just use NumPy or pandas to read it based on like what sort of data do you have in the CSV. There's also like a question that why Excel isn't good, human readable, well Excel is a binary format, have you opened it with an editor? But it's a good point, like many of these formats are complicated. And what human readable means is, yeah, it's a complicated thing. Yeah. Okay, so maybe we can keep answering these questions by the text and we can go on. And if you want more, watch the video from last year, maybe someone could link it here. Yeah. Yeah. Okay. Yeah, but so we ask questions in the top will try to answer as many of them. That's a really good question. We'll happily answer them. So what's next, it's