 All right, welcome to this intermediate Python course that is in representation in Python. This is a small rerecording of the first part of the course because we had a mess up when we actually did that the first time. So to access the course material, you can go to our GitHub repository whose URL you see here, okay, and whose page you see there. And the first step that you have to do is to get all of the content which contains both the notebooks but also solutions to the exercises and everything associated. So it's important that you get that. For that either clone this repository if you know or just click on this green button there and download the zip file. Once this is done and you have all of the content and zipped of course in your computer, load up a Jupyter notebook instance or Jupyter Live if you prefer, navigate to where you have put the material and at some point you should end up with something like this. Then if you go there and start up the first notebook, here we are going to be able to start up with the content of the course which is to learn how to load up tabular data and start analyzing it looking at it with Python. So this is a menu for this very first lesson. First off, we are mostly going to talk about Pandas libraries, yeah. And so Pandas is a library that is based off of NumPy arrays and implements the idea of a data frame which is heavily inspired from the R data frame where it's basically tabular data okay, organized according to rows and columns, okay. The columns have column names of course and the rows may have names often by default it will be numbered in a bunch of numbers and sometimes it's names that have more sense to that actual labels which we refer to as index, all right. And so a data frame is composed of several columns and if we delve a little bit into the structure of a data frame, each column is actually what today its own little custom type which is what Pandas called a series, all right. Which also have index and a given type. We'll go back to that in more detail later on during the course. But first off, the first thing that we need to do is to be able to load our modules there to make sure that we have all of the required libraries. Okay, all right, so this worked. And then we need to be able to read tabular data inside Python, okay, from a file, typically a CSV file or a CSV file as the Pandas data frame. So let's look at it together. Our base function is called read underscore table. And to that you basically give the file path of the file that you want to read and a bunch of parameters, typically the set parameter which determines which is the separator between field. So in the CSV, it's tabs in the CSV, it's coma or semi-columns, might be some things. Whether or not there is a header, okay, and when it can be found by default it will expect a header. And you can also ask it to skip some rows and there are, as we will see also, many, many, many other options that one can use, okay. So let's try and load some data there, all right. So import Pandas as PD, as PD is a very classical shortcut alias for Pandas. So DF as will be the name of the viable that we want to create from this file, there's a CSV file that contains data about the Titanic passengers. So we look at it and the header shows us the first few rows. So very useful to check that reading one pocket. Now there is something there. You see it doesn't look super nice. It looks like it has read everything in a single big column containing all of this data. So there follows our first micro exercise where you can try and fix the cell above so that the reading there goes more smoothly and it gives us a better result, all right. And if you have the time, you can try and look up for the head to see that there is a function that maybe helps you to have different default values for read table and makes you like, makes your life a bit easier, right. So in the actual course, we wait, we give people more time to solve these, but then in the video version, of course, we will cut directly to the solution. So just put them on pause if you want to work, otherwise I'll correct right now. So to try and solve that one, of course, you can gather that here, the problem is that the separator in the CSV file is a coma where has the default here is tab for PE.readTable. Of course, if you are unsure, you always want to check this with the help function. Okay, so the help. And if you kind of look at what's in there set by default, it's a tab, all right. So if you then just go back there, okay, get this right. And then you just set the separator as commas and you read that one, then you'll get something much, much nicer, much, much closer to what you would actually want to have, all right. Now there is an alternative is that pandas also has implemented some read underscore and then the format name alternative function. And because CSV is such an ubiquitous format, there is already a read CSV with the default separator as a coma that exists there. And so you can get the same sort of result using directly PE.readTable.read.score CSV, all right. It's always useful to know both functions and both possibilities when it comes to usage. Okay, so that's for the correction of this little micro size. And the next part is just a little commentary on that and also a reloading. If you've not done it, be sure to run that cell to reload the rest of our data sets because we will use that for the rest of the notebook going more and more detail in the different part of the single data sets. All right, so that's it for my little part. And then I'll give the floor back to Robin with the original video from the original course for part one. Okay, so now another aspect that we might have to deal with when loading data from disk is whether we have a header or not in our data set. So in the example we had so far, if we look at the actual file on disk, so if I go to data.titanic.csv. So here I just have a look at the actual file on disk and we see that the first line of the file actually contains column names, okay. So what happened is that pandas actually automatically detected that there were column names and so it uses these values, the first line of the file as automatically as column name. But now let's see what happens if I try to load a file that has no header. So this Titanic no header file, it actually doesn't contain this first line with column names. Okay, so you see that what happens here is that pandas has taken used the first line of the file as column names, which here is actually not appropriate because this is not supposed to be the names of the columns. It's simply the first row of the table. Okay, so if this is a case and I want to avoid it, I need to explicitly indicate that header equals none, all right. So if I add this header equals non-argument, now pandas will be aware that there are no header values in our file and so there should not be any column names. All right, so now the import is done properly. And of course, since I didn't give any column names, so what pandas does is that it simply uses numeric values by default, so the columns are simply named after their position. And as we will see, and as you should be used to with Python, the indexing or numbering always starts with zero. So the first row has position zero and the first column has position zero. So for people who come from ours, this can be a bit unexpected because in ours, the numbering actually starts with one. Now, let's assume that our data does not contain any header but that we actually want to give column names to our file, to our data frame, sorry. There, what we can do is that we can add the pass optional names argument to the read table function and we can specify the names of the columns, right? So you see that I simply pass a sequence of strings. Here I give a list, but I could also give a tuple, for instance. And now I have imported my data and the values I have passed here in my list have been used as column names. So only requirement is obviously that the number of names of strings that you pass to the names, argument matches a number of columns. So this is how I can set column names. Now, it's also possible to set row names and this is what we see right now. We see that by default, so row names have, I mean, if there are no row names in our file, then the rows simply gets a label or the name that corresponds to their position, so start with zero. So just a vocabulary point in pandas, so row names are called the index of the data frame. Okay, so whenever we refer to the index of the data frame, it means row names. And we can access these, so access or set these row names using the dot index attribute. Right, so let's see what happens if I load a file that contains row names. So here, I just want to show you the, we just print the start of the file as it is in the, you know, in the original text file. So we see that we have column names and then we have so different rows of data. Now, what is different with before is that you see that I don't have a name column anymore. So if you actually count the number of elements you have in the first line, you will see that you have one less element than in all the other lines, all right? So let's see what happens now when I load this file. And you see that what pandas automatically did here is that it used the first value of each line as that is not the header, it used it as name of the column. Okay, so as index value. So now the index values are no longer, you know, the default numeric value is zero, one, two, three and so on. Now the first values of each line in our file has been used as row names, so as index value. And so the reason, I mean, the reason why Python pandas, sorry, did this is because it detected that the first row of the file has one less values than all other rows in the file. So when this is the case, pandas will automatically assume that the first element of each line contains index value, so it's a name of the row. As I mentioned with dot index, if you, it's a dot index attribute of the data frame contains the index value. So then anytime, for instance, if you want to access row names, you can query this dot index attribute of the data frame, for instance, here I display the five first items. Maybe here, just a note to say that you can see that the type of the object returned by dot index is not directly a list or tuple, it's actually an index object. So most of the time you can use this directly as a, I mean, it's an iterable object, so you can use it in a loop or so, just as if it was a list. But that times, if you really need to pass to have the content of the index as a list, then you have to explicitly convert it. So for instance, here, but you can do this simply with, by using the list construction, like that. So here you see I have the five first element of the index now as a list. So there are times if, for instance, the file that you're loading has not, it's the exact structure where the first row has one less value than the other rows. For instance, our data file is the same as we had in the beginning, so it's the same number of column names as elements in the subsequent rows. But for some reason, I want to use the name column as index values, then I can always manually, manually, so it indicates that the index column should be the first column, corresponds to the first column of the data I'm loading. I do this with the index underscore column argument. So I can either pass a position or I can say, this gives the same result. I can say my index corresponds to the column name of the data frame. So if I do this now instead of being a regular column, the name column here is actually the index of the data frame. So it's not a natural column anymore, it's the index. If I want to check to make sure, I can print the list of columns. So you see it starts with sex, age, and so on. And name is no longer a natural column. And you can actually see here in the display that the name label is a bit lower than the other ones. I'm really sure that this is an index. Okay, so these are the basics of loading a data frame with pandas, of course, there are a lot of other options for read tables that we will not look at all of them, but we just listed a couple of here that can be useful. For instance, you can pass the NA underscore values as an argument to indicate to Python which values in your data set should be considered as NA values, so sort of empty values of your table. If you want to automatically convert certain values to true or false or the Boolean true or false, you can do this with a true false values argument. If your data is compressed, there is also an argument to pass to pandas so that it automatically decompress your data before loading it and so on. So just a quick example here. I have these very small data frames that I call ugly data frame because it's contained. The data is quite messy in it. You see it has two columns. And then I have, for instance, for NA values, I used the data frame user strings missing and not available and also an NAN. So, but actually all this should be NA values. And also I would like that the values, which contain only true and false values. And so the numeric values, zero should be converted to false and any other so one and two should be converted to true. So I can fix this by passing the following arguments. So NA values, I will give a list of all the strings that should be converted to NA. So for instance, missing and not available should become NA. So true, one and two should be true and false and zero should be false. When you see that when I pass these arguments now, my data set is, my data frame is much cleaner. I have only true and false values and all the missing values are proper NA values in the table. As I quickly mentioned, at the start, there are many other functions to read specific tabulated file formats. We just listed a couple here. You can directly read Excel files, JSON, which is actually not a tabulated format but it's still a structure type of data. So you can directly read JSON files, SQL files and so on. So we didn't, you can click here to see the exhaustive list. There are quite many specialized functions to load all sort of different types of files. Okay, so if you're using a fairly common type of file, it's quite likely that Panda already has a function to read it. So please check the documentation. All right, here we have another small micro exercise. So when you ask to read this file here and you have to check if, you know, make sure that the data frame is nicely imported. So we'll have the right type of separator and also make sure that's a test if there's a header or not. So I give you again, a few minutes to try to load this data set. But obviously I gave the wrong, it's not a tab delimited. And you see that I only have a single column with all the values. So here, so separator actually seems to be simply a white space. So I can now add white space as a separator. And now you see it looks a lot better. So I have my different columns and rows loaded properly. Now maybe one more thing I could do, but you know, this depends again on the type of analysis I want to do with the data, but maybe in this case, so you see that I have the first column contains gene names and then each column is actually the count of how many times I see a given sequence in the gene. So maybe for this particular analysis, I would like to have the index of my data frame equal to the gene names. And so I could do this. If I want to do this, I need to manually specify it and I use index called argument. And here I have two options, either I can say it's the first column, so zero, or I could also pass the name of the column, which is gene. So this both give me the same result. And now you see that the actual index of the data frame corresponds to the gene names. Maybe one more thing, maybe some people might have tried to load it without compression. And you see that if I do this, it actually also works. So in this case, compression argument is appears to be sort of optional. So what happens here is that because ZIP is such a ubiquitous compression format, Pandas actually is able to auto detect what if files are compressed with ZIP. So for ZIP file, it's not absolutely necessary to specify the compression, it's auto-detected. All right, are there any questions? Let me just check. Good dog, no. All right, so hopefully by now, how we can load the files is clear for everyone.