 Now we come to a very exciting lecture and we're going to see how to import data into Python so that we can work with it. We're going to work with a very famous and successful package called Pandas. Now Pandas is really one of the big success stories of Python and why one of the main reasons why it's become so popular as a language for data science. So in notebook number four here, importing and manipulating data, one of the most important tasks that we have in data science. So I've already mentioned this package called Pandas and it really is for any kind of tabular data. So think of a spreadsheet. You've got a spreadsheet file and we're going to import that spreadsheet. Even if you store your data in a database such as Redcap or any kind of SQL database, you can export that as a flat file. In other words, rows and columns of data as an spreadsheet and we can import that. Of course with Python, you can also import directly from a database. But what we're going to do here is stick with the basics. So we're going to import data from just a spreadsheet file. So we're going to start off with the packages for this notebook and you can see there we're going to import Pandas with that namespace abbreviation PD. And as I said, you can use anything you want other than PD, but it's just the industry standard. Everyone just does PD and then for numerical Python NumPy, we use the namespace abbreviation PD. So that imports all the functionality of these two massive and very important Python packages into this environment of ours that we're working with at the moment so that we can make use of that. So specifically to Google Collab, if you have your data files stored in your Google Drive, you'll have to import this function from Google.collab. So Google itself is a package. It has a module called Collab. So we say from Google.collab, we're going to import the Drive function, specifically because what we want to do is to maintain security. We only want to connect to our own Google Drive and we want to keep that secure. So this notebook as it stands, it's not 100% secure. So we have to log in, re-log in and give this notebook instance, this notebook that's running now permission to enter our Google Drive and read some of the files. So it keeps things very nice and secure. So we've got to import that function as well. And also be specific to Google Collab is this data table. So Google.collab.datatable and we use this magic command. So anything in Python that you see start with this percent, this is just a couple of them. Just load underscore ext and all that's going to do is just reproduce a table very nicely inside of Google Collab. If you don't have to do that and if you use Jupyter notebooks on your local system, there's another way to do that. Just to have the tables print very nicely to the screen. That's all this is about. And it's really not 100% necessary, but I like to do that because sometimes I like to work with those tables right inside of my browser here. So now we have to import our data. Now our data lives in our Google Drive. It's a CSV file comma separated values file. So I just want to spend a moment or two just talking about that, a comma separated values file, if you in any kind of spreadsheet, whether you are in Google Sheets, whether you are in Microsoft Excel, when you say save as, you can save it not in the proprietary format. For instance, Microsoft Excel would have SLSX as a file extension. That adds a lot of extra information to the file. Some people color there, add colors to their spreadsheet and format it in some way, print out values so that it prints out as a percentage to the screen, or add monetary values to that underlying that data, that formatting is the real actual data. And when you export a file as a CSV file, it strips away all that extraneous information, which we really don't need. We are just after the raw data. So always save your files as CSV files. It's also universally readable. So no matter what software someone uses, they can still read those files. First off, remember that file, that function we import to drive. So it has a mount method. So drive dot mount. And now we're going to mount our G drive and internally on Google Service, that's forward slash G drive. And I always add this little extra argument. So remember, arguments are these things that we pass to a function or a method. So in this instance, the mount method, we're going to pass all these arguments, these two arguments. We separate the arguments by commas. And the second argument is forced underscore remount. And I'm setting its value, I'm assigning to it the value two. And that just means if I run this notebook a little bit later, and it's already connected, I don't know if it's still connected, it just forces the remounting of or mount means access to that drive. So we're going to run that, and that is going to provide us with ability now to connect to our Google Drive. And this is what it looks like. It says go to this URL in your browser, and you're going to click on that. And that's going to open a second tab. And on that second tab, it's going to allow you to reaccess your accounts. You have to sign into your Google to your G drive all over again. And that's really just for safety purposes. So it's going to show you a couple of pages, actually, there's going to be a tab as well that's going to show you all the access that it gains from you logging into your drive again. So this notebook is going to have all sorts of access to your Google Drive. And that's all fine. It's very secure. You need not worry about that. And what we're going to do then is give it the permission. And then a tab is going to open up with a secret key. And you can see there, enter your authorization code. So on that page, it's going to be a very nice little button, so you don't have to highlight the whole key because it's quite long. You can just click on that button that's going to be the same as copy. And you can come here and paste it in here. So just click in there and hit Ctrl V or Command V, and it'll copy this authorization code and you're going to hit Enter. So I'm going to do all of that. I'm not going to show you on the screen, so we're going to do all of that. I'm going to copy and paste that in here and just hit Enter or Return, and that's going to give this notebook access to my G Drive. Now one thing I just do want to mention here, of course, depending on where you work in the world, it might be a bit difficult to store sensitive information about people on Google Drive. We might have laws that prevent us from storing human information outside of the borders of a country, and for security reasons, we don't want to save data in this way. So make be sure about how you save your data if it pertains to human beings, and make sure that you can save it on a Google Drive. Now Google also has some features that you can pay for. You can pay for your Google Drive to get extra functionality and extra space, and that's more secure, of course, and so please find out about the laws before you enter any human data on your Google Drive. Of course, this is a course just to show you how Python works, what data science is all about, and none of the data that we have in this course pertains to real human beings, so not a problem for us in the course. So I'm going to click on this, and it's going to take me to a new sign in page, I'm just going to re-sign into this very same Google Drive that I'm using. And there you go, I've copied and pasted it into that cell and just hit enter or return, and now it says mounted at G Drive, so I've got access to my Google Drive now. Now if we scroll down to the next section here, what I'm doing here is using a terminal command. On your computer you'll have a terminal or a command prompt, and that's exactly what we're using here. What we're doing is we are just changing the directory to the folder or directory on the hard drive, now that won't be our hard drive, it's up in the cloud, where this notebook and where our data is stored. And I'm going to use, as you can see, quotation marks here because this is a string. So you can see it starts with a 4th slash G Drive, that's the home folder, and then my space drive, and then I'm typing this whole folder structure, so I'm just going to take you to a different tab so that you can see where we get all of this. So this is my Google Drive, and you can see all the notebooks here that we're going to work in, and that's in a folder structure. So on my Google Drive, it's under my drive, then I've got Stellenbosch University as a sub-folder or sub-directory, inside of that I have another folder or sub-directory school for data science and computational thinking, inside of that I have another sub-folder or sub-directory data science, and that's where all my notebooks live for this course. Inside of this though, I also have other folders, remember you can always say new, and then folder, and then of course by the way, right down there more, if I go down, Google Colaboratory if you want to start a new notebook, I also have a data folder, and inside of this data folder is where all my CSV files are stored, so I keep things very nice and neat. But it's this whole address here, my drive, Stellenbosch University, School of Data Science and Computational Thinking, Data Science, Data, that's the whole folder structure to get to that file, and that's exactly what we're going to do. So I'm saying %cd, so that's a magic command, change directory, and I'm typing all of that in G Drive, My Drive, Stellenbosch University, School for Data Science and Computational Thinking, Data Science, Data. So my CSV file is there, so I'm just telling this notebook, change internally to that directory, and now what we can do is use the %ls command, and that's going to show us all the files, it's going to give us a list of all the files inside of this directory, which we've changed into. So think about just your explorer on your normal computer, you can click around and go to different folders or directories on your hard drive, this is exactly what we're doing here. So I'm not going to run that ls, but if you do, it'll list all the files. What I want to do is I know the file that I'm after, it's called data.csv, so there we have the csv, comma separated value spreadsheet file, and the function that's going to import that for us is the read underscore csv file, and that comes from the pandas package. So pd, because we can use that namespace abbreviation, so pd dot, one of its functions is read underscore csv. As argument, inside of quotation marks because we use strings for file names, so it's data.csv is the name of my file, and by the way, I mean you can just drag a file from your internal drive onto your Google Drive, and it'll upload, and it'll live there. So I'm going to run this cell, and by the way, look there, I'm assigning it to a computer variable called df, remember the assignment operator, there we go, the equal symbol, whatever it's to its right, it assigns to what is to its left. So it's going to read this data file first, and import it, and save it in a computer memory with this name of that little spacer memory called df. So now that is done, I can look at the type, remember the type function is going to tell me the type of this object, this df object that we've now stored in memory, and we can see it is a pandas type object, pandas.core.frame.dataframe, so it's a dataframe object, and that's what pandas does for us, it creates this dataframe object that we can now manipulate. Now as fast as dataframe object df is concerned, it has many, many, many methods, so we can say df. and then a method, and the way that you can see all the methods, not something that we do all the time, but if you want to learn about all the methods that are available to you with your df now, in other words what can you do with this dataframe object, just pass it to the dir, or directory function, and if you do that, it will give us this long list of things that we can do with this object, I can say df.ifill that's forward full, or eval, or empty, or dtypes, or dropna, and there's a long, massive list, and that's what makes pandas so powerful, because there's so much we can do with our data, we can manipulate the data that we have so much. So one of the first methods is the head method, and I'm not even passing any argument to it, because it does have a default argument, and that's five. The df.head, that's going to show us the first five rows of our dataframe, and we always do this because we just want to make sure that our data object imported properly, and there we see our first CSV file that we've imported, so at the top in the spreadsheet that will be the first row of these header columns, and we see that this is what we call tidy data, now it's very important as much as you can work with tidy data. Now the tidy data really comes from the R world, Heidi Wickham described this in a lovely paper, and it's all about every row in your dataset pertains to an observation, and that observation might be something in your chemistry lab, it might be something in your biology lab, it might be something about economics, no matter what kind of science you're working with, as long as each row pertains to a single observation, and all the columns are very neat variables, so you've read up about variables, the specific data types, so in down that column, that data type is fixed, so on the age those would be integers, and we don't put anything else in there, we don't put words in there, we don't put any other kind of data type in there, that is a continuous numerical variable, inside of age, and that's all that goes inside of vocation, we have nominal categorical data, so it's all the same, and this is called tidy data, so each row is an observation or a subject in our dataset, and each column pertains to a very specific variable, and we don't put variables together, many times you'll see a dataset where there's lots of commas or ands, almost free entry of data inside of a cell, and there might be a variable that has lots of potential entries that you could make, what you have to do is all those have to be separate entries, and that column can just have in zero ones, or yeses and noes in it, so it must be very clear what the data type is of that, and it is a single data point value that goes into each cell, and by the way because we've used that percent load underscore ext, this is the printout in Google Colab that I was talking about, because look how nice it is, I can say how many values I want, how many observations per page, now we've used the head function so it's only going to show me five, but I can really work with this table and look around with the data quite nicely, so I like to use that magic command, so the .head method is what we're going to use just to make sure that our data imported correctly, the next thing I want to use is the shape attribute, now remember an attribute or property that is not a function or method, so it doesn't have parentheses after it, so it's just df.shape, and what the df.shape is going to do for a flat file is going to tell me how many rows they are and how many columns, so the rows are how many subjects I have in my data set or observations, that's the number of rows and the columns are my number of variables, so we can immediately see with the shape attribute or the shape property that we have 200 rows and 13 columns, 200 rows and 13 columns, sometimes we also want to just have a look at all the variables that were collected, and for that we use the columns attribute, so df.columns, and it's going to give me this index of columns as a list object, and we see name, dob, age, vocation, smoke, hr, spp etc, now there's also an ndim number of dimensions, and we have two dimensions, row and column, so it's a two-dimensional data set along two axes in other words, there is the size attribute as well, and that's just going to show me how many data points I have, so it's actually going to be rows times columns, 2600, the d types or data types is very important, it's a very important attribute, and it's going to show me all the variables that we have, and it's going to show us what python things or pandas thinks that data type is, so name it thinks it's an object, an object is just a keyword for a string, and strings are going to be nominal categorical variables, age it sees as an int 64, so that's a 64 bit integer, now the 64 bit that just gives us what is the maximum and the minimum, the highest and the lowest value it can save as 64 bit integers, which is actually quite huge numbers, so it's still an integer, vocation again an object or a categorical variable, smoke is a bit difficult because it saw that as a 64 bit int, so let's just go back up just a little bit, and you see they smoke, well these first ones were 0, we can well imagine that whoever created this data set had in mind that 0 means no smoking, and 1 means smoking, and 2 means smoke before for instance, so that's encoded as a number, but that V clearly is not a number, I cannot say what is the mean smoke value, because that'll come out as perhaps 1.25, no it has no meaning, so that is a categorical variable encoded with numbers, so that might cause a few problems and we have to keep that in mind, that is definitely not a numerical variable, that is a categorical variable, so we really have to watch out for that kind of problem in a data set, we really have to be concerned about how these things were saved, then the float, remember float is decimal point values and it's also 64 bit float, which gives us a lot of decimal places, 64 bit there'll also be 32 bit, 16 bit, 8 bit, and this means less and less and less values after the decimal place that this kind of data can keep, next though we're going to talk about extracting some of the rows and the columns, because that's what this data set, this data frame object and pandas is all about, is for us to manipulate the data, so the first way that we can refer to just one of the columns, we might be interested in just one of the columns and this is how we would do that, we use square brackets, well there's actually a couple of ways and we'll get to that, DF and then a set of square brackets, denoting that we're working here with type of indexing and then inside of quotation marks, I'm just passing the name of that column that I'm interested in, so DF, square brackets, age and I'm going to assign that to a computer variable that I'm going to call age underscore column, so I'm just going to extract that single column and I'm going to assign it to a new computer variable, let's see what this new computer variable of this, the single column inside of a data frame, what kind of type it is and we see it's a series object, so in pandas you get a data frame object and you get a series object, if it's just a single column with an index still, on the left hand side you'll see 0, 1, 2, 3, 4, remember Python is 0 index, so the first row will be row 0, that's just a single column, so that's one way to do it, there's also a shorthand notation where we don't use the square brackets, we just say DF dot and then the name of the variable, we can only do that though if we don't have any illegal characters in that column name and spaces are the big big problem, either spaces between words or an invisible space after a word, we can't have that, so if our column variables are nice and neat and always try and do that, don't be overly verbose and have a long sentence there as your column header and definitely don't use illegal characters such as spaces, if we have this we can just use the dot notation, so DF dot age and that's going to give us exactly the same thing, so by the way we can, because it's a series object that's just like a data frame, we can use the head method, so dot head and that's going to give us the first five values now in that age column, it is a series and that's why I say the series will also still have the index on the left hand side, the first row, the second row, the third row, up to the fifth row, Python 0 index is starting at 0, it starts counting at 0, now we might want to in some instances not have a column that we extract like this to be a series object, we just want it to be a numpy array, remember those, for that we have the to underscore numpy method, so I can say DF dot age, which is going to extract that series for me, just that single column and then we have the dot to numpy, dot to numpy, a method there, I don't have to pass any arguments to that and I'm going to assign that to a computer variable called age, so instead of a pandas series, I now have a numpy array, so just to make sure that is true I'm passing it to the type and I see it's a numpy dot nd array and that's exactly what we expected it to be, so it's not a pandas series anymore, it's not going to have, it is not going to have a little index down the left hand side, each value in a numpy array still has an index but it's not going to be printed down the side, so we can also look at all the many many many many methods that are available to numpy arrays, so you can see they're passing it to the dir function quite a bit, here's a couple of them, the minimum, the min method, so I'm just saying age dot min and I'm going to get the minimum age in that numpy array, dot max gives me the maximum age, dot min is going to give me the mean of that array of values, 53.07, so that is how we extract first the series and how we convert it to a numpy array, now you don't always have to do that, I just want to show you what is possible and how you have to think about the data type that you are working with as far as what what these objects are in python, so next up I'm going to show you two very important methods and that's the iLock and the lock method, so iLock is integer location and after that always comes square brackets because we are dealing with with the indexing here, so df my dataframe object dot iLock and zero, so let's see what that does, integer location, the zero means the zero with observation, so the zero with row and what that's going to return for me is all my column headers there, the variable names and then for that first observation, that first row of data, the zero with row is going to show me all the values that are inside of that first row, so if you know exactly which one you're going for, you might be interested in your lab and a very specific data entry, you can just use this iLock and pass a single value to it and it's going to give you that row of data, you might not be any interested in a certain row, you might have a couple of rows that you are interested in, say row three, four and six, remember that will be index two, three and five, python being zero indexed and I just pass that as a list object inside of the iLock square brackets, so note there that there's a python list inside of the outer set of square brackets, so d of dot iLock and I want rows three, four and six, in other words index two, three and five and now it's going to just return for me those specific rows, you see the index two, the index three and the index five, so now I can see in a nice column, I can see only those rows that I was interested in, so what if I know that the rows, the subjects, the observations that I'm interested in are actually sequential, for that we can use that range object remember or the colon in between that jumps in increments of one, so it's going to be zero, one and two, just remember the last value is not included, so it's only actually going to be the row zero and the row one, so if I know that they're contiguous in this fashion, I can just use that range object with a colon and now again it's showing me the first row with index zero, the second row with an index one but it doesn't show me two, that two is excluded, so that's just the rows but what if I'm interested in not getting back all my columns, well this is one of the powers now of indexing, it's always a row comma column because see there we're bringing in a little comma there and if you see a comma inside of i lock or lock that means row comma column always in that order, so what am I asking for here, I'm asking for rows with the index zero, one, two, three, four, remember the five will be excluded, comma and then I'm passing a python list for the columns, I want the columns with index one and index two, now even the columns are zero index, so the first column is going to be the zeroth column, so I'm actually asking here for column two and column three but this means I should know what those columns are because we're not using their names, we're just using their index, so let's see if Pan is the right thing, if we did what we wanted it to do and there we go zero to four, so zero, one, two, three, four and if we look at name that was my first column, remember the index doesn't count as a column, the name was my first column so that's the zeroth column, that's not what we asked for, we asked for index one and two so zero, one, two, one and two will be dob and age, date of birth and age and that's exactly what we got back, dob and age, so that's integer location which is sometimes useful as far as the columns are concerned but just .loc that property is actually much more useful because there we can actually use the column names, so here I'm asking for exactly the same thing as we did with the previous iLoc, I'm saying df.loc or location zero to five as far as the rows are concerned comma and then I'm passing as a list the two column names that I'm interested in and that's pretty much more useful than the iLoc and I get back as you can see exactly the same thing so remember that difference between iLoc and loc, so I'm going to show you there's also iAt integerAt and I'm just passing the row and the comma row comma column so I want the third row second column, I want that specific data entry I can use iAt and let's see that was 43, now we're going to come to one of the most important parts or the most interesting parts at least pandas and that's how we're going to filter some data we only want to extract certain observations or subjects from our data set and that is filtering the data now for filtering the data we're going to use these conditionals remember they're greater than or less than that it returns a true and false value for us so let's see a three greater than four no it's not we return a false a three equals to equal equal so we're just checking where the left hand side and right hand side are equal to each other and three equals three dot zero it's exactly the same value so two so remember these conditionals we're now going to use them so first up if we have a categorical variable we might want to know the sample space remember the sample space are all the possible values that that variable can take so we know beforehand this is the set of values that that variable can take and now if we start entering data whether that be from human beings from chemicals from laboratory in the laboratory some organisms doesn't matter what it is if the categorical variable there's a sample space and we're going to any observation that comes in is going to have one of those values so for continuous numerical variable there's going to be an interval from the minimum to the maximum and every subject will have a value in that interval so those are that's the sample space the possible values that a variable can take so first we know the smoke as human beings the smoke we know it's going to be whether this participant was smoking or not smoking so if i just want to know the sample space specifically for categorical variables we want to know what the sample space is for that we have the dot unique there so the unique method so df dot smoke remember what that's going to do that's going to just give me the smoke that smoke column as a series and on that new series df dot smoke i'm using the unique method don't have to pass any arguments to there and i can see the values that are down in that column and that's why it's so important to have tidy data i see there are three sample space elements no matter what observation the subject the row is in my data set they will only have values zero two and one and what pandas is going to do it goes down that column and it just starts returning it in order of how it finds it that's going to be the order that it returns it in because this is not numbers these are categorical variables so they mean something to which there's no order really so there's actually a nominal categorical variable there so imagine then in my data set i'm interested in the ages of only those people who are non smokers so by the way that will be the participants that was encoded as a zero so i want to know the ages only of the non smokers how would i go about that and we're going to use conditionals for that and there we see the code for that so first of all on the left hand side i'm going to create a computer variable with my own of my own making non underscore smoker underscore age so i'm only looking you know it's very descriptive that name i'm only looking for the ages of the non smokers and there's more than one way to go about this i'm going to show you this one i'm calling df the data frame and then immediately a set of square brackets now in reality i think many people would use dot i lock here or dot lock but you don't strictly have to so i'm just saving a bit of typing here so i'm just saying df and then immediately set of square brackets what does that mean well if there's a comma in that square bracket there's going to be row comma column so there's no comma in there so i'm only referring to rows as soon as there's a comma it's going to be row row comma columns there's no comma in there it's only rows so it's df and i'm indexing the rows in which rows am i do i want panace to look at i wanted to look at the df dot smoke column i want it to go down that row so df dot smoke and then equals equals zero and you can well imagine what this is going to do it runs down that smoke column and only when the entry is zero which in our case those are the the participants who do not smoke it's only going to extract them and then another set of square brackets because now i've said once you've done that once you've selected only the people where the smoke that smoke variable is set to zero was captured as a zero only those are going to return two's all the others are going to return false if it's a one or a two it's going to return a false and it's going to be ignored it's only looking for the two values now from those that have smoke equal to zero please extract the age column for me and then export that to a numpy array so i'm using the dot to underscore numpy method on that so that's a whole sentence you've got it says take my data frame df go down the smoke column and only select those who don't smoke being encoded as a zero for those that you've then selected please take the ages and convert it to a numpy array so nice little english sentence there and we've extracted that as code and this is what computational thinking is all about so i've saved that and it's just print that out by calling that and we see it's an array of values and it's going to be the ages of only those who in their smoke column have been encoded as a zero so that's one way to do it i'm going to show you a different way where we use the lock function and we can use rows comma columns so this would be another way to do it now it's confusing in the beginning because there's so many ways to do it but you can find the way that works for you and it really makes it just that much more powerful so i'm going to say df dot lock so location that means when i have a comma in there after the comma refer to the columns i can actually pass the column name and instead of its integer index value so i'm going to say df dot lock and then df dot smoke equals equals zero so that's my row value so it says go to that smoke column go down all those rows only select the ones where it's zero so only those are going to return a true comma the age column please and then again convert that to a numpy array for me please so there we go exactly the same values the values of those people who do not smoke and now that i have that as a numpy array i can just use the dot mean and that's going to give me the average age of the non smokers it really is as simple as that and once you get used to it these things just run automatically where you can start having these wonderful research ideas and you can extract that specific data that you're looking for from a very large data set and that's the power of using a computer language such as python so now we're going to get to the a different kind of question non smokers where the survey choice is three so what are we asking here we want the ages but they're two conditionals they've both got to be non smokers and they've got to have a survey choice of three so both things have to be true both have to be true for us to then include that and just extract those ages that's very different from asking the non smokers or the survey choice of three when we have all either the two can be true and that observation or subject will still be included so in python we have this idea that boolean logic is and all and for is we have the ampersand symbol you can see there and for all we have this pipe the line up and down search for that one on your keyboard that's all so again if it's and both have to or you can have more than just two but they all have to return true for that road to be included and the all anyone of my conditionals can be true so let's have a look at how we do that let's look at the right hand side first it's actually quite intuitive so it's div.lock so that I can use rows comma columns and then columns I can use the actual column name and you see there is my little ampersand there so I'm saying go down the div.smoke that series that column where that equals zero and go down div.survey where the values are three comma which column am I interested in I'm interested in the age column and eventually I'm just going to export it as a two numpy now in reality I almost never convert this to a numpy array but to keep things neat here I'm just I will keep on doing this this as two numpy and then we have the assignment operator and we're assigning it to a computer variable of our own choice and I've called it non underscore smoker underscore satisfied underscore h this is a little bit descriptive and now that is going to give me the ages of those that have zero for smoking and three for survey both of those sets of things have to be true and then I'm going to get the ages back of those now a little bit more difficult let's look at never smoked or satisfaction score greater than three and I'm going to show you slightly different way because I think you can already know what this line of code is going to look like I just want to show you a little bit of a different way to do it I'm going to save my my filtering as a computer variable that I'm just going to call to that I'm going to assign what we had before in the comma and before the comma as far as the lock was concerned so the rows I'm gonna say df smoke equals equals zero and df dot survey is greater than three but I'm saying oh never smoked or satisfaction scores three either those two can be true so therefore we use this little pipe so not the ampersand here and then when I actually do this given it a name there I'm going to say df dot lock and then instead of writing this criteria I've now saved it as a computer variable crit so I'm just going to say crit comma age as simple as that so if we run this line so we saved our crit we do it there and the reason why you might want to do this is you want to build up different criteria you might have criteria one criteria two criteria three and you just want to rerun this line of code and just enter different criteria all the time it's just it just makes in the end for a bit more easier to work with if you've saved this in this way so you don't have to continuously change this whole line of code but instead most definitely instead of that crit you can have typed all of that no problem what I want you to see is the difference between and and all so let's have a look at this one non smokers and satisfaction score three or less and still the age by the way so that score three or less that's actually quite easy because that's just going to be less than or equal to three of course but what about the non smokers in as much as the ones who are non smoking are not only the zeros but they are the twos as well because the two actually signified they smoke before but they stopped they still non smokers currently so I need to include the zeros and the ones now of course I could write three little sets of of conditionals so I can have smoke equals equals zero and smoke equals equals two and df.survey less than or equal to three so certainly no problems there but I just want to show you there's not only and and all but there's also a not and this is what I'm going to do here I'm going to choose exactly the opposite of what I want so I want the smokes of zero and smoke of two but I don't want smoke of one so that's exactly what I'm going to choose df.smoke equals equals one and what do we want for survey we want three or less and I'm choosing exactly the opposite greater than three and I want both of them to be true so I pass the little ampersand there and now when I do df.lock I have this little tilde symbol there tilde crit and what that means is not the criteria so the exact opposite of the criteria so I'm actually looking for all the false ones so when we have df.smoke equals equals one that's going to return a true the falses are going to be the zeros and twos that's exactly what I want the survey of greater than three those are all going to be true but that's not what I want I want all the others all the falses and that's why I use that little tilde symbol there so not the crit so all the false values then age and then we're going to return all of that and that's going to give me the list of my non smokers which have zero and one in it and my survey scores of three or less by choosing exactly the opposite so sometimes that makes that can really help you if you if you remember that you can choose all the false values that get returned down a row so next one create a new data frame object that only contains participants younger than 50 and that's a very common thing that we are going to use one or two or three or four large groups in our data and save them as separate data data frames why not what you have to be concerned about though if these data frames are very very large because as it stands here with panace they have to fit into your into the computer's memory so if you just continue create new ones from old ones you're going to run out of computer memory so it's just something to be cognizant of but this is a small little data set no problem here so how are we going to do that with conditionals again i'm going to say df now strictly speaking i should say df.i lock but you can just drop that in simple cases like this i'm going to say there's no comma there so it's only the rows which rows i'm interested in the rows in the age column so df. age less than 50 and i'm assigning that to a computer variable new underscore df and if we now call that we're going to see if we look down age column no one is going to be older 50 or older so it's less than 50 so 50 is obviously not included now as you can see yeah i didn't even export to change this to a numpy array just to show you that you don't have to do that so i'm just saying new underscore df. age so i can use the dot notation here for that column because age doesn't have any legal characters in it and then dot max so there's a max method just as there was for numpy array and now if we see what the max age is it was 49 so it definitely only extracted into this new data frame of mine only those participants that were younger than 50 and remember the alternative if you do have spaces in there just put it inside of quotation marks inside of a set of square brackets it's going to do exactly the same thing for us just just to show you another way to do it there's so many ways to go about this something new underscore df dot lock so locations i can use row comma columns so the colon symbol on its own there it just means all the rows remember all the rows so all the rows comma the age column and give me the max of that and i'm also going to get 49 so really very powerful because there's so many ways you can go about this so don't be intimidated by the fact that there's so many ways to go about it because when you sit down and you want to try it yourself the sort of all these ways that you can do it in your head and you're not quite sure which one is which one it comes with time and experience and you just going to find the way that works for you so next up create a new data frame for participants with a restricted list of job titles how would we do that so for instance if we go down that a vocation column there's lots of vocations in there i'm only interested in the participants that were either IT consultants energy managers or clinical embryologists so i'm going to save that as a python list sign that to the computer variable jobs and then i'm going to use the is in method and i'm going to set that as a as a computer variable i'm going to sign that to this computer variable crit so i'm saying df dot vocation is in so that whatever you pass to that it's now only going to return two values if that specific subject is one of those three in that python list so i'm actually passing a python list jobs to that is in method so that's my criteria and now i can create a new data frame jobs underscore df and i sign to that df dot loc crit so it's going to use this script so i'm what is there i could have just typed that in there and instead of that i could have just typed all of that so i could have had one long line of code but i just break it up like this and that's quite a common thing to do and if we now look at jobs underscore df a new data frame if we look at location they're only going to have energy manner it consultant clinical embryologist energy manager so it's only going to select those so if you have a column where you have these words in them as far as the sample space elements are concerned for that variable and you're only looking for certain ones of those use the is in method very easy and powerful to use now you might want to get a slightly more granular than that slightly more granular that in a word like that i might only be interested in a single word in that tech in that string and for that we're going to use the str.contains so look at this we only want those participants that had the word manager in the vocation so that's what we're going to build a new criteria i'm going to say df.vocation so down that column that pandas series the str.contains method string.contains and what i wanted to contain is the word manager and then you see i've got a second argument there na equals false that just means just exclude those in which nothing was entered empty data and we certainly going to have a look at empty data still in this very notebook and then i'm going to create a new pandas data frame vocation underscore df and i assign to that df.loc my criteria and if we have a look at that if we have a look at the dot hit method so the first five there's energy manager tourist information center manager estate agent or land agent or estate manager land agent logistics and distribution manager passenger passenger transport manager so it contains that word manager and if you have very complicated data sets and you want to extract only certain values from there this str.contains can become very very powerful next up let's talk about just updating or changing values in a data frame that is a very common thing to do the first one is just to rename some of your columns so in this first instance we have a name column i want to change that to participant now not a very useful example here but you might have no a horribly named column and you just want to rename that so for that we use the rename method so df my data frame dot rename and one way to do it is to use the columns the columns argument and i'm setting that to a dictionary because a dictionary is very useful if i want to do more than one replacement remember dictionary is a key value pair so the key is the old name and the value is the new name they are strings so they go in quotation marks so name was the old one participants going to be the new one and then i'm setting this in place you're going to see this in place quite a lot in place equals true that means make the change permanent if you don't put that in place equals true it's just going to do it for the cell but when you use the data frame again it's going to revert back to the old one so if you want to overwrite values in the data frame use the in place equals true and that's going to do it permanently so let's look at the df dot columns the columns attribute there and now instead of having name in my first column i now have participant and if i look at my data frame the first couple of rows you'll see that's the name is changed to participant good next thing we might want to change is a certain value but we want to change all the values down that column so one way that we go about just putting some anonymity inside of data remember data security and laws pertaining to data on humans are very strict in most countries one way we can go about that for instance a very simple example that i'm giving you here we might decide that when we capture the the ages we are mentally going to subtract two from every age so that first participant there we could say that they were actually 45 years old but we capture 43 and that just adds a little bit more complexity if someone found this data and you know they want to figure out who these participants are it makes life a little bit more difficult for them of course we'll have to strip away the participant name and the date of birth we'll have to delete those columns and i'll show you how to delete columns but that's just one way to change things when we analyze this data though we've got to remember that we've got to add two to each age again to get to the right age so i'm going to show you a couple of ways to do that the first and more useful way but that's also the most complicated way of doing it is writing our own little function so python comes with a bunch of functions of course and when you import other packages it's going to have even more functions but we can create our own and we do that with this def for definition keyword i'm saying def and then i give my function a name and i'm calling it add to the descriptive function i'm going to create then a set of parentheses and then an argument now this is a very simple function i'm only have i only have one argument and it's just a place holder so that says whatever someone uses my function they put a value in there and so that's just a place holder so then a colon and if you put that colon it's going to enter this white space for you something that is very important in python and then the return keyword and what it does is x so whatever the argument value was it's going to add two to that so that's my very my very own function so let's have a look at what the values were before so i'm calling def dot age so it gives me back a pandas series and i'm calling the head method on that series so that i only get the first five values so 43 53 33 43 46 now i can use the dot apply method i'm going to say def dot age dot apply and then i'm just going to pass the function name to it add to that function knows what to do it's going to take every value and add to to it and what i'm doing here is i'm overwriting that column so because i'm assigning it to that column so i'm saying def dot age equals def dot age dot apply add to and what that is going to do for me it's going to permanently overwrite that column because i'm assigning it to that specific column so let's call the head on that column and now we're going to see instead of 43 we see 45 instead of 53 we see 55 etc so that's one way to go about it there's a perhaps a more convenient way and let's use a lambda function now lambda is an anonymous type of function it's part of python and not something we use quite often but i want to show you that it exists so this time i'm going to subtract two values i'm going to say def dot age so i'm overwriting that by the assignment operators i'm saying def dot age dot apply and what i'm applying is this very shortened it's a short form of a user defined function so i'm saying lambda that means create this temporary thing called x and after the colon what you do with this x well just subtract two from it so that apply is going to go row by row and that value in that specific row is going to subtract two from that so what we're going to see we're just going to be back back to normal okay so those are complicated ways but sometimes you do really need these more complicated ways because there's something more complex you want to do to all the values but now i'm going to show you the one that we all of us use most often that's just the simplest way def dot age equals def dot age plus two as simple as that so on the right hand side i'm going to get a panda series def dot age and it's going to broadcast this plus two to every value done that column so all the rows in that column is just going to add two to it an assignment operator def age which means i'm overwriting what is in there and now suddenly i'm going to have the added two again to it so very very simple now i might want to change a nominal variable to an ordinal variable so how do we go about that so if we look at the group column there were two sample space elements in that group column by the way and you can check on that doing def dot group dot unique called the unique method and you'll see there's control and active and i'm doing this as a i'm doing this as a dictionary key value pairs so the old value and the new value is what these key value pairs are going to represent in this dictionary so all the participants were either in the control group or the active group but i want to change that to zero and one and later on in this course we're going to do some machine learning some artificial intelligence and some of those algorithms work only with numbers so i have to convert my categorical variable to a new medical variable and this would be one way to go about it so i'm going to call def dot group and then i call the dot map method and what i'm passing to it is this dictionary so if we do that we're going to see that the the group is now not going to contain control and active control active it's going to contain these value zero and one i just want to warn you against this map function the alternative to the map is the replace as you can see there so def dot group dot replace what the what the map does if it finds an empty cell it might have a little bit of a problem with that and if you do the dot replace it's going to keep those values so what the map will do is it'll delete that whole row and with the dot replace this is going to leave it alone it's not going to do anything with it so sometimes i would use dot map sometimes we can use the dot replace pass a dictionary to that of old values new values now it need not only just be categorical variable and then change it to a numerical encoding you can go the other way around as well or change things completely the dot map and the dot replace is something that we can whether we use quite often so let's just talk about changing columns and this might be very useful for you to do you have a column in which the data is not tidy different things were entered there comma with commas in between or semi-colons in between or spaces in between and you want to separate that out into different columns so we had this name column which we changed to participant remember so what we're going to do is we're going to split that because it was a first name a space and a last name first name and a surname and we're going to use the dot str dot split method so i'm saying go df dot participant string dot split and what do i want to split it on i want to split it on the space and expand equals true that means it's going to make two separate things it's going to create two separate columns a first name and a last name so i'm calling that new data new underscore data and that is the splitting that i want to do and that's going to give me these two values the first name and the last name and remember those would be indexes the first one would be index zero the second one would be index one so what i'm going to do is i'm going to create two brand new columns in my data frame df the way that you do that is you pass a string we're inside of a set of square brackets so this first name column does not exist in this data frame i'm creating a new one and this is how you do that and what i'm assigning to that is this new underscore data that the zero is index that's going to be all the first names the second one is going to be all the last names after the space which is our split now very fortunate here that there's none of the columns have you know second names in them etc so you know this is a kind of a simple example but i think you get the point and if i now call df.head on this so right at the end here we see two new columns first name last name dylan and pattern it was split on that participants that space in there so sometimes you are lucky someone put semicolons or commas in there and then instead of that space there you can use the semicolon or just the comma whatever the case might be and i just want to show you that you can concatenate those all over again so this time i'm creating a new column called name so df and then inside of square brackets my quotation marks name and what i'm going to do is concatenate two values so df.last name then the plus to concatenate remember we concatenate strings of the plus and then a comma in the space and then the first name so let's just have a look at what that's going to do so if we just look at our dot name df.name i'm going to have pattern comma dylan how would comma sander williams comma cementa so that works very well next thing i want to show you is how to create a categorical variable from a numerical variable and sometimes that's called binning so i have a numerical variable but i want to change it to a categorical variable and that doesn't mean i'm changing every single value i'm creating these bins a minimum and a maximum value in which that specific value might fall so let's go down this cholesterol and we just look at the cholesterol before that series that column in other words and call the min method on that so we see the minimum cholesterol was 1.2 and the maximum cholesterol before was let's have a look 11.1 so we know what the smallest value was in the largest value so this might not be units that you are familiar with in different countries use as they standard in their laboratories different units so but for this data set doesn't matter what the variable was our minimum in this continuously made available was 1.2 and the highest was 11.1 so we know that all the values fall inside of this interval with 1.2 being included and 11.1 being included as well but what if we want to chop this up into say three different bins and we're going to call them low intermediate and high so we're going to have this new variable and it's going to be a categorical variable and it's going to put that specific value of that observation inside of one of these bins or buckets so i'm going to create a new column in my data set my data frame df square brackets quotation marks cholesterol before level and now i'm going to use a pandas function on this so to this new column i'm going to assign the speedy dot cut so i'm using the cut function in pandas speedy dot cut the first argument is the numerical variable that i want to cut df dot cholesterol before then i'm going to say how many bins do i want them to be divvied up into now what it's going to do it's going to look at the minimum and the maximum and it's just going to do a third a third and a third i'm not setting those values it's going to be purely dependent on the values that are there so this doesn't make physiological sense what we're doing here i just want to show you how it works and then the labels i'm going to give those three bins as low intermediate and high so if we do that let's have a look there we go and let's just do a numpy array of that so we can see so 1.2 fell into low 1.2 fell into low 2.1 fell into low so if you print out a few more you'll see ones that are a bit higher they're going to be intermediate and ones that are even higher going to be high so those are going to be our two columns our new column cholesterol before level so just a little bit of mathematics because you might wonder if a value falls right on this edge between the two which side does it go to so mathematics we have this idea of the different types of intervals and we use a notation for that so if you see this notation here square brackets 10 and 20 so let's just use this values 10 and 20 on this range on this interval 10 to 20 if a mathematically put square brackets on either side so there's not computer square brackets python square brackets which was talking mathematical brackets here the 10 and the 20 are included if I put in mathematics if I put these parentheses around 10 and 20 it means on that interval from 10 and 20 but the actual value 10 and actual value 20 are not included so that's an open interval this is a closed interval and then we get semi open and semi closed intervals so if I have this interval 10 to 20 here that means the 10 is not included but the 20 is included and we can swap those around the other way as well so remember those half open intervals so now consider these values that I have here 11.2 12.2 13.2 15 16 16 16.9 I think I wanted to put a 16 there let's let's correct that one so there's another 16 there we go so we have those values and I want to create these bins so the what I want to do is that everyone below 13 I want that to be low from 13 to 16 I want that to be normal and from 16 and above I want that to be high so a proper interval that I could have is 11.2 to 13 but 13 is not included 11.2 is included so the next bin that I have to have is 13 but with a square bracket so 13 is included in this new middle group of mine my normal group and then 16 is not included so as soon as someone is 16 they fall into the next one the high one and this is what we do with these semi open or semi closed intervals and then on the top side as long as I put in a very large value that I know is not even in the data set I can close it off on the right hand side or I can just put infinity there the point being someone who has 16 will still be included here so we can do that in code so I've gone about here creating three bins so I'm still using pd.cut and my column that I'm interested in is still df.plestral before but now I put these bins, bins equals and now I'm going to put four values there for my three bins so then as a python list so I go ridiculously low on the zero side on the smaller side so I started zero I know no one was zero so just go lower than you think and then a five so that first interval is going to be zero to five my first bin my second bin is going to be five to ten and my third bin is going to be 10 to 20 but what happens to someone with a value of five with a value of 10 you know what happens to them so I've gone you know very low on this side and very high on this side we know the minimum maximum is not zero in what ten it's well inside of that so just choose values that are way outside as these middle two that we're interested in so what happens to someone who's five and what happens to someone who's 10 now in which of the bins does it fall and for that we have this right argument and I'm sitting right to false that means the right side so the five between zero and five that right that five is not included as soon as the participant has a value of five it goes into the next one and the same from five to ten someone who's five is now included in this second bin someone who's ten is not included someone who's ten is going to fall into the next one and then I put labels low normal and high so if we do that now we can be very precise as to who we include in this new categorical variable of ours so you can take a numerical variable and you can change it into a categorical variable with specific sample space elements our sample space elements here were low normal and high so deleting a column sometimes we just want to get rid of a column perhaps it has some information there that we're not interested in or we want to take we're concerned about anonymity of the data etc and for that we just use the drop the drop method there so dataframe.df.drop and we have this columns argument and we can set that to a python list and inside of that list we just list as strings inside of quotation marks all the columns that we want to drop and we want to make this change permanent though so we say in place equals two so if we do that and we look at the columns now the name column remember that was one that we created is now gone that's not in our list anymore sorting very easy to do sort values is the method so df.sort values and then there's the buy the buy method and the one that we want to sort by is the last name so what that's going to do it's going to see that this is an object and it's going to see that it's it's letters it's going to do it alphabetically and if we look at last name here at the end we now have abbot etc etc etc we can also change that around you know going to verse order by setting the ascending argument to false so doing exactly the same thing but now we're going to start with the z's or the z's and you see there the last name right at the corner there of the screen on the right hand side or the last name start with a z now we can also do you know we can do the sort by by more than one and it will be that order so this one we're going to sort by age and then by sbp so those are two numerical variables so let's see what happens there so if we go down age 32 32 32 and then 33 33 33 33 34 so certainly that goes in ascending order but there's a couple of 32's so then it's going to go to sbp and for all those 32's it's going to put those in order 133 159 168 and then 122 so you'll come across this when you do when you want when there's more than one value that you want to sort by so 133 159 168 then this 122 goes to the next one there's 33 33 33 33 33 but now these will all be in order so you can really sort by you know sort to your heart's content and you can also make any ascending or descending order for these so I'm passing a another argument here ascending passing a python list to that and that's true and false so the age is going to be ascending and then the sbp is going to be descending so you can really really manipulate your data to exactly to exactly what you need if you very quickly want to see the largest values or the smallest values there's the n largest and down here you'll see in smallest as well so I'm saying def go to sbp show me the 15 largest values in that series and you'll see the 15 largest or you can do it this way as well give me def dot n largest so I pass this sbp the column that I'm interested in as one of the arguments there's just those two ways to do it now if you do it this way of course it's just going to sort of be exactly like the sort sorting them and as I said there's also the n smallest method there so in this section we're going to talk about missing value it's so common to see a data set file a spreadsheet file and there's this missing data or that missingness has been encoded by a value let's see how to deal with this missing data so in the numerical python library there is a value nan so if I say np dot nan that's this nan not a number it's missing so if I create a python list and there's a np dot nan value in there there we see the list one two three nine if I now do the sum function using numpy dot sum on my list the answer is not a number because you know that value can be anything so it's impossible to sum of all of those so let's import a spreadsheet file that contains some missing data it's called missing data dot csv I'm going to use the read underscore csv function inside of pandas so it's pd dot read underscore csv as a string past the name of the comma separated values and we can assign it to a computer variable called missing underscore df so let's have a look at this missing df and there we go we see an h column a salary column and a previous company column and there's na n's all over the place so those cells were empty in the spreadsheet this is what they're going to look like so how do we deal with this missing data well number one is to delete it and what we're going to do here is any row that contains even a single na n value that whole row or that whole observation that the information on that whole subject is just dropped so it's certainly something you can do but it's it's quite aggressive in the way that we would do that it's a missing underscore df that's our data frame dot drop na and open close parentheses so the drop na method we use there are no arguments and it's just going to drop all of them and i'm going to assign that to a new data frame and it's complete underscore data underscore df so that if we do that then we're going to have you know a lot of data missing but at least the data that's there now everything would be there now if I didn't assign it to a new computer variable and I did not put in place equals two there it is not going to do that permanently so the original missing underscore df still contains all that missing data I just reassigned it to a new computer variable so very importantly the missing underscore df data frame was not touched because we did not put in place equals two right there so if we don't do that the original does not get changed now that would certainly be one way just to drop at any row that contains an na what we might want to do is only drop a row if a certain column has an na value in that and for that we use this subset we use the subset argument so I'm going to say missing underscore df dot drop na and subset equals and then inside of a python list object so square brackets I'm going to pass you know the columns that I'm interested in so now if we look down the age column we'll certainly see that there are no missing values so it's only when those were missing that the that the row was dropped but certainly here we can see there's subtly there's still na n's in there in the previous company there's still na n's in there so that would be one way to go about it now there is this dot is in a method so if I say missing df dot age so remember if I do that I get the age column or the age series then and I then use the dot is na method it's just going to return a bunch of true and false values so it goes down every row and if it's an na it's going to say true so you see the couple of true is there and if it's not na not missing it returns a true value there now internally in python false is represented by the number zero and true is represented by the number one and that means we can add true and false values to each other so if I say missing underscore df remember that wasn't changed originally I had assigned it to another computer variable name so the original wasn't attached so it still has these na values in them dot age dot is na and then the dot sum after that so it says take my missing df data frame go down the age column do all the truths or falses with the is na and then finally sum over all of these as I said false is zero true is one so if we sum over all of those all the ones the ones that we're missing we're just going to sum over all of these and then the result we're going to see is four there were four missing observations in that variable so what can we do to replace missing values that's called data imputation and that's a very very big subject there are certain classifications for missing data such as missing a completely random missing a random and missing not at random not very descriptive terms but there's a whole science behind how you know how values are missing and that really is a big topic what we're going to do here is something very simple though we're just going to fill in the missing values we're going to impute the missing values by various very easy methods so for that we're going to use the missing underscore data frame go to the age column so remember that gives us a series and then from there we're going to use the full na full na method and it has a keyword argument method and we set that to f f i l l forward full and all that's going to do is going to look at the value just before it and it's just going to just replicate that number so if we do that it's just going to forward full all of those so you see there the 46 year was missing as just the 46 in front of it and just replicated those values so there's also a backward full that you could also use so it's just the value after more commonly we'll just use the median so we'll look down that column that variable calculate its median and then all the missing values are just replaced by that median and that works to a great extent in as much as some of the non-parametric statistical tests that we can do they work on the median and it's really not going to make this dramatic change to your data although that's not always the best idea but what we're going to do here is missing underscore def dot age so you get the age column or the age series dot median and that's going to calculate for us what that median is it's 57 and now with a full na instead of the method we just put the value that we want all the missing values to be so if i put 100 there it'll put 100 in all the missing values but here we're saying please put the median in all of those missing values so if we ran that we're going to see the median was 57 and there we see that three number three that was missing it is now 57 so all four values that we're missing are now suddenly going to be 57 remember if you want to make these changes permanent you've got to do the in place equals true argument there as well what about the fault missing values so there's another spreadsheet file default missing data dot csv i'm going to use the read underscore csv function from pandas i'm going to import that and let's print it to the screen to see so if we look down the age column there's a couple of triple nines so when someone captured this data if the age of this participant was not known that just entered 999 now that causes a problem if we want to do any kind of calculations on that column of course if we look down the salary every time the salary wasn't known someone actually typed their null now this is a big problem because this is no longer tidy data because if you look down this column the salary variable we see integer values and we see strings certainly null the word in i n i l that is a word someone type that in that's not a missing there wasn't blank in the original spreadsheet someone type the n i l let's say string part of a categorical variable and you can't mix in different data types in when you when you when you're working with data so that's certainly a big big problem and the same goes for previous company when it wasn't known someone physically typed in the spreadsheet missing so again you have this categorical variable and you have numerical variables there now remember these are not really numerical variables they just encoded one means something two means something three means something four means something five means something so this categorical but it is represented by numbers and then we have these strings in between so that's it you're not going to work as well but that was decided when this data was captured to to to put something in when the data was not missing so we want to replace these and we can do that right when we import the data if we were the designers of that data set that spreadsheet file or that database and we knew that we had done this or we received this data and someone told us this is how it was done we can actually put an argument when we use the read underscore csv function we could say na underscore values and set that to a python list and now pandas is going to look for these values 999 null and missing 999 being a number null missing being strings it's going to look for those when it imports that and we can assign this to a new computer variable and now look what it's done when it found the 999 now it's put an na n value there when it found null i put an na n when it found missing i put an na n value there so we can encode that right from the get go remember there's always the map and the replace remember with the dot replace method we can add a little dictionary there with the key value pairs and the key value pairs being old value new value and you can certainly put a np dot na n if you've imported numpy as numpy and as the replacement for any one of those values so that would be a different way for you to go about it working with times and dates is probably one of the most difficult things to do whether you've used Microsoft excel or google sheets or computer language such as python very difficult to work at times with that dates and times and that really depends how that data was entered in a database or spreadsheet so what we're going to do here we're going to import a new data set data dates times dot csv again using the read underscore csv function from pandas and i'm going to save that as the computer variable dt so let's look at this data set we see an id column we see a batch column we see a specimen date so this looks like it came from a lab and might be organisms or or some other microbiology laboratory or or some such a test date and a test time so to us as humans this looks very good we see 2025 so the simulated data set we're into the future 2025 04 21 test date 2025 04 26 and the time recorded there's 12 colon 23 some a bit of a military time there 13 45 etc so it looks like when someone entered this data it was done properly in a data set in the spreadsheet file please just to remember if you did the designer of the spreadsheet and using something like Microsoft excel or google sheets don't set the formatting don't click there in the top and set and say this column is a date date time and try and save it as that just type the text as it is just leave it as a normal cop just a little heads up there so look let's look what pandas thought about all of this though so we're going to use the d types attribute and the batch was a 64 bit integer that's quite correct the specimen date is an object and the test dates an object and the test time is an object and objects are categorical variables so it strings it's seeing those values there and strings specifically because there's I suppose that forward slash symbol and there's a colon symbol so pandas are seeing these nuts as not as dates and times but just as normal strings so we've got to change that up so what I'm going to do here first of all just remind you that we can concatenate two columns and so I'm going to create this new column remember how to create a new column I call it the data frame dt and then inside of square brackets and quotation marks I'm going to give my new column a name and then I'm going to assign to that the following seeing that test date this test date and test time that's been seen as objects or strings I might as well concatenate them so what I'm going to do is I'm going to say dt dot test date plus and then a space but this is a space remember that is a string so it goes inside of quotation marks and then this string so just concatenates them all together so now when we look there's this new date time column and it's taken test date and test time and it's concatenated that with a little space in between if we were to look at the d types though it's most definitely still going to be a string an object so if I do dt dot date time it's my new column so that returns a series for me and I call the d type attribute on that I see it's d type o o for object so now we've got to change this so that it's recognized as dates and times and fortunately for us there is a pandas function to do just that so what I'm going to do is I'm going to create a brand new column dt and then call it date time but you see this time I'm using a lowercase d and t so that's going to be seen as a different new column and to that I'm doing the following I'm just calling the to underscore date time function from pandas so pd dot to underscore date time and it's going to take two arguments my first argument is going to be the specific column that I'm interested in that's dt dot date time the uppercase d and t so the one I've just created by concatenating the strings and then I'm going to have a format argument and now let's just look at this what we have here we have 20 25 4th slash 0 5 4th slash 0 2 so it's the full year 20 25 not just 25 so it's the full year and then a month as a number 0 5 and then the day 0 2 a space and then our colon minutes and that's exactly how we build our formula have a look at this inside of quotation marks because it's a string and we say percentage uppercase y so the uppercase y refers to the fact that we used 20 25 and not just 25 if we had a lowercase y that would just be 25 but we can see the format there is 20 25 so it's uppercase y then a 4th slash because that's the 4th slash we have there and then it's month so that's percentage m it's a lowercase m then another 4th slash and then percentage d then there's a space so I'm going exactly by what the string looks like then percentage hour uppercase h colon percentage uppercase m we're using these uppercases because the hours go into 13 14 15 etc so I'm using this formula that I'm writing here and assigning that to the format argument exactly how my text appears there and we're sitting that assigning that to a brand new column and if I now look at the new column it looks a bit different look at this it now says 20 25 and then a dash or minus 0 4 dash 26 space 12 23 dot 0 0 and if we look at the d types now what's going to happen is this new column that we've just done the daytime with lowercases it's now a daytime object it's now a daytime object now we can eventually work with that so I really urge this workflow if you fill in that spreadsheet file just do it as normal text don't change the cell type or that column type by formatting it in any way because that creates a lot of pain so now this daytime column or the daytime series inside of this penis data frame is very easy to work with so I'm going to create yet another new column in my data frame so I'm going to say dt and this time I'm going to call my new column month and I'm going to assign to that the following I'm going to say dt it's my original data frame dot the daytime so I'm looking at that new column of mine which is now a daytime 64 object or data type I should say I'm going to call the following method on it dot dt dot month underscore name so it's going to extract from that the month it knows now this is daytime it knows what every part of that of that value means and then I'm going to do a dot string dot slice stop equals three so what is this going to do now the first part here let's just do this this dt dot daytime now that dt refers to the computer variable I gave to this data frame the original data frame as we imported the spreadsheet file so it says take that data frame go down the daytime column so this is now a series on that I'm going to call this now that dt has nothing to do with the name this is one of the methods inside of daytime objects so it's dt dot dot month underscore name so if we look at this one it says 05 of course that's May and then all I want to do then I want to then just stop it the first three after the first three letters so let me show you what that is let's call a dt and if we look right at the end we see what it's done so see there's apr apr apr for April and indeed there it says 2020 504 so that's the fourth month and that's April so if I didn't put this in this last little bit the string dot slice that was going to do the whole of April because that dt dot month underscore name that's going to extract the name of the month that month was given as a number the fourth month it knows because that's a daytime object that that obviously refers to April and that's very useful because sometimes we just do want to create some plots and graphs do some analysis based on how many times you know a month appeared in our data set now I can do something like this dt that's my data frame dot daytime so that's that column or series and then I'm calling dt dot year just dt dot here and you can well imagine I can assign this to another new column such as dt square brackets quotation marks year and now it's just going to extract the year for year for me or dt dot hour and now it's going to extract just the hour for me and again in data signs that might be very useful for me to know something about that specimen and the hour in which something happened to that specimen so once you have a daytime object properly set up is actually then becomes a pleasure to work with and there's a lot of information hidden in dates and times so that was a very good cool quick introduction to to the pandas data data package I should say and we're going to use it for the almost the rest of this course import data using pandas because it's so easy so powerful to manipulate your data with pandas easy when you're used to it I'll add that caveat and as much as you have to gain some experience I will say this doing this course probably have said it a couple of times and will say a couple more times this is like learning a language and you don't learn a language just in a week and you don't learn a language by not speaking it you have to speak that language as much as you can for you to pick it up properly so this was a good introduction and a notebook for you to refer back to every time you can't just exactly remember what to do as always remember on the pandas website itself there's an enormous amount of data pandas is a huge package and it just expands almost by the day as new functionality gets added