 As the sun and the noise comes in through the side window, let's talk about what we're going to discuss in this video. I'm going to show you the library called Pandas. P-A-N-D-A-S Pandas is the animal. It's one of the success stories of Python, or at least one of the libraries that has helped Python become so successful. And it's all about managing your data. Your data is going to be in a spreadsheet, and we're going to bring that spreadsheet in, and we can use Pandas, the Pandas library, to manipulate that data. So when you analyze data, you only want certain parts of that data when you want to use that data inside of a student's t-test, or a guy's square test. And that brings me to another thing that I'm going to start this video off with, and that's to tell you about the different data types. All the data that we collect are of a certain type, and that's very important because that type is going to determine what statistical test we can eventually use. So this is going to be quite a long video. There's quite a few things I want to show you. Pandas is a massive library. We're only going to scratch the surface, but we're going to scratch the surface of the stuff that we are going to use when we analyze our data and when we do our statistical tests. And here's our notebook. We're going to talk about manipulating data. So you see I've given it a nice title there underneath the clinical research, all in capitals. Both of these, by the way, as you can see, they're just under the one hashtag there, so I'm going to get the title size for both of those. But just to distinguish them, I put the first one all in capitals, uppercase letters, and the second one, not. You see my name there? I just wanted to show you how to do that. So if you have this little greater than symbol there, for align with the greater than symbol, you're going to get this nice little gray bar, and as I said, we'll get to how we import all these little images a bit later. So the Pandas library, we're actually going to work with data. I think there are very few people who have not seen a spreadsheet before. I'm going to show you a few things about spreadsheets, but the data is going to be in a spreadsheet, and we're going to bring it in to Python so that we can stop manipulating and working with the data. We want to answer questions from that data that always teach the fact that the data hides a story, it hides information, and it is our duty to bring that story out of the data. We've got to pull it out of the data so that we can use it for inferential statistics and inform or infer at least those results to a larger population, and we'll certainly get to all of those things. Now, one of the libraries that has made Python so successful is called Pandas, and it's P-A-N-D-A-S just as the animal, Pandas. It is one of the most successful libraries inside of Python, and as I said, it's really one of the success stories or at least contributed greatly to the success of Python as a language for data science and for statistics. So we're going to use this library to extract all of our data. So let's just go back one step here. This is my Google Drive. You can see there my drive. You can see I've created a folder in there called Coursera, and I've created another folder inside of Coursera, which is understanding clinical research, because most of these videos really aimed at people taking my Coursera course, although as I mentioned, it's open for everyone. And inside of here, I've discredited two other folders as well, and that you can just do by clicking on new and then hitting Folder. You can have a folder, so you need a couple of those. And if I right-click on them, I can even change the color, give them a bit of a color just to distinguish them, which is always nice. And so I've created this data folder. Let's just open the data folder, and you see there's a bunch of stuff in here, mostly CSV files. CSV, CSV, CSV. And a CSV file, well, that CSV stands for comma-separated values. So it's basically a text file, a very old-fashioned TXT file, and the different values, the data point values that were collected for every patient or every subject in your research and every column, which is available, and I'm going to talk about that in this video. It's just all of those data points are just separated by commas. So it's just interpreted nicely as a spreadsheet would be. Of course, if you use something like Microsoft Excel, it's going to save it as an XLSX format. And certainly you can use Python and Benus to import those, but it's much more common to have files in CSV format. And the good thing about CSV files are that they strip all the unnecessary bits of a Microsoft Excel file. In a Microsoft Excel column, you can format that column so that it looks different, so that the values are displayed differently. That's no good for data analysis. That's no good for statistical analysis. We want to strip away all of those just fancy-added things that make the spreadsheet look a little bit better. And I'm going to show you a spreadsheet that might look good when you open it, but it's actually horrible for data analysis. You want to strip all those things away. So when you are in Microsoft Excel or LibreOffice or whatever the software you use as your spreadsheet software, when you save that file, save it as a CSV, comma, separated values file. It makes it much easier, at least for us to work with that data. And you'll find, as you start analyzing data, as you learn how to do your own statistical analysis, calculate your own confidence intervals, calculate your own p-values, is that a lot of the work is just about cleaning up the data and getting the data in the right form in the spreadsheet before you bring it into Python. So into Python that's certainly part of the work. So I've got one file here problematic, and I've just created it as a Google Sheet instead of an Excel file or something like that. And if we open it, it looks like this. And at first glance, it looks beautiful. Look at the lovely colors. And I've got these totals for the columns down here. And I say on the left-hand side that that's a line for average, and that's a line for SD. And I've got a little pivot table going on here. And I've got a nice plot here, a scatter plot. And everything is nicely colored. And I see the dollar symbols there and percentages there. It looks lovely. You cannot analyze this data. This data, as it's captured here, is not for statistical analysis at all. Well, you can do a little bit of statistics inside of Google Sheets or Microsoft Excel. Yeah, you can do quite a bit. But it's not nearly as powerful as using a language. So no, this is not good. This is not how you save your data. This is not how you analyze data. What we are after is sometimes called tidy data. In an R, there's this whole thing about tidy data. And there's actually the tidy verse. That's a bunch of libraries that just conform to a principle of the use of tidy data. So let's look at what a spreadsheet should look like. It looks bland, but this is exactly what we want. Look at this. There's no colors. There's all your columns, all your rows, and it's just very neat. And that's what we want. So what makes it neat? Let's have a look at it. First of all, let's have a look at these columns. I have a name column there in my first row. So my first row is going to be used for my statistical variable names. There I have DOB for date of birth and age for age. So what is a statistical variable? I've spoken about computer variables, but a statistical variable is something that is of data that you can describe as very contained, all the same forming, all the same format, following the same format, I should say. So age might be an appropriate one there. It's just a bunch of whole numbers, integers. So down that column, I have this variable called age. Sometimes you'll come across something called a random variable. So a random variable would refer to any one of these values. So the data point values can take on a random value. If I just walk down the street and just grab some people and ask their age, that's just at random. Of course, we can be biased in our data collection because if I walked around an old age home, there's going to be a bias in that random data. It's not going to be as random as you might think. If you walk around a school, it's going to be likewise have a little bit of, well, have a lot of bias in it as far as it comes to age, but a random variable means a random value in that variable. I like to use the term data value, a data point or data point value. That for me is more descriptive of the fact that I'm going to collect one data point value for that variable, for that specific patient or whoever's in my study. So on that first line, I have all my variable names. Quick note, if you look at these, let's look at instance for this one, cholesterol before. And that looks odd because I used uppercase C, uppercase B, and I left no space there. Forget about these fancy typing. Don't put spaces in those. Either put an underscore as we do with the computer variables or bunch them together like camel case here. I can still read what these variables mean, but it's so much easier to analyze if I don't have these illegal characters like spaces and all sorts of funny symbols. Don't put them in. Look at this one, SBP, systolic blood pressure. I just put in lower case S and uppercase BP. That's fantastic. I can still see I mean systolic blood pressure there, and I haven't written it out, and I haven't put spaces in between the words, forget that, keep it nice and simple like this. Another thing about a statistical variable, if I look at age there, it's a bunch of numbers. If I look at the vocation here, this first patient is an energy manager. That's a tax advisor. That's definitely not a number. Then I come to smoke, and zero I think here was for the patient. Doesn't smoke. One was they do smoke, and I think two was their next smoker. If you smoke or not, that's not really a number, is it? If you smoke, you smoke. If you don't smoke, you don't smoke. That's not a number. I can't do arithmetic with that. I can't say three smokers and divided by four smokers. Well, yes, I can by counting them, but I can't do sums with the word smoke and X smoker. Those are not numbers, but I can capture them as numbers. These are just placeholders for what they actually mean. So that variable smoke right there is a numerical variable, as we call them. It's not a numerical variable. Heart rate would be a numerical variable because those are real numbers. If someone has a heart rate of 100 and someone has a heart rate of 50, the patient with a heart rate of 100 has twice the number of beats per minute as the person who has a 50 beats per minute heart rate. So certainly, it's a static blood pressure. Someone with a diastolic blood pressure of 50 and someone with a diastolic blood pressure of 100, that's twice as much or half as much depending on how you look at it. Cholesterol there. Lipids there. Survey. Now, what is that? Do you think that's a real number? In this instance, people were just given a survey and they could say whether they agreed with something or not, and you see those typical like it scale with an odd number of choices, totally disagree, disagree, neither agree nor disagree, agree, strongly agree. There's five. There's two choices jotted down for each of those choices. Again, that's not a number. And even if that was a pain scale, the patient could rate how much pain they have from one to five. That's not a number because someone who's rated their pain as a four does not have twice as much pain as someone who chose two. You can't define that and you can also not say that the difference between someone who chose two and three versus the difference between someone who chose three and four is the same difference. How do you, how can, how do you quantify that? The point is you can't quantify it. These are not numbers. Also, you can't have a pain value of 4.5. So if your average of all your pain values came out to 3.475, what does that mean? And you can't just round that often and say, well now that makes sense, that's nonsense. So that is not a number. Let's look at the cholesterol after, the cholesterol difference and then the group the patient was in. Active or control group. Once again, those are not numbers. So what you've learned here is that there are two large types of data. We call them numerical data and categorical data. Write that down somewhere. Numerical data obviously numbers, numbers that we can do some arithmetic with. Like age. I can calculate the average age and someone who's 40 has been on this earth twice as long as someone who's 20. Absolutely. So those are real numbers, numerical variables. Now, there's a nitty-gritty there. There's two types of numerical variables basically. You get the interval type and the ratio type. Interval is the fact that there's no true zero. If we think of temperature, for instance, zero degrees Fahrenheit, zero degrees Celsius, that is not the absence of temperature. Hence, you cannot say if it's 40 outside it's twice as warm as when it's 20 outside. I'm talking degrees Celsius here. There is an absolute zero on the Kelvin scale. That is the absence of temperature. But zero degrees Fahrenheit or zero degrees Celsius is not the absence of temperature. They're still temperature there. Unfortunately, we don't use that too often and it doesn't really impact on what we're going to do. The ratio type is, of course, what we deal with and then an absence, zero means an absence. So if your systolic blood pressure is zero, never mind that you're not alive anymore, but it really is zero. There's an absence of blood pressure. And then we get the categorical variables and the two main types of categorical variables too. Nominal and ordinal. Now, the ordinal has almost sounds like order. So there's some natural order to the possible values in that variable. So if we look at the survey that someone filled in, maybe this was how happy you are with something. How happy were you with taking this drug or some question like that and you could rate it one to five. Clearly, if five was you more satisfied then versus one that you're not, there's some order to this level of satisfaction or order to the Spain scale. That's an ordinal scale. It's usually easy to express those as placeholder numbers. But then we get to the vocation say for instance of these patients and participants in this study and there's no natural order to that. How do you order? Well, I suppose you can do an alphabetical order but you can just use a synonym for someone's job and then it's out of order. So things that you can't naturally order those would be nominal categorical variables. So please write that down. So we have numerical and we have categorical. And it's very important to know which is which because that is going to decide what statistical test you're going to do. If you see a student's t-test you better be working with continuous numerical data. You can't do student's t-test on a nominal categorical data. It just doesn't work that way. It's impossible. So you've got to know what kind of data you're dealing with. It brings us back to the idea of tidy data because if I look at each of my variables the values in there they all are the same data type. I haven't put here that this patient is 43 and put that in words I'll get to that. That's a problem as well. But I didn't put something in there that meant something else. I didn't put the year of birth there. I could then calculate how old they are but it is the same data that's in there. It's clean, it's neat and that's all you're going to find in that column. So you've got to be able to break down your variables into something that is that specific. Now a specific variable like this has what we call a sample space so write down that word as well a sample space. That's all the possible values, random values that that variable can take. So if you ran a study and you only included patients from the age of 18 to 80 that's the sample space of that variable. All the possible random values that can occur if you went out and you accrued 100 patients to your study or 2000 patients to your study that's the only values from which the ages can be taken if we look at the group that the patients fall in active and control group so placebo group in an active drug group the sample space of the group variable only has two elements in it only two possible values that's the sample space from which a random variable can be drawn. So the survey for instance has a sample space of five elements and that brings me to this idea of continuous numerical when I said students T test I said continuous numerical because age even though we're capturing it there is a whole number someone can be 43 years old 12 months 2 weeks, 6 days 5 hours, 2 minutes 3 seconds, comma as small as you like to go there's a continuum there it's very difficult to to make little blocks of that you can but the numbers themselves are continuous numerical someone's blood pressure for instance, yes the millimeters of mercury we always can express as an integer whole number but in reality if our apparatus could measure more accurately someone could have a systolic blood pressure of 145.745 689 it just depends on what accuracy we can measure this so that's a continuous numerical variable look at the cholesterol there as well we would see those as continuous numerical and as opposed to that we get discrete numerical variables and that becomes a bit of a gray area sometimes I suppose you can make the argument that it becomes gray but say for instance someone that receives and we we want to include that in our analysis how many units of blood that they receive and usually people are going to receive these as discrete units of blood, so much blood but I suppose you can break this down into volumes millilitres and now we hope to a continuous numerical variable so when we talk about discrete variables sometimes there's a bit of a gray area and I suppose someone can talk of the survey numbers as discrete numbers they can only take discrete values but be careful those are still not numbers in this instance that still remains a categorical variable so look at this data and keep these things in mind I hope you've written them down so you can look at them this idea of numerical variables and discrete variables that we have this idea of a variable and it has a sample space and values for everyone in there can only be taken from that sample space it brings me to the fact that remember I said the first row is then all your variable names and then every patient or participant they have their own row that's another part of tidy data so everything about that participant in the study falls in that one row and we're going to collect those random variables or data point values for each variable and for each of them they are drawn from some sample space she's starting to think like a statistician here we're building this understanding of what this is all about so that's fantastic there you see this is all tidy and now it's going to bring me to one of the last points here is sometimes people are bad at typing and all fingers are pointing at me because I'm bad at typing so I'm going to double click say for instance on this active and I might type active here and put a little space behind that active if I click off of that you would never see that space believe you me python can see there's a space and it's going to see that as something completely different from all the other actives that patient is going to be counted totally separately now there are three elements in this sample space there's active and is active with a white space behind it doesn't work like that so sometimes you have these hidden things likewise one of the actives might be written with a lowercase a now suddenly again I have three elements in the sample space of this she's got to be very clear when this data is collected when you clean up this data that you get rid of all these sort of mistakes dates are horrible sometimes people put years first months first days all sorts of things for slashes backslashes hyphens there for the date it has to follow the same pattern that's why I said initially a lot of your work goes into cleaning up data before you analyze the data and hence when you are part of a project see to it that that data is collected as cleanly as possible put your effort in there you will be rewarded because that's not what we want to concentrate on we want this information the story that is hidden in the data we want to bring that out and there's a story here because if I just look down all these values I don't know what the story is I have to do statistical analysis on all these values and for the story to come out and one of the first ways that we're going to do that is to start with summary statistics and we're going to graph plot this so we can turn this numbers into something that us as humans can understand so I hope my few words have taught you something about data please this is not acceptable there's no ways we analyze data like that this is acceptable tidy data remember the term tidy data and make sure that your data is tidy like this now this file I'm just going to close these two down they do reside inside of this data folder so this is the data.csv file we're going to import some of these and I just want to show you at the top remember they are in on my google drive now this is mock data to be sure putting actual patient data on google like this even though I've got a paid account so I pay google every month there is still some security issues here and you have to know what is allowed in your country your region to put patient data like this this is mock data absolutely made up data so it's easy enough to put it here and as I said later in the course I'll show you how to install python on your own system if you want to work securely and you have to work securely you can run all of this on your own machine but to make it easy for us we're running all this because we're not there yet to go through the hassle of installing python on our own although that has become very easy as well but notice I'm in my drive course here a full sub folder understanding clinical research sub folder and data for sub folder look at this that's horrible I put spaces in between those words that wasn't a good idea but my drive google puts a space there anyway so I can't do anything about that one so we'll see how to get around those things but that's where I stored all my data and if we go back one by the way I've also got this image folder that's where these images that you keep seeing appearing that's why I've hidden them and there's a way that we are going to show you how to bring those images I'll just use them for decorative purposes inside of your well I suppose you can use them for explanation and we create this notebook to share your research findings with others you might also want to use images so there we go I've got a little template notebook here because I've done all those things and I just copy this template and start working we'll get to those things so what we are doing it right now we're in this week one data folder that is where we are and I'm going to show you this manipulating data with pandas so after you understand now the way that you have to look at your spreadsheet files let's start importing them so on the left hand side you see there's quite a few things that we have to get through quite a few ways that we can use pandas to manipulate our data I'm going to show you where the pandas website is you can just look it up just type pandas in python you're going to get to their website it is an enormous library you can make multiple courses just on pandas but I'm going to show you the most useful things and the things that you would need when you get started so there we go as you can see I've also neatened up things nicely these are all my level two headers with the two hashtags and I've tooled them all down so we can see there's two hidden cells there for hidden cells 20 hidden cells there etc first of all though it's been a while since I've been at this computer when you watch this hopefully the the world pandemic has come down one day when you watch this so it's back and forth for me after work so that I can record a little bit keep on contributing to the educational resources those are getting left behind and I really want to keep contributing to that anyway so I have got to reconnect and I'm going to reconnect to the Google service so if you've been away from your computer a while it'll kick you out and for good reason we don't want to overload this fantastic free resource that we do have so we want to overload Google so if you start running code it will reconnect automatically but if you hit that button now I'm connected so let's see pandas as I said it's a library for ingesting your data and analyzing your data and let's import these libraries so I'm going to import pandas and remember so pandas is going to have a lot of new functions that's not inside of Python so if I have to use one of those I have to type pandas dot and then this function and as is the norm in Python we use this namespace abbreviation and most people just use pd for pandas you don't have to but assist the norm so let's just stick with that so I'm saying import pandas you've seen that before we're just importing a library but I'm going to import it as something and look at that it's always like English language I mean import pandas as pd it's like a normal English language and that's what makes Python so beautiful so now I don't have to write pandas dot that function although there are people who really do that they don't like this laziness of these abbreviations but anyway that's a different story we're going to import numpy numpy short for numerical Python some people pronounce it numpy I don't know that just wraps me up the wrong way but anyway also a story for a different day import numpy numerical Python as np again that's just the standard abbreviation you can use whatever you want and that's what we're going to go for here and then from google dot colab so the google is a library inside of Python and it has a sub module inside of it the sub library called colab so I'm going from google dot colab I'm importing the drive keyword so that drive function we're going to import now that is specific to using google colab and my data is inside of my google drive if you have your own system running off of your own hardware you're not going to have to do this that's a specific to using google colab so this last one is specific as well you see this little percentage symbol there percentage load underscore ext that is a very specific here to google colab if it starts with a percentage like that it's called magics commands and that is just going to do something to the rendering engine of your web browser so that something happens in this load external and then google dot colab dot data underscore table that's just going to format the way tables are presented by your web browser and remember we're working the web browser here so that's just something that google designed if you use your own system you're not going to do this I'm just doing this so that we can see on our screen so with those things imported let's connect to our google drive and you might think that I'm really connected to my google drive I'm inside of my google drive and I'm using one of the things that google provides I'm using a colab here yes but you still have to connect securely to your own google drive and this is the way that google does it again if you're running python in your own system you won't have to go through this so remember we imported the drive function so we're going to use drive dot mount well the drive was actually another little sub module and it has a function called mount let's be more specific so I'm going to say drive dot mount because look there I've already imported this module there drive so now I can say drive dot mount and it's the g drive this is specific to google don't ask questions this is the way it's done so inside of little single quotation mark see forward slash even if you're on windows google's running on a unique system that's linux, that's the same as macOS everything is a forward slash it's us poor windows people who have to deal with back slashes this is a forward slash it's supposed to depends on how you see these things anyway g drive you've got to do that from a second argument here called force remount equals true I always put that in case I had to walk away there was some emergency and I didn't shut off properly and I just have to force remounting of my connection to my own drive you don't have to put that if you're just running this the first time now something's going to happen it's going to say click on this URL in your browser so that you can connect to your account now I'm going to blur some of the stuff out that comes because another browser window is going to open with a bit of detail so let's just look at what is going to happen now it's going to let you choose between some of your accounts and at the moment this is the account I'm using to do all of this it's going to say you are now allowing google colab to have access to all of these things well you can't go ahead if you don't so might as well do right secure, not to worry so now we get to this page the sign in page and this is what I'm going to blur out well this code does disappear after a while so it's not like anyone can make use of it but nonetheless so I'm just going to click here on this copy or I can highlight all of that text and copy it but I'm just going to click on that and now it's copied I can close this and I'm going to come into this enter your and paste this plain text and just hit enter or return so now internally I'm signing in and giving all those permissions to this colab notebook to work with stuff in my google drive right now on your normal computer if files were on your normal hard drive you would just navigate to it using windows explorer or the other operating systems version of that you would just navigate slightly differently by using another magics command now cd stands for change directory because we want to change to this directory just like on your hard drive where your files are in some folder or some directory we've got to do the same here because data.csv file that we want to import our spreadsheet file is in this folder remember I showed you before so the way that we get away from these illegal characters like spaces is just to make it a string so I'm going to say percentage cd and then inside of these quotation marks which makes this a string I'm just going to write this whole address that I have pointing to that folder that contains my csv files and that's fort slash g drive fort slash my space drive remember we looked at all of these things up here that's exactly what we're typing course here understanding clinical research and now I can put the spaces no problem fort slash data and it's exactly I wrote it all with uppercase so it's got to be there so now I'm just changing directory to that directory because that is where everything lives and the file from there that we're going to import is data.csv and this is how the import happens I'm going to create a computer variable called df for data frames remember I said within the limits that we spoke about before you can name your computer variables whatever you like but it is sort of everyone does df for data frame so I'm going to say df equals and now because we imported pandas as pd so I can just say pd dot and one of the functions inside is read underscore csv there's also read underscore xlsx for excel files but read underscore csv it is a function so you see the parenthesis there and then I pass as a string so inside of quotation marks I'm passing the name of the file so these are just things these are things you really have to memorize and after you've used them a couple of times it becomes you are become unconsciously competent you know just know how to do it so exactly the name they stick to exactly the same name uppercase lowercase exactly the same so we're going to import that nice neat data file that we saw that nice spreadsheet file is now imported into python very nice remember the type function in python let's see what this data type is this type of this object that is now stored in this piece of memory on google side name mighty machines called df and the type is a pandas dot core dot frame dot data frame object there you can see the full name of it remember before we also said that all objects have these attributes and they have these methods method is a function that pertains specifically to that object now to get to all the attributes and by the way pandas I think calls them properties not attributes doesn't matter properties attributes and the methods available to that object df now we can just use this dir function and if I pass the data frame object to it you see all these methods and attributes that are available so I can say df dot any one of these and all of them are going to do something some are attributes some are functions and look at a lot of stuff you can do it's just absolutely phenomenal and we're going to do some of these is in is in a is no items all these things just look at them dot mean dot median oh let's start looking interesting some statistical analysis here but all of these things are available to this object that's the way python and pandas designed it so we can use all of these things to do useful things to this object called df by the way you'll see these starting with underscores they those are dunder methods double underscore methods etc. they're there that dunder so those are not dundas with a single underscore there's the dunder things double underscore things all these things mean something in python if you want to get deeper into python they're actually quite easy to understand and they make a lot of sense anyway notice these first few ones they age cholesterol after cholesterol before date of birth delta group heart rate name smoke survey they also available to this object data frame and those were our statistical variables our column headers in our first row they are also available to this object that's interesting okay let's scroll down let's get some more information on our data frame object this spreadsheet file that we've brought in by the way that is where I put that link to data frame objects inside of the documentation of pandas but you just type in python and pandas in google or your favorite search engine and you'll get to this so the first method we're going to use is the head method so dot head it is a method in other words a function so open close parentheses I'm not passing any arguments to it because there is a default argument and that argument is 5 and what the head function does it prints the first 5 rows and remember right in the beginning we imported this magic thing to have tables printed nicely to the screen that's what this one is doing so we see these white and light grey alternative rows and we see we can change the number of views per page to 10 or 25 or 50 or 100 so if I clicked on 10 for instance it was going to show the 10 it is not doing so now because we are using the dot head function so just a little caveat there if I just typed a df it would have only printed so many rows but I'm using dot head so it's constraining it to what this head is doing as I said the default is 5 but I can also put in another value there depending on how many rows I want to see again look at the left hand side the index telling us what row value this python starts counting at 0 so we can only see 0 to 4 here for the first 5 the first patient dylan pattern has an index of 0 but you see all of them there and if I just did let's do that let's just do that so I'm just running df and now it's totally under the control of that initial magic command that we imported now I can easily go to page 2 page 3 page 8 etc back to page 1 I can say no no I only want 10 lines displayed here so it's under that control and even if I when I click on these it's going to allow me to change the order the sorting that's not going to be available if we don't use colab here and we don't use that magic command so that's specific to here don't worry about all of these there's also filter there that's all from this magic command that makes why working in colab is so nice because you've got this extra functionality but we're going to do it all with code the next thing that you want to know about your data frame is its shape and that shape is a property or attribute so you're not going to put parentheses there so just df.shape and that's going to tell you the number of rows comma the number of columns get used to that it's always rows comma columns everything in python rows comma columns 300 rows so 200 patients or participants in the study and there are 13 columns 13 statistical variables on which we gathered some data the dot columns it's also an attribute of property and that's going to give us back all the values in that first row and because we were particular about it we've got tidy data that's going to be all our statistical variables and it's going to create this index object for us and inside of it is a list and sometimes I call this repeatedly because now I'm going on to the next analysis I can't exactly remember how that variable name was written just call df.columns all the time just to remind myself what are all the variables in there they list them name, d-o-b-h, vocation, smoke our traits are started by pressure glacial before your try and seal glycerols that's your lipids survey glacial after delta is the difference between the before and after and what group the patient was in this is a little mock data set that had these patients take either placebo drug or new drug and we looked at their cholesterol before and after simple as that endem is another property number of dimensions and you can see two two stands for the rows and the columns they're two things, they're two dimensions here the row and a column and that's as simple as that the size is just going to give me the size is just going to give me all the values there are, so number of rows times number of columns so that's how many data point values we have, 2600 and then we get these d-types and that's also an attribute and that's an important one because it tells me what Python or Pandas thinks these types are remember we spoke about numerical variables categorical variables now let's see what Pandas thinks about all of this it thinks names are an object and object is another word for categorical variables so that's quite right there d-o-b it sees that as an object and there's actually a date and we haven't spoken about what a date is what kind of thing is a date anyway then there's age there's vocation it sees that as an object and age as an integer and see in 64 now the 64 refers to how many little bits so that's the little on-off switches microscopically small and you're inside of your computer's CPU etc 64 means it takes 64 of those switches bits to describe that number but as an integer it stores whole numbers that's right smoking which was 0, 1 and 2 which is a categorical variable and nominal at that because there isn't a natural order what you put x smokers before smokers now before no smokers you know there's no natural order but we use 0, 1 and 2 in 64 that's wrong ding ding ding that's wrong okay it's a started blood pressure it says in 64 float 64 that's decimal point value so that was correct survey it sees it's in 64 that's wrong because it's a categorical variable and so on and so on but it gives us an idea of what python is seeing or pandas are seeing and we have to work with that so that's very important the d-types so let's start extracting some of the rows and some of the columns so the first thing I'm going to do is show you just to look at a single column not interested in all the columns I just interested in the age group the ages of these patients in my study and there's two ways that I can extract just that column this is probably the proper way df and then I'm going to put inside of square brackets which we've learned about before to be something like indexing but inside of quotation marks I'm passing the computer I should say the statistical variable the column header I'm passing as a string okay and I'm saving this inside of a computer variable called age underscore column there we go so that is one way to extract only a certain column and I've saved it in its own computer variable now so what is the type of this new thing we only extract a single column from a data frame object it becomes a series object a series is something that has just this one column but it also has an index it keeps an index just as we saw with the data frame here if we scroll back up there's this index column that pandas adds to the data frame so that each of these rows has its own index and we don't have to stick with this default index we can actually if you want it you can even make one of the columns the index but then it's got to be kind of unique to do and sometimes it is done for dates and times etc but we won't get into that so there we go I have extracted created a pandas series by selecting only one of the columns so if it's only a single column it's a pandas series there is another way to do it and that is if we don't have any illegal characters like spaces then we can use dot notation so see there I had to put it inside of square brackets and indicate that it's a string but if there's no illegal characters in there I can just use dot notation so df.age and that's a little short hand I like to use it but it's probably not the proper way and it's safer if you have illegal characters because if you put that inside of a string you can just do it good let's look at the first five rows of our pandas series and you can see that it's a series because it still maintains this index on the left side there's no column headers there in other words it's not saying index and age but you can see that I just have this one column of age values now one thing I want to teach you because it can be very useful at times it's probably not absolutely necessary but it is nice and that is to convert an object to another object so we have this age underscore column at the moment it's a pandas series object and I'm going to convert it to a numpy array it's going to be a new object type and I like a numpy array because now I strip away that index and I'm only left with the values that I'm really interested in and for that we're going to use the to numpy method that is here two series so pandas series because if I say df.h I've got a pandas series and one of the methods of a pandas series is that to underscore numpy method so I'm going to call that method on the series and I'm going to store that in a new computer variable called age so let's do that as I say these are useful things and not absolutely necessary but sometimes I find them very useful and the type of this now is a numpy.nd array an n-dimensional array at the moment it's one-dimensional because there's one list of numbers it's not two-dimensional as remember there's no rows and columns as our data frame and because this is a numpy array it will have its own set of attributes and methods and if I pass that to the dir function you see all these are dunder methods there with a double underscore and these are normal methods and attributes all listed there so let's look at a few of them very useful ones if you've converted to numpy arrays although they are also available for pandas series by the way but let me just show you it's the .min method so age.min it's a method so parentheses that's going to give me the minimum age of that whole array of ages I can do the same for max I can do the same for the average so what was the average age of our data it was 53.07 see how easy that was to do some statistical analysis I extracted it as a numpy series I converted it to I should say a pandas series converted it to a numpy array and just called one of the methods that is available for that type of object .min as I said it's also available for pandas I didn't really have to convert it to a numpy array but hey I just want to show you what is available now let's look at just just extracting the rows and for that we're going to use this attribute called iloc iloc iloc so df.iloc iloc that's integer location in case you're interested and that is followed by a set of square brackets and what I'm asking for here is just 0 now iloc integer location is strictly going to go rows comma columns if I don't put a comma just put values there it's going to assume just the rows and give me all the columns so if I put 0 remember that's going to be the first row of data now it's printed very nicely as two columns here but you can think of it as that first patient all their values in that first row so there's their name, their date of birth their age their vocation etc all the values their random variable values for these variables and there you go now what if I wanted rows 2, 3 and 5 well that's index 2, 3 and 5 that makes it rows 3, 4 and 6 remember the 0 it's counting I pass that as a list so you see the inner set of square brackets there I'm passing a list to these outer set of square brackets and again to the iloc attribute and now I'm going to see index 2, index 3 and index 5 those rows of patients the ones that I want so that's one way to go about it but I want it might be contiguous so 1, 0, 1 0, 1 and this by the way it's all that I'm going to get remember the right hand side usually in python is ignored so it's up till 2 but it's not going to include 2 so it's going to be row 0 and row 1 one way to remember it is I'm going to get 2 rows back so that's why the 2 is there anyway let's have a look at that it's going to give me this contiguous set so 0 and 1 but it's not going to include 2 it's going to give me 2 but it doesn't include 2 it's 2 rows but it doesn't include that third one which would be index 2 so let's do rows, columns because every time I've just asked for rows now you didn't see any single comma in there I listed them there were commas in my list that I passed but there's no rows, columns so here we have rows, columns and for my rows here you see I'm using a range remember range from the last video so it's 0, 5 so 0, 2, 5 what do you think is going to happen to that 5 I wonder what happens to a Pandas range does it exclude that 5 does it include that 5 I guess we're going to find out but now you see the comma and then the columns and I've passed the columns as integers because I'm using iLoc, integer location so I can't use the names of the columns I have to use the column number and the first column the name column there is going to have a value of 0 that's the 0 with column and DOB date of birth is going to be the first column so I'm asking for columns 1 and 2 that's index 1 and 2 meaning it's going to be DOB and age ok but let's see what happens to the rows there we go 0, 1, 2, 3, 4 no surprise there I hope I didn't entice you so that 5 is excluded even in the range here in Pandas ok so we're sticking to that rule so you can see the columns that we were after the DOB and the age column and you can see the 5 values that I wanted there the 5 rows so 0, 1, 2, 3, 4, 5 being excluded so over and above the iLoc you also get the location and now I can use actual words now remember my index was just numbers that was done automatically but I can make one of the columns a column I could make the index for instance if I have something like date time and the patients come sequentially those are all going to be unique values and I can turn something like that into an index so that all the values are still unique and then I can use loc to refer to a specific string and the column to the specific string the column header the statistical variable name so loc allows me to do that but we are stuck here remember with just the normal integer indexing so I can say dot loc so the rows again is 0, 5 comma and now as a list I am passing these two so DOB and age so instead of iLoc which I created a list for those two that I want I can now pass as a list the actual names with loc and that's going to give me back exactly the same thing low and behold though is it? now we are breaking the rule here with loc here with loc that 5 was included so I have 0 to 5 I now have 6 so just bear in mind sometimes you do get these little these little differences just live with them it's not a big issue there's the iAt integerAt and that's more like what you would think of a spreadsheet file I want that row and that column so row number 3, column number 2 which will actually be 4 and 3 remember we start counting at 0 and there's the value that was exactly at that and we can also just use dot at if we want to use the actual names like we did with loc ok let's go on to something much more interesting I want to filter the data based on some information that I want I'm just going to filter the data instead of calling out the actual rows and columns that I want I want to finesse a little bit here I want something more interesting because I didn't know in what order these patients were captured in that's not what I'm interested in so let's go and ask the first filtering question we want to find all the unique values in a column remember what that meant that meant the sample space specifically of a categorical variable so let's go do that the sample space of a categorical variable so the smoke remember that was I think 0, 1 and 2 for non-smoker current smoker, X smoker I want to find out it might be many more than that and I'm only seeing the first 5 or 10 rows but there might be many more and I don't want to go through something that has 2000 columns so for that we have this unique method for pandas series so I've got df.smoke and that's going to give me back just that column as a series and one of its methods is unique so open and close parentheses it's going to go look down that column and return for me the sample space elements so that's the only thing that occurs in that column there's a 0, a 2 and a 1 so why did it give me this order it just goes down and finds the first thing that it found and there was a patient with 2 an X smoker before there was a patient with 1 so it's just going to give me that order it doesn't matter I can see the sample space now of that and it helps a lot of course when we use it on categorical variables not so much for medical variables especially when we get to floating point values because then you're going to get back a lot of values so there it would be much better to use minimum and maximum so let's have a look at filtering all the ages of the non smokers now that's interesting I might want to know just give me all the ages of the patients who don't smoke think about it that's a very interesting question and how do we translate that just teasing out those values from our data frame so let's have a look at how to do that and we're going to do that by using Boolean logic now we've looked a little bit at that that is where I can ask a question and it's going to return either true or false behind the scenes now first of all I'm going to create a computer variable and I'm going to make it very descriptive I'm going to say non underscore smoker underscore age so that I know inside of there I am going to just store things that are the ages of non smokers so if I look at this week from now I probably will be able to make out of up I give it to someone else they probably be able to make out what values are in that computer variable okay so there's a couple of ways to go about that I'm going to show you the very short way and it is a bit different from what you've seen before with loc and I look in those comma columns so have a look at this notation just got to get used to it so it's df the data frame and then inside of a set of square brackets I'm going to say df dot smoke equals equals zero so the double equals we've seen it before to ask the question is this equal to zero true or false and it's only going to include the rows where that is true so only going to look at the ones where the patient is a non smoker if that is what the code was for if this was a string value one of the sample space elements I was looking at I can just put that inside of quotation marks and just look at that one specific one so it says df dot smoke equals equals the double equal sign is a Boolean question and asked is this true yes it's true include the patient no it's not a zero don't include so what do we want to include that goes in its extra set its own set of square brackets with the name of the column inside of quotation marks that's why the quotation mark format is probably better than the dot format because here there's no ways to use the dot well I think I tell a lie I think you can actually but let's stick to this for now and then on all of that I want to conversion because I'm going to get back a panda series from this but I want to just convert it to numpy as I said is this something extra I'm showing you so the object inside of this non underscore smoker underscore eight computer variable is going to be a numpy in D array and that's very nice because I can call dot mean on that so I'm going to say non underscore smoker underscore age dot mean and that method is now going to calculate the mean and I can see the non smokers had an average age of 50.1 great stuff now let me just show you other ways to do this again I'm going to use the exact same variable so I'm going to overwrite it a computer variable non underscore smoker underscore age and I'm going to use the dot look method so here I'm going to say df dot look inside of square brackets looks almost the same as that one I've just added dot look so I'm going to say df dot smoke equals equals zero and then in a separate set of square brackets age to numpy I'm going to get back exactly the same thing there's another way to do it and as I've written down here it might be confusing in the beginning there's so many ways to do exactly the same thing while there's power hidden behind that because the reasons why you can do things differently for now except the power that is at your command so non underscore smoker underscore age exactly the same as the computer variable I should say df dot look and now I'm just going to pass them like this almost a row comma column version so I'm going to say df dot smoke equals equals zero comma age this is probably the more appropriate way because you'll remember this way so I'm doing rows comma columns go down that column smoke row by row and only include the rows for which the smoke is zero and then give me comma the age column and then again to numpy and I'm going to get back exactly the same numpy nd array object no problem okay let's filter something else filter all the non smoker ages where the survey choice is more than three getting very specific here of course that's not a doesn't make much sense in the context of the data that we have here but when you have a data set you might want to be this specific in your subgroup analysis so you want to be able to extract very specific things here so let's let's just tease this apart I'm looking to filter all the non smoker ages the ages of the ones who are non smoking and the survey choice must be three or more so that survey choice was perhaps are they satisfied something so I want the satisfied people here let's have a look at how we construct that so I've got to look at two things here both things have got to be true they've both got to be non smoker and in a survey column they've got to have a value of three four or five both of those things have to be true before we get a true value and then that age will be included and you can think about I can string even more together so let's put it together I'm going to show you this long way and then show you a slightly better way but again I'm going to give it a descriptive name I'm going to say non under small smoker under source underscore satisfied underscore age we've decided that if you mark a three or more you are satisfied by the way okay so it's df.look and now I'm going to pass each of these little Boolean things inside of its own set of parentheses so I'm going to say df.smoke equals equals zero and I have df.survey greater than three so I should say yeah more than three so it's only four and five not three four and five not greater than or equal to so really they are satisfied that just four or five and then I'm going to bind these two together with this little ampersand ampersand ampersand here and python is symbol for and and both must be two for that row to be included not either of them because that would be an or yeah I want an and and then comma age so I'm using this row comma column sort of notation inside of my square brackets and then two numpy because I want to numpy nd array at the moment again I'm just showing you that this is possible now I've got those ages only they'll have a survey score of four or five and they will be a non-smoker guaranteed and they'll only have the ages for those and now I can look at the mean for those very powerful stuff here if you think about I want to show you as many examples as possible if you want to take a break or have have some tea or coffee whatever your favorite beverages come back but come back okay more more more filter the ages of all non-smokers or those who have a satisfaction score of more than three so now it's an all so if either one of those are two I want the age not both of them want to be and so instead of putting these together with the ampersand which means and both of them have to be true if I have more of them they all have to be true this pipe is what we call it the single up down and my keyboard it's above my enter which is shift and my back slash key on your keyboard might be somewhere completely different search for it sometimes it's actually a broken vertical line there's a little empty space in the middle so just watch a check out your keyboard but what I'm doing here is I'm saving it as a computer variable and I've called it crit for criteria criteria more than one so again each one in their own set of parentheses importantly df.smoke equals 0 or df.survey equals more than 3 and then I'm just going to pass this as my row here so df.loc row comma column so that's my rows go down that column and that column row by row and just look for any of those to be true and then include that age express it as a numpy array or convert it to a numpy array and store it in this non-smoker I didn't run the crit there so let's run that now it's in memory and now we can use it as this criterion okay so now we're going to have the ages of patients who are either non-smokers or have a score of more than 3 or both like someone could both be a non-smoker and have a survey of more than 3 but there certainly are going to people there who are not on the 0 and whose survey is 1, 2 or 3 and the other part of that or must be 2 for them to be included there okay more more more filter the ages of all patients who are not non-smokers and who do not have a survey score of 3 or more well so 2 negations here let's just think about this for a while let's twirl this open so we're dealing with I'm going to read you for you we're dealing with negation so we can change our language a little bit if you think about it where we can say where we can take this opposite view so if you are not a non-smoker that means you are a smoker or an ex smoker and if you do not have a survey score of more than 3 that means you have one that is 3 or less so you can change it into a positive way to look about it but sometimes that can become a bit difficult especially if you're not so you have a sample space of 20 elements and you just want one of them excluded now you have to include all the other 19 it's easier to go about this way first of all we're going to create the same criteria so smoke equals equals 1 you don't want and the survey value more than 3 that you don't want and now when you pass the crit you're just going to put this little tilde squiggly line in front of it that's negation as far that's the not operator if you know that kind of thing inside of python so not this criteria so I want exactly the opposite of this criteria and both have got to be there so I'm using the ampersand there to indicate and if I run this now it's going to do exactly that it's going to have the ages of people who are not non smokers and who do not have a score of more than 3 excellent more, create a new data frame object that only contains patients younger than 50 sometimes I want to create sub data frames because I only want to from now on I'm only going to do subgroup analysis on this group of patients and I want a new data frame so that I'm not contaminated by these other patients so let's have a look at that very simple, I'm going to give it a computer variable name and I suppose I could have called it younger underscore df that would have been a bit better don't you think anyway it went with new df here so I'm going to call the df and then I'm going to look at a specific series in there df.age and I want the age to be less than 50 and if I do that there's no comma column there so it's going to return all of them for me so you see I've got all my columns still there but if I look at the ages I'm not going to find any one who's 50 or older okay so now let's just make sure that that is so so new df that's my data frame.age that's going to return a panda series for me I can call the .max method on that series and I get 49 that's the oldest patient there if I remember .age just a little reminder there is probably better let's not use the .notation but use the full notation there it's still going to get back the same thing and here's just another way again I'm just showing you what is possible here don't let all this abundance of ways to do things confuse you but just for the interest I just look at this one so I'm saying new df.loc and that means I can refer to the names and I'm going to use comma so I'm going to have rows comma columns and you see a colon there the name shift was saved from 0 to 5 remember for loc the 5 was going to be included but if I just put a single colon there it is short notation short hand for all so that's all the rows comma just the age column and then the .max method on that panda series and again I'm going to get exactly the same thing the oldest patient was 49 now create a new data frame for patients with a restricted list of job titles because that's one thing we haven't spoken about when we spoke about the tidy data I said have this clean sample space element and how many data sets have I analyzed where there was columns that had freeform input and one of the most famous ones in health care research is the list of comorbidities so one patient will have hypertension and diabetes the next one will have hypertension and diabetes and ischemic heart disease and then the next row those three things will be listed with commas in between them but the order will just be changed so you can just put all the stuff in a single column that is not tidy data that is not tidy at all you cannot analyze that that's impossible to analyze everything must be in its own column so if you think of comorbidities you'll have to think of all the comorbidities that you're interested in and they've each got to become their own column and the sample space for each of those will be yes and no so hypertension will be its own column and its sample space will be yes, no, yes, no, yes, no next column will be diabetes and its sample space will be yes, no, yes, no, yes, no or just yes and no the sample space elements but then yes, no, yes, yes, no, no etc so you have to break that down it's called demifying but we were not of interest for us now but sometimes you just there's just no other way that there's this free form input so I want to give you some sort of idea of how to do this so for instance that vocation column had all sorts of jobs for the people in our study but what if I were just interested in these three jobs I'm just interested in people for some bizarre reason I want the IT consultants the energy managers and the clinical embryologists now you'll have to know what is inside of your dataset and remember I mean if that embryologist was written with a uppercase E and some other ones they were written with a low case E that's two different elements so be careful of those but those are the only ones I want I'm passing them as a list to this jobs underscore just jobs computer variables so I'm passing this list object to it and then I'm going to use a criterion and my crit I'm just calling it so I'm saying df.vocation so that gives me a pandas series and then on that I'm using this is in is in as a method and I'm passing this list jobs this list object to it so that is in actually ask a little question is it in in this pandas series so if IT manager or IT consultant, energy manager or clinical embryologist is in that row one of those three then it's going to go into this crit that I've got there and then I'm going to create a new create a new data frame I'm going to call it jobs underscore df and I'm going to pass this df.vocation is in jobs I'm going to pass that to the loc function the loc attribute I should say and then let's have a look at that first rows and if we now go down this vocation it's the only ones we're going to find we're only going to find people with those three jobs I think there were only four in this whole data set mind you but I'm only going to find those inside of this data set so remember this is in method and you can pass a list of things to it and it'll only look for those things in a pandas series and you can pass that to the loc function and then the loc property there and then you're just going to get the ones that you're interested in of more use when it comes to free form input is to look for specific words or phrases in this column where people could just type anything they want so what if I only want to include patients where somewhere in that vocation column they had the word manager in it now of course you can think of this idea of hypertension, diabetes, pulmonary, chronic obstructive airway disease, the key words that you're looking for and this is what you're going to do now this gets complicated because this .SDR string has so much to it there's so much you can do and it becomes so complex but it's actually a lot of fun but this is a useful one so this is the one I want you to practice and this is the one that's going to go into the memory banks so look at what we do we're creating a criteria again criteria is going to be this df.vocation so that's a panda series that we're going to get and then we call this str on it string.contains the string.contains you can see that as a method the contains is actually the method but manager I'm just passing this word manager in it and na equals false you don't have to do that I'm just letting you in on a little secret because later on in this video we're going to look at naN values I'll keep that as a little surprise na equals false I think that's the default you don't have to put it there let's get back to what I'm actually trying to tell you I'm only looking down that column for rows that actually has the word manager and I'm saving this as a computer variable called crit and I'm passing this to the .Loc so df.loc that criteria and I'm giving this a new dataframe name vocation underscore df and let's have a look at the head of that so I see energy manager, tourist information center manager, estate manager slash land agent anywhere where there was manager and you can well imagine if it was this hypertension that you're looking for it's only going to draw out the rows in which hypertension appeared so that can help you a lot so remember that .str.contains great stuff let's move on to something brand new you want to break go have a break, come back are you back, great stuff updating or changing the values in a dataframe object first thing we're going to do is to rename some columns that is a very useful thing to do because people write all sorts of weird things with all sorts of illegal characters in their spreadsheet files and sometimes you just want to clean it up the best way to clean it up is to use a panda num python, use a python dictionary remember dictionaries, key value pairs go inside of curly braces, go watch the other video again all right df.rename rename is a method for a dataframe object so df.rename and I'm going to say columns equals that's the argument columns equals and I'm going to pass a python dictionary in this instance I just want to change the name column to patient just showing you that it's possible you can use other names for your statistical variables column headers okay so if I had more that would just be comma and then another key value pair and we indicate key value pairs by this colon in between, remember but what this is going to do, it's only going to do it temporarily, if you want to make it permanent some of these functions in pandas you've got to watch out for and you'll pick them up as you go along you really have to have this argument in place equals 2 in place equals 2 you've got to have that then it will make the change in place and permanent right so let's have a look at df.columns remember that's how you get the list of your columns and there we say we don't have name anymore, we have patient there that's our new column header if I look just at the head the first five rows you'll see patient there is a bit of name so we've changed that what if I want to add two values add two to each of the age values that is how we change actual data point values in that column now one thing you can do if you want to protect patient privacy as a rule everyone is involved in that research you can have your little team and the team says when we capture patients ages we are just going to subtract two from everyone's age so we're capturing these false ages so if someone tries to track those patients they might have difficulty in finding them and there are many ways to obfuscate the data that you capture you just have to put them back into the right values okay so this is not often necessary I just wanted to show you that this is available so let's not spend too much time with that the first way is to create a function and I'll show you about creating functions functions remember I told you Python is full of functions you add extra functions by importing new libraries but you can even create your own isn't life fantastic so the def is our keyword we do that stands for define we're going to define a function we're going to give it a name and our very fancy name is add two because we want to add two to every value so seems kind of appropriate as a name for a function and then we say we have this argument and the argument is just some placeholder which we're going to call X after that line is always got to be a column because if you hit enter there's going to be some white space and I told you before Python's very specific about its white space so I couldn't do that the R right underneath the D no no no when I hit return the notebook's actually going to do that for me it's going to have this white space so it says return X plus two so whatever I put into it my argument it's going to return whatever I put in actually it's got to be a number otherwise this ain't going to work it's going to add two to it we've created our own little function now there's a lot more to functions but anyway so let's have a look at the ages before so first one was 43 remember this is now a series 43 53 so let's use this very useful function dot apply we're going to apply something to a pandas series so there's df.h that's a series and we use the dot apply method on that and what we're going to apply to it is our new function add two guess what it's going to do it's going to look at every value and it's going to add two to it so let's have a look at that patient one was patient zero was 43 now he or she is 45 then 53 55 so I did all of that for us I'm going to show you another way to do this also quite hectic to learn about function creation and lambda the lambda keyword it's a bit more advanced python but it can be useful and I want you to know about it so df.h I'm going to overwrite the age column with df.h so remember equal is not an equal it's an assignment operator so on the right hand side I have this spender series I'm going to use the dot apply method and what I want to do is this lambda function it does exactly the same as our little function that we created so lambda and we're going to have this placeholder x and what does it do that's what that colon here means symbols have different meanings inside of python here it means what to do with this placeholder I'll take this placeholder and subtract two from it so I'm using that as my argument to the apply method and if I look at what happens now because I've overridden it I have what is on the right hand side I'm assigning to the left hand side so now I'm back to having subtracted from 45 minus 2 is 43 and we've subtracted all of that I wanted to show you that keep it in the back of your head in case you might have ever needed now changing nominal variable to an ordinal variable so I have my groups and the sample space element for group had two elements in it to apply them as either control or active so in the control group getting placebo and active, getting an active drug if I wanted to do something like logistic regression I would have to use numbers inside of my logistic regression model so I want to change these names, these strings I want to change them to numbers so what I'm doing here is changing a nominal categorical variable to ordinal I'm being slightly naughty there just because it's 1 and 2 it's not really ordinal so I should probably change that but I think you know what I'm trying to say so one way to do that is the dot map method so again df.group is going to give me that column as a series and I'm saying doing the following map and we're going to use a python dictionary so wherever it says control replace that with a 0 when it says active replace that with a 1 and I'm overwriting the df.group so if I look at df.group that series its head I have 1 1 1 because those were all active active active but there'll be 0s in there as well now there's another function that you can just look at and that's dot replace it's going to do basically the same thing good we're almost there hang in there we're going to change just mention changing the columns we can add some columns add new columns and let me show you some of these I'm going to split the patient column the patient column remember there was a first name and a last name into a first name and last name column I'm going to make new columns now doing it with names is not that exciting it's perhaps more exciting just to do that with other variables but let's just do that and it's one of those strs again so a bit more advanced but be aware of it this is called data it's a new data frame that I'm creating I think I've used it before but anyway it's just going to overwrite df.patient so I have this pen a series of the patient column .string str.split and I'm going to split it on a space because remember it was first name, space, last name and expand equals true now that's a bit of a difficult one always put it in for now just remember that always put it in now to create a new column I can just use this kind of notation I'm going to say df first name inside of a string so just as you called them up if they already existed you do the same thing if you want to create a new one and what do we assign to that this new data 0 because it's splitting it's going to split this series into two series basically behind the scenes I don't know if you can really say that anyway a 0th one and a first one I shouldn't do that a 0th one and the first one so I'm going to put the 0th one in the new first name column and this first one which is the second one in the last and let's look at what dfhead looks like now there we have right at the end two new columns first name, last name so here we had in patient Dylan Patton now becomes Dylan and Patton because the split appeared a little space that was between the N and the P for Dylan Patton right let's do something else let's combine two columns into a new column so I'm going to create this new column called name remember before it did exist and we changed it to patient so it doesn't exist no more so df name and what am I going to assign to it I'm going to concatenate these things together I'm going to concatenate the series last name plus that's the concatenation so I'm putting them all together a comma and a space which is a string so I'm putting them inside of quotes and then the first name and the panda series so if I put all of those together and look at it now it's last name, comma, space, first name I've concatenated things into a new column wonderful stuff perhaps slightly more interesting is changing a categorical variable from a numerical variable by binning so let's just have a look at cholesterol I see the minimum cholesterol before was 1.2 now that's units we use locally other parts of the world don't use si units but anyway and the maximum in our data set was 11.1 just roll with it those are not the values that you're used to seeing so let's cut that up so between that minimum of 1.2 and a maximum of 11.1 I want to create three groups of people those with low cholesterol those with intermediate cholesterol and those with high cholesterol so that's an ordinal categorical variable there is some order to it low intermediate high it is categorical because we just now have these three discrete categories so I'm going to create a new column I'm going to call it cholesterol before level makes sense and so what are we going to do we're going to use the cut function it is a panel function so it's not a method of an object so I've got to say pd.cut what do I want to cut well the series please cholesterol before and create bins for me and I said bins equals three so panels are going to go automatically go from 1.2 to 11.1 and divide that up equally into three equal sections so we're going to give each of them a label so the first and it's got to be an order and it's got to make sense how you construct this to what you're trying to achieve so I've got low intermediate and high and let's just look at the first 10 of those so I'm using the .head and then to an umpire row so it was kind of an order so 1.2 is low, 1.2 is low 2.0 is low, 2.1 is low but if we did 100 you'd see the medias and the highs as well but just to show you here that 3.2 is indeed low so this all works out for us sometimes it doesn't work out so you want control you don't want three equal bins you want to control the size of the bins and here we have that I'm just going to show you exactly that overwriting the cholesterol before level already exists I'm going to overwrite it with this pd.cut so I take pd.cholesterol before that series and the bins are written for numbers here that's because if you're looking between them you're going to have three bins from 0 to 5 5 to 10 and 10 to 20 so if I look in between those I have three bins and I'm going to say write equals false and I'm still going to have the same labels so what does write equals false because write equals 2 I think is the default so we've passed write equals false here if you remember something about intervals from school you get the closed intervals and the open intervals and closed intervals we have the note by a square bracket and the open interval with a 5 so if it's square on the side 0 to 5 that means the 0 is included but on the open side here we have a 5 so the 5 is not included so the next slot would be 5 to 10 but the 5 is on the closed side meaning that it is included write is the right-hand side of this interval so the 10 that's said to false write equals false means it's excluded so think about this whatever patient has a cholesterol of exactly 5 which of the two groups are they going to fall in in the low or in the high well if they're 5 they're going to fall into this intermediate group because 5 is included the right-hand side is said to false so if they were in 5 they are on the open side and they are not in this group so 5.0 exactly will be in the intermediate group 10.0 exactly but if they were in the high group they will not be in this group so that's how you control your bins as simple as that I hope it's simple so you can control the size of these bins exactly where you want the cutoff to be because medically that's going to make a hell of a lot more sense so let's delete a column that's just the drop method so df.drop and then I'm going to say columns equals that's the argument and I'm going to pass a list of column names so you've got to put it inside of square brackets you've got to pass a list and you can say comma others and the drop method is one of those that if you want the changes to be permanent you have to put the in-place equals true argument this is one of those things so if I now look at df.columns that name remember that we did last name comma space first name it's gone now it's gone from our list of columns sorting let's go on to sorting I want to sort values and so that is a method so I'm passing df to it my datafame object it has a method called sort underscore values and it has a argument by equals and I want to sort by last name by default this is going to be where's our last name where's it yeah it's going to go in alphabetical order so now abit becomes first and etc and look at the index on the left hand side it's all scrambled up now we've sorted alphabetically by the last name I can do other things I can say sorted in ascending equals false order so another argument and if I were to run that and if we look on the right hand side now we're going to start with the z's here okay simple simple stuff now what if I want to sort by more than one thing well I give it that order in which I want to sort so first by the age and then by the systolic blood pressure so the youngest patient was 30 but there were two of them and if we jump to the systolic blood pressure that is now an ascending order 133, 159, 168 so all those three patients were 30 years old but now it's going to move on to the second one that I want to sort by and you can even do that mix and match so I can say ascending and pass a list of values to it true and false so age 2 so that's going to be an ascending but SPP in descending please descending is false to SPP so I've got to have equal number of values in my list there but now look at it 30, 30, 30 then that order then 31 but if I look at SPP now the highest SPP is first of those three patients that were 30 years old 168, then 159, then 133 so you can mix and match as much as you want couple of useful methods I just wanted to almost finish off here before we get to missing values and dates and times is the enlargest so here I'm passing 15 to it so I'm saying go give me this df.sbp series, the systolic blood pressure give me the 15 largest please boom it's there I can change things around slightly differently and this is the different notation this is pandas it can do things in many different ways and you just see the other notation there but this time it's going to return the whole data frame object for me instead of this method that is remember first is going to give me this syntax is going to give me the pandas series and this one is row comma column not row comma column I shouldn't say that because look at this it's a function oh I'm really probably getting tired here am I not I hope you were taking breaks it's n dot large just give me 10 and go down the SPP column but give me back this whole data frame so you can see the difference there if you haven't taken a break take one now because now we're going to get to something very important and that is missing values are you back? let's look at missing values the numpy library's NaN value NaN stands for not a number but it's actually for any kind of missing data basically but anyway let's stick with pandas here with numpy we're going to do not a number np remember we used the abbreviation for numpy dot nan and that gives us back nan not a number so if I have a list and it has 1 comma 2 comma 3 comma np dot NaN in it and I look at what my list looks like then behold it's going to have 1 2 3 and then not a number so can I sum can I use the numpy's some function to sum over we saw that before we just sum over all of these what do you think is going to happen it's going to return a NaN because that's not a number it doesn't know Python doesn't know you don't know I don't know so how can we add these values it's going to say 1 plus 2 3 plus 3 is 6 plus plus what depends on what NaN is well it's it's not a number so we don't know so it's going to turn that NaN for you so let's see how that applies to missing data so inside of our data folder which we are still in because we changed directory to it right in the beginning I'm going to import this missing data dot CSV file and I'm going to very appropriately give it this computer variable name missing underscore DF so let's look at missing DF print it to the screen and look at that that cell was empty if you looked at the spreadsheet file you'd see that was empty and now they pop up as NaN values here oh boy are you going to have difficulty in analyzing this data now so let's try and get rid of some of these NaN values or do something with them one way is just to drop them missing data missing DF dot drop NaN it's a method for data frame and now it's just going to drop all those values and if I now look at this all those patients but look at them they're missing now because we go from page in zero one two three four five and six are gone just because they had perhaps one missing value those patients are now gone from my analysis that might be a bit tough we all know how many times we don't have any data now I can specify just read here you can specify how it should be dropped the default how is to any arguments you can set values to arguments so if any one of those in that row anything is missing just one thing is missing is going to drop that you can also say all have to be missing before you drop that row there's a bunch of stuff you can do and you can also drop we were dropping rows now but you can also drop columns and then you just add this axis equals one or set it to columns axis equals columns or axis equals one it's going to drop a column for you if it finds anything and you can do the how there as well ok more importantly is you can specify that you want to drop something only if something is missing from a very specific column or more than one column so we're there we're going to use the argument subset so missing the underscore df is our data frame we're using the drop in a method and we're using the subset equals argument and we're setting it to a list of values a year we will only drop a row if the age was missing so the ages are all there now but see we've dropped some patience now because there was some of the other columns had an in value in there now we can also just call a data frame I should say pan the series missing underscore df dot age and we call this is in a method on it and it's going to go down that series and just say to a false is it missing or not false no it's not missing to yes it is missing and behind the scenes I think I've told you before false is stored as the number zero and true is stored as the number one so that makes it for a little nice cheat because I can say missing df dot age so that is the pan the series we're looking at the age series there and I can say is in a which going to give me a bunch of true and false and I can say dot some some method on that and it's going to count all the true's as ones and the false is a zero's add all of them up and that's an easy way to show me how many of the values in that column are actually missing and I can see the 17 missing okay so they're missing what can we do about it well we can replace them replace missing values yeah you needn't throw that whole patience data away we can impute the data data imputation that is where you fill in values that you didn't know before and that is a huge subject all on its own so I'm going to show you some of the easy ways to fill that in you can use this method on a series object fill in a so I'm going to say missing underscore data data frame dot age so that's a series and I use this dot fill in a method and I can set a method for that and f f f full that means forward full so when you look when we look here for instance forward full means it's going to take the one just above it that 22250 and fill that in there as well and just going to forward full and if there were two in a row they're both going to have that value there's a forward full and there's also a back full so it's going to take the number after that to fill in there and that's not very scientific but hey it works so there I'm just using the forward full another way more for numerical variables is perhaps to fill in the median value so take the median of a numerical variable and just fill in the end fill in all the n a's with the median for that variable so let's look at age missing df dot age dot median now you see something very important here that we should just stop at pandas is very nice at automatically ignoring missing values unlike NumPy remember we couldn't sum those because there was an n value pandas by default is going to ignore n a's but we can set we can set that whichever way we want so the median age was 40 I can now say missing underscore df dot age so that series fill n a use that method and I'm going to fill it with missing df dot age dot median so I'm going to fill all the missing values with the median which is 40 that's scientifically a better way to go about it of course if I wanted to make permanent I had to use this in place equals true argument default missing data is something I want to talk to you about as well sometimes when we collect data we with the people who collect data we have these codes we say if you don't know if that data is just not available so it's not like someone didn't fill it in they were lazy however somehow that data is no matter what it was not measurable it does not exist we can use a code like 999 or null or actually use the word missing we can have all these codes and so when we import a data frame that we a data set a spreadsheet file that we know that these values do not exist we can say to this import function dot read underscore csv import for me this data frame this spreadsheet file but these are the values that if you find them just change them to na values that's much easier to do than later on they are going to go change all these remember I said dot replace or dot map and we can use curly braces as dictionaries just put them all as NaNs right from the beginning and we can say this na underscore values equals and pass a list of all these codes that you have now in my data set instead of seeing 999 and null and missing whatever code you came up with you're going to see them as NaN values and sometimes that is a lot helpful if you want to go take a break the last exciting thing is going to come up and I'm going to brush lightly over it because again it's a very deep subject but I want to show you the easiest ways to deal with it and that is dates and times as I said you're going to get dates and people do the most horrendous things when they capture dates then they put them day first then they put they just change halfway down the line it is it gets crazy okay so let's import this data set it's a spreadsheet file dates times CSV and I'm going to use the computer variable for it instead of DF I'm using DT felt like something different we have an ID column there so patients got an ID when they got into the study first name last name date of birth and time of admission so somehow you can see this is very contrived because I just want this type of data so that I can show you if we look at the the date of birth column there it looks to be month day full year so not just 58 but 1958 if we look at the times the second one there is 23 column 52 so that's on a 24 hour clock and that's 8 minutes to midnight there okay that becomes important so let's look at the D types of this DT data frame object contrived the ID is integer 64 right first name is an object categorical that's right last name that's right DOB that's not an object that's that's a date and time of admission somewhere it was important what time of day these people would but anyway that's an object as well instead of time that's all wrong so let's do something but let's do something useful that we've seen before we're going to create a new column we're going to call it daytime and it's going to be this concatenation of different series so give me DT.DOB concatenate or plus to that a space so that's a string so I put inside of quotation marks plus DT time of submission and let's have a look at that now so now we have this DOB and DT time all in one row okay that's very wrong because the time of admission and the date of birth have nothing to do with each other I just wanted to show you you can combine this because we have this thing called a date time object date time there's date and time in there and that's why this contrived example I did both of those I suppose I could have called that the time of their birth but few not all of us know what time exactly we were born but let's look at it now um this date time column that we've created it's still a date type O it's still kind of an object is it so we get this to underscore date time pandas function so PD I've got to use PD dot because it's peculiar to pandas so PD dot to date time so I'm going to create this new column just because I can and I'm going to call it date time but with lowercase D so it's a new column so I'm going to use this to date time pandas function and I'm going to say take the DT dot date time column as it stands now and interpret the following format for me so there's this format equals argument and you can go on pandas on the website and look at these because it's uppercase m's lowercase m's and they all mean something very specific but here we have inside of quotation marks we have percentage m lowercase m and that just means it's going to be an integer month value then we're going to see a forward slash and I'm going exactly if we look here at this last row there's 24 here index 24 25th patient I'm going exactly according to that recipe then there's a forward slash and then a percentage D because the month is also written just as an integer so it's a lowercase percentage D forward slash again and then percentage uppercase y lowercase y would have just been that shortened here 59 58 but it's uppercase y so it means it's the full year 1959 then there's a space because we have a space there then there's a percentage h and then a colon and a percentage m and they all uppercase this tell us hour and minute and they buy military time so it's 24 hour clock time so you've got a tease out what it is all about and now look at what it's done there's my new daytime column beautiful because this is now really a daytime object and we can extract stuff from a daytime object for instance I can create a new column called month and I can see say DT dot daytime dot DT dot month underscore name dot string dot slice stop equals three goodness gracious great balls of fire I hope your mind is refreshed and you can just tease that apart but it's not that difficult so DT a data frame has this property called daytime if indeed it is a well daytime I should say that's our series let's start there DT dot daytime that's our series it has this property called DT which has a method called month underscore name so that's going to give us it's going to look at that daytime object and it knows now which one of all those values is the month and it's going to extract for us the month name so just that it's going to give us the month name I've added to that the fact that I want this short three month abbreviation so if you just stop there you're going to get the full January the full February the full March but I want to slice it a bit so I'm going to use the SDR dot slice method and I'm going to say stop at three so January is going to become February is going to become FeB so you needn't put this if you want the full ones I can just say dot month name okay there we go and now I have look at this last column here month February March January February August and it extracted that from the fact that it's a daytime object this and it knows exactly where it is so stop there if you want the full just stop there this extra bit it's just me confusing you but it looks nice when it's like that I want to show you how to do it and because it's a daytime object I can also extract other things I'm saying give me this daytime series so dt dot daytime and there's a dot dt dot year attribute so it's going to look down all that and just extract for me the years or I can say just extract for me something like all the hours or I could say dot minute and now it's going to extract for me all the hours and I can put that in a new column because that might be important for my analysis okay I'm up it's done congratulations you've made it you've learned something about pandas again the first time you see this it's going to be difficult and you can have this question I want to do this to my data and you won't remember come back to this video come back to this this notebook search on Google you'll find help anywhere just type it in format formulated somehow in Google type it in someone is going to answer your question go to the pandas website the documentation is going to answer your question and after you've used it for just a little bit go through one project go through your second project by the your third project all of this is just old news you just know how to do it and you can even if you haven't seen before guess how to do it because these things form a pattern python pandas is beautiful this way once you figure out how to do one thing you can kind of guess what a new thing is going to look like I hope you enjoyed this video give it a thumbs up subscribe all those nice things I should say I should probably also say so that you can confuse me with a real youtuber and and then tell others about this so many people have taken the course here of course it's been such a wonderful thing and I want to keep on spreading this these resources to teach you how to do this yourself because in so many areas where we need answers from research people can't do their research because there's no one there with the ability to do it it's expensive to get there's expensive software to do it on again looking at you IBM and SPSS I hate that because it just removes the ability for people to do research and through that research we can get proper answers answers that otherwise might be hidden from us in the medical community so I hope you're enjoying this tell others about this so that they can learn how to do their own data analysis using Python I hope I get some time to make more of these videos