 In this lecture, I'm going to introduce you to pandas. I'm going to break this up into a few shorter lectures because it's quite a bit that we have to go through. I'm also going to introduce you just to the basics of pandas. First, as you can see here, it's just to get you familiar with the package. Then we'll move on to actually importing some data that was created inside of a spreadsheet, which is the normal way that you're going to go about using pandas. So what is pandas? Well, that is the module that does all the managing of the data force. It's a phenomenal package. You can see I say, I say, I love pandas. It's absolutely phenomenal. It has aspects of data analysis to it. It has aspects of manipulating data as in a SQL database. It is really absolutely phenomenal. As per usual, it extends the base Python language to do a lot of things. Let's get going. So you see the first cell there. If I double click on it, it's just normal text. It's a markdown. I'm just going to just to make all these notebooks look the same. I'm just going to stick to this convention. I'm just going to import the CSS style sheet there. You don't have to do that. I'm doing that just for this nice gray background and the blue, orange font there. First things first though, this is the proper part. We've got to import some things that we're going to use inside of this notebook. We're going to extend Python by importing a few things. And I want to run through these with you. Now I've marked this with a comment there. So I've just put the, if I take it away, obviously you see that line of code will be executed. But if I put the hashtag or pound sign in front, that becomes a comment line and Python will ignore that. So I'm going to import pandas as pd. We've seen this before. We import module pandas and we're going to use its abbreviation. So all the code words that were written into pandas, I can now access by saying pd. And whatever code word I want to use. SciPy stands for scientific Python. That's a phenomenal package. You can do an enormous amount. Basically what you can do in commercial packages you can do with, as far as mathematics is concerned, you can do in SciPy. It has a bunch of sub modules and this one is stats. So from scipy.stats, I'm going to import this word here norm. That stands for the normal distribution. Now scipy.stats contains all sorts of distributions. And what you can do with those distributions, you can just say give me samples, 1,000 sample values, but make them come from a normal distribution. Please make them come from any form of other distributions. There is plenty and plenty of distributions, but I just want the norm one for the purposes of this lecture. We're going to skip those two that we'll use in future lectures. Now this is an important one. Import matplotlib.byplot. So it's the sub module byplot from the module matplotlib. You can see the word plot there. You can think to yourself, this is going to plot some graphs for me. And it's phenomenal graphs. You can just export them. You can submit them for publication to research journals. No problem. Powerful package there. And I'm going to import it again with its abbreviation PLT. These abbreviations, you can call whatever you want. I'm going to call the PLT, PD. This is the norm. This is what people use all over the world. So I try to stick to these, but you can use whatever abbreviation you want. Seaborn just extends matplotlib. It has a few extra plots in it. And it just has some nice... It's a nice design, I can say, to your plots. Makes them look slightly better. It's a bit of a design aesthetic to it. Plus a few extra bits. And it's normal to import it as SNS. Now from warnings, I want to import filter warnings. Now this is something I always do. Many times in the iPython notebook, you've done code. The code will still execute. But the notebook wants to warn you that there's something wrong with it. It's still going to execute. There's nothing wrong with the code itself. It's some interaction problems. And then you get these ugly pink blocks of code. And I absolutely hate those. So you can set them to be ignored. Now, so I've done all my imports here. And just leaving this space, this empty line, just to make the code look nice. There's a few magics. That percentage sign. There's a few magics built into iPython. And this maplotlib space inline. That says whenever I want to plot something, draw a graph, it'll actually render it right in the web, this web page. If I didn't have that, an extra window, a pop-up window will appear with a graph. And sometimes you want that because from that pop-up window, you can actually manipulate the graph a bit more. You can save it externally on your hard drive and save if you want to use it in a publication. For purposes here though, I want the graphs that we're going to do, I want them to appear right inside of the web page. Now filter warnings, remember that's one of the methods there. I've imported from warnings or functions I've imported from warnings. And it takes an argument. Arguments go in these parentheses, these brackets. And I put a string text in there, ignore. And that says whenever it finds one of those things you might have done wrong and it wants to print that ugly pink text box, it'll just ignore that. So we're going to run that code by clicking run up there, run cell as you can see. Or alternatively when you're in that code cell, you can just hit shift enter, shift return. See the little star that was there? It was executing. Now it says this is the second block of code that I've executed. Let's start off with pandas. I just want to show you if I double click, where are we? There we go. If I double click there, it's just pandas and it's a hitting one code cell. So if I run, we're to run that, it's going to render nice h1 tag. Introduction. Pandas is a Python module. Let's change that to module. It's a bit of confusion when things are called a module and when are they a library, et cetera. Really not important for us as healthcare researchers to know that the dark secrets of modules and libraries. Anyway, you can see I've included it in these open and close p tags. So it's mark down cell. So that becomes a paragraph and you'll remember the single little stars on both sides that makes this italics. And there's a double in front and behind. And the single in front of behind. So this pandas is going to render italics and the double as you saw when it was there before that would be bold text. Okay. So as I said very quickly, you'll pick up the mark up in HTML. I'm not going to use a lot of those. Some basic pandas. Now I'm actually going to call this a list because it's more of a list than an array really, but it's an array of values. So what am I doing here? I'm creating a space in memory, a bucket. I'm going to call that bucket values underscore one. You could call it whatever I want. I can call it I like my dog. I can call it whatever I want. There's some code words. Obviously you mustn't use proper syntax that you use that is part of Python, restricted words. And there's some restrictions as to special characters and numbers, et cetera. Under scores are fantastic. You can just use them. Values one. Now I've put things in these square brackets. Square brackets allows me to create a list of things and I'm going to put a comma in between each of them. So I've just sum sucked these values. 12.3, 14.2, 15. So it's just a values in a list. Let's run that. So instead of, remember we said 8 equals 7 where we introduced the notebook. Well, instead of just one value, I can put in a lot of values. Now let's introduce some basic pandas. I'm going to make a new bucket in memory. I'm going to call it data underscore one. And that's going to equal pd. Remember that's my abbreviation for pandas dot. And now we're going to start using code that is within pandas. And the first one I'm going to do is series with a capital S and then open and close my brackets there. And inside of it, I want to put this computer variable values which holds this list. That's what I'm doing. Now, what is inside of these that follows a word like series is the arguments. They are arguments that go in there. Now I'm just going to delete that quickly and just show you something. Let's do that. pd. See that dot turns red? If I now hit the tab key, it'll give me a list of all the syntax, everything that pandas can do. Look at that. All computer code that you can execute. Lovely stuff. I can hit the capital S. Now I will just see all the ones that start with S. I can see series is the one I want. So I can double click on that. It'll auto complete for me. And watch this as soon as I open. Just hit the open bracket there. It's going to show me, give me a little tool tip as to what the arguments are that go in there. Now most of these arguments have defaults. So you don't even have to put them in. One that it does want is the data. So you must put something inside of this series. And what I wanted to put in there, remember, was values 1. So I'm going to say va. And if I hit tab again, it'll find anything that I am allowed to put in there. And values is what I want to put in there. So I can double click on that values and it will auto complete for me. Beautiful. Now remember I said a equals 7. And then I just said a. And ran that. It told me a was 7. Just printing it to the screen. Let's do that. And lo and behold, look here. That would be what a series looks like. It almost looks like a small spreadsheet. We have two columns and that list of values, it's put each in its own row. This on the left hand side is called the index. Python always starts counting from 0. So 0 will actually be 1. So the first row contains the value 12.3. The second row, which has an index of 1, that's how it works, an index. So a series is a list of values. And it needn't just be numbers like it is here. These are decimal points. So they're floating points. You can see they float 64. 64, but floating values. Float means decimal. Decimal values there. That's the data type that's inside here. But the series, a series like this, is this row upon row, each with its own row number. You can always ask the computer. Remember I said you don't have to tell other languages like C sharp, etc. You don't have to explicitly tell Python what type of data you're putting into a variable. Python will infer from context, but you can ask it if you're not sure. So you can say type. And then that's a Python code word in values. So you want to know what type of values or what type of data is values one. Well it is a list. That's what we created there, a list. Now we can ask as well, let me just do this, I'm going to say sell all output toggle. So see what that does? It hides all the output. It hides all the output. So where were we? We were there. And if I now execute it, it will execute and it says pandas.core.series.series. It is a series object where this values was a list object. The computer bucket. The computer variable had a list inside of it. Data one has a series inside of it. Now what makes a series? Why would I not just use lists? Why did I want to create a series? This is the reason. Look at this. If I were to type data one now, let's do that. To delete that and say data, I can hit the tab. It will autocomplete the only DAT that it found when I hit the tab with data one. And if I say dot, that dot is red. Look if I hit tab. Look at all the things I can now do with data one. Look at it. Just look at it. It's phenomenal what you can do. But what I wanted to do is the describe. So D, where are we now? There we go. D, describe is the one I want. And I'm just going to open and close. When I hit the open bracket, it automatically puts the closed one in for me. I don't have to do that. There's the tool tip. I don't want any of those. All of those are default values. So I don't actually want any of this. I just want to run that. Look at this. Phenomenal. Basic descriptive statistics done for you. It says there were 13 values in data one, data underscore one, which was a series, remember? There are 13 values in it. The mean value is 14. Standard deviation, two and a half. The minimum value is 9.9. 25th percentile, the median, which is the 50th percentile. Third quartile is 75th percentile, 15.4, and the max value is 18.3. So immediately you can do statistical analysis just because you imported a list of values as a series. It gets even better than that. Remember we imported seaborn. When you type in s and s dot and tab you'll get a list of things that seaborn can do. What I want is this disk plot, distribution plot. And I want a plot of data one. So those, the Pandas series, I want it plotted as a distribution plot. Look at that. Phenomenal. So it made a histogram. We'll talk about what a histogram is. So it just divides the things into these little bins. It's called, it looks like this bin goes from nine to something else. Now you can specify, it is an argument. Let's just see bins equals, let's make it ten bins. If I now were to run this, see it just made the bins a bit smaller. So the value goes from there to there and it just tells us how many of those values were inside of that bin. Now you'll see the y-axis here. What it does is it makes all of the values, if you count all of them up, to be the space that contains 100% of all the values. Now you can something like 100%, can also be called a fraction, which is 1.0. 50% would be 0.5. So that's what it does here. It tells us that between that value on the x-axis and that value we found about 9% of all the values fell there. That's a histogram. See there were no values in that little range there. What this block also does, it draws for me this nice curve and you saw that values that I just thumb sucked were basically normally distributed. There was some normal distribution to them. What this does, it bootstraps a lot of values. We'll talk about bootstrapping. It's a statistical exercise you can do, but it bootstraps a lot of extra values so that it can create this distribution plot. What's very nice about this distribution plot, you don't have to remember what I say now. We'll talk about it extensively in future lectures. The area under this curve is actually 1. If I were to do calculus, integral calculus on the function of this graph would be 1 would be underneath 100%, 100% of the values fall under this. Don't worry about it. This is a little taster of what statistics is all about. Now I'm going to create values 2. It's a new computer bucket and I'm going to create another series. Obviously I'm going to put something inside of this series. Before I just put that list of values in, but here I want to do something else. One thing you can see, we're never going to do this in our statistical analysis. This is just for interest sake. I'm putting inside of it a code that I wanted to execute. You can see the open and close turn green. All of this is going to go inside of this new series. What I want to put in it is this norm. Remember we imported this normal distribution. rvs is random variables. I want it to, and you can see the size. I say, please Python, give me 13 samples drawn at random, but they must come from a random distribution. Norm.rvs must come from that. LOC, location, it actually means the mean. I want the mean of those values to be 18. I want the scale. Scale means standard deviation. I want the standard deviation to be 4 and I want 13 values from there, please. Let's run that code. It is now inside of values 2. Every time you run this block of code, it will give you 13 different values, because they come from, come to you at random, but from a normal distribution. Now let's just move on. Before we end this part of the lecture, I'm just going to introduce you to the data frame. So series was one, now the data frame is the other. I'm going to make a new computer bucket called data underscore 2, and I'm going to make this empty data frame. Why would I do that? Why on earth would I want to do that? The difference between a data frame and a series, the series only had that one column with the index. The data frame can have multiple columns. And I can actually create columns by saying data underscore 2, which is now this empty data frame, open and close these square brackets and then put some text. And that text is going to be the name of a column. A column header like in the spreadsheet, you can type a name in the first row of columns, that's what I want to do. And I want to put in that column values 1. Remember my 13 values in my list? So this is another way to do that. Let's just look. So it's printed out slightly differently from before. Now it actually does create this little spreadsheet for you. So I still have my indexes on the side, but now I have a column name. Now if you wanted to build stuff like this, you could do that. I'm just introducing you to the concepts here. This is not the way we're normally going to use it. You're going to have a spreadsheet with data and you're going to import that spreadsheet straight away. You would not have to build a data frame like this. I'm just introducing you here to a data frame. So my data 2, I can add another column. This one I'm going to call var underscore 2. And in that I'm going to put these 13 random values. Values 2 was this series. So I can put a series inside of a frame. Let's do that. And now let's execute data 2 and look lo and behold there's now two columns. And as I say the time you execute this, this is going to come from a random distribution with a mean of 18. Now you're only drawing 13 values, so your means are never going to be 18. It'll be off of 18 most of the time because it just drew 13. If I drew a thousand from that normal distribution, probably my mean was going to be. But look at this, I can describe, I can describe the whole data frame now. And it's going to do two things. It's going to do what we wanted before, the normal descriptor statistics, but it does it for each of my columns. Hey lo and behold, look it was close to 18. It was close to 18. With a standard deviation, very close to 4 there as you can see, my minimum values, my medians, my maximum values. Last thing very quickly, I'm just going to do this cell code and then we're going to stop. So I'm going to make a new bucket. I'm going to do something to my second, my data frame which I created, data underscore two. Now this is a bit of coding that you have to get used to. So you're telling it take data two and then take this column out of data two. But you can't just refer to the column, you have to refer to this whole name of data two and you've got to put that inside of square brackets. You've got to put this whole thing inside of square brackets following data two. So just get used to repeating this data two, data two. That column, I only want the values from this column that are larger than 15. So chuck out all the other ones and make a new data frame. So this is what happens. Every time I use a computer variable and I attach something to it, it will also become a data frame. And then I'm just asking it to print it to the screen and it's going to create this new data frame and you can see it keeps the index values but these values are only, it's going to make a new data frame but only where values in column one, this var one column will be larger than 15. Good, we'll stop here and we'll continue in the next lecture.