 Let's become familiar with that mock data that we're going to use throughout this lecture series First of all, we have our fancy style sheet there. You need to do this That's just for me to get consistency through all of these lectures so that all these notebooks look sort of similar We're going to set up set up our python environment those The things that we're going to import and the things that we're going to execute to expand pythons capabilities The first thing I want to do is to import numpy and I'm going to use the abbreviation in P I'm going to comment out these lines now when I make these notebooks I just want to want to use templates and then from these templates I just skip these lines by putting this little pound or hashtag or Hashtag or the pound sign in front of the of the line and that will Comment out this line and python will ignore this line So we skip down here to import matplotlib. That is a plotting module. It has a Submodule called pyplot And I'm going to import that as the abbreviation plt I'm going to import seaborn as sns all these as as as words You can use whatever name you want, but there are conventions most people stick to them So if we do share our code, we all we all have the same Abbreviations now seaborn Just extends matplotlib. It just adds a few extra Plots to the matplotlib and it also adds a bit of eye candy. There's a lot more Colourization and customization of the graphs that you can do with seaborn I'm going to import the very important pandas. That is what we Used to play with our data to Reconstruct our data to get solutions and answers from our datasets pandas as PD and Then something I always do sometimes when you run these scripts There's not Explicit error in the syntax of the code the python that you write But there's something that the python interpreter or ipython does not like and it will give you this ugly pink warning box And I like to ignore those so from warnings. I import filter warnings those work We call filter warnings if I import something like this from warnings import filter warnings Filter warnings is one of the methods inside of the module Or library warnings so I can just reference the word directly filter warnings. They I use it and I use This argument whatever goes inside of these brackets behind the method is called an argument And I use the argument which is in quotation marks. In other words, it's a string ignore In ipython notebook, you have these magics commands Percentage sign matplotlib so percentage matplotlib is one of them in line so there's no Round brackets and in quotation marks is a space then in line that will render any matplotlib plots and Seaborn plots for that matter right on the web page Yeah, it won't open a new window with the graph and sometimes you want that specifically if want to save those To insert in a document that you want to submit for publication, but in line is what we want here Now I mentioned to you Seaborn can really extend Your capabilities of plotting graphs This is not important for this course Google Seaborn See what you can play with up setting the style of Seaborn setting the context to a paper background Or a sitting line widths certainly you don't have to do this You can just go with what the default settings for Seaborn is So let's become familiar with our data the data set that we are going to use is MOOC underscore mock That's its name dot CSV. It's a comma separated value file That is one of the spreadsheet formats most spreadsheet formats can save your spreadsheet as a CSV file as opposed to its native xls xlsx for For Excel or the open document formats for LibreOffice, etc. and numbers As far as the office suite for for Apple is concerned You can save your files as CSV if it is a CSV file I have to use this method read underscore CSV and That is part of that's a code word Python code in the pen pandas module Because I imported import pandas as pen PD. I must say PD dot read. I can't just use read CSV otherwise, I would have had to say from Pandas import read CSV, then I could just have used read CSV I've deleted some lines some PD pandas dot Read CSV There we go pandas dot read CSV. Oh, there's another dot that has to go and I'm going to read that as soon as it reads a data frame It's going to import it inside of a computer variable, which I've called data That computer variable data is actually a object and that object is a pandas data frame Whenever you use read CSV It becomes a data frame. That's very computer variable will become a data frame Let's import it and we get an error Why did we get that error because we never ran this block of code? We did not extend the Python language by running this block of code Let's run that block of code and now if we go down here and run this there would be no problems whatsoever Now we need to see whether our data set was imported correctly I've got the word data now. And so if I use data dot Let's do this. Let's write that Data dot if I were to hit the tab key It gives me a list of all the things that is attached to data All the methods that I can apply to data in this instance, I want to apply the head So I'm going to start saying H and Then you'll see the two Methods that that I can apply to data start appearing there and it's the head one I can double-click it or just write out head and if I open my Parenthesis there it says self comma n equals five These are the arguments that I give to this module so that it can work on the data data frame. I Just want to put three in there self that self you can usually ignore don't worry about that three or n equals three I can just put three there means Please show me all the columns and the first three rows, please I just want to see if the data imported correctly and lo and behold There's the first three rows of our data again, Python counting from zero one two So that's the first row the second or the third row file numbers. Let's run through what our dataset contains File numbers. We didn't want our patients to be identified. We have age Which might or might not be coded in other words I might we might have between yourself and all your investigators You might say we just going to subtract five from every patient's age and then capture it here so that that de-identifies our patients Female and male that's the wrong way to do that. Remember We want to de-identify that data as well. In other words use code. You would say use two four six eight and The letters R B or C if I entered any of those that would be female So we have the secret code behind the scene for our purposes here We just put down male female male female, etc. That's gender delay How many days before the symptoms developed until they presented to hospital an integer format? Stay how long did they stay in the hospital before discharge? Were they admitted to the intensive care unit? Yes or no What was the retro viral disease status negative or positive? So no. Yes. No Did they have a CD4 count when you see something like NAN that is pandas Interpreting the fact that in the data set in the spreadsheet. That was not a number not a number Then 57 suffered was left blank or you said someone typed in not done not performed Not applicable anything other than a number. It would say NAN not a number Heart rates the admission heart rate patients admission temperature If a CRP was done in what the value of that CRP a C reactive protein What the admission white cell count the leucocyte count was what the admission hemoglobin values were? At surgery was the appendix found to be ruptured. Yes or no all the appendices were sent away for histological examination in the laboratory and under the microscope a Core was made whether there was infection. Yes for appendicitis or no. We took out a normal appendix Was there any complications? While the patient stayed in the hospital. Yes or no wound infection respiratory tract infection urinary tract infections bleeding any complications We just coded as yes or no and the mean alvarado score modified alvarado score That is a scoring system where we predict whether patient has appendicitis or not that is our data This is mock data does not belong to any true patient these values were just thumbs sucked To to make this spreadsheet so that we can play with the data Now let's try start finding out a bit more about our data set. Let's start playing around with it Now it's called data and this is the way that we refer to a column We use the name of the data frame, which is data and with these square brackets and then quotation marks and then Exact name as it appears there. You must use the uppercase G exactly as it's written there Try never ever when you do your spreadsheet to have spaces spaces can cause problems But anyway gender just as it's written there and from that I put the dot and then The method of value underscore counts open and close parentheses, but they empty. What does that do? It will look down the gender column and first of all it will group everything that it finds so female male There's a difference between those two if there was a spelling mistake someone wrote males with an s That will be deemed a different type of entry and how many times that occurs will be counted So it's just gonna count everything that it finds Let's hit Enter there or shift enter command a shift return And we see it found the word male 94 of them and it found the word female 56. It didn't find anything else. So no one made a spelling mistake down that column Let's do value counts for some other things So again, it's data That's the name of the data frame and the column so this and the whole thing together is how you refer to a column There was an ICU column do for me the value counts So it is dot value counts once again, let me just show you if I were to write this Dot if I hit the tab key Those would be what appears now value counts Unfortunately, one of those that do not appear you have to type out value counts. It's not going to appear as as a Legitimate method for you to use but it is available use it all the time. You can see No appeared hundred and thirty six times down that column this ICU column and Yes appeared 14 times 14 patients went to the intensive care unit hundred and thirty six do not let's look at the RVD status Just for you to start getting a feel for your data because how many how many were patients were in this data set Well, let me click there something nice to show you. I'm here and I can say insert cell below It's going to do a empty cell below this. Let's write data dot tail Give me the last three rows. Oh What did I do wrong? It was not proper right if cannot make spelling mistakes the date dot tail So there's file hundred and forty eight file hundred and forty nine file hundred and fifty so they are hundred and fifty entries Hundred and fifty patients were entered in this database So just to if you were to print all of them out, which we can do you can just type in data just data like that And it'll do all hundred and fifty there. Well, it won't do all of them it within limits It will skip the middle lot But you can't just look down a column of 150 values instead of getting idea how many times things occur So this value counts is excellent. Let's do the value counts for rupture at surgery 60 were found to be ruptured 90 were not the penics themselves when we send them away for analysis it looks like 120 were taken out because for the correct reason they were inflamed and 30 were taken out. No they were normal and That's kind of par for the course these days. We are a bit better at that might be a bit high but again These this is this mock data 80 did not develop any complications and 70 did so quite a high complication rate Now that's one way to look at our data just value counts value counts value counts Human beings are much better with pictures though. So let's introduce the matplotlib.pyplot Submodule which I use the PLT remember PLT abbreviation for and we're also going to type I do an SNS Now look at the difference between these two the the matplotlib PLT is quite a few lines of code I could write a slightly better code, but the SNS boxplot is a much easier Syntax here makes life much easier one line of code versus all this. So let's just run through this boxplot very quickly What am I interested in I'm interested in drawing two plots I want to look at a box plot of the age distribution in males and in females So I say plt.figure Empty there inside of that arguments. There's quite a few arguments that you can use Let's just do that if I were just to do that You see all the arguments that it takes so you can play around with all sorts of things, but if I leave it empty All I'm telling I Python is please get ready to draw a figure. I Want to make two subplots. I want to put male and female box plots next to each other and so I would say PLT.subplot and you see one two one and there's the other subplot one two two What does that mean? It means draw it in one row two columns number one One row two columns number two So you're gonna have this one row with two columns next to each other This is the way to do it get used to it. So subplot one. I'm gonna give it a title You can do it like this plt.title then the arguments Is the title that you want to give those are string values you have to put them in Quotation marks comma fun size equals 18. What do I want to put in there? The following data Now I'm using something a bit different here. This is not purely matplotlib. I'm using Box plot within pandas. I'm referencing the box plots. I've created this plotting figure But I'm doing a box plot as a method of data Data contains remember we just type data dot. There's a lot you can do So data What do I want to do take all the data gender column? So I'm referring to that column Equals equals double equals remember that is a ask a question. Is it equal to male when the answer is true? It'll take that row if it's not true It won't take that row. I want you then only the males to draw box plot The column that I want you to do is the age column And we run through that again. Now. This is not something I would normally use. This is very laborious Look through that code Should make sense to you and there we draw one figure two subplots one row of them two columns of subplots There is my box plot of the male patients and of the female patients very laborious Let's do it in s&a seaborn seaborn dot box plot Takes a lot of arguments, but the ones I want to use is just take data and the age column So there's my first argument look down the age column, and I want box plots of that comma, what's the next argument group by So what do I want it split by the age while I want males and females and that's very easy I just say group it by data Gender column so that gender The gender column remember it found males and females so it's going to group the ages by the gender And then I just want to give names to my plot You've got to be very careful which one comes first and which one comes second Okay, just note that you don't put the wrong Title here to the to the two or two graphs, but that's a much easier line of code only the three arguments and very easy to understand and Look at those beautiful box plots with statistical outliers even indicated for you by default Let's look at a violin plot the violin plot is something I use quite often. I like a violin plot I've done a different type of violin plot here where I break them down into New the data frames into new data frames. We'll have a look properly at how to do this with quite a few examples in this lecture series Let's just use and for purposes of this lecture just this easy way of doing it SNS dot violin plot We had SNS dot box plot now. We have SNS dot violin plot exact same thing the data that I want That's my age column. Please group it by whatever you find in the gender column and Give the names exactly the same you'll see in both instances. I use this semi colon If I don't use the semi colon, let's not use the semi colon It's going to write this line of text there and then draw your plot the semi colon just gets rid of that That you just get your graph and there's a violin plot violin plots are kernel density estimates on this side so you can kind of see a much better idea of how the spread of the data was here between 40 and 50 there was a Bit more females before it went down and that sort of notch doesn't really occur here in the males So you get a much better idea of how your data was spread than just from a bland old box plot I really like violin plots speaking of a kernel density estimate. That's called a dist plot so SNS dot this plot and Now unfortunately, this is not going to work because yeah, I am referencing These two new let's take out the comment line take out the comment line These two new data frames that I made I said create two new data frames They would both be both these computer variables will be data frames because they are constructed from other data frames What data frames are they constructed from or from the original data one? square brackets The column that I'm interested in Equals equals males is going to look down the gender column of the data data frame and when it finds male It'll put it inside that row inside of male and run through the whole lot and only put males in this new one Only put females into this new one Run that and now I can reference these male age and Bins what does that do? Let's have a quick look. There's a Distribution plot, so it's just going to make these bins of ages see those little bins So between that age and that age there were so many and then it also draws this beautiful smooth curve Which is basically this curve that you see here just on its side same for The females I can do there. There's another thing I can do. I can SNS set style white So I'm only gonna have a white background because there's a bit of a problem with the overlap of all these lines So I'm just changing the style Let's just redo the female the bin still 10. What does that 10 mean? It has to decide how wide this is from what age to what age falls inside of one bin So you can give it the size of bins what you know, whatever you like But I'm just going to say kernel density estimate equals false It's another type of argument I can put in there if I ran that code we're gonna have a beautiful white background by the style that I set and This kernel density estimate is gone now. I only get I Only get this little histogram to tell me how many and you can see we've also changed from a fraction This fraction will apply to this kernel density estimate the area under the curve Which will have a lot to say about later on but it just gives me a normal histogram So these are normal ways just to get familiar with your data to play with your data And just to see what does what is my data set all about what is inside of there? These are common tools for you to use to become familiar with your data set