 So here we are in the ipython notebook. It just runs right inside your browser. It's fantastic It just doesn't get any better than this We're gonna write lines of code and it's going to do the statistical analysis for us Now the beauty of Python is that so many people work on it They see a problem they write a new module or a library and you can just import it and And it extends that Python language for you to use new types of code and here we have pandas as the main Library that we're going to use here or module or whatever you want to call it There we have import pandas as PD and we're importing all sorts of other things The beauty of ipython notebook I write a few lines of code and I hit shift enter or shift to return or I can just hit run cell See a little star there that changed to one while the stars there It is just executing that code and it's all there. Don't worry about this code I'm only going to show you what is possible or actually just a small bit of what is possible You can do so much more with pandas and Python So I've got this comma separated values file. It is a file of data written in Microsoft Excel It was saved as a CSV file as opposed to an Excel file doesn't matter it lives on my hard disk I'm going to import that into Python with that line of code there Now I just want to make sure that it is there so I can write the first five Entries there, but it also shows me all the data points that I've collected now This is mock data on appendicitis patients So it's just some random values there They don't this is not attached to any real patient, but you can see there There's my data set that was imported and you can see an index here at the left-hand side Done automatically for me always as with Python. It starts with a zero. I don't want to zero I want this patient file number to be my index I can do that with a line of code and do the first three rows there And you can see now file the patient file becomes my index just easier for me to do now Here's a column of the age patients. We could say well, can we just look at that column? Can we get some answers from that loan behold? There's a describe function if I were to play that Or execute that it shows me there were 150 entries in that column the mean was 20.9 in other words There was 150 patients the mean age was 20.9 years We see the standard deviation there the minimum and the maximum and our percentiles. They're beautiful Can I graph this not a problem? There's our line of code and I execute that this is called a distribution plot So it gives me percentage wise with 100% being 1.0 Histogram and a kernel density estimate plot all in one that's fantastic Now I've decided between the minimum and maximum to have 10 bins you can have as many bins as you want Can I can you tell me in the gender column? How many males and females they were well There's the lines of code for that. Let's execute that it looks down the gender column It sees what type of entries were there while there were only two it said male and female So no spelling mistakes were made. So there were only those two and there were 94 of the male type and 56 of the female type beautif Now there's a command in pandas called group by which is the best one to use here. I've done something else I've created new data frames and new series. Don't worry about that. I'm splitting it in two I'm making two different data sets one with just males and one with just females and I'm looking at the age column of each describing that There we go. It says in the male there were 94 with a mean age of 20 and in females I can quickly see of the 56 females they mean age was 22 Can I draw some box plots of this because I'd like to do that When I submit this for publication beautiful box plots. Is there something better than box plots? You bet violin plots They are much better because not only do they give you the same data as a box plot But they also give you the kernel density estimate. So you can see how your data was The spreading your data there, which you can't really get that easily from a box plot so violent plots are fantastic Now can we do some inferential statistics? Let's just look at the RVD column retro viral disease column Apparently there were entries for no and yes 96 no 54. Yes Again, I'm just going to split my data my data frame into two different ones. Let's look at the age distribution now for the Positive patients. We see there were 54 of them and they mean age was 25 with a standard deviation You see and for the negative patients. We see 96 of them with a mean age of 18 As their statistical difference between these two while I'm going to use a normal t test for this I'm seeing my age distributions. Yes continuous data. Let's import that Function and I can execute it Let's have a quick look It says the p-value for difference in age between RVD positive and negative patients and I can calculate the p-value very significant if I were to choose 0.05 as my Cut off for significance So there was a statistically significant difference in the age group between the positive and negative patients Now let's plot that Let's see again a violin plot makes life very easy. We can see the distribution in age Yeah, we have the positive and negative and we can see the medians mark there I can clearly see there's a difference between positive and negative patients and that's why our p-value was so significant Let's just look at the positive patients when it comes to rupture and The negative patients when it comes to rupture now we have categorical data here Can we do a chi-squared analysis because we see in the rupture patients city on histology had no rupture and 24 had 61 didn't and 35 had in the negative group So let's import some functions that can do chi-squared analysis for us. There we go we've created this 2 by 2 contingency table 30 and 24 61 and 35 and We can do the p-value on that 0.431 so chi-squared Distributions there we used a contingency table I should say I can do some chi-square analysis and get the p-value for that and we see there wasn't a statistically significant difference in Rupture rates between the RVD positive and negative patients. What about confidence intervals? Well, let's look at the delay in The delay column we see again There were 150 entries in the mean time between the onset of symptoms and coming to hospital was 4.2 days in the standard deviation there Can I just describe the 95 percent confidence intervals around that mean? Well, there are two ways that I show you here how to do it one is using the scikit-start bootstrap function there Let's just import that so and let's print that and it says yes Remember it was 4.22 the 95 percent confidence intervals was from 3.8 to 4.6 6 days There's a better way to do it though than with that bootstrap data and that's to use this base MBS function and if I were to import that let's have a look at that It gives me a lot more data and first of all I can choose my percentage confidence interval I can choose 80 percent 90 percent 95 percent by just typing in a value there And it actually gives me the confidence interval upper and lower values for the mean The variance and the standard deviation so it gives me all of those confidence intervals But you'll see it's almost the same as the bootstrap values there as far as the lower and upper values So that's a very quick look Disscraping the but the surface of what pandas and python can do and it is just phenomenal I hope I've enticed you to make some effort to learn statistics using python