 There are wonderful packages in the Julia ecosystem to work with statistics. There's also a statistics package built right into Julia, so when you install Julia the first time this package will be there. You've got to import it to be able to use it though. And it just contains the box standard statistical test that we use all the time. So I'm talking about the two-point estimates, mean and median and the measures of dispersion. In other words, your variance, your standard deviation and the quantiles. Very easy to use these functions, functions that we use all the time and they're right there for you after importing the package. I'm going to open a notebook and we'll explore these functions. I've opened our Julia notebook and you can see here a statistics standard library in Julia. By the way if you just want to see how these colors were done, that's just the use of some HTML. So that's the button I'm sitting there and if you give it a class, this would be a danger button and that gives it its color. You can use CSS to override that. That's just how I get these buttons. In any way, let's get to the standard library. I'm going to use three packages here. We're going to import them CSV. That is a package that you have to install yourself. The same would go for data frames and the CSV from the name would suggest that it is a package which we use to import CSV files and then data frames is the famous package for working with the data. So those you've got to install yourself, you can use the REPL at the right square bracket and use just the add function to add these to your Julia installation statistics though. It comes with your Julia installation and what you can see is that I'm using the import function instead of the normal using. So what import's going to do for me here is the fact that I have to reference the package to get to the functions that are in there. That's going to be the practical use of it here and the reason why I'm doing it is so that you can see where these functions actually come from because once you're using, obviously you just have access to all the functions that are in there and you can't always remember where these functions come from and while you're using learning to use Julia, I like to know where these functions come from. So let's import our data file. Now this data file data.csv, it lives inside of the same folder on my internal drive as this notebook. So don't you have to use the whole path c colon backslash etc for Windows and then different for Mac and Linux because they live in the same folder, I can just reference the file directly. So let's run that, it'll take a minute or so a minute or so, I should say a couple of seconds or so and it'll import that file. We're going to use the first function here. The first argument is df and then how many rows we want from that data frame to be printed to the screen and we just do that so that we can see if the import worked the way that we thought it would as far as our data frame is concerned. So it's only printing one, two, three, four, five, six, seven columns here, it's emitting the other five because it does show the 12 columns and I asked for five rows and we can see the five rows there. At the top we can see the column headers. So in the spreadsheet file itself, that'll be the first first row to contain the column names. That's name, age, vocation, smoke, it seems zero in ones, heart rate, SPPs, systolic blood pressure and cholesterol before and we can see how the data frames package and how, how, what it thought these values represented as far as data types are concerned. So it interpreted the names, now this is just mock data, it doesn't refer to anyone in real life. So that was a string, it saw the age of 64 bit integers, the vocation is a string, it saw smoke, which zero was coded for patients or participants who were not smoking, one for smoking and two for smoke before, something like that. But it sees them as integers and of course these are just numerical values that we are using to represent a categorical variable. Heart rate you see 64 bit integers, 64 bit integers and then 64 bit floats. So we can use the names function and pass to the data frame to that if we're interested to look at all of the statistical variables, the column headers in our data frame. If you really need to work with this properly and you want to change the type, so we can see here the smoke for instance, that 64 bit, as a 64 bit integer for now we'll leave it as that, but let's change, let's change the vocation and there's also a group column, which you can see there and remember we referred to the, again referred to these using symbol notation, symbol notation has the colon in front of that. So we're going to say data frames dot categorical with the exclamation mark which is the bang and that means change it in place so it's going to overwrite what is there. So the changes that we make are in place, so categorical so change the following, so in df then I'm going to list the columns that I want to change and if we do that those columns are now, they are now a categorical and they won't be strings anymore. If I just want to reference a single column I can use the dot notation here and I can just say df.h and if we look at the type of df.h we see that we're going to get back a csv.column data type and it's going to have an index column and it's going to have the actual age value so we see it's going to have two 64 bit integer columns, if you're used to working with Python and pandas, if you reference df.h for instance in the same way as we do here you get back a pandas series and a pandas series is just a column which is an index starting from zero where in Julia we start coming from one and the second column is going to be those actual values. So let's just look at the size, I can use the size function in Julia and it's df and then because it's indexing we use square brackets there and the exclamation mark here just lets this shorthand forgive me all the values and then I'm using symbol notation I can also just put age inside of double quotation marks and I'm going to get the size just of that and we see there are 200 values as we expected right at the top here I think we did do size df and it gave me 200 rows and 12 columns yes we did. So if we don't want to work with it as a csv.column data type we can always convert it with a convert function in Julia and the first argument is what I want to convert it to I want to convert it to an array what do I want to convert to array well the csv column type so if we convert that I now have an array of 64 bit integers and they're just in a single column so it is a column vector and we see that I mentioned there of it being one so that is it for just for our data let's get into the statistics package we're going to look at two things it's really divided into the two point estimates and then the measures of dispersion so remember point estimates and statistics that is just a single value that must be representative of all the values as human beings we are bad at looking at rows and rows of data rows and rows of numbers it doesn't make much sense to us we need to summarize that we need a single value to be representative of the whole and of course the mean is the most common one and yeah we have the mean function inside of the statistics library and this just calculates the arithmetical you can see arithmetical mean there and there we see the equation here in equation one we put a bar on top to signify this is the sample the mean for the sample that we just add all of them up and we divide by how many there are in case you're interested in how to get this green background and then the mathematical notation there that's just LaTeX so I'm using the div here in html class is set to alert alert success that gives the success is what gives it the green color and the role is alert and then I just have to close my diff tag there and then inside of a set of double dollar signs is where I put my LaTeX code and you can see the LaTeX there which a Jupyter notebook can interpret properly and give us that mathematical representation anyway let's get to the mean function so it is part of statistics now remember I said we are importing these packages and not using the packages so I have to say statistics dot mean if you just said using statistics of course you could just say mean you can still say statistics dot mean but I like to do this as I say to remind us where these functions come from so this mean function is not part of Julia it is part of the package so this function comes from this package statistics dot mean and I'm going to pass to it this CSV column data type df.h it will correctly interpret the fact that I'm not I don't care about the index numbers I care about the actual numbers and it gives me the mean because the mean is such a simple arithmetical operation we just write this in Julia using the sum and the length function if I pass this df column collection and if I pass that to the sum function it is going to sum all of those elements and I divide by the length of how many they are and I'm going to obviously get exactly the same result back now the mean function is useful in as much as I can do some form of transformation on the values in my collection before I take the mean and in this instance I'm doing a log transform so we're taking the log of each of the values separately in that collection before we take the mean and what I'm taking the mean off of course is df.h so that's going to be the mean of the log transformed values so transferring all those to log first and then taking the mean we can also use the mean to work on multi-dimensional arrays so here in this inside of square brackets so I have a list object here which is actually going to give me back a two-dimensional array and we see a two by three array two rows three columns and you see it's just the space in between to jump to the next row I use the semicolon and now we can see it's an array along a second dimension as well so not only down a single column but now I've got multiple columns and we can see a one two three so we can work out for multi-dimensional array like this we can work out the mean along the columns so I can get the mean of one two and three and the mean of four five and six or I could get the mean of one and four two and five three and six and I do that by just stipulating an argument here for the dims value for the dims argument so statistic stop mean I'm passing my multi-dimensional array here but I'm saying dims equals one so if we do it along that first dimension it's going to be for each of the columns so the mean of one and four is two point five and the mean of two and five is three point five and the mean of three and six is four point five but if I say dims equals two I'm going to get it the other way around so the mean of one two and three is two and the mean of four five and six is five of course we can also use the convert function if you run into problems that you only from whatever data type you have you remember you can always convert so here we're going to use the convert function I'm going to convert to an array and what do I want convert to a ray I'm using df the data frame and I'm using square bracket notation so indexing I'm saying give me all the rows and then as a list I pass two columns cholesterol before and cholesterol after think about it it's going to give me back a multi-dimensional array we're going to have 200 rows across two columns and that means I can set the dims and if I set the dims to one remember that's going to give me the the mean for every column specifically so I know for cholesterol before the mean was six point zero one two five and for cholesterol after the mean was five point seven five seven five so that's one way if we're working with data and we're passing a multi-dimensional array we can do you're going to use the mean function to calculate that column wise and that'll give me the mean of each of these two variables of course remember you can always use the skip missing if you have in a collection here I have a collection which is a list an array one comma missing comma three if I say skip missing pass that pass this collection here to the skip missing function as an argument is going to then ignore the missing and then calculate the mean otherwise you will get a result of missing back because mean doesn't know what that value is it cannot calculate the the mean median of course we sort all our values in ascending or descending order and if it's an odd number of values that we have this is going to take the middle one that divides our data set into two halves and if we have an even number it's going to take the middle two and just take the average and that really comes from this middle function so if I use statistics middle what is the middle between one and five well it's three because I have one and two and four and five on either side so I have two on either side of three but I have an even number it's going to take those middle two which is two and three and it's and it averages them out to 2.5 just to remind ourselves of the length of that there we had 200 of those so that means we have an even number so we'll call statistics dot median on the on the df dot agents 54 which means we're going to have 100 values less than 54 and 100 more than 54 but remember 54 it's going to be there were probably two values that were 54 right on the smack bang in the middle and there which was of course 54 we can always use skip missing remember the same as we did with with a mean so that's mean and median let's look at measures of dispersion standard deviation is the one most commonly used of course that is just we first work out the variance and then we take the square root of the variance and the std function does the standard deviation for us by default it's going to do the standard deviation of the sample so that's small s and you can see the sample size which is in we subtract one from it so we just sum over the squared differences between each value and the mean and we sum over all of those and divide by one minus our sample size take the square root of that and that gives us the standard deviation so std function there and we see a standard deviation of 12.5 because this is such an easy equation we can just do it with julia functions so square root outside we're taking the square root of the sum divided by the length as you can see in the sum this df.h dot minus so remember that gives us element wise subtraction so it's going to take each of the values in the age and subtract from each of those the mean of dfh and in the end dot is power so each of those differences now it's going to square and i'm going to divide all of that by the length after summing over all of those and i'm going to get exactly the same results 12.5 so remember in julia to do element wise operations we use the dot in front of our arithmetical operation i can also specify a different mean i don't have to go with the mean given by the actual data set i can say work out for me the standard deviation given a different mean then you can see that follows after semicolon so it is a named argument i can put named arguments in any order but before that semicolon i have these sequential arguments they have to go in the same order so for standard deviation we're only passing one of them and that's the actual collection of values so i can work out what the standard deviation would be with respect to the mean of 10 if i want the population standard deviation that's sigma there i can use the argument corrected equals false and corrected is also a named argument so that goes after a semicolon not a normal comma so to give me the standard deviation of the age column but calculated as if this is a population and not a sample and the only difference being here and down here in the denominator where we don't subtract one so if we do that we are going to get 12.52 and of course we can just quickly do that just with normal julia functions variance as i said is what we actually do and as much as we're not taking the square root so he has a sample of variance so it's s squared as opposed to the population which is sigma squared and from a population we subtract each value from we subtract the mean from each value and here it's the population means so sigma there the var var this is the function for that and if i say correct my corrected equals false named argument it's going to be for population so slightly different i can also just specify a mean as we did with the standard deviation so work out for a different mean lastly in the statistics library we have the quantiles now the most famous ones probably are the quartiles whereas the median divided my data set my ordered data set from minimum value to maximum or maximum to minimum value it's just going to median divided in halves the quartiles divided into quarters so the first quartile is going to have a quarter of the values less than it and three quarters more than it the second quartile is exactly the same as the median half below half above third quartile is three quarters less than and one quarter more than we express these as a fraction though so the first quartile would be 0.25 25 percent of the values quarter of the values 0.25 below so i'm going to use statistics dot quantile i pass my collection to it tf dot age and 0.5 that's the exact same thing as the median in fact if we use median there we see it's exactly the same thing i can pass a list of values here we have a collection of all the quantiles that i want so quantile is the group term and we can take any fraction from 0 to 100 so i'm asking for the zero with the another term for that would be percentile the zero percentile up to the 100th percentile that would be one would be the maximum value and zero would be the minimum value and then all the quartiles in between so if i run that i'm going to get all those values and we see 54 again there in the middle and 30 was the minimum and 75 was the maximum so that's it for the functions the bug standard very useful functions inside of the statistics library i just wanted to mention just for completeness sake remember the interquartile range that is just the difference between the first and the third quartile so i can just subtract the 0.25 value from 0.75 value that gives me the interquartile range and that's something we can use in a boxon whisker plot or at least mathematically that we can can estimate if there any statistical outliers remember the difference between interquartile range and the range the range is this the maximum minus the minimum so that's that full difference and we use the standard julia functions maximum and minimum so the maximum of that collection minus the minimum of the same collection and that gives me 45 again that's it for the statistics package it's a very useful functions you can after you familiar with with where these come from just use using instead of imports you don't have to type statistics dot all the time but it is just something that i use in teaching julia that we just want to remember exactly from what package these functions that we do import where they come from