 In this video I want to introduce you to the computer programming language called Julia. Now it's a wonderful rainy cold Cape Town winter's day. And I was rather curious, I always use Python in the iPython notebook, but could I use Julia to do some of this analysis. Now it's a computer programming language really for computer scientists, mathematicians, etc. And I was wondering if a mere mortal, a mere surgeon like myself could use Julia. Now I'm never going to move away from Python in the iPython notebook. They're really all fantastic. It's really phenomenal stuff to do your work with. But I was wondering what I could do with Julia. Now Julia has packages that you can import. They are like your Python modules, but they're certainly not that well developed. And I don't know Julia that well that I could just write the code all on my own. So I want to use some of these packages. I'm also not going to use the iPython notebook for this demonstration. I'm going to use a different development environment called Juno. It's beautiful. It's built on light table and really looks well. Certainly though, for any kind of production work I'm going to stick with the iPython notebook and Project Jupiter. I think it really doesn't get any better than that. But let's go. Let's have a look at Julia. And I'm going to take you through this video at rather some pace. There's quite a bit to do. I'm not going to stop and talk about all the syntax. Just follow along. And what I'm going to do is to show you a data frame, a database, which is just a spreadsheet of values that we're going to use. Now I'm going to show you just how to construct one from scratch in this video and just import some of the random variables to populate my database. But I'll mention quite a few times I think in the video there. You just import a spreadsheet with your data and use it as such. But I wanted to show you some of the functionality of Julia. So going through some pace, maybe if I have some time later I'll break it down into slower sections and go through some of the syntax. Really Julia is a phenomenal language. And let's go have a look. And so here we are in Juneau. You see that I have already typed everything I want to show you. Let's start right at the beginning. We're going to just import some of the packages that we're going to use. The first one is a package called Markdown just so that we can print some text to the screen. I'm going to use it probably not in the correct way, but you'll see how we do that. So in Juneau we just type using Markdown and I'm going to hold down on the Mac the command key and hit enter or on the PC that would be control and enter. And you'll see at the bottom the little spinning disk went on while it imported. Now Markdown is not a big package that will go fairly quickly. I'm now going to do the DataFrames package. Again holding down command and enter or control enter. And again you'll see our little nice little animation down the bottom as this gets important. And this takes quite some time. DataFrames is a big package. And it takes a while to do. There we go. It's done. The next one we're going to use is distributions. We're just going to look at using random variable distributions. Import that. Done. Now a big package called GetFly. Now that is the package we're going to use to do some plotting or graphing on the screen. Also a very big package and if that's the first one that you do load it might take quite some time. Okay I cut the waiting period out in the video for you because depending on the system you use you could seriously go out and take out two gallbladders before coming back. The last one we're going to use is the hypothesis tests. This is the package that contains a few rudimentary statistical analysis and at the end we'll see how to use those. Good. This is how I'm going to use Markdown. So it's markdown.pass. And it's just a string that I'm going to print to the screen and you see beautifully rendered here in Juno. There we go. So we're going to create our first DataFrame. There it is. DataFrame. We're going to use the word DataFrame there. I'm going to call D upcase D upcase F and I'm going to attach it to this computer variable called DF just for DataFrame. And it's going to contain two columns and see how the columns are separated by a comma here. The first column is going to be called A and there's no double quotation marks there so I'm not entering as a string. That's just going to be the name of the column and I'm going to have a column B. Column 1 is going to contain the integers 1 through 10 and column B is going to contain the integers 2 through 20 so at the end of the line hold down the command key control key hit enter and there you go. So this is what Juno is going to do. It's going to render the DataFrame there for you hidden but if you click on it there you go. Beautifully it shows you that DataFrame on the screen. So there's my column A and B and as I said one integers 1 through 10 in my column A and in the rows for column B we have 2 to 20 and going up in steps of 2. If I just click on it again it hides itself. That works rather well I think. So some more markdown just to the text. Let's select certain rows and columns. So I can do that referring to the DataFrame DF that was the computer variable name I gave my DataFrame and I can say the rows that I want row 1 to 3 separated by a comma then columns. So I want rows 1 to 3 and I want only column A. Look how that goes in square brackets and there is this colon in front of the column name. And all I've haven't attached that to another DataFrame name I'm just asking Julia just to select those rows and columns for me and if I do that look at that again it's this row column A and just the first 3 rows rendered there. Let's just show how to add a column named Names containing just some strings. So to do that I'm just going to refer to the DataFrame I've created up there DF and then in square brackets just the colon again names so that is going to be the name of this third column and there's just 10 ABC string elements there just some single characters so there's 10 of them so that will fit in nicely to my DataFrame because I do have 10 rows in it and if I hit enter you see it is a vector and it's a vector ABC so Julia is only going to refer to this vector that you created this is a row vector well it's actually a column vector I must say that's part of the mathematics of Julia so if you do have these commas in between it's going to do a column vector for you if you did not it would be a row vector if you didn't put the commas anyway it's just showing me the vector of 10 dimensions I can say I suppose that it's going to add to my DataFrame so let's create a new DataFrame by selecting only values of A in the names column so remember here we only selected some of the rows and a certain column now let's do that we create a new DataFrame called data underscore A and it is from the DataFrame DF DataFrame the column names that are equal to so it's dot, double equal sign, A, and the colon so you've got to learn how to use this syntax for Julia and if I were to hit command enter, control enter there we go so what it's done it's gone into the names column, names there and it's only selected the rows that contain the value A in them and gives me the corresponding rows in both columns A and B so that is my new DataFrame that I only select if I wanted to do some research and some of the research participants were selected to be in group A I could create a new DataFrame with only their values in it and this is the way to go about it what if I wanted to have more selection criteria the way to go about that is the same as this you're going to put parentheses around what you want to do so it's still the DataFrame in big square brackets then look at this, there's a parentheses there and there and it has the single ampersand sign there and another parentheses there and there so with the names column it contains the values of A and the B column has numbers greater than 4 so if I were to do that it's still only A but now in the B column we're only going to have values that are larger than 4 so 6, 14 and 20 there so you can build up quite a few criteria just to select parts of your database, your DataFrame and that's wonderful if you want to do some analysis on that so let's do some simple statistics on what we have I'm going to just ask Julia just to describe DataA the A column for me now look what happens a spin there, nothing happens there's a check mark there so it was executed the way Juno works it is going to hide it for us in what is called the console and to view the console you have to hold in command the little control I should say so even on the Mac it's control so it's control and the little tilde sign and the console opens up at the bottom and you can see it has described the A column so the mean for the values in the A column and remember data underscore A only contained the As where did we do it there, there's our DataFrame I'm going to look down column A here so it's values 1, 2, 3, 7 and 10 it has a mean of 4.6 third quartile there, a maximum the Na stands for rows that contain no data entries there were zero of those and the percentage of that makes up zero so it gives us a little description there starts there, manifest quartile, median, mean third quartile, max, the Na's so I hit control and the tilde key again and it disappears so if we just were to look at the big DataFrame and just look at column A we can just ask it to calculate the mean of all the values so there you go it's 5.5 we can ask it to sum all the values 10 plus 9 plus 8 plus 7, I mean that would be 55 the standard deviation the variance, the minimum the maximum, there's lots of normal descriptive statistics that you can just do on any of the columns in your DataFrame here we've asked just for some numerical calculations and we'll get all of those let's ramp things up, make it a bit more interesting and we're going to populate just an empty DataFrame data, we're going to call this data so an idea if I'm giving it a different name data and it's a DataFrame open close parentheses so it's just an empty an empty DataFrame, there it is if I click on it there's absolutely nothing in it now what I'm going to do is just to create this DataFrame as I said there from scratch it's an empty but I'm going to populate it with columns and rows the normal way that you would go about it obviously is just to import a spreadsheet you can import a spreadsheet, a spreadsheet that you've saved in Excel or another spreadsheet software you could save it as a CSV file, it's a bit easier but you can import your spreadsheet but what I'm going to do here is just to create one from scratch and put in random variables in it just to show you how the distributions, the random variables distribution package works so first of all I'm just going to create this empty array, I'm going to call it gender array and I'm going to add just this single character F to it and you'll see it in a moment why I do that, it's not necessary, it's for explanatory purposes so there we go, it's this vector one by one row column vector with the character F in it and now I'm going to create this little for loop so obviously if you just do your normal statistics you import your spreadsheet file you needn't go through all of this but just to show you a little bit more of Julia here so I've got this for end loop so it's going to loop through this from this 4 to this N at the bottom from 1 to 199 so it's going to go through the 199 times I'm going to create this random normal variable called RN, so it's just a random N empty parenthesis there so it takes a standard normal distribution so it's going to have a mean of 0 and it's going to just randomly select values on either side of 0 as a standard normal distribution in other words it's going to have a this distribution has a standard deviation of 1 so it's going to select that and if you think about it's either going to be positive or negative so what I'm going to do here I'm going to run an if else end statement inside of it and I'm going to say if I chose a number that was equal to less than 0 I'm going to append see the exclamation mark there the gender array with female else in other words if the random number was positive I'm going to append the gender array with male so I started with 1 and I'm going to add 199 other randomly selected male females every time you run this you're going to obviously have different results so what I want to do is just create a column vector male female male female just randomly so I'm at the end of the last end statement here and I'm just going to hold down and enter and there it is you see there it's executed if I wanted to look at it quickly so there's gender array there executed and there you see it's this column vector 200 randomly either male female male female male female so if you ever wanted to to create the entries in a column randomly this would be one way of going about it there's obviously other ways easier ways more computer science ways of doing it properly so what I want to do now is to create this remember my data was an empty data frame I'm going to create this column called gender and I'm going to add to it this gender array which is my 200 rows of randomly selected male and female patients just added to that so there we go it is now this when you do it this way it's not going to give you the whole data frame it's just going to give you this vector column vector that you added to it now I'm going to do this the same and I'm going to go through a new I'm going to create a new array called group array and I'm going to call it a I'm going to call it group array and I'm going to add this as I did before just add something to it called A or B you don't have to do that you could have done that all in here it's just for demonstration purposes I'm going to do exactly the same and I'm going to randomly put in either A or B as if these patients belong to two different groups if I do that that very quickly runs through if I look at my group array now which is this vector and again it's this random let's go down so you can see it's this random A's and B's A's and B's all the way down click on it and it goes away now I'm going to add that as another group in my data frame and I'm going to call this group this column I'm going to call a group and I'm going to attach this to it there we go and let's add some more columns I've got two columns now was randomly selected male and female and randomly selected A, B so let's add some more columns so I'm going to refer to my data frame data and I'm going to add an age column a days column a temp column a white cell count and a CRP column and I'm going to use random distributions random variable distributions and the way to go about it is this way I'm going to use the normal distribution with a mean of 35 and a standard deviation of 10 so that goes in these parentheses here comma 200 I want please give me 200 random variable, random values that is normally distributed with a mean of 35 with a standard deviation of 10 done and if I look at it now it's just this vector column vector, 200 values I've asked for 200 of them and one after the other and they are randomly distributed around this normal distribution with a mean of 35 and a standard deviation of 10 that's quite beautiful to do let's do a days and we're going to use a Poisson distribution and we're going to give it a lambda value here of 2 and I want 200 of those as well please and now I'm going to have an admission temperature and I want it to be from a normal random distribution around a mean of 38 with a standard deviation of 2 and I want 200 of those please the white cell count I'm going to have a mean of 5 and a standard deviation of 5 I want 200 of those and now I've just been a bit silly here this is definitely not a proper distribution but I just want to choose something different than a normal distribution so I'm going to have a lambda value of 2 for my Poisson distribution I want 200 of those and then what we do here we add times 2 plus 100 to each of those so it's going to take each and every row entry multiplied by 2 and add 100 just so that we get something you can see here just so that we get this kind of values around about for our CRP as I said it's not a proper way like that but just to add something something different so data if I just were to command control enter behind that there's my whole data frame now you can see it has gender group age days temperature white cell count CRP and I have 200 rows of entries for those so I created this data frame just by selecting random values showed you how to generate it in real life you're obviously going to have real data inside of a spreadsheet and you can just import that spreadsheet file not have to go through all of this so I'm now going to use GATFLY which is a popular plotting package for Julia so this is the way it works because I said using GATFLY it's now in memory and I can just use one if it's keywords there plot and then there's my open and close parentheses the close one right at the back at GATFLY takes data frames as entry values so I'm saying use my data data frame on the X axis please plot the values that are in gender and there are only two values in gender and they are categorical it's only going to be MF MF the color it's obviously then going to do two columns for us and I want you to choose different columns based on two different colors for my two columns based on the values that you find in gender found two values so it's going to choose two colors and I want you to draw a histogram for me so that's the geometry that's the way that GATFLY works it draws these geometries so it's geometry geom.histogram I want you to put in a guide guide you get the titles the X labels Y labels that's the text that you can add to your plots so guide.title I want it to be title gender and I want the Y label just to be numbers so that's how you would construct a histogram I'm going to hit command control enter and the first time you run a plot like this also you can probably swallow down a quick cup of coffee while you wait for this to render ok we're back I had my coffee took out another two gold ladders it really takes quite a while to do the first one but annoying but there we go so there we go gender in my title numbers on my Y label and it found only two values F and M so many males so many females here and it colored them indifferently because I asked it to use colors based on what it found it found two and these are the default colors the first one it chooses is a blue and then this nasty looking yellow for the second one let's try a box plot so again a plot it takes a data frame as an entry on the X axis I want whatever you find in gender so immediately you know it's going to find two things and on the Y column I want age but I want it to be a box plot this time so geom.boxplot and I'm going to just add a title called age difference between male and female now the second time you run a plot it's going to be a bit quicker certainly won't be time to do a major surgery while you wait and there you go two beautifully rendered box plots when you look on the internet you search for get fly you'll get all sorts of interesting information as to how to increase the gaps between these it's all just extra arguments that you put in there so what did it do on the X axis well it found only two types of categorical entries in the gender column female and male there and it took the age on the Y axis and it created a nice box plot for us there let's do a kernel density estimate so plot again data on the X axis I want the white cell count so it's going to be the distribution of white cell counts and color equals gender now it's going to find two entries again M and F so it's going to draw two density estimates for me there so the density and I'm going to call it that if that doesn't make too much sense look at the look at the output there we go because it found two types of entries it drew two graphs so it's going to do male and female separately and there we have a kernel density estimate of the distribution and now remember I took white cell count as a normal distribution around around about 15 was somewhere there with a standard standard deviation and indeed and I asked it to give me 200 random variables from that and you can see beautifully these are kind of normally distributed pattern that I do get there let's just look at white cell count according to the groups now remember there were groups A and B so this was going to look exactly the same and it seems as if this time because it's just random variables there was a bit of a difference between groups A and B let's just look at the CRP remember I created that very awful distribution for CRP and this is for the whole I didn't say color equals group so it took all of them as one and you see this very funny distribution the reason why I wanted to do that now every time you run this you're obviously going to get something different that's being random and I didn't seed it with specific initial values so you're going to get a different plot but certainly you'd be a bit concerned just doing normal parametric analysis on a kind of distribution like this you're not going to be sure that indeed those CRPs came from a normal distribution of a normal population parameter so let's do something else let's do some point plots so again I'm taking my data frame as an entry on the x-axis I want the white cell count on the y-entry I want CRP so it's going to take the information's entry and it's going to just do a scatter plot for us in essence to see if there's some correlation there with white cell count and CRP I'm giving it a title again but this time I'm just going for points points so all we're going to have now is this that and look because of that funny distribution I chose this is what we end up with certainly certainly because of the weird distribution I chose that looks a bit awful anyway so let's do something else let's just do points again in this time we're going to see the distribution of CRP now some of these values fall on top of each other so that's why it doesn't look like we have 200 values here so I asked in the x-axis just find me the different things you find in group again it found two categories A and B and it's just going to plot the actual CRP values that it found so there were quite a few at 110, quite a few at 100 and it just plots it on top of each other unfortunately I haven't found a way to do some jitter in these so that the points can be spread out you can actually see how many there are okay let's create some sub data frames remember we did that let's call them something new we now want to create two groups I'm going to call them group A so for group A I'm going to take from the data frame data I'm going to take the column called group and I only want entries that are equal to A and then because they were only A and B I could have said here equals B but the other way to do it is not equals A so again the data data frame the group column where it doesn't find A in our example here it's just going to take all the B's so if I were to run those two you'll see now these two data frames will only B for A and this one only for B so it's a super selected all of those for me so let's just do some statistical analysis and now a look at that this is first have a look what was the mean of the white cell counts for the patients in group A that was 14-7 let's look at the standard deviation for almost 5 and that really is because of the distribution I chose up there and that is standard error of the mean we can also just ask for 95% confidence interval and that is the hypothesis test package I imported it has this function called CI confidence interval one sample T test don't be too concerned about the name that was given there so as this group A is my data frame that just contains patients from group A and I want the 95% confidence intervals for their white cell count and there we go 13.7 to 15.9 let's look at that compares to the L group B patients in this instance the mean white cell count was 15.2 the 15.3 the standard deviation there the standard error of the mean and the confidence intervals there 14.3 to about 16.2 so let's compare these two I can do a P value that also comes from this package from the hypothesis hypothesis test package P value equal variance T test if we look at the two standard deviations there almost exactly the same and I set it up that way so I can use this parametric test of equal variances one group A's white cell count comma group B's white cell count and if I were to enter you see the P value of 0.5 so no significant difference there just to show if they were not so do a main Whitney U test this is for interstate C where the main Whitney U came out, came about 0.4 0.5 as well I suppose you can round it off which just shows you once again I suppose this is not proof of it just not one example like this that the nonparametric test is a little less sensitive than the parametric test but don't use this example as an argument to say that that is always so let's just look because I chose CRP so badly that perhaps we should not use a parametric test there main Whitney U on the CRP values between the two groups but once again you see that wasn't really significant at all either so I hope you learn something and see that you can do a bit of medical statistical analysis here run through the packages that you would need to import how the data frame works created this data frame from scratch by these distributions again in real life you'll just import your Excel spreadsheet or your spreadsheet file and you can do some pretty awesome graphing and some pretty awesome simple statistical analysis on your data good