 So a couple of years ago I made a video on Julia for medical statistics, quite a successful video, I think it's cut down about 12,000 views now, and because people are interested in that now, it's a couple of years down the line, short couple of years, and we had version 1.4, and I think that was still done in 0.4, so many things have changed. And Julia's really grown up, it is a phenomenal language for scientific computing, solves so many problems in such an easy way, and there's a big community now, and there's also a big package ecosystem, so it was time to redo this video. The video is still going to be about showcasing Julia, how to use Julia to do medical statistics. And what we're going to do this time around though, we're going to base it on a paper, so an open paper, I'll put the link down below, and we're just going to use some of the summary statistics to simulate some new data, and that's the data that we're going to analyze, so I'll show you how that works, how to use distributions is to create and simulate your own data. Then we're going to go summary statistics and plotting, you've got to visualize your data, and I'm going to show you a plotting library called GATFLY and it really produces beautiful plots. And then we're going to check for the assumptions for the use of parametric tests and then do one or two inferential tests. Lastly, just again to showcase the ease of the use of the language, what it can do, we're just going to hand code a chi-square test for independence. Now as I mentioned, Julia has really grown up, a lot of people are using it, the community is just getting larger and larger. Currently, I'm working on a course and as soon as that one's done, I'll pop that link down below, so maybe by the time you watch this, that link will be up. In an introductory course, I'm just teaching you how to do Julia from the ground up, whether you've programmed before, maybe a bit of R, maybe a bit of Python, or if you've never coded at all, just to show you how easy it is to learn a programming language, and if you're going to learn how to use one, might as well be Julia, it really is going places, and if you can code in Julia, you can basically code in Python as well, and you can code in R, and there's really no use just to code in one language, quite a couple of them, it really makes life interesting and makes your work a lot more fun. Start with Julia though, why not? Now though, let's look at Julia using medical statistics. And so we start off with Julia language website, it's julialang.org, lots of information here, you can learn about Julia, all you like in the documentation, I'm going to tell you that the documentation is on the technical side, so at least when you start, it is a little bit of a struggle. Of course to download and install Julia, we can just click on the downloads button, and we can see when I made this recording, the current stable release was 1.4.1, that's what we're going to use, and that was released on April the 14th. There's also a long term support release, if that's what you want to stick to, and you can see we have Windows, Mac OS, and all the Linux libraries, and then for some of them 32-bit and 64-bit, you will know what to install. For Windows go for 64-bit, Mac OS, you're only going to get the 64-bit, and then with Linux, the 64-bit would just be the norm, most computers, and operating systems are 64-bit these days. When you've installed Julia though, that is only going to be available to you in the REPL, so that is your terminal, your command prompt, and so you can just write little lines of code that get executed. What you do need is a graphical user interface, and the best one to use at the moment is Atom, and there we go, Atom.io. When you go on the Atom website, it will recognize, should recognize your operating system, it says that I'm on Windows here, and we can download and install that. If Julia was installed properly, once you've downloaded it, just accept all the defaults, it should be in your environment variables, or in your path, that means your computer knows where to find Julia when you want to start using it, and when you install Atom, it should pick up where Julia is, if you install Julia inside of Atom. So Atom is a general purpose coding environment, an IDE, and you can code, use a lot of languages inside of Atom. Let me just open Atom for you there, there we go, that's Atom, and you can see right at the top there, it says Juno, that means Julia was already installed work inside of Atom, and I'll show you how to go about that, two ways that you could go about that. The best way, or the easiest way I should say is just to go to Julia Computing, so not julialang.org, but juliacomputing.com, commercial arm of Julia, you can find out all about what they have to offer, but what you want to do is go down here and look at Julia Pro, let's click on Read More, and there you can download Julia Pro free of charge, and you can see there they've got 1.0.5.2, the long-term release there, but also the current stable release 1.4.1-1, and again you can just download those, might ask you to register, it's all free of charge, and you can go ahead and download that. What that is going to do for you, it's going to install Atom, as we can see here, you're going to see Atom and Julia's already connected to it. If you do the julialang separately, and Atom separately, you can always come to the settings page, and you'll see for me that was control comma, or command comma, and you'll see packages there. I'll list all the packages that were installed, and you can see Julia client was already installed, Julia language already installed. What you would have to search for is uber, that's uber, Juno. If you search for that package, and install that, that's going to install everything for you. If everything worked well, you install Julia properly, you installed Atom as we've done here, you should have Juno there when you restart, you can go down to settings, and you can set all sorts of things as far as the Julia client specifically is concerned. You can see there's some Julia options, the UI options, et cetera. Atom itself has built-in themes as well, so you can go for a light theme or a dark theme. Let's close this all off. What I've got you on the left-hand side is just a project pane. In other words, I went to file and add project folder because the files that I want to work in, they're all in a specific folder, and that also allows you to connect your code on a GitHub repository, and you can just continuously upload to GitHub. What you see down at the bottom is a terminal, and you can run the Julia in the form of a REPL right here. So if I click the enter return there, we can see Julia's launch there, and when you install just base Julia, just Julia without anything else, without an IDE, this is what you're going to see, and I can just write in type some code there, two plus two, hit enter, and we're going to see four. In the middle here is where we see the coding environment itself, and you see I've got a file open Julia for medicalstatistics.jl, and that's what we're going to work with. On the right-hand side, there's the workspace. This is going to show me my current workspace, the computer variables that I've created, and the objects that are assigned to them. You get full documentation here, so you can search the documentation. Of course, you can just go on the website as well, and we're going to have inline plots. So what I like to do is I'm just going to grab the REPL here, and I'm going to move it right up there next to workspace, where it makes that little blue mark there, drop it there, now it's on the right-hand side for me. I can also just close down this side, and we've got a lot more space to work with, because some of the lines of code are quite long. As with Python, we see that we can use the hashtag symbol or pound symbol on the left-hand side of every line of code, and that means that whole line of code will just be ignored by Julia. It won't ignore that, and inside of an IDE such as this, you can just use these comments just to write some comments. And when you download this file, you can see the comments that I've made. It'll tell you a little bit about Julia, about its type system, about multiple dispatch, wide so fast, wide as such a lovely language to use. Basically, it comes down to it is as simple as Python, but it runs along the sort of speeds that you can expect of other compiled languages, like C, because that's exactly what Julia is. It's a compiled language. So when you enter some code, it gets compiled for your computer, for your CPU. In other words, the code is going to execute very fast. It's just in time compiling, so it goes to a low-level virtual machine, and that is going to compile the code for your system, bringing you that speed. One more thing that I want to say about Julia before we start, everything in Julia is a function. Functions, it's a functional language. In other words, you saw me type 2 plus 2 there, but what's happening behind the scene is there is a plus function in Julia, and you start by typing the function, and in this instance, it is this plus symbol that is a function, the name of a function. And as with all functions, you pass it some arguments, not all functions need arguments, but most functions would need arguments. They are separated by commas, if they're positional arguments. We might talk a little bit about keyword arguments, perhaps not in this video, but those would follow a semi-colon. Here I'm just passing two arguments to the plus function, it's 2 and 2. The plus function knows what to do with it. Why? Well, it understands what 2 and 2 are. 2 and the other 2, those are both 64-bit integers by default 64-bit because this is a 64-bit operating system. So those are 64-bit integers that I'm passing to this function through multiple dispatch. The plus function will know what to do with integers, and that will be something very different than doing just that. That's 2.0, that's a floating point value. It's going to call a different method for the plus function because it knows what to do with floating point values, those are decimal point values, and that's different from integers. But Julia understands what to do, how to compile the code so that it executes as fast as possible, as as optimized as possible for that plus function to execute. We don't have to tell Julia what type of variable we do have. So if I say a equals 3, I don't have to instantiate that 3. That is, I'm instantiating this object, which is actually an array, a vector at least of a single value, and that is assigned to the computer variable a. So that's an instance of 64-bit integer here that I am passing to, but I didn't have to tell. I didn't have to specify the type of that 3. I didn't have to specify. Julia is going to infer that for me. You can obviously do specify, you can specify the type and that is going to lead to a better execution. So all that being said, that's not what you're here for. Let's have a look at what we can do in Julia. Now we are going to use, make use of third-party packages. Most other languages have packages that you can install to greatly expand the functions that are available to you and what you can do. And you see here, I've just typed a single line of code here, import data frames. Let me increase the size just one more tick. There we go. That was just holding down control or command and using my mouse wheel and just to increase the size there. So import data frames. Data frames is a package. That's the name of a package. And I'm using import there to import. But you can also see down here, I'm importing GAT fly but instead of import, I use using. So there's a difference between those two. There are a few differences. One of the main differences are that if I want to use some of the functions inside of data frames, I have to use the data frames, the namespace, the full namespace word there. So data frames dot and then a function that lives inside of there. Using for most of them, you don't have to do that. You just use the function as is. If you're familiar with Python, that would be importing NumPy as NP and then using NP dot random, for instance. Whereas if you're imported from NumPy import random, then you can just use random directly. The reason why I'm going to do this here is I want to show you, if I show you a function, I want you to know where it comes from. In many tutorials, all these packages are just used to the using keyword. And then functions are used, but you're not really sure where they come from. So I'm going to restrict myself myself to import here, which is not the norm. But just to show you if I use a function from, for instance, from statspace, that you know that that function comes from statspace. So where did I get all of these extra packages from? Let's go back to the REPL. It's going to delete there. I'm going to hit the right square bracket. And that brings me into the package management system. And all I have to do is say add. And then I could say data frames. And if I hit enter now, the data frames package is going to be installed. That's a permanent installation. Unless I remove it. So it's always there. But the first time that you start up the IDE. So just to say to get out of that package, I just had to hit backspace and I'm back. Before I forget, there's also the question mark that brings you into the help system. If I were to look for the function called sum, that's all look at that. I just get all that information. I could also go to the documentation, of course, and type in sum. And I'm going to get this information as well. So the help system is always there for you. Back to these packages, you install them there, but you have to import them. Every time you start the IDE or start the Julia kernel again, you have to import or use the keyword using with ease. And then they're going to be pre-compiled. And that takes some time. So every time you import these packages, it is going to take time. There is no way to rush this situation. It's not like importing NumPy as NP. And it's almost instantaneous. This has to be compiled. So you can see here at the bottom pre-compiling data frames. And you see this little gear icon here. It is doing the pre-compiling. And you have to wait for that. If you want the speed, you've got to wait for the language to compile. And there we go. It didn't take that long. Let's also import distributions. So the distribution package is going to give me access to a lot of discrete and continuous distributions. GATFLY is a package for plotting. Based on the grammar of graphics, if you've used ggplot2 before, it should not be too difficult for you to pick up GATFLY. So it renders very beautiful plots. And certainly go to the GATFLY website and have a look around. It is this phenomenal, the type of plots that you can create. And of course, we're going to create that plot in this tutorial. So let's do GATFLY. It's going to take a little bit longer to pre-compile. It's a large package. And it's going to take a bit of time. I'll continue talking about the hypothesis tests that's going to give us access to a lot of statistical tests for inference. So we're going to get our T tests, F tests, et cetera, that is going to be available for us. Not all inferential statistic tests are available. Have a look at their website. For instance, as I mentioned in an introduction, we are going to hand code or handwrite some code to do chi-square tests for independence at the end of this tutorial. StatsBase is also going to give me access to a bunch of statistical functions and which makes life very easy. It's a very useful package to have. So let's just go back to hypothesis tests. And you can see I'm doing for GATFLY and for hypothesis tests I'm using using. But for StatsBase, I'm using import. This is that I want to show you where these, as I mentioned, where these functions come from. So I'm going to do import, by the way, to execute these lines of code. I'm just at the end. I can be any way inside that line and I'm holding down shift and I'm hitting enter. Shift and enter or shift and return. And that is going to complete that for me. Let's do statistics. Now statistics and random, they come with Julia. They build into Julia. But if you want to use the functions inside of those two packages, you have to import them or use the using statement. CSV, of course, is a package that's going to help us read CSV files. If you work with spreadsheet software, such as the ubiquitous Microsoft Excel, never save your files. If you're going to work with data, don't save your files. It's Excel spreadsheets. Save them as CSV files, comma-separated value files. Makes it much easier to work with those kind of files that have stripped away all the fancy things that have been added to the rendering of your cells inside of Microsoft Excel or some other spreadsheet software. All you want is the actual data, not its representation as a percentage. If it was captured as a fraction, you want those values as a fraction. And then lastly or second from last, we're going to use query. That's a package to help us query language. And that is a very powerful. It's not only for the data frames package. I didn't mention the data frames package right in the beginning. That's going to allow us to work with the data that we've imported. But the query language on top of that is almost like a structured query language. It allows us to write queries to interrogate the data. And in the end, if we're going to do a chi-square test for independence, we've got to have some contingency tables and the frequency tables or FREQ tables. It's going to allow us just to do those counts along contingency tables. So let's import that. So what I'm going to do here, next is we're just going to make use of this journal article that you can see here. The link will be down below. You can read that link. The article itself used an organism in tablet form and randomized some participants and had a look at the effect that would have on their cholesterol and specifically HDL cholesterol. They've got beautiful tables in that journal paper. And we're just going to use the summary that they gave in those tables and just generate some data on our own. That's one of the beauties of a computer language. Of course, you can generate your own data. So this would not refer to real life data. This is data that we're going to simulate based on the summary statistics in those tables. So the first function that we're going to use you see here is random.seed. Open close parentheses because the seed is a function and I'm just passing some integer that I decided on 12. You see the little exclamation mark there? That's the bang symbol. So random.seed bang. And I'll tell you a little bit more about the bang symbol a bit later. To seed the random number generator means all these random values that we are going to generate will be exactly the same if I run this code again. If you run this code, you're going to get exactly the same seed of random numbers as well. If we don't do that, of course, every time you run this, you're going to get different random numbers. So the first set of random numbers, let me just execute that shift, enter shift, return. And now if I run these in order, of course, if I go back and rerun some of these, it'll be different numbers. But if I run these all in order, we should all get the same random values. So the first random function we're going to deal with is a built-in Julia function, rand. And it's going to take some arguments and return one or more random values. The first argument here is these are positionals. So you've got to put them in order is what to choose from. And this is a unit range 30 colon 65. Because I don't put a default step size in the middle, so I could say 30 colon 2 colon 65, that's going to go up in steps of 2. If I don't put any step size, it's just going to use one as a step size. So from 30 to 65, that is a unit range. It's going to select from those values. How do I know that that's a unit range? Well, let's go to the REPL here. There is a type of function. And if we were to say 30 to 65, it'll tell us what Julia sees that as. And it sees that as a unit range of 64 bit integers, just as we suspected. Back here to the comma, the next positional argument would be 46. I want 46 values back from this interval of 30 to 65. I want 46 values back. The random is going to give us a uniform distribution with replacement. So if the age 35 was chosen once, it goes back into the pile and it can be chosen again. So just a uniform random distribution, shift enter. And there we see we have a vector of 64 bit integers with 46 elements. If I twill down on it, you can see a list of them there. 46. So in this paper, they had 46 participants, 23 in each arm, taking either the drug itself, which as I say, we're just some organisms, or placebo. And we want 46 of them. And we're just going to randomize from uniform random distribution, uniform distribution. We're just going to take 46 values. And we're storing that, assigning that. Remember equal sign is an assignment operator. It assigns what is on the right side of it to whatever is on the left. And on the left, we've generated a computer variable name age, creating a space in our memory with this object on the right hand side stored. I'm going to use snake case for my computer variable names. There you can see snake case. In other words, it's a word I came up with, computer variable, but the words have got these underscores in between them. It's commonly seen in Julia, although you can use, I suppose, whatever convention you want, camel case, it's really up to you. The next one I'm going to generate is gender. And here I'm going to use the sample function from the stats-based package. And that's why I used importing. So I have to now write stats-based dot sample. So my first argument is going to be an array. And it is an array because of the square brackets. An array is a list of elements. And the two elements, I'm going to give it as female and male as strings. And in Julia, strings go inside of double quotation marks. You can also use single quotes, but that's only with a single letter. And then it's not a string. It's actually a character type, not a string type. So there we go. Now in this paper, there was just a binary allocation of gender. So there was only female or male. Then in the paper, 60% of participants were female and only 40% were male. So I'm going to add some weights to this random sampling. So stats-based dot weights is my next function there, weights. And again, I'm passing an array of 0.6 and 0.4. Of course, that's going to sum to one. And I want 46 of those. And that's a very expressive language because that's almost like an English sentence that I've written. The sample for me from this sample space containing two elements, female and male, sample with weights, so that 60% are female and 40% are male. Or at every turn, that's that likelihood of being selected. There's a 60% likelihood of choosing female and a 40% likelihood of choosing male. And I want 46 of them as well. So shift, enter, shift, return. And we see my vector here of strings this time, not 64, but integers, but strings. So let's tool that down. We say, we see male, female, female, female, male. And with only 46, of course, it's not going to be completely a 60-40 split. 46 is a small number, but it's going to be in that order. Now, next up, I'm going to create the group, this computer variable called group. And now I'm going to use the repeat function. So for the repeat function, I'm going to pass again something to choose from. In other words, it's got to be an array. But this array only has one element. And I want to repeat it 23 times. And you can guess at what's going to happen here. I'm going to have a vector of strings with 23 elements. And there we go, placebo, placebo, placebo. So what I'm trying to simulate here is I'm just going to put the 46 patients or participants in that took the placebo. And then I'm going to add 23 of them that took the active ingredient. So how do I add to the end of an already existing vector or array? So you'll see it's called a vector here in atom. But if you just ran it in the ripple here, it'll say array, not vector. But if we just want to add to the end of that 23, another 23, I'm going to use the append bang function. So there's a pen, and I'll tell you a little bit about the bang, because you get many functions with and without the bang. What the bang does, what it does to this function, it makes the changes permanent. So I'm going to append to the end of these 23 placebo values. And I'm going to put that inside of the group computer variable. But the changes will be permanent. Sometimes you don't want those changes permanent, you only want them to happen with the inside of for loop or inside of a function. But then you want the original back at the end without the permanent change. Yeah, I want the permanent change append to the group variable. This repeater. So repeat active 23 times. And if I run this now, now suddenly my vector is 46 elements long. I've appended to the end of those placebos, those 23 placebos, I've added 23 actives. Great stuff. So let's close these. There we go. The next one that we're going to go for is we're going to actually sample from a continuous distribution. So in the paper, they looked at a lot of variables. I'm not going to simulate all of them here, just a couple of them. So they looked at HDL cholesterol, high density lipoprotein. That's the good cholesterol before and after the intervention. So they told us in the paper what the what the mean and standard deviation was for the sample values for those variables. Of course, we don't have access to them, but I'm just going to simulate them based on those parameters for the normal distribution. So I'm going to call my computer variable HDL underscore cholesterol underscore before. I'm going to use the function, but this time I'm going to not take a unit range. So I don't want this uniform distribution. I actually want the normal distribution with a mean of 1.24 and a standard deviation of 0.31. So that comes straight out the paper. I'm saying use this distribution. So that's distributions dot normal. So if I said using distributions in the end, you could just have said normal. But because I said import distributions, we have to say distributions dot normal. As I say, I'm doing this so that you can see where this normal function comes from. It doesn't just come from Julia or some other package. It comes from the distributions package. So it's all about just showing you about that fact. So distributions dot normal from a mean, these oppositional arguments, it's always going to be mean comma standard deviation. As I say, that comes from the paper and I want 23 of those. So let's do that. And now we've got these 23 elements. And then I want to add another 23. So that was read from the table as far as the placebo group was concerned. Now I want to add from another distribution according to the summary statistics in their table for the 23 participants who took the active ingredient. So what do I do? I'm going to append with a bang because I want that permanent. Append to HDL underscore cholesterol underscore before from this distribution with a mean of 1.24. And so that was exactly the same there and 0.29 as far as the as far as standard deviation is concerned. So I'm going to add 23 values to that, which means I now have a 46 element vector of those values that we were interested in. So we've simulated that. So it's correct for or the two groups of participants. Then I'm going to do this a couple of more times. So I'm going to create an HDL cholesterol after. And again, I'm going to go through the same thing random from an all distribution with a mean of 1.4 standard deviation of 0.35. And then I'm going to append to that some more. So you can have a look at that code. I'm going to run through it very quickly because it's just a repeat of what we've done before. So there's a wait before and a wait after, diastolic blood pressure before diastolic blood pressure before and diastolic blood pressure after. So I'm just going to run quickly through all of these just create them from the distribution as per the parameters in the table. So let me run through all of those done. So that would be one way to go about it. Let's just just show you something else. And the reason why I'm doing this is just to show you how a for loop would work inside of Julia. So we have BMI underscore before and I'm passing to that an empty array. And an empty array is just the two square brackets. Open, close, square brackets. It's a vector. And now you see the type is any. The type here is any. Now I just want to stop there a little bit because I want to tell you just about the Julia type hierarchy. Everything is a type in Julia. So if I say, what is the type of some typing in the repel on the right-hand side here? What is the type of just three? Well, it's a 64 bit integer. What is the type of three dot zero? So I'm just saying three dot Julia knows it's a zero. So if I enter that, we see it's a 64 bit float. And these are abstract types and you can instantiate an abstract type. What that means is I can get a computer variable and I can assign three to it. That would be an instance of a 64 bit floating type. If I asked what the type of and let's make three comma three, but I'm passing these inside of square brackets. So that is actually an array. And if we look at that indeed on this side, it'll say array on the ID, it'll call it a vector, same thing. So it's an array of 64 bit integers and it's a rank one tensor. In other words, it is in mathematical speak, it is a real column vector. And again, the 64 bit integers for us here or the array that we see here, that is a concrete type. I can create an instance of that type. But I can also look up the hierarchy. So let's see what the super type is. Super type is my function. What is the super type of array? Well, in this instance, it's has a super type of dense array. Let's go up and say what is the super type of dense array? Well, that's an abstract array. Okay, let's go further up. Let's see what the super type then is of an abstract array. And that's any. So any is right at the top of right at the top of this type hierarchy. And it's like a branch is going down on all sides, you get all sorts of types. So any will have many, many, many, many types. But I can look at all the sub types. For instance, what are the sub types of let's make it number. So number is some way up a different branch of the trees. So let's look at the sub types of number. And we see it has two sub types, complex and real. So complex is actually a concrete type. In other words, I can create an instance of complex real on the other hand continues to branch out. So I cannot make I can't instantiate real. I've got to go all the way down to say integer or float 64 down that hierarchy tree. If you're interested in that, Google it, it'll give you you'll quickly find this whole tree structure. Anyway, back to our for loop, which is what I wanted to show you. So for I equals one to 46. Again, remember that would be a unit range. So I just put in one to 46. And it's going to go from 123 loop through all those values because I didn't put a step size. So I could have said something like this, that would that would have jumped one to three to five, etc. But if I don't put anything there, it'll be exactly the same as doing this. I'm just going to go from one to 46. And I'm doing steps of one. So one to 46 to the following when you at the end of the line there and you hit return or enter, it's going to create this blank space for you. And that helps us here with seeing what the flow is going to be a for loop is always ended with an end. So you've always got to have the end there. So what am I going to do here? I'm going to create two random numbers. And the first one is going to be stored in placebo underscore BMI underscore before. And the next one is going to be stored in active underscore BMI underscore before. And what are these two values? Well, both of them are the ran function, both come from a normal distribution. But you see the slight difference in the mean and standard deviation from which those are going to be selected. And I haven't said comma, I want five of them. No, no, no, I just leave that blank because I just want a single one of them. So just give me one random value back from that distribution, store that in this placebo computer variable, give me random value from this normal distribution, store that there. So I've got these two values now, while I still one, and I'm going to use the push bang function. And I'm going to push to BMI underscore before this BMI underscore before it's empty at the moment. But now I'm going to push a single value to it. What am I going to pass to it? Well, that comes after the comma. And what I want to do is to assign based on what is inside of group. So let's have a look at group again, I'm just going to show you here on the right hand side. So there's group and you can see it's the 23 placebo and then 23 active. But I can index that I can say what is group, what is in position number four, seeing that this is just a column vector, this is going to go from one to 46. Of course, that was going to be placebo. So instead of giving it a number specifically, what is inside number four, I'm saying what is inside number I, because every time we loop through this, I was going to be one, then two, then three, then four, then five, then six, et cetera until 46. So it's going to iterate through all of them. And now it says, is this equal equal to placebo? Double equal sign is a Boolean question. It's going to return a two or false. So it's the one I'm looking at at the moment. So let's just look at what group one, we all know what group one is. It's going to say placebo. If this is placebo, then you see a little question mark. And then give me back the placebo value that was stored there. Else after the colon, give me the active value back. And that means I can assign based on what is inside that group value. And what we see here is called a ternary operator. Let me do this for you here. So we can say two is less than five. And then question mark, you've got to put these spaces. So yes. And it's not a comma, but a colon, no. And those are strings, you can put in anything you like. Of course, it is, if we now say two is greater than or equal to five. And then yes, yes. And we've got to have those spaces there. No, of course, we're going to get back no. So it's just a very shortened version of an if else statement in case you were wondering. So that's all we're doing here is saying we read that value. And then we're going to assign one of those two values based on what we see in the group value, we're going to push that into this BMI. So let's do that. And now if we have BMI, BMI before, I'm going to get taken from the right distribution for all of those 46 values. And then we're going to run through the same for loop there. Now I've just created a bunch of random values, random variables there. We've got our data point values. Now we're going to store them inside of a data frame. So I told you about the data frames package. That's an excellent package just to work with data. So if you're going to import a CSV file, comma, separated value files, you can use CSV dot read, and then it's going to be stored as a data frame. If you're familiar with R, that would be like our data frames or in Python, that would be like pandas. So I see an open, there open and close set of parentheses. So these are all arguments that I'm passing to the data frame function. And what we can do there is give a name of our variable. So think about it in a spreadsheet file, that will be row number one, all the column headers. And look at them here, they are just passed as normal words without spaces in between or other illegal characters, but they're not strings, I'm not passing them in a strings, just like that. So I'm going to have ID. Yeah, I'm going to use the range function from one comma stop equals 46. So it's going to go from one to 46. That's another way to create this iterator age, with an uppercase a, that's going to be my variable name inside my statistical variable name inside my data frame, I'm passing that the age computer variable. And I go all along all these ones that we've created, except here for cholesterol, HDL cholesterol delta. And to that I'm going to assign an operation I'm going to do on two of these variables that we have created, we've created HDL underscore cholesterol underscore before, and we've created HDL underscore cholesterol underscore after. And I want this column to be the difference between those two, hence the name I've chosen. And all I'm going to do, I'm going to do element wise subtraction. Remember, this one has 46 elements in it, this one has 46 elements in it, and I want to do element wise subtraction. In Julia, that means the minus sign, but we have to put this dot in front, so dot minus, and that indicates in Julia that we want element wise operation. So take those PESA values, subtract them from each other. And then we've see all the others you see I've done a weight delta as well, so dot minus. So the difference between those two vectors element wise. And you see for BMI I didn't create a delta because I'm going to show you how to do that just with code. So enter shift return, and now we have a data frame. You can see a 46 by 15, so 46 subjects there across 15 variables, and there's not enough space, so it's going to emit some of them, but there's my data frame. And you can see it is like a spreadsheet file. There I have my column headers, row, id, age, gender, group, htl, before, etc. But just below that Julia tells us what the type is of these elements. So what is the type of this? Well there's 64 bit integers here, another 64 bit integers, and then strings, strings, 64 bit floats. Now sometimes you want to deal with these things, not strings, but as categorical variables. And there's a data frames dot categorical bang function, and if I pass it the data frame, and then the column header, and it's going to change that permanently because this is a bang from a string into a categorical type. And what you have to notice here is this colon in front of group. Data frames allow us to use the group name, the column names as symbols, and that is a symbol. Once there is this colon in front, that's a symbol. And that is the notation that we're going to use when we want to refer to the columns inside of a data frame. So remember that we've got to use that symbol notation. So I'm just changing the group and the gender. I'm going to change them to, if we have a look at them now, you'll see gender is now categorical type, not a string type anymore. And there are advantages to that. Next up I'm going to show you how we get access to only certain parts of the data. Because if you start analyzing data, you don't want to see all the data in the data frame. You want to narrow it down, looking at something specific. So I'm going to start off with by just looking at the ways to slice a data frame. So remember when we said group, when we had group, we just looked at the first value in there, and that was placebo. But indexing, I used index notation here, and that's inside of square brackets. That's exactly what we're going to do here. But here with the data frames, we have rows and columns. It is a, you can think of it as a rank two tensor. We have rows and columns, not only values down a single column. So we've got to refer to, as you do with a spreadsheet file, give the cells row and column address. So here we're going to use a unit range, one to three. So that's going to give me one and two and three. So rows one, rows two, rows three, comma. This here, if I just use the colon symbol, that shorthand will give me all of the columns. So rows one, two and three, all the columns please. So let's have a look at that. I get indeed three rows across all 15 of the columns. So only the first three rows there across all the columns. Now let's just ask for a single column. So still rows one, two and three, but only of the age column please. And look again, I'm using symbol notation. So there we go unit. It's a vector of 64 bit integers. And there's any three elements in it because we've only asked for three elements. If I want to use more than one column, I've got to pass them as an array. So they're going to go inside of square brackets. And I want the group column and the age column, still only rows one to three. So let's do that. Now I have a three by two data frame object. And I have the group and the age as I've asked for in that order. And I see the first three rows for those. What if I only want rows one and three, not one, two, three. So I've got to put that inside of square brackets because this becomes an array, an array of rows, only one to one, only one to three comma an array of all the columns that I want. And now I'm just going to get this two by two data frame, only rows one and rows three, and only for those two columns. Now if I want to see all of the remember I said up here, we use colon as our short hand for give me all, just make a note of it's becoming more prevalent just to use the exclamation mark when you want to refer to all the rows. So the data frame object, give me all the rows and then just the age column. And I'm asking a Boolean question there, would that be the same as using this notation? Am I going to get back exactly the same thing? And the answer is true. So I could use either of those two notations. Now we're going to pass a rule. We only want the data frame back, all the columns, but, so you can see all the columns, but go down the age column and only return those that have an age of more than 50. So how do we go about that? So I want to see the whole data frame with all my variables, but I only want to see it for participants who are older than 50 years of age. So this is what we're going to do. We have this very nice dot notation. So if I say df dot age, let's do that in the ripple, df dot age. And now it's just going to give me back this 46 element array or a vector of these 46 values. So df dot age, it's a short hand. Otherwise I could have written df like we've done up here. And here I would say give me all the rows only of the age exactly the same thing, but short hand I can just refer to it as df dot age. So it's going to give me back this array. And you can see there the return is not a data frame. It is an array that we get back. So go down each of them and see if they're greater than 50. So the dot greater than again, that means it's going to go element-wise. So it goes down every one. Is it more than 50? False. It's not more than 50. This first one, that row is now not included. Next one, 35. It's not more than 50. It's not included. And let's go down now and this would be the first row that gets included because this participant was older than 50 comma all the columns. So that's very nice notation. And you see we down to 18 participants now. And if we go down the age column, they all going to be older than 50. Simple as that. Now maybe I want to string more of these together. Now I only want participants that were older than 50 and they were in the placebo group. So how are we going to go about this? Again, it's my data frame. Rows comma columns. So there's my comma. I want all the columns there. But let's look at what rows I want. Well, I'm going to put these inside our parentheses. So I've got these two rules that I want. So when df.h dot greater than, so element-wise 50, dot and and that's the symbol for and. So both of these have got to be true before we get a true back. So we just have normal logic here. So it is dot and and we want the df dot group to be dot equals equals placebo. So row by row and all these things have to be true before we we get that row included. Now we down to 10 and they are all going to be over 50 and they're all going to be in the placebo group. You can see how easy it is just to manipulate the data to get something very specific back. That means I can also create new data frames, sub data frames from the full set. So I'm going to call mine placebo and intervention and I want to split the participants up into two very separate data frames. Very easy to do. Call the data frame, address the rows by df dot group dot equals equals placebo comma all the columns. So in this new data frame only has 23 participants in it but they're all going to be in the placebo as far as the group is concerned and I'm going to do the same and I'm going to call that one intervention and of course in the group it's only going to be the active participants taking the active intervention. I promise to show you that I how to do this with a data frames code I should have said this difference between. So if you want to create a new column I have to create it this way. So there's df square bracket notation and then by symbol name. So if we scroll back up when I created this data frame I did not use symbol notation here but when I want to add a new column I use symbol notation. So create this new column and that is going to be the difference the element wise difference between these two arrays and the first ray is df underscore bmi before and dot minus to indicate that it's element wise df dot bmi after. So that is going to allow me to have this new column in my data frame. We see that I now have 16 columns. Now I want to show you a little bit about the query language what I want to show you actually is this that it exists because it is phenomenal and it is vast. For you to start taking or selecting and changing manipulating only a part of your data. So I'm going to show you just a couple of examples here but that is many many many lectures worth of things you can do. So I'm going to start with my data frame and then I have this pipe operator. So that will just be the up down stroke on your keyboard or my keyboard that's above my enter so I hit shift and the key above it will be different for your keyboard I'm sure and that's state data frame and pipe it into what comes next and here I'm using I suppose the generic form of the query package so I'm going to start a query and we're using a macro here. So in Julia it makes use of macros like this and what the macro would do a macro actually generates code so instead of us writing out the long code the macros we can create macros that actually generate code and then that code gets executed so it's a very nice way to write this succinct just the sort of a function there a macro that'll generate code and do something for us but this is built this at query macro is built into the query package so I'm going to say do what what query do I want to want to have so I'm going to create this variable called I now it's another thing we haven't spoken about and that's local variables and global variables let's do that here I've created a variable name and I've assigned some object to it this placebo exists in a global space in the global space in other words I can make use of it anytime it exists in its value exists everywhere inside of a for loop as we did before we created some computer variables but those only will have a local scope outside of that for loop they don't exist and I can't refer to them again they're not permanent they're not in the global scope so here we have query I and then begin and to be beginning there's an end here and then I'm going to use other macros at where I dot age is greater than 50 so what's happening here this pipe operator piped df into the square and it was put into the square right there with the eye so I'm piping df into I so when I say I dot age I'm actually saying df dot age it's just being piped into this eye here so where the age is greater than 50 then at select and I'm passing these inside of a set of curly braces I dot HDL cholesterol before and I dot HDL cholesterol after so it's going to select only those two columns for only participants older than 40 and then end my begin then my end close my query parentheses there and pipe that into something and I want that to be piped into a data frame object but because we only used import data frames I've got to say data frame dot the data frames dot data frame so if I run all of that lo and behold I get this back only those two columns that I asked for so df dot HDL cholesterol before and HDL cholesterol after and it's only going to be for people who are older than 50 so this is just a quick look at what the query language can do for you now I could have done that much simpler before but this complexity allows me to become very specific and there are more macros than just these before I get to the next macro I just want to show you this many times when we collect data we try to protect patient confidentiality and in a very simple example what we might do as researchers beforehand we're going to just subtract in our heads two so for instance the value two from everyone's age so if someone was really 50 I'm going to capture 48 as the age no one outside of the study knows that so if someone got hold of that data it's a little bit more difficult to to bring that back to an actual human being so that's a very simplified example it won't really it won't work in the real world you'll have to be much more inventive than that and there are ways to do this but if I want to change something permanently so df.h I'm now saying assign that to df.h dot plus two so that's going to add two to everyone's age and if you capture data in that way where you changed it before you analyze the data of course you want to change that back so that would just be the way to do that here we are back with the query package so I'm going to pipe data frame into this at filter macro and we've done this one before I only want the ages back and whether on the placebo group this would be the way to do that we saw that before but this is way to do it with pipe df into this and instead of I we have this underscore here so df is going to go in place of those underscore but I'm using at filter so I want ages greater than 50 and group equal to placebo note the differences though there's no at before that and I'm using the double ampersand and then so you've got to read the notation on the query package and I'm piping that into a data frame so what we're going to get back is this data frame of all the columns but we're only going to have participants older than 50 again and only in the placebo group one more way just to use the query I just wanted to show you that I'm going to pipe df into the at group so I want to group by df dot group because this df was piped into this placeholder and then the at map query and I want two column headers my first one is going to be called key and my second one is going to be called count and to the key I want the key function inside of query and you'll see just a little in a while what that means and then an account column I want length length is a Julia function and that is going to just count how many things they are so let's have a look at what happened here so I've created a data frame because I piped this all into a data frame and I have two columns one is called key that is my selection there and one is called count there key this key function what it did to this this placeholder because this was all piped so let's just look at this piping df was piped into the group by so this now becomes df dot group df dot group that got piped into all of this so this is now df dot group so it's key of df dot group and that's what pipe we've got beautiful versions of that in our programming language for statistical analysis same sort of thing you have you built this pipeline of execution so this df was piped into this placeholder now the whole lot was placed piped into this placeholder so this becomes df dot group and the key of that is going to return for you the sample space elements what are all the unique elements that were found in that column so in the group in the df dot group column there was only two sample space elements placebo and active and that's what it's going to give me back but in the second one I want to count how many times it occurred so this is a way to count the occurrences of your sample space elements of a variable and there we can see we had 23 with placebo and 23 with active exactly as we designed it great let's do some summary statistics the describe function is very very useful so it comes from the statspace package so if you just see describe they would know where it comes from but I said import statspace I've got to say statspace dot describe and what I want to describe is from the df data frame take all the rows for me in the age column remember there are different ways to write this you could have just said df dot age anyway let's execute this and the result there's just a tick mark because it's executed in the ripple so it says there were 46 values there were no missing values the mean was 46.93 the minimum the first quartile the median the third quartile and the maximum value they were all there and they were all 64 bit integers so a nice descriptive statistics here of that column there's also the summary stats the summary stats function that's going to do exactly the same but it's going to return it for me here in the ripple so I can pull down and see those results here and then the statistics package that is one of the built-in Julia packages don't have to install that remember how to install go here write square bracket type add space and add what you want to add and then import it with import or using so statistics dot median so this is going to give me back the same value as we had before the 48 there's the median there it was 48 indeed no problem what one thing you won't see here is the standard deviation but there's a statistics dot std that gives you back the sample standard deviation and if you read the documentation there you can also ask for the population standard deviation it has a mean and standard deviation mean underscore and underscore std that's a function inside of stats base and that's going to give me back a tuple a tuple is different from an array instead of square brackets we use we see parentheses and by the way tuples are immutable so you can't change their values as you can with arrays but that's a story for a different day we get back the mean and the standard deviation variance of course is just dot var again there is a another argument that you can pass that gives you back the population variance this would be the sample variance there's a statistics dot quantile function and you can ask for what percentiles you actually want i want the 25th and the 75th that gives me the first and the third quartile values and you can see for age those are the ones we saw there 40 and 55 in stats base not inside of statistics but inside of stats base there's an IQR that is going to give us the interquartile range but i can also do that with code let me just go to the front of this i can say statistics dot quantile the 75th minus statistics dot quantile the 25th and if i execute that i'm also going to get back 15 because that's what the interquartile range is the difference between the third and the first quartile span stats base dot span that's going to give you the full range so the minimum and the maximum value and remember initially we chose it from 30 to 65 but we added two to each one of the participants age so now we see the youngest was 32 the oldest was 67 now this is a bit of a convoluted one i want to show you here in the describe function what if you don't want everything back so i'm saying here df all the rows comma only these columns but this describe function can take back can take as arguments some things that we want to create what do we want to create while that comes after the next comma create a symbol for me that's a column header with a name a ve and attach to that and we have to use this error notation so that's equal greater than state space dot mean another symbol std and attach to that state space dot std so that becomes quite convoluted and it takes a while for you to to get behind what is exactly how to do that and remember how to do that but that gave me the age and the cholesterol delta and it gave me the average of each of those and the standard deviation for each of those so that would be another way to go about this and what you can start to see here what's developing is the fact that there's just so many ways to do things in julia there's the unique function in julia and i'm telling it go down the group column all all the ones that just give me back what is unique well we know it's going to be placebo and active so we're going to get back just sample space elements i could do it with query language also i could say pipe df into the ad group by so this is going to become df dot group pipe that into this map at map macro and all i want is a key and pipe that into a data frame and now i'm going to get exactly this what i got back here with a unique instead of getting back just an array i actually pipe this into a data frame so i'm going to get my sample space elements there there's a stat space dot count map function so not only is it going to give me back the unique values but it's also going to count how many there are and again i see active and placebo but i can also do this just with the query language so i'm piping df into group by so df dot group that gets pipe into the map the key so df dot group the key of that so that's sample space elements and the length is going to count all of them and you know exactly what we're going to get here this is we have before we're going to get placebo inactive and their counts so so many ways whatever pleases you you choose that way um one more thing let's do the median and we do only of people who are older than participants older than 50 in a placebo group and then i can use another set of indices here and that is only give me back the median of the ages then so what we've done here is the median of the age of participants in the placebo group that were older than 40 the older than 50 so this is a different notation to do this first and then what you actually want so hang on to that notation it is sometimes easier to use sometimes it's much easier just to use some of these macros inside of the query language we can construct our own functions and that's one of the beautiful things of julia construct your own function so i see if the function key were there tell julia i want to create a function the name of my function i decided on that and it's going to take a single positional argument x and it's going to return for me the following statistics dot mean of whatever i pass in comma statistics dot std of whatever i pass in end was we always have to have an end and now i have a new function mean and standard deviation as you can see here it is a new function and now i'm going to pass in a rate to it into that function and so that takes the place of all the x's i'm getting back the mean and the standard deviation of that so you can create your own functions too let's do this last one before we visualize data i'm going to group by group i'm going to map the key and then create two column headers average age and standard standard deviation age and i'm going to perform those two on it you can see what is going to be in that place holder because df is going to be piped into there then that makes a df dot group that gets piped into there which is also piped into there and there i think by now you get get the meaning of how to write these pipelines i'm going to get placebo and active back and that gives me the average age and the standard deviation of the age for those two groups very easy to start learning how to put these together now let's get back let's get to one of the exciting parts we've summarized our data and we've really summarized it a lot we've created a lot of other variables so please play with them and see what results you can get that's the only way to learn is to start playing with these so get fly there are other plotting libraries as well i'm going to show you get fly here i like the way the plots look and it's a very simple specifically if you know how to use ggplot so get fly dot set default plot size i'm just going to set it for argument sake here to fit in here to 800 pixels by 600 pixels so here's a very simple plot remember i said using get fly so i needn't have said get fly dot set or get fly dot plot i can just use the plot function directly now get fly why like get fly it works very well with data frames so my first argument is this data frame it's the data frame we're passing on the x-axis i want the group variable on the y-axis i want the age and what geometry do i want well geom dot box plot so i actually want uh julie here to create a box plot for me there's a couple of guides one of them is title so guide dot title give it a title as a string and then you can also do a theme so my theme here is going to be has have a default color midnight blue there's a bunch of these already built in you can google that and see all the theme colors that are built in and i want to also just make 100 pixels in between my box plots as simple as that i'm going to hit shift and enter shift and return and i'm going to go make a cup of coffee because the first time you create a plot there's a lot of stuff that have to happen behind the scenes it has been recognized that this is a bit too long and there's certainly a lot of work going on behind the scenes and trying to improve this time to first plot in julia it is a known problem you do wait quite a long time before that first plot is created so i'm going to go often have some coffee i don't know what you're going to have i'll see you back in a minute or two and there we go it really wasn't a minute in my case it was about 10 seconds it was slightly less than 10 seconds so it's not that bad anyway that is from the time i stopped talking by the way and there we go i see a beautiful box plot i see my age on my y-axis i see the two groups it found placebo and active all on its own and we see the title that we created up there and we see these beautifully rendered midnight blue box plots very nicely you can assign that to computer variable because you do get packages and you do get the draw function which can save this plot that you've assigned to p save it to your hard drive as a png file svg file now you might have to import some other libraries i think with png you'll have to import kyro c a i ro uppercase c package to export to png otherwise you can just export to svg it's capable of vector graphics and those are very nice to use in other programs like inkscape or adobe illustrator etc to add a lot more to your plots so here i'm going to use plot again plus the def i'm going to have on my x-axis the changing cholesterol color by the group now this color has nothing to do with color color color argument here means split by so group by the group column whatever the sample space elements you find there give me a geom dot density so this is going to be a density plot of the changing cholesterol before and after the intervention split by the group variable and then i've just added a title so let's have a look at that and there we see beautiful density estimate of the placebo and active group as far as the change in their cholesterol values we're concerned so if we look at the these two the two distributions for our data values so beautiful density plots there you can also just ask for plots that are models look at this data frame x-axis is bmi before y-axis is htl cholesterol after so this is going to be a scatter plot and indeed i'm calling geom dot point that gives me point markers so this is a scatter plot of two continuous random variables bmi before and htl cholesterol after so i'm trying to predict htl cholesterol after given an input of bmi before and i want that done separately for each of my sample space elements in my group column so color equals symbol group geom dot point and then in the layer argument i'm going to call the stat dot smooth function inside of gat fly i want it to be a simple linear model and i want a 95% confidence interval around my model and that's going to be a geom dot line and the ribbon is the confidence intervals around that then outside of that i've got a title and a theme and in my theme i'm passing a bit of transparency and i'm making the point size the markers of the scatter plot quite quite big with 10 pixels let's have a look at the output of this beautiful plot it's going to take a second or two again because it's now also creating this linear model behind the scenes so i'm not even using a package that does linear modeling they exist too glm being a beautiful example of that but just inside of this plotting package gat fly i can do this and then i can see my two models both for the placebo and the active group i see my markers with a bit of transparency and i see my two linear models created with a confidence interval around those that is absolutely fantastic and those plots are are really beautiful let's do some inferential statistics now with inferential statistics we always start off by describing our data now we've already done that and the one that i'm going to concentrate on here is just to see what's the difference in hdl cholesterol before from that difference mean before and after but between the two groups so i've just we've just got the subtraction of what the before minus after they give me one variable and i've got that same variable split along one of my categorical variables which is the group so i'm just asking a question is there a statistically significant difference in the changing cholesterol so my null hypothesis is that there's no difference between those two and my alternate hypothesis is that there is a change between the two so one can be higher than the other and i'm using an alpha value of 0.05 so there's my hypotheses so let's just describe the placebo group remember we created these two sub data frames let's just go back to the ripple to see the results i'm describing hdl cholesterol for my placebo data frame so that's another way to go about it you need to use the query to select these just create two sub data frames one for each of your groups and that's what we did in the beginning so i have my placebo and my intervention group and there we can see summary statistics the mean of the difference was negative 0.1 for the placebo group and negative 0.22 there for the so there was a bigger decrease in the hdl cholesterol uh in the time period before and after the intervention we can also just ask for the following so i'm asking for confidence intervals so that comes from the hypothesis test function a package there's a conf int function and what i'm going to do is call the one sample t test function also from hypothesis tests on each of my two each of my two data sets the hdl cholesterol delta for each of my two groups so i've got the mean and i can work out the standard deviation but here we have the 95 confidence interval around the means for both of these so very simple first use hypothesis test just for confidence intervals i've really plotted these two distributions for you so let's just have a look at our assumptions for the use of parametric tests so in hypothesis tests there is a p-value function and what i want to do it does not have a shaperu walk test but it has comograph smirnoff and i think a few others so i'm just going to use ks test here so exact one sample ks test and i'm passing my cholesterol delta for the placebo group and against a normal distribution so that's a ks test against the normal distribution so let's do that it's going to give me back a p-value so the null hypothesis remember is that it is from a population in which this variable is normally distributed and let's do the same for my second group and there we have a problem we see that a p-value of less than an chosen alpha value of open 05 so we're not really there meeting the assumptions for the use of a parametric test i can investigate that visually by qq plot so let's just have a look at this i'm creating two plots p1 and p2 and each of these are going to be a qq plot so in the x-axis i have the values and on the y-axis we have the theoretical distribution list i'm calling distributions dot normal and it's all going to be changed by this stat dot qq so it knows what to do what to do with the values for a qq plot and then gm dot point so it's it's scatter plot and then a semicolon because i don't want any output to the screen i'm going to do that same for the intervention group and then we're going to call get fly dot v stack so i could also just say v stack because we we said using get fly so v stack plot 1 and plot 2 so make a vertical stack of those two plots and then the plots i can see a vertical stack of these two qq plots and you can see here for the second one that's really off of a straight line there for for normal so we're not really meeting the assumptions for the use of parametric tests and although this data was was taken from the normal distribution when we created this we only took 23 points from each so there's always the the chance that we get random values that are not going to show up to be from a normal distribution so in that case for two groups a non-parametric test will be the man whitney u test and of course there is a man whitney u test there i'm just passing my two data sets my placebo is changing cholesterol and my intervention changing cholesterol using the dot notation so it's going to give me two arrays that's what we like with a man whitney u test and i want to p value back from there so let's have a look at the man whitney u test gives me a p value of 0.7 there so that is above my in my alpha value of 0.5 so we we can't reject another hypothesis there just to show you that we also have an equal variance t test so if we did meet the assumptions for the use of a parametric test there we see equal variance t test is another function and we can run that but we're going to see a p value in that is also above 0.05 so no problems there last thing on today's list i'm going to show you just how to do a chi square test for independence there is no such function in the hypothesis test data or any of the packages of yet so we're going to do this by hand so i hope you know the equation for working out a chi square value we're just going to sum over the square differences between observed and an expected contingency table and then divide that by the expected values and that gives us our chi square value and then we just going to use a chi square distribution as far as the degrees of freedom are concerned so let's create a frequency table a contingency table of observed values and what we're going to use is the freq tables function package dot freq freq table frequency table function there and what i want is the data frames and i want group against gender those are two categorical variables remember we changed them to categorical variables but what i want back is not a data frame i want it to be converted to a straight up array so i'm going to use the convert function change to array the following thing for me this frequency table and i'm going to store that in gg underscore obs that's my my contingency table of observed values and what we get back is this two by two array of my observed values and that's what you want for your for your chi square test for independence and you can see it's 64 bit integers but i have two dimensions now in other words there's rows and columns this is a rank two tensor or a matrix and that's exactly what we want now i want to know from this the row totals and the columns totals so the this first column will have a total at the bottom here 16 plus 14 is um is 30 and 9 and 7 is going to give me 16 but then i want across the two columns for each row as well so i want both the row totals and the column totals i'm going to use the sum function i pass my matrix but i say a long dimension one so let's see what that gives us back that's going to give us back the 30 and the 16 so that's adding 16 and 14 and adding seven and nine but if i say dims equals two it's going to give me across the 23 and 23 so 16 and 7 is 23 and 14 and 9 is 23 so i have the row so totals and the column totals i need those and i need the sum total as well and remember there were 46 participants so no problem there now i just want to know the size is another function we haven't seen before what is the size of this array and that's going to give me back a tuple of how many rows comma how many columns and if i have a higher rank tensor i would have more elements there but it's a two by two array that we're passing to that so now i'm just going to instantiate an empty array of similar shape so two by two array and they're all just going to have zero values in it so i want the number of rows with this gg dim its first value two and gg underscore dim its second value there so if i had more than two sample space elements in each of those group and gender it was just male and female and active and placebo group so there was always going to be a two by two contingency table but if i had more and these were different three comma four for instance i would just still reference these two by now i could have just put by hand two comma two but i want to show you where that comes from this gives me this array of all zeros i'm just instantiating that because i want to overwrite each of these values and i'm going to overwrite that with a double four loop i'm saying for i equals one two gg underscore dim one so that means two so for i equals one to two and then for j equals one to two as well because i want to i want to iterate through all these four values row one column one row one column two row two column one row two column two i want to overwrite all of those and remember how do i get that first one well that is this row total multiplied by this column total here divided by the sum total and that's what we're doing there so i'm overwriting at the moment i is one and j is one so i'm overwriting gg uh expected which is just this four zeros so now i'm position one what do i do well i take that columns total one times the row total one divided by the total 46 and that gives me that first value now i'm going to iterate over the inner four loop so j becomes two so that's the second column so we now at that one there i'm still in the first row so now we're looking at this value up here divided by the total and that's going to give me that value now we threw this j loop so we jump out to the four loop and i goes from one to two now it's two and now we're going to go again it's back to column one row two and then column two row two so that's with a double four loop i'm going to iterate through all those values and if we look at the table now we see our expected table and that's what we would expect given our values versus our observed values and we want to know is there a difference how do we do that remember it is observed minus expected but i do that element wise so it's dot minus dot square all of those differences individually so dot and to the power two that's to the power two and then each of those i divide by the expected the the expected expected value and in the end i sum over all of those and that's how i get chi square and look at the beauty of julia you see the chi symbol there well we can actually use unicode so i can say backslash alpha and hit tab and that gives me an alpha symbol or backslash beta so it is like latic so i hit tab and i get the beta symbol and i can assign that to that that's a variable name so chi two there that's a variable name look at that backslash colon smile so you can put all of these little icons you can also use for computer variable names it's just a little whimsy that exists there inside of julia and anyways that's my chi squared value and there we go it's 0.38 is that significant so is there dependence between those two well i'm going to use from the distributions package this time the pdf probability density function i'm going to use a chi squared distribution with a degrees of freedom of one remember that is the number of rows minus one times the number of columns minus one so it's two minus one is one times one is one so it's a single degree of freedom and i pass the x square value to that and that's going to give me a p value and oh behold it is more than an alpha value of 0.05 so there's no dependence between those people that not land up in one of the two groups that was totally independent of the gender that they were so we have a chi square test they're done by hand very easy very quick very simple to do ask our for loop slow no they're not slow so you needn't vectorize your code to get speed in julia because remember this is going to be compiled before it gets executed and when a for loop is compiled it's very quick so no problems there whatsoever so that was a brief introduction just to julia by way of showcasing some medical statistics on this simple medical statistics this has been an update from the video that i uploaded in 2015 that still used julia 0.4 and there are certainly lots of breaking changes when we got to julia 1 julia is now mature we had version 1.4 and it really just is a pleasure to use right inside here of atom now you needn't use it inside of atom you can also use it inside a visual studio code that is becoming more prevalent and of course i julia comes with julia computing but you can also install i julia let me just show you here so i would go on this side and say add i julia execute that and then we could just say using i julia and once we say that we can just call the notebook as a function notebook open close parentheses and that's going to open a jupyter notebook for us so i julia is going to install all its dependencies well i should actually just put that in uppercase that's the correct one i julia the i in the jr uppercase and you can use julia notebooks so you can code right inside of atom you can code inside a visual studio code or you can code right inside of jupyter notebooks whatever your preference is so atom comes with jupyter computing i should say julia computing when you install julia computing you're going to get this as your standard your default ide and it really is a lovely ide to code in i hope you enjoyed that video like subscribe and comment if something wasn't clear let me know i can perhaps spend some more time and explain that otherwise spread the word it is really easy language to use beautiful a lot of speed of course with the size of datasets that we commonly work with in inferential statistics when it comes to medicine we don't really need that speed but it's just such a lovely language to use and learn that i don't see any reason why you shouldn't spend a couple of days and weeks to get yourself familiar with this with this lovely language