 Well, it's a lovely afternoon here in Cape Town. It's lunchtime, so I thought I'd just step outside quickly and just make this recording. Last night I did a screencast on Huttling's T-squared test using R. That is a generalisation of the normal T-test. A normal T-test, that's a univariate test. It's a single variable that we're comparing between two samples with Huttling's T-squared. We're comparing multiple variables for two samples. Something I think we should do a lot more. Human beings, biological systems are complex systems. We're not just a single variable. We should probably make use of these much more. We should see more of these in the literature. So I'm going to show you one article, but then more specifically how to do it using R. So here we are in R studio. That is our trustee development environment to write our code in R. So I'm going to do this as an R markdown file. So if we were just to go to R pubs, we have already published this. Let's go have a look at the document when it's done. Here is my R pubs public page. And you'll find most of the documents that I create for these videos here. The majority of them will be on deep neural networks. But I also do this regular type of statistics if I can use that term. And I post these documents here as well. If we go look at it, there is the document, the R markdown document printed out as a nice HTML page. Now remember, the code will also be available on GitHub. And the links for all of this will be down below. So there is the document that we are going to look at. And we've seen some nice plots there, et cetera. You'll see some equations there. Of course, you can just ignore those equations. If you want to know more, they are definitely there, but we're not going to concentrate on them too much if you're not interested in them at all. So let's go back to our studio. And let's put this document together. So at the top, there's the usual YAML. So we are going to talk about multivariate analysis of the means for two groups. Now, we know the means of two groups very well, with the students T test or the varieties of T tests. We take a single continuous numerical variable. We take data point values from two samples for that single variable. And we compare the means between those two samples. Yeah, though, we're talking about something else. We are generalizing that very specific case of a single variable. I can compare the cholesterol in one sample, sample of patients to the cholesterol in another sample of patients from the two populations. But I needn't stick to just that very specific case of a single variable cholesterol. A human being is much more complex than that. Or for that matter, any organism, a lot of bioscience is all about the complex systems. And you certainly are not just a single variable. You consist of many variables. And it is that measurement, that comparison of all of the means of, say for instance, cholesterol, total cholesterol, total some other forms of lipid and blood pressure and serum glucose. So you measure all of those combined against each other. So it's not one on one. It's not the cholesterol with the cholesterol for the placebo in the control group and not just the systolic blood pressure against the systolic blood pressure. But that patient as a whole with all of those variables, we're going to compare two samples across all of those variables. And that is multivariate analysis. So our dependent variables, our outcome, has many variables in them as opposed to just one when we talk about the student's T test. So with a YAML and back to my R Markdown, just the author there. The output that I like to do is HTML of course, the table of content to be true, so that when we do see the R Markdown file on R pubs, there's a nice table of content and I don't like numbering of the sections. This is the normal setup. Remember this comes standard when you open an R Markdown file. I do set the working directory to the get working directory. So it's going to find where on my file system, on my internal drive, this notebook, this Markdown file is saved. And it's going to set that as the working directory. And I've got everything else or my other files that I want to use. For instance, we're going to import a CSV file that exists in the same folder. So I can just set the working directory to the single directory. Here are all the libraries that we're going to use. Tubl, Reader, Plotly, DT, MVNormTest, DPlyer, IC, S&P and Hotelling. And it's the Hotelling's T-square test that we're after. So if you haven't yet installed those libraries, remember use the package here, click install, type in the name and install those. I like to discolorize my headings, heading one, two and three in my HTML files, that deep navy blue and a bit of a gold color. That's fine. And that is how we import a logo. Remember it's the exclamation mark and then a set of square brackets and then just the name of the file without quotation marks dot PNG. And that asset also lives in the same folder, so I don't have to refer to its whole address on my internal drive because I'm setting the working directory to this current directory that this markdown file lives in. Everything is in that same directory, no problem. So let's just get back to this issue of this more generalized form of a t-test. If we just take one variable, that the student's t-test or the other variations of t-test that we know it's a single variable, measured the means of those, at least between two samples. Here we are generalizing the situation and saying well, we needn't just stick to the specific case of a single variable, we can generalize it to more. So let's just have a look at an example in the literature. So here we are in PubMed. Let's just increase the size here a bit so we can see properly. And I'm searching in PMC, that's PubMed Central, and if you search PMC as opposed to just normal PubMed, you're only going to get articles that are open access, that you don't have to pay a fortune for to view, which in my opinion is, in the opinion of many others, is just the wrong way to go. Anyway, here's a nice little article in the Journal of Oral and Exilofacial Surgery and we see the PMC 2011, October the first published here in the journal itself in 2010 and then comparing actual surgical outcomes in 3D surgical simulations. So of course that was 2010, that software wasn't too advanced, I'm sure we could do a lot better today, but that's not the point of this discussion of ours. What they did, if we just look down at the methods section, you see their Hotlings T test, that should actually be Hotlings T square test, but what they did, if we scroll down, they operated on patients, they had imaging, they also did the simulated virtual surgery, looked at the imaging after that and scrolled down, scrolled down and this is what we look at, this imaging after the surgery and just the different landmarks are colored differently if there was a spatial difference between the two images after it. So operating on a patient, operating on the virtual image and then superimposing those two, that virtual image was from the actual patient and then seeing if there are differences, spatial differences, how many millimeters were certain landmarks away from each other in the two images? And if it's green, it's very low into the blue and then onto the red side, it is more. Now we can just take certain landmarks, so take the McZilla and the left McZilla and compare just that landmark in that space between these two sets, the virtual image afterwards and the real image or the real surgery image afterwards and see what that is. And we can do a t-test there if we meet the assumptions for the use of a t-test, that being a parametric test and we can just have a look and see those specific differences so we can go to each of these landmarks. But what we can also just imagine well this is a complete and whole patient. The patient does not just consist of that one landmark and just doing a t-test individually between all of these that kind of doesn't make sense and if you think about it it doesn't really make sense that we just see this abundance of univariate t-tests. Multivariate is probably what we should see more of but we don't. And anyway, so you can take all of these landmarks and so there'll be a column for each and for the two samples and we can compare the means across all of them at once that one patient has many measurements so we are comparing many means against many means all at once. So let's get back to to our studio there and if you want to read this I explain a little bit about that. The assumptions the Hotlings t-square test being a parametric test we are looking at sort of the similar a similar set of assumptions for the parametric t-tests. And one is this the main one probably is this assumption of the sample taken from a population in which that variable is normally distributed and we can use for instance the Shapiro-Wilk test as we always do to see that that is true. You also get a multivariate version of the Shapiro-Wilk test and we're definitely going to use that here just to look at that at that assumption. Now it's not as hard an assumption as you might imagine it is slightly robust against the data not being from a population in which those variables are normally distributed. So it's not the end of the world if it isn't perfectly normally distributed. Now it just reminds me we haven't really spoken while I'm doing the screencast about multivariable the term versus multivariate when we used the variate we're talking about the outcome or the dependent variable side if that is multivariate that we are measuring more than one outcome more than one dependent variable when we talk about multivariable that means these independent variables that explain something. So if we look at linear regression or logistic regression we have a model in logistic regression of many variables that can have an impact on a single outcome. So we would have odds ratios and p-values for each of those as they impact this single outcome. Those are multivariable variables that impact this single outcome. When we talk about multivariate it is the outcome that we are interested in and that has more than one variable in it. So don't get confused between multivariable and multivariate. So in that sense students T test is a univariate test the outcome variable is just the same variable that we measure the means of between two samples. So the second assumption is just that the determinant of variance of the variance covariance matrix must be positive. Now that is a bit of a deeper mathematics there. I'm not going to burden you with that. So what we see here is the variance covariance matrix for a multivariate analysis where there are three outcome variables and I'm going to call them A, B and C and as we construct that what you can see here along the main diagonal these matrices are always square and that means that the number of rows and the number of columns are equal and we can see we have three variables and then we have therefore three rows and three columns along the main diagonal square matrix has a main diagonal that's the one that runs from the left top to the right bottom that's the variance just of the one variable the variance of the second variable and the variance just the normal variance remember the square of the standard deviation along the main axis and then symmetrically around this we find the covariance so A and B, A and C and B and C and because the covariance between A and B is the same as the covariance between B and A I've just put A, B, A, B and both but strictly speaking it should be A, B and B, A but that doesn't matter because there we go there's the equation for covariance in case you've forgotten it you just take each value you subtract it from its mean and the same for the second variable and so on and you multiply them at least and you just multiply those you sum it all up and you divide it by the total number of subjects minus one and that's your covariance and that's going to go into all of these and you can well imagine if you're just looking at the variance of one that is just the different squared so variance, covariance we should also probably just talk about the dependence between these variables that will impact the results if some of these variables in your multivariate analysis are dependent on each other of course that is going to have some effect on your results and if they're totally uncorrelated then that will really add the unique variance to your results which you actually want to see multicollinearity for that matter in the independent variables that will explain variation amongst themselves but we're really talking about something different here just read that I've put that in the notes and there's some other assumptions ferricity, compound symmetry etc you can read just a little bit about that if you want to know more obviously you can look it up so let's just have a look at hotlings t-squared test so this is just a reminder of a normal t-test and by the way I like to show these there's my lartic there so that is what you would have to type in to see this nice little mathematical equation on your screen lartic is very easy to learn you can have a look at that there so remember t-squared that the normal student's t-statistic that is just the difference between the two means for the two samples of the same variable divided by this pooled variance and you see the pooled variance in the denominator so the t-squared is different here we don't just have a single mean minus the other mean we have a vector of means so a vector of means this means there are in our case there was a b c that would be 3 it was mean number 1 mean number 2 and mean number 3 and you put them in a little column that's a column vector and you put and make another column vector for the other sample with its 3 means and that's what we're really interested in so what we're doing here is we're subtracting these two vectors from each other x sub 1 and x sub 2 and we transpose that so transpose mean if I have a column a single column of 3 and I transpose that I will now have a single row with three columns so you just like turn it on its side that we multiply by the inverse of this pooled covariance matrix and I show you there if you're interested to calculate the pooled covariance matrix and we multiply that these are vector and matrix multiplications again by the untransposed vector minus the other vector and if you know something about the dimensionality of vectors and matrices you know that we're going to end up with a scalar a scalar means a single value not a vector of values and there's our scaling factor there that this has to do with the sample size and then we get t squared and we express this t squared in the form of an f distribution and in f distributions always going to depend on degrees of freedom so to get the f distribution we see there it's d1 divided by d2 times this t squared or just in another way if we just look at the number of sample total number of samples that is n and k being the total number of variables there's another way it works out exactly the same so this d1 is just the number of dependent variables and then d2 is n1 plus n2 that's the two sample sizes minus this d1 variable the number of variables minus 1 if you multiply that by t squared you get f and then f follows a certain distribution dependent on these degrees of freedom and therefore we can work out a p-value very simple so let's run a few things I'm going to go all the way back up because I haven't executed any of these code chunks so let's go there going to execute this first code chunk let's import all the libraries you can see the green bar moving down there that is executed beautifully let's move down nothing to execute here nothing to execute here and here we go let's import the data and I'm going to use read underscore csv that is from the reader rea dr library and that creates a table as opposed to a more old-fashioned r data frame and it's data dot csv it lives in the exact same folder as this r markdown file so with my set wd get wd functions there I don't have to worry and then the data table function here comes from the dt library and that's just going to print a nice html nice little data table to the screen let's have a look and we execute that it's be imported and we have this nice data table you can search and you can do ascending and descending show all the entries move to the second page etc this is a nice way to to view a table on a web page now let's just have a look at our data here we seem to have three don't worry about what the values are or what the actual variables are but every subject has three variables and we want to just compare patients in group one and patients in group two we want to compare those three combined so this is a value that lives in this for this patient that lives in three dimensional space not in on a single line it's not a single variable but let's look at them one by one I'm going to use plotly for my data visualization so I'm going to create a plot and I'm going to call it f1 and that will be the plot underscore leave function the data is the data that I'm passing to it the type is a box so I want a box plot plotly is fantastic because I can use this pipe operator here so I pipe to that a box plot on the y-axis the complain variable on the x-axis the two groups and I give it a name complaints and then I pipe that into a layout and the layout will have a title and it will have an x-axis with a list which only contains one keyword argument the title you can view my videos on plotly it's for me the most fantastic library because it's very interactive you got all these buttons there I can save it as a png so that I can use it in a report that I'm writing you can zoom in zoom out do all sorts of things viewed online to change it online on plotly's website but it's very interactive if you hover over that that's I really love that so we can see the sort of normal distribution for the complaints variable you can see there's a bit of difference between those two this is have a let's just have a quick look let's just have a quick look here at the second variable we can see its distribution there let's have a quick look at the third variable and there we go we can have a look at the distribution of it there so let's just do some of our tests now the first one we're going to do is the multivariate version of the Shapiro walk test and that comes from the MV norm test library what it wants though is something very specific it only wants a data frame or a table of the actual numerical variables so remember we had a fourth variable that contained the categorical variable the groups you have to remove that it only once unfortunately it only once sub data frame or sub table that only contains the variables that we're interested in so I'm going to create a second variable called data dot vars and I'm going to pipe my data through the select function and I'm going to use this argument minus one underscore of so it just means remove please remove from this data table one of the columns since it's the group column that I want to remove so if we were to do that if we just view these very quickly there's data vars and it now only is those three numerical variables whereas if we looked at data it has that group with it so I've just removed that no problem let's just close that up so now we can use the MV norm test library and I put this MV norm colon colon m Shapiro dot test just to show you what library has been used of course you need to put that you can just use m Shapiro dot test and another little quirk it wants that data in transposed form so if we look at it here we have the columns at the top the variables and then the rows down it's going to shove that on its side so we're only going to have three rows here and then all the values are going to be across so that's transposing it so you have to transpose it before you pass that and let's run that test and we see a w score of 0.9 that gives us a p value of 0.05 that's very close but it's still not significant it's not less than 0.05 so we can say that it passed this test at least and we can accept this idea of normality the other assumption that we like to test is just to check that the determinant of the variance covariance matrix is positive so c o v that's covariance that's going to give me the matrix the variance covariance matrix and the debt is just the determinant of that and we want that to be a positive value and plus 0.7853 that's definitely positive so we've passed those assumptions and let's just go for the hotlings t-square test then as simple as that now i'm going to show you two versions of the hotlings t-square test and they come from different libraries there's the icsmp library and the hotling library so you can use either those two libraries they both have the hotlings t-square test and they just use different functions for it and the icsmp library it's hotling t2 and hotling.test for the hotling library so what we want to pass is the two sets of data the two matrices are values so we're going to use lyr the filter function so i'm going to filter the data according to the group column where the group column equals equals 1 in other words that's a brilliant question i'm asking take only the ones out that return a true value so the ones with the group equal 1 and i want all the rows comma only columns 1 to 3 because once again i just want those columns i just want the numerical columns in there and then filter data group equals equals 2 and then again all the rows comma columns 1 to 3 and if we execute that we see hotlings two sample t-square test and the there's our difference that we think we think that the difference between each of the three variables should be zero so we have this vector written here as a column as a row vector and zero comma zero comma zero that's what we expect the difference between that to be and it's three the three of them combined not individual by individual by individual variable it's the three of them combined and we see the value of df1 there and df2 that's degrees of freedom 1 and and 2 and we see a p value of 0.77 so there's no statistically significant difference between this these two samples as far as this multivariate analysis is concerned and then just from the hotling library just using hotlink.test and we're going to see the same value there 0.771 so we really contradict the null hypothesis and we state that there's no multivariate difference between these two samples so in my opinion a much more exciting way to do your analysis biological systems human beings are not are really not single variable entities or systems and we should probably make use of this type of analysis much more commonly of course what we do do is we do our analysis where we say we we correct for all the other variables so that our two samples are very similar as far as all the other variables are concerned so that we can just drill down on this univariate especially when we do randomized controlled trials we just want to drill down on this univariate single variable analysis and we correct for or we take out patience so that all the other variables are similar but it's probably a better way or we can make an argument for the fact that it's probably a better way just to really concentrate on all of these variables that make up this area of interest that we're studying and you can well imagine that there are times that we can really make use of this so what we are doing here is just comparing the sets of means vectors of means of two samples of course you can do that for more and now we're going to get to multivariate analysis of variance and you get multivariate analysis of covariance and even when we talk about regression you get multivariate forms of regression where we're going to try and predict a vector of outcome variables so all of these tests really have multivariate versions and I really want to make the case for the use of most of them if I have some time I might make videos on the multivariate analysis of variance and covariance and some of the multivariate regression as well very easy all of these things are so easy to implement here are so I hope you've learned something have a look at hotlings t-square test look at some look in the literature find some multivariate analysis just make sure that whoever you know published that didn't make a mistake misusing the term multivariable and multivariate I've seen that that does happen but just to keep things simple try and stick to those multivariable means those are your input variables or your independent variables in regression models and multivariate that's more your outcome or your comparisons as we have here so try and keep those two terms separate just to keep things clear