 So, because we have a little time left this morning, I will quickly finish on the permutation test and get into the canonical analysis portion. Okay, I think we have most of the people. I think I may have generated some confusion by calculating these p-values that are correct, by the way, because here I'm testing either an alternative hypothesis that the correlation is positive, and we obtain something significant because our value is positive. But for the other one-tail test where the hypothesis is that the correlation is negative, of course we are very far from that. This is why we obtain a p-value that is aeons from being significant, that's correct. But if the value here was negative, then this would become significant because that one here would be here, so if it was negative. So we would compute in the lower tail for the p-value, but in all cases we take the real value and compute everything to its right, and sum everything to its right for the one-tail test in the upper tail, and everything to its left for the one-tail test in the lower tail. So if the value was here, it would be significant, since it is here it is clearly not significant. Okay, so much for that, then I said that I will show you some results, these are results of simulation showing what is the influence of lack of normality on test of hypothesis. This is for another test, this is the test of the difference between two groups here, it comes from my classroom notes from the undergraduate, it is in French. And here I generated two groups of data with how many points, I don't remember how many points, maybe 50 points, oh yes, a variable number of points, this is the abscissa here, so this is for two groups of 10 points, 20, 30, 40, and 50 points. And in this case the two groups come from the same statistical population, so there is no real difference in the mean. And the results in black are those of the permutation test, and those in white are for the parametric test, and each of these points is the result of 10,000 simulations. I created new data 10,000 times and tested them either parametrically or by permutation, and I see the result when we look at the 0.05 significance level represented by the gray line here, how many, what is the proportion of the results that were significant by chance. Honest test should have the rejection rate of the null hypothesis about very near the significance level, and this is what we see here because here the data were random normal data, what's the difference between this and that, must say at the top, oh yes this is for one tail test, this is for two tail test, so no problem. Now here I generated data as many other authors did using a cubed exponential deviates instead of random deviates, so very strongly asymmetrical distribution. And I did the same thing, 10,000 simulations for every point. You see the results of the permutation test are perfectly correct. The result of the parametric test have lower rejection rates. So these tests are still valid, but there will be a loss of power corresponding to this difference when we test data in which there is an effect to be detected. We lose power, we are less likely to detect the effect in this case with this, with the parametric test and with the permutation test. With other statistics like in the analysis of regression coefficients, the effect is in the other direction and the parametric test has rejection rates that are higher than the significance level, and in that case the statisticians say that the test is invalid, it should not be used, but the permutation test remains like that. So there are many published simulation results showing that effect of lack of normality on test of significance, on parametric test of significance. Now I repeat what something that Daniel Boca as already mentioned, permutation testing solves the problem of distribution, normal distribution or not, but it does not solve all the other problems of data. For instance, in our first speaker this morning mentioned when there are links between the observation, when the observation are auto-correlated, permutation tests do not solve the problem of auto-correlation or in analysis of variance, that is the case that Daniel Boca mentioned was it yesterday or the day before, there is this assumption of almost cadasticity that is that the different groups must have equal variance, the different populations from which the groups are drawn must have equal variance, permutation test does not solve that problem, only the problem of distribution, but this is a major problem with frequency data and we see that we have a nice solution, and this nice solution we will use it in the next part of my talk where I will discuss the canonical analysis where we will use multivariate frequency data, species abundances and so we will use permutation tests in canonical analysis. So it is very easy to program permutation tests, there is a small function that you will find in the list of the functions for this course, the file here is called core perm, that is correlation by permutation, there are actually three functions for permutation test of the correlation coefficient in this file. So as we saw yesterday when you wrote your functions, this starts with the name of the function core perm number one, then in the file you have number two and number three, the magic name function and then we have three parameters, the first vector, the second vector and the number of permutation. All these lines here are simply comments because they start with this symbol and it explains what the function does. Here is the function, it is very short, in this version of the function I will simply use the R statistic here to carry out the test instead of the T statistic. For the correlation coefficient this is perfectly valid, but for other tests for instance to test regression coefficients in multivariate regression you have to use the T statistic, but here it works very well with the R statistic. So some of these lines are not necessary, here I put my X vector as a vertical vector in the matrix, here is the number of rows, that is the number of observations and here I compute the reference value of R by calling upon the function core between the two vectors and by default this uses the Pearson R, if you want to use Spearman's R or Kendall's Tau you have to add a parameter here to specify that. Now this is the permutation loop here, this is like a small function inside the Bing function, it says for the index i going from one to the number of permutation for instance 1099, we will do the following that starts with the opening parentheses, curly parentheses and terminates with the closing parentheses. So here I take one of the two vectors, here the first one is called X, the second one is called Y, so I chose Y, I could have chosen X to the same result and I will permute Y with the function sample that does a permutation and you can do bootstrapping by adding a parameter saying that you can have replicates and here actually is the number of observations but it is not even necessary to specify it, if you don't specify it the function sample will permute the whole vector that you give it and that produces a vector of Y under permutation, it's equivalent of taking the values, checking them and throwing back them at random and we calculate again R under permutation by using again core of X unpermuted against Y permuted, so we obtain the value of R under permutation and now we have this counter here, usually in a loop like that you would start the counter, this is the number of greater than NGT, we would start the counter at zero but in this case we know that the true value is in the distribution and the true value is larger than or equal to itself, so instead of doing starting at zero and doing this test that is useless I start the counter at one, we know that there will be at least one value equal to itself because itself is in the distribution, then we test here if the absolute value of the value R under permutation is larger than or equal to the absolute value of the reference R then we increase the counter by one, we say we found one case where it was greater than or equal and this is for a two-tail test, this first version of this statistic it is this sort of test because I look at the absolute value of R under permutation compared to the absolute value of the reference R, it is like taking my distribution and folding it like that and everything is in the upper tail and at the end of this loop if I have performed that 999 times then the value of NGT will be between one and a thousand because it was one at the beginning depending on how many of the results were greater than or equal to the real one and from that we exit the loop and we calculate the p value which is this number there divided by number of permutation plus one the calculations that I did here okay and then the cat statement just prints out the result and you also have a return list where you have the correlation value where that prints R the reference R, number of permutation prints the number of permutations and p perm prints the p value and this is perfectly operational I have loaded this function already let's see here are some results that I calculated last night no need to repeat them you wake up yeah okay here I have I took two variables from the environmental variables associated with the mites data uh the file is called mites.env the mites data the first one is substrate density and the other one is water content and we will see if there is a correlation after 9999 permutations so the function spits out the value of the correlation the number of permutation and here is the p value here I just repeated it to show you that it can vary a little bit and here it actually produces four decimal places because I asked for 9999 permutation it would be 0.0030 but R did not print the last okay so it varies a little bit because this is what we call sample permutation test we cannot do all possible permutation here there are 70 points so permuting 70 points how many permutation are there 70 factorial this is a big number we you cannot go over 70 factorial permutations probably in your lifetime so forget forget it you can use a complete permutation test up to a n equals 7 or maybe nowadays n equals 8 not 70 so in real life we do sample permutation that is a sample of all possible permutations and for that we need a function that guarantees that it will produce all possible permutation in an equi probable way and the function sample in R does that we can test that so that's how we obtain this this sort of result knowing that it varies a little bit so if we want to be if we are close to the significance level and we want to be sure then we have to run more permutations in order to have a more stable value the last decimal place is never stable so this was the idea of permutation test and this is what we are going to use now in the next topic which is canonical analysis I'm going to use this file a portion of a chapter that I'm giving you for reference don't tell my editor that I give away chapters and in canonical analysis actually this is a word by that was coined by the statistician Rao and it means actually all the methods that handle two data tables and there are different ways of handling two data tables I will draw the two data tables here so we have y our response data typically species gene frequencies and this sort of thing and x environmental data this is response data okay and this will lead to the the type of analysis that I will describe today that are asymmetrical analysis just like regression where we will consider that this is the response and these are the explanatory data and we'll have two variants of that redundancy analysis and canonical correspondence analysis actually if you had only one y variable y against x is multiple regression so redundancy analysis is the extension of multiple regression which is asymmetrical in multiple regression you cannot switch the two sides now in some cases we have two data tables that are equivalent and we want to just compare them and there are other methods described in the same chapter and you can find them in the book in the green book they are canonical correlation analysis a method that has been around since the 1930s and more recent methods like the co-inertia analysis and procrastis analysis that also considered the two two data tables that are that we can call x1 and x2 and in co-inertia analysis what we compute is the matrix of covariance between the two data tables and we decompose that matrix of covariance and produce plots and there is a test of significance that is a generalization of Pearson's R to two data tables so that is a co-inertia analysis or procrastis analysis with other tests of significance but i will not describe these other methods because i don't have time in this short course but if you want to know more they are in my book and we have software in R to carry them out another of the asymmetrical method is multiple discriminant analysis where the response is actually a vector which which happens to be a factor and it is the result of a classification if you classified for instance the species data from groups you could put your groups here and put environmental data there and find out which of the environmental data explain the differences between the groups of the species and this is multiple discriminant this discriminant analysis and it is another one of these asymmetrical methods but i will not describe it today because i have to limit the presentation okay so this these are the different types of canonical analysis that are available all available in our software the basic idea of the asymmetrical methods is that we will combine ordination and regression on monday we studied ordination methods and in particular principle component analysis and correspondence analysis with sites by species or other frequency data principle component analysis can accommodate other types of variables environmental variables for instance if you standardize them no problem and many of you have even programmed your own function for pca so you know that it is very simple to calculate and produce nice graph now things that we have learned from principle component analysis are that it preserves the euclidean distance so if we want to put species abundance data or gene frequencies we have to transform them because we have you have seen that the euclidean distance is inappropriate for this type of data it can produce completely foolish results while after transformation of the species data with the cord or helinger transformation or the chi-square transformation then the results of pca are meaningful so that will apply exactly directly to canonical analysis everything you have learned about pca for the response data applies to canonical analysis and also if you put here environmental data or data that are in different physical units you learn that they must be standardized before pca they must be standardized also if they are used as response data in canonical analysis now yesterday we discussed regression a little bit and in regression we have a single response variable in many explanatory variables and we produce a model like this but there are tests of significance that we can use and we will have the exact same test of significance here because the x matrix is exactly the same as the matrix of explanatory data in canonical redundancy analysis and just like in regression this matrix can contain almost anything not only quantitative variables it can contain binary variables or factors that will be decomposed internally into a series of dummy variables so just like regression can handle any one of those types of variables canonical analysis will also be able to handle any time any one of those when it comes to the analysis of geographic variation that we will see in more detail in the last next two days we can use here either the some transformation of the geographic coordinates or the new methods of spatial eigenfunctions in regression we can do the exact same thing in canonical analysis and of course since our data when our data are not normal we would use a permutation test in multiple regression we will use permutation test in canonical analysis for the same reason everything is actually exactly like what we have seen on monday for ordination yesterday for regression and all the statistics in regression and today for permutation test okay so we build upon what we have seen in the past two days now these are the symmetric and asymmetric forms now redundant say analysis is our basic form the one on which i will spend the most time simple rda means that we have one table of y and one table of x in partial rda we will have two tables of explanatory variables and eventually more in variation partitioning so the word redundancy is synonymous with explained variance it means the how much variance of the y can be explained by the x variable that's what redundancy means here is the big picture that summarizes what we do in redundancy analysis it looks terribly complicated but you will see it does not here is our response data if they are species they are already transformed before we go into that into redundant analysis and the first thing the function does is to center those variables there is an option to standardize them if they are environmental variables in different units but otherwise they are simply centered centered the x variables the environmental variables are also simply centered this is in order to remove the intercept that we don't need because we don't need a model of individual variables of individual y variables so by centering here and there there is no intercept to be computed that simplifies the calculations a little bit now you can imagine that we could take the first column here let's see the first species and regress it on all these explanatory variables and obtain fitted value we would write them in the first column here and residuals would be written in the in the matrix down there okay so we decompose the first column into fitted values and residuals and we could reconstruct that by taking this column plus that one it gives us that we have lost nothing now do the same thing with the second column do the multiple regression obtain the fitted values obtain the residuals then do that in turn for each piece for each column if you have 10 columns you can do it by hand if you have 300 species it would be a bit tiresome but you could hire an undergraduate to do that now it turns out that there is an easier way to do it and we had that yesterday when I showed you the algebra of regression so we can compute all the fitted values directly from this matrix equation we saw this equation yesterday with a small y a single response variable and it produced a small y hat which was a single vector a single column of fitted values but if we put the whole matrix y here this whole matrix in the equation then we obtain this whole matrix in one shot with one line of our code that's interesting saving a lot of time and also if you do it by hand it is prone to making mistakes here you don't make mistakes you just write that line of code and when it is correct then it will work every time now with that we already have enough to answer many questions before I go into this this part will be a pca of that matrix y hat but before we do that we have to know if our results if our relationship is significant we have to know if x explains the variation of y in a significant way that is more than random data would okay so from simply the y and the y hat that already embeds the effect of x we can compute the r square of this analysis and that would be the variance total variance of y hat divided by the total variance of y I showed you the same equation yesterday with a small y in the case of regression and this is since variance is sum of squares of y hat divided by degrees of freedom and this is sum of squares of y divided by degrees of freedom degrees of freedom cancel out and we can compute it in that way now how can we calculate this or that let's say for y the data are already centered if we wanted to compute the variance or the sum of squares of one column since the data are already centered we would simply square all these values and sum them that would give us the sum of square for column one if we do that for all the columns this is the sum of squares of y so this in r can be written as sum of the y squared that's all because y is centered okay y centered it already we square all the values in y and sum them it is that sum of squares and on top it would be the sum of y hat squared so in a single line of code when we have these two matrices we obtain the r square isn't that nice that's the power of the r language and the matrix algebra associated with it now we have seen yesterday that from the r square we can compute the r square adjusted using the the equation of the prophet Ezekiel yes you remember that okay the same equation will transform the r square into r square adjusted another thing that we can compute from the r square is a ic and from it a ic corrected for small sample size because uh because we we saw in the handout of yesterday that a ic can be computed from r square it is simply penalized form of r square and of course we can compute the f statistic which will be simply r square divided by 1 minus r square and here divided by m which is the number of explanatory variables or to be more precise it is the rank of the covariance matrix of x because if in x there are strongly collinear variable maybe the rank is smaller than the number of variables so in software m is the rank of the covariance matrix of x and here in the denominator we put m minus 1 which is the basic number of degrees of freedom minus m and there we go we have the f statistic okay then to obtain a p value we will do a permutation test and this is done simply by permuting the rows of y at random and permute the entire rows we don't touch x and we recompute the canonical decomposition that is we recompute this simply 999 times when we have this for the permuted data the total variance of y has not changed so this is a fixed value through the permutation test we can recompute the sum of squares for the y hat obtained from the permuted values and we have the r square from which we can compute the f under permutation 999 times plus the real value we calculate the p value just as I have shown well half an hour ago is that clear so here it is simply an application of everything we have seen in previous talks and we have then the first half of our canonical analysis results the p value tells us if we have explained more of the variation of y using this matrix x then a random organization of the y's against x would produce okay if it is not significant you stop there so either you go back to your data and see what you can do to your data to obtain the significant result for instance you could change the x variables try to get better predictors for the relation for the structure of y or something or you drop it entirely go to the bar and have a beer if it is significant then you may be interested to look at what the structure looks like using the plots that will now be called try plots because there will be three types of element to put in these plots but these plots are based on the principal component analysis of this matrix the function that many of you wrote yesterday you could use it on y hat to obtain these results there is no trick there is no special handling it is just PCA of y hat and in the PCA of y hat you would obtain eigenvalues that would be at the top here but I did not write them eigenvectors they are here and from the matrix of eigenvectors you can multiply these eigenvectors with this matrix to obtain the matrix of ordination in the space of the variables x in PCA I call this matrix f here the authors are using a different symbol z because it is the result of the PCA of the y hat and also because in the tradition of the of redundant analysis you can also project these data and rotate them using these vectors derived from that so this is this is not the result of the true PCA of this because we take these eigenvectors computed from y hat and multiply them by y this is what is written here and this is what is called the f matrix in redundant analysis but the true matrix of PCA result is that one yeah well since we are doing a permutation test we don't have assumptions of normality and we don't go to the trouble of checking the assumption of a muscadasticity for each of the regressions because it would be for each species and are too many of them so we assume that over the whole set of species we will have something that may look like a muscadasticity but if you wanted to be formal yes you would have to do that except that nobody goes to the trouble of doing it if we don't have a muscadasticity then it would simply lower the power of the method and it would tend to produce non-significant results which is all right because then you would not stand with significant p value based on wrong things it lowers your power and it reduces it increases the p value it makes it less significant otherwise there is no other assumption and because in x you can have whatever type of data you want constitutive binary qualitative okay okay then in some cases we are interested in looking at the residuals that are here and doing a PCA of the residuals in the software vegan this is done automatically you get the eigenvalues eigenvectors for y hat and you automatically get the eigenvalues and eigenvectors for the residuals and you can produce the plots if you want a few people care about this but in some cases it can be useful but it is produced automatically by the RDA function of vegan yeah when we give two matrices to the RDA function of vegan it does the canonical analysis with one matrix it does only the PCA with three matrices we will see what it does okay then in the chapter output here you have a long discussion of the algebra the step of the multiple regression is here then the step two is the PCA as I said statistics in simple RDA we saw that the R square the adjusted R square the F statistic here well there's this discussion about the cases where we could use a parametric test in that case the the degrees of freedom for the parametric test would be that and that that is multiplication by the number of species but for the permutation test anyway there's this p and that p cancel out in the equation but this is useful only if you do a parametric test a parametric test is possible if two conditions are met the first one is that the data are multivariate normal the second one is that you are you have standardized your variables so with species abundance or gene frequency data you don't want to standardize because you want the most frequent frequently found genes or species to count more in the analysis and of course these data are not multivariate normal that's why we don't implement the permutant the parametric test in RDA we use it only in simulation work where our data are multivariate normal and standardized that's the only case where it would be used okay so we don't use this equation 11.6 but the simplified form where p has disappeared and the F is the same in the two cases but in the permutation test you don't use the degrees of freedom in the testing procedure it is possible to test individual axes when we obtain the axis one axis two axis three there may be a long list of these axes and we don't want to look at graphs all the possible pairwise graphs of the pairs of axes so it is nice to know which ones are significant and then we can forget about the others well there were proposals in the literature the first one was already available in Kanoko since version 3.10 that was version the version of 1990 of Kanoko and then Yari Aksanen who programmed the RDA in the vegan package came up with another procedure slightly different and there was some worry about which one of the two procedures was the correct one and then we did a study together that I published I carried out the study and had my two colleagues check it and sign the paper with me and it showed that the two methods had correct levels of type one error and they had comparable powers so they are equivalent methods so vegan and Kanoko are both correct in testing the individual axes so this is nice because usually you end up with two or three axes that are significant and you can forget about the other ones here is the statement about the null hypothesis of that test this is a long section on the algebra of simple RDA you can look at that if you're interested in the algebra and it shows where this equation comes from it comes from the the two steps that I showed that is a multivariate regression followed by PCA and it produces in the end that equation involving the covariance matrix of y and x the inverse of the covariance matrix of the x's and you put them in this eigenvalue eigenvector equation and you obtain the results okay this is how you obtain the f matrix for the ordination scores in the space of the y's the z matrix of ordination scores that are the true PCA of y hat so you you have all the algebra to do that and of course in RDA what you want to obtain is a triplot because you are interested in the sites the species but also you want to draw the environmental variables onto that same plot so how to do that this is what I will explain now but before that I mentioned that you can have scaling type one and scaling type two in RDA just like we had in PCA scaling type one that preserves the euclidean distance among the points of the y hat data and the scaling type two that preserves the covariances among the columns of the y hat data and the equations are the same as in PCA so I will come back to this part in yellow in the moment RDA scaling type two okay so the question here is oops I should have stopped here yeah is how you do you represent the environmental variables in the by plot that already contains the site and the species so it is not done within the calculation of the RDA itself it is done aposteriori and for that you compute the correlation yeah I called that I called it rxz rxz he put this a bit further z is the matrix of the ordination scores obtained from the y hat matrix so rxz is simply written as core of x comma z you have these two matrices and that produces a matrix with rows that are the x variables and columns that are the axes canonical axes okay and here are the environmental variables so you have your environmental variables and your canonical axes and we are mostly interested to produce the first plot in the first two columns that is axes one and axes two okay so these correlations they can be used directly to the the values there they can be used directly to draw arrows in the ordination plot representing the environmental variables in scaling type two in scaling type one we have to modify these values into by plot scores the name is pretty long it was invented in by ter brach in the kanuko program this he had a longer sentence by plot scores that are something about the environmental variables he had a long sentence now we abbreviated that to by plot scores and it is a transformation of these values taking into account the variance the eigen values divided by the total variance of y so here it is you oops you take the eigen value of column number one divided by the total variance take the square root of that and you multiply each of the correlations on column one by that constant here for column number two you repeat the calculation take eigen value number two divided by total variance square root and multiply these values by that to obtain the by plot scores so for scaling type one so this is still the environmental variables for scaling type one this is what we have to use and it is again coyote abrac and the kanuko program that demonstrated that this must be done in this way you cannot use the correlation for scaling type one so all that involves a lot of work but now we have all these results and we can just apply it and i will show you an example so you see for by plot scores of type two it is simply the correlations my example here you will be able to play with it this afternoon this is one of the data files in the in the files provided with the for the course we have a coral reef that i've made up with 10 sites going away from the shore okay and so with depth increasing here and we have six species we will use this portion for rda and all of this for the example of cca so these six species they have been worked out is so that they are related to some of the environmental variables or in one way or another for instance here that species decreases in abundance from site number four to site number ten so as we increase the depth that species decreases this one increases all the way from the shore to the deeper sites this these two species are linked to to coral substrate so this is a substrate type variable they compose into three binary variables and these two species they are linked to other substrate for sand i have no species related to that because if you know a bit about the warm seas when you are near the shore on the sandy beach there is nearly no fish there so i just followed this idea that there would be almost no fish in the sandy part of the of the transit and here we have species where the sums the column sums decrease okay so this is not real data it is made up and i use that to demonstrate what happens first of all when we we can try to anticipate how many axes we are going to obtain from these data so the number of axes in the pca would be equal to the number of species it would be six axes but in an rda when we regress all these variables suppose there was only one variable in x so regressing all these variables on the single variable would produce results in one dimension so the maximum number of axes that we can obtain from the rda is the number of environmental variables but is it is it true for this table what do you see in this in this table that is a bit strange mm-hmm the main well we have more than one x yes but in particular the substrate type has three states and it is the composed in three variables one of them is called in error with the other two so actually the the the rank of that speed of that matrix is three is not four so it means that it will produce three canonical axes in maximum there could be fewer that is some canonical axes could have a variance of zero but it cannot produce more than three even though there are four columns there and here are the results and indeed we obtain here three canonical eigenvalues three canonical axes and then on the non canonical side we could have six new axes that is this is the pca of the residuals but here after four axes we have exhausted all the variance so the last two non canonical axes have a variance of zero an eigenvalue of zero so we will focus on the canonical side here are the eigenvalues when we divide them by the total variance it gives us the relative eigenvalues here and if we put them cumulatively here the first one explains 66 percent the second one 22 percent so this plus that is 88 plus that is 95 and then the remainder is explained by the non canonical axes the last four percent is the non canonical portion it's because my example is made up that we have such high results because here this is also the r square you see the total at the total variance after three axes is the r square 96 percent in real canonical analysis results you're happy if you have something between 40 and 65 percent let's see now these are the species scores in the canonical in the canonical or in the the rda program of vegan they call them species scores but they are actually the eigenvectors and this is the metric z of the pca of the y hat matrix and each side has a coordinate and we will see that in a moment in the graph okay let's talk and then the correlation matrix here there it is that's the correlation matrix and then we do the calculation for the by plot scores by multiplying that by the proportion by the square root of these values here that is that divided by that is this we take the square root of that and we multiply the correlations by these these correction factors to obtain the by plot scores from which we will be able to draw these variables as arrows using these coordinates and so i'll show you this result that is here and this by plot is very nice because it tells you the story that you want to put in your paper in your thesis that is how do the species react to the environmental variables as our first speaker was saying this is another analysis a simultaneous analysis of y and x instead of what we were doing 30 years ago that is to do a pca of that and then take the axis of the pca and relate them to the environmental variable here instead we have an analysis that involves an ordination of only the portion of y that is explained by x it is the matrix y hat and so in this case we have the sandy sides that are here the coral sides that are there and the other sides that are here and you have the species as dashed arrows and our environmental variables represented by bold arrows now there is another method of representation of the classes of the factor that was proposed by ter brach and implemented in the kanako program i know some of you have been using the kanako program already and this method is also available in other function so it is to represent here sand for instance either as an arrow or as the center of mass the centroid of the points that have this quality of being on sunday substrate here is the center of mass of these three points that this is also one point and the circle with the cross is the centroid and in in for this group there are four points one two three four and this is the center of mass of these four points so you can represent these classes of the factor either by the centroid or by an arrow here i use both types of representation to show that the arrow as corrected here really pointed to the centroid here the arrow points the centroid the arrow points the centroid if we had used the correlation coefficients the arrows would not point to the centroid for instance this arrow would point like here instead of there so the correction is necessary to have correct representation given the stretching of the axis done by the eigenvalue okay so oh the arrows there is an arbitrary decision here made at the time of drawing you can pull them more or less but then you have a comparative length to the arrows you see that these two arrows are longer so they probably have a stronger correlation than depth for instance so this is the idea of having comparative length but then there is in this drawing plotting program there is a factor that you can put if you want the arrows to be longer or shorter and in actual software they come out in color in this software this function that I wrote these variables are in red and these are in blue not you know programmers can implement things with different bells and whistles good so that's what I wanted to show you about RDA and this is a summary figure figure 11.4 in the book that is the pendant of the one that Daniel Boca showed you for PCA in PCA we could use simply the raw data against the explanatory variables and in the old days we were saying that when we have short gradients that is not too many zeros in the data we could use either CCA or RDA just like in simple organization we can use PCA or correspondence analysis but then we were also saying and I wrote that myself in past editions of my book that for long gradients when there are many zeros we could use we should use CCA now that we have the transformation base RDA we can use RDA all the time even with species abundance data the example that I showed with my phonic coral reef if I was analyzing that for for a paper I would have transformed the data but I did not do it because it was simply a numerical example in the book so we know that PCA implements the Euclidean distance and the Euclidean distance produces foolish results with these data we have seen it on day number one so we can transform the raw data when they are sides by species we transform them into chord the Ellinger or chi-square transformation and we use that in the RDA to obtain a meaningful analysis and the third method for when we want to use any distance metrics that we like for instance the percentage difference alias brain girders we can take our raw data compute the distance metrics and do a principal coordinate analysis but in that case we want to have all the x's because we want the data to fully represent these data after the data transformation done by the distance function okay so we you could not obtain all the x's with an mds but you can obtain all the x's with principal coordinate analysis but you have to remember that since most of the distance function that we use for species abundance data have the property of being non Euclidean you have to take the square root of the distances before when when you do the principal coordinate analysis in order to avoid producing negative eigenvalues and complex axes here that we would not know how to handle in the RDA with a square root transformation of the percentage difference matrix or of the Jacquard or the Seren Sen or the okii distance matrix for binary data then we don't obtain negative eigenvalues and complex axes all axes are real and we put them then with x in the RDA to obtain our canonical analysis the advantage of this over that is that here the species the the columns still represent the species the data are transformed by the column I have preserved their identity while here the columns are now the the eigenvectors of the PCOA and the species are all integrated into that they don't correspond to individual species so in the triplot the species arrows the the dashed arrows of my previous graph produced by these would be meaningless whereas here they are meaningful so that there is a small advantage in using the transform abundance data okay I'm chewing into your time yes discriminant analysis is a form of canonical analysis factor analysis is a form of analysis where you want to maximize the covariance instead of the variance and it is used in the social sciences it does not use anymore in ecology yeah that's a quick summary we could talk about an hour a for an hour about that but that's the quick summary okay so this is called the transformation base RDA and this is the distance base RDA you have the references here in the document now a last quick word about partial RDA partial RDA is when you have a third matrix of data that already Daniel showed it yesterday and we called it w it is the same as partial regression except that now partial regression is applied to a multivariate data table okay and the function RDA of the gun if you give it three tables it will produce a partial RDA so the mathematics is the same as what Daniel described yesterday for partial regression there are nice applications of partial RDA well here are the statistics and all that the F statistic is a bit different because it has to involve the number of covariables but Daniel already showed that the test of significance are described here this is an intricate thing but these problems have been solved back in 1999 in our paper comparing different methods some that were invalid and some that were valid we now know that and there is no problem anymore yeah I will quickly run through the different applications the first example no sorry not that yeah the first application is to control for a well-known linear effect and this is what we do most of the time and we will see that again in variation partitioning if we have environmental variables spatial variables then we can look at the effect of the environmental variables while controlling for space or vice versa we can want to look at the effect of a single explanatory variable suppose that you have 10 environmental variables here if you want to look at the effect of a single one you put the nine others in here and do the canonical analysis and you do that for each variable in turn having one here and all the others there this is done automatically by an option in the ANOVA function that follows the RDA function in vegan you will have these tests of the marginal effects of each variable and that is very interesting to know that's a way of eliminating perhaps some environmental variables and we'll see that this afternoon in the practicals the analysis of related samples this is also another nice application and we will see it in more detail is it tomorrow I think it is tomorrow when we will look at space time analysis we can if we have a data set that is structured with 10 sites and repeated over time well if we want to analyze here the effect of time class one class two class three of time we will put the description of the 10 sites in the w matrix and this is an analysis of related samples just like paired t-test generalized if you like MANOVA by RDA Daniel is going to talk about that but I don't know when when I shut up so I leave that to him principle response curves and other nice analysis but I'll skip it a partial PCA would be the the PCA actually of the residuals it is there are three or four ways of computing it here and then we will use also partial RDA for the selection of environmental variables Daniel is going to describe that with the function forward cell and or the step or the R square step so that's it and I will finish by showing you an example of variation partial partitioning by RDA the var part function that you use yesterday for with one response variable we will simply use it with the whole table of response variable it was designed for that and this is the fish data of the Dew River where I put in different explanatory matrices the topography the water chemistry and the geography so in that case topography was represented by altitude slope and water flow chemistry pH hardness concentration of phosphate nitrate ammonia dissolve oxygen and BOD and geography it was simply the linear distances from the source of the river okay so we wanted to know which one of these sets of variables were adding additional information and which ones were producing the same type of explanation and here we have it so in the three circles let's say I think this one yeah this one is slightly significant this one is more significant this one is not it means that geography the explanation provided by geography is entirely embedded in the explanation provided by water chemistry and for a good deal also with topography so geography is not useful in this analysis it adds nothing to this and that now between topography and chemistry there is a good deal in common also but then each one adds a significant component to the explanation so they should both be kept in the analysis okay okay I think I will stop there and give the microphone to Daniel Barca