 OK, thank you. I thank the organizer to invite me for this seminar. I am Francois Gillet. I'm a professor in plant ecology, community ecology, and numerical ecology in Besançon University. And I will speak now about some methods that in the framework of canonical ordination, when you have several tables to analyze together, but with symmetric methods. So how can we relate two sets of variables describing the same object? This is the main question we have to address here. So you have seen yesterday, I think, with Pierre and Daniel, a symmetric approach is based on regression. These are based on, you can use, if you want only to address the response of one variable with a set of explanatory variables. In another data set, you can apply multiple linear regression, of course. You have, in this case, one numeric response variables and several explanatory variables. This requires multiple tests, and we have seen that there is a problem with multiple tests, and you have to correct p values for that. But the main problem here is that you ignore the correlations among the response variables, which are indeed correlated and linked together. So you can use, of course, multivariate linear regression using canonical ordination, CCA methods, RDA, DBRDA, et cetera. Well, you have one response matrix, y, and one explanatory matrix, x. And you can use global permutation test of the causal relationship from one set to the other. So you suppose here that you have response variables in one side and explanatory variables on the other side. And you have seen that, the possibility of variance partitioning among several subset of explanatory variables. The alternative is to consider symmetric approaches based on correlation. The first idea, perhaps, is to use pairwise linear or rank correlations that they can be useful to detect linear or monotonic relationship between individual numeric variables. But, for example, using Pearson R or Spearman or Kendall, you get a matrix or scatter matrix of the bivariate relationships between the variables, for example, between a species and some environmental variables, et cetera. But it could be very cumbersome with many variables. So one solution in this symmetric case is to use global symmetric association between the two sets of variables. And here I give a list, if it's not exhaustive, but a list of different methods that have been proposed to address this question, to link two sets, two tables of variables in a symmetric way. Perhaps the oldest one is the canonical correlation analysis, which is described in the book, but I will not detail it. You have the family of procrustes, so-called procrustes analysis, with ordinary or generalized procrustes analysis, which relate two ordination diagrams of the two sets of variables. You have co-inertial analysis, and I will detail more of this method, and correspondence analysis, a relatively recent method based on correspondence analysis of two tables, generally community tables, or species matrixes. And the last one is multiple factor analysis, which I will detail also. So in this case, there is no response, no explanatory variables. Both play the same role. But the question is, why and when can you use this symmetric canonical ordination method instead of a well-established constrained ordination? So constrained ordination, such as CCA or RDA, should be preferred if you have both these three conditions. First, you can make explicit or generally implicit hypothesis about a linear or unimodal causal relationship between the variables in the two data sets, or the two or more data sets, generally two. No feedback. Loop is assumed. So biological communities are supposed to not influence their environment. And this is sometimes something that we can think about. There is not only an influence of environment on biology, but also the biological communities can modify their environment and have more complex interactions with feedback loops between the community and its environment. So in this case, you can prefer some symmetrical approach. And the last condition is more technical. But in fact, to apply constrained ordination, the number of objects, the number of sites, must be much higher than the number of explanatory variables. If you have more variables than sites, you cannot apply this. Or you have to reduce the number of explanatory variables by using forward selection or so on. So if these different conditions are not met, symmetric coupling methods such as coinage analysis or multiple factor analysis may be useful. In this case, you don't assume any linear unidirectional causality between one set of variables and the other set of variables. It is possible to analyze relationships among any number of groups of variables. We can see that it is very useful. You can have as many groups as you want for these variables. And there is no constraint on the number of variables compared to the number of sites. And this is often an argument, but we can discuss because when we have to advise students to organize their sampling or experimental design, we advocate that they make more sites, more samples, and limit the number of variables to measure than the contrary. It's a general advice. It's always better to have more rows in your matrix than columns. But it's not always possible, especially with species, for example. So what is the principle of this symmetric coupling? The first idea, when you want to couple two tables, is to merge. It's simply to merge the two tables to make only one table from the two tables by binding columns in a single table. This is merging. With this, so you put the columns with the species, and the columns with the environmental variables, for example, is the same table. It's perhaps not the best idea, but it's the first idea we can have. This approach is meaningful normally only if the inertia, the total variation of each of the two tables is similar. It's generally not the case, of course. And in this case, you can standardize, send, center, and scale the data or use PCA on the correlation matrix with this merge table. A better approach would be perhaps to cross the two tables using a matrix product. In this case, you have n species and p environmental variables, and you can simply apply a matrix product of x and y, and perform the PCA on the correlation of all the covariance matrix. An example to illustrate this, I will use always the same example. I don't think you have seen it already in the previous, no, you have not used this data set. This data set is available from the vegan air package. It's called dune. Dune contains the species matrix, and dune.and the environment, the environmental data. So it is about the vegetation, the plant communities, and its environment, their own environment in Dutch dune meadows. It looks like this. You have here in the partial view of the dune data set, you have 20 sites and 30 species, plant species here in columns, and here are semi-quantitative codes of abundance of this cover of the species, from 1 to 9, and 0 means absence of the species. The environmental matrix data frame is made of the same 20 sites and five variables. As you can see here in the summary of this data frame, the original data set is made of mixed variables. You have, for example, the depth of the A1 organomineral horizon of the soil, which is quantitative. You have moisture, which is, in fact, an ordinal semi-quantitative variable for the soil moisture between 1 and 5. And you have management, which is purely qualitative. This is summarized here. So you have meadows used for biological farming, organic farming, for hobby farming, nature conservation management, or standard farming. And you have another variable called use, which is the main management, agricultural management of these meadows. They use hay fields for hay production, mowing, or if they are used for pasture grazing, or both hay pasture. And here, also, again, semi-quantitative variable. Sometimes we can use it as a quantitative, but semi-quantitative. And the last one is like moisture is the manner, is the fertilization of the meadow, the semi-quantitative cascade. So if you have here the summary of another version of the data, we must use when qualitative variables are not allowed by the method. For example, if you perform PCA, you cannot directly use these metrics, of course. You have to transform it and to record the dataset in order to get only numeric variables. And here, you have moisture is considered as quantitative, as well as manner. And hay field pastures are derived from use, and they are dummy variables, binary binary variables, 0, 1. And if you have hay pasture, we check one for each of these binary variables. And you have the qualitative here, which was transformed, coded as dummy variables, too. Well, apply the first ideas to these data sets. So the first idea is to merge the two matrices. So here is the code in R. We built an half speed data frame by binding the numeric version of the environmental data set and the species data set. And we can use the vegan function RDA already, which makes perform a PCA on this table. Of course, because of the big differences between the units and the values in these different variables, heterogeneous table, we have to put scale as true to make the PCA, to perform the PCA on the correlation matrix. And the result looks like this. We have merged these numeric environmental variables on species abundances. And in this by plot scaling 1, you can see the different correlations between so scaling 2 would be better to see the correlation, but you can see what are the main variables influencing the principal components. But the problem here, you get a result, which is interpretable, but it's not an ideal configuration because the inertia of the two original tables is unbalanced. And you can get sometimes strange results with this method. So it's not something we can advise. Now, merging the two matrices is an interesting solution. It is the same idea, but you can, of course, here we didn't consider the problem here. Yes, I forgot this. We have only standardized the species, so the zero disappear. All the abundances are now with a mean of zero. So absence is completely ignored in this first analysis to address this problem. We can use, instead of the original species matrix, we can use the Hellinger transform species matrix. You are now familiar with this, I think, of course. And we bind with the scaled numeric environmental table. And we use the DeCostin function in vegan with the argument with the option ranged to scale all variables in the range 0, 1, so that they will be comparable. And you can perform the PCA on this. And you get this result that you can compare with the previous one. And it's perhaps more correct. And here we perform the PCA and the covariance matrix, not on the correlation matrix, to conserve the variance of the different variables. But the problem remains that inertia between the two tables is unbalanced. So now we can cross the two matrices. And here, they're just an example of perhaps the best solution. But here I used, for example, species profiles. It means the relative abundances by species. So for this, I use the DeCostin function with total option with margin 2 by columns. It's standardization by column. In this case, we focus on the ecological niche of the species more than the difference between the sites, among the sites. And we can do the matrix products here with this. We have to transpose the species matrix and multiply by the scaled and the remontable matrix. And perform a PCA on the covariance matrix of this table. And this is interesting here. The objects have disappeared, of course. Now you have no object in this ordination. But you have the species, which play the role of sites. And the rows correspond to the environmental variables. And this, you get a relatively nice picture of the influence of the relationship between the different variables represented as gradients and vectors and the different species here. Yes, but it's not a traditional way of doing or merging this table. There are more elegant ways to do it. And the first one I will present you is the so-called Coinertia analysis. Coinertia analysis was developed by my colleagues in Lyon, in France, and is available only in R in the ADE4 library package. Briefly, the principle of this Coinertia analysis without entering into the details, the idea is to perform first separate ordinations of the species matrix and of the environmental matrix here to find axes, principal components, in fact, that maximize inertia in each table, principal components or factors in correspondence analysis. It depends for species and for environmental variables. And the Coinertia analysis will aim at finding a couple of Coinertia axes for species and for environment on which the sites are projected. And this analysis maximizes the squared covariance between the projections of the sites on the Coinertia axes. In fact, you have here the detail of the calculation of this squared covariance. You can see that it's compromised between the squared correlation, between the variance of sites in the species viewpoint, and the variance of sites in the environmental viewpoint. This is very well explained in the paper of Stephane Ray and collaborator in ecology. So this method is a very flexible method, which allows various possibilities for symmetric coupling of two tables. There is no constraint on the number and the nature of the variables in each table. There is a possibility to combine different ordinations after different potential standardizations of pre-transformations of the data. The only constraint you have is that raw weights must be identical in the two ordinations. So you have to check that this condition is OK. And you can use the Coinertia function in the AD4 package to perform the Coinertia analysis. And you can use the so-called RV. I didn't get any explanation about the abbreviation RV. Perhaps my colleagues know. So it's called RV coefficient. This RV coefficient, the detail is here, measures, in fact, the similarity between the geometrical representations derived from each centered matrices x and y. So the species and the environmental matrices, or the two matrices in general, are centered. And to calculate to compute this RV coefficient, which is technically, which is the ratio of the total Coinertia to the square root of the product of the square total inertia of the separate analysis. And it is, of course, a symmetrical measure. It's ranged from 0 to 1. 0 means that the two matrices are completely independent. And 1 means that they are completely homophytic. And this RV coefficient can be tested by permutations or by an approximation of the distribution with a parametric test that are more efficient and more rapid. This is described in the paper of Joseph. I have, I give you at the end of my presentation a list of these different references. Yes? X. X. Yes? For the rest of it, we can tweak it into a correlation. A correlation or a similarity. Yes. There are, what is interesting with Coinertia analysis is that there is a, yes? I don't do very much here. The mystery is, OK. Thank you. So in this Coinertia analysis, you have a lot of flexibility. You can apply various pre-transformation of species abundances prior to PCA. For example, if you use PCA for the species matrix using either species profiles, relative abundance per species, site profiles, relative abundance per site, for example, the famous Helinger standardization or co-transformation, you can use double profiles, for example, the chi-square double standardization, et cetera. You can use various ordination methods for the preliminary separate analysis of each table. Basically, you can use, of course, a PCA on a covariance or a correlation matrix. You can use CA, Correspondent Analysis, for species matrix or if you use a contingency table. You can use, I don't know if you have spoken about it, multiple correspondence analysis if you have a table with only qualitative variables. It's a variant of PCA for qualitative variables. Or you can use also PCA, Principal Coordinated Analysis, based on any resemblance matrix, providing that it is metric and Euclidean, based on metric and Euclidean distances. For example, the Jacques distance is OK. Here are the results with our example of the Dune Meadows, of a first example of Coinertial Analysis. Here we have performed correspondence analysis on the row species matrix without transformation, of course. No need for that. And a PCA correlation on the numeric and random matrix X. And here is the result you get by default. You can also detail each plot, but you have here all the main graphical information out of this analysis. You have the RV coefficient here. You have here the eigenvalues. You have here the representation of the position of the axis for the species or for the environmental variables. No, the environmental here and the species here, sorry. And you have here, interestingly, the position of the sites from the point of view of the, from the environment here at the basis of the arrow to the point of view of the species there. So you can see in this plot the correspondence between the different, the influence of these different points of view on the data. And you have, you see the projection of the species and here of the environmental variables. Here, an alternative to the previous choices. We have chosen here a PCA on the covariance matrix on the Hellinger transform species matrix Y and here the same PCA correlation on the numeric environmental matrix and you can see there is no so much difference. Yes, of course. I've preferred that last solution because it's more consistent with the principle of co-inertian analysis. In your previous slide, you did a CA on the PC data and a PC on the environmental data. Now in co-inertian analysis, there is a condition that the weights must be the same in co-inertian analysis. So you either have to take the weight of CA that depends on each row if the weights are the sum of values in the row and apply that to the environmental matrix or take the weights of the environmental matrix that are all equal, all one and apply that to CA and that will lead to two different solutions. So I discussed that at length with Stéphane Dray when I was writing the Miracoli-Cardini book and we concluded that actually there is no logical way of deciding that these two solutions that produce totally different results. So in the next slide, it will be the appropriate way. I didn't give the data, but I chose to use the weights from the row weights from the CA to perform the PCA. You have to do one? You have not this problem here because all rows have the same weight in the case of the PCA. So thank you for your remark. So the last method I would like to present you is called multiple factor analysis. It's also another method developed by French biostatisticians in Rennes this time and it's a very flexible method too. The idea is to perform a global ordination of several subsets of variable describing the same objects based on PCA again. So we have that problem if you use always PCA. So the only restriction is that the variables inside within each subset should be homogeneous. So they should be only qualitative or only quantitative or they have to be scaled or not. And yes, you have to choose to organize your data with relatively homogeneous subsets. So you can use row quantitative variables if they are homogeneous. In this case, the PCA on a covariance matrix will be applied. Or you have to scale these quantitative variables if they have heterogeneous, typically with environmental variables we have this case. Or you can use a set of qualitative variables as well. A subset of variables may be active or passive. In this case, they are called supplementary groups and they are not used to perform the analysis but they are projected at posteriori in the analysis. The subsets of variables are weighted in principle of this multiple factor analysis so that their influence is equivalent. And this is achieved by dividing by the first eigenvalue of the first axis of each data table. You have here an illustration. I took from this excellent paper. You can read it. It's very interesting to summarize and give more details about this method. So the first step here is the second step. We compute generalized PCA on each of the tables. Generalized PCA means that you can also use for qualitative variables you can use what we have already mentioned multiple correspondence analysis. So the third step is to normalize each table by dividing it by the first eigenvalue and concatenate the three normalized tables and perform a simple PCA on this concatenate table to get the global PCA, which is the MFA result. An example, always the same example of application. So with the Dune example, we use here, you can use the MFA like this function of AD4 and you get this result or I will more detail another library to do this which is more flexible. But with AD4, here I use two sets of variables, the Helinger transform species matrix and the numeric Dune environmental matrix which is necessary with the idea for a function. And we get this result. You can see here the position, the scores of the sites, the row projection with the point of view of the two data sets and here the projection of the columns, the species and the environmental variables and the position of the first axis of the individual ordinations. So what is interesting with the MFA is that you can have more groups than two and you can hear, for example, in the Dune example, I have considered different groups of plant species, the forbs, the grasses. I distinguish the environmental variable between soil variables, land use and I use two passive groups, supplementary groups for shrubs and moses which are not very important in the data sets, just for the exercise. And I use here the MFA in Capscay. It's the same name but something different. In the FactorMind R package, FactorMind R package is developed by the authors of this method. Here is the script to do this and you get a number of outputs from this MFA that you can examine in detail. I will not do it here. I will more insist on the graphical outputs. So here you have the partial axis of separate ordinations projected on the global PCA of the multi-factor analysis and you can see the different groups, for example, the grasses, which are strongly correlated with the first axis of this global PCA. Here you have the eigenvalues. Here you can get the correlations among the numeric variables with colors corresponding to the different groups, which subsets your analysis. And you can also restrict the projection. If you have a lot, a lot of variables, you can restrict the number by setting a probability from a permutation test or some other criteria to limit the number of the most significant variables. But you can see the different correlations between the different variables in the different datasets. There is no explanatory variable here. And you can also, if you have qualitative variables, which is the case here, you can project here the center rates of these quantitative variables here with the label. And you have also the different points of view for the active subsets. You can see, for example, that Hayfield and SF and NM, which have management variables, are influenced more, axis 2 than axis 1. And here you can view also the global and partial projection of the sites with the different points of view. And here, the last graphic, you can see the contribution of the groups to the main MFA axis. And you can see here the land use, which is expressed mostly on axis 2. The grasses, soil, and four variables are more expressed on axis 1. And here you can see the passive subsets. You can overlay the result of a cluster analysis on the scores of the global PCA of the sites using the square Euclidean distance and the world method. And you can get this picture with three main groups of sites. With the advantage here, you have this classification made on a combination of the different points of view on the different subsets of variables that describe the sites. Here are the RV coefficients with this data. You can test this. I used the test proposed by Jos Etal and collaborators in this paper with the clef RV function. And you have the p-values there and the RV coefficients in the upper part of the matrix. And you can see that there is a strong correlation between form, grass, and soil. We have no test for land use because with this function, we cannot apply it to qualitative variables. But we get the RV coefficient from the output of the MFA. It's probably significant taking into account the numbers you see land use is strongly connected with, of course, the vegetation and the soil. Okay. Just some words about an extension of multiple-factor analysis called hierarchical multiple-factor analysis. It's an extension to the case where your data are organized in nested groups, hierarchically organized. You have several nested partitions of variables. And this case, HMFA, balances the role of each group related to its node in the hierarchy. You have a hierarchy of variables and you can take into account this hierarchy in the analysis. So to understand it, the example of my Dune Meadows, you have vegetation and you have environment. This is the second level of the hierarchy, the first level being the groups we considered before. Forb species, the grass species, the most species nested in the vegetation and soil and land use nested in the environment. And it's relatively easy to do it. You have to define the different groups at the bottom level. You have to assign these groups to a hierarchy here by giving simply the number of subsets in each group of the second level. And here, you use the function HMFA to perform the analysis. And you get this result. You get the eigenvalues expressed here in the percentage of variation expressed by the different axes. And you can see that here we can only consider the two first axes. You get the correlation circle with the relationship between the different variables in the groups. The scores and the projection of the sites and of the qualitative variables centred here. You can have a superimposed representation of the partial clouds at the highest level, species environment, or at the basic level. A bit more confusing here. But you can, of course, you can isolate each point of view in a different graph, for example, to analyze it. And you have here the group representation, such as in the MFA. And here, you can perform... I used here to the hierarchical clustering. I used a function in the FactorMine R library. And with here four groups, it's a classification based on the scores of the sites in the MFA. And you can see here the projection of this dendrogram with the four groups and even a 3D representation, which is nice. I find it very nice. But probably not the more readable, but interesting. So you have here the code for this, and it's very easy to do. Okay. I have some minutes left, yes? Two applications. I would like to show you the published applications of this approach in ecology or in environmental sciences, because it's our topic. I forgot to tell you that these methods, multiple factor analysis, were developed in sensory metrics, how to say that, when you compare the chemistry of elements with the perceptions of people, it's in this field that these methods were developed at the beginning. But now they have more and more success in ecology or in environmental sciences. First paper, sorry, is a paper I wrote with a colleague in the Museum National Histoire Naturelle in Paris. It was to analyze a very complex dataset about vegetation, soil fauna, humerus components in a subalpine spruce forest ecosystem. And for this, we used hierarchical multiple factor analysis. I will not enter much into the details. You can read the paper if you're interested. The data of this, we have data about vegetation, above-ground vegetation. We have five depths for the observation of the humerus layers under the soil fauna. And in each layer, we had an inventory of samples of the colambular species, oribatata, genera, actinadida, families, so it was not the same resolution, taxonomic resolution, as you can see, and other soil animal taxa. And humerus components describing these different layers. And with this data, it was relatively complicated and we chose to apply hierarchical MFA. Yes, one of the reasons, it was perhaps we can consider that there is some explanatory and response variables in this topic, but in fact, we had less rows than columns because there were a lot of variables describing these layers. So it was the reason why we applied this method. And here you can have some, extracted from the paper, you have some results, you have here the scatterplot of the sampling points on the two first axes for the 11 habitats, which are labelled here, and the interpretation in term of humerus forms. It was a very interesting result. I cannot enter into details with the paper. And here you have, for example, the position of plant groups only, which is the row heads, compared to the barycenter here at the row basis for the 11 habitats. The barycenter is the score of the global PCA, the analysis. And here the trajectory, the position of some animal taxa in this plot. So, okay, across depth because these symbols represent here the different depths. And it was interesting to show the different representation of these taxonomic groups across the soil depth. The second and last example is more in environmental science in general, the environmental application. Here it was paper about the use of multiple factor analysis to relate heavy metals and organic compounds, pollutants, et cetera, in sediment samples. And I chose this example because it was in the Bay of Trieste here. This study was made by Belgian people, but... Okay. And in this paper, it's pedagogically interesting because they begin by a PCA. They put all the variables, the chemical variables together and they make what I presented at the beginning of my talk with, of course, on a correlation matrix, a PCA and a correlation matrix with this. And they got this representation, for example, of the variables. And they interpolate the scores on the first axis to get this map. And they compare to the result of multiple factor analysis, where you can see that here we're separating heavy metals and organic compounds. We have a quite different picture with all these variables here along axis one. And some of them, some heavy metals here influencing more axis two. And if we concentrate on axis one, they can do this second map based on the interpolation map of the scores of these points here. I'll let you read the paper if you're interested by the result by itself. Okay. That's all. This is a list of the references I give here. And I put in the website, you can find the R script I used to analyze the Dune data so you can explore by yourself. And tomorrow, at the same time, we will go into the practicals with this method and we will use the famous Dew River dataset for that. To apply co-inertia, analysis and multiple factor analysis. Thank you. Yes, you have questions?