 OK, thank you. So good morning, everybody. And for me, also, it's a great day. And I thank Vinko and Cosimo for the organization of this course. The genre has shown you the bowels of ordination, the mathematics of the basic mathematics of PCA. I shall present you some examples of how you work with them completely, those methods. And I may first reassure you, to run a PCA, you won't have to write yourself the equations that Pierre showed you. Of course, they are the functions in R that do that for you automatically. But of course, it's very important to go through these mathematics in order for you to know what you're doing. Because too many people use statistical software in a quite automatic way. And actually, they don't know what they are doing. And this is the place where you can make mistakes because you could misuse those methods. You can use one instead of another because you simply don't know the limits and the applications. And this is a point where I shall, of course, stress some particular aspect. I'll briefly come back to some aspects of PCA later. But then, first, I'll go through that short part concerning transformation of species abundance data. Or actually, I'll go even more in the basics of general types of transformation. But this will be very short. You have sometimes to transform data for different purposes. One of them being, for instance, to make comparable descriptors that have been measured in different units. Ranging is one, although not so well known, or maybe not so often used. Ranging, consisting actually in expressing the variables in such a way that the maximum would be always 1 equal to 1. Or you can go further and transform the data so that they are simply re-expressed, re-scaled between 0 and 1. And this is the complete way of ranging data. This is used, for instance, in a particular case of model 2 regression called range major axis. But of course, we will not go into this one for now. But as Pierre told you, one of the mostly used transformation is standardization, meaning you subtract the mean and you divide by standard deviation. And you obtain those so-called Z scores that are used. For instance, in PCA, as Pierre Jean told you, when the variables are expressed in different physical units, you would do that. Or as we will see in a couple of days, in RDA, the explanatory variables are automatically standardized by the software. Because generally, in the most general case, those explanatory variables are expressed in various and different physical units. For the other ones, you certainly know about square root transformation that you may use when your data are moderately asymmetric that way. So with possibly a constant, if you have negative values to avoid complex numbers. So this is a possibility. The other one being the log transformation. When the asymmetry is a bit more extreme, you may resort to log transformation. And there, the constant may be used if you have either negative or zero values because of the log of zero is undefined. So a typical case of that, if you have to log transform species abundance data to scale down the largest abundances, we do that quite often, then you would transform by setting the constant to one. Because in that case, if you have zero abundance, you have the log of one, which is zero. You fall back to zero for zero abundances, and the rest is transformed, and the addition of one being almost negligible to them. And this has generally the effect of making the data more or less symmetrical. You have always the problems of zeros here, and I'll come back to that problem later on. In other situations, you may have data that are in semi-quantitative ways, expressed in semi-quantitative scales. Like, for instance, in Europe, it's well known, the brand-blanquais scale for phytos sociology, or simply ordinal scale. Here you have the brand-blanquais scale, for instance. And you have a re-expression in an ordinal scale of 10 units, 0 to 9 here. But in any situation where you have those semi-quantitative ways of expressing variables, you may be interested in recording them in a quantitative way. This does not restore more information. It's just a way of re-expressing your variables in ways that suits more. For instance, you could simply, in the simplest way, you take the scale. You can put this to an exponent w, which w equals 0. You re-express this as 0, 1, present-substance scale. And at the other extreme, you will give higher weight to larger abundances on the scale. I used such things when I was studying, or everybody might, as Pierre Lejeune told you. I was studying those animals several years ago. And at the type of animals, you can find three or 400,000 individuals per square meter. So even in small cause, you have quite a lot of them. And it's time-effective and cost-effective to estimate the abundances. And I have devised an ordinal scale from 0 to 5 that visually corresponded to something that was quite akin to a logarithmic transformation, actually. So after that, if I needed to upscale the abundances, I could use such transformations. Another case that we will address later because we'll use those types of descriptors is with qualitative descriptors, where we have several classes, like the levels of a factor in ANOVA, for instance. Here, for instance, I borrowed this material from François Gilles, where we have four types of soil, which are four classes. But those classes are not quantitative in any case. And in several situations, you may have to recode them into binary descriptors called dummy variables. And you would think that for four classes, you would use four binary descriptors. But if you look at it closer, you see that three are enough because everything is fully characterized. The first one by a one here and two zeroes, and then the one is here, the one is here, and the fourth case, when you have no one at all. So you don't need the fourth one. And this is coherent, consistent with the fact that actually a factor with four classes, four categories, have three degrees of freedom. So by recoding such a qualitative variable into dummy variables, you obtain as many variables as you have degrees of freedom in the factor. We'll come back to this, and we'll show you another kind of orthogonal coding called Helmert contrast later in this course. OK, that was a short overview of what we have for the general types of transformation. But the main purpose of this first part of my course today, concern transformations to obtain ecologically meaningful relationships among sites while using linear techniques. What is hidden behind this concept? To understand, I'll now present something that is specific to species data. And this is also a good example of the reason why many of the multivariate techniques methods that have been developed for ecology have been developed by ecologists. Because in every field, and you may have different fields, in every field, you have special situation cases that may require specific treatment. So for people who would be interested in developing or transposing such kind of method in their field, and if their field is not concerning species data, you may have to think about this and look at some specificities. So in this case, the problem is called the double zero problem. In community composition data, where you often sample over large ecological gradients, the data set resulting is full of zeros. Because in many places, you have species groups that are specific to one situation and other species to other situations. And those don't necessarily overlap with the others. So they are absent for part of the sites and are present in other parts of the sites. So this proportion of zeros is all the greater than when you sample over larger, broader, ecological, well, transects or any kind of gradients. So it may end up with something like this. And even there, there are not so many zeros in this short sub part of the Oribatid mite data that Pierre has already presented you. But all the situations where you have here two zeros for the same species, this species is absent from those two sites, three and four. And for instance, they may not necessarily be just one after the other. Side five and 10, this species is absent from those two sites. So this is what we call double zeros. Why is this a problem? The zero value in a matrix of species abundance is tricky to interpret. Because if you have a species that is present in two sites, it means that minimally, these two sites offer ecological conditions that are suitable for that species. So the main dimensions of that species' ecological niche are set in a way that is adequate for that species in both sites. So this can be interpreted as a resemblance between the two sites. Because they have characteristics that are common ecological characteristics. On the other hand, if a species is absent from two sites, actually you cannot know if the reason of its absence is the same in the two sites. You may say, yes, I know my species very well. And well, if you work in restricted gradients or areas when you perfectly know that every species at the outset could potentially be present in every site, then this may not concern you. But in the most general cases, you have a large number of species over broad ecological gradients. And this means that you certainly cannot be certain that in every case of double zeros and are many, many, many of them, you can be certain that the two species are absent for the same reason. A species may be absent from one site because it's too dry, from another one because it is too acid, from a third one. Because you have simply missed it, it was just, well, five centimeters away. You may have encountered and captured it. So you cannot be sure. As a consequence, you cannot interpret a double zero as a resemblance between two sites. And this has major consequences on the choice of the methods you will use to analyze species data. This is extremely important. Piano Jean stressed the fact that everything, all those analyses, most of them, are based on comparison among sites or among species or variables. So the choice of an appropriate distance, dissimilarity measure is crucial because this will determine whether the rest of the analysis will be adequate, will be usable, or if you made a mistake from the beginning on, and the rest of it may be meaningless. In the Green Book's language, the association coefficient will speak in detail about association coefficients later, tomorrow, actually, those association coefficients being the measures to measure the distance or the resemblance between sites that Pierre already mentioned, the Euclidean distance, which is actually the main one. But one example of the many, you may have heard of breakout distance, for instance, for people working with community data, which is another example. So the association coefficients that consider the double zero as a resemblance as any other value, these are called to be symmetrical because a double zero or a double one, if you speak present substance, have actually the same meaning. Actually, the double zero would be considered as a valid measure of resemblance between two sites, which would be true with physical or chemical variables. If you have 0 milligram per cubic decimator of, I don't know, nitrogen content in the soil and in another one, these are truly resemblances between the two sites. They tell you something about the ecological conditions of those two sites, and those resemble each other on this particular point. So those symmetrical measures would be adequate for those situations. On the contrary, the association coefficients that do not consider the double zero as a resemblance, those are said to be asymmetrical, asymmetrical, meaning that double one has one meaning, double zero has another, or possibly no meaning at all. Meaning they are not considered as to measure the distance or the resemblance between two sites. And those are the ones to be used when analyzing species data. A stone in what Pierre's example. Because from that point of view, PCA being based on the Euclidean distance is not adapted to raw species data. Because PCA, Euclidean distance, everybody knows Euclidean distance, even if you don't know the equation. In any space defined by variables, Euclidean distance is the physical or the distance between two even points. I mean, there is Euclidean distance of about three meters between me and you. And well, this is Euclidean distance. Now replace x, y, and z, and z geographical dimensions by dimensions measured by pH, value, and temperatures, and the well-scaled, as you wish to do it, you still can define the same Euclidean distance between the points. But in the Euclidean world, when you have two points that had many zeros in common, meaning that in many dimensions, the measure is the same, those points will end up very close to each other. So the double zero is two points that have many zeros in common will be considered very resembling in that sense. So why did Pierre use this in species data in his very simple example? Because he used only three species, and you may have noticed that those three had no zero and certainly no double zero values in that. It was a very small example of three species that may have been close to one another in a short part of the ecological gradient. And that may be a possibility to use PCA with raw species data. Now, until 2001, that was about the end of the story for PCA, meaning that for practical purposes, PCA was not an option to study species data. We resorted to correspondence analysis, which I will present to you in a moment, or maybe principal coordinate analysis, but not PCA. But then came Pierre, and his usual genius, and I wait my words, because I was there when he had that idea. And I still remember you practically could see the electricity in the room. He just thought, yes, but he thought that you have other measures of the similarity distances to study asymmetrical ones, to study species. But they were not in the world of PCA. But then he saw that actually a couple of them had a Euclidean component in them. And then mathematically, you could dissociate the Euclidean path from the rest. And so these distances could be re-expressed as pre-transformation, a prior transformation of species data in such a way that when you submit this transformed data to a method based on Euclidean distances, and this includes PCA, this includes, well, ANOVA, for instance, multiple regression, and so on, then when you take the transformed data, submit it to the Euclidean-based technique, whatever the technique, then the actual distance or the similarities that has been preserved is the one that have been decomposed and no more the Euclidean distance. And that opened up all the world of those linear measures, or techniques, methods, to the species abundance data. The five. And this ended up with an extremely important paper by Pierre Legendre and his colleague Jean Gallagher in 2001. The summary of it, they presented five such distances. They were already known and used by ecologists, core distance and so on, helinger distance at the end, which are actually both that are most used. Those can be obtained by using PCA or ANOVA by pre-transforming the species in these ways. So these are not the measures of distance. These are the measures, these are the way you pre-transform your data before you submit them to the analysis. For instance, here in the first row, core distance. What does this mean? It means that in each row of your data matrix, you square all the values, you sum these squares, you take the square root of what you have obtained, point by point, I mean species by species. And then all your species abundance, you divide by this. This will be done automatically in R as well. And to go to the other one that we use very frequently, the helinger distance, it's even simpler. You divide all the species abundance in one row, in one site, by the sum of all abundances at the end of the row here, yi plus. And you take the square root. And this is your new abundance, transform abundance value. So these two are, well, experience had shown that those are the two that gave the most interesting results. So I mostly speak about these ones. Maybe just parenthesis about the square distance here, which is a transformation which uses row and column information. And this is the distance that is preserved in correspondence analysis that we speak about later. A little example, this is called, well, this is a small example of what all of C in 78 called the species abundance paradox. So it's an illustration of the problem, of the double zero problem. You have here three sites described by three species here with their abundances. And you see you have a couple of zeros here. If you compute Euclidean distance, so you can imagine that those three species are represented by three axes, x, y, and z. And you position your sites the way Pierre has shown you in his 3D graph that he could place every position he wanted. So you have here the data. If you compute now the distances among all pairs of those sites, you obtain this matrix, which is symmetrical because here, of course, on the diagonal you have distance from site one to site one. So of course, by definition, it's zero. And these values are the same as the ones here above the diagonal. So what you observe here, that actually the distance between sites two and three, sites two and three are those two, which have absolutely no species. And common site three has only species one. And site two has species two and three, but not one. You see that. These distance in Euclidean terms is actually shorter than the distance between site one and two, which has all species in common. This is an effect of the double zero problem because those zeros between the sites interfere. And you have the, well, you may have other situations where it's even worse. Because you have real, here you have simple zeros actually between site two and three. But you may have other situations where you have double zeros and it's even worse. They are actually very, very close. But the effect of the transformation, those transformation I've shown you most of them are here now, is to correct the situation. Now you see that in every situation when you use the pre-transformation and then you plug this into the Euclidean distance formula here. I mean, to go from this to this equation, you just replace the original, the raw, abundance here by the transformed values which are here. You see the correspondence here. Well, so in every case you have the appropriate situation where sites which have species in common are actually closer now when you respect the core distance or the helinger distance or the chi-square distance. Species profiles we don't use very much now. It has been shown to have some problems. But the take-home message here is when you have species abundance data with the zeros and the most general case, you can use, you still have the possibility of use linear methods like PCA, RDA, and ANOVA, for instance. But you have to pre-transform your species. So the core transformation, which is one that gives good results according to our experience now, consents in transformation. The transformation I showed you, actually, each object has now a length 1. And it's the distance between those objects along the cord. If you have two sites here, you could embed the whole thing in a circle because now all objects have the same length. And the core distance is actually this one. So if you want to figure out what the core distance means, it can be obtained in R by one, the function DecoStand, which is in vegan, one of the packages that we will heavily use in our practicals. So you just feed your data matrix, your untransformed data matrix, species data matrix here, and you ask for normalize because this is what it does. It normalize all vector lengths to one. So if you use these y dot cord here matrix that I have produced here in any of those methods, you actually preserve the cord distance among sites instead of the Euclidean distance. And now you are appropriate for species data. The Hellinger transformation can be obtained with argument Hellinger, which you can shorten to anything meaningful like hell, possibly with one L, because if you keep the two Ls, it will give you a nightmare of transformation. Well, in any case, this makes it useful for your abundance data. So these are the transformations that can be and that we heavily suggest you to use whenever you deal with species abundance data. At this point, oh, there's a mistake here. P is missing. At this point, if anybody has a question related to what I already explained, questions will come with the practicals, I'm quite sure. And while I'm at it, concerning the practicals, well, I have written here the address where you'll find all the material. And the material is abundant, for sure. You will see that for each day. We present you a series of documents built by Pierre and oriented in such a way that you may explore the mathematical aspects of the methods. So we produce the examples that Pierre showed you and go further into that and see how it works. Because all those matrix equations can be written down quite easily in R. Another of the, maybe understated, but in any case, useful requirement for the practicals is that you have a minimum working knowledge in R. Otherwise, the only possibility for you will be to spend next night studying the basics of R, using one or two of the documents that are provided to you. So it's called introduction to R. And you really learn what R is and how it works and so on. So it's absolutely, I must first base it. It's absolutely necessary that people have such a minimum working knowledge in R to be able to follow the practicals in these next days, including today. I'm certain that among you, many people have already a working knowledge in R. I would strongly encourage those persons to pair, possibly, this afternoon with people that don't have that knowledge in order to have some clusters of people that could help themselves and go into the practical in such a way. And for each day, I think it's the last row each time, you have a document which is an R script, the practical that I have built. These practicals have been built upon those that are presented for the yellow book. And these practicals are actually, well, borrowed but adapted from the material that is distributed on the page of this book and Pierre presented you this page earlier. They have been updated to the latest R version. And they will be useful for, well, they can be usable for hopefully on the computers here, of course. Most probably, and I hope so, for the personal computer, for everybody here. So the scripts go through the day's methods. And you have titles saying, OK, let's do a PCA on the physical chemical variables so we don't have to transform and so on. And we have these notions coming sequentially. And in many cases, I have provided maybe two different ways of presenting biplots. We're using two different functions that are available in the document. So you will have what I suggest you to do is to download all the material into one folder and in R, define this folder as your working directory. So in this case, each time you will import a material, it will be easy because it will be from the same directory. And you don't have to navigate through the whole architecture of your computer to go and fetch the documents. And so normally everything should work fine for you. We, of course, hope so.