 Good morning everyone. So with this weather outside I mean when I opened the grape trees in my room and I looked outside I saw the fog you know on the sea and the little harbor and all that I thought that it was highly romantic and that I was very glad to be inside and not having to walk between buildings we just have to go to the cafeteria and then down here and we are dry and safe good so we have a lot of work to cover this morning and we will talk about measures of similarity and distance and this may look like a trivial subject but it is not it is central to the ordination of course and to what we will do in the next few days and I will come back to similarity to especially this similarity coefficients on Thursday when I talk about better diversity analysis we will revisit the properties of this similarity coefficients in order to choose those that are appropriate for better diversity study but that will be on Thursday today I will do a general presentation of the main families of similarity and this similarity coefficients and I will introduce you to a concept that maybe you're not totally familiar with and it is the concept of Euclidean distance what what is it to be Euclidean compared to being metric it is not the same thing at all and it has some importance on our choice of this similarity coefficient and then I will talk about the some the the basic statistics used in multiple regression because tomorrow we will use the same statistics and canonical analysis so I'm still preparing for the introduction of canonical analysis tomorrow then Daniel will take over after a coffee break with polynomial regression and partial regression and the important concept of variation partitioning okay I will use different documents today as I use the yesterday I yesterday most of my talk was based on one document today I will switch from one document to another instead of having a nice PowerPoint presentation that's my style of teaching and I think that when I changed the document and that type of presentation that makes people wake up and it's good it forces you to try and figure out what I'm trying to do we have a document called some measures of similarity and distance where is it just open it from here not the one to some measures ascending I'll talk first about the coefficients for binary measures I think Daniel already introduced this idea yesterday when we have binary data we can more or less summarize them I'll put that a bit smaller we can more or less summarize them using a sort of contingency table like here with ABCD a being the double presence for the site one site two let's say a is the double presence d is the double absence and this here comes the problem of double zeros and b and c are the number of variables or species present only at only at site one or only at site two so when they are for two vectors like this it's easy to count how many have double presence to go through the vectors we have one here another one there so here we have a equals to squares a b c and d here in this example a equals to how many are present only in the site one and nothing inside two here's one and there's another one we have a two here tell me if I'm making a mistake sometime when you're too close to the screen you don't really see things well here how many are present inside two and absent in site one we have one here and that's all so we have one and how many double zero do we have one and two okay so from that we can build all the coefficients here they are expressed in form of similarity for present absence data and I introduce here the two families of coefficients those that keep the double zeros as an indication of similarity and those that simply discard the value d as Danielle explained yesterday so these are the symmetric coefficients because a and d are considered in a symmetric way in the construction of the similarity coefficient and here is an example of an asymmetric coefficient where d is removed from the equation so here the similarity is simply the double presents plus the number of double absence a plus d divided by the sum total here it is the same thing but without the d we remove the d portion to obtain this this is called a simple matching coefficient and it is a nice and good in taxonomy where you can look at presence or absence of some structures or something and here is the Jacquard coefficient the most famous of all similarity coefficients designed by Paul Jacques French botanist was working in the Alps and he used that coefficient in 1896 for the first time and he wrote it in a paper with these letters a b c and d in 1901 so it is very old coefficient but old doesn't mean that it is out of fashion quite to the contrary it is found in all packages that do this sort of calculations and of course the this idea of removing the d portion is based on the theory of gradients in ecology and I think Danielle mentioned this this is a figure from a paper by Robert Whitaker we will talk again about Robert Whitaker on Thursday is the man who defined the alphabet and gamma diversity concepts and he was working at Cornell University and built a big lab where they developed numerical method this is where Mark Hill worked in particular and and you gouged also and the idea is that if you have a gradient an ecological gradient like going up a mountain for instance you have species that succeed one another each one having its optimum at some point along the gradient like if you are going up the mountain in the low altitude you may have this species that feels very good there and then it is replaced eventually by other species that are more adapted to the conditions at higher altitude the dryness the wind and all that and it is because of this idea of species having an optimum at some point this is called a unimodal distribution that we exclude eventually the d portion from the comparison it's because when we have a double one it means that the two sides for instance for this species are close to one another they are under that curve and but a double zero may mean that the two sides are here and there to the left of the curve or on to that side to the right of the curve or one is here and one is there so we cannot interpret double zeros we don't know if it means that the sites are similar or that they are very far apart and ecologists in general remove the double zeros because of that so this is a very interesting concept that in the choice between it's it's theoretical that asymmetrical coefficients is not based on mathematics it is based on ecology so there is a whole family of these coefficients and for frequency data species presence absence or gene allele for the presence absence the data is the same thing they are often used in the binary form presence absence and then these coefficients here are very much in use and the two that will be there are three actually that will be used a lot based on ABC this is the Jacquard community coefficient of community we will use also the serensin coefficient that gives double way to eight you have to a here and two ways two times a in the denominator this is a serensin coefficient and very much in use and the third one is the okay I coefficient that has a more complicated equation where the in the denominator we have the the geometric mean of the the sums of a plus b here a plus b is the sum for that site let's say did I put one or two this is site two and this is site one so the sum for site one would be a plus c and a plus b for site two and in the denominator you take the job in geometric mean of these two and these coefficients they are the limit for presence absence data of other coefficients that handle quantitative data for instance the serensin coefficient is the limit of the percentage difference called brain kurtis in some software and the okay I coefficient is the limit of the Hellinger distance and the court distance while the Jacquard coefficient is the limit of another coefficient quantitative for quantitative data called the what's it called again it was he was a ski something like that coefficient okay so so much for the present the present substance data now I'll give you two examples of coefficients for quantitative data the best known of the coefficients for quantitative data is the equilibrium distance where you simply take the difference in values in the between the two values here the difference is five square it and you do that for every variable and you take the sum of these squares and take the square root so this coefficient has a lower bound of zero but an upper bound that it can be any large number it can it can really be larger than than one for instance the distance between here and the sun in microns is larger than one that's an example so yeah this coefficient does not have an upper bound for better diversity studies we will see that there is a strong advantage in using coefficients that have an upper bound so this is another reason why the equilibrium distance will not be appropriate for these studies but you have already seen in the examples were presented by Daniel Barca yesterday that the equilibrium distance produces the wrong answer when we apply it to some simple data like the small example that he showed of the old sea paradox data the three sites with the three species abundances including distance showed that the two sites that had no species in common had the smallest distance doesn't make sense for an ecology mathematically it is perfectly all right so again the decision between this type and this type of coefficients has to be made on the basis of ecology okay so much for the equilibrium distance there are other coefficients for quantitative data and I will describe in a moment the power index that is very useful but not for species presence have abundance data here is the the percentage difference in this slide I was still using the old terminology brain practice so this is a percentage difference why the I suggested to use the original name of a percentage difference is that first this coefficient was properly described by Odom in 1951 under the name percentage difference second is that Brinkertis in 1957 did not describe the similarity coefficient their paper was about a new ordination method and they simply said we use this coefficient that was available in the literature and the coefficient that the use is not the one known as the Brinkertis coefficient they never used the Brinkertis coefficient in their paper so there's no reason to call it that way people who wrote software and called it that way have never read the Brinkertis paper so let's go back to the original name which is the percentage difference but of course in software we have to know that this name Brinkertis is just synonymous to what we wanted the idea of this coefficient here it is presented first in the form of similarity and then we turn it into a distance coefficient the idea is that for each PC we will consider that the similarity between two sites let's say that I have a site here and a site there okay the similarity is if you have that much of a species at this site and that much at that site then the similarity is the minimum of the two numbers and we add these minima to compute the similarity the difference between the two numbers is the difference so you can also rework this formula to write it in terms of the difference between the two values but here I wrote it in terms of the of the minimum that is the similarity species by species so for these numbers here are the minima two three four five one and the sum of the minima is the the essence of the similarity between the two sites for these five but it the sum is 15 we would like a coefficient that is between zero and one so we wonder what is the largest possible value that the sum could take it will be the mean of the sum of the abundances at site one which is 20 and at site two which is 22 so if w is the sum of the minima so the similarity will be w divided by the mean of a and b which are the the two sums of sites one and site two okay and so the the two migrates there giving you two w divided by a plus b it's that's the basic form this is the similarity form that was described by Polish mathematician called steinhouse during World War two and we know that I have never found the papers of steinhouse they were probably destroyed during the war maybe but this work was reported by other mathematicians Polish mathematicians after the war it was published republished in 47 49 and so on and then later authors picked up on these reports to say that this thing here is the steinhouse similarity coefficient and we turn it into a distance by taking one minus that because this formula is bounded between zero and one so you can take one minus the similarity to obtain the distance but we will see that this distance is not without problem first I will have to describe some additional properties of these similarities but before I do that I will go through two of the similar of the other similarity coefficients that we often use and then I will go back to the transformation of similarity into distance and this is the tricky part of the operation okay this is a description of the chi-square distance this is the distance preserved in correspondence analysis so it is an important one and we will see the domain of application of that distance by examining what this formula does first of all if you don't consider this portion here this is the value of the variable at site one and at site two for species j this coefficient is strictly for frequency data like species abundance data or gene frequency data okay you cannot apply this coefficient to quantitative measurements of salinity pH and things like that no this coefficient is not for that it is only for frequency data and so with frequency data you can add when you have the data table you can add the rows to say what is the total number of individuals found at the site it would not mean anything if you did that on physical or chemical variables adding temperature to pH to salinity and so on would be meaningless so you see that in the construction of this coefficient the this operation of dividing by the row sum makes sense only for frequency data you do that for site one you do that for site two and if you did only that transformation this would be the transformation into species relative frequencies what we call the profile transformation and again forgetting about that the rest of the equation is the Euclidean distance applied to these frequencies transformed by dividing each one by the row sum okay so that's the shape of the Euclidean distance that is take the difference for each species square it sum them take the square root now what is this part this part is the weighting of that difference by the total abundance of the of each species in the dataset so here I've made up a small dataset I think I should replace that example for the one that Daniel Barca showed that has six species it looks more meaningful but for the calculation this is enough for you to understand what we are doing so I have these three sites three sites with five species again and here are the row sum and now we will take into account the column sum so if we first transform each value here by the row sum we obtain these values and the distance between species profiles would be the Euclidean distance between values of that table now we will have these values that will intervene where we have the column sum here divided by the sum total 87 this actually has little effect on the coefficient it is this that has a very important effect when we divide by a small row sum like 10 or even values that can be smaller than that it means that we give high weight to this square difference whereas when we divide by a large value of the column sum this gives a low weight to this square difference so at the end of the story when a species is abundant in the whole dataset then it receives low weight in the calculation of the distance but when the species is rare in the whole dataset it receives high weight because it is a division by this column sum what does that mean ecologically it makes a lot of sense for ecologists to use a coefficient like that because it means that a rare species when it is found at two sites is a better indication of a different of a similarity of a similarity yes then finding a common species at two sites you will always find ubiquitous species species found everywhere you will always find it in the comparison of two sites so it contains little information but a rare species found at two sites it may mean that these two sites have some property physical or soil chemistry or whatever that makes these two sites suitable for that species and it is for that reason that the chi-square distance and hence correspondence analysis is interesting for the analysis of community data however in 1986 we had a workshop in we held a workshop on numerical ecology in Roscoff in Brittany professors Cardi was present at that workshop and and other people of course were about 50 participants and in the morning we had presentations of methods by methodologists and statisticians we even invited the psychometricians to talk about their methods and so on and in the afternoon they were working group of ecologists discussing the application of the present the methods presented in the morning to their date to the ecological data in the afternoon and when we discuss the application of these coefficients of this coefficient in particular it produced no position between two groups of ecologists some ecologists were saying oh yes we want to use this coefficient because of that salient property that you know it gives higher weight to rare species and it is very informative and other ecologists were saying well in my case my rare species they are badly sampled so I cannot trust the values that I have and we realized that people who sample things that they see like vegetation and you know when you look at trees they don't walk away and they are very very nice and all that so you can you can trust the fact that when you counted rare species it was really there and while other ecologists that sample things that are moving or sample in very small units like insect traps for instance there may be a rare species present in the surrounding but that you don't find in the trap because it just pass by because the units are very small so we concluded that when you can trust your rare species data then chi-square distance is fine and this is what you want but when you cannot trust your rare species data don't use the chi-square distance because it gives high weight to data that are very variable because of the sampling and at that point we were stuck because as Daniel mentioned at that time we thought that for ordination it was either principal component analysis or correspondence analysis or then principal coordinate analysis of course but the two main contenders were principal component analysis and correspondence analysis and we cannot trust principal component analysis because of the double zero problem and because it implements the Euclidean distance and now we cannot trust correspondence analysis in half of the cases because it gives high weight to rare species so for about 10 or 15 years the literature developed in trying to find ways of removing rare species from the data table before computation of correspondence analysis. Daniel even developed one of the best methods to do that to decide which of the rare species should be eliminated and by eliminating them first the most rare and then looking at the result of correspondence analysis and then you eliminate the next rare species and so on you go by steps and you look at the eigenvalues of the axis that you obtain and the total chi-square and you make a graph of that and then you can decide where to stop eliminating rare species so that lasted until about the year 2000 and at that time in discussions with Daniel in our lab and with Eugene Gallagher at the University of Boston then we talked about the transformation and I think that gave us a new avenue because with the transformed species data then we could use principle component analysis on species abundance data instead of correspondence analysis in cases where we could not trust the rare species abundance okay so that's the story of the development of ideas in this field but it is important to understand this property of the chi-square distance is that clear for everybody or you have questions about that yes simply a transformation I don't I don't see which transformation would work I mentioned that we can eliminate the rarest of the rare species yeah and that during 15 years people develop different methods to choose the species to eliminate yes that's one that's one possibility but then if you simply transform the species data using the chord or Hellinger transformation then you can use principle component analysis and principle component analysis in that case is does not react to the presence of the rare species so you can just leave them there don't have to remove them removing them or leaving them there produces the same result at the end so it is easier but then you have these two choices eliminate the rare species and go to CA or keep the rare species but transform the data and go to PCA other question yeah the log transformation is is another it is another possibility yeah for instance if you want to try that if you take the spider data that you were working with yesterday for ordination try to do an ordination with of the spider data after log transformation log of x plus one or the Hellinger transformation and do a PCA of these two and the PCA's are very similar but the log transformation is not sufficient for what we will do on Thursday because the distance at the end misses some important properties but just for the nation this is fine it's a good idea okay for this topic this is a very important topic for ecologists you know what to do people some but people are telling us to do this others are telling us to do that why I'm trying to give you the lines of reasoning leading to these different solutions and then you decide what to do I don't know that's your decision okay good now I'm going to talk about another important distance the Gower distance that was developed not for species abundance data although there is a form of the Gower distance that can be used for species abundance data but this is not our main concern the Gower distance as suddenly gain in importance for the study of species traits for instance in the FD package for instance the basic calculations for the the trait the functional that that's what it's functional diversity in the seas you can base it on the calculation of the Gower distance and this is what is done in the FD package so people now want to understand that distance and I'm going to explain it quickly it is available in several our packages but the different our packages don't necessarily implement the Gower distance to its full extent so I will try to describe how it works and what are the main variants of the Gower distance for instance in the vegan the Gower distance is calculated correctly for quantitative data only but then if you have factors no I don't think vegan takes factors but other packages take also factors and then there there's the problem of missing data in vegan doesn't handle missing data in FD it handles missing data and then you can add weights also FD does that vegan does not do it so here I'll describe the application of the Gower distance for quantitative data and then I'll briefly run you over what to do for factors and the presence absence data this is an example of seven variables let's say that they are physical or chemical variables of the environment the water the soil whatever and here I limited the example to small value and we are going to compare two objects from a data table that contains suppose it's supposed it contains 50 objects and here I made up two objects that don't really exist one that has the minimum value in the whole data table and the other one that has the maximum value in the whole data table just for illustration in a real data table there would be a search through every variable to find the minimum and maximum of each variable so the basic equation is the one at the top for the Gower coefficient which will be a sum over the different variables of a small similarity small similarity because I use small s okay computed for each variable so you compute the similarity between 0 and 1 for each variable you sum them and you divide by the number of variables that's the simple and but very efficient method that John Gower developed during the 1950s to solve this problem that was a problem commonly found with ecological data and with taxonomic data as well there is a video somewhere on YouTube where John Gower recalls the story of the development of that coefficient and what it was to use a computer at that time that was before the invention of the computer the programming languages the first one was FORTRAN formula translator for FORTRAN and before FORTRAN what did you do you had to program in computer language basic computer language in hexadecimal and using hexadecimal code in many cases so he tells the story or what it of what it was to compute a similarity coefficient at that time that's nice so he developed yeah John Gower was the successor of Ronald Fisher as chair of statistics at the Rotamsted experimental station and he spent his career developing methods for numerical taxonomy and for ecology he was with us at the conference in the Roscoff that I talked about earlier okay so what do we do here for each of the variables we will use for quantitative data this small index so this is standardized similar this similarity and we turn it into a similarity for each variable j so first we compute the difference between the two values so the difference for variable 1 is 1 the difference for this variable between the two values is 2 and so on here there's no difference we put a 0 now we look at the range in the data matrix that is the difference between the minimum and maximum and we find it here here the difference the maximum difference that you can find is 1 the maximum difference that you can find the data table is 4 and so on so you divide the actual difference but by the maximum difference to obtain these values okay and you turn them into a similarity by taking 1 minus that and you obtain this and we sum all of the what we do in the equation okay this is the sum then we divide by the number of variables which is 7 and this case and we obtain this similarity now if we want to use it as in the in the R language everything has to be a dissimilarity or a distance so we can take 1 minus that to obtain now this similarity this dissimilarity it will not have the property of being Euclidean that I will describe in a moment and so we'll wonder what to do but then if we have other types of variables presence absence data are handled with this same equation that is if the two values are identical for presence absence data if you have 0 0 or 1 1 then you obtain a similarity here of 1 or it is 0 for multi-state factors who qualitative variables you use the same logic that is if the two sites have the same state you will produce a similarity of 1 for that variable and if it is not if the two sides do not have the same state in the multi-state qualitative variable you put a 0 so it is easy to handle this sort of variable now there has been and some this for a start again for ordered variables John Gower was simply treating them as being quantitative there has been there have been some later proposals especially by Janus Podany in Hungary to rework the states of an ordered variable in different ways and this is described in the documentation of the function Gower this in FD for instance or in my book but then this is a small a small point the other main point that I would like to mention is that using this formula here the formula at the top we can add weights to these and to this these small similarities and then when we give different weights to these variables here instead of dividing by P we divide by the sum of the weights so this is rarely used but it can be useful in some circumstances where some variables are linked together to code for one thing so you may want to give them less weight than other variables that are used fully but the main use of that idea of giving weights to variable is that it can handle missing values if you have a missing value here you don't want it well you could of course exclude this variable from the data set this is what most software would do but and a more intelligent software would simply exclude this missing value from a comparison where it is found but for a comparison with another site where the value is not missing that it would keep the variable so the idea is now to say if there is a missing value here you exclude this variable from the calculation so you have one small similarity less than for the other site so you count how many of these variables do not have missing values and you divide here by that number so by the sum of the weights because the trick is to see that if you have a missing value you give a weight of zero to that comparison but if the values are fully present you give a weight of one so by summing the weights you can then have a correct coefficient that will exclude data and comparison only when there is missing value so this is a great advantage for our kind of data especially for physical chemical variables that we collect in the field if the pH meter runs off the battery is off and so on you don't you have a real missing value it is not a zero and here you can exclude these values from the comparison or people who work in paleontology and if you code the characters found on fossils fossils are rarely complete so sometimes you have pieces missing you don't know what's the value of the variable that you want to measure you put the NA in the data file miss meaning missing value and this coefficient will handle the data correctly so there are all sorts of applications for this sort of thing and all that is available in the gauze dysfunction of the fd coefficient of the fd package okay so very useful coefficient the developed by John Gower there are of course many other coefficients and I will mention many of their other of the quantitative coefficients that are in this group of the of the asymmetric coefficients when I come back to this subject on Thursday now when I showed that many of these coefficients were first computed or described as similarities and when we want to turn them into this similarities then you can do it in different ways I changed the version of this handout some measures s and d on the web page last night and I added this page from my book that it is useful to have it in this handout and so you can later go go back and download the new version that has this extra page so we are mainly concerned with transforming similarities into this similarities because in our all packages that use this sort of matrices they required that they be this similarity so there are three ways that we can use to do that we can in the most common is to take one minus the similarity when you have a coefficient that is already between zero and one or then you can take the square root of one minus the similarity and this has also been suggested but there is no not much use for that form so the big decision will have to be made between this and that and it is not a trivial trivial thing you will not obtain the same result at the end I will just mention that for coefficients that do not have a fixed maximum for instance if you had the if you had a coefficient like do you clearly understand that does not have a fixed maximum then you could take each value and divide by the maximum in the dataset to obtain a norm value of the similarity that is between zero and one or you could do it in that way also and this is if you want to turn the distances into similarities but we don't want to do that in our so I will now discuss the difference between that and that okay and this is a more mathematical subject so if you are allergic to mathematics you can go and have coffee immediately because I'm going to talk a little bit about the mathematical part so let's see I'm going to go to my chapter on distances and I will go to page 311 which is the deficient definition of a proper distance here it is so the this is the mathematical definition of proper distance a proper distance has a minimum of zero and then other values that are not zero they are positive you cannot have a negative distance okay makes sense and then all the distances that Daniel oh yes yes thank you good idea a bit more yeah why not okay so the first two properties are pretty obvious and then all the distances that we will also describe are symmetric that is if you compute it between site one and site two or between site two and site one you obtain the same value so all these the first three are found in all our coefficient number four is an important one the triangle in equality this is what makes a distance metric a triangle in equality is something that seems to be trivially simple so if you have three points x1 x2 and x3 and you look at the distances like this so the sum of any two distances is larger than or equal to the third one it cannot be smaller equal to it would mean that we have between x1 and x3 we have this and then x2 may migrate there x2 then this plus that would be equal to that one but it cannot be smaller now what happens with many of our distance coefficients that we use all the time I will go now to page 311 to give you an example it will be easier for me if I scan at lower resolution here's an example yeah so this is the kind of data that makes that makes full sense for ecologists we may have three sites with five species with these value no problem here I apply the the distance corresponding to the Sørensen binary coefficient where we have Sørensen is 2a times over 2a plus b plus c and the distance here is 1 minus the similarity of Sørensen okay so I in other words it is b plus c divided by b plus c divided by 2a plus b plus c this is what I use here so we calculate the three distances between site one and two we obtained one between site one and three we obtained point five point one if I between site two and three we obtained point forty three this plus that is smaller than maybe I did the mistake in my calculations try it you will see I would be surprised if you found another result okay so here is a coefficient that we use all the time that violates this this this rule of for a coefficient to be metric I will now show you the same thing for another example a bit further down I think it is next page yes here I am with the percentage difference alias brain Curtis in my book there is a note at the bottom of the page explaining this story that I was telling you about and their coefficient and not being what it is supposed to be and here is another example from last law lotsi work who was a vegetation scientist and then another example quantitative data so between one and two we have that between site one and three we have that between sites two and three we have that well this plus that is smaller than that another violation of the triangles inequality well these two coefficients that I showed this one and that one are related because this one is the binary form of that one but many of our coefficients violate the triangle inequality when we come to ordination this is where I want to take you when we want to do an ordination with principle coordinate analysis principle coordinate analysis can be imagined like one of these games that children had in the old days before kids received electronic games you know real games that you can handle with your hands where you have for instance blocks with holes and sticks and you can assemble the sticks and try to make a construction so kids learn to manipulate things and they may learn some principle of geometry and they may learn the the idea that some distances are not euclidean with they cannot really close their constructions what happens with this sort of distances especially if you have a lot of these distances that are not euclidean is that some of these triangles will not close because if you want to if you have joining unit here and there and you have a stick here then another joining unit then you will have this going to to that and here the stick will not reach there okay so you have a problem you cannot fully represent your distances by a fixed construction are there people from Brussels in the room nobody from Brussels you have been to Brussels many of you may have been to Brussels you may have been I've seen the the symbol of the international exhibition of 1957 I think the atomium where you have balls like that with corridors in between well imagine that some of these corridors are too short people will walk and then just fall at the end so this is what is the consequence of this violation of the triangle inequality some of the sticks are too short to reach the other end and in principle coordinate analysis we are trying to produce a euclidean representation of a multivariate data set that contains all these distances and if you have distances that are too short the method will still be able to produce a fully euclidean representation but for that it has to borrow some pieces of distances to make the sticks long enough for the construction but at the end of the list of the eigenvectors you will have to pay for that by having ugly things called negative eigenvalues negative eigenvalues measured the amount of variance that has to be invented by the method of principle coordinate analysis to make the construction fully euclidean in the first group of variables but then at the end of the list you have these extra axes that have negative eigenvalues to compensate for the variance that had to be borrowed that had to be invented to make a fully euclidean representation in the let's say if you have a hundred points you will have 99 axes all together in the first 90 axes it will be fully euclidean but in the last nine axes you may have negative eigenvalues so what do we do with negative eigenvalues the first solution is to forget about them and if you want an ordination in two dimensions you will use the first two axes that have big positive eigenvalues so even if you have a small negative eigenvalues at the end of the list you can just forget about it at the first two axes will still be will still provide you with the good representation of the distances between the points but we will see different applications in this course including on the number five where we will need all the eigenvectors and we will see that also yeah tomorrow when we will use principle coordinate analysis as a method of transformation of the data and what we will do is to start with a data matrix here why produce distance or this similarity matrix D go to principle coordinate analysis to turn that again into a rectangular data matrix and hopefully we'll want this new data matrix to recuperate the information on all axes because we want to use this the dissimilarity matrix the dissimilarity function as a transformation of y into this that will then be used in RDA to compare with a matrix of environmental variables where we will have that y transform compared to x environmental variables okay so in that case principle coordinate analysis will be a transformation of the data true well-chosen dissimilarity measure we'll do that tomorrow and in that case we want to recuperate all axes but if we have some axes that have negative like eigenvalues what happens to the eigenvectors so I have to go back quickly to the the document PCA PCA and PCOA that for which I showed you yesterday only the PCA portion if there were there is an example of PCOA in there and Danielle did not quite go through all of that in PCOA quickly you start with the your distance matrix you transform it like this each distance is squared and multiplied by minus one half then this matrix is centered by using this equation and that makes the rows and the column sum to zero in the ordination diagram it simply brings the centroid of the cloud of points in the center of the ordination diagram but we will see on Thursday that this transformation will be used again to find very interesting coefficients for beta beta diversity study the diagonal values here will be very useful but then after this transformation and that one we compute the eigenvalues and eigenvectors and this is where the negative eigenvalues pop up when there are some but then the eigenvectors in principle coordinate analysis they are I told you that the function eigen of R produces eigenvectors that have a length or a norm of one we change that norm in principle coordinate analysis we meaning John Gower told us to change that norm because this is the way to reconstruct the original distances we change that norm by taking the eigenvectors that are just fine coming out of eigen and we multiply them by the square root of the eigenvalue what is the square root of a negative eigenvalue and imaginary number quite right and when we multiply our real number eigenvectors by an imaginary number we obtain a complex number and how do you represent a complex number in an ordination or how do you use it in an RDA I don't know how to program that maybe some physicist would but I don't know so we have been looking for ways of changing that this the distance matrix if it is this is what it is to be not Euclidean it is to have the sticks that don't fit and it translate into having negative eigenvalues in the principle coordinate analysis actually the the not the criterion but the the diagnostic for non-Euclidean rt that's a word invented by John Gower the diagnostic is to do a principle coordinate analysis of the distance matrix and look at the eigenvalue and as soon as there is a negative eigenvalue in the list we say that it is not Euclidean there is a function in R it to be found in my practicals this afternoon and it is is that Euclidean is Euclidean of some matrix and the answer is true or false okay this function is in the package ADE for we ask this question about the matrix and the function tells us if it is true it is Euclidean false it means that it has negative eigenvalue okay so it is easy for us to check that and there was a lot of discussion on how to make a distance matrix that is not Euclidean to turn it into Euclidean and at first John Gower suggested two corrections that were already available but for other purposes in the literature one is called the Lingo's correction and the other one is the Kaya's correction Micaely I found the discussion between you and Daniel Chasselle about that back in 1997 I think I received copies of that from Daniel okay so it is not a trivial question but then a bit later we found that there is an even simpler way of turning a non Euclidean distance into Euclidean and that works in almost all cases and it is to take the square root of the distance taking the square root of the distances is what we had not here not here no that yeah some measures SLD is it is what we had here so in all those cases with all these distance coefficients that do not produce Euclidean distances we can solve the problem by taking the square root of the distance when we change the similarities into distances or since R produces this similarity or distance matrices we think before use we simply take the square root just write sqrt of math and that take the square root of each value and this is what we feed into principle coordinate analysis and boom the eigen negative eigen values disappear isn't that great now is that a trivial problem that concerns only one application in the ten thousand not if you are an ecologist because most of the this similarity functions that we use in ecology are like that they can produce negative eigen values before because they are not really dead and there is even a case where because they are not metric I mean even metric coefficients can produce negative eigen values in some cases and there are tables in my book in chapter 7 saying which of the similar this similarity coefficients are metric or not and then which one are Euclidean or not some may be metric but not Euclidean and the concern is those that are not Euclidean this is where we have to take the square root of the to terminate this subject I can show you just a picture all again produced by John Gower in my chapter 9 page 500 oh the famous page five hundred yeah here it is see page 500 very very famous page for a very famous picture this is an example of distances that are metric but not Euclidean look at that we have four points one two three and four for each triplet we have a well-shaped triangle where the sum of two distances is larger than the third this this one and this one but then in this triangle X four is there and this one X four is here and this one X four is there the distances are too short when you make a construction with several points they are too short for X four to meet in one point and that exists in our data we have that all the time and this makes a coefficient not in Euclidean so it produces negative eigen values in principle coordinate analysis this problem is created a lot of turmoil in the literature so much that some in some packages they said the coefficients that have this property we should never use them well that would be hindrance for ecological analysis because most of our coefficients are like that in other packages like the primer package developed at Plymouth there's somebody from Plymouth here on the yeah so the package developed by Warwick is for many years it did not have principle coordinate analysis because principle coordinate analysis could produce negative eigenvalues and they pushed on the metric multidimensional scaling for that single reason now principle coordinate analysis has been reintroduced since about the year 2000 in the primer package in the extension extension called e-primer and peace has resumed among users because now we know that with this transformation we solve the problem of the negative eigenvalues but you see when faced with problems like that ecologists and developers of software have tried different solutions in good faith to try to offer their clients things that would work but now with a bit more research we find that there are solutions that can satisfy everybody