 Thank you very much to the organizers. It's a pleasure to be able to share my thoughts on this on CODA for microbiome with you in this CODA day. So today I'm going to talk about variable selection in microbiome studies. And I will talk about work, a paper that we published last year, together with Toni Cecin, Eva Wong, and Kiman Lecao at Nar Genomics. And in this case, we focus on the characterization of what is called microbial dysbiosis. And microbial dysbiosis means an alteration of our microbiota. The microbiota is all the bugs that lives in our body. So microbial dysbiosis has been associated to several diseases, especially metabolic diseases, inflammatory diseases. And what we want, our goal is to identify those species that when the disease is associated with an increase in abundance or a decrease in the abundance. And there are others that are not, the abundance is not altered. So we have like three kinds of species affected by a disease or an intervention. So when we talk of variable selection, there are slightly different goals. We maybe focus on understanding the biological mechanisms. And in this case, we want to know exactly, we want to identify all the species that are altered in this way or in this other way. But there is another goal that is focused on prediction, where the main goal is to define microbial signature, but perhaps it doesn't contain all the components that are altered, but it works for prediction or for regression. So this is important because some models will be more useful for one thing or for the other. So let me briefly explain what is microbiome data. So microbiome data is a pain after taking some samples of, for instance, for the microbiome, gut microbiome samples of the stool. DNA is a sequence and these sequences are classified on the sequences that are almost identical, allowing about 3% of difference are put together in what is called the OTU, Operational Taxonomic Unit. So we can count for example how many reads we have for OTU1, OTU2, but we need to know which species are these operational taxonomic units. And for this, there are several databases that tell us which species and genus and family and order and this is called the taxonomic assignment of microbial, usually bacteria. So this is important because as you will see, we can analyze the data, the microbiome data at different levels of taxonomy. So after this process, we have, for each sample, here the samples are the roles, for example, we have dependent variable that maybe this is the status and we have the number of reads that we have observed for each OTU. But then sometimes we may want to analyze the data at different higher taxonomic levels and we agglomerate this data or amalgamate this data, but just assuming the number of counts. So microbiome data is count data, the number of DNA reads for each variable. There is a large variability in the total number of counts for sample. This total number of counts is not informative and is constrained by the technology used and the dimensionality of the problem depends on the taxonomic level of aggloration that we are considering. Okay, so microbiome data is clearly compositional and I guess that here in this audience, you know that variable selection within compositional data is something that is complex because, well, I will go through this. Let's imagine this situation in the environment before we observe the data. Okay, let's imagine a disease that affects taxa 1 by increasing the abundance by a full change of F1. Taxa 2 is altered by a decrease F2. Taxa 3 also is decreased and the last two taxa, we call taxa the components at different levels that we are analyzing. So the last two taxa are not affected. Okay, so variable selection means like to test for each component if this full change is equal to 1 or different from 1. But this univariate differential abundance testing results in large proportion of false positive because the change in abundance of one species or several species, indeed, induces changes in the observed abundances of the other species, even if in the original environment these two taxa were not affected. When we observe the data after sequencing, we observe a change. Okay, so let's specify a little bit what is this bias that we observe if we try to measure the difference between the two groups, the control group that has this composition, close composition P. Okay, and then we have a vector of full change of F1, Fk, positive some of them may be equal to 1. And this is the observed proportions for after the disease or something that has affected the original compositions. Okay, but what we observe is it's normalized, it's affected by the normalized constant. Okay, that is in this case has this fall. So if we work on trying to measure a test, univariate test by comparing the proportions before and after the intervention or without the disease with the disease in the local scale, for instance, we see the effect is biased by the log of the normalizing constant. Okay, and for those components that with the fall change equal to 1, so for those components that were not affected, we observe an effect. Okay, here I just write the result for two instances, but this would be the expected value of the bias or the expected observed effect if we write it down in, for example, of individuals. Okay, so I think that this is already known in the CODA work, but I think it's interesting to quantify this quantity. So what a microbiologist that acknowledged this problem, what are the options, what can they do? So they will go to the CODA toolbox and they will pick one of the methods, the CODA methods that is appropriate for testing these effects. Okay, so maybe the most straightforward option would be the central operation transformation because, as you know, the CLA transformation brings the data from the simple X to the real space, and then here everything is fine. I think, or it's not. I think this is the, perhaps I would say the most important misunderstanding on CODA, that when you have your date in the real space then you can forget about the compositional problem. And as I will show, this is not the for variable selection. Okay, so if I do the same as before, very simple computations. I compute the CLR before the intervention. I compute the CLR after the intervention, and I compute the difference between both. There is a bias. We were expecting this, this is the real effect or the logarithm of the full change, but the result when we use the CLR is biased. The bias is different. Here the bias is the geometric mean of the full change, but still it's a problem. Okay. Yeah, and for those components that didn't were affected, we get instead of zero, we get a difference of minus the geometric mean of the vector of full changes. So you never get testing this bias, both when we work with log proportions. We have this bias, as I said before, but now let me elaborate a little bit more this normal, this bias, based on this normalization constant, this bias can be written as the minus the log of one plus this expression that depends on the full change that are different from one. So the components that have really an impact. Okay, so the bias when we work with the log proportions, depends on the magnitude of the full change and on the abundance of the components that have been shifted by the intervention or by the disease. Instead, the CLR bias is based only on the vector of full change, not in the abundances. Okay, can be written like this, minus one over k, k is the total number of components, plus the logarithm of the full change of those that are different from zero because the others are zero. Okay, so this makes a difference. And let's put a small example for this to see how these bias effects in one way or in another. So let's consider these five parts composition and I simulated some data with this full change five, one over two, one over 10, one and one. And I consider two examples, one where the initial abundance of all the components is the same, 20%. And the other where the three first components that are the ones that are affected by the disease are low abundant, 2%. Okay, and taxa four and five are more abundant. Okay, let's see what happens in the log transformation and the CLR transformation in these two examples. Okay, in the uniform case and the log transformation, we see this shift, this bias, and this provides false positives. We already know that for the CLR, we also see these bias. Okay, and the bias has the other sign in the other case, the cases, sorry. Yeah, here you see the shift is, it reduces the abundances, the observed abundances for cases and in the CLR, we see an increase in the cases. Okay, what happens when we, in the other example, where those components that are affected by the disease are low abundant. In this case, the bias is very small. It's almost negligible in this case. So the log ration, this case would give good results, but the log transformation gives the same results, whatever the abundance of the the real significant components are. So in this case, the bias of the CLR, it's much larger than the bias of the log ratio. And here you can see that this bias also introduces the reduction of the power to the tech that, for instance, this second component is significant. Okay, so the conclusion is both the univariate approach with log ratios and CLRs are bias. So we would like to find another approach based on the log ratio approach. The log ratio is, considers two components. This would be good if, in cases where only two components in the environment are affected by the disease, but if more components are affected, we would like to extend this notion to include more components. So we consider, here I will introduce two extensions for this. One is the compositional val allowance, the method cell val that we published a couple of years ago, and it's becoming rather popular. And the other is a penalized, called a penalized regression. Okay, so I will talk about these two approaches. The first one is based on the concept of balance that is an extension of log ratio, given a composition and two sub compositions of this composition, the balance between these two sub compositions is the logarithm of the ratio of their geometric means, and also can be expressed in this form as the difference between the arithmetic means of the log transform variables. And the other approach that I will comment is based on log contest functions that are linear combinations of log transform variables with the restriction that the coefficients sum up equal to zero. Okay, so we proposed cell val in this paper, and I'm happy, Jenny, that it was useful for your work. And the idea is how can we identify these components, these two groups of components, A and B, A would be the ones that increase, B would be the ones that decrease, okay, and that are related to a dependent variable. And maybe we can adjust for other non compositional covariates. So for a continuous variable, we consider a linear model where one of the covariates is this balance that we still don't know. And for dichotomous variable, we consider the logistic regression where this balance is here, and we will build this balance so that we obtain the largest mean square error, the lowest mean square error or the largest AUC prediction accuracy for the binary outcome. And the approach is as follows. We first start by replacing zeros, then we check all the log ratios with two components and take the one that is optimal, the one that optimizes the evaluation criteria. And then we keep putting components in the numerator or the numerator of the log ratio, while these adding variables increases the optimization criteria. And then we stop including variables by cross validation criteria. We perform cross validation in order to know what is the optimal number of components that will be containing this balance. Okay, so let me explain the result or illustrate the cell valve results with this example in Crohn disease, where we have 975 individuals, 662 half Crohn disease, and the rest have no symptoms. This is a n number of individuals, much larger than k, because we are considering here we are analyzing at the genus level, and we have 48 variables, 48 taxa here. In the paper you can find another example that is just the other case, n much smaller than the number of covariates. And so the first thing is to estimate the optimal number of variables that will be included in the balance. Here we have a binary outcome that is disease status. And in this case the optimal number of variables was 12, 12 variables included in the balance, because including more variables did not improve the cross validation accuracy. This is the balance that we obtained. This balance should be interpreted as the relative average abundance of one group versus the other, and provides a score that can separate, in this case, separate quite well the two groups, so have a very good classification accuracy. We also give some additional information on this. And the other approach, that the one drawback of cell valve is that it's a forward selection process. So it depends on the selection on the previous selection that everything will be fine or not. So an alternative to this is it's an approach that jointly estimates all the parameters. And here we consider, we call it codependent analysis regression that is based on a law contrast regression model with constraint. This was proposed in the context of the macrovium by Lin and Lu, and we have implemented Dina. And here we have this log contra regression model with this penalization term in the L1 penalization. So this is a last penalization term, but it can be extended to other, to elastic net for instance. And we add the restriction that the coefficients sum up equal to zero. And we implemented this minimization process by two steps, the soft threshold and projection. And for the case of the Crohn disease here, you can see depending on the value of lambda that the coefficients ring to zero. And at the end only two taxa have coefficients different from zero. But here we take for comparison with cell valve, we take lambda equal to 0.15 that selects 12 variables. And we can present, I think that the presentation of the results is very important, we can present here the 12 variables that have been selected like this. Some coefficients are positive, so those are negatives because the sum of all them are zero. And these are the variables that cell valve and codalaso share. So we can see that they overlap quite well, at least for the most important variables. Okay, so in the paper you can see the relationship between these two models because they are an assimilation study. And let me finish with just emphasizing that the compositionality of microbiome data cannot be neglected, it's very important. And that univariate differential of them testing, both using log proportions or a code approach like CLR provide bias results. So it's important we have alternative approaches to do this and depending on the goal, cell valve is more focused on prediction and on obtaining sparse model, the very, very smaller, the small model that gives good predictions. Okay, and codalaso provides another view of the same problems, but at the end the results are similar. Okay, so that's all and thank you very much.