 Ok. If everything is fine, I think you can start. Ok, good afternoon. My name is Manuel E. And I'm a PhD student in a joint collaboration between the Satellite Geologica from Napoli and the Laboratory of Interdisciplinary Physics in Padua. And so the title of this school is from single cells to cell communities. And today I will discuss about some taxonomic and functional patterns in diatoms basically following the opposite path. So I will start from cell communities and in particular I will rely on the Taraocean dataset which Martin already mentioned last week. And for the ones of you who are not familiar, so the Taraocean is an expedition that I have collected samples all across the oceans. And for a larger range of dimensions of individuals, mainly on plankton, and basically thanks to high throughput sequencing they have produced three datasets of metamics which are the metabarkoding, which will be the focus on the first part of the talk, and then the metagenomics and metatrosketomics which will be the focus on the second part of the talk. So thanks to this very big dataset we can ask questions like so despite the very biogeochemical heterogeneity of this dataset can we observe some common rules that regulate somehow the taxonomic composition and also the functional expression of our communities. OK. And so we focus on diatoms, mainly for two reasons. The first one is because they are important. So they are a class of phytoplankton, so they play a key role in the carbon cycle, and also they are huge in terms of biomass. But the main reason here is that they are ocean-wide distributed. And so if we want to infer some properties of all the latitudes, for instance, of all the temperatures, we need some species that are everywhere. OK. And despite that they, of course, they adapt to different conditions, they somehow occupy similar traffic levels. So this makes them as ideal candidates for investigate global patterns, let's say. OK. So before getting into the, really into the topics, a descriptor that will be, the most important during this talk will be the species abundant distribution, which is a descriptor that is very used in ecology. And basically, so here I show you a tropical forest just for historical reason. Let's see that, let's suppose that you have a sample from your forest a possible way to study it as a community, as a world is to classify the species that you observe in classes of abundances and usually in log-2 scales. So for instance here, we can divide in log-2 scales log-2 scale abundances from 1 to 2, from 2 to 4, from 4 to 8 and so on. And we can count how many species have a particular abundance. And as you can see, here we have a lot of species which are rare, so with just a few individuals and just a few species that have a lot of individuals. And properties like this are shared within a lot of communities, despite a lot of differences that you can find in nature. So these somehow shared features suggest that exist some very fundamental and also simple mechanism that is shared across the communities and somehow regulate this distribution. And this is a question that has puzzled the ecologist for several years and the possible explanation was done by the so-called neutral theory of ecology in which by assuming that the species basically do not interact and basically they just obey some very fundamental mechanism such as birth, death, immigration, we can describe the properties that we observe in nature. And this is just not, let's say, fitting. This is also interpreting what we observe, because for instance here we can say that in a sense the interactions are weak, so are negligible. OK, so in the first part I will talk about some taxonomic result. So I will focus on the metabark coding in Tara. A first work that focus on the species abundant distribution is a work by Enrico Sergiacumi in which basically they analyze different sizes of plankton from the pico to the mesoscale and they found that the distribution is very similar all across the ocean. So despite the differences that you can have in the polar regions and in the tropical ones the distribution, the species abundant distribution is very similar. And they propose a model which is in the spirit of the neutral theory so just you just have birth and death terms and immigration term and you can derive the species abundant distribution, your theoretical species abundant distribution which is simple power law with an exponential cutoff. We started by testing if focusing only on diatoms we observed similar patterns and the answer is yes. So in this plot here I'm representing the species abundant distribution of all the communities collected by Tara and I put also power law reference with the same exponent that Sergiacumi found. And as you can see we can observe clearly that there are differences depending to the different temperatures and stations that are the coldest ones or the polar ones but despite this diversity in the total abundance the power law is more or less the same all across the deficit. And then you can perform a maximum likelihood power law fit and you find that basically the medium of the power law is exactly in agreement with the one by Rico. And so we propose a sampling a sampling hypothesis because the idea that we have is that somehow the diversity that we observe is mainly due just to different sampling efforts in our data sets. So in a sense we should expect that the only diversity that we observe is just the total number of OTUs. So we started from a very abundant station one of the polar ones and we started to sample from it. And in this way we can produce synthetic samples synthetic sets just by matching the number of sample OTUs with the observed ones. And as you can see this very simple exercise produce quite good results. And we can test this result by comparing the richness of the total number of OTUs that we observe and the agreement is quite high. Although you can observe that there are not random deviations since we are usually overestimate in the richness in the most abundant and we are underestimating the richness in the less abundant. So we propose a theoretical model in which basically simple idea is just that in the polar regions or in the most abundant stations the sampling effort is better than the one we have in the other regions. And so we assume that the dynamics is the same everywhere with the same rates but what we observe is different from the reality because we have an impact on the less abundant station due to the different sampling effort that we have. With sampling effort do you mean to get more reads in general? Yes. But is it because also there is more biomass at the poles? Exactly, yes. It was just to understand. Yes, and we use the ratio between the total abundances as a proxy of the ratio of the sampling efforts. That's the idea. And so we can fit, as Sir Jacquem did the space abundant distribution of one of the reference stations and then we can make a prediction of the side of the other station just using the fitted parameters and the ratio of the total abundances as a proxy of the ratio of the sampling efforts. And the results are pretty good I will say and then we can also compare the number of different species and again we observe a good correlation so the prediction is good but again the deviation are not random so they follow a biogeography let's say. And for us this means the following. So at the first level we can say that the sub community of a very specific condition so in the Arctic it is large but not the large so it has more or less the 7% of the total richness. So starting from the properties of one station we can infer properties anywhere in the ocean. But this is not the whole story so at the same time the deviations reveal that somehow we should take in consideration also other ingredients other ingredients. One station is a function of the sampling effort which explains somehow the mean properties but then the deviations are function of the environment and then we we can also argue what are the ingredients that we should consider sorry and in particular we find that the deviation delta S is highly correlated with the geographical properties latitude, temperature, salinity while it is not correlated for instance with the nutrients and this should be understood Yes Sorry but did you try to disentangle whether it is temperature or latitude Yes I don't know why it is not here the plot but we use random forest so try to understand which were the most important ingredients and usually is temperature the strange things is that when we look at the highest correlations actually the highest one is not with one is not one of these properties but is one of the evenness which is which is a measure of the Shannon entropy so it is something which is not related to the temperature, salinity so in other words it is somehow strange because we were Yes it is sort of we were sort of like puzzle because we were expecting that the deviation show the highest correlation with the temperature so it is somehow strange that the highest correlation is with intrinsic property of the station but they are sort of correlated because they evenness is like a measure of how broad is the distribution and if you think that sampling is sort of the origin of the trend in diversity you should expect I mean actually evenness is quite insensitive to sampling so by measuring evenness you are sort of measuring the true contribution of diversity which is what you also measure by looking at the deviation of diversity Yes I agree the point is that yes but we were like so we were expecting the evenness to be correlated with the power low exponent so naively one should expect you should observe different evenness if you have different slopes of the power low which was not the case and I don't know if you got my point No I Okay Yeah so there is something that we should understand but as I said before the average diversity can be understood just by sampling hypothesis and then we should investigate this deviation this is more or less the take on message for this part then let's move to the second part in which I will I will focus more on the functional level so on the gene expression level and so basically when we started to analyze the metatranscriptomic data we started to read the literature and the species abundant distribution that has already been studied in the context of gene expression but the idea the underlying idea that we found was that the gene expression were always basically power low distributed and the exponents were found to be similar all across several species so from the bacteria to humans so this was what we found in literature but as you can see there was something that was puzzling us that somehow we can already here see some deviation from this power law so we started as an exercise to look at the species abundant distribution and if you perform a maximum likelihood fit with a simple power law you will find weird result in the sense that you can see that for the less abundant genes the power law works pretty well but as soon as you consider the most abundant genes you have clear deviation from it and this is something that you can observe both for metagene and metatene with ok yes so yes exactly the left plot here is the species abundant distribution so basically here I consider etara station I look at how many genes I have in the metagenomics or in metatranscriptomics so basically in which sense so do you mean how do they build the data sets or yes yes they have some reference yes so the question is how do they cluster it so I think they use some reference genome for some of the species of diatoms because they are known but I think that mainly they use clustering algorithm because many of the diatoms that they find in the ocean are not yet classified in clustering algorithm and ok so once you have defined what a gene is you can count them so you have row counts of several genes and then you can count how many copies of a particular gene you find in your station the metagenomics yes because it is the gene in DNA while the metatranscriptomic no because it is the number of expressive genes which is the function of both of the genes that are present in the DNA but also on the function that are expressed and so in this way you can make the left plot while the right plot is the so-called rankabundas distribution which is basically another way to look at the same properties so you consider your genes you sort them from the most abundant to the less abundant and then you make a scatter plot which is related to the cumulative distribution of the left plot so there are just two different ways to look at the same properties is it clear? ok so the first thing that one can do of course is to try to fit the most abundant genes so we prefer another maximum likelihood for the most abundant and somehow to us it seems that there are two regimes each one following a different power lobe with different slopes and also if you look at all sets for all the stations you can clear if you compare them with the metabarko that I showed you before you can clearly see that the metatranscriptomics in particular follow something different than just a simple power lobe ok so the question is basically why why do we observe these two slopes what is the meaning of these two slopes and ok so the first thing that we tried was to perform likelihood ratio test among a quite broad class of distributions and then eventually we found that in more than 90% of the cases the select distribution was this generalization of the power lobe which has this formula here but which is nothing else let's say a double power lobe so here you have that you have a scale k I'm sorry you can see that this function you have the argument that is always x minus mu mu is just the minimum value and then you have some exponents and when x is small when x is small basically this factor is one and you have a simple power lobe this term here while when x is large you have the multiplication of two power lobes and again is a power lobe ok so we started to test on the different stations and it works pretty well that basically no, I mean that basically among several distributions this is the only one that is able to fit this data across several orders of magnitudes so the ranka bundans for instance here spans 5 orders of magnitudes and to us was like surprising that in a in a dataset like the one from Tara which is usually full of noise, you can have a single function that describe 5 orders of magnitudes of abundances of gene expressivity in a meta transcript so in a collection of of individuals in this sense by regarding your question we analyzed also of course the two exponents of the two power lobes and basically what we found was that in the meta transcriptomics we can clearly we can clearly see a dependence of the, basically of the latitude so if you look at the of the polar regions the exponents are lower when compared to the one of the more tropical areas while this does not occur as clearly in the metagenomics and so in other words if you look at the correlation among the distribution parameters and environmental parameters you will find that the k which is a proxy of the mean abundances of the genes correlates with the environment while this does not occur and on the other hand the parameter so the slope of the distribution is correlated for the meta transcriptomics so the slope of the function is correlated with the environment while this is not the case for the metagenomics and yes and then you can also do around on forest so the most puzzling question for us is why do we observe this distribution or at least why do we observe two power law regimes the first thing that one can think is that since we are observing a meta community the double slopes are just the result of dealing with such a complex system not with the single cells but eventually we analyze more than 300 samples of single cells data and we found that these double slopes is like everywhere so we look at not just on diatoms but also on other species and we consistently found that a simple power law is not the law that describes the data so our intuition is that the mechanism that shapes this distribution is something that is already present at the single cell level it's not just a result of the community level these are just they are always gene expressions I have known that that can be very well fit with the log normal we try with the log normal but the Pareto law is always the one that was selected but you have two more parameters yes we also did BIC which accounts for parameters and the Pareto was the one selected but yes in a sense I agree for us the important point here is that we observe two slopes not the Pareto we don't believe that the Pareto law is somehow a golden rule of the gene expression the important point for us is that we observe two slopes and we want to account for these two slopes in a sense yes I agree log normal is not able to describe the taste that we observe so we have particular problems on the taste of the distribution just a quick comment for instance if for a moment we trust to this distribution and we compare different experiments for the same we compare different conditions within the same experiments we find that the exponent are I mean are clustered according to different conditions so for instance here the blue points represent the control and for instance you can argue that the influence that you have when the stress is hit can be seen at the community level while for instance the cold conditions are not as stressful as the heat ones so you can also make some inference about how the community level sorry how the whole genomic level you respond to different stress yes can you reconstruct from the TARA data the gene expression of single species of the diatoms or not I think that you can do but just for a very limited number so I don't think it's you can make something statistical no just because here you have single species but I was wondering if you can do single species in your data I'm not aware of anyone that are I mean if you do it for each if you have a diatom species I think in principle it's possible but I think that known species are less than 1% of the one that we are studying for yeah but it's something that we can try ok so at the end we would like to propose let's say a minimal model to understand if it possible to explain these two slopes and we started from the central dogma so as you know I mean as you know here you can have promoter that are inactive and inactive and they can jump from one state to the other and then when the promoter is active with a certain rate we can have the transcription of mRNA which in turns can either degrade or be translated to proteins and as you might know usually people focus on the distribution of protein in this in this scenario and the distribution is the negative binomial which is in the continuum limit is the gamma distribution which is something that it's not what we will serve and also the mRNA is a Poisson distributed so it's again something that we will serve in our data so in the literature we found that in a paper from 2006 they propose a slightly more complex model in which they add two ingredients the first one is a feedback a hills like feedback so which is basically a monofunction and a leakage term in which you can have the transcription of RNA in the inactive promoter state and within this model you can derive a species abundant distribution for the proteins which has this which is a double power law with an exponential cutoff so here we have two main problems the first one is that this distribution is for proteins not for genes which can be translated in the language of mRNA if you consider that the mRNA is not already transcribed as mature but it has to be it's first transcribed in an essence state and then it becomes mature the second so the second point here is that if you want to to match the distribution that you find with the Pareto one to impose some you have to impose that your hills like feedback can assume negative values which is something weird so we are still investigating it but something that I really like is that if you are able to to match this model with the distribution you can have a biological interpretation of your parameters that you fit so for instance in the model you can define an activity as the mean number of births per cycle in the protein expressions and when you match the distribution with the Pareto law you have this simple relation between the activity and the Pareto slopes and for instance we look at one experiment in which they stress of diatom by having three months of prolonged darkness and what they say is in this experiment is that diatom cells after some days have a reduction in the metabolic and transcription transcriptional activity and this is something that we observe also when we look at the activity of the Pareto exploit so you can have an interpretation let me just conclude with this part so last week Andrea Weisse talked about the mechanistic single cell model in which using some trade-offs you can derive 14 ODE that rule the important stuff inside the cell cycles if you focus on the free mRNA which is what we measure you will in the equation you will observe that basically you have some positive contributions which do not depend on the current state so does not depend they do not depend on M and some negative contribution which are linear on M so basically what we tried was to start with a very minimal model a very minimal effective model in which we call mu all the positive components so we call a transcription and then we put a depth terms which is linear with x where x is the mRNA concentration but of course we should consider that we have other variables that vary during times so since this will be a very minimal and effective model we put two sources of noise and so we have this Fokker Planck equation which can be integrated and eventually you will find a double power law behavior in which you have relevant parameters one is the depth rate so you know what is the average lifetime of mRNA and then you have K which is the ratio of the two noises which is the scale and then you have two parameters alpha and beta that regulates the slopes so you can do some simulations for instance here I have a zoom on the mRNA production on 10 minutes remember correctly and then you can look at the species of bunda distribution and they have this double power law behavior and then you can try to fit it to the taraošan data and yes so I don't think that Pareto is the golden law but again you need to have some mechanism that have this feature to have two powers law and yes so since I am late I think I will keep this take a message and let me just thank my collaborators and have a good lunch we still have some time for questions can you so in the model I am just trying to wrap my hand around the model what is causing some genes to be much higher expressed than other genes in two difference in mRNA levels of different genes so in the model why is a high express gene high and a low express gene low what is the sort of mechanistic origin in your model so the intuition is that you have two sources of noise so it is all noise that is just all result of noise the intuition is that yes you have two noises and one is dominant on one scale and the other is dominant on the other scale yes but we know that this is regulated ribosomes are always high expressed in any organism let's say things like transcription factors are always low expressed in any you can take mRNA from any animal you want you correlate the orthologous genes against each other you get 0.8 correlation coefficient this is not random this is designed by evolution stochastic means not random I mean I am not sure if I got your question but I don't see that these are two opposite ways of looking at the same properties we know that stochasticity plays a role I am not saying that here I just throw a dice I am just saying that they have two scales and in each scale I have a source of noise which is something that we know and depending on which scale I have the slope is different I am not saying anything more than that but let's say acting always be on the right of your distribution no matter what kind of system you look at in all eukaryotic cells acting is one of the highest expressed genes it is always the case I understand that you are saying that is the result of noise that seems to me strange I will say something else I will say that this is just another perspective usually you will focus on something but here the idea is that also at the whole expression level you have like a reorganization of your distribution in a sense for instance you change conditions it is true that you are in a sense affecting some particular genes some particular transcriptions but this is not all you are also reorganizing the whole distribution and this is related not just to one classes of genes but to the whole genes Thank you very much for the talk it is really good to see the neutral modeling being thrown at the teradata I don't know enough about all the analysis that has been made with the teradata but from what you said basically if you would sample equal biomass at each of the stations you will get the same number of species so yeah we wanted to test this and the answer is yes but with some deviations and the deviations is somehow related to the different conditions yeah I think that's very cool so I'm looking forward to the paper I think that's going to be very interesting and I feel like especially the community that started the project it's not necessarily the community that thinks a lot about neutral models and the sampling things so I think that you are doing this is very cool Thanks Was wondering if you could just help me clarify one of the points that I think you tried to make you looked at a metatranscriptome and you fit this Pareto distribution to the entire community and then you looked at individual species as well and you found that they also followed this Pareto distribution with similar types of parameters so that's really interesting it suggests that interactions among individuals in the ocean might not fundamentally change this distribution of a transcript activity and distribution I'm still trying to think about what that means biologically and what we could use how to interpret that pattern I was wondering if you could just maybe elaborate a little bit and that's more so I can understand maybe other I mean basically I'm here to understand more these questions so I would say that in a sense this observation makes me think that interactions are not the right ingredients that shade the distribution so this is somehow indication of neutrality at the community level but I will not say something more than this yes I mean if you have an interpretation no I've thought about this myself because we've looked at distributions of metabolic activity in isolates not gene expression per se but just looking at and then we've gone into environmental communities and the distributions of activity for individual populations and communities look qualitatively different models need to be used to fit those data and that's made me wonder what can we learn from isolates or individual populations and then when we go into mixed communities for our data it seems like there's something different and in your data it seems like the distributions for isolates or single populations in whole communities are the same and so for me personally with that comparison I'm trying to reconcile what that means for your data and my data and I mean there are different sorts of information but it's the same general principle that we're looking at distributions of activity in single populations in mixed communities ok, we also found some differences but not on the on the functions maybe for instance on the TARA data sets the range of exponences is larger than the one that we observe in the single sets but so there are some differences they are not completely the same but at least the distribution is functionally similar thanks alright, in the interest of you all getting lunch we will unfortunately have to question it here I was told that we continue at 230 and Wolf still wants to say something before we can thank the speaker again so it's just a short announcement related to the book chapter writing so I think most of you know what I'm talking about those who just arrived today we have sessions almost every day for