 Cordero, and thank you very much. Thank you. This is OK, the sound? Yes, OK. Thank you. Thank you, Jacopo. Thank you, everybody, for the invitation. It has been very stimulating these couple of days, unfortunately. Tomorrow I have to go back to Vienna when I'm doing my sabbatical. And I'm Otto Cordero from MIT. I am from Ecuador. So I'm going to talk about something that I think last night's conversation made me realize is in many people's minds, which I find satisfying because it's also in mind. And I think it's a really good problem to work on. So I'm going to talk about it from different perspectives. So I'm borrowing here some language from my friend, Cepa Quinn, from the University of Chicago. They're asking the question, what is the right language to describe microbial communities? This is obviously a kind of poetic way to ask the question of what are the right variables? Because we are imposed, or we have been imposed, the variables from sequencing technologies, where you get ASVs, OTUs, whatever. Whatever thing is said, some generic units. And then I think I don't need to convince you that this may not be the right variables, although we're going to look into it in a second. So it turns out ecologists, the traditional ecologists, let's say, have thought about this for a while. And so they have what is called trait-based ecology. And here's just one example, just to give an idea of what this could mean. Somebody mentioned yesterday that when thinking about ants and plants and whatever, people were used to thinking about roles. And so here in this picture, roles could be pollinators, the herbivores, the insects are eating the plants, the organisms in the soil that are recycling or fixing nitrogen, et cetera, et cetera. To some extent, this is a human construct you one can imagine. But in many other cases, at least in the case of plants, I think this is rooted in physiology and people understand the constraints and so on. So they can explain quite a bit from these traits. But the general problem stands in any kind of ecology is what is a trait beyond the things that make sense to our limited understanding of biology. And so in microbial ecology, this is apparently more problematic. But people have tried to address this problem, develop databases of traits, for example. These are noble efforts, in my opinion. But they have inherent problems. And here's an example. The maximal growth rate, whether the organism is dormant or not, cold tolerance, motility, and stuff like that. So I don't know what you guys think. I find this dissatisfying again, because whatever we think is a key, it's supervised learning. So whatever categories that are in our minds that we think are relevant that we know, maybe what we have read. But then this is not necessarily what matters. There is no objective way here to define what really matters. And also, more complicated, is how you determine what matters means. So this is out there. But the question in my mind is how to do this better, because this is based on manual curation, functional annotations, and so on that are problematic. Is there a way to discover these ecological gills? I'm also going to use this terminology of ecological gills. Is there any way to discover these gills in a data-driven way, or in an unsupervised way? OK, that's the challenge I'm going to try to address. Yes? Oh, OK, so gills simply, the roles in the previous case, they look straight. Right, so organisms that have the share a set of traits. And so the roles in the case of a very compelling example would be you have primary producers, you have herbivores, you have the carnivores, and stuff like that. These are roles in an ecosystem. These are roles within the community. And the properties of the community may emerge from interactions between these gills, these functional groups. So just to give you a warning, it's not the right word, this problem is not solved in general. So I'm just going to tell you that I think it's solvable and we're making a little bit of progress in this direction. And I'm also going to tell you the limitations. But to start, I want to introduce you to these beautiful little things that are called the pink berries of Sipah Wizard Marsh. Sipah Wizard Marsh is near Boston in Massachusetts. And these little things, they are about a millimeter approximately in diameter. These are microbial consortia that's self-organized in these little granules. There are other cases in industry and in the environment with microbial consortia organizing in spatially structured aggregates, like spheroids. And this is one of the most charismatic ones. And so before I go on, I'll tell you that this work has been, it's one of these projects that has been going on at a very slow rate in the lab for many years. But I think we're probably going to submit it this year. And this is being led by Axid Goyal. You saw his face previously. And Gary Levental, who's now back in Switzerland in industry. But it's a collaboration with people at Santa Barbara, Lisi Wilbanks, and Boris Schreiman. So what are these pink berries? From a metabolic standpoint, Lisi Wilbanks published a paper some years ago showing more or less a skimware of what they do. So you have, in principle, this is an engine that cycles different chemical species of sulfur, sulfur in different redox states. So the pink things are purple sulfur bacteria. That they are oxotrophs. They oxidize sulfide. So they are also photosynthetic. So instead of using water, like green algae, they use hydrogen sulfide. And then that produces sulfate. It can also produce sulfal granules. I skipped that picture from the slides. But it's a beautiful picture just in the interest of time. But you can see little sulfal granules on the surface of the berries. And the sulfate goes to the sulfate reducer, which is a heterotroph that takes carbon, organic carbon, and uses sulfate as an electron acceptor, producing against sulfide that goes back. So the cycle goes on. This is not so important, but I'm not sure this is necessarily a cycle because there is sulfate, abundance amount of sulfate in where these metacreatures are found. But that's not very important. So there are two main functional roles here. That's what I want to say. Sulfide re-oxidizer, sulfate reducer. Where you find them is a beautiful marsh as Toyosipe, a wizard marsh in the coast of Massachusetts. You can walk in this little, in this beautiful place and the tides are low. It's a tidal place. Sometimes it's completely flooded. And sometimes the tides are low. And then you can look in these little tidal ponds. And you may see already from where you're seeing these little pink dots. These are the pink berries. So you can pick them. And say this is one community, and you pick another one. Say this is another ecological replicate of a community. And that's what attracted me to this years ago when we started looking into these systems. And so what we did is from a methodological standpoint, it's embarrassingly simple. And I sort of regret not doing more now that I know more about it. So we just sequenced the genomes of these things. But when I said regret not doing more, I regret not measuring fluxes, for example, because we could have done much more with that. But I didn't know that at the time in my defense. So because what I was interested at the moment was co-evolution. Because it's a beautiful case where you have this genotype, I should say, of one organism that is metabolically coupled with the other one. And you have the genotype, and you have potentially hundreds and hundreds of these replicates. So you can see if there is any pattern of code diversification. And the other thing that we can do is this structure function type of questions. So about the first one, I can tell you a story. I'm not going to do that in the interest of time. The simple answer is there's an interesting analysis that actually did, but there is no evidence of code diversification. As far as we can tell, these little things assemble and disassemble frequently. So it's not like they are replicating as a unit, in which case you would have a co-inheritance of mutations. So what I'm going to talk about is a structure function question. So we have 182 individual berries that we sequenced with metagenomics. And from that, we assemble 58, these are called max, metagenomically assembled genomes, approximately the consensus genome of the different species in the system. So there are at least 58. They're not just these two things, right? OK, so the structure function question. So this is phylogeny of those 58 things. So it turns out this is one of these cases, like I know. So if you're not a microbiologist or environmental microbiologist, these names don't tell you anything. I have been long enough in the field that I can look at these names and something pops in my mind with a certain function. And sometimes you don't need to know much to do that. So I can tell you, we can also, of course, look at the genomes, and we know where are the genes involved in these pathways. So what you have here that I'm highlighting are the sulfide oxidizers, a specific clade of gamma preto bacteria, well known to be what's called purple sulfur bacteria, as well as some alpha preto bacteria that also have the genetic potential to do that. Then there is the sulfo bacteria, which not surprisingly, as the name indicates, is a sulfate reducer, also a very specialized type of organisms. And there are a few of those in this specific clade there. And then another thing that pops out, I think, is this relatively large clade of bacteroids. Similar, to some extent, to those that are in our guts, they love to degrade complex polysaccharides. And these polysaccharides in the south marshes are very complex and sulfated. So at least, there are some things on the line there that are also predators, bacterial predators. Maybe they're eating other bacteria, but whatever. Now we have 182 replicates, so now you can look at the variation across this system. So this picture is, yes? Is it a biofilm? How do they build this structure? It's like a biofilm, yes. But it's relatively densely packed with cells. They're about a millimeter, all of them. 20% variation around that, I would say, yeah. But in overseas zones, they change in time. In our sampling, they are comparable. When it's cold, they're small. The maximum size could be a couple of millimeters, but those are rare to find. Probably somebody eats them, yeah. You can keep them alive, but you don't see biomass increase. And yeah, you can also kill them very easily. If you put them in the wrong medium, you see accumulation of ammonium and such a stink, and they become dark. But you can keep them alive. And Lissy Wilbans is the expert in that. Therefore synthetic, yeah. Yes, they're in shallow tidal ponds. But from pictures I have seen, and there's only one picture of this. There was no obvious gradient structures. I don't know, maybe you need to do more. There should be something, I think. So let me explain this complicated figure. So I just took, this was previously on the slide. Just took this one, rotated 90 degrees, and chopped the bottom so I can show you the upper part. And then, OK, so that's the phylogeny on the left. Now this confetti thing in the middle. Each of these little bubbles is the fractional abundance of this particular genotype in a given berry. The dark dot is the mean of that fractional abundance. And the colors correspond to different geographic locations, different ponds. But you can ignore that for the purpose of this talk. And then here I'm highlighting the two most abundant things, which are indeed a sulfide oxidizer and a sulfate reducer. And the main thing I want to say with this, there is a huge variation in fractional abundance. That's my main point. There are approximately four orders of magnitude in many cases when you look at the fractional abundance of each individual taxon. OK, so this is not super new, right? So there's a lot of compositional variability when you look at this level of resolution. Now, if you group them according to what we think are functional roles, and then you look, yeah. If you group together all the genes of the 58 genomes that you assembled, and then you look at all the other genes that you found, but you couldn't assemble, which fraction of the functional genes you have? That's a good question. Sorry, I don't remember that number. I don't remember that number. My guess is that it's pretty high, because we get the most abundant organisms, right? But so if you group this taxa in what we think are the functional groups, oxidizers, reducers, and others, then you get something that is obviously, I should emphasize, is much more stable in the statistical sense, that the variation is not for us of magnitude from minus whatever, 10 to the minus 4 to 10 to the minus 1 or something like that. It's all fluctuating, not too much, around 20% for the reducers, around 50% for the oxidizers, and so on. I emphasize this is extremely obvious to anybody that knows elementary statistics because you're just grouping things, right? But everything that you find, there's a lot of highly cited papers in the literature saying that functions are more stable than taxonomy, and it's just statistics, it's just statistics. Because if you take anything random of the same size and you group it, you'll find exactly the same thing. Shaiyu did this, and it's just, there's nothing, there's no signal, right? So it's just statistics. So the question, is this just statistics? Is the question. So this is where, actually, intervened, and then so he took this phylogeny and said, broke it in little pieces, basically all possible subclades, and created all possible groups, say, of size two, all possible groups of size three, and so on. And asked, what is the variability of these groups in the statistical sense again? So this is what I'm describing here. That's the metric used, coefficient of variation, and the sum of the coefficients of variation for a given number of groups. And it turns out that the SOB, so sulfate oxidizer sulfide reducer, by partition, is the most stable statistical one. Which is to say that if we were to, if we don't know any biology, and we just take this thing, and as the computer to give us the most statistically stable grouping, we recover the biologically meaningful set. And I don't think that's totally trivial, right? So this gives me hope that, actually, one can approach this problem in a systematic way. But this is a very simple example. Maybe once we go to more and more functions, then it becomes more complicated. That's where we are at the moment. We don't have a great solution for this yet. So, oh, sorry. And if we do the, so the number of groups equals three, then you find the bacteria, the polysaccharide degraders. And so there is one extension of this idea because here we don't have a readout of function. This is what I said at the beginning. I regret not measuring more about the fluxes or anything. It's just the stability of composition is what we're looking at. But there's other types of data sets where one could say I have a measurement of a function, which let's say this is typically in the context of human health, where you have all these examples of microbiomes from patients and some phenotype, maybe disease, not disease, right? And then the game that people want to play is can I find a predictor of the disease? And then, you know, obviously all the things that would follow. So these are called association studies, like the typical of the GWAS, in the case of genetics, microbiome association studies, when you find a micro-granism that is correlated with a disease or with health. But if you think about this functional redundancy, now you immediately see a problem because there may be nothing, no individual micro-granism that correlates because you're not looking at the right unit if this disease or whatever function it is, is attributed not to a single species, but to a group of species and the statistics breakdown. So I mean, I think it's an important problem, but I'm not so interested in the application of this, but I use it only to introduce the way we approach it as well. So we show you, Sean, a student in my lab who just graduated. We looked into this and he, I actually thought it initially wasn't possible, but he found a solution that now in retrospect is, it makes it very clear that it should be possible to do this in this particular case. When you have some composition of microbes, this is a cartoon, right? So these are the many colors there where the different bars are samples and the colors are different species and the little dark triangles are, say, a function that you measure. Let's say CO2, whatever it is, but this is important, it will come later. And then you want to find the grouping that best explained that function in the statistical sense. It turns out in this example, of course, is the blue and the red. You group together, then you get it. And so what becomes obvious when you do this is that what kind of things you should group, what statistical properties they should have is that the blue and the red should be somewhat positively correlated with the effect that you want to have and ideally they should be as anti-correlated as possible with each other, which is exactly the same idea in the stock market. When you want to have a portfolio of stocks, you don't want your stocks to be correlated because then you are increasing the risk of fluctuations. In order for the portfolio to be stable, you want to spread the risk by having anti-correlated things. Exactly the same idea. And so that's kind of what the algorithm does. This is just explaining the same thing in a bit more formal terms. What you want to maximize is the projection of your group of species on the, because these are vectors because you have many different samples, on the vector of the functions that you measure to the extent that this angle is zero, then your correlation is excellent. And the orthogonal axis is the errors, the residuals of the regression, right? And that you want to cancel out and you only cancel them out when the vectors are pointing in different directions like the blue and the red. It's just what I said. And so Shaiyou came up with an algorithm to do this, which we define and I'm not getting you any details about this, but I can talk more about it. It's just an objective function that is an expression of the R square and a search process that goes through the possible combinations of species and some penalty for group size. We have to regularize. And then turns out that the examples that we have three examples in the paper and in all these cases, the answers are very satisfying in terms of our understanding and expectations of the system. And I'm just gonna show you one, which I think serves the purpose. So we took the TARA oceans data. And the TARA oceans data is just many, so I think 128 stations in the ocean where they were sampled at different depths. And then in the ocean, you have a gradients of many things, especially, for example, nitrate. Dominant nitrogen species in the ocean has a profile with depth. And not only with depth, also with different, but mainly depth. And nitrate, that we know is controlled by microbial activity. That one we know is important. I mean, it's a result of ammonia oxidation, the nitrification and these things. So can we find out of the many species in the ocean which ones are more, when grouped together, allow you to explain statistically the concentrations of nitrate. And this is the group of species. The size of the bubbles indicates how important they are for the statistical regression. And the edges between the bubbles is how important having both together is for the statistical regression. That doesn't matter so much for what I want to say. The two most important ones are these two things, nitro-puluminiciae and the candidatas, scandillua. And if you look at what these things do, the one is, they are both ammonium oxidizers. So they convert ammonium to nitrite. And it turns out one is aerobic and the other one is anaerobic. And so this is exactly what I was telling you about. They perform the same function in terms of ammonium oxidation, but in different environments. When you consider the other variables like oxygen concentration, they will be anti-correlated as they are in this figure. So this is how the thing works, the game works. So there are a lot of different micro environments that we are not taking into account that explain the diversity of species that perform the same function. And you can statistically sort of integrate over that and recover the function that you care about if you have the functional readouts, which in this case is nitrate. Okay. Anyway, there are other examples. This one is maybe relevant, but this is only for those of us. So if you haven't seen this paper from Alberto Sanchez Lab on the emergency simplicity, forget about this slide. If you have seen it, then you know, the functional groups are taxonomic units, families, enterobacteriaceae, and pseudomonas. And you can recover that using this algorithm and you get exactly the same thing. So you don't need to know the phylogeny to know this. Anyway, to me this, yes. Phylogeny. The ammonia oxidizers will be one of these cases. Dressed, completely different. It's by horizontal transfer. I'm not sure what is the history of the ammonia oxidation enzymes, but yeah, this is not monofiletic. Most are archaea, but these guys are distant. Okay, yeah. But in general, I think there is a strong phylogenetic signal, I would say. That's my expectation, yeah. So I'm just saying here, this is an aspirational slide. So what we have now is the microscopic variables that we get from sequencing from the omics. We would like to get to having these mesoscopic variables that we postulate would be much better at explaining the environmental parameters, the functions, the fluxes, and things like that. And I think this is, technically speaking, a solvable problem. And the limitation, I don't think it's computation and I don't think it's math. The limitation is the lack of suitable data sets because there is a whole, I mean, there is a, you cannot count the number of data sets that are quantifying what's on the left of this picture, structure. Thousands and thousands and thousands of data sets from the ocean, from the gut, et cetera, et cetera, from soils, telling you what species are there and what abundance and what genes and so on. But there is almost nothing on the functional side. And I think this is partly a cultural problem, partly a technical problem. It's much more difficult to do than sequencing, et cetera, et cetera, but I think this is what's holding us back. Okay, and this is kind of where we like to work on. And so I'm looking forward to have discussions with people about this because if we have good ideas, we can really put a lot of resources into something that addresses this challenge. Yeah. You're gonna shift to the next topic before you do. Let me just catch you on this. So for this very nice classification you are doing, can you zoom into the next level to see where do you stop? You know, you're, I mean, even when you do this of shuffling and so forth, but you're deciding some kind of a core screen level. And then it is a classification that reproduces your sulfur reducer and oxidizer. Ah, okay. So in principle you can push this forward. Ah, okay. Yeah, yeah, yeah. That's a good question. For the first case, for the sulfur reducer oxidizer. Yes, yes. That's a good question. No, no, no, no, right. That's a great, so I interpret the question as how do you know the, it's like a clustering problem. How do you know the ideal number of clusters? But what's the limit? Yeah, yeah, yeah. I, we have been looked into this, but this is something we can tackle, but it shouldn't be, it should be, you know, what people have thought about in terms of clustering. There has to be some statistical metric. For sure, at some point there should be this agreement and then, but like, at some point it has to be worse than the previous level by some metric. Yeah, yeah, yeah. Yeah, yeah, yeah. I would like you to grade this. If it works, it's great. And I understand the limitations because of the not measuring a lot of functions. I'm just wondering how sensitive is this technique if you're trying to combine data from different sources. So when you work with star ocean, that was kind of single source, but when you are combining data from different, different groups measure the same function, but they measure slightly different techniques and all those negative correlations which are essential for you to bat hedge against risk though to say might be batch artifacts. So that's. Okay, so this to me, if I understand, gets into the problem. Yeah, okay. It's not just measure functions. Now we need to understand a bit more about in what state is this communities. For example, if this community is going over successions then things are gonna be positively correlated because there's a systemic change. So ideally these things should have equilibrated in some way after community assembly and you have variation that comes from the microscopic processes. Let's say the noise as somebody said yesterday, phages, research ordering, you know, the hierarchy of research preference, whatever it is. And then this is the ideal data set. But that can be constrained when you, so okay, it's not just taking samples. I should say, okay, that's a good point, yeah. All right, so yeah, you answered another question which I wanted to ask, but right now I was asking about much more mundane things that if you have one big project like Taro Ocean which measured everything in one kind of standard, then those negative correlations are relatively free of artifacts. But if you have seven groups which measured it using their different standards, when you combine the data and try to find this correlation analysis, you will have the batch effect which will spoil the power. That's true also, yeah. It should be standardized, yeah. Think Martina. Actually I was thinking that over successions, it's actually when you have problems because I thought that over succession you have a lot of anti-correlated abundances because you have species increasing in abundances and species decreasing in abundances because, so. But at different time points. So in one snapshot. Yeah, no. So for example, if I take many replicates of succession at initial time points, everything that is an early successional species will be correlated. And the anti-correlated with the late successional species. Okay, I get it. Thank you. Okay, so can I ask a question? So the existence and the possibility to find these guilds, it seems to me that strongly depends, I mean, let's say a dataset like you should, it strongly depends on how the environment is varying across samples, right? In the sense that what are the axes of the variation of the environments determined, what are the functional groups you can find? It's not something. Yes. Because like, I mean, also this anti-correlation, right? Let's say if the environment is fixed between quotes, like what you can do in experiments, you expect this anti-correlation. But if you're in a natural environment, the anti-correlation between species within the same group. But if you're in a natural environment where these environmental axes is varying, then this correlation becomes positive, right? So, I mean, I think, let's even imagine a case where the experiment is done in the lab under control conditions. But let's imagine it's a complex humidity, but I have control conditions. I'm still also not totally sure about this in the sense of, yes, so depending on if you, how you vary that environment, imagine that I can manipulate, I can put a little bit more of nitrogen, a little bit less oxygen, a little bit of this, whatever, change, put some noise on the dilutions or something like that. All of these will give me some answers, right? But are there, is there a way to do this that is unbiased? And I'm not totally sure of this yet. Let's stop becoming, the data you show, I don't think it's even fluctuation. You know, the nitrogen thing, right? Basically, I look at your data, there's just two clusters. One is when the, what's the x-axis? Oxygen, oxygen, yeah. At oxygen point. Exactly. So it's not bad energy or anything, it's just like, you know, in one regime is doing one thing, the other regime is doing one thing. So if you, in that case, right? If you would perturb the oxygen concentration, then you learn this. No, no, but in that case, that is constant, but then you have fluctuation, I mean, there is something that is determined the fluctuations within, which is what is varying, right? But suppose that these other things that is varying is now constant, and what you vary is whatever determines the fixed level, you should see, I mean, you should see a completely different pictures. It's related to what is varying and what is constant in the environment. Is data such as lumping everything? No, yes. So I can, we can talk more about this. I actually really want to talk about it. I can tell you where we are now is, I think right now, but this thinking is evolving on a daily basis. We're gonna just perturb a set of variables that I think make sense, but knowing that this is still biased because it's whatever we're imposing. But yeah, this is a hard problem. So I have maybe 20 minutes, I think, left, and if I'm not mistaken. So I have a different, I want to tell you about a different approach for this problem, which is not, well, I guess it's partly a way to solve the problem, but maybe from the bottom up. And with that, I also want to tell you about, a little bit of, tell you about the things that we have been doing over the years in a very simple manner. So we have been studying community assembly in marine ecosystems, coastal environments. And the way we conceptualize the ocean, let's say, is as a huge bio-digester that you have primary production on the surface, enormous amount of carbon and nitrogen and complex organic matter that starts to sink in the ocean. And then in the ocean, bacteria are the main recyclers of this form of organic carbon. In soils, fungi are more relevant, but in the ocean, we think it's mainly bacteria. And then the way this happens, so there's a lot of dissolved organic matter where you have oligotrophs that are consuming that, that have very high affinities for the substrates, but you also have these little patches of nutrients that people call marine snow where bacteria sort of congregate. This could be fecal pellets from zooplanktor or it could be dienalgae, a bunch of different things. And then, but these are hotspots of biological activity and this is where the coputrophs, I like, thank you for introducing this terminology earlier on. The coputrophs that love high concentrations of nutrients and have this boom and bust dynamics, that's where they colonize and where they grow. So that's where community assembly happens on well-defined spatial scales. Okay, that's what I said. So here we can ask the question of how is metabolic labor divided? What are the roles of the community, in other words? So I'm not gonna talk too much about this, this is a bit of old news, but the way we started doing these experiments was using a synthetic particle that was in this case a little bead, it's a hydrogel with chitin and has a magnetic core. We put it in seawater and it turns out it's like you can farm little communities and pull them out of seawater and you see these beautiful patterns of colonization on the surfaces. And then, so these are, for example, natural seawater bacteria colonizing and forming little colonies and doing crazy things. Now we are developing techniques to look at this colonization process in real time using microfluidics. So it's something we have to talk. And anyway, so, but then for the purpose of this talk what I want to tell you is that we developed an isolate collection from this system. So a collection of marine bacteria that colonize these particles, right? So the way this works, the pipeline has been, you take this particle, you immerse it in seawater, then you get the colonization, you can sample your particles at different times so you can see the dynamics of the assembly. And then we can culture this bacteria. It turns out that many of the particles, sorry, of the bacteria that we collect on particles are cultural, for reasons I can explain. So then we developed a cultural collection from that and then we have a few hundred isolate genotype and about 200 which have good quality full genome sequences. And then we did the phenotyping, that's what I want to tell you about. And then we can try to understand now how to put them together in a way that makes sense. And so this phenotyping is what's left in my talk. And this was the work of Mati Gralka who's incidentally a physicist who's now leading a group on quantitative ecology or quantitative microbiology. I'm sure I forgot how he calls it in the University of Amsterdam. So about 186 strains are grown on 135 different carbon sources and you see here the different taxa. Of course, this doesn't need to mean anything to you but these are the most abundant groups of copiotrophs in the marine environment. So this is what the data looks like when he does his growth experiments for, this is one organism, the one, the Vibrio that we, you know, these things are organized in 96-well plates. For those of you that know, there are 12 columns, eight rows, one A is the first one. So this is one A. This happens to be a Vibrio. And so you see the growth curves, they look pretty nice I think for different substrates which we, this will come later but here they're classified in terms of sugars, organic acids or amino acids. You can also see in many cases this growth is zero. And so Matti basically fits their growth function from which you can get yields, loss and rates but we are mainly looking at the rates. And then we have this matrix of resource utilization. Okay, so what can we say from this matrix? So here's one thing that in case you care about the relationship between phenotype and genetic distance, there isn't a very strong signal here. It's, and there's no characteristic genetic distance at which things really change. It's, of course it become more phenotypically different as you go to long distances but they slowly go down and the phenotypic similarity of very closely related things is not that similar. That's what I can say about this. And so the other thing that pretty obvious perhaps thing that you can do with type of matrices is the principal component analysis. And this is what that looks like. And it turns out this is interpretable. So here each of these dots is a strain. And so when you project the type of resources, sugars, organic acids or amino acids on this PCA plot, you see that basically you have this coincidence, right? That you have things that are pointed in the direction of sugars, in the direction of TCH cycle intermediates and in the direction of amino acids. And so, and this match is so good that Mati developed a simple index release just based on the growth rates on sugars, KS and the growth rate on the acids. KA, you can have an index, you call it sugar, acid, preference, SAP. And so this first principal component is almost perfectly correlated with that index. So we, in the rest of the talk, I'm just gonna use the index but it's the first principal component. So why would you have this type of specialization on acids and sugars? I'm talking about the first principal component. So I think Terry may disagree with this and I don't know what is the explanation, but I just, I really want to mention because it's very compelling other people's work that I think is, let's say, sounds relevant. So you have sugar metabolism that brings you, that takes you down, well, down in the orientation of this graph, of course, towards the TCA cycle. And then you have a gluconeogenesis that brings you up into the TCA cycle. Okay, that part is fair, this is just textbook stuff. And so, so what I was referring to when I said work of other people is the idea that you may have a frustration if you want to do both things at the same time. So because glycolysis has, and gluconeogenesis have opposite directions in their flux. And so, as I was saying, glycolytic reactions will go down in this scheme, whereas gluconeogenic reactions go up. And if you want to do both, you may have some type of futile cycle. So this has been looked at in a beautiful, I think, detailed way for E. coli and pseudomonas. What is not clear is whether this is a general explanation, but E. coli and pseudomonas have indeed these preferences for glycolytic metabolism and gluconeogenic metabolism, respectively. Like E. coli does glycolysis preferentially and pseudomonas does gluconegenic preferentially. So, you know, this, I think they could potentially regulate and do this better, but it seems to be somewhat imprinted in their genomes or in their genetic code somehow. So can we actually read that from the genomes? And again, without going into complicated metabolic models which I, well, I mean, if I ever have to go into that, I'll do it, but I'm not super fond of it. See if I can, my question, is there a simple way to make these predictions from the genomes? And as I will tell you, there is. And I think this is kind of interesting. So turns out, and here's a critical slide. So make sure that you're following what I'm saying here. What we are doing here is counting the number of genes that are bringing sugars into glycolysis. So for example, galactose, the sugar goes into glycolysis and propunate, an acid goes into acetyl-CoA or something like that gets into the TCA cycle or goes upwards. So counting the number of genes that are in those reactions that are bringing sugars into glycolysis or in those reactions that are bringing acids into gluconegenesis, okay, just counting genes. So for example, it could be that for galactose, you have the, I'm not sure, I'll show you the exact pathway, but let's say in my scheme, there are three genes, there are more than three genes in reality. It could be that you have many copies of the same gene. They don't need to be identical, but they have the same functional annotation. So you count that up and you count all the other genes in the feeder pathways into central metabolism. When you do that, you see there's a really nice correlation in this particular case of propunate and galactose. When you have the sugar acid index on the horizontal axis, it's a very nice correlation. The more they prefer sugars, experimentally measured, the more genes they have in the pathway that brings galactose into central metabolism. So they have more redundancy in a way or the pathway is longer or something like that and the opposite for propunate for the acid. So in the particular case of galactose, then we can look at that in more detail. So these are all the steps that bring galactose into glycolysis. Where you have, what is expanding in this particular case is the first step of the pathway. That's where you have the correlation with it. That's what's driving the correlation with the SAP. So it turns out that there are six and eight copies of those genes that are doing that step. And when you look at what those copies are, so this is a phylogeny where we have made, we have made a phylogeny of all these genes in the first step, I think it is. The beta to alpha D galactose, second step. You see, these are the different clades in where you find the copies of the gene. The one where I have the arrows, the red arrows, this is one organism, it's a flavobacterium that is a sugar specialist. And you see where the genes are, they are nested in many different clades far distantly related to flavobacteria, in the alpha proteobacteria, in the gamma, et cetera, et cetera, which means that this is horizontal gene transfer. These are not just copies of these genes. These genes are accrued from all over the place and they just have more diversity of things that do approximately the same thing. So the picture is something like this. So there has been an expansion of this part of this pathway. So, okay, so I'm gonna go back to the previous slide where I had the SAP and the galactose and propionate slopes. Sorry, correlations, but I want to first tell you what my opinion on what these things are doing. Why do we have so many genes there? But this is just speculation, okay? So, and this is, and I'm borrowing ideas from what people have observed in other cases. So my speculation about this is that these enzymes are not exactly redundant. They're just optimized for different conditions. So in the case of oxidoreductases in the electron transport chain, it has been shown, for example, that some work better at low oxygen concentrations and some work better at high oxygen concentrations. And therefore you have many copies of the, what seems to be the same thing, but it's really not the same thing. It's just optimized for different conditions. So if I'm an expert, so if you have different environments, you may use the different genes, that's the idea. So I was gonna say if I'm an expert in repairing computers, I probably have a lot of tools to repair computers, right? Not just one type, that kind of thing. Okay, this is speculation, but back to the data. The point here is that we can take the slopes and then put a number into this tendency to lose or gain genes as a function of the sugar-acid preference. And we can put the slopes there, and this works pretty much, there's only one exception, and I'm not sure where it is, but it works for, sorry, there are a few exceptions, but there are a majority of points. If you have an acid pathway, the slope is negative. If you have a sugar pathway, the slope is positive. So galactose and propionate were not special cases. This is generally true for sugars and acid pathways. And so then we can basically aggregate them, and then come up with a simple way to predict this sugar-acid preference that doesn't need any complicated metabolic model. It's just a linear model with two variables. One is how many genes are, how many, the gene counts of the sugar pathways and the gene counts of, sorry, of the acid pathways. Oops, I wrote sugar in both, but this should be sugar and acid, I'm sorry about that. All right, so this simple model, how good is it? So we have to train it on something, right? In our case, I think it's reasonably good. We can predict the sugar-acid preference from the genomes using this simple model. We have a pretty good R-square. And then what I think is more compelling is that Matty went to public data sets, turns out that when people used to be the case, I don't know if they still do it, when they find a new species, they test whether they grow on lactose, on acetate, and a couple of simple things, and they report this data. So there are these tables with like 10,000 species and whether they grow on lactose and so on. And then of course we have the genomes. So we can make the predictions from the genomes using the simple model, and then we can see how well it works based on that data, and there is a signal. That actually depends on your standards. It's pretty good considering that this is data that we have not training at all in the regression. So that's I think compelling in the sense that there are these genomic signatures that are predictive of the function of the system. So I think I'm kind of out of time, right? But how much time do I have? Five minutes, okay, five minutes. If there are no questions, I will transition just to conclude. I will skip some things that I had, but I was thinking that I may need to skip, so I'm gonna press this button. And so just to kind of wrap it around a little bit. What this means for the ecological dynamics. So remember I told you that you can take these particles and sample them at different time points, and you get some picture of the ecological dynamics. So this is all news. It's the first paper I published in my lab, but this is the figure from that paper. And so what you have on the, if you haven't seen this on the rows, it's a different taxa, you know, and the red is the, so it's normalized, the data is normalized per row, and it's a fractional abundance of that tax on different time points. Red means the maximum fractional abundance, and black is when it's undetectable. And so you see a pattern of succession where you have an early arrival, and it stays for a little while, and then other things arrive later, and so on and so forth. But the interesting thing is also that you have these phenotypes attached to it because we have the isolates, we can ask what they do. So if you look on the, this by the way where particles made of chitin, so you ask whether they grow on chitin, or chitin is a sugar, it's a polysaccharide, whether they eat the monomers of chitin or the dimers of chitin, and the answer is pretty much yes for the first part and no for the later arrivers. And so there you have this really drastic shift in a way in the phenotype. And turns out these are glycolytic organisms, primary degraders we call them, or exploiters if they don't produce the enzyme to break down chitin. And the later part are the gluconogenic, predominantly gluconogenic organisms that we call scavengers because they're utilizing metabolic byproducts. So that for reasons that are, well, they're still kind of somewhat unclear, but at that time we had no idea why would this metabolic byproducts be released if the only thing we have there is chitin. And there are different theories for this, but this is kind of the wrong thing to do, I think in terms of, I know that you probably are very tired, so I'm gonna show you something very complicated, but if you want to close your eyes for the rest, it's okay. I think you did really well. Thank you. So, and maybe I just tell you the punchline, right? So what mediated transfer, we don't know, but from the data we have we think is phages, phage predation, prophages that get induced in the early colonizers, release metabolites that the other things can utilize. So maybe I just leave it there and then if you want to see the data I can show you, but I'll stop here and then I take questions. And to the summary, it's just repeating things that I already said. There's a broad pattern of specialization and we can read it from genomes because this has a signature which could be just a correlate of these expansions in the pathways, in the number of genes in the pathways. And then, well, the story, I didn't show you any data, but it's this idea that the transfer of different forms of carbon from the sugar specialist to the acid specialist, at least in the communities that we study, I think it's largely driven by prophages being induced, which, you know, cell slides and all these metabolic byproducts are released for other organisms to take. Okay, so, tons of people to acknowledge, I already did partly during the talk as well as the funding sources and this fantastic collaboration of people that includes Terry, that I'm proud to be a member of. So, thank you. Thanks a lot, we have time for questions, Martina. Thank you, very nice. One thing is, okay, let's say, you see this changing, let's say, before you have a sugar specialist and then acid specialist and it's in a succession. But for example, when in a single resource like glucose, you see both sugar specialist and acid specialist. Do you think that you need phages also in that case or it's enough cross-feeding? I don't know, I think, I don't know. In that case, it may be just enough, well, certainly there's acetate coming out, right? But then there is also the problem in those experiments as well, as Terry mentioned, I think yesterday, you get into stationary phase and then there is death. So this could be phages, it could be natural death, but I'm sure there's tons of things coming out in them from stationary phase. Okay, thank you. Thank you, so I am very fascinated by the last thing you were saying about prophage induction in the degraders. Do you have, I mean, I don't want to steal time if this is gonna take time, we can discuss it in private, but do you have a sense of how prevalent these prophages are in these degraders? Like if you take, I don't know, 100 of these degraders, how many of them will have prophages and how many prophages on average will they have? Well, at least, so we looked at this in the communities. Okay, I also don't want to derail too much because the slides are a bit complicated, that's what's annoying, but so we did commonly say, we looked at the metagenomes of single particles, cornice, and we can detect prophage induction by looking at the ratio of reads that recruit to the prophage versus the rest of the genome. This is not easy to explain, but I think, ah, sorry, I don't have it here, so, but in the communities that we assemble, so on these little particles, there are degraders and there are non-degraders that you can tell from the genomes. And then in those particles, the degraders were more likely, or they were more frequently with prophages than the non-degraders. So in those particles, at least statistically, then they are more likely to, and then when you look at the things that get induced, it's pretty much only on degraders because these are the first colonizers that are also growing fast. So when they're growing, there's something we have no idea what the signals are, but that's when prophages got induced. Thank you for a wonderful talk. I'm interested in the, well, in all parts, but my question is about the second part where you have shown evidence of this extensive horizontal transfer of tools so I have long-time ago collaborators and we have this toolbox model of evolution, but in this toolbox model, we assume that if you already have a tool, you don't need a second one. And you say that it's important to have multiple tools adapted to different conditions. So I'm just wondering how much worse your feet would be if your presence or absence of tools would be binary. So if you have one tool, you already have one, just at least one tool, zero or one. So is it really the tendency to accumulate variants which drive this correlation or just presence, absence is enough? No, okay, I can tell you because I asked Mati many times because I'm also showing this example because I think it's interpretable and I really like it. But he repeated to me many times that this was not the only way. There was also pathway length increasing which I don't understand. That's why I'm not mentioning it. But yeah, let's say 50-50 or something like that. It was not clear, but this definitely happens and this I find it more interpretable. Also somehow pathway length gets larger, I don't get it. The first thing is about this multiplicity of the tool, very interesting. I think let's pick an organism and study it. Yes, okay. And by the way, for the kalatos, you should only count the first two as that because the other is used by everybody because it's to make components of membrane. So that's where the signals, that's great. About the phage part again, what's the rationale that phage will give rise to acid eaters? Because when it lies, you got amino acid, amino acid's fine, but you're not gonna get these acids. Where are you gonna get these acids? No, but acids includes amino acids. Acid eaters includes amino acids and organic acids. But then by your accounts, you're not gonna get any of them. Because amino acid, you're taking generic transport, you're not gonna find signature of amino acid. Cause you directly go into amino acid. What do you mean? I don't understand. Like the way you were doing the propionate, right, these pathway signals. You're not gonna, how are you gonna find signals of amino acid eaters? That's a good question. So that's a good question. But the, yeah, because the, okay, that's a good question. Because the, if I'm not mistaken, this, when we count the number of genes for sugars, we count the number of genes for acids, these are not amino acid pathways. Yeah, yeah, so I'm guessing that you're actually recognizing acid eaters special, right? But still includes, I don't see a logical link. Right, I agree with you. But it still includes the amino acids. The ones that prefer amino acids. Yeah. I guess it's because they're prime to do gluconeogenesis. And amino acids eaters are doing gluconeogenesis. No, right, but yeah. Okay, any other question? Okay, let's thank Kepoto again. Thank you. So,