 Hello everyone. Can we see how many participants we have? We have 44 participants as I can see. So thank you for joining our session on ecosystems and with applications in ecology and agriculture. As we know, ecosystems such as whole oceans or human god or human saliva these are sustained through the microbial communities and of course through the diversity of these communities. Therefore it's very important so that we understand how this diversity changes through time and how microorganisms interact within these ecosystems. So better understanding would actually give us the means to sustain this ecosystem being healthy and also to manipulate them if we want to achieve certain desired outcomes. We also know that the data that comes from environmental samples are not easy to deal with. It's large, messy, highly heterogeneous. So it's really necessary to develop robust bioinformatics approaches and to model, develop models that capture the evolution and the interaction between microorganisms. So today we have five speakers who will together give a good idea of the latest works that have been going this direction at the side beam. So you can see the topics are very diverse and they reflect the variety of different approaches and applications. So enjoy the session. I pass you to Teo who will introduce the speakers. Hello everyone. So first we will welcome Lukas Paoli which is a PhD student in the ETH Zurich under the supervision of Shini Shishinagawa. And you will talk for 12 minutes about global ocean microbiome and then we'll have the time for one or two questions for us. Hello everyone. So thank you for tuning in. As Teo said, my name is Lukas and I'm working as PhD student with Shini Shishinagawa at ETH Zurich. I'll present today how metagenomic-based genome reconstruction can provide very important insights into the ecosystem which is the global ocean microbiome. And I'll focus on one angle which is the one of natural products and the biosynthetic potential of this ecosystem. So that's an ecosystem that actually covers 70% of our planet, right? As you can see in that picture, you have a view of the Pacific Ocean that just shows how prevalent that ecosystem is. And within that surface of our planet, in each of the single drop of water, we have more than 500,000 microbial cells. Altogether, these cells are actually driving the biogeochemical cycles on our planet with the production of most of the oxygen we breathe, for instance. To understand or get a better understanding of that ecosystem, one can try to understand the genomic basis of that microbiome, right? So you would, with the traditional approach, go to the field, for instance, take a boat, sample, cultivate the microbial cell, you could sample and sequence their DNA. And with that, you would get isolate genomes from different something sites, as you can see in that map, for instance, and having different strains of the same species you could start doing comparative genomics. However, nowadays, for the last 10 years, metagenomics have proven extremely useful in sampling the whole community in its natural environment. And here on that map, you can see five of the most important metadata surveys of the global ocean microbiome. You have three spatial surveys, including the oceans, Malaspina and biogeotraces, as well as two time series, the Hawaiian Ocean time series and Bermuda Atlantic time series. All of that together represents over a thousand metagenomes, and they capture the natural communities within the ocean microbiome. So integrating these reference genomes derived from isolates with the metagenomes that sample natural communities in situ. You can reach where you actually see the natural occurrence of your isolate genomes in the communities you want to study. However, the major issue here is that the mapping rates of these metagenomes on the reference genomes is only 10% on average across the whole ocean microbiome. Meaning that for the DNA reads that you have in your metagenomes, you can only map 10% of these to your isolate genomes. So you're missing 90% of the genomic content of the community you're trying to study, which is obviously the major fraction of it. And with such as a fraction of the big picture, one only can wonder what do they do? What does this 90% of the community does? What's the ecological function? And as I mentioned, what's biosynthetic potential, meaning what's the secondary metabolism within these organisms that we don't know anything about in terms of what are the chemicals they use to communicate between themselves or to defend themselves against viruses or to attack themselves through antimicrobials, for instance. And this is particularly interesting in the midst of a viral pandemic, for instance, having a good knowledge of the biosynthetic potential of the different microbiomes can lead to the development of antivirals, antimicrobials, which are under big pressure at the moment with the antimicrobial crisis. Right, but again, how do we get their genomes? So I'll start by walking you through a reconstruction pipeline for reconstructing genomes from metagenomes. I'll just try to highlight one specific point on which I think we just need to focus a little bit and then move on to the results. Right, so you would start with metagenomic process reads from all your samples. In our case, it's about a thousand metagenomes. You assemble these genomes, metagenomes, so you need to get genomic scaffolds for all your metagenomes. Once that step is done, you can back map your metagenomic reads to your metagenomic assemblies. And here is where I want to focus a little bit. We back map all the metagenomic reads from the data set to all the assemblies in that data set. Right, and that enables differential abundance meaning meaning that you can see whether the scaffold I as a correlated abundance with the scaffold J and therefore group that together. That's, for instance, the case displayed here. However, in another case, the abundances can be completely basically random with no signal, and that's a case where you do not want these two scaffolds to end up together in a bin. Based on differential coverage and to try to clear the frequencies, the genomic scaffolds group together. Obviously, within these different groups, you will have things looking like proper genomes, other genomic elements, or just random pieces put together. You need to filter based on market genes to only select good quality candidate genomes. Then, there is a duplication step, obviously there is redundancy in the data set with over genomes, and these different metagenome, metagenome assembled genomes are the obligated based on nucleotide identity to identify species level clusters. The genomes and not only the repeated ones are subsequently annotated functionally taxonomically and looking as well for mobile genetic elements. So, I insisted on the different current. And I did so because here on that figure, we can show that using large scale differential courage improves the beginning results free falls. So that figure displays the ratio of cumulative equality scores of being results of metagenomes. On one hand you have the quality scores of beating efforts with differential coverage divided by the beating efforts without differential coverage. And this quality score captures both the number of marks recovered number of genomes recovered and their quality. On all the data sets that we have using differential coverage across 180, 190, 58 and 610 metagenomes improves the quality of the beating efforts on average almost free falls, which is a lot. Right, so that's why I really wanted to insist on that. And then, so we applied the pipeline I described to the 1000 metagenomes and recovered 26,000 metagenomes from this 1000 metagenomes. This, these 26,000 mags could be grouped into 5000 species. Despite previous efforts of reconstruction of the specific ecosystem, we still find a lot of phylogenomic diversity. As you can see on this figure, a third of the species are known, but a third of them are unknown species from non-genera and other thirds are completely novel. We are not the only ones. We are doing such efforts to access the 90% of the uncultivated fraction of the global ocean microbiome. So we integrated our reconstruction efforts with other people reconstruction efforts as well. We include manually created mags from them and colleagues, the single salamplified genomes from Amunesty panellocas, as well as reference genomes from isolates using the MA database. Overall, we end up with 5000 genomes and we can display how they are distributed across the tree of life or the tree of bacteria and archaea. In gray, you see the genome taxonomy database tree backbone, and we overlay the 35,000 genomes with phylogenomic placements in that tree. A darker blue indicates a higher number of genomes in their specific part of the tree. We can then see how the different types of efforts capture the different parts of the tree. And we see that the mags are actually the ones capturing most of the phylogenomic diversity, although we see a very strong complementarity between the sags and the mags, with the sags capturing clades such as beta-gibacterialis much better. Then, as I mentioned, we want to focus on the biosynthetic potential of that ocean microbiome and that of these genomes in the ocean microbiome. So that's this outer layer where you see the number and the type of biosynthetic gene clusters within the genomes that we have. We can see right away that axinobacteria stand out, and that's quite expected because that's where streptomyces is, and that's the bacterium from which we actually derive most of the antibiotics we have right now. But something else is striking. And what's striking is that clade here highlighted with a red arrow. It's a mag within the LMU bacterota phylum that is completely uncharacterized and has a completely unsuspected biosynthetic potential. And as you can see here as well, with the top 10 most talented in terms of biosynthetic clusters, bacterial species that were constructed in the data set, and it's unknown species from the LMU bacterota phylum. It's the one with the highest number of biosynthetic gene clusters, so highest number of natural products can encoded in the genomes. And so a quick snapshot of all these products is very diverse. It includes a wide range of potentially active compounds, but I want to focus specifically and draw your attention to three of them, which are proteasins. These are recently characterized family of buttons, antimicrobials and antivirals. And interestingly, for instance, the cluster at the bottom shows a shows homology and sentinine similarity to a very recently characterized proteasin cluster that shows activity against adenoviruses such as the margot virus. So very critical diseases. So why don't we know about it, right? It seems to be particularly interesting and have a very interesting potential for the ecology of the ecosystem and the natural products community. So we don't know about it potentially because it's a very poorly known candidate file with only a few mags represented, representing it, right? So this is another view of the GTDB tree where you see how the LMU bacterota phylum is placed. And if you look at the other mags in that phylum, we see that they really don't have the similar biosynthetic potential that the mags compared to the mags that we reconstruct in our state. So there is a large difference between the two, between the picture of what we know in the phylum and what we reconstruct here. And finally, an explanation as to why we have no knowledge of that biosynthetic potential in that specific phylogenomy area is because that we reconstructed in a very challenging environment. The mags reconstructed from samples that are below 4000 meters depth in the Pacific and North Atlantic Ocean, which is actually quite challenging to actually go there, sample, and even more isolate from this environment. So with that, I would like to thank everyone for your attention. I would also like to thank everyone in the Swedish Agave Lab, the students that work with me, and I would like to thank the UNPL at Institute of Microbiology of ETH and Serena Robinson at the University of Minnesota that are working with us on the experimental characterization of the very synthetic mags that I just presented today. Finally, I have posted today, I'd prefer to maybe dive more into all of this. Thank you very much. Thank you Lukas. I think we have the time for like one question before going to the next talk. So I would like take the most important one from Mose Mane, which asks, which programs or strategy did you see? Did you use to infer the function of genes? So for the secondary metabolize, we use anti-smash tool developed by Manex Metamers Lab in the Netherlands, so specialized in prediction of biosynthetic gene clusters using the location of different genes on genomic context to predict the type of natural products encoded by that cluster. I think was it especially for the antiviral and things like that, or in general in terms of the functionality? Well, antiviral was an example, but I guess it was rather a general question. Okay, and we complement that with keg annotations, we use also the Pogara as a general annotation pipeline and ECNOC. So we have different sources and we can complement the anti-smash prediction with external annotation to confirm or complement these predictions. Thank you, Lucas. We're going to the next talk now, which will be given by Carlos Peña from the CI4CB in Iberdon-les-Bains. Thank you very much. I'm going to chair my screen. Hopefully, all will be okay. So thank you everyone. I hope you are not looking, you're only having the single screen because I have some other things on my screen. Okay, if not, just let me know. I'm going to speak today about the Maldiv project, which is for machine learning diagnostic soil, but it's in Switzerland not so far. So this is a collaboration mainly between the Chang'an Wine School, the School of Inology, and our school, the HIGVD. CI4CB is my group, Computational Intelligence for Computational Biology. The idea is that vineyard soils are affected by several stress sources, so we would like to have an indication of whether that's the case in an easy way. And we need to monitor the soil quality. The hypothesis is that protists could be a good source of possible indication of that stress because they are quite diverse, they are very abundant, and they are very sensitive to the environmental conditions. So these protists have been successfully used for water bi-indication, but for the moment they have been widely used for soil ecosystems. So our goal in the project is to mix microbiomics data, molecular data, with machine learning so as to have a tool that could serve as a diagnostic for the soil quality. That's the method. We are interested in several environmental factors, so what kind of factors could affect the presence of these protists, and for that we have data from real or actual vineyards in Vale. And on this basis, what our partners in Changiang have done is to produce meta-barcoding data from samples taken from the soils. And after we have the sequences and the meta-barcoding, we can do different kinds of analysis, and I'm going to present one of these analyses. As I mentioned, the meta-barcoding part of this, the bioinformatics related with that is done by our partners in Changiang, and that's not the subject of presentation today. What we are interested is mainly on these soils. We have two sampling years, 2015 and 2016, with a number of samples, and they were able to quantify more than 1000 taxa. It's not exactly species, but well, you know that better than me. And from this data, the idea is how far can we predict the presence of these environmental stress sources based on the abundance of different protists. So as I mentioned, usually these factors have an effect on protists, and these protists are affected by that, and we think that the abundance of these protists will be an indication of these stress sources for us. So what we are predicting is based on the quantification of protists, how far we can identify the different stress factors. Just to continue. If we are only looking, for example, at copper, we can see that it is possible to predict with a relatively good accuracy, 80%, more than 80%, the presence of copper of high levels of copper. So that's already a good sum. We tested here three different methods. We have tested other methods, but that's a good indication, but we are not actually going to quantify copper based on this protist because copper is quite easy to measure. But it's just to show that it is possible to infer the amount of copper based on these kind of indicators. If we can mix that prediction with the power to measure other environmental sources, we can see that it is possible for some of them with a relatively high accuracy, 81% for plant coverage, 80% for organic matter, or even 72% for pH or the percentage of water, different environmental conditions that could be representative of the health or the state of the soil. So if we can mix all these predictions, we could have a better picture of the stress based only on the quantification of these samples. So that's the idea behind our project. We have done that. This is done with one method, Cubis, but we have been trying different machine learning methods and finding which are the best adapted. We would like to go even farther with some other environmental conditions, because for example, the basic respiration of the microbial world is quite, the predictive value is quite low, relatively low. So we would like to see if it's possible to infer more information. Just to be clear, these values are predicted in 2016, but the models were trained on 2015. So we can see that from one year to another, we can have a relatively good prediction. At the end, we will try to have much more data for a longer time, so as the models are more predictive and capturing a better indication. Okay, I think that's what I have to say. In conclusion, we have shown that the protest communities, given the response to different environmental conditions, they can be combined with machine learning models to predict if a given soil is subject to stress factors on different possible sources of stress. And that's what we are investigating just now, and that opens a lot of new questions. Next step will be to perform what we call feature selection to identify which of these protests are the best predictors instead of using the more than 1000 if we can use only 100 to 100 of them to predict all these conditions that will be something much better from a practical point of view. And that's all for today. Thank you very much for your attention and I hope we will have time for questions just now or at the end during the Q&A final. So I stop my sharing. Thank you everyone. Thank you, Carlos. I think indeed we have the time for a small question and I just, I just received one I think. So it's a question from Emmanuel Bouté, which asks, if you also consider the advantage composition of the soil, because apparently various roots, because various roots exudates, yeah, may strongly influence microorganisms community. Okay, as I mentioned, all the biological questions are more in the side of our change young partners and unfortunately they were not able to be here. And I think that in general, we cannot completely separate the environmental sources of other possible sources of variation and the only way to filter out or to consider all these effects would be to have actual information on that. So as to build the models aware of that existence, those factors, I mean the models are not magically filtering out this kind of perturbations and we need to assess whether they are present or not. Thank you, Emmanuel. We'll be heading to our next talk from Joao Matias Rodriguez, which is a postdoc in the University of Zurich. You, and there's a supervision of Christian von Meering. And he will talk, he will talk about the micro-battles project data sets. Thank you, Theo. Hi everyone. I'm just sitting up to talk. Yeah, so I'll talk about micro micro Atlas project. Unfortunately, I won't go into the new insights because that's still work in progress. So I'm sure everybody of you is familiar with the usual microbial analysis you collect your samples isolate DNA sequence and in the end you'll get the raw sequence data. For each, if you're interested in the composition of the microbial community, you can perform 16S analysis and obtain a set of O2 representatives. And in the end, the O2 counts and taxonomic annotation when that's available. If you go further, you can make your stack bar charts for each of your samples in which each color represents a fraction of certain microbial species, for example. And I'm going to assume that at this point you want to know what is the species doing in my sample or in my set of samples, does it shape or is shaped by the environment. What are the prevalent prevalent and typical microbial to use. And for this, you actually have to then compare your samples to another set of reference or, you know, to establish what's the baseline. And you can also then perform the PC away analysis to identify clusters of are there differences between the samples, the sample groups and my study or not. But this is all for trying to understand what makes the differences between these groups of samples. How does my sample group differ between sample, the reference sample, for example, or the different conditions. So if you between these conditions you do, for example, differential abundant analysis. And then you get a list of the top most different taxa or species between the different conditions. And this is where the kind of the problems and the real interesting part starts. But so, for example, you might to kind of have an idea of what is a certain species doing in your sample, you might try to understand that by investigating its function. And for that you would look up the in the literature what does the, you know, report the function of the species. The problem is that this is not possible for the vast majority of taxes since they're actually unclassified they're not isolated as well. And to illustrate how big the knowledge gap is. You take the 16 s's of culture collection strains, and you cluster them at what would be the species level, and to special level Tuesday to 97% identity, you get around 800 species. So a lot of the strains actually are belonging the same species. And if you now take, you know, the high quality sequence genomes, these are 20 to 40,000 right now. And you cluster those they're kind of double that. And there's also Lucas and Silas said today, you get around 5000 more genome and and but they most of the times they so 5000 more species. But if you then take the most comprehensive census of 16 s. That's 150,000 and you cluster them you get 150,000 species level to use. And so just to kind of, you can see that the proportion of what is known what you can actually use for experimenters with are the culture collection strains. What you can actually have the information of genomes is still like 2% of the most comprehensive data set that we have currently. And this is of course, a very small fraction of the real diversity of microbial species out there. So the question is how can I actually speed up the research into microbial knowledge. And so one clear thing is we can use metagenomics sequence data. So there's as of January 2020, there's around 3 million sequence samples. And, and I mean this covers very high. They're very geographically diverse as well as environmental. I mean we have a specific interest in animal gut. But using this information we could actually start building some information about the unknown taxa where are they usually present, and then even establish other things like what relations are they usually found with some specific microbes like the ecological relations between them. The problem is until now most of the studies usually concentrate on analyzing the samples independently. So you have studies that concentrate on plant microbes on soil and on gut of equal samples, but they've generally been studied in isolation. And of course there's a lot of challenge as well because different studies target different RNA. So write up some RNA regions they use different types of sequencing approaches and to compare this the studies the results into studies would be quite cumbersome. You might not even know that a certain study would have found some microbe that you are interested in unless you download that raw sequence data analysis yourself. So there's also metadata heritage, NAIT, there's a lot of large amounts of data, and of course also statistical issues pertinent to this data sets. So we've been for the last 10 years we've been working on developing tools to be able to analyze all of this data. So there's so several of HPC class map seekers just for the 16S analysis and the Yanko also in the group developed flash free for identifying interactions in this kind of data. And so also, so we already performed this analysis already analyze half of the data we're currently so that's up to 2019. We're currently analyzing the rest data from last year. But of course, nothing. This data is quite large. So to allow researchers to very quickly and easily browse and research and compare their data to this we developed. So Gregor was developing this website, Michael Bethless. And so just to illustrate how how useful it is you can now just take your 16S sequence, you know, just like in in blast you would paste the sequence. And so this will search from our close reference of 1.5 million sequences and basically immediately so it's in third of a second. You would get all of the samples in which this microbial taxa had been found in. So out of the almost 3 million sequences, a single microbial samples we've analyzed so far. And you get all of this information without even having a taxonomic classification for your species. So even for the vast majority of unknown species you can actually get some ecological information. So you get all the IDs from NSVI SRA data sets, including the abundances, you can download this information, you get the pie chart of the samples in which this microbial tax was found in the abundance, the density plot of abundances. And you can also submit your own sequence data. So if you, for example, did all the analysis and already have O2 representatives, you can simply submit O2 representatives and accounts, or you can just submit your raw sequence data. And we will do that like it takes a couple of minutes to actually get your results back and you'll get a profile, a plot of your sample, the composition, including the O2 use that we use as well so that this is directly comparable to the data this in our database as well. And more interestingly as well, you get which are the samples that we have already analyzed, which are most similar to your samples, such that then you can download them, then use them as a reference, for example, to compare your data to and to find out what is different between my microbial samples and the ones that have been historically already analyzed and published as well. So, yeah, again, you can provide your own reference sample group or use one of the available ones, and of course we will give you the p-value in the difference between differential abundance test. So with that, I would like to finish and thank everybody that was involved in this project, Christian for being a great boss and Gregor, Yankuan, Maria, Sebastian, Lisa and the rest of Merringlab and the SOBD organizers, which I'm sure has a lot of work setting everything up and you for your attention. Thank you. Thank you, Jero. I'm not sure we have like a new question that arrived yet. Yes, there is one from Elena Montenegro-Borbolam. Have you tried an Amplicon sequence variance instead of OTU? I guess the sequence variance are probably the 100% identity. So 100% identity over a certain, for example, region does not guarantee that the full sequence is 100% identity. But we actually in the database we consider the different levels of OTUs from 99% up to 90%. So 99, 98, 97, 96 and 90. So you can actually look at the data at all these levels. We haven't done 100%, but that could be also easily done. Thank you. Thank you all for all of those interesting and diverse talks. Now it will be time to switch meetings and go to meet the speakers sessions where we will all be able to talk with each other. So I invite you to click on the link that is on the website, the ASIB website to go in that room and meet all together.