 Okay. Good morning everyone. Can everyone hear me? I realize it's almost lunchtime, but I would like to ask a moment of your time to hear me talk about minimatogenomics. My name is Boyk Bergus. I am a postdoctoral scholar in the lab of Steve Quake at Stanford. And we are developing methods to allow a more quantitative approach to genomic profiling of complex microbial communities. I'd like to start off by thanking the organizers for a great conference and for giving me the opportunity to speak here. And also thank the people whom without this wouldn't have been possible. Of course, Steve Quake, who's been organizing or developing methods in genomics for quite a while, from single cell genomics to the field of microbial ecology, as I will explain later. Dr. Brian Yew, he is a former graduate student in the Quake lab and we're now working together to further develop the minimatogenomics methods. If you are interested in this method, I would highly recommend you to look up the publication, which is soon going to be in Elife and can be found on the bio archive. Also others in the Quake lab for their expertise and support in sequencing and microbial ecology. This project is in collaboration with the Department of Energy Joint Genome Institute, partially in collaboration with the lab of Victoria Orphan at Caltech and partially with the lab of George Church at Harvard. And these are the funding sources. So before I start off about minimatogenomics, a little bit about myself. I finished my PhD in single molecule biophysics in April of 2016, where I looked at how single proteins or polymerases bound to nucleic acid templates. And from these binding and processing events, I was able to extract binding lifetimes or polymerase pause lifetimes, which in turn allowed me to set up kinetic models for these systems to basically understand and predict them better. And I talk about this because I see a number of parallels between the field of single molecule biophysics and what I'm doing now. It's basically taking the physics mindset, taking apart a initially very complex and messy biological system into its individual components and looking at them one by one and basically building up understanding one small step at a time. And so in physics, you have the particle in a box system, which single molecule biophysicists took to biology by putting proteins in boxes. And now we're putting genomes or cells in boxes, which I hope to make clear to you has a number of specific advantages. So what is the scope and scale of the problem here? So at present, there are around 5,000 complete microbial genomes. There are around 11,000 culture microbial species and around 10 million 16S RNA sequences reported. And recent predictions place the total number of or estimate the total number of microbial species around 10 to the 12th. And given the scope and scale of this discrepancy, this has been coined microbial dark matter. And if this number is true, that would mean that 99.999% of the microbial genetic information on Earth still awaits discovery. But even if this estimate is off by several orders of magnitude, the scope of the problem is still immense. And so that's why we're developing methods to do this and to scan these communities in a more efficient manner. So where to start? Well, in the lab, we have a number of microbial communities from extreme environments. And we start off with extreme of our communities because, well, there's an intrinsic interest in the boundaries of life. They have known to produce a number of useful compounds. For instance, the TAC polymerase, which was found in a Yellowstone hot spring. They're complex, but not too complex. And they provide a nice opportunity to look at overarching patterns emerging in extreme environments in geographically separated locations. So the two basic questions that this, in microbial ecology, what it always boils down to is who is there and what are they doing. Traditional methods, bulk metagenomics approach this by taking an environmental sample, extracting the DNA from all the cells, cutting them up and sequencing them. This basically has a very high throughput. But basically, it treats the environment as a bag of DNA. There's no real cellular context. The assembly can also be computationally very intensive and challenging, and ultimately has a very low resolution. The other end of the spectrum, we have single cell genomics where you take a single cell from a environmental sample, extract the genome, amplify the genome, and then sequence the genomes. This achieves a very high resolution, of course, is computationally much less complex. But has a much lower throughput. And in the case of very complex communities with hundreds, or maybe even thousands of species, this would be a daunting task. Minimated genomics kind of operates in between, occupies a space in between, where we take an environmental sample, take a subsample of cells around a thousand or 10,000 cells per experiment, and distribute these over 96 wells of a commercially available Fluodime C1 platform microfluidic chip. The complexity here is greatly reduced compared to the original sample. We lies the cells, amplify the genomes, attach a well-specific DNA barcode, and sequence it. Afterwards, we assemble the DNA into longer, contiguous sequences. We do this per well. This also reduces the computational complexity compared to metagenomics, and afterwards we assemble longer contigs. In doing this, minimated genomics occupies a space with a resolution that can be similar to single cell genomics, but greatly increases the throughput, not quite like metagenomics, but in the order of a thousand to 10,000 cells per experiment. Two key things to remember and that are important aspects of this assay is that the subsampling reduces the complexity of the community, and the partitioning is a Poisson process. This adds information that we can later use, as I'll elaborate on. Another key feature of minimated genomics is basically that after your sequencing, we perform analysis, which I will also elaborate on later in this talk, but basically you get your genomes or your species, and then because we have a well-specific barcode, we know exactly where, which chamber each piece of DNA was found in, so we can then go back to these chambers and sequence those deeper, so we can really perform a targeted deep sequencing approach, which no other method could do, so you can really focus the cost and power of sequencing on those genomes that you think are most interesting. So the goals here are to perform genomic screening of around a thousand to 10,000 genomes per experiment, doing that, demonstrating that we can focus the cost and power of sequencing where it is most needed. We can also then relate the functional gene sets to individual species and really try to answer more of not only who is there, but who is doing what, and what are the different species dependencies, and at a later stage would be interesting to look at patterns emerging in complex microbial communities that experience a similar environmental pressure. There are a number of method development questions that arise. This is not an exhausted list, but this first set, like how efficient this lies is, how does this partitioning actually improve the assembly, how else can we use the Poisson information, are all related to this Poisson process, and so for example, in Brian's paper, he has this figure where he basically has a number of genomes that he found in his experiments, and he basically can create a binary presence map as to in which chamber of the microfritics chip the genome was present, and so you get a binary present map, but it is impossible to know exactly if a genome in this chamber arose from a single cell or from multiple cells, and so this is something you'd like to know because if you can compare, if you know for sure that you had had a single cell in a chamber, you can start comparing single cells of the same species across different chambers and look at genetic heterogeneity, etc. Using Poisson statistics, we can determine a co-occurrence threshold. Basically, we know that the probability of co-occurrence of more than one cell per chamber goes up with the Poisson average, so the number of chambers that are occupied, and so here for example, I took a 1% threshold, this would mean for a 96 chamber experiment that a maximum of 14 chambers would have to be occupied for a 50 chamber experiment, this would be 7 chambers maximum, and so the number of chambers sets the dynamic range of this threshold. So then you can go back to your data and basically cross out all the genomes that have a higher than 14 occupancy here and a higher than 7 occupancy here, and start treating these cells as originating from single cells and start comparing these single cells over various chambers. The second set of questions depend really on more on community-specific parameters. So if you have many cells of the same species in a similar in one chamber, Poisson statistics would dictate that this species would be present in many more chambers than a species that is present only once in a chamber, so you would see that from the Poisson average occupation over the entire experiment. So yeah, so community-specific parameters have a number of questions that basically like relate to the scaling of the experiments, you would initially intuitively assume that measuring more is always better, but this is not necessarily the case. So typically the community complexity is shown by a, so basically it's assumed that microbial communities have a log-normal abundance distribution, so the species abundance follows a log-normal form, and so if you have two microbial populations each with an equal number of species, a community can be either very simple, so meaning that all species have roughly a similar abundance or a community can be very complex, meaning that the log-normal distribution has a high skew, where a low number of species basically dominate the community and mask the presence of the species with a very low abundance. So if we subject these two, these kind of populations with different log-normal abundance distributions skew to our experiments, what can we expect to see? Can we, from one single measurement, know the total number of species in a population? So here I have six environmental samples. I perform a 96 chamber, 10 cell per chamber experiment, so roughly around a thousand cells, which sample has the most species, and basically it's impossible to tell because of this, the form, this skew of this abundance distribution, so from measuring only once you would basically say that a small community of around 117 species is a similar size of a large community with 2,400 species, and so doing one measurement basically is not enough, and this abundance distribution can really mask many species. By increasing, by doing an additional measurement, so by increasing the number of cells per chamber, you would however see a large rise in the number of species found in your experiment for the large population, whereas the small population would then turn out to be almost already complete. So again, this all has to do with the sampling probability, so for the simple community, complex community, the sampling probability looks very different with sampling a thousand cells, you basically almost completely measure this population, whereas here the low abundant species are masked by the presence of high abundant species. So if you increase the number of cells per chamber, you would see that the species with the low abundance would increase only very marginally, while the species with high abundance increase very rapidly, so there's a larger spread in the rates of increase when you increase the number of cells per chamber, and this spread is directly proportional to the abundance distribution of species in your community. With at the limit of identical number of species or identical number of cells for each species, this would be a horizontal line. So the rate of abundance increase in your experiment is directly proportional to the abundance distribution of your population as a whole. In doing this, so if we would do a concentration sweep, we could basically map out the population abundance distribution. This is the per capita, yes, so this is the species rank, and this is the increase, the rates of increase per species, and so you see, yes per species rate of increase. So if you have a population where a few species dominate the population or are there in a very high abundance in your experiment, you would see those increasing at a much faster rate than the species that are there only very sparsely, and this basically implies that measuring more is not necessarily the smartest thing to do with these kind of distributions. So how would this look? So if you have a simple population and we have our 96 chambers, one cell per chamber would mean roughly around, roughly 100 cells in your experiments, and here this dotted line is a single occurrence, so this is the occurrence threshold in our experiment. We would have around 100 cells of the most abundant species. This would also be true for a complex community, but as we increase the number of cells per chamber, we would soon see that the simple population almost completely is present in the experiment, whereas for the complex community you would keep on seeing species emerging, and at the same time have species that are present at 10 to the fourth or more in your experiment. This is why scaling doesn't necessarily make sense, so for a complex community the ratio of the maximum abundant species to the median keeps on increasing, and so basically you're spending more sequencing costs on the same species, so we need a smarter way to measure. However, you can tune which species or the concentration such that a certain species of interest are within this single cell range to basically look at, well, these species in a single cell genomics fashion. Is there an optimum number of cells per chamber? Again, this is highly dependent on the abundance distribution of your community, and for a simple community there is a peak in the single cell range, whereas for a complex community this is much less clear, and it also depends on your population size where this peak is, so basically what you need to do, what you need to do is measure at multiple concentrations. So to conclude this part, Poisson statistics can be used to set a co-occurrence threshold. The rate at which species increases in occurrence is directly proportional to the relative abundance of a species in a community. Scaling your measurement, measuring more doesn't necessarily make sense, but here we can basically rely on going back to the specific chamber with a genome of interest and measuring sequencing that deeper. The species that fall within the single cell range can be tuned, and the optimum of cells per chamber can be found by varying the cell concentration. So now there are two approaches that are attractive, basically do a light scan with our minimina genomics experiment. First, and then targets specific chamber, microfluidic chambers for deeper sequencing to indeed avoid measuring more of the same species. Or we could measure do a 16S scan first and then target those genomes that we find most interesting. Second part is a bit about our current work in progress, so we have a number, as I told you, a number of samples in the lab that we are subjecting to these measurements, and basically developing a analysis pipeline for and developing quality standards for. So what we do is we take an environmental sample, we perform minimina genomics and shotgun sequencing together on the same sample to be able to compare these. We assemble the genomes, then we annotate them by uploading them to the JGI database. We do a camera-based clustering and dimensionality reduction using TSNE. Probably many of you in metagenomics are familiar with this pipeline. And then we cluster or form genomes from this TSNE as I will show. And from this we can basically do all kinds of analysis. So look at genomes, select certain species of interest for deeper sequencing, look at functional interaction, so looking more at who is doing what. And again, so these steps are probably going to be very familiar to many of you. So what have we done? We've taken five samples from Yellowstone National Park, the Obsidian Pool. They're from the same pool. They're a hot spring. We perform minimina genomics and shotgun sequencing. And we devoted an equal number of sequencing reads to each method. So both to minimina genomics and metagenomic sequencing. The overall results, so here we have a rank of the number of phyla present in the experiments. Five samples, as you can see. One stands out here. This one has a high presence of cyanobacteria and almost no chrenarcheota. And basically this also shows that we can reproducibly measure these communities with minimina genomics since this sample was taken at lower temperature. What we can do is map the amount of DNA back onto the microfluidic chip, as well as the assembly quality and the origin of the different sequences. From this, we get a rank of the different phyla presence with their assembled length, where roughly the assembled length is a good proxy for the number of genes present, as you have around one gene for every kilobase. Minimina genomics performs very similar to bulk metagenomic assembly methods, but also finds more phyla that are unique to minimina genomics and do not appear in the metagenomic sequencing efforts. So, like I said, we do a K-mer analysis. In this case, we show a five-mer analysis, and we reduce the dimensionality to be able to plot the high-dimensional space into two dimensions. And what you get is this cluster data, where you have each data point represents an assembled sequence longer than 5,000 base pairs. And if we color them by origin, so the metagenomic contigs versus the minimina genomic contigs, you see that there are clusters that are formed that only minimina genomics finds. This means that this region of K-mer space is only found by the minimina genomic methods, whereas each blue cluster still has underlying green contigs, meaning that this is also covered by minimina genomics, yet not as much as the metagenomic method. So, if we look at the minimina genomic contigs and their clusters, we can look at functionality in these clusters. We can look for specific enzymes. I guess this is all very familiar to people in the metagenomics, but additionally, we can map the contig occurrence back to the microfluidic chip, so we know exactly where each species came from. So, here, I'll show a number of phyla, the aquifase, all cluster into one region here of the TSNE, the craniocchiota. This phylam occupies more than one cluster, meaning that there are probably a number of subspecies present, and we can basically do this for every phylam present, and we can also then get an abundance distribution because we have this chip presence, which at this point doesn't say too much, but it would be interesting that if you increase the number of cells per chamber that you see this shift, which I also saw in my modeling, the simulations that I have done. There are a number of unassigned contigs, which actually occupy the entire space, so we basically need to find or assign these contigs, which we can then do by clustering the data using HDB scan, which basically clusters the different groups of contigs as you would draw circles around it by eye. Then we can compare the clusters found for bulk and metagenomics and overlay them, and again you see that certain regions are occupied by both, metagenomics and bulk, some are only occupied by metagenomics or only by bulk. Then we can also start looking at functionality, and again for these clusters we can then look which phyla and which species are present, and again also look at the occurrence in the chip. This gives us a different type of ranking, so not based on phylam but based on cluster, where many here are unassigned, and it's now the task to basically figure out which species these represent. We can rank them by assembled length, but actually more importantly by the number of cells, and the order of this change is because there's not necessarily a strong correlation between the assembled length and the actual abundance of a species in an experiment or in a population, so this metagenomics platform is really the only reliable way to infer the relative abundance of species, because you cannot reliably do that by looking at the total assembled length. Extracting one step further is we can represent the clusters by spheres with the radius proportional to the assembled length, and start looking at which are the most abundant phyla associated with those clusters, and then since we also map this to the keg database, start looking at functionality of our genomes, and basically if you zoom in onto these functional functions, you basically start seeing that not every genome is performing the same task, not everyone is doing everything, and here it really becomes interesting to look at the who is doing what and how we could infer species dependencies. So for instance nitrogen metabolism definitely does not occur in every genome, and then if we look at a specific pathway here we can look, so which is nitrate reduction. This first step, nitrate to nitrite, can be either done by a three subunit enzyme or by two subunits enzyme, and we can start looking where in which species these are present, and two subunit enzymes for the second step, and so here obviously the problem of how complete is your genome starts occurring, so we need to treat this with care and develop quality scores for genome completeness, but this is still work in progress. So to conclude, so this is an easy to use platform and ideally suited to analyze complex microbial communities, as a matter of fact I would want to challenge this system by moving away from extremophile communities only, but also looking at highly complex communities, soil and microbial communities, for instance. The sequencing data provides complementary information to shotgun or bulk sequencing, but also as you saw many additional phyla that were not present in the bulk sequencing data, and it enables a more quantitative approach to genomic screening of environmental samples, and the abundance of phyla and genomes can be referred through this chip presence, as can the functional abundance, so the abundance of a certain gene set. So we're looking forward to further developing the analysis pipeline, demonstrating this deeper sequencing of specific wells, doing single cell analysis by looking at SNPs and general genetic heterogeneity within a species, and in doing that show that we can measure smarter by not just throwing all the sequencing power at every species out there, but really focusing it on the most interesting or curious species. With that I'd like to conclude and thank you, and also if you have any communities that you would like to subject to this method, I would be happy to talk about collaborating, so thank you.