 now, but it looks like you guys are seeing my screen. Can you confirm that with? Yes, it is looking good. Ludwig Geislinger is here from the Harvard Center for Computational Biomedicine, right? And welcome and take it away. You have 15 minutes. Perfect. Thank you so much, Vince. Great pleasure, of course, to be here at least virtually and talk about BoxCTV, a new database for the for microbial signatures for the interpretation of human microbiome experiments. And I guess I start with the acknowledgement by saying this is not only the work of myself, but rather of many. And I'm particular pointing here out Curtis Hartenhauer and Levi Waldron, which have been true guides on this project. A couple of words on the introduction. I think many of us are well familiar with differential gene expression analysis, where we typically compare in the basic case to sample groups for differential gene expression analysis, and subsequently look for enrichment in differential expression using well-defined databases such as the gene ontology or the CAC database, often subsumed in the very popular MCICDB database for some sort of gene set enrichment analysis. Now, today we're talking about microbiome data and often enough, the data looks very similar at the end of the day. So depending on which technique you're using, whether you're using 16S RNA-seq, where you're mapping your reads to a marker gene and are able to resolve up to the genus level, or whether you're using newer approaches, where you actually take your sample and map all the reads through different bacterial genomes and are then able to classify bacteria as opposed to genes, again comparing them between two different conditions in the basic case. Now, you would likely like to do some sort of enrichment analysis here as well, taking this list of differentially abundant microbes, but you're missing some sort of comprehensive databases such as MCICDB or GO or CAC in the microbiome space to do such analysis, which makes this pretty much infeasible. So when we started off this project a couple of years ago, the goals were to improve the interpretability of disease-linked microbiome profiles by translating the concepts from gene set enrichment analysis and developing microbial signature resources. And we had three particular goals in mind here. First of all, we wanted to develop some sort of database that hosts microbial signatures. We wanted an online Viki where curators could feed in these signatures that are curated and users could interactively access these signatures. And then on the other hand, we wanted to develop and assess methods to enable microbe set enrichment analysis. And this database is now available here on the boxicdb.org. Now, a quick look at the database on what kind of data is actually there. So you see we started a couple of years ago and over time went through the literature with a number of curators currently having around 2,000 signatures curated from around 500 papers on differential abundance covering all sorts of geographies, sites, but also different conditions that go from cancer or metabolic diseases over to antibiotics. And in all of these studies, we're typically looking at a contrast where we are comparing some sort of study condition with some sort of controls. Now, you can look a little bit into the characteristics of these signatures where you see most of the microbes are contained only in one or a couple of signatures, but similar as for gene sets or gene signatures, you see a bunch of usual suspects that are turning up over and over again as differential abundance. So you see here things like streptococcus or primatella, for example, that turn up as differential abundance and up to 200 signatures. You can also look at the signature sizes where you see most of these signatures are rather small. So a typical gene set or gene signatures typically around 500 genes, but what we're seeing here, we're typically having around 5 to 10 microbes in such a signature. And then you can use controlled vocabulary to classify, for example, the condition, but also the body side that these signatures are annotated to. And you can, for example, see what kind of signatures we have really in the database going here with a lot of anatomical system disease signatures, but also a lot of signatures that you can enter. And you can return to this top 10 microbes here and kind of like ask what are the proportions that these bugs are really found in signatures associated with these conditions. And you see, for example, here that bacteroidus is often or more proportionate over disproportionately associated with things like metabolic disease or antibiotics usage when compared to the background of the database. Now, a little bit of a look at the architecture of VoxicDB. We have this web access. This is website where the curators feed in the signatures and users can access interactively these signatures. But then we have also quite some infrastructure and GitHub magic to basically pull down these signatures systematically. So there's a GitHub repo VoxicDB export, which does a weekly export of this dynamic data that curators are continuing to feeding in. And in order to have some sort of stable release every half a year together with the bioconductor release cycle, we are publishing at this data dump to Synodal where people can pull that for the sake of reproducibility. And then if you prefer to work in R and bioconductor, which many of us likely do, given that they're here at this conference, there's a bioconductor package named VoxicDBR, which basically allows to either pull these weekly exports or the stable releases into a data frame in R, and then allows to basically a number of convenience function to extract these signatures, to write out these signatures into other formats if you prefer, tools outside of R and bioconductor, and also programmatically access pages of VoxicDB. And the last infrastructure piece here is this VoxicDB stats package, another R package that lives on GitHub only, which allows to analyze the contents in a continuous integration setup space, and it allows to look into signature statistics, but also metadata statistics and ontology based summaries. So there are a lot of things that you can do with this data, and major use cases, of course, using these signatures for enrichment analysis. And as Samuel has pointed out, one difficulty is, of course, to find a setup where you have some sort of ground troops. So for that, we turn to colorectal cancer data sets for curated metagenomic data, a package that is available on bioconductor. And we basically looked at 10 different colorectal cancer data sets that overall when cooled compared some 600 colorectal cancer stool samples versus some 600 healthy control samples. Now, you can basically go about the enrichment analysis in two different ways, either you apply your differential abundance analysis and your enrichment analysis on the overall pool data sets, which is a good idea as you have more power in this setup. But you might also want to basically apply your enrichment methods on every single data set and then apply some sort of rank aggregation. The second setup is, of course, useful to inspect the performance of these enrichment methods on individual data sets. You will also need some sort of ground truth. And here we worked or used two that we call spiken signatures, so-called positive control signatures that in previous studies here from Burble et al. in Nature Medicine 2019 and from Thomas et al. also in Nature Medicine 2019 had established species and genus level signatures of bacteria that were found with increased abundance in colorectal cancer versus controls. And we are now basically asking, do we find these positive control signatures in these 10 data sets when we are applying differential abundance methods? What kind of enrichment methods are we applying? Well, we turned to this paper that we published last year where we actually quite systematically looked through the literature and available enrichment analysis and systematically benchmarked 10 different enrichment methods, among them the classic hypergeometric test, where you basically threshold your differentially expressed gene and then apply some sort of Fisher's exec test. And then also what we find, we and others found to work very well, some sort of gene set scoring method that computes some sort of gene score and the weighted mean over your signature and then apply some sort of sample permutation to estimate the p-value. Now, you can look at the results here for the colorectal cancer case and in A, we are basically looking at the pooled data sets and we'll start to looking at the Fisher's exec test results, overrepresentation results. And the first thing that you see that you see these two positive controls, these spike in signatures from verbal and Thomas, you see them indeed coming up on top, so you have your proof of concept. But you also see two other colorectal cancer signatures from a Lali and Vu, which are coming up as independent medication. And then you see also signatures that investigating other diseases and interestingly, most of them are coming from oral samples, which is in line that in colorectal cancer in these two samples, you apparently have an intracresion or migration of oral microbes into the gut as one possible disease causing mechanism. You can also look now at these two different methods that we identified at being effective and working very well. You can take this standard very popular overrepresentation test and then this sample permutation product test and you can basically look for all 10 of the data sets. Where do you now rank these two positive controls? Where basically means if you're ranking more towards zero, you're ranking more towards the top of the ranking. So these are percentile ranks and it figures, depending on the sample size of your data set, you of course have different power to detect these meta-analysis signatures and it figures part of performed overall a little bit better than this overrepresentation analysis. You can also return to this piece here in A and basically look what are actually the overlaps between these different signatures from colorectal cancer but also other disease phenotypes. So we basically seeing here these signatures again that we have here in A and you can actually see the microbes on genus level which seem to turn up over and over again. And among them is here apparently fuzobacterium which is highly up-regulated in colorectal cancer versus controls and it seems to be there in all the other signatures as well. Also porphyromonas and pepto-structococcus seem to be very frequent in there. Now I only inspect here a little bit more fuzobacterium and here a work on hypothesis indeed that goes beyond observational studies is that this particular species of the genus fuzobacterium, fuzobacterium nucleatum seems to produce format which is apparently metabolite that you don't want to have in your colon as it seems to be an oncometabolite and has apparently pro-tumorogenic effects. Okay and I think with this I'm already well over time so I thank everybody for the attention. I will be happy to answer the question and just pointing out the availability of these different infrastructure pieces again. BoxicDB is available on boxicdb.org and there is this BoxicDB or bio conductor package for accessing BoxicDB signatures from within bio conductor. Thank you very much. Thank you Ludwig. Questions from the audience or in the chat? Well nothing yet.