 to a software called Dreamlet, which can address some of these challenges. So the specific dataset that I've been motivated is, I'm going to talk about, but this applies to any kind of cohort study design that's increasingly common in the field. So I've been most interested in Alzheimer's disease, which is a common cause of age-related dementia, and is a neurodegenerative disease with a strong genetic component. The goal is to identify self-type specific gene expression signatures with AD to characterize the biology of progression as well as neuropsychiatric symptoms. And so the dataset that my colleagues generated were from single-nucleus RNA-Seq with a 10x genomics platform. This has 8.8 million cells. It has over 3,000 biosamples, including technical replicates from over 1,600 donors, over 560 technical batches, which is 10x batches. And the dataset is about 5,000 cells per sample for about 5 to 50 cell types, depending on the annotation. And one of the challenges, the shared challenges of the single-nucleus datasets is fairly low-recount per cells. So on the biological side, we would like to ask questions about a differential expression based on disease state within each cell cluster. And then downstream, think about genetic regulation. So existing tools were struggling to scale both computationally to this very large scale dataset. And we were not able to model the statistical complexity of the RNA-Seq with repeated measures at this single cell level. So the engineering goals, first from the statistical side, is to be able to statistically model repeated measures designed with about 5,000 nuclei per biosample, including technical replicates. There's going to be variation in measurement precision based on the count nature of the data as highlighted by LIMA-VOOM or DEC2 that we're all familiar with. And that low-read depth leads to heteroscedastic gene expression measurements. There's also many small technical batches, and we'd like to be able to use random effect shrinkage to account for batch variation in the form of precision-weighted linear mixed models. And finally, we'd like to use an empirical-based shrinkage like in LIMA-VOOM to form shrinkage across genes. And this statistical framework was inspired by a package I worked on previously called Variance Partition, as well as LIMA, which we're all familiar with in the field, and then Muscat, which is designed for single cell RNA-seq differential expression for smaller datasets. And so on the computational side, the basic data processing is pretty daunting because the H5AD file is 160 gigs just zipped. We need to perform the analysis using on-disk instead of in-memory storage, and we'd like to parallelize both the pre-processing and the statistical analysis. So in this dreamland package, I integrated, which is well integrated with single cell experiment from Bioconductor, and simplifies a lot of the user-based tasks and puts the statistical and computational challenges in the back end. And so it also includes integration with plotting and downstream analyses. So the goals is scaling to fit complex regression models and allows easy use by the end user. And so for multi-donor differential expression analyses, this is the basic workflow with precision weighted linear mixed models as the workhorse. So across multiple donors, you collect, say, two biosamples, but it could be more or it could be fewer. And you can use a normal prior in terms of a random effect to account for these repeated measures. You can also use a random effect in order to account for batch-to-batch variation. The variation due to the number of cells observed is modeled with precision weights and the expression from a single cell cluster are then aggregated with pseudo bulk and we perform standard library size correction. And then the mean variance trend from count data that we're all familiar with is accounted with a second round of precision weights. Then we can perform downstream analysis with dreamlet, say case control analysis within cell cluster, or we can perform these analyses across cell cluster or perform variance purchasing analyses at the gene level. And so the computational scaling is especially important for data sets of these size. And so dreamlet shown in red says across a thousand donors and two million cells when just computing the pseudo bulk, which is a very time-consuming step, has by far the lowest memory and very competitive run times. We can be performed on a laptop. For differential expression analyses, it's the only scalable approach that can include random effects. So when the y-axis is time and number of subjects, and so even modeling batches a random effect, it's orders of magnitude faster than, say, GLMAR or the MAST software. Statistically, it matches the performance of the leading software using benchmarks developed by Muscat and on a permuted data set, so real single cell data with permuted phenotypes, it's able to correctly control the false positive rate while depending on conditions and sample size of some of the other methods show increased false positive rate. So just a quick introduction to an analysis. This group profiled memory T cells from donors that had been exposed to tuberculosis and then performed single cell RNA-seq and identified subsets of T cells. And with the dreamlet pipeline, we're able to, in this case, look at one subset of T cells, look at the mean variance trend and incorporate that into downstream precision weights. We can look at the gene level, the contribution of batch and donor, and as expected, the TB status has a fairly small contribution to gene expression. And a dreamlet also includes downstream gene set enrichments following the differential expression analysis, and we find this population of T cells which shows up regulation of T cell immunity even after a substantial time following exposure to TB status. And with that, I just want to thank my colleagues at Mount Sinai Med School and the Center for Disease and Urgenomics, and all of the software dependencies which were contributed by by our conductor developers that make some of this work possible. This will be available at the target date of September 1st with documentation and examples, and we'll be on Biowork live and by our conductor. So, thank you so much. Thanks, Dr. Hoffman.