 Okay, yep, now I think we're good. Okay, yes, so I'm going to present on work that we've been doing on unsupervised analyses of spatially resolved transcriptomics data with a package we've developed called NNSVG and that's within the bioconductor framework. So I'll start with some background on spatially resolved transcriptomics. So this is referring to new technological platforms that let us measure transcriptome wide gene expression at spatial resolution on tissue slides. By that we mean so we're measuring expression of thousands of genes at a set of thousands of spatial locations on small tissue slides. The illustration I've got here is from the 10X Genomics Visium Platform, which is one of the most widely used platforms right now. So here there's a grid of spatial locations on the tissue slide in a hexagonal arrangement in the latest version. And at each one of those locations that are called spots, we're measuring transcriptome scale expression of thousands of genes by sequencing. So they're tagged with spatial barcodes and then sent for sequencing. And then examples of unsupervised analysis that we can do in this data is identifying spatial domains, spatially distributed cell populations, and spatially variable genes. So I'll talk about more. And I like this illustration of showing how this fits in with previous technologies. So in bulk RNA sequencing, we've got all different types of cells, all measured at once. And in single cell RNA sequencing, we can identify cell populations, but in spatially resolved transcriptomics, we can also identify the spatial coordinates where they came from. And then this illustration here from the human brain. And in terms of the data, this means we end up with tables of expression counts, which we usually format as genes by cells or genes by spatial locations in the spatial world. And then in the case of spatially resolved transcriptomics, we have these additional columns of spatial locations that each measurement came from. And also it's possible to derive image features such as morphological features or number of cells per spot. In this work, we're using the expression counts and the spatial locations. Now spatially variable genes. So here we're referring to any genes that have spatially defined patterns of expression across the tissue slide. And here I've got an illustration of six genes in a sample of human brain, dorsolateral prefrontal cortex, which was measured with the 10-axis genomics using a platform. And in this region of the brain, there's this laminar layer structure. So the top three genes there. So this is the same slide six times, showing six genes. Expression counts at each spot on this slide. The top three there are associated with the laminar structure and the bottom three are associated with other patterns for blood and immune processes. And the crucial feature here, if we're trying to identify these genes, is that depending on the structure that they're associated with, they can vary across different distances, which is what I've annotated there with those red arrows. So some of the genes have quite large patterns, other smaller patterns. And in unsupervised analyses, we want to identify the top genes in a data set that are associated with any structures of interest in there. And doing that in a way that takes into account this different varying range of expression is quite tricky. Now, why look for spatially variable genes in the first place? So the two main tasks that we do with this, the first is from a perspective of data pre-processing and data reduction. So we're reducing the number of genes to a smaller set of biologically informative genes instead of the full set of 20,000 or so protein-coding genes. Secondly, and in that case, we can use that as a feature selection for your processing step for other downstream analyses. Secondly, also identifying a top list of top informative genes to investigate individually. So really top genes associated with specific processes. And then the question becomes how to define biologically informative. So in the non-spatial world, we use methods called highly variable genes. And then in the spatial world, taking into account the spatial coordinates as well. And that's referred to as the spatially variable genes. And in an analysis workflow, as I just mentioned, this fits into the feature selection, which can be viewed as a pre-processing step. So feature selection, reducing the dimensionality from 20,000 or so genes to often around 1,000 that reduces noise and approves computational performance for any further downstream steps, such as dimensionality reduction or clustering. Or secondly, to identify those top ranked genes for further investigation individually. And yeah, so I mentioned methods called highly variable genes for identifying non-spatial informative genes. So those methods are more standardized by now. This has been around for a bit longer. So there we're ranking genes by excess biological variation above a technical trend, which accounts for a mean variance relationship in single cell data. But that does not take into account any spatial information. So then in spatial statistics, there are measures including Moran's eye statistic, which can rank genes by observed spatial order correlation. But that has not been adapted to the specific properties of spatially resolved transcriptomics data. So now new methods have been developed to specifically focus on spatially variable genes. And several papers were published recently on this, including these three. Spatial DE and Spark and Spark X. So one of the first there, Spatial DE, this fits. So this uses Gaussian processor regression to add using a spatial covariance function and a kernel on distances, then using a likelihood ratio test to identify significant spatially variable genes. This was a really nice method. However, this scales cubically in a number of spatial locations. And with the new 10-extrument thermo-expecient platform, where we have thousands of spatial locations, this becomes quite slow. So our work was on trying to adapt this in a way that lets us scale much faster and apply this to the newer platforms. So we've developed this method called NNSVG. This uses a technique from spatial statistics called nearest-neighbor Gaussian processes, which approximates the likelihood at a small set of nearest neighbors instead of the full thousands of spatial locations, which approximates the data very well and then allows us to scale computationally, linearly, in terms of the number of spatial locations, which is a huge improvement in terms of speed if we've got a large data stack with thousands of spatial locations. And this NNGP framework was implemented at the time in two packages, SPNNGP and Brisk. We've applied the Brisk package in the context of specially resolved thermo-expecient data to apply this in a linear manner to our data. The methodology works like this. So we fit a model, one model per gene, using Brisk, extracting the maximum likelihood parameter estimates and then, again, using a likelihood ratio test to compare, model with and without spatial terms and then using those likelihood ratio statistics to rank genes and by the strength of their spatial patterns across the tissue slide. That lets us do unsupervised analyses where we can simply rank all genes in the data set in terms of the strength of their spatial patterns. And crucially, this model has a flexible length scale parameter. So that was this red arrow that I had annotated previously in the example of spatially variable genes. So we can identify spatially variable genes with flexible length scale parameters for different biological processes within the same data set. We can also include covariates for spatial domains, which lets us look for spatially variable genes within subregions and then rank genes by the likelihood ratio statistic. And the crucial point here is that this becomes linear in the number of spatial locations due to our use of the NNGP framework. We have a gene-specific length scale parameter. Let's put it independently per gene and we can include covariates for spatial domains. We've parallelized this in one of the package using Viaconductor. Runtime is around 45 minutes on a laptop for one Visium slide. Now I've got some results where we've evaluated this on several specific data sets. So this is, again, that same data set I showed previously from the dorsolateral prefrontal cortex in the human brain. There are those six genes that we're particularly interested in here. And at the bottom left, actually, I've got some ground truth annotations from that same data set. So bottom left is showing six cortical layers and white matter. The white matter is the black region at the bottom left. And the second panel there is showing white matter versus all of the gray matter layers. And what we see when we fit our models, one per gene here, is that we... So here on the right, I'm showing the estimates of those length scale parameters that I mentioned before. And what we see there is that for the genes, the bottom three genes, they're HPP, IGK, CNPY, we get very small, estimated length scale parameters, which is what we want. So the model is correctly fitted to those small patterns in those three genes. And the ones at the top have much larger length scale parameters. So that's good. And then we evaluated this and compared it against several other methods. We evaluated it by calculating the rank of specific genes of interest within this data set and also doing the same for other data sets later on. So on the left there, I'm showing these six specific genes of interest, showing the rank from N and SVG in dark blue and other methods in the other colors, showing the rank in the list of top spatially variable genes from each method for each of those six genes. And we show that N and SVG in this data set and also in several other data sets, recovers at high ranks all of the main genes of interest that we know about in this specific data set. And other competing methods do not, or not consistently across all of the data sets that we evaluate. Here, same thing again, we have a longer list of, so those were just six genes that I was talking about previously. We also have a list of 137 additional genes that were identified in that original study, associated with those cortical layers. And we show that we also get those as significant spatially variable genes. And then we also simply plot the top spatially variable genes from our methods. So on the left here, we're showing our method. On the right, this was a competing method that we compare against. And here we show that those top 20 spatially variable genes, just the top 20 ranked ones, most of them are associated with the white matter versus gray matter distinction, which we know is the strongest biological signal in this data set, so that's a good confirmation. We also do some simulations showing that it really is linear in the number of spatial locations. Here I'm subsampling the number of spatial locations per data set into data sets and then running it several times, showing that that scales linearly, which is great. Right, and then I've got some details here about the implementation. So we've implemented this as an R package within the bioconductor framework, which lets users integrate this into analysis workflows. Inputs and outputs are either spatial experiment objects, which I've got another slide coming up on, or simple numeric matrices, depending on what people prefer. It's parallelized using bioc parallel, and the package is available from Bioconductor and GitHub. We've got some vignettes and tutorials there showing how to use it. The left one is the screenshot of the Bioconductor vignette, the right one is the GitHub readme. And we've got a preprint, a platform bio archive, going through this in much more detail. And now here I've got some more details on analysis workflow. So I mentioned, so this is like one step in an unsupervised analysis workflow of a spatial-resulted transatlomics data set. So we've chosen to implement this within the Bioconductor framework. We really like Bioconductor. This is a, if you're not familiar with Bioconductors, open source, community-based software development project for high-throughput genomic analysis in R. Currently over 2,000 contributed software packages there and really nice standardized data objects and documentation standards. And specifically in this project, we've used the spatial experiment structure that we have also previously worked on. This is an extension of single-cell experiment for spatial-result transatlomics data. So this is a data structure that stores expression counts, as well as row and column metadata. Usually here rows will be genes and columns will be cells or spatial locations. So in spatial experiment, we've extended single-cell experiment to include some additional slots, additional structure specifically for the spatial-result transatlomics data, such as the spatial coordinates and image information. This schematic shows how the structure extends single-cell experiment. And that's described in this paper as well here and available from Bioconductor. And we have our work in progress where we're building up an online book showing several example workflows of spatial-result transatlomics data using this spatial experiment structure and going through a complete analysis pipeline from pre-processing to feature selection, which I concentrated on in this talk, and also continuing on with further downstream analysis. So this will be freely available online through Bioconductor. Right now it's available from my GitHub. So yeah, going through a complete analysis pipeline for spatial-result transatlomic data. With example data sets and interactive code. So in summary, I've talked about our new method called NNSVG for identifying spatially variable genes in spatial-result transatlomics data. Screenshot of the pre-print there. So this lets us identify spatially variable genes in a linearly scalable manner in large data sets. I also mentioned spatial experiment structure for storing spatially-result transatlomic data within Bioconductor. And also orchestrating spatially-result transatlomics analysis with Bioconductor work in progress on analysis workflows. And that's the end of my talk. So thank you to all of our collaborators and my advisors and funding. And that's it. Thank you.