 Yes, perfect. So welcome around to the afternoon session where we'll be focusing on like exploring spatial transcriptomics But before we get into that, I just wanted to give you a brief introduction on myself here So, you know, who's on the other side of the screen. My name is Alma Anderson. I'm from Oosterbeek, Sweden Which is a fairly small village with a population of like 69 people At least as of the day before yesterday and the influx and outlooks isn't super huge So I'm pretty sure that this is like a correct number Though now I'm living in Stockholm, Sweden, which is a slightly larger city, you can say and I'm working and situated at Syloflab in Stockholm and I'm also associated with KDH I've been here at Syloflab since 2017 where I did a short spree in the Deltemot lab where we focused on molecular dynamics and even more specifically like membrane proteins and you can get even more specific about that where I focus on ion channels and viral potassium ion channels Since 2018, I've been a part of the Lundberg lab and we work sort of like exclusively with spatial transcriptomics or ST, the technique And in this lab I've been doing like computational method development So let's use some brief information about me and I should also make a small disclaimer here This is sort of like my first online teaching experience So if the technology isn't sort of like on par every time, I hope you have some understanding Before that So here's some information here about the outline of this session I'm just gonna super quickly go through some Realtation here and then we just take a couple of seconds and then we look into some background of spatial techniques And I know Michael and Charlotte said that we sort of expected you to have a bit of a background knowledge But my idea here is that spatial is a pretty new field So maybe it's good to just add some context to what sort of data that we will be working with And then we'd brief the touch up on the data processing Touch and move forward to the data analysis And we'll do some basic analysis and compare how this compare look at this have compared this compares to what we do It's a single cell data Then we'll have a short break and answer some questions if those have arisen And then continue on with the data analysis where we will be looking at mapping of certain cell types And I would be elaborating a bit more on this later on and then we'll do another example Look at some more spatially focused analysis, which you can't really do with single cell data And my idea is also to give some information regarding the exercises and also some more questions So for the notation here You will see later on in the exercise session consists of three parts and a lot of that material We also be covered here in the more theoretical part And I've tried to indicate whenever that is the case using this symbols here and I will place them in the top right corner Just for you to have something to relate to Right so moving on to the background then and my idea here is to just give you a very like overview of the different techniques special techniques that are out there and You can more or less put these into five different categories Where the first one is the micro the section based techniques and here you isolate a region of interest In your tissue or sample the place is isolated in a separate well and you sequence it either using like bulk or a single cell methods And this is very much like a brute force approach to reaching spatial resolution And some example here is our LSEM, Tomosek, Tiva, Proximity and Nilsek Then we have the in-serial sequencing methods Where we sequence the transcript in place that is within our tissue and these methods tend to offer like a sub-cellular Resolution some of them relies on our priority priority defined targets, but not all of them For example, Fissek is a bit more unbiased We also have the in-sate to hybridization methods and here we use labeled probes for specific targets And we let this hybridize to these targets and this tends to require of course a priority defined target since we're aiming for them And before there was quite an issue with the multiplexing capabilities here But due to some clever expansion strategies and some good decoding schemes we sort of have like overcome this issue We also have the in-silicobase methods and these are a bit of a curiosity Maybe there are more of them out there, but this is the one that I'm sort of aware of And here we try to like infer and reconstruct spatial structure from non-spatial data, for example single cell data And I think that this is a pretty cool theoretical Approach to something. Yeah, so I would recommend you to like check this one out if you haven't seen it already And then as a fifth category, we have the in-sito capture-based techniques And here we captured the transcripts in-situ, but we sequenced them ex-situ And usually these are a bit less dependent on prior selection of targets And I've actually grabbed this sort of classification of the different methods from a review That was recently published, so if you want to know more about the different techniques and just get some insight into the field I would recommend you to have a look at this one, for example But for today we will be focusing on the in-sito capture methods and even more specifically the ST method here And that stands for spatial transcriptomics Which I will just give you a brief background or history to here So this idea here of spatially barcoded arrays, it was introduced to the scientific world As you can see in mid-2016, I think this paper here was published in July And it was a science publication by Storlital And they made this, sort of like, then they thought it was a really good new to name this spatial transcriptomics technique Today it causes quite a lot of confusion when you say that you work with spatial transcriptomics Because that's more or less the whole field However, in late 2018, TENIX acquired the IPerite Studies Technology And that resulted in the launch of the Visium Spatial Lean Expression Platform And that is sort of like the type of data that we would be working with today So just to give you some brief specs about the Visium platform here It is an array-based technique And we have this 6.5 x 6.5 mm area that we put our sample on And within this array we have about 5,000 different spots arranged in a hexagonal grid And the spots are 55 micrometers in diameter and the center to center distance is about 100 micrometers And each spot has millions of capture probes And I tried to get a bit of a more specific number on this But TENIX is pretty like secretive with that information So they state that, yeah, the spots have millions of capture probes However, each of these probes have a spatial barcode And that's sort of like what allows us to link gene expression to an actual physical position on our array They also have a polarity sequence which allows us to capture the polarity-related mRNA And it gives us sort of full transcriptome-ish data And I add the ish here because we can't really distinguish between different isoforms and so forth But it's untargeted and tries to capture the full transcriptome at least What I want to point out here is, and you might have seen this from the resolution Or the sizes of the spots that I just mentioned before Is that we don't really operate here on a single cell resolution yet TENIX says that approximately 1 to 10 cells contribute to the observed gene expression in each of the spots And that's something that's pretty good to keep in mind and I will come back to this later on as well So for the experimental workflow, just to briefly summarize this And I'm not like in the lab doing this so it would be very much of an overview here What you do is that you take your sample, your tissue of interest You put it on this viscimery, you take a bright field image And this is something that you use later on as a reference to back map the gene expression to Sort of like where in the tissue you capture these transcripts You do the mRNA capture, a CDNA synthesis, you cleave off the probes And you put them into the sequencer, there's some preprocessing going on there And now it's bits in the account matrix And this of course is very similar to what you would observe in a single cell experiment Where you have genes along one dimension but rather than cells you have spots Along the other dimension And these spots have barcodes and the barcodes are associated with certain specific spatial coordinates So I should say are there any questions so far? No, doesn't seem like it, great So like I said before, we'll briefly touch upon the preprocessing here But the majority of the rest of this session will focus on what happens once you obtain this count matrix But still it's good to know how to like how we end up there to get this count matrix So yeah, let's head on to the data processing So just you have the cell render for the single cell data, we have the space ranger for the spatial data And we can use the make fastu to convert our VCL files to fastu files And then we have the space ranger count command that do tissue detection or alignment and the UMI counting And when I say tissue detection or alignment it's sort of like to find which spots that are under the tissue and where they are located physically Once you run these commands you will end up with an output like something like this depending on which parameters you specify You have some automated analysis there which I don't tend to pay too much attention to And then you have your count matrices represented as max files or agf files And you will see that some of them are named raw and some of them are named filtered The raw data contains all of the spots whilst those that are named filtered only contain spots that are under your tissue And you can read more about how these commands work and the different options that you can specify at Tendix website So I won't really go into too much details about this So let's say now that you processed your data and usually what I want to do is to convert this to a more convenient format to work with And there isn't really any standardized format for that but my personal preference here is to convert this to an H580 file And this is what we will be using in the exercises as well And up until recently scanned by Ndata didn't really have a good module to use Visium data but they've just recently released one Which is slightly different to the format that we will be using because I didn't really have the time to update that it was super recent but it's very simple So, just as Charles was speaking of before, this Ndata file have different slots sort of where we have different types of information And to just shortly give you some information on how the spatial data will be stored here is that in the variable slot we hold the genie identifiers and some with IDs and genie names The observation slot holds the spotted identifiers and the respective coordinates and this unstructured slot here it holds the image, the h&e image or like the bright fit image And some scaling factors that allows us to go from the count array coordinates to the actual physical pixel coordinates And I will elaborate a bit more on this in the exercise session and you can also check out more information on this GitHub page here that I've linked Right, so now we have our count data, the matrix and we're ready to start like doing some initial assessment here And I will try to like give you some examples using human breast cancer data that I've taken from 10x website so it's public data and anyone can use it And that is also what we will be using in the exercise sessions So the very first thing you can do is plot how the spots are spatially arranged and you sort of like see what your tissue outlines are and everything but it's way more informative if you also overlay this on the h&e image And we can actually add an additional layer information here, let's say for example that we're interested in how our certain gene is expressed across the tissue So what we can do then is to let the face color intensity be proportional to the gene expression values here And then we can see how this gene, ERBB2, sort of like it's more highly expressed in the cell dense darker regions Which makes sense because ERBB2 it's associated with healthy positive cancers, we would expect that where there are more cancer cells perhaps So one question that I tend to get quite often is how you sort of like visualize this high-dimensional data So in the previous slides we looked at the expression of one gene, but we actually do have like 20,000 features And somehow maybe we would like to condense all of this information to like a single representation And here we have 12 different gene expression profiles for certain genes And one suggestion might be that we just like join this all in one image And that works if we have used 12 genes, but if we start to plot like about 5,000 or like 20,000 features it's just going to be a mess So another alternative or idea here is that we take this gene expression data, the full gene expression data set And we embed it in three-dimensional space using for example UMAP, we do a fine transformation where we move this data set into the unit cubes So all the values will be between 0 and 1 And we can consider these values as RGB values or values in any other color space And then each spot is associated with a different color and that allows us to plot this in a bit more informative way Where regions with similar color also have a similar gene expression profile So that gives us some information about what we can expect from our data So moving on to the data analysis, pardon And before you go start doing some more deep data analysis, of course you always need like filter, normalize and perhaps use some batch correction And as we noticed in like the previous exercise, there's actually no like magic recipe to give And how you process your data is very much like dependent on the samples you have and the objective of your analysis But what I can say is that much can be learned from analysis of single cell data You can transfer most of these concepts to the special data So I would just also give you some general advice here regarding some things that I've learned from analyzing quite a lot of special data here So regarding the filtering, it's good to fill the genes based on their expression level Or we make a cutoff and say, okay, we only include genes that have a total expression across all the spots higher than some specified level And this depends a bit on how many spots you have, just as previously dependent on how many cells we have We can also fill the genes based on spot presence here And what I mean with that is that maybe you have one gene that is extremely highly expressed but only one spot That's like like a technical artifact or some kind of stuff going on with the mapping so it could be a good idea to exclude those genes as well We could also filter spots based on the expression levels or we say that, okay, to include spots in our analysis We need to have a total gene expression level higher than some threshold This is not always necessary, it was more necessary before in the old STRIs which weren't like as standardized as the current epistemaries But it might be a good idea to do this if you expect that the automatic detection of which spots that are under your tissue is not completely correct We could also filter away ribosomal and mitochondrial genes, they tend to exhibit quite sparse expression patterns and sometimes like quench more relevant signals As for the normalization or like batch correction, I would recommend you to account for the spot sort of like library size Since we have varying cell density across our tissue, yes, because we have higher expression levels of certain genes in some regions It doesn't really mean that it's upregulated, maybe we also have more cells that are there to express these genes Also, if you try to like regress out certain covariates, it could be a good idea to include a slide or array as a covariate to regress out and not only like sample ID and such We've seen that in big data sets, for example, this actually did come into play And some tools that have performed pretty well so far is SC Transform and Harmony And they are of course developed for single cell analysis, but they are applicable to ST data as well, especially data Right, so just to give you an example here of some basic analysis We could start with clustered spots based on their gene expression after we've done some normalization and then we visualize this in the UMAP space Just as we would do for single cell data, the neat thing here is that we can back map this to the actual tissue that we are working with and using the image as a reference And the image here is actually a really valuable source of information, it could act as a sanity check where we ask ourselves like does this make sense Do the clusters seem to somehow correlate with the morphology and if they don't, why don't they? So that's something that people tend to forget when we speak of spatial data that just having the image to like compare with could add quite a lot of support for your conclusions Next, the natural question is usually to ask like what do these clusters represent And of course you can proceed to use the like a standard analysis, maybe we're interested in what these orange clusters up in the right corner represents We find some genes that are upregulated in this cluster and we subject them to some gene set enrichments or functional enrichment analysis And all of this information can guide us in the annotation of this cluster, maybe it's like a cancer related cluster or it's more immune related But there's this but here, we need to remember that each of these spots is a mixture of multiple cells, meaning also that we can have a mixture of different cell types present at these spots Hence if we sort of cluster our data based on the spatial gene expression, the clusters doesn't necessarily represent a specific cell type I rather tend to like think of these clusters as an assembly of spots with similar gene expression profiles, or maybe they have like similar population of cell types Some of them like the BH cluster here, it might be like purely consisting of one single cell type, but that is not always the case So a natural question that tends to arise and in which a lot of people are interested in, where are my cell types then, how do I find out where certain cell types are located in my tissue So let's take off where we left, and some of you maybe have started to think about this question here that I just posed And one suggestion might be that we just look at marker genes And this is like an easy and straightforward approach, but there are some issues To start with it requires like knowledge of our marker genes, that we know the marker genes for each respective cell type, and that is not always true There's also this risk of an overlap among our marker genes, if you're working with a really complex tissue with a lot of different and similar cell types, maybe we can't really find a mutually exclusive set of marker genes for each of the cell types And then if you observe a certain marker gene, we're not sure if it belongs to cell type A or cell type B, and which cell type indicates the presence of There's also this question of how we interpret certain expression values, so if we have a high expression value of a certain marker gene, does that mean that we have like a lot of the cell type there, or is it just a few cells that really highly expresses this specific marker gene And as for the lowly expressed marker genes, they may not even always be observed or maybe don't capture them, so we sort of like overlook these cell types So now another alternative solution here is to use single cell data, we integrate with this with our spatial data And the idea is that we extract information regarding the cell types from the single cell data, and then try to use this to infer the spatial location of these cell types within our spatial data And of course the big challenge here is that we need to deconvolve our data in order to get an idea of how much of each cell type we have at each of the spatial locations or response So our objective is to like present this in perhaps a more visually like easily interpretable way is we want to go from information regarding the gene expression in our spots to be able to make some form of informed statement regarding the cell type population at each of the spots And the suggested approach here that I will be presenting on it's a method I've been working on for about a year and there's a preprint on bio oxide And that's where we're using a model based probabilistic inference approach. So basically we can summarize this in three steps, we start by inferring the cell type expression parameters from the single cell data And then we're more or less translating to characterizing a statistical distribution for each of the cell types and actually each of the genes within each of our cell types Then we use these inferred parameters to try to find the optimal combination of cell types in each of our spots that best would explain the observed expression values And then we can simply just like back map these proportions back onto the tissue, just like we did the cluster labels before Just to give you some insights into the like underlying machinery here, we work on this assumption that single cell and spatial transcriptomics data can be modeled as negative by normal distributed Just for some notation here we say that x is n be distributed with the first parameter r being the rate and the second parameter p being the success probability And here below we have the PMF for this distribution as well So, used to start with a single cell data here to say that the expression y of G and G in cell C when C is of the cell type said it falls in the distribution The first parameter is a product of a cell specific skating factor in a cell type and gene specific rate and the second parameter we only condition on the gene Now looking at the spatial data to say that the expression of a certain gene G at a given spot S from a specific cell C when C is of cell type said it can also be modelled using a negative by normal distribution And here we use the same parameters as we have in the single cell data when we condition it on the cell type And the idea here is sort of that no matter which technique we use to start your data with or our cells with They should sort of behave in a similar way. It's not as if a B cell behaves like a B cell in the single cell data, but all of a sudden turns into a T cell is because we use spatial techniques to study it But we also introduce a scaling factor that is spot specific and the gene bias coefficient And the reason behind the inclusion of this gene bias coefficient is that maybe certain transcripts are more effectively captured in one technique compared to the other And that's a very good trade account for that as well Now we don't really observe the contribution from the single cells we observe their sum. So the variables I used to refer to before they are sort of like hidden variables and we like you said observe their sum And those are the elements of our count matrix So the idea here is that when we have a sum of nb distributed variables with a shared a second parameter the pg here this sum is also nb distributed and even more beautifully the first parameter here is just the The sum of the components respective first parameter So having established this We can make a neat trick here where rather than summing over each of the cells we change the index of summation and start to sum over the cell types and we introduce this new variable here n which represents the number of cells from a certain cell type at a given spot We also make another change here where we joined the scaling factor alpha and these newly introduced cell counts And then we introduce a new variable which we call the unadjusted proportions and I will soon explain why and we end up with this bottom expression here And the idea then is that we use the maximum likelihood estimate to get estimates of these cell type parameters Then we use these as known information when we try to model the spatial data And again we use Emily to get the gene scaling factors beta and these unadjusted proportions And the reason for this is that once we obtain estimates of the unadjusted proportions if we just like normalize them within each of the spots We'll actually get an expression that looks like this here Where the scaling factor sort of cancels out the alphas and when we end up with a fraction of the number of cells from a given cell type set at a spot divided by the total number of cells at that spot and that is just like a part divided by the whole and of course that represents the proportion of each cell type at this spot And that was sort of like what we wanted to know in the beginning, right? So to summarize the theory here we use a probabilistic model where we model the data as n be distributed and this has all been implemented in a tool called seroscope In the output from seroscope it's a spot times cell type matrix where each of the elements represents the proportion of the cell type that belongs to a given spot And you can find it on github if you're interested and like I said there's also a preprint So if we apply this to our breast cancer data, the same sample that I showed before, you can see some of the proportion estimates here And I should say that the single cell data was provided by the Swarbrick lab in Australia And you will also be working with this in part two of the exercise session So here's just a bit of a shared picking of some different cell types for them so we can look at the CD4 T cells, the memory B cells, the epithelial cancer cells and the plasma blasts And what we can notice is that the B cells and the T cells they seem to spatially correlate when the epithelial cancer cells and the plasma blasts have sort of like a spatial anti-correlation And we can quantify this in a better way using this cell type colloquialization plot which is used like the Pearson correlation between the different cell types And if we look for example at the memory B cells we see that they have a high correlation here with the CD4 T cells And the same goes if we were to look at the epithelial cancer cells and the plasma blasts that would give us like a high anti-correlation So just to summarize the integration of the single cell and spatial data part here, it leverages the strings of respective technique and it gives a spatial resolution of well-defined cell types It could also be used as a basis in subsequent analysis, for example looking at patterns of cell type colloquialization And of course ideally we would like to have an experimental technique here that gives us single cell resolution but up until then I think this is a good approach to the problem And also it should say that there are other alternative approaches and this is not the only method that tries to solve this issue And that says my internet connection is a bit unstable so yeah, hopefully that's resolved now And the final point here is I just want to say that this trend of publishing a lot of atlases as a resources both with single cell and spatial data that's super exciting Because now we don't have to like scavenge the whole web for a good single cell data set to use to integrate with our spatial data There's way more available for us so probably we can get some really cool analysis results from that as well Right, so now I've been speaking quite a lot about the integration between single cell and the spatial data And also compared like how we can do clustering which is again very similar to what we do with the single cell data So my idea here was to give you an example of an analysis that is a bit more spatially focused And say that we have some spatial data here some vision data and we find an interesting domain of the clustering our data For example cluster two here and just to give you some context this is data from the mouse brain I think it's the subjectile section but I'm not 100% sure So we then may ask for example how the gene expression changes with distance to this cluster And when I say distance to this cluster I sort of like mean how far away we are from the edges of this cluster And this image here to the left I should say is represents the distance the higher the larger the distance the higher the intensity of the face color of our spots So if we just plot the gene expression of three different genes here Where we have on the y-axis their expression level and on the x-axis their distance to this cluster, you can really see some trends here For example the gene DDN and the green one it has an elevated expression it seems like the further we get away from this cluster While for the blue gene and the RG1 the expression seems to be a bit higher when we near the cluster but then decreases as we get further away And if we were to just like visually represent the gene expression here overlaying that on the tissue and our colored cluster two here in black is to have a reference to compare with We can see that our inferences that we made from the previous plot really do check out and the most clear example perhaps here is the green gene DDN Where we see that there's almost no expression near the cluster while the expressions of like increases quite significantly when we get further away And now just having become familiar with this idea or notion of expression as a function of the distance to something within our tissue We can also continue to ask which within which distance D from cluster two or like boundary that we're interested in, can we find absolute percent of all transcripts from some given gene So if this plot here again represents the expression on the y-axis and the distance to our cluster on the x-axis when they graph the model states relationship This question is sort of like equivalent to solving the equation that you see here to the right with respect to DD And to further emphasize what I mean by this for example say that we want to find the distance within which 50% of all of the transcripts of this gene is contained We can sort of like try to see and solve this equation and without but seeing okay about 0.17 distance units away from these clusters we might find 50% of our transcripts If we were to say that we want 85% of all of our transcripts, of course we would need to move further away from the water and if we increase this number even further we need to move further away from the water Now we can also flip this question and say which genes have epsilon percent within the shortest distance DD from cluster two And that allows us to sort of like find genes that are spatially upregulated or like how you express nearby the cluster of waters And this might not be the most interesting example if you're not like a mouse brain phantast But if we for example put it into the context of tumor tissue or cancer tissue Maybe we're interested in the tumor micro environment, the regions just nearby the tumor edge And this could be one sort of unbiased way of identifying genes to have certain expression patterns and how they relate to our tumor edges