 Fantastic. So there's a lot of moving pieces here. So I've never used this software before so hopefully I can keep track of the chat and also figure out what's going on. So today I'm going to be talking about our spatial analysis of my parameter imaging data. I kind of hope that anyone that's that's here listening to this was also listening to Nils this morning. And so he gave a pretty good background of the type of data we're going to be talking about. But I got a little bit of presentation envy this morning. And so I saw Nils live coding. And so I've decided to pivot and try to do some live coding myself. And so if this all falls apart, it falls apart. But I'm pretty sure I can pivot back to what I was originally planning on doing if things don't turn out that well. And so hopefully everyone's seen images like this. So we've got these fantastic technologies that have come out that really allow us to look at tissues and then see cells in a tissue environment at quite high detail. And the really kind of awesome thing that's happened in the last few years is that we've been able to have all these technologies which can measure up to 40, 40 different antigens in this tissue space. And so here we've got a piece of head and neck tumor. And we're just visualizing six, six colors and you can already see the amount of variety and the different resolution of cell types. And so to me this is a really, really exciting thing, but with really, really exciting things there also becomes really, really challenging problems. And so our university we've recently purchased or at least a few years ago a Hyperion Imager. So we generate IMC data and more recently Codex Imager. And so we've been trying to think of different ways to handle this analytically to get the data out for our collaborators. And so I just mentioned Codex and IMC, but there's obviously a lot of, there's a large variety of these technologies that can generate similar types of data where we get these high-parameter measurements for lots and lots and lots of different cells at either the protein or the RNA level. My experience has all been at the protein level, whether it's been with SciSIF data, IMC or Codex. There's obviously a lot of other technologies and this is in no way an exhaustive list either. And so everything I talked today is hopefully aimed at a pipeline to analyze information from these technologies. And so hopefully I'm not going to be too ambitious, but I'm hoping to walk you through a demonstration of how we can use a bunch of different R packages that we've developed to get our handle on this type of data. And so like Matt said, if at any point things get confusing, please interrupt me whether it's on the chat or whether you're just shouting out and I'll try to clarify some things. And if you're feeling very, very motivated, you can find the code that I'm going to use for this workflow on our GitHub page, the Sydney BioX GitHub page. You're welcome to check that out if you want to. So because this is a package demonstration, I thought I'd use some data that's out there and publicly available. And so I've decided to use this Mibitof data set from Michelangelo and they're effectively just looking at invasive breast cancer. And you can see that they have managed to generate these really, really beautiful images where you can see a lot of structure in the tissue. And it's this kind of structure that we're hoping to quantify and encapsulate and hopefully use to learn something about information. So another great thing about this data set and why I chose it is because it roughly follows the workflow that I'm hopefully trying to outline. And so what they've got is they've got a bunch of images on people with these breast cancers that either progress or do not progress. They go and do a lot of different analysis. So they look at cell types and cell states and look for differences in these. They look at tissue compartment in Richmond and look at cell to cell proximities and looking for these spatial relationships. They do some tissue microenvironment morphometrics, which is something that I don't and can't do at the moment. But then they go and use all this information. Eventually they use it to classify patients. So hopefully in half an hour's time, I'll have given you all the information that you need that you can go off and often do this, this style of analysis in maybe like 20 lines of code. So I think that's really, really exciting and really, really cool. And so like I said, I think I'm going to try pivoting to some live coding. And so if it falls apart, it falls apart. That's what it is. So again, please interrupt me if you've got any questions, we can hopefully look at things as we go along. I'm going to start by reading in our data. And so again, the reason I chose this data was because I thought it was pretty, pretty cool and nifty, but it's also quite small, at least the version that I'm using. And so we're going to look in here. And so this data is sorted by what they've done is they've gone and put a new folder for every single patient. And so we can look at a particular patient ID here. We can see that they've also split their images up into images for each individual channel. And so, like I said, this is small data. You look at it. This is nice and compressed. We're talking about like 10 kilobytes per per channel. And so you're looking at around five megabytes per image. And so I've got all this up on GitHub, which is means you don't have to go anywhere and download stuff. These are all small files. Everything's nice. But this is the way that our data is organized. And so to read our data in, we can go and use the image package and this is fantastic. And it'll just go and read the data in in this kind of folder file format into an image object. And like I said, I hope that a lot of you attended Nils's talk this morning because what we're going to do is now that we've read in all our images. I'm actually going to go reread those in into a CIDR image list. And so I do this for a couple of reasons. One, it's just a nice object that we can put some metadata on. But two, it allows us to store these images again back on disk as H5 files. So we don't have to have them loaded into memory. And so this is generally quite a convenient thing to do. And as you can see, we're really not using much space in memory now that we've read these these on as with any data we're kind of eventually going to be interested in looking at associations with some sort of clinical phenotype. And so it's a little bit ugly, but I'm going to read in some clinical data with a little bit of manipulation to make sure that our image IDs match between the clinical data set and what we've got. And then I can go and take all of this clinical data and put it into the M calls of our images. And so we can look at that. This is what we've got. And so for each image, we've got its image ID, some information about the patient in particular its status of whether it's a non-progressor or not. And then a bunch of other things that we're not going to be using today. Okay. And so the first step of our pipeline. So we've gone and read in our data. It's all loaded up into our we've got a bunch of images. One of the first things we might want to be doing if we're going to go and make conclusions about cells is figure out where those cells are. And so we've developed a simple package called simple seg, which goes and performs really simple segmentation on these images. And so we developed this for a few reasons. One, we really liked the segmentation workflows outlined in the image. We just wanted to make these even easier for our students who are going off and doing this analysis to try to take out as many unnecessary complications as we possibly could. And so there's lots and lots of different options to simple seg package. But what we're doing here is we're taking our images. We're going to go use a combination of principal component analysis in the histone H3 channel to identify our nuclei. And we're then going to square root transform these and threshold them. And then once we've found our nuclei, we're going to do some sort of size selection. But then we're also going to get our cell bodies. We're simply just going to die, dilate out our nuclei by two, two pixels to try to capture the marker expression that's outside of the nuclear. And so while that's running, it shouldn't take too long. I probably jump back and tell you why we like our simple seg and why we think it's nice. So there's obviously lots of really cool software out there for doing segmentation. So we mostly use elastic when we're trying to do things that are complicated. But all of these required to be outside of the environment or rights use fancy wrappers that are calling Python from inside of our and they're kind of complicated to install. So like I said, we were originally off doing a lot of our segmentation using a B image again to make things really easy for our students. Alexander Nichols, one of my honors students has gone off and written this simple said package. And so one of the really cool things about it is that it does automatically choose some of the few key tuning parameters that we were having trouble with. And it does a lot of other cool things and gives a few different options for the types of segmentation that you might want to do. Like I said, this is a simple segmentation. All we're doing is identifying nuclei and we've got a few different ways of estimating our cell bodies. But here, if in this diabetes data set, we look at, we quantify synaptophysin expression in our nuclei and compare them to the original elastic. We can see that we get a reasonable correlation between our new quantifications and the original ones. So that's just looking inside the nuclei. If we just dilate out from the nuclei and just purely look in that disk that we're around the nuclei, we can see we also get a reasonable correlation in intensities between the two approaches. And then if we go and combine the information from our dilated disks with the nuclei, we get a much stronger correlation. And so by using a really simple segmentation approach, we effectively end up getting this very similar quantifications as going through and using elastic, which requires you to go and train where cells are and it requires a little bit of manual kind of curation. So again, for a first pass analysis, we've been using this simple seg package. And here again, we've just got some densities, the shows that if you just use the disk, you get reasonable correlation between our disk quantifications and elastic to use the nuclei. The correlations become even stronger. And then if you use both the disk and the nuclei, you get quite strong and concordance between a really simple segmentation and the cool stuff that you can do with elastic. And so we'll jump back. Hopefully it's finished running by now. And so our segmentation is finished. We've just gone and segmented the cells in around 70 different images. We can go and look at the quality of these segmentations. And so again, every image has got a lot of nice functionality for looking at segmentation. And so this display function is pretty cool because you can generate plots in your viewer, which means you can go through and zoom into different regions to kind of check out what things are doing. We can see that in general things seem to be segmented. Okay, not perfectly, but okay. Cytomapper also has some really nice functionality for checking out segmentations. And so hopefully you saw this this morning with Nils. Here we can have the capacity to go through and set a bunch of different colors for different channels. We can scale these in some way using the BCG thing. And we can see again that the segmentation, I think looks pretty nice and pretty reasonable. And for first-pass analysis, we think this simple SEG package makes reading images in and segmenting images quite easy. So now that we've segmented our cells, something that we're probably really, really interested in is quantifying the amount of marker expression in each cell. And so again, in the Cytomapper package, we can use the measure objects function. And what that's going to do is it's going to take our masks and our images. And then calculate the average abundance of each marker for each cell as well as a few simple morphology characteristics. And so this isn't necessarily quick. And the reason why it's not super, super quick is because it's going and reading each of the images from those H5 files. And so it does take a little bit of time. And so how this function works is that it takes all of these features and it loads everything into a single cell experiment. And so it puts all of your measurements into an assay called counts. But then in the cold data of this thing, it also stores information about like the area and the mean. But most importantly, your X and your Y coordinates. And so we're going to be using these quite a lot as we go through and continue our analysis. And so we can go and look at the abundance of any markers. And so here we're looking at Pansata keratin zoomed in a lot. You can see that this marker is really highly skewed. And so each of these is a density of the marker in all 70 cells highly skewed. And it's really probably quite difficult to see which cells would either have Pansata keratin expressed or not. So what we can do is we can go through and use our normalize cell function where we're going to do a square root transform. And then we're also going to trim the 99th quantile of each marker and scale it do a min max scaling between zero and one. So if we plot these again, we can see that one not perfect. We've now got this really bimodal distribution. And at least it's now clear that there are cells that express Pansata keratin and cells that are not if again we wanted to progress with this further. We would go and use our SC merge package for really harmonizing marker expressions between images. But this kind of bimodal pattern is probably enough to effectively go through and do some clustering and identify our cell types. So that's what we're going to go do. And so one of my students, Elijah Willie has developed a few song package. And so effectively this is just a self organizing map that we've designed specifically for this spatial type of data. And so I'll go and run that now. It's quite quick. Sorry, I had covered last week and so I'm suffering. And what this goes and does is it takes a single cell experiment and potentially a set of markers if you want to restrict the markers used for clustering. It goes and builds a self organizing map. And then clusters that self organizing map into a specific number of clusters. And so here I've costed into 20 clusters and we can go and use a function from Scutter to try to interpret what those clusters are. Zoom out just a little bit. So here we've got each cluster and we can see this clustering's worked quite nicely. So when in doubt, I always go and look for T cells. And we can see that we've got a cluster with our CD8 positive T cells and a cluster with. Sorry. With our CD4 positive. I'm going to drink so it. Our CD4 positive T cells CD8 positive T cells. If you look closely, you can see your B cells are the wonders of presenting at home straight to the sink. Macrophages. So things look kind of quite nice. So quite happy with that. I just arbitrarily chose 20 clusters, but we can go out and use the estimate the number of clusters to go through and generate a bunch of different statistics or quantifications we could use to potentially optimize these. And therefore use the GAP method that we probably would have chosen around 22 clusters. So 20 isn't a horrible thing. And like I said, you get the majority of the immune population. So things are kind of nice. One thing we might want to do if we want now that we've got these classes, we might want to go see if the abundance of any of these clusters or cell types are associated with progression in some way. We don't use edge R or diff site to do this. But we've got a really simple convenience function just for doing T tests and Wilcox and rank some tests on these columns. And I think the most important thing we can see here is that we really don't see any clusters that are associated with progression, at least just the kind of proportion level. Nothing's really changing in abundance relative to progression. Which is unfortunate, but that's what we see. So this was the strongly associated cluster and we can see that it's for whatever reason more expressed more abundant in progressives versus non progressives. Okay, we could go do some dimension reduction and check things out, but let's give that a skip. So now that we've done, we've segmented our images, identified our clusters, we've just looked to see if the abundance of clusters are changing. And I want to progress a little bit further and start making use of our spatial information. And so here we've developed the spicy our package, my one of my students Nicholas can yet and that's quite easy to run. Let's start at running. So what it does is it goes and looks at pair wise associations between each of the cell types and looks to see if these associations are changing relative to some condition. And so we've got our cell types stored in the clusters column in our single cell experiment. And we've also got our status condition and we don't need much information. We just need the XY coordinates. We need the radii that we want to quantify our spatial associations over. And here I'm also using this Sigma equals 50, which just accounts for a little bit of the global structure in the images. We can see in this image over here that there is this global spatial structure. We may or may not want this to influence our quantifications of association. So I'm just dialing it back a little bit. And again, we can see if we go and look. Sorry. We've got the pair wise association between two clusters. And that after accounting for multiple hypothesis testing, there's really not anything strongly associated with their progression status. And we can visualize this quite easily and quite nicely using the significant plot. And so on the y axis here we've got clusters on the x axis here we've got clusters. And each of these pair wise associations is colored by red or blue, depending on whether the cell types seem to be attracted to each other or avoiding each other. Uncircled the moment all the clusters with just a nominal p value less than 0.05. And if you're really, really quick, you can see that one of the only global patterns that we seem to be seeing is that between the non progressors and the progressors. Overall, everything seems to be their associations seem to be moving to a more avoiding kind of thing. So here they're kind of strongly attracted in one group and that attraction becomes weaker. They're weakly avoiding in one group and they become even more strongly avoiding in the other. So there seems to be some sort of dispersive thing happening as people become more progressive. I don't know what that means. Okay, so we've gone and checked for changes in proportion changes in cellular relationships. We can also go through and look to see if we can find these cellular neighborhoods in our images. Effectively a fancy way of looking for spatial associations between multiple different cell types. We've gone from like a pair wise situation to kind of like this multivariate kind of thing. And so we're going to run our Lisa class function. And so it just goes and clusters these these spatial association statistics into lots and lots of different regions. And we can again go and try to interpret what these these regions are that we identify. And so here we've just said I go and identify five different regions and then we can see which regions which clusters appearing more frequently in each of these regions. And so you can see some of these regions are capturing potentially interactions between multiple cell types. And we can obviously go and visualize these. And so here we look at one image we can see that we end up with these fancy kind of patterns of association. And so we've got some cells here that all seem to have some sort of particular spatial arrangement and things are what they are. Obviously looking at this plot we can see which cells we've assigned to which region. But we can't necessarily relate that back to which cell types which and so I've written this simple hatching plotting function that's quite slow. But it does kind of end up generating slightly informative images. And so here we've got this hatching plot where each of the regions is represented by a hatching pattern. And then we've colored each of the cells by the cluster that we're assigned to. We can see that we're finding regions that may represent something in particular. We've kind of got a region that represents our tumor and immune border. We've got regions that are very much the tumor and then a bunch of other things that we probably don't need to go through in the term. So because we've assigned cells to regions we can obviously go through and look to see if the proportion of cells in each of these regions is changing from image to image and associated with status. And most likely because we're doing fewer tests. In this case we do end up identifying a region that is significantly associated with progression status is what it is. Again, we could go back and try to interpret which cells are most frequent in these regions and hopefully knowing which cells are which cell types. Okay, we've also got this really nice package called SC features which is designed for both single cell sequencing data and this spatial data. It goes through and can calculate a bunch of different quantifications of features on these images. And so I might just flip back to my original slides. I should go back and show you this. Sorry to bounce around. The reason we like using few some is because it really does perform better than a lot of different other clustering approaches. So we've looked at flow some and phenograph and when people are applying these clustering approaches quite often they over cluster their data and then manually go through and start merging clusters that seem to make sense to identify cell types. So if we go and apply few some to data and flow some to data, even though flow some would have been used to originally generate the package, the clusters. I was going to done manual curation and we end up finding that few some is is much more associated with these manually curated clusters than what flow some would be if we choose a similar number of clusters. So we get these adjusted RAM index and these mutual informations that perform quite well. Alright, so bouncing around. Okay, so our se features so we know that data can be quite complex. So here is a little a cube that would represent some single cell data. Where for we've got information on cells. We've got abundance measures of genes or proteins of these cells and then each of these cells might be coming from a different person to do any sort of machine learning or testing. We really want to take this kind of complex data structure and flatten it out into something that's just samples by features. And so this this function is pretty cool. We've gone through and developed a bunch of different ways for for doing this we can either simply calculate cell type proportions with some different transformations. In our case what we're going to end up using is cell type specific gene expression measurements so the expression measurements of each gene within each particular cell type. We can obviously also go look at some spatial statistics too. Oh no, didn't want to do that. The whole time I haven't had it running. Oh, shouldn't take too long. So this will just go through and we just ask it to simply calculate proportions even though we've already got those. And then also calculate the average expression of each marker in each cell type not coughing. This is absolutely brilliant. I knew glad that I'm not there coughing all over you. The wonders of virtual. Okay, and so now we've got all these measures. We can go simply test to see if there are any genes changing with any particular cell type that are associated with regression. And again, we find things with small p values, but none of them really hold up once you account for the fact that we're doing 20 times 40 tests. Okay, so we've gone to the stage where we've gone and looked at some different features within different cells. We've looked at spatial interactions with identified cellular regions. Now we can try to pull it all together and perform some classification. And so we've developed the classifier package. Dario Strebenak to not just perform classification, but as a framework for evaluating how that classification is performing. And so hopefully most people that fit classification models and try to evaluate them are used to using cross validation. And so just developed a bunch of convenience functions that go and perform cross validation on a lot of different objects that we're used to using. So this will take a multi-assay experiment, a capital data frame. I don't know how people kind of respond to this, a list of data frames, anything. And then we can go and just perform a range of kind of implemented a few different classification approaches. So we've got like random forest SVM. In this case we're using an elastic net model. And then I've asked it to go perform five-fold cross validation with 100 repeats. And so if we do this, we'll hopefully get some classification results. Last one. So what I've fed it here is I've gone and I've got my proportions and my means for each cell type. I've also handed it my region information and then stored this in a list of data frames past this through to classify art. And so now I'm simply looking at a box plot of the AUCs from models built on each of these data types. And we can hopefully see that if we use the average expression of each marker and each cell type or the proportions of cells in each image, neither really do a fantastic job of classifying our patients into progression status. This is actually reflective of what they found in the original paper. But if we go and use some more complex spatial information, so the proportion of our different regions, we end up getting AUCs of around 0.7. And so in the paper they use the morphometric measurements and they ended up getting up to around 0.75. But I think this is pretty neat and pretty nice. And so I can't believe I didn't stumble too much with the live coding. But effectively what I've presented here or tried to is a kind of cohesive pipeline that makes it easy to perform a standard type of spatial analysis that people might want to do. And so like I said, this is something that a lot of our students in the labs have started using and they've been finding it reasonably easy and cohesive. And so I'd like to acknowledge the Sydney Precision Data Science Centre where a group of academics that are all into anything kind of statistical and biomedical. And so they're always worth acknowledging. And then as part of this as a group, we're developing lots and lots and lots of different R packages for analyzing both this spatial information but also single cell RNA sequencing data. So doing anything single cell, maybe check us out. We might have something that could help you along. And again, I'd just like to acknowledge everyone that contributed to this work. This is a labor of love from many, many people. Also Nils for a bunch of really stimulating conversations and Wolfgang Huber and Susan Holmes for their Modern Statistics, the Modern Biology book, which would really help me when I was trying to get my head around a lot of this spatial analysis. So I recommend checking it out. Cool. So thank you very much. I believe I've stuck to time, which is fantastic. And I'm happy to take any questions. Hi, Alice. This is Leonardo. Great talk. Can you hear me? Yeah, I can hear you. As I saw you like, yeah. I don't work with this type of data, but I work with data from the vision platform and the vision and fluorescence. And we are also struggling a lot with like segmentation. And so I don't know how much of a simple seg I think doesn't name. Yeah, how much of it can generalize to different types of morphologies or only mostly for like that P and identify nuclei or like. Or do you have plans and like expanding it and making a complex segmentation? Is that a simple. So, so I kind of doubt that it would transfer directly to Visium and probably transfer to the, I don't know what their their high dimensional technology that they've they've got coming out like a high dimensional spatial transcriptomics but I'm sure there's like a fancier name for it. I know so it probably applied to that but in terms of Visium as it is, you could probably chuck it on. But I don't know what what features you'd be trying to segment out and what you'd be using obviously you're not at the cellular level with Visium. So, but if you're just trying to segment out regions, maybe it could work. But I think you better off heading instead of segmentation heading down the kind of more cellular neighborhood kind of route where you just start clustering all the different spots and kind of hope things pop out in that way. I'm sorry if that doesn't answer your question at all or fill you with hope that yeah I don't think it would apply at all to to Visium data. We can try it. And thank you it doesn't answer the question and but I do I do think that there's also an opportunity to try to get information at the spot level where you have for a particular spot you can say like here have five cells and they have these different shapes and then these shapes that I have, I don't know two cells. So that's what like, like, it could still be useful to get that information. Yeah, with that SC features package we have said some sex just just success just analyzing things at the kind of spatial level and so we obviously don't go and use our really cell type specific spatial metrics but we do look at spatial data analyze things at that kind of spot resolution, but yeah, none of nothing I don't have anything up my sleeve for kind of estimating number of cell types and those kinds of things in spots. Hi, can you hear me this is Lucas Weber. Hi, nice to meet you. Hi, could you comment on the difference between simple seg and Steinbock from from New Zealand this morning. So, one of the differences is is that simple seg is in our so you don't have to go off into Python to run the thing. And so that was really a key motivation for the developing it. It's also applicable to different imaging technologies. I don't know how IMC specific Steinbock is, but I would imagine that their kind of pipeline would probably end up having better performance in the sense that they either make use of deep cell or elastic, which are much more complicated approaches but they do require training. So, if you're really, really worried about having perfect segmentation I potentially go through that kind of pipeline where you can go through and really try to optimize things. I think simple seg really is that it's just supposed to be simple and really allow you to do this quick kind of first pass analysis and potentially final pass analysis if you end up finding interesting signal. So, I guess a lot of it depends on your expertise and your willingness to really do things, I guess, properly or optimally might be the type of word. Okay, that's great. Thanks. Yeah, great talk. I have a question. So I find the spicy R and Lisa cluster packages would use that style to perform the special point process analysis. So for these kind of data you you have to have like an observation window just to construct the PPP. Object in order to run the coding spats that so I just want to like what kind. So what do you use as the observation window because that my I also want to like how much it affects the results like I do use like a tissue boundary annotated or like just a whole. Yeah, and this is this is a really important problem. I've got a student that's got these really, really beautiful slides that show that your your conclusions that you make with spatial analysis can change very much depending on the window that you use. Obviously, if you kind of zoom into a region cells might look like they're kind of randomly distributed. If you zoom out a little bit more you might find that there's two populations really and they're very much avoiding. And then if you zoom out even further the two populations that look like they're avoiding relative to the whole window space actually start to look like they're attracting. And so really nailing the window that uses really important in spicy. I've got a few different ways of calculating windows. Obviously, you can mask out the tissue thing is which works really, really well if you've kind of got these punch samples you might be using in a tissue marker. But we also have methods for estimating either a convex hull or a concave hull around the cells and we find that performs quite well. But obviously you could also just use a square and use a square. If you use this sigma function that kind of corrects a little bit for these global correlation structures. And it kind of does an okay job of correcting for horrible window estimation. Yeah, that's a really important, important point and important thing. And so in spicy are we do have these simple ways of at least estimating these holes which can help a little. Yeah, so in general, like a which way do you like recommend. I recommend using a convex hull and then using this sigma thing to account for a little bit of the structure so because often is like little holes in the tissue and that can help a little bit. But I was also thinking that I probably should be implementing something a little bit more complex but maybe we go through and we dilate out from each cells and kind of like imprint or cookie cut what we think might be the tissue region as well and that should be a reasonable approach to. But yeah, I recommend at least using the convex hull and this sigma to account to account for any other weird stuff that's happening in the tissue. All right, thank you. And I'm obviously if you've been having difficulties with it I'm obviously more than happy to chat at any time. It's great if you've been applying it. Well, let's thank Alice again for a very nice talk. I hope you feel better soon. Yeah, I'm feeling pretty good. It's just trying to breathe. Thank you and enjoy the opportunity to present virtually as well like that obviously made things very convenient for me.