 Thank you very much for having me. I'm going to place some links here on the chat. Well, I'll fix it a bit later. Anyways, thank you very much for having me. I'm really excited to present some of our work here. Compared to the previous speaker, this might, it's like maybe a bit more niche in a specific area of research, but it is actually really powered by R and by conductor. And so I've been working at the Libre Institute for Brain Development, analyzing different types of gene expression data, and recently I've been collaborating with Karen Martinovich, Krista Maynard from Libre Institute, as well as Stephanie Hicks from Hopkins Biostatistics. You can find the slides here, so I'll post a link on the chat. And so, as we're interested in studying the brain, one particular region of the brain that we're quite interested in is called the dorsolaphylocrefrontal cortex. That's because this region of the brain has been implicated in several near-developmental disorders. And for example, it has been implicated in a schizophrenia disorder. So this gives you a bit of why we're studying that brain region. But there's different ways you can study gene expression, so like that's like gene activity levels. There's different technologies for that. The older one is called bulkarnacic. And even though it's like cheap enough that you can do it in a lot of samples, the problem is that if you have our labels here that illustrate a brain, you end up just looking at all of them mushed together. You can't really tell the different colors or shapes apart. A few years ago, a single cell or single nucleus was developed. And so that allows you to classify the different, in this case, colors, so the different cell types together, measure gene expression measurements for them. While that is pretty neat, you lose the spatial information, which is where spatial trans, spatial transcriptomics comes into play. And that's the most recent technology. But unlike the picture over here that looks like in 3D, the spatial transcriptomics, you only get one slice. So like an X and Y plane, and it was the method of the year in 2020. So we're pretty excited about that. We're like, oh, how can we use it to first study the brain? But before we try something new, on something that we don't know much about, we decided to try it out on the dorsolateral prefrontal cortex because we know that based on studies from sort of decades ago, we know that there's supposed to be six different cortical layers, so six different layers of neurons that have different shapes than cities, cell type composition. And so there's a lot of biological knowledge based on those regions, layers, sorry. And so we're like, oh, okay, let's try it out here. Let's see if we can obtain what we already know, right? And so we use this commercially available solution for spatial transcriptomics called Visium from a company called Tenet Genomics, that what it does is it has this little square over here that is 6.5 millimeters on each side. And it has a honeycomb pattern that has these spots over here. They're like 55 micrometers in diameter. And we can measure gene expression measurements over there using the technology that's kind of similar to what has been used for single cell gene expression. So that way we obtain like gene expression measurements in a little area over here that a single spot might contain like one cell, five cells, etc. So it's not at the single cell level, but it's close enough. And so we decided to run this pilot study using the dorsolateral prefrontal cortex, but it's actually quite big. So we, you know, when we were looking at a 6.5 square, that's just a tiny region of what we can study of the whole brain region. And so we decided to select for tissue where we could see actually all six layers plus the white matter. So that's what we're trying to aim over here. And we did it across three different subjects with two sets of spatially adjacent replicates. So imagine you have a loaf of bread, you take two slices, then you discard like the middle area of the bread or you do something else with it. And then the last two slices, you also measure them. So that's what we need over here to just try out this new technology. And so people like that are trained at this, not me, can look at this type of image and say like, oh, actually here we see the white matter and the other six layers, same with the histology image. But now that we can measure the gene expression measurements, we can see like, oh, things do kind of match up. It's not 25 is a gene that is like highly expressed in neurons. And layers one to six are supposed to be, have a lot of neurons. So that's why you can see a lot of expression in this area, whereas MOVP is a gene that is supposed to be expressed mostly in the white matter. And that's why you see here like expression mostly over there in that region. Another one is like more fine. Finally defined is PCP four, which is a layer four marker gene. So you can kind of see it over here, but you also see some noise in the rest of it. And because this was the first time we were trying this, like we were like, okay, let's look at our spatially adjacent replicates. And they're supposed, you know, you're hoping to see really similar images across the spatially adjacent replicates, which is kind of what we see here is not like a perfect like clone one of the other. But overall, we're happy with how the data looked here. And so based on all of that known biology and what we're seeing, my collaborator, Krista Maynard was able to manually label all of the spots and categorize them into the different six layers plus white matter. So that is a meaning to this like rainbow type of image that we see over here. But in order to actually analyze it, and this data is quite sparse, so that means that there's a lot of zeros, we have to compress the information using a process called pseudo bulking, where like, let's say we here we had a 4000 columns, we ended up with just seven, we ended up with one number per layer across the number of genes that we have. Once we do that, though, we can then use like a principal component analysis. And we can see here that the first physical component, which explains the most variants by design, actually separates the white matter shown in black versus the rest of the colors on the right side. So that's pretty good. But then also the second principal component starts with like layer one on the bottom, layer two, three, four, five, six. So we weren't expecting to see these like nicely ordered data. It was pretty nice to see. Once we have that, then we can run different types of linear regressions or I said, this is like where like you could go back to the introductory statistics course, we ran like an ANOVA model, and then some linear regressions for like, we call one of them the enrichment, or it's like we're looking at one layer at a time against everything else. So we group the rest of them into a single like large box plot. Or we can look at pairwise differences, because sometimes those differences can be useful into understanding the role of a particular gene. So there's a lot of different genes you can find like that explain different changes in expression. We focus on the enrichment ones, which are the easier to interpret. Because at that point, you're saying like, hey, we want a gene that is highly expressed in one layer, and lowly expressed on the rest. For example, MOVP here, we find like a bunch of different genes for the different layers. And some of them match with what was already known by some other studies. But a lot of those earlier studies have been done in mouse. And like at that point, when you're looking across organisms, there's some differences that across them, right? So this was like, this involved like a lot of new infrastructure around this project. And so that's why we developed the package called Spatial LIDD, which is available on Bioconductor. I don't know if you've heard about Bioconductor, but it's a repository like CRAN for sharing packages. One key distinction is that they have to be related to computational biology. And they also have to have vignettes. So they have to have like quite complete documentation. And so Spatial LIDD has like functions for visualizing this type of data, this gene expression data coupled with the images, looking also the statistical models. But most days, it has an ability to make a shiny app where you can interact and explore the data. That's how like through that exploration, and then actually using the app, I call Lee Christen Mainer was able to annotate the different spots. And so now it's published also. It's its own separate paper. And so all of this was pretty neat, but like you can always go one level below in terms of infrastructure. And so here we needed like an R package that could help us keep track of all the data together, right, to make it easy to build these visualization functions and interactive explorations. And so that's why we collaborated with Dario, Lucas, and Elena to build another package called Spatial Experiment, which is the one that contains information about the images, the coordinates of the images and other properties of it that make the rest of the work more smooth. And so actually, both of those papers were like published like within a few days of each other earlier this year. So it's pretty nice to see. And the reason why we also invested time and energy in Spatial Experiment is we wanted a common infrastructure package to make it easy for users so they don't have to convert the data across different containers of the data that like every author comes up with, right. So this is a more like having a standard infrastructure I think can be quite useful for both developers, but also for users in the future, right. But a situation here is like not everything was easily done in R. And so my had a Tiffany developed this set of MATLAB scripts called VistaSec. And like while the code is public, MATLAB itself is like not open source. You have to pay for it unless you're in academic, you can get a free account. So it's like a little bit in that great area of like it's not as open source as R in my conductor, but you don't have to pay to use it from our side. You might have to pay MATLAB. But this, we did this because with MATLAB it was easier to do some operations on the images themselves like split them, obtain some information from them. So like set them and obtain like the number of cells that we have in each of these spots or circles. Yeah, so I see already a question from Tyrone Lee, like this is only for vision data. Yes, it's not for OSM fish. So after this, we basically provided a framework in our paper for how to compare cluster results against this manual annotation. We needed this because if you run like on someone's supervised cluster results, you get like many different types of shapes of clusters and you're like, oh, is this, does it make biological sense? Yes or no. And so that's why like starting with a brain region that we knew what expect was very useful for us. And so how we compare different methods, you can see here this white axis is called the adjusted random index. The higher you it is, the better. But none of those values are pretty high like the below like 0.4 most of them. But that's what we could do at that point. But because we shared the data early, then other people were able to develop better spatial clustering methods. And one of them is called base base published in 2021. And so you can see here that on the left the manual annotation that we have on the right side of the base base results. And even there's not a perfect match that it looks much closer in shape than for example, the results from this walk trap algorithm. And so they compare a few different methods and like their base base method in general has a median of like a little bit above 0.4 in the adjusted random index. So that was better than previous results. There's still some room for improvement. And so here I want to highlight why like in general in research, but like in my particular field, it's pretty, you should, I mean, I would encourage people to share the data early. And so when we preprinted our our study in February of 2020, that's when we made the data available. And so base base as a preprint came out in September of 2020. But if you had waited for the full publication, like ours was published in February 2021, base base was published in June 2021. So if you look at all those dates, and you compare like, for example, the sequential fictional timeline, it would have taken like around 620 days. The reality was 461. So there's a difference of 156. But in terms of actually access to software, the difference between the preprints was 190 days. And this was possible because we we share the data so fast, right? Not everyone that makes a preprint shares their data. Sometimes there's concerns about patient privacy and things like that. There are very valid concerns. But if you're able to share the data, I would encourage you to actually do that. Because you can accelerate science quite a bit. Now, the issue with us is like, okay, we provided a framework for comparing and developing new spatial clustering algorithms, which is great for us, because now we can use them when you data sets. At the same time, there's some caveats with our ground truth. So you should consider it as a guideline. It shouldn't be like the final goalposts. I see a message that, am I lacking? I don't know who can else who can respond. Right, John says it's good. Okay. So I think the message from Beth is for Michael, not for me, right? So I'll continue. And so ultimately, like the ground truth for that goalpost will move as we learn more about it. And so we probably want to get closer to it. We don't necessarily want to exactly recapture that goalpost, right? Because at that point, you could be overfeeding your model, right? And so our paper is doing quite well in a lot of metrics and things like that. But I think one reason why he's doing quite well is that data that we're providing is like way more challenging than the example data that GeneX genomics provides that is based on the mouse. Because the mouse at that point is a very small. You can fit, you can fit a full mouse brain on a 6.5 square millimeter or half of it. And so at that point, you're really, really looking at across different brain regions instead of like these finer differences. So there's always like, if you're developing methods, you want to have access to data sets that present some challenges. Otherwise, like for example, just running k-means can give you the result that you want. And so this is just another more recent paper that also shows like different results and how like completely methods that were not designed for this, you know, provide some results that like, if you didn't know what you, what expect, you would like look at it and be like, oh, is that biology? I don't really know, right? And because we're working with all these brand new methods, you actually have to be careful and keep track of all the software versions of the tools that you're using, even if they're not from R, even if some of them are from Python, etc. Because you might have to work with like the latest versions of R, latest versions of back conductor, things will change a lot, even a small version change or a single bit commit can like drastically affect what you're the results that you're obtaining. And but you also have to interact a lot with authors from software. And we do this a lot on GitHub, we're asking for clarification of documentation, providing reputable examples that maybe it's bug, maybe it's, we didn't understand the inputs to something. And so we do that for others, we also hope that when people ask us questions, they also provide reputable examples. But because of that, like documentation becomes really important for making it easier for users, having wrapper functions can help too. So we've introduced some of them in special IBD. But then also testing your software. So we test our software, we give her actions. And by conductor, we just test the software on Linux, Max, and Windows every day, every 24 hours. So all of this effort that goes behind the scenes sometimes is not notice it, not notice, but it helps users be able to trust your software and use it effectively. So we go back to disease a bit. There's a bunch of different types of studies you can do where the end result is basically like a list of genes that are related to a particular disease or disorder. And so one of them is autism spectrum disorder. And there's this paper from 2013, the Safari one, where they have a list of genes that are reached in the autism spectrum disorder. And we could see that like, we could localize them to layer two and layer five. Now that we have the spatial transcriptomics data, so that's pretty nice. Particularly because a newer study from 2020 has 102 genes basically replicated those results. But something else that we found is that this study from 2020 can break up the 102 genes into two sets, 153, 149 genes. And once you do that, then you can like actually find that one of them is related to the layer two, one of them is related to layer five. So this is how like some of this technology could help us more, especially localize the effects of some genes related to some disorders. That's something that we're excited about doing more in the future. But gene expression itself can't do it all. And so one pilot study that we're doing on Alzheimer's disease is based on the property that in Alzheimer's disease, you can visually see neurofibrillary tangles and amyloid plaques. So these are marks of the pathology. And with a new technology called Visium immunofluorescence or Visium IF, we can basically see on an image what are the amyloid plaques for the tile tangles. And that leads to this cartoon for the plaques or these green balls. The tangles are in purple here. And if we see it on a particular vision spot, now we can say like, oh, this spot over here is a cell without a pathology. This one has both types of pathology. This other one over here just has the green, so A beta one, the A beta pathology. And that might be because the gene expression differences might be quite subtle. So you don't notice it with like spatial clustering. But with the image, you can find them. So this is a project that Sanco Kwan is giving, where he generates some pilot data, you can find for the A beta signal list for the p-tile signal list. Overall, we also see that it looks good in terms of MOVP for white matter, snap 25 for gray matter. But everything that is new involves new challenges. And so here we have the challenge that the images for the immunofluorescence are quite big. They have their multi-channel images. They're broken up into tiles, which I'm trying to illustrate with these squares here. And then each channel has different features. So some of them, the signal might be more regular, like these triangles. Some of them, the signal might be more irregular. So there's some challenges with that. Before you can get to a matrix like this, where you have the spots on the rows, some gene measurements, like where you have the lab's tissue, the number of cells. But then you also want to know what is the number of triangles, clouds, the percent of the spot that was covered by triangle signal or cloud signal, things like that. And so I have a Tiffany updated VistaSeg for this type of data, which we were like, okay, this can work. We took quite a bit of effort for that. And now that we have that signal, we can now label the spots into seven different pathology categories, whether they're free pathology, have one, the other or both, and whether they're neighboring to the spots that have both, one or the other. So this is how a sample like that looks. So now that we have labeled each of the spots, now we can go back to the gene expression data and try to find differences across those genes, which is what we can do here. And so, for example, RPN2 is a gene that has been linked to Alzheimer's disease based on some type of study called GWAS. So that was interesting to see here. This is a work in preparation that we're trying to finish sometimes soon. So to recap, when you're working with Visium or the spatial transsyntomics, it can be very powerful. In general, Tenex genomics makes it open source friendly, but maybe there's maybe it can be maybe too restrictive the current settings. See, there's opportunities for creativity. Maybe you have brain regions that are too small, and you can fit two of them in a particular square. Maybe you need to use multiple of these squares arranged in a particular pattern to observe a larger brain region. But working with this type of data has required the development of a lot of software. We mostly like to use R and bioconductor. And for myself, it's particularly fun to work on a project where there are no answers on Google, but the project can be challenging. We have a few future directions involved in different technologies. So we want to finish that proof of concept that I told you about. We're going to integrate different types of data, maybe work with this new one called Visium HD, which is going to be a larger square. We need to integrate better the data from the images. And as we're spurheading this effort, we're trying to leave a trail behind us such that other people can join us and follow us. So we're trying to also create educational resources. So for example, we're creating this book based on a book down actually summarizing how we analyze some of the data. So this takes a lot of people, the effort from a lot of people. We have, I want to thank my collaborators from the Libre Institute, in particular, Kerry Martinovich and Krista Maynard from Hopkins Biostatistics, Lex Stephanie Hicks. And we're hiring actually, so if you're interested in learning how to analyze more of these types of data, you can find here some information from anonymous team survey that I adapted from someone at Howard Hughes Medical Institute. There's some things that are good, some things that are bad. And I'll leave you here with like, if you're interested in joining us from the data generation side or the analysis side, we have different opportunities. I think that's it for me. And I'll just post some of the links on the chat that I had prepared, but my message was too long, so I couldn't quite base it all on it. Thanks, Leonardo. All right. So we filled out some of the time. We do have a break now until about four. If people want to chat, Leonardo or any of the other speakers or questions, you're more than welcome. Otherwise, we'll take a break now. Thanks. All right. Thank you very much.