 Welcome to this single salt analysis tutorial. My name is Wendy Bacon. I am a lecturer at the Open University, as well as a visiting researcher at Emblem EBI. This tutorial I'm going to do on the human cell atlas galaxy instance. It should function on most galaxy instances that I know of. And if not, try galaxy.eu or galaxy.org. I'm only using the human cell atlas one because that's the first one I ever use. So it's my favorite. This tutorial is not going to be explaining a lot to the science. It's just going to be as a tool or a resource for you to be able to follow along, get your right parameters. If anything is going wrong, you're not sure where to click or what to do. Hopefully watching this tutorial will help you. For any of the scientific content, please read the tutorial itself. That is going to be your home for the information. Okay. So we start with getting data. You can either use the input history and then import it, or I'm going to import the and data object that was made in the previous tutorial. So this can sometimes take a while. So I'm going to copy the link to that input history. And I'm just going to import that. I'm going to move the history and then make sure is it H5AD? Yes, it is. As part of the first question, you might be inspecting the and data object. And again, if you're using the tutorial version, you can just click. So I can just click on that and then I'm sorted. I want to do that a few more times. I want to see the OBS. Okay, and then we can look at all sorts of our lovely information. Okay, and then we're going to plot, copy the keys over. Do the same thing again. And then finally for batch, I'm going to make some scatter plots. I'm just copying and pasting from the tutorial. And I mean, you can pretty much use anything in your OBS information. This, I just think came out the cleanest in terms of being able to visualize on this dataset because it's quite a messy dataset. So it helps clarify where you might want to put your filters, et cetera. Okay, and then we're naming our plots, which is for easy access because you end up with quite a lot of plots on here. All right, and now we have all the plots in the world. And you can also, if you want, you can enable scratch book and then you can sort of look at the plots. And resize that, you can look at the plots side by side. So if I then click off this and then I also have that, well now I can sort of look at the two plots at the same time and that's quite handy when you're using scratch book. So that's a little handy utility within Galaxy. Okay, and then you're going to be given lots of different questions to analyze what's going on in each of the plots. And then it's time to apply the filters. So very easy to accidentally hit the serrat on here, by the way, when you're searching, so make sure you don't do that. And I'm gonna do the standard. And there are a few that automatically come up. We're not using those because it's less clear because the data's quite messy. I wanna do that. Oh, I see how I'm gonna plot next. And we're looking at genotype just because that's where my important variable to be fair, you want all the variables to be as similar as possible in terms of like sequencing depth and output. So that's that important thing to keep in mind when you're setting up. And then I'm gonna inspect. And this is quite cool because even if you examine it, you can see right here, without even having to do an inspect step, the cells and the genes that are left. And then you can compare that with the original, right? Which had a lot more cells with the same number of genes. So you can look at this and see each of your different parameters and what's happened. You can compare it with the previous one. And say, oh, look at these different batches. How cool is that? I think actually the best way to set this up right now is to say we want island genotype. So I'm gonna enable scratch book island by genotype. All right. And now I also want this violin as well. So I can see what's changed. So you can look and say, that's not really done, I think. Let's look over here. And we can see now we're seeing a much more significant change because that's what we were filtering by. We've sort of done a cutoff of this bottom for better or for worse. All right. And then now we're going to filter total counts as opposed to gene counts. We do this step by total counts rather than the log one P of gene counts. We're gonna go with 6.3. It'll always be higher than your genes because you are expecting to have multiple copies of a transcript. Well, not copies, not PCR amplicons rather, but copies of the cell has made. It should have more than one transcript of a given expression. So you would want to put your cutoff higher and then we repeat what I'm so zoomed in because a lot more effort to do. These sort of repeat steps. All right. And we're there again. So I will rename everything. I know it's a little bit of a faff to rename everything all the time. This is gonna be a long tutorial and it can be so confusing when you're using the same tools over and over again which we are, a lot of the single cell tools have a lot of stuff packed into them and a lot of abilities packed into them. And it can be so confusing when you're looking back. So the more you can get in the habit of managing your data well, the better your life will be. So do it. Okay. And then I'll add this filter by counts in and now I can actually using my scratch book see each phase of filtering and what happens. This is not a great look, if I'm honest. To be frank, this is a better look because you can at least see the bottom of the violin plot, which is a lot better. But we, you know, this is, I'm putting in quite low cutoffs. So I'm being very liberal with my cutoffs. That's how I would put it to keep as much as I can because I know the data is messy and there was a lot of background. Okay. And now that we've done by genes per cell, by counts per cell, or you know, the log versions of it, we're gonna do the same thing with percent counts mito. So we know that mitochondrial, urnate, et cetera is a sign of stress and we don't wanna keep our stressed out angry, half dead cells. So we're gonna filter out our, well, this is easier, just do that. 5% is pretty standard. Always, you know, read the text, it depends. Depends on the sample that you're working with. But we're gonna go with 4.5 today just to be a bit contrary really. I always think the ultimate test of your data is how well it survives weird analyses. And as boys, relabeling, and then we can again look at all of our plots, one, two, three. Okay. And the last step we're gonna do is filtering genes. Now, the first time I ran this protocol, I got a bit lackadaisical and thought, ooh, it probably doesn't matter the order in which I do these steps. Filtering's filtering, right? No big deal. So I want rogue. I want rogue and I did the filter gene step first like a fool. And basically this means that later on, you end up with a whole bunch of genes in your, once you filter out all these cells, you'll end up with a bunch of genes that don't have any cells associated with them. So you have these like empty columns in your data set and later on, much later on, a bunch of the tools will break because I can't handle the fact that some genes have nothing in them. So don't make that my mistake and filter genes first. Make sure you do all of your filtering and then hit filter genes. Always love making these videos because making the videos takes, you know, as long as it takes you guys to get through the tutorial. But then the final video output at the end when you edit out all of the time of waiting for stuff is much shorter and it seems so efficient. So if steps are taking longer than you're seeing in the video, it's because I edit out all of the other time waiting. All right, on our process. You know what, I'm just gonna, just gonna, if I search scamp it here, usually it'll be a lot easier for me to find all the stuff that I want. Although if you're using the tutorial mode then you're pretty much good to go. Okay, normalizing some data. Yes, number 20. Do that. And I'm just gonna set up the next one because it's kind of a sequence of pretty standardized, mostly steps. Although there's a whole bunch of parameters here you could change if you wish. Scale data. Okay, two, 10 and then we're gonna run PCA. So this is our first, well, I guess find variable genes. You're also down sampling bit. But now we're properly going to start reducing the size of our data, reducing the dimensions. I always double check the inputs on this because you end up with quite a big history. So I do recommend that. That's how I want that. And we use 50, so we'll show 50. Okay, and I will rename that. I don't know if you can hear it, but it's a wonderful sunny day. Some lovely birds singing in the background, mocking me. All right. And then you can decide, ooh, where do I wanna put my cut off? How exciting. Okay, onto our compute graph. No space, no space. Right, so this is where we're defining our nearest neighbor graph and trying to put everything on like a single graph as opposed to 1,001 dimensions. Let's just check. I want 15 neighbors, sure. And then the number of PCs we're gonna use 20 because that's what we've determined from the previous plot or at least such as what we think. We can also do a bit of, get the rest of our plots calculated, right? So different ways of reducing dimensions, different places you're gonna be on your X, Y graph. Perplexity, yeah, so perplexity will be 30, but you can change it if you're working in a group. And we also want run U map. These other ones, paga and run fdk are more looking at trajectories. So that's a separate tutorial. The next one, in fact, if you're following along, now I have a lovely U map. So we're not gonna get any pretty plots out of this yet because it's just calculating the coordinates, right? And now with those coordinates, we can now try and calculate, all right? If those things are quite next to each other on this nearest graph, what's it like with their cluster? Let's start calling some clusters. I'll use the Louvain. Lots of people now are using Leiden. We're gonna use a resolution because I know this dataset and it's not the cleanest. We don't wanna make too many clusters because the more detail you have per cell, the more you can trust, the more clusters is what I would say. But it also depends on your sample. If it's a super homogeneous set, then you also don't wanna be calling a whole bunch of clusters where there are not. So you kind of have to take a few things into account on your cluster calling. All right, we have our lovely clusters. So let's now do the fun bit, which is figure out why a cluster is a cluster. So let's look at the genes that make it so. So sometimes you just need to hit that. I don't know if that glitch has been fixed yet, but for whatever reason, you do need to click that sometimes. Loads of parameters you can change there. So it will sort of automatically make clusters by Louvain clustering. We might also be interested in if there are any differences across genotype. And then, you know, the manipulate and data tool could let you filter out specific clusters. So then you could just compare those to each other. Like there's a lot of fun that you can have with the fine markers to compare different things. Okay, and again, this is actually particularly important. You want to make sure that you relabel these things appropriately. Because you've got four things that look like fine markers. And the other side of this is it will store the result of your fine markers within the object. Oh, I've just done that wrong. Oops, okay. It will store that information within the object and we are more interested in, yeah, so here's your marker table and this is what gets best. We're more interested in the cluster comparisons rather than the genotype. Because the genotype is sort of like glorified bulk RNA seek because you're just smashing everything together. So I'll just, I'll just get rid of that one so that I keep my final object. Anyway, you want to, so yeah, you want to keep stored within your end data object the results of the comparisons between clusters. Now we're going to do a little jiggery pokery, final object and I want variable information. Right, so we want misinformation because if you look at it, oh, it has the ensemble ID, which is the most accurate way of counting transcripts, but the symbols are really what we're going to be talking about when we're trying to understand it from biological point of view. And right now our lovely cluster and genotype tables, they only have the ensemble ID as their label. So we're going to do a bit of jiggery pokery to make that work. Join two datasets side-by-side, on a specified field. And their column four is the one that has the ensemble ID. Yes, that's what we want. And yes, yes, no, yes. Now at this point, I strongly recommend checking because sometimes the order will be a little bit bizarre. So you just want to make sure you have the right number. All right, and then I want to rename these tables. Should have genes and then symbol awesome. And I know that it's the shorter one, it's going to be when you're spludging everything together. It's going to be when you compare and cross genotype. Really the most interesting one is one, it's by cluster, own peace of mind. Just gonna, and I'll, I have a nice history. And now the best bit, well, it's time to plot. So we're going to see all of our lovely hard work on an object, oh yes. We're going to start with PCA, using our predefined knowledge. We're going to be plotting by a whole other different thing, lots of different bits and bobs for changing how your plot looks essentially. And then I'm going to do the same thing for Chisney and UMAP, auto correct, mocking me. Oh, there's buckets and buckets of information you can now get from these images. It would help if I zoomed out a bit. All right, so buckets and buckets of information you can analyze and think about and interpret. And I strongly recommend that you do, that the more sort of time you spend trying to get into the mind of why might you want these plots? What might they be telling you? The more easily you'll be able to direct your questions, direct your analysis. Yeah. But we're going to move on to the annotating clusters step. We're going to rename our categories. We've cunningly been able to figure out exactly what each cluster is, looking at their marker tables and our known marker genes. And so rather than have them be named cluster zero, we're going to give them their actual name names, cell type names. And then we don't want to necessarily delete that, especially if you look at the marker table, it will be cluster zero, cluster one. So it would be nice. Although I suppose you could just run the marker table again using the new categories and that would work too. We'll just add them back in. So now we're copying that cluster annotation back into your original object. And then that means we essentially have Louvain and Louvain under for zero. So that's not ideal. And now it's called Louvain zero. We don't really want that. Actually, I'm going to just get a fresh one so I don't accidentally repeat the same thing again. So we don't want it to be called Louvain. We want it to be called cell type. So this second Louvain category is getting changed. We have Louvain and cell type. So that's a lot nicer. So we're going to rename this our final cell annotated object. So if we want to now plot that so that we get our lovely labels, we can rerun one of these and just run it on the final object, add in cell type. Or you can switch Louvain to cell type. It'll color it the same way. I'll just label it differently. And now if we look at the plot, it's labeled by cell type rather than number. And that's cool. And there's whole heaps of information again across all of these different plots and what you can interpret. So please do take some time and do that. The last bit is when we're looking at some of our interactive visualization. And so if we go over here to our UCSC cell browser, we'll choose the format. So it's scan by, that's what we're using. Yes, final annotated object. Sure, sure. Yeah, Louvain is fine. We can execute that. And we're up and running. So now I can hit view data. I've zoomed far too far in. There we go. Okay, and then this is brilliant. And it's something you're gonna wanna spend some time playing around with. You can look at all sorts of different visualizations. You can color by different things. Yeah. It's a lovely thing to mess around with and to be able to interrogate your data. So spend a bit of time mucking around with seeing all the wonderful things you can do in this. It's also nice because you can just share your history and someone can immediately click on this and start playing around in it just the same way you were, which is awesome. Yeah, we're gonna just do that. All right, and it's finally changed to, I'm ready. So I'll click here to display. And this is going to take you into, it always makes me do this. That's fine. The inner root of viewers for whatever reason, always my security stuff. No, well, I will find sometimes if this happens to you and it says proxy target missing, I don't know why it glitches this way. It is, I promise it's worth the pain, okay? Just run it again and then leave it for a few minutes before trying to look at it. As I said, if this plays you up or if it says proxy target missing or something, just exit, leave it for a minute and then come back and do it again and then leave it for like five minutes. And then usually you'll get this happen, which is pretty great. So this is a whole world to explore, if I'm honest. There's all sorts of cool things you can do with it. Like, okay, I want to color it by batch. I want to color it by cell type or genotype, right? Ooh, that's, this is interesting when you color it by genotype. And then now I've created a little population and I can click it. There's all sorts of wicked stuff you can do to explore. And this is just nice because it means you're exploring your data without having to recreate plots left, right and center. And then you kind of pick which plots show what you're looking at and what you can investigate. And it's just a really nice exploration tool when you're trying to interpret your data. All right, and then just make sure to come here and hit stop when you're done looking through that tool. And that brings us to the end of this tutorial. So I hope you had a great time.