 Hello, this is Wendy Bacon from the EBI, and we're going to go through another tutorial. So today's is going to be on, oh, I've already found it, I cheated. Today we're going to do my transcription tutorial. We're doing this clustering 3,000 pbmcs with skimpy, or skimpy, depending on who you talk to. And again, go through this on your own, this video is entirely for if you get stuck or if you need a bit of help on any given step. Right, so going through, we got to get our data, genes, barcodes, and matrix. And again, I'll, I have an input history you can have or you can upload the data. Uploading the data. Okay, so we've uploaded and then now we're going to re-label these, that makes a bit more sense. So we've got our matrix, we've got all of our barcodes, we've got our genes. Cool, so now we're looking at the matrix file and you can see this here. So if I look at the matrix file, if I click it, it'll just, here's a lovely matrix file. So we want how many non-zero values are in this matrix. Okay, well we know that it has this many genes and this many cells and this many non-zero values because this is, this is your count. So if you want to know how many values are in the counts, you have this many. So if you want to, you can do molded lie, here genes by cell barcodes in order to get the total possible counts. And so actually very few of them have an actual count associated with them. And then, oh, I've cheated. How many counts are in this gene? Whichever one that guy is. Okay, so that's associated with cell one. And then there will be 10 counts associated with it. Cool, all right. So this is such a good image for trying to get your, wrap your head around how this data started. So we're doing this import and data and loom step. However, as of specifically the 11th of January, 2021, there's a small glitch in this tool update. Hopefully it's fixed by the time any of you use it. But just in case, I'm going to be using the slightly older version of the tool. Okay, so I'm sure that this tool will be working by the time any of you see this. However, I'm going to use the slightly older one. So we come to matrix market. We've got our matrix output version two or earlier. We have our genes. We have our barcodes, gene symbols, make the rare blend of technique. Let's rock and roll. Lovely. So now we have our object. And if we click here, we can see that it's h5ad. Good. And then we can rename it. All right. Let's rock on with some analysis, shall we? So first we're going to inspect it. Okay. Now there are some pretty nifty things happening. We're trying to make it so you can kind of skip that and just see the information down here. But we're going to do the full shebang. So go to inspect and data, general information to start with. We're going to do a lot of inspecting in this tutorial. It's because you can't just click on something. It's not just a simple matrix from here on in. So this is your way of getting information. Right now it's very little. So we've got that 2,700 cells by 32,000 plus genes. So yes, that makes sense with what we got from the matrix. So now it's going to be always called observations and variables. So the 2,700 and 32,000. So now we're going to inspect the full data matrix. Sure, we're not. My vass is just rerun and I want the full data matrix. Sure. Right. So that's finally finished. So we can view it painstakingly. And along each of these is going to be the different cell barcodes. And then now along the top, we have the different genes. And so then the number is going to be the count for each cell barcode for that given gene. Looking into the key index observations and variables is super helpful as you're going through your analysis. All right. So now we can look and yet we see the list of all the different cell barcodes. And then now we're going to look at the variables. Cool. And then now we can see what we know about our genes. And then this one we have two columns. We both have the like ensemble ID as well as the, I mean, how I would refer to a gene. So the gene symbol. So now we're finally going to do some analysis. So before I do that, for my own sort of piece of mind, if you... So if I come over here to operation multiple datasets, I've got all these inspects that are kind of just in my way. So I'm just going to hide those. Cool. And then I click this off. So now I have my nice clean session. In fact, you know what? I'm going to go big. I'm in. Let's start some pre-processing. So for this we need to filter with scampi. Okay. And we, yes, filter cells. We're going for minimum number of filter genes based on, okay, yes, minimum number of cells expressed. Very important you hit this. I have repeatedly done this tutorial and forgotten to make sure it says cells expressed because minimum number of counts, minimum number of cells seem very similar. So don't make my mistakes. And we're done. And in record time, rename 3K, PBMC. But we've removed like random genes that kind of only appear once or twice and may very well be alignment issues. And then we're going to look at it. Cool. So now we have gene IDs, numbers of cells, loads. This is a very important information about why it's so important to filter stringently. So very big data, lots of zeros and ones can really hose it. All right. Ooh, we're now going to extra. Oh, this is important, right? So the next few things is a little bit of a roundabout way of just finding a different way to label or add in what I would call metadata. So you want a different way of labeling, labeling what you have, okay? So, right. All of this is so that we can label which genes are mitochondrial and which are not. So if we go and inspect it now inspect and we want information about our gene, orange circle of doom. All right. So, yeah, we have our gene annotation. And now, thanks to basically when you filter these things, it kind of will sometimes add in metadata information. It shows the number of cells that a given gene appears in. All right. So we're going to switch that. That's our gene annotation. Now comes the tricky bit. So we're going to do some text reformatting. We're going to use the AUK program, okay? Because at this point, this isn't any fancy single cell thing. This is literally just like a table. So text reformatting. Sure, we want to do it to that table. So we want it to catch anything that starts with an MT because then we think it's mitochondrial. There are some genes that don't necessarily start with MT depending on the species, but this is sort of a quick and dirty way of pulling out those genes. And we're there. Okay. So now it should have... Well, it's probably won't show me anything, but it should be... Oh, yes. It's pulling out whether any of these started with MT and it gives it one if it has and a zero if it has not. Cool. Inspect the generated file. How many genes were found as mitochondrial? So if we go to filtered... There we go. Okay. So we want column two to equal zero. And yes, skip the first header line. Sure. All right, let's see what that gives us. All right, what do we got? So there's one, two, three, four, five, six, seven, eight, nine, 10, 11, 12, 13 here that all starts with MT. All right. And indeed that is what was found here. Cool. And so now we want to not have it be zero or one. We want it to be true or false. So replace text in a specific column. Okay. So we've got it on not our filter. Don't screw that up before as well. So column two, zero needs to be false. One needs to be true. It is mitochondrial. Okay. And once important, and so if you kind of are trying to figure how to do this for maybe a bit of metadata, the important thing you need to keep in mind is that you don't want any of the tools to like sort what you have going here. So it needs to always stay in the same order. And sometimes random tools will just automatically sort the data that you're working with. So you just watch out for that if you're applying this to some other piece of metadata. All right. We're done there. So now it should have taken all of the zeros and ones and made lots of falses because most of these are not mitochondrial. Okay. So there are two versions of this. Make sure you click keep click the tail. Keep everything from line two onward. All right. And we're good. So this should have gotten rid of the header. Fantastic. And then now really we already have this information as part of the metadata of our objects. Right. That's what we inspected. What we really want is specifically this column in the exact right order. So to be added to our metadata. So none of it will tell us whether our genes are true or false. So if we cut that column, so I want column two from 18. Yes. All right. Yeah. It's it's quite around about a bit of a faffy way of getting there, but we'll gather and we're there. So we have our specifically our column. Okay. But we now need to give it a title. So I'm going to do another paste. And I'm going to do just the text mito. Okay. And then sometimes it mess up the auto detect. I want it to be tabular mito because that's going to be mito is picked up by a lot of the tools. So you don't want to, you know, go rogue here and add some random title to go on to your mitochondrial annotation. Weirdly, that upload often takes a while. I don't want to share my. And we're in. As you can see, I've created a word. Okay. So now we want to chart those two things together. So if we come over here to concatenate data sets, fancy for copy and paste. All right. And let's go not the one with the cat. Okay. And we want this data set to go on top of this data set. And we're there. Hopefully it'll say mito and then a whole bunch of bosses. We have our mitochondrial annotation. So it would be really super great if our mitochondrial annotation showed up in the objects with this information as well. Okay. So now we're going to add that annotation back in to our object. So if I go to manipulate and data and we again are having a little bit of a version issue here. So for the moment, go to 0.6, 0.2, right? And then yes, we want a 3K PBMC. And we want to add this new annotation, which is the mitochondrial annotation. We want to add it to the genes information in the variables section. All right. And then we'll execute that. Okay. And hopefully that will have added our mito information in. So we've got that's enough. Okay. And then now we can inspect it like we did before. And one glorious day, it will just appear down there in the winter. All right. And we want information up in the variables. I gave it genes. All right. And now we can check and see if everything's worked. And yes, we now have a fourth column with the mitochondrial information. Awesome. So I'm again going to do my little trick of we just spent a whole bunch of time trying to get some mitochondrial labeling into our end data object. Let's just hide all of that. All of that lovely, lovely effort. All right. Hide it. Cool. So we've gone from our input. We filtered out, sprays genes. And then we have our mitochondrial information. So now we're going to calculate some QC metrics. So we're going to go to inspect and manipulate. And we want to calculate this. And yes, we have some mitochondrial information. So now we have some, hopefully, variants. Yes. You can see here, you can see all sorts of different pieces of information that have been added into our object. So let's, we can inspect it again. First, let's rename QC metrics. And we can inspect it. All right. And now we can see all sorts of information that's been added in. Oh, cells that counts, mean counts, log mean counts, dropout counts, all the counts. Yes. Oh, this is really good. It's nice that it has an explanation of every single one in there. So that's really good. So let's start using this to process our data. All right. So first we're going to plot some stuff. So we want genes by counts, little counts percent count. So we're going to start making some plots. And that's going to help inform our decisions on where we're going to put our lines. So we want a filing plot subset. We want those. They come in here under attributes. Yes. Yes. So let me see your point four. And we're in. Let's look what we got. Plots. Right. So this is the total counts you get each little dot representing a cell. The amount of mitochondrial stuff you have. So obviously this isn't looking great. So that's my toe. Okay. So let's roll faster that way. Okay. We want scatter plot. Use coordinates, total counts. And in genes by counts. And we'll just throw the other one on as well plus we're at it. So we can look at this plot, which is where you're looking at the percent count. My toe. So this is always helpful because you can kind of see that the mitochondrial genes are highly expressed in the cells where you don't have that many genes at all, which makes them quite spurious because you're going to put some sort of cut off here. All right. And then we have our, where we have our number of genes versus our number of UMIs or our total counts. And we can see that there is a relationship, which is quite good. You're hoping that the more you count, the more genes you get. Otherwise you're probably picking up a lot of some gene that's quite boring because it's repetitive. So what are we going to do with? Ah, yes, we can make the solutions. What are we going to do with all this lovely information? Is now we're looking for good thresholds. All right. We're using this data to determine our thresholds we're about to use for filtering. Okay. So if we look at our percent counts, mito, quite common is a 5% cut off, which makes a bit of sense with this chart. We're seeing this out here. And then we're looking at 200 genes total. Yeah. So I mean, well, that's 500. So where's 200? Like something like that. We also are going to look at, and sometimes people don't do this, but we're going to include a high threshold as well. So these seem like some pretty intense outliers. So we might want to cut those out. All right. So we're going to use 2500 for this. So let's start filtering. We're going to start with a minimum of 200. All right. So we've got our QC metrics. We're going to filter minimum number of genes expressed. Make sure you click the right thing. Minimum of 200 genes expressed. See what we needed to see. All right. And so now let's look and see what's happened. No cells have been removed. We still have 2700. So the data set you were given to have any cells with less than that. Most likely, you're given nice data because you will always find cells that are junk. So maybe any linemen or the RNA star or whatever step preceded this has taken out a certain number of cells. A lot of it just takes computers ages to deal with it. Okay. So now let's do the same thing, but for a higher cutoff. So I've got a maximum number of genes expressed. 2500. Make sure to use the output, the correct output. Okay. And let's look and see. And we can see we've dropped a whopping five over here on the cells. So we've come like up here. We can even count it. One, two, three, four, five. So we've removed five. So now we're going to filter out the mitochondrial that is we want everything less than 5%. And this requires a different tool. So if we come over to manipulate and data, and just to make sure nothing goes wrong later, make sure you click the 0.6.22. Okay. Yeah. The correct output. All right. We want to filter. Filter variables. We want percent counts, mito and observations. Yeah. Because percent counts mito wouldn't make sense. It's mito is in the variables, whereas the percent counts for each cell is in the observations. So you want to, yes, number, we want less than. And we're there. I'll rename this 3K PBMC after filter. Cool. Then I want to inspect it and see what's happened. And we can look at that. And we can say, oh, a few more cells have been removed. So how many have been removed? Well, there's 2638 now. So a whopping 57 cells. All right. And onwards with the normalization and scaling. All right. So we're going to normalize with scampi. Bone-on QC filter, target sum, and then we want f. Now it's not enough just to normalize, because to be frank, there's such discrepancies between cells. The sort of standard normalization you might do for, I don't know, PCR or something, it doesn't quite work. So you also want to log transform. Otherwise, you can really exaggerate different expression levels. Okay. So inspect and manipulate. Yeah, we don't want to do that. We want to logarithmize the data matrix. I'm going to be a bit sneaky here and just go for the next step. So if I come to this manipulate and a test, like the correct version, I'm going to freeze this. I want this to be in the raw. So other stuff we're going to do to map it or show it might change the numbers. But what we want when we're doing our differential expression testing is the raw attribute. So I'm going to get that to go. Because you don't want to be sort of removing variation or variability from a gene expression level or transcript level and then do statistics on it. That's not great stats. And we're there. So I'm going to rename this 3KBBMC after QC filtering and normalization. And now we're going to move on to feature selection. The first thing we're going to do is we're going to find our highly variable genes. So if I come to filter, let's get it again. And then what we want is to annotate and filter highly variable genes. Because if something's not showing variation in our sample, then it's just going to add dirt to our analysis and hide the interesting stuff. So we're going to get rid of it. And I think the standard of this, this is fine. All right. And now we're going to plot this to see for how many of our genes might be variable. Yeah, that's one of the reasons I like the human cell atlas is because it's set up with the single cell fields that come up right now. And the pre-processing and we're dispersions. And let's take a look. So gray are all the genes we're going to get rid of. Black are the highly variable genes. So taking into account normalization, you're able to get rid of. All right. And let's look at the output. We now have highly variable. Is one of the ticks for our variable. Like that's another annotation alongside mito. How many genes are stored in the object? Where's the store the information about the genes? I don't know, I don't know, I'll just look at that. All right. And so now we're going to manipulate and we're now going to filter so that we only keep the highly variable. Make sure you have the right version. Filter observations variables under variable. I always hesitate before I hit execute to just make absolutely certain I have the correct input. That's why I like to delete grapple on here. And it's finished. So we're going to rename it. And then inspect it. And we can now look and we can see how many, how many genes have we lost? Absolute buckets of how many genes we've lost. Right. Onward was the scaling. So scaling is sort of along with removing different confounders or batch correction, depending on what you're going to do. Okay. And now we're going to do the scale step. And that's in this inspect and manipulate. Let's scale p2. To scale data. Zero sensory. Yes. Maximum value 10. And then we're going to rename it after scale. That makes it much easier to compare symbols onwards with dimensionality reduction. There we go. Okay. And then we want, yes, that's our input. We want to compute PCA. And we want TLPCA. Compute. Yes. We want, and now we finally get to some of the fun bits, which is plotting. I'm going to plot PCA. I want scatterplot. All of these are named very similar. So just be careful. Okay. And then plot attributes. We want set component one and two. And set component two and three. Number of panels per row. All right. We can look at those lovely plots. And we can see some nice variation along. Principal component one, principal component two. Oh, principal component three. Not looking great, but hey, here we are. And now we're going to look at what is contributing to each of those PCAs. So we're going to plot again. This time we're going to go with PCA rank genes. Okay. And now we can look. We can see the top genes that are contributing to each of the principal components. CSD3. And now we're going to plot and look at those genes. The top loads in each of the three. Top of this, top of this, top of that. And now we can look and we can see the expression or rather the presence of each of these transcripts and where they are on the different PCs. So let's plot again. And we're going to go with scatterplot using variance ratio. Sure. And if we click on this, we can see it. We can see our different principal components and we can see it as we saw kind of before is that the first few really show all the variation. But we'll take that information and just determine where to put a cutoff. Anything more than 10 would be kind of silly because it seems very little variation at the point. So now we're going to start competing our graph so that we can start figuring out what a cluster is. So we start with computing the neighborhood graph with our old friend inspect and manipulate. We want to compete a neighborhood graph. Size for this tutorial is 10 and 10. We've determined 10 PCs. Here's a hard threshold. Yes, we want humap down here, euclidean there. And we're going to rename this. It now has our neighborhood graph. And then now we're going to do the inspecting again. So we want to know how this information of the neighborhood is stored within this graph. And we can look and we can see how that information is stored and we have a whole bunch of other new ways of storing information now. So the neighbors as well as the PCI to be fair, a lot of this stuff is now stored in some new object locations. So if you wanted to inspect it, we could now go to... So we've done alms, we've done var, but now we can go to uns. And we want to look at neighbors and we can look at that. And we can get all sorts of information from that. All right. And now let's... You know, I want a cluster. So the next thing we're going to do is look at a UMAP. So if I do another one of these cluster... Clustering. All right. I want to embed the neighborhood graph. Now we're getting in a position where we can start looking into a UMAP. I'm just going to clean this up a little bit for my own. All right. Now that that's done, we can then look and see how that data is stored. Although to be fair, if we just click on it, we can see it under OBSM, so we can investigate. We can look at all sorts of information. And we can look and see what we've got from there. Look, some lovely, lovely UMAP coordinates. And now we can finally plot our lovely UMAP. And we want embeddings, UMAP. Oh, yeah. And then we're also going to include those genes we were interested in a second ago. That's all it has completed. So now we can see the lovely pretty picture with our genes being highlighted. All right. And then you get the question, are clusters... Can you visualize the clusters in these graphs? Might they be linked to PCs? I mean, yeah, sure. We can start eyeballing clusters, but it would be a lot better if we could calculate them. Although to be fair, we do a lot of calculation, and it always comes down to sort of iterating between what you think might be true and what's not true and it's all a bit messy. But let's do it the good Coty way. We're going to do another cluster and for trajectory and embed. And this time we're going to do T Louvain. Just because that's more traditional one, I think people are using light in now and want resolution of 0.5, something you'd normally play around with, keep cluster, you'd see if it makes sense, you come back, you change your resolution. Let's see what we get. All right, we're going to rename this before I get ahead of myself again. Okay. And how is the question information stored in the object? Well, check out down here. Oh, I love it. It saves you a whole inspect and data step, if you can see here. I mean, you've got Louvain, Louvain categories, et cetera. We can look and if we did do the inspect, we can see it's also going to label the cells themselves. So within the Augs category. We've got Louvain and we can go over here as well. Let's go right over. I think you can see it's given the number of a cluster to each of these. No, the numbers is just, it's just by size, like the fact that zero or two, it's meaningless. It's just the biggest clusters to the smallest clusters. All right. Well, enough chat. Let's plot these bad boys. So I'm going to plot and I want to plot my clump. So I'm again going to do the PLU map. This time, however, I'm going to add in Louvain. And then I'm just going to get a plot attribute. So I want to parro. Doesn't really say that. This is actually super important. Sometimes the plotting errors out a little bit. So it may be that you need to change this to Veritas. So if you ever find one of the, like, plot new map functions doesn't work, it may be because something's buggy with the colors in the back end. So just change that and see if it fixes the issue. And then we can look at our lovely, lovely plots. And these three color by gene and this color by cluster. So you can see all the different number of clusters. And then finally, home stretch. We're nearly there. We're going to go into the finding of marker genes. So the first thing we're going to do is use the t test. So we're going to inspect them and play with scum play. Okay. And this time we're going to rank genes. We are going to be comparing by cluster. You could compare by whatever you want. I use the raw attribute. Yes. So we don't want the, you know, change neighborhood graph or anything that we've done to it. We want the original straight after normalization, not that we're going to compare each group to the reunion of the rest number of genes, 100 method. We're going to choose t test. And then we're going to roll that. And again, if we can consider how this information might be stored, we can run another inspect. Okay. So we've got wide variety of information. So we can look within this one. So we've got the rank genes groups. So let's look at more information to do with that. And now we're going to get, whoa, five different bits of information that all have to deal with the marker gene and the marker gene test. Enjoy looking at that. All right. We're now going to move on to plotting. And this time we're going to look at the marker gene. We want to look at 20 different genes. We want three panels per row. And now we can look at this. And we can see in the different comparisons, we can look for different genes as compared with everything else. You can look at the top five ranked genes per cluster. We're not going to do the exact same thing with a different test. So we'll cox and rank some test, which is a little bit more normally used. Okay. So we're going to go back to our inspect and manipulate. We're not going to do this on the t-test that we've already done. So that would be this guy. So we're going to do on the previous one, rank genes, to our groups. All right. The key is moving again, raw again. This time we're going to do cox and rank some. And I'm going to be good and name my previous one. My previous bit of data what I should have. I'm going to remember to rename it with the cox and test. And then we're going to plot again. But we're going to plot that one. And we'll inspect it again because why not come down here and rename this ranked genes with t-test. And again, I'm just going to do a little bit of tidying. Okay. So now I've got the names, but this was ranked genes for Wilcox and test. Now I can look at that. And then I could come over here and I could compare it with the ranked genes for the t-test. A lot of them are quite similar, actually. Well, if that makes sense, one would hope that would be true. And then I can also look at the plot. And again, slightly different, but reasonably similar. You can quickly look at the top five and do the analysis as it's suggested. And then now we're going to do some even cooler plots. So we're going to keep going with the Wilcox and test just because that's the most common one done. All right. We're now going to go back to a generic violin plot. We're going to subset, look at our favorite three. And we're going to also look by cluster. And now we can look and see our pretty, pretty graphs. And this is very exciting. So you can imagine the, you know, what genes you might be looking for or what you might be interested in. Cool. And then yeah, you consider whether they're in the clusters you want them to be or you thought that they would be. And now let's start getting some even better plots. So yeah, at this point, the tutorial basically becomes what are some cool ways we can visualize data or answer certain questions. So now we're going to plot a whole bunch of them. Now, slightly misleading. Okay. Number of categories is actually referring to the number of clusters here, not the list of variables, just in case that may throw you off. Use wrong. Yes, please. Custom figure size. Swap axis. Yes. Some figure size. Yeah. That's a strip plot. No. We can now look at our super cool stacked violin plot. I love those. We can analyze that more and really that brings us to the end of the tutorial today. If you happen to go further than that, you know, good on you.