 Hello, my name is Wendy Bacon. I'm a lecturer at The Open University and I work with Embly BI and this is going to be a tutorial on using Alavin to create a single cell expression matrix using single cell data, ideally from a droplet-based strategy. Some important points about this, I'm gonna do this within the human cell atlas galaxy instance. This has all the same tools that are on the galaxy, any galaxy.eu instance, although they're just a little bit easier to find because it's set up for single cell in general. So you should be able to use any galaxy instance you want in the European group. It just may be a little bit trickier to find tools. So if you ever stock because it's not coming up if you search tool and you're not using the human cell atlas instance, go through the little, our gorgeous graduation cap here. And then you'll be able to click on the tools directly in the tutorial. So if you ever stock, that's your get out of jail free card. The other thing to note is this screencast is all about helping you navigate galaxy. All of the important scientific information is within the tutorial. This is purely for if you get stuck. So if you're able to blitz through the whole tutorial on your own without having to use this video, do not bother coming on because there's no other secret information that I've stored in this video. This is entirely for if you're getting stuck or perimeter's not working and trying to figure out what you're supposed to do. Right, onwards with the tutorial. Now, normally at the beginning of a tutorial we do a load of getting data. We copy all the links to the data. We come over here, we go to paste fetch data and we're able to paste our links in and then download it with bigger files that can sometimes take a little while. So I quite like the little sneaky way. So that's why I'll often when I make a tutorial also give an input history. So I'm gonna nab my input history and I'm gonna import that directly. Terrible data management there. You should probably make this something else. So it'll be part one tutorial. Let's go, all right. And yes, it also, if you do it that way you don't have to rename the data sets. And if I was quite on my game and put any important labels on here that can sometimes, or tags on here that can sometimes help. It's not that big of a help in this tutorial. Right, you're then given some questions to look at different things, right? We can examine, oh, what happens if I do that? Loads, right, and you can look at this here and examine it, that's pretty good. We can look at all the information in our GTF file. Four for four, and it not automatically downloading, that's pretty good. We can look at all of our loads of transcripts information so you can examine this to your heart's delight. Ooh, that's quite short, I wonder why that is. Read the tutorial. And now we're going to start actually doing stuff. So our first tool of the day is we're gonna use our GTF to gene list. Now as I said before, in some instances this doesn't come up if you search it here so you will need to use the tutorial version. The tools exist, sometimes the search bar is just a little bit wonky. All right, what we want is our GTF file. It's probably not gonna even recognize anything else. That's a fairly good point. This has come up as a FASTA, this has come up as GFF and these have both come up as FASQs, so that's good, it's the right format. Feature type, we want transcript. And at this point I'm copying over what's in the tutorial so, yeah, the header is just about making it format properly so that the tools will talk to each other. We're not gonna flag mitochondrial genes yet, don't worry. That will come up. This should automatically grab my FASTA. So I'm gonna filter that, all right. And this may take a little while. And through the magic of prerecording, we are there. So I can change this to, and they, uncle, this file, this GZ file you can become the filtered FASTA. And now we're gonna run Alavin. It's gonna be, this is the factory of this tutorial. It does all of our work for us. Okay, and you can use sometimes, if you're very lucky, like we've got mouse now here that you could use, which is pretty cool because that's MM10. I think it's slightly different from the one that we normally use, but man, that will speed up your work if you use the in-build transcriptome. But instead, we'll build our own. So we want our FASTA, which has been filtered. Yes, that's fine. And now we're filling it in. We want read one, and that's read two. This is drop seek data. And this is gonna take a long time. And we're in. Okay, and now we can look at different files. Ooh, these are nice files. Duh-duh-duh, sure. And we can look at our matrix. Ooh, ooh, look at those numbers. Those numbers are important. And if you ever get kind of confused on what's what, it's helpful when you can look at like the columns versus the rows and that's good information. And now we're gonna make some droplet plots. Oh, and I should be zoomed in. There we go. Okay, so we want our droplet barcode rank plot, not in that format. Raw, CB, you can see, cool. Oh, this is very good remembering to label a plot. I never remember to do this in Galaxy, so this is some excellent data management technique right there. And we can look at our lovely barcode plot. Ooh, and then we can look at thresholds it gives us. Cool, cool. And I'm gonna be good and I'm gonna rename my plot. And then, you know, this is, I'm gonna be honest, these numbers are poor. So hopefully it's giving you images because this is down sampled data. So yeah, there's a reason it looks rough. Although to be fair, it would be kind of rough anyway. Okay, so let's redo this. Let's redo this jazz. All right, but this time we are gonna input in matrix market format. Quantsmate.matrix, so we're looking at Alavin processing and we can look at what Alavin did. Whoa, and onwards and upwards. We're gonna run Alavin again. I'm gonna cheat and rather than look for the tool, I'm just gonna rerun it. All right. And this time, this time, I'm gonna come down here and I'm gonna have keep CB fraction is gonna be one. Frequency threshold is gonna be three. So rather than the default of 10. And that way it's gonna keep way more of my stuff rather than implying it's own threshold so that I can apply a threshold. And we're there and we can again, look at all these lovely outputs. We can look at our matrix, see what's going on with that. Ooh, these numbers look different or do they? Yes, so you can examine that all you want. Oh, I should have done this, but I forgot. It is helpful to try and remember which Alavin run is which. So you definitely want the later numbers when you're working on the next tools. So let's do that. And again, you can use the tutorial version in order to find these tools. So you can see the two versions there. And I'm getting all the later things. So that's good. We're gonna rename everything. So this is gonna be what matrix table. Just cause everything else I named a table. And so we're stuck with that now. Lovely stuff. Okay, so we need to add in a whole bunch of information. And so again, we'll use our favorite GTF file to pull all this information. So we've got gene, gene ID, suppress header. Yes, cause that's just how the tools work together and how the formats work. And what I want to pull is gene ID, gene name and whether or not it's mitochondrial. So yes, I want to flag this. Oh, sorry. That one, that's the one that I want to flag. I'll flag everything. And then we no longer need to filter a FASTA like we did before. And we're there and we can look at this. Ooh. Yes, so now we have all of our lovely gene information. But we need to get this information into this gene table. So we need to sort of organize it as such that it'll do that. And so how we're gonna do that is with our join to data sets function. I want to join gene table. Call them on gene information. Oh, I do this every time. Remember to make sure that the data type is tabular. There we go. For our gene table, gene information. And then we want yes, yes, no, no. And then we also want cut. Oh, that's advanced cut. I don't want advanced cut on that one. Columns one, four, and five. So this is now our annotated gene table. And that looks a lot better. So this is now in the exact same order that our matrix is in. So now we can put the two together. So let's read it in. And we want our matrix, our new gene table and our barcode table. And now we can finally run empty drops. So we want the output object, which we definitely remembered to label properly. And then I don't think we really have to put anything else. And we're off. And we can see our cool output. And then our object we can, this is some interesting things there. That's cool. So we're gonna rerun that because obviously that's not really gonna work for our fake data that we have. Ooh, new version you say. Sure. I'm in. Okay. And this time we're gonna say the lower bound is gonna be five because it's down-sampled data. And also don't like freak out if this number is 24, 22 or something. Like it's all gonna be pretty close. There is an element of randomization within this tool. So don't stress. Okay. And now I put in the number five. So I'm gonna make this. If you have one. Okay. And so I should have around, well hopefully around exactly 111 barcodes. Cool, this is great, but not the right format at all for the thing I wanna do. Come to me, sceasy. And so I'm gonna go single experiment and data. That's what I want. And we're finally there. We have our and data object and we can move forward with our lives. Huzzah! However, it is only actually 400,000 reads of the total thing. And there's only really one lane where you might wanna be combining multiple lanes. And thus we hit the second half of the tutorial. This is where we're combining FASTQ files. Now you're very welcome to go through this entire tutorial six more times, or indeed seven if you wanna get the full FASTQ data. So you'd have to redo this as well. And then put them together in a history. Top tip for that if you're working on it. Oh, you'll see all my messy histories. As that when you create your new history, my new history of lovely FASTQ files, right? And let's say, sure, you've done this a whole bunch of times on your different data sets. You can just click and drag over and then you can start with your new history and it's brilliant. Okay, but we're not gonna do this a bunch more times. So instead, we're just going to grab the input history. So I'll open that link again and here we are. And this is, I went through and added little labels which you can do by clicking on any data set and adding a little tag just so that I would know which one was male, wild type, et cetera, all the metadata, which is quite important. So we have all of our data. All of the data is H5AD. Otherwise we would come over here and change the data type, but that all looks fine. So we're gonna start with concatenating objects. And this can go very wrong if you don't click it correctly. So make sure you start with one there. And I, yeah, we're still using the down sampled stuff to make it a bit easier. And then you want two through seven here. Make sure you don't accidentally click one again or you end up with essentially eight libraries where there was actually only seven. And yeah, we want intersection of variables so they don't just keep adding the same, you know, metadata field twice and just add a dash one or whatever. And we want batch separators that on where we go. And this is gonna stitch all of our data sets together in a meaningful way. And now we can use one of my all-time favorite tools. Ooh, all sorts of stuff. I mean, to be fair, what's cool now is that you could look in this little window and get all sorts of information about your object, which is awesome and didn't used to exist. So gold star to the developers of that, which I believe one of them is Mehmet. So gold star to you Mehmet. When we can get even more information by running this, get me the more information tool, which will prove very useful to you when you're trying to manipulate your metadata, which we're going to be doing so soon. And metadata is like, where did the sample come from? Was this knockout or was this wild type in this case? And now we can look, this will tell us, you know, our cells by genes. This will tell us all the, lots of the different metadata we have. We can look at our columns of metadata and say, oh, this is cool. This gives me some sort of maths from the empty drops. This is telling me my batch information. So these are all from the first one and then later on they're the one. And then variable. So this is information about each gene. So it's symbol, whether it's mitochondrial or not here. And if we look at, yes, our experimental design, right? So when we add these from zero, one, two, three, four, five, six, seven, so this would be considered batch zero because it's the first one we added. And then we added two. So this would become batch one and so on and so forth. So you end up with batches zero through six. All right, let's add in some metadata, shall we? So we're gonna use this information to change it because that calling n701 batch zero is very confusing. So let's stop that. All right, so we're gonna go with replace text. Our observations, so this is our cell data. And remember that these numbers are, this is found from that experimental design object. This is how you can figure out which batch is supposed to be what. When I had one female, we're gonna rename that whole column, sex. And we only want that, those columns, rather than, you know, if we look at this, what we're interested in is creating a lovely column that's useful here. We don't wanna repeat all of this extra information, so let's cut it. Again, don't want advanced cut. All we want is C9, right? And then this should give us, yes, our column of sex metadata. And now we're gonna do everything again, but we're gonna label them by genotype instead. So let's do this again. We're gonna switch it and it's gonna be zero, three, four, five. And you're gonna be wild type, one, six. Knock out. And then we're gonna be calling this genotype. And then now we can call that our genotype data. Metadata, and now we're going to paste two files side by side, genotype data and sex data. The little way tab. Yes. I mean, you can probably skip that step and just manipulate and data. Like we're gonna do it in a second all at the same time, but that's how I do it the first time. And so that's how I'll forever do it. In case anything goes wrong, you've at least saved yourself a little bit earlier on an easier to function step. Yeah, that looks right. Okay. And the next step, we're gonna be manipulating and data by adding that information in. So add new, who observations and it's gonna be that. While that's working, we also know that there's some labeling is poor. So we're gonna try and rename these categories of annotation. So rather than where we have it being batch, zero, one, two, three, four, five, six, we're gonna rename them their actual indices from the experiment. And this, I've seen people fail here because if you get to this step and it doesn't work, it's often because you didn't actually concatenate all of the data sets in that very first concatenate step. So if this one fails for you, check that and make sure you didn't accidentally have like one data set twice or missed off a data set. And we're so, so close. All of our lovely see now we have genotype and sex are in our observations or cells information. And now the final thing is we did all that work flagging the mitochondrial reads. We want to not have a column that says true or false. We want a column that says, you know, how well, what percentage of mitochondrial genes are in this cell. So we're gonna use that information now. So you've got our yes or our output. Informa and data, sure, copy null, insert field change, gene symbols. We don't, we want it to look within the column that is talking about our mitochondria. So we're gonna trick it into looking within the mitochondria one because it's slightly more accurate to count the mitochondrial the way that we have done using the GTF file rather than just the names necessarily. And we are there, my friends. So if we'll rewrite the name of this, okay, and then for my own sanity, now that it has all of these objects in it and they're all labeled and it's all lovely, I'm gonna just remove them. Yeah, and I was using these tags in some of these to distinguish between whether I let Alavin throw its own thresholds or I dependent on empty drops only, all right? And so the only tag I'll leave this one with then is I'm gonna leave it with them because it's 400 k reads. And that's important to realize this is not the full object. This is only 400 k reads per fast queue. So it's a down sampled object, right? So we've done all that, it's just because it goes a bit faster in a tutorial. We've done all that, awesome, fantastic. There are other ways you can pull data if you aren't as interested in taking it from raw if you wanna believe other people's preprocessing steps, we can download the exact same data just with the EBI's preprocessing, which for better or for worse, it's amazing because the way it works is they'll apply the same general preprocessing standards to everything. But there is a lot, if you're looking within a specific data set or specific data, the cell type, there is also the other side of you want to curate your analysis for a specific cell type or group. So there's definitely swings and roundabouts for having a sort of standard pipeline or for having a targeted pipeline. These are the parameters that work for these cells based off of what we're finding in these samples. So it depends on how you wanna access the data. So you have it either way. And then it's not in, obviously it's four files right now. So it's not in the format that you want. So we have to read it in and that's fine. So we pick our matrix and our gene table and our barcodes and keep in mind, these will have already had some extensive filtration on them and the experimental design goes there. There shouldn't be, I don't think there's anything for that. Yeah. So the data will look a little bit different because it's already had that pre-processing. And we're done, congrats on making it to the end. I hope you learned a lot, I hope you had fun. Let's be honest, the next tutorial is far more fun because that's when you get to make your plots. So I'll see you on the other side.