 I think we're good to go. People have gotten drinks and food and excellent, everything. So we're going to continue into a bit more nitty gritties of once we've planned our single-cell RNA-seq experiment, what do we do now? And once we've got our scone. Yeah, so we're going to look at how we quantify expression in single-cells. So I was supposed to do learning objectives because that's how we do teaching here. I hate learning objectives, so I'm not going to use them. Instead, you all chose to be here. So I want you all to think about what you want to learn in this course. And if we don't cover it or if I don't cover it, ask. Make sure you learn what you wanted to learn. And I want to give you two minutes now. I hope that you're on the slack. If you could post two things you want to learn from this course on the slack right now, that'll be super helpful. And I will make sure I cover all of those things tomorrow. Hopefully you've all written down a couple of things you want to learn. So we'll start off with just how single-cell RNA-seq actually works. So you've heard about all these technologies from Trevor earlier this morning. But they all sort of break down into this pipeline. We start with isolating our single-cells. We collect the RNA out of the single-cells. We turn that RNA into DNA. We amplify that DNA until we have a ton of it instead of one single-cells worth. We then turn that pile of DNA into a sequencing library. We throw it on a sequencer, usually a new Illumina, but other sequencers are available. And then we take that sequence of data back, we process it, and we turn it into a count matrix of our gene expression. And we do all kinds of downstream access to it. So first step, where we have choices coming. So obviously, there's lots of ways of isolating your cells. Trevor already covered that. He also briefly mentioned different ways of turning our RNA into your DNA. So you can do full-length RNA single-cell RNA-seq in one version, or you can do tag-based. I'm not going to talk about full-length because quantification for full-length single-cell RNA-seq is essentially the same as bulk RNA-seq. If you want to learn that, you can do them, of course, just before this one. Instead, for single-cell RNA-seq, almost all of you will probably be using tag-based methods, and they are significantly different from bulk RNA-seq. So that's what I'm going to talk about. So this is your mRNA. So first, to capture this mRNA, we use a poly-DT sequence that hybridizes to the poly-A tail. To this, we add a whole bunch of barcodes to record exactly where this mRNA molecule came from. So first, we add the unique molecular identifier, the UMI. So this is a randomly synthesized barcode. They're typically between 8 and 12 base pairs, giving you over 50,000 possible combinations. So in theory, this should be unique to a particular mRNA molecule in a particular cell. In practice, it's not quite that straightforward. So we end up actually having to use the combination of what Reed does my UMI link to and what is my unique molecular identifier to identify a unique molecule in our synthesis. We then have to add a cell barcode. These are 14 to 18 base pair sequences, and they come from a predetermined pool. These are predetermined so that we can make sure there are at least two base pair differences between any two cell barcodes, which makes it much easier to determine precisely which cell our RNA reads come from, because there are, of course, sequencing errors. So we can get one or two errors in these cell barcodes, but as long as we've got two base pair differences between our different barcodes, we can get almost all of our reads correctly matched to the correct cell. So if you're using 10x genomics, they have a whole bunch of different whitelists. Depending on which particular chemistry you used, you need to make sure you're using the right whitelist. So they're all available, I'm going to say. So once you've got your cell barcode, you then also add a 10x barcode. These are 10 base pair differences, base pair sequences, again, from a predetermined pool. They come in a kit of 96 different ones, and they're used to identify a particular sample. So if you sequence 12 different mice, you give them each a different 10x barcode, then you can pool them all together when you sequence them, and your sequencing costs less, which is good. Then we add a PCR primer so that we can amplify this like crazy. This gets reverse transcribed by reverse script days into CDNA. It ends with some poly C at the end, which allows us to hybridize it, switch all of the nucleotide to those poly Cs. With our other PCR primer, reverse transcript days goes back to synthesize the other strand. Now we have a CDNA that we can amplify with standard PCR. So next we amplify this like 12 times through PCR. So we get tons and tons of DNA. And then to prepare our libraries, we have to fragment this into 10 base pair, 300 base pair fragments for our sequencing machine if we're doing alumina. If you're doing nanopore, you don't fragment it and you sequence the whole transcript. But almost nobody does that because it's expensive and single cell is expensive enough. So we're going to alumina with 300 base pair fragments. And we do paired and sequencing to get to 100 base pair reads back from our sequencing machine. All right, so what are these reads? So first read one is purely dedicated to our barcodes to make sure we sequence all of those barcodes all together. And then it will run into the poly A tail, but most of the time just going to be a whole bunch of A's. So that's not actually useful for us at all. So we just trim that off and throw it away and we just get our set of reads from read one. Then our matching read or paired read read two is in our transcript fragment. And that's where we can actually see what gene does this molecule belong to. You may also get an eye read if you use cell ranger to generate your fast cues from your sequencing machine output. The eye read just contains the sample index. It's not really actually important. You can do everything you need but just read one and read two. Now there's two different versions of 10x genomics and some other protocols. So you can do either three prime or five prime sequencing. And all that's different here is which. All ago, we're attaching all of our barcodes too. So if you wanna do five prime single cell RNA seek, take all those barcodes, attach them to the switch all ago and you're done. You're now doing five prime single cell RNA seek. Everything else is exactly the same. Yeah. Is there a high-resonance reduction? Nope. Because it's a lot more common. Three prime is much more common basically because it was first, because we were using the polytease to capture the RNA. So it was like, oh, well, we'll just add our barcodes to that and we'll do sequence of three prime end. In theory, there's no efficiency difference. However, you will know, note if you do both 3 prime and 5 prime, you cannot just put those back together again. Because of the tons and tons of amplification we do, obviously the sequence at the 5 prime end of your gene is different than the sequence at 3 prime end of your gene. So any little bias in that amplification that depends on the sequence, like the GC content will cause a bias on your technology on which end you're using, and you can see that. So if you take 3 prime and 5 prime from the same sample, stick them together, they don't integrate unless you apply some sort of integration to it. No, so for TCR and BCR, you have to do the 5 prime end because that's where the variable section is. So if you do 3 prime sequencing, you can still see the constant portion of your TCRs and BCR, but you cannot see the variable portion because that's at the 5 prime end of the gene. Oh, so why don't you have it after the fragmentation? Because the switch all ago is at the 5 prime end. Oh, don't have it any? Oh yeah, so it is possible that after fragmentation you can get your breed 2 to just be polyatail. Yeah. Oh yeah, sure. Yeah, so the question is, is it possible that after fragmentation you won't have any gene sequence in your fragment anymore? And that is possible if you have a particularly long polyatail, though it's fairly unlikely, but it can also happen if your sample is really badly degraded and all of your transcripts are really short now, and then you'll get lots of poly T in your breed 2s. I don't see any other questions. And again, our breed 1 is going to be all our barcodes, and our breed 2 is in our CD9 transcript, the actual gene. So that's what in theory should happen, and that should be awesome, but people make mistakes. So what would happen if your sample is super degraded and you had really short transcripts? I already kind of answered that. You get lots of poly T in your breed 2s, which then makes it means you have to throw away a whole bunch of your breeds essentially. But what would happen if you use the wrong cell barcode whitelist? Or if there's a strong bias when you're generating your UMI's towards certain sequences? Maybe a couple minutes to think about these and feel free to discuss with the people next to you and go through the answers. What would happen? Okay, I think everyone's had time to think about these. So answers from the crowd. What would happen if you use the wrong cell barcode whitelist? No one feeling confident in their answers. Oh, yeah. Yes, you'd be throwing out a whole bunch of your sequencing because it wouldn't match any of your cell barcodes. So you'd end up with like half or a third the number of cells you're actually expecting to see, and all the rest would have gotten thrown out by CellRanger or whatever tool you're using to quantify your single cell RNA seed. What would happen if there's a strong bias towards certain sequences in your UMI's? So it wouldn't affect the differential expression of certain transcripts. Yeah, so when the UMI's are independent of the genes, so it won't affect any gene in particular, they're random in theory, but they're generated by a particular process in the lab. So when we say random, they're technically random, but they're not actually random. If you actually sequence and count the number of UMI's you get of different sequences, they are not evenly distributed across all possible sequences. There are biases towards certain sequences in there. Yeah. So your expression level quantification will be an underestimate because you'll have multiple repeated UMI's that are actually taking different molecules, but they just happen to have the same UMI because you have a bias in your UMI pool. So the true frequency of UMI's is not completely random, which is why we end up using both the gene your UMI is linked to in that second read and the UMI code when we're counting how many molecules there are for each gene in your pool because those UMI's are not actually going to be unique in a single cell because they're not actually, there isn't actually equal frequencies of 50,000 different UMI's. Certain UMI's are more frequent than others. Okay. So this is how single cell RNAseq works. So what can we capture with single cell RNAseq? So mRNA, we're going to capture really well because those are poly A tails. Minor congeal RNAs, RNAs from the mitochondrial genome, we also capture really well. They have poly A tails. Micro RNAs you do not see. T RNAs you can't see. And RNA, so ribosomal RNA, you're also not going to capture. So they don't have poly A tails. But viral transcripts sometimes maybe you can capture them because if you have a transcript that is very high in A's and T's that you can have in certain viral genomes, you might have a very high AT rich genome for your virus. There you can get internal priming. You can just have a spontaneous set of poly A's in that, in a gene and then that will get captured. The same goes for bacteria. If you have a bacteria that's very AT rich, you can have spontaneous poly A sequences just in the middle of your gene and get internal priming and get recorded gene expression for that gene. But if you have a very GC rich genome, you won't see it. In theory, yeah, if you do single nucleus RNA seek, you should get no mitochondrial RNA reads. But in practice, it depends on what protocol you use to extract the nuclei from your cells. Some of them will truly get you zero mitochondrial reads. Some of them you can still get some mitochondrial reads because it captures not just the nucleus but some of the stuff stuck to the nucleus as well. Not really. You just use the total genes detected then and you can't, there isn't a substitute for mitochondrial RNA. You'd have to look in the literature for your particular system if you want to know if your virus can be captured or not. Or download a publicly available dataset for your system and see if you can find it because a lot of people don't check to see if they can find it or not. There's tons of datasets available. Generally, all the tools we use to do the quantification throw out everything that doesn't match the human genome. So if you take everything that doesn't map from your initial mapping and then basically remap it to viral, yeah, it's a second pipeline of that leftovers. You can look at whether they map to bacterial transcripts, viral transcripts. I've heard a report that if you take that sort of junk that's thrown away and map it to microRNAs, sometimes you can get one or two microRNAs in there as well. But it's a secondary pipeline. You'd have to design yourself. I don't think the pipeline that was used for those studies is actually available. Oh, yes, link RNAs. If they're poly A, have poly A tails, you can capture them. They don't, you can't. And the problem with link RNAs is also they tend to be very lowly expressed. So even if in theory you can capture them, just by chance you might not see them. Except for mallet one, which you will see everywhere and everything and it drives everyone crazy. Yeah, everyone here who's done some analysis themselves has seen mallet one. It is everywhere. It can. Yeah, it can't. I would not recommend it though. If you're going beyond sort of model systems, I would recommend using star solo instead of cell ranger, because I'll cover this a bit later in this talk star solo allows you to tune all of the parameters used in mapping, which can greatly improve your sensitivity for non model organisms. Cell ranger in theory, you it's all open and you can tune the parameters in practice. It's very difficult to change the parameters. So it's basically a black box, what you put in your input and you get out your output and you can't really change anything about it. This brings me to the next slide, which is talking about that exactly. So quantifying gene expression. Okay, so we've got we've thrown our data on the sequencing machine. We've got our fast Q files back. Now we need to take those fast Q files and turn them into you and my count matrix. Right. So the first step is to map the reads to the transcriptome. So here you need to consider what genome you're using. So if you're using master human, you can easily download the genome. If you're using a non model organism, you may or may not have a good genome available. So you also need to consider your annotation quality because we're mapping to the transcriptome. If you're transcriptome, your transcript annotations are garbage, you're not going to get very much out of your single cell RNA seek. So you may want to consider recalling your transcripts using your single cell RNA seek data if you're using a bad genome for some weird non model organism. You also need to consider about whether you're going to include introns and exons. So until last year, the recommendation was to only include exons in your transcriptome. As of July 22, 2002, 2022, it's now recommended you include introns when you're mapping your reads because you get about 10% to 15% more reads mapped and counted in every map. Every read counts when our number of UMIs for a lot of our genes are one or zero. You then have to assign reads to cells. So this is where your whitelist comes in and then you count your UMIs. So things to consider here is how you deal with multi-mapping reads. If you have a read with a UMI that maps multiple different genes, which genes you count that UMI for and how do you do with sequencing errors in your UMIs? You could have two different UMIs mapped into the same gene that are off by one base pair. Are those actually two different molecules or is that the same molecule? But you have a sequencing error in one of those UMIs. And our standard tools for doing this are CellRanger, which probably most of you, if you've got data, have used or has had your data run on it because that's released by 10x Genomics designed for 10x Genomics staff. We're not going to run it in this course because it's super boring to run it. It's one command, put it on to Linux, you wait eight hours, you get your UMI count matrix back. There's also StarSolo. So CellRanger is based on the alignment tool map alignment tool Star with some slight modifications of parameters in the genome. People who made Star weren't too happy about CellRanger getting all the credit. So they made their own version of CellRanger called StarSolo, which again is still based on Star. And it tries to do all the downstream steps as close to CellRanger as possible. So in theory, you get the same answer. In practice, you don't because there's secrets that 10x Genomics doesn't tell you some of the time. So it's not quite the same, but it's in theory the same. The benefit of StarSolo, as I mentioned, is you can tune any parameter you want, which means the key ones there are the mismapping rate. So if you're using, say you're using a new species that doesn't have a genome available and you're using its closest relative that does have a sequence genome, then you're going to have a higher mismapping rate. And if you use CellRanger, a lot of your reads are going to be thrown away. If you use StarSolo, you can change that mismatching rate to be more permissive. And all that read, all those reads come back and can be useful. There's also Alvin, which is based on pseudo alignment. I've met one person who's used it, so I'm not going to talk about it. Okay. So I mentioned sequencing errors. So when I talk about sequencing errors, the people who are Illumina doesn't get too happy, because they say there are no sequencing errors. They're 99.9% accurate. When we're talking about sequencing errors, in the Biomex context, we're talking about errors by reverse subscript days, errors by DNA polymerase when doing the amplification, as well as errors in the actual sequencing. All of those things can contribute errors. Some of those we can see from our sequencing quality score, most of them we can't. So they'll get through and we'll have to deal with them in computation. So the way CellRanger deals with sequencing errors in their UMI's is they allow for one mismatch. And so if there's two UMI's that match to the same gene with one mismatch between them, and one of those UMI's has significantly lower numbers of reads, they all merge those two together into one UMI. Is that perfect? Probably not. There are tools that are better. So if you want to look at UMI tools, it's technically better, but it's slower and you're only getting like 1% or 0.1% more counts. So people can't be bothered and they just use the CellRanger one mismatch. And a sort of a demonstration of why these mismatches are important to consider, and the sequencing errors are important to consider, is mismatching reads. So here on the left, I have a plot where on the x-axis is the gene expression level. So it's the log expression level. And on the y-axis here is the number of cells where the gene expression for that gene was zero. So lowly expressed genes are zero in almost every cell, whereas highly expressed genes are detected in every cell. And you can see we have this main curve in gray where most of the genes fall. So obviously there's a pretty strong relationship between the expression level and the number of cells. You see. But you can see the second curve in orange that is basically the same as the main curve, but it shifted to the left by a small portion. And if you look at what genes those are, those are processed pseudogenes. So processed pseudogenes are genes in the genome that have no promoters. They have no enhancers. They have no regulatory sequences at all. They don't even have introns because they are mRNAs that got reverse transcribed somehow out in the wild and then got embedded into the genome. So these things really should not be expressed. They have no promoters. They have no regulatory sequences. But here if you look at them, we do have expression of them. So the truth here is that all of the reads mapping to these pseudogenes should actually be from the original gene that it's a copy of. So it's a copy of a gene that already exists. But what we observe is about 4% of the reads or the omias get mapped to the processed pseudogene and not the main gene. And it's because of this that cell ranger, all of their genomes have removed the processed pseudogenes. So that's one of the differences. Whereas if you start solo, you probably won't do that. And you'll see processed pseudogenes in your data because of this effect. Yeah. So if you see processed pseudogenes in your data, they did not use cell ranger. Or they used a very, very old version of cell ranger before I presented this at a talk in Ox in Cambridge where there was a cell ranger representative and they're like, oh, we should get rid of that. Yes. So star solo is designed to work on 10x data just like cell ranger and it's designed to mimic cell ranger's pipeline. Star solo can, because everything is customizable, you can customize the barcode structure in star solo. So it can be used for other data sets as well. So you can use it for drop, secret, and drop, or other technologies as well. Whereas cell ranger only works for 10x. Yeah, if we assume 1% sequencing error rate and 100 base pair reads, this 4% corresponds to reads that have three or more sequencing errors, which seems reasonable for those to then be mismapping. So once we've got our UMI by cell matrix, the next thing that we do is create a barcode rank plot. So here we just take the total number of UMI's met to each of our whitelist cell barcodes and see how many UMI's were assigned to that and plot it on this log log plot. And you can see that we get reads for 100,000 droplets, even if we only loaded 10,000 cells. So why did we get all of these reads mapping to these droplets that clearly aren't cells, because we only loaded 10,000 cells? So then we have to think about how the droplets are actually generated. So we have these beautiful pictures, that is the ideal case, right? So we flow our barcoded beads and our cells together, we get a droplet with one cell and one barcoded bead in it. In practice, this isn't what happens, because we have no way of recording to check whether this is happening or not. So we can't have these two, so it writes about barcoded beads, and the cells are flowing through our microfluidic device. We have no way of waiting until there's one cell and one bead in each droplet before pinching off that droplet. It's happening way too fast, we're doing thousands of droplets per minute, that's simply not practical. So what we actually do is we control the flow rate so that on average, we tend to get only one cell and one barcoded bead for most of our droplets, where we get a cell. To achieve that, we have to have a lot of droplets where we don't have a cell. We also want to only have one barcoded bead per droplet, which means we also have a whole bunch of droplets where we don't have a bead in it. So this is our ideal case, and you'll see already in my ideal case, I've added something that's not on that diagram, which is ambient RNA. So in our cell and our cell mixture, probably there's going to be some ambient RNA from our dead or damaged cells, that's just floating around and sticking to our other cells, and that will end up in our droplet along with our single cell. So there's always going to be some ambient RNA in there. Most of our droplets are actually empty. A lot of them will still have a barcoded bead in it, which is why they will match to a whitelist barcode. So here we have ambient RNA and a bead. So we've got some RNA that can get captured, and we've got a bead that can capture it, so it's going to get captured and sequenced. And we have ones that are truly empty. We can also have droplets, right? So we have one droplet with two cells in it, and we can also have what's known as a barcode multiplet, where we have a droplet with two barcoded beads in it and one cell. So this will end up as basically a photocopy of that cell in our data set at the end. In practice, there's nothing we can really do to correct for barcode multiplets. We just have to keep in mind that some portion of our data is actually just copies of the same cell. It's just something that happens. Yeah? Okay. I've been thinking about this since the beginning, but I'm just sort of specific of how we might get attached to one-to-one mapping with the mRNA of interest. So I'm trying to imagine in this process, maybe you're wrong, it could open up and it would sort of match it, but if you could like, can we touch on it, or maybe you'll talk about it later? Oh, sure. No. I had finished talking about that, so I'll go back to that. So the way that works, so the way we get a unique molecular identifier attacking to each mRNA is these barcoded beads. So I can't really tell you exactly how it works because it's a trademark secret of 10x genomics. How do they make these? But in these droplets, we have our polyase sequences and they randomly, they basically generate a whole bunch of UMI's, take their poly-A tails, throw them in, got this big mixture of UMI's and poly-A, poly-T sequences, they're going to just capture them, and then you throw in some DNA ligase and ligate on one of these UMI's to each of the poly-DT sequences and you hope it's different. And then you take that pool and you shove it into one of these hydrogel beads that then goes into the machine. Yeah, I can understand the randomness process if you only put like a certain, you know, say like theoretically like one UMI of one type, you know, and it contains just only one kind of mRNA type, but you have to make sure that that mapping is actually isomorphic with mapping all the other cells. So, you know, the other barcode, it should work the same way, but in this process, it sounds like, you know, in one bead, one cell, you could have one UMI map of one type of mRNA, but in the other kind of, you know, one cell, one barcode, one thing, you could have like the same UMI map of different mRNA. Yes, yes, so that is how it works. So the UMI is randomly associated with your mRNA, right, through the hybridization. So you'll have the same UMI repeated across multiple droplets, but mapped to different genes. Yeah, by design, yeah. A lot of this relies on random chance. So how do we distinguish these droplets that have a single cell in them, and those droplets that have doublets of them, computationally? Oh, and also, we can affect a lot of these things. So here I've got a set of numbers that are from this paper, they're sort of average typical numbers. But depending on how you design your experiment, these numbers can change. So different depending on our experimental design, we get different numbers of doublets, we can get different numbers of cells captured, and different amounts of ambient RNA. If you have a couple minutes to think about what can affect what parameters of your experimental design kind of influence these three factors. And again, feel free to chat with your neighbor if it's helpful. Okay, I think everyone's had time to discuss this. Hopefully you've got some answers. So what about our experimental design can affect the number of doublets we see? Yep. Yeah, so your cell concentration that you put onto the 10x machine or whatever experimental capture system you're using. More cells means more likelihood of doublets. Anything else? Yep. Yeah, if you change the flow rate, so if you're using a system other than 10x, you can change the flow rate. So you can change the flow rate to increase or decrease your chance of doublets. Yeah, so how sticky the cells are and how thoroughly you dissociate them, right? So you can change your dissociation protocol to isolate single cells or be more gentle and then you'll get more doublets. Yep. No, so in theory, you should actually see fewer doublets in single nucleus than single cell because it's less likely your single nuclei will stick to each other. Whereas your single cells, if you're taking them from a solid tissue, a lot of them are used to being stuck together. So they will tend to stick together more and you'll get real doublet, what I call real doublets, where it's a doublet because your cells are actually stuck together rather than just the random chance of them being put in the same droplet. Yep, sure. Well, you mentioned that we've got a protocol that's not like understand after when we are doing the analysis and we just assume that this happens and there's nothing we can do. But isn't there a possibility of, because I have two protocols in one cell, that my total counting of UMI's are going to be like too low when I do this exercise? In theory, and you might think that having two barcode beads in the same droplet will have the number of UMI's you get for each of those. In practice, there's so much variability in the number of UMI's we capture per cell or per droplet, just in the normal case of one cell per droplet, that we can't really tell that apart. It's the same thing with the doublets. In doublets, you might expect there to be twice as many RNAs, so you'll capture twice as many UMI's. In practice, you can't really tell and there's so much variability that even when we can detect doublets using the other methods, sure they tend to be at the higher end, but they're not exclusive to that higher end. There's plenty of real valid single cells in that higher end. Yeah, that's the same thing at the low end. Yeah. Yeah. Yeah, absolutely. And even experiments where you don't deliberately do that, we've accidentally found cell-cell interactions because we see an over abundance of doublets with two particular cell types involved. Yeah, so the one we saw it for was macrophages and endothelial cells. We tended to see a particular type of macphage and each particular type of endothelial cell forming doublets more often than we would expect just based on the math of the randomness. So, doublets can actually be interesting. How about the number of cells captured? I mean, we kind of covered this. Yeah, so the number of cell captured is basically how many cells you're loading and what concentration you're loading them at, as well as how healthy those cells are. If your cells are all sick and dying, you're not going to be able to capture that many intact cells. What about the amount of ambient RNA? Yeah, degraded samples, so making sure your sample is good quality, cell viability. Yeah, if lots of your cells are dying, you're going to have more ambient RNA than if they aren't. Not so much. So yeah, how you're manipulating the cells and in particular, yeah, if you're doing this, yeah, if you're doing your dissociation for a very long time, you'll have more destroyed cells, a lot more ambient RNA. The other thing you can do is you can wash your cells before you put them on the 10x machine to try and reduce the amount of ambient RNA. But then, of course, your cells are dissociated for a longer amount of time, so your RNA is going to be degraded more, so it's sort of a trade off. Depending on your particular system, you may want to wash them or you may not want to wash them. Yep. So we've seen in our data sets that single nucleus RNA-seq tends to have more ambient RNA, because if you think about the nucleus, it's covered in pores that are designed to export mRNA from the nucleus. So it makes sense that mRNA is getting shunted out of those nuclei in solution while you're loading them on the 10x machine, so you get more ambient RNA. However, it does differ between different papers. Some papers do see this, some papers don't see this, but we do see it, so I believe it, but it's up to you. Current controversy. Yep. Yeah, it shows for other reasons. Well, one, we already sort of talked about if you use the wrong cell barcode whitelist, then you're going to get an underestimate of the number of cells that should be there. Other issues would probably be more if your hydrogel beads are like a really old kit for 10x genomics and they've started degrading, who might get fewer than you expect, or loading issues. So one thing that's quite common, not using the 10x machine, if you use like a drop-seek machine, you can have the beads, so it uses beads instead of hydrogel ones, they can get stuck together, and then you'll get fewer cells captured. Yeah, I don't think drop-seek has improved much, but lots of the others have improved, yeah, but yeah, so which technology can massively influence how many cells you capture? There's also just random chance, right, where capturing, I am putting these cells into droplets just at random, so you can get just super unlucky, and you may end up with high doublets, or lots of yourselves end up in droplets that don't have a bead in them, just by random chance. So we typically see in the days that I've looked at, between, so you can have like three or four full difference in the number of nuclei or cells you capture, just between different samples, even if you run them all identically, just because of random chance. So you do the experiment and over a number of days, yeah, the tissue differences is usually your dissociation. So PBMCs, you can capture really well, neurons, not so much. So they're getting destroyed or damaged during the dissociation or the tissue hand line. All right, so this is sort of the table of empirically how many doublets you get versus the number of cells loaded. I'm not going to go through it, but you can see more cells below the higher doublet rate. All right, so we've got our barcode rank plot, so now we want to find which of these droplets contain cells. So here, this is a pretty easy one, right? We've got a whole bunch of droplets up at this high end, where we have our real cells, and they end up about like 5,000 here, real cells on this curve, just where this knee is. And that's probably how many we were expecting. And then we have a whole bunch of stuff down here at like 10 UMI per droplet. So those are definitely going to be empty. Complexity comes in if we consider a tissue where the cells aren't all the same size. So maybe you have or had the same amount of RNA. So maybe you have a tissue where you have some really big metabolically active cells in some really small, relatively inert immune cells. Now what happens? So now you get two humps on this plot, one with your big cells, and then a second hump with your little cells. What happens if our sample contains a lot of ambient RNA? Oh, we get the same thing. We had two humps, one where we have high ambient RNA, and one where we have high ambient RNA and a barcoded bead, right? So we can capture that infant RNA, and we get another hump where we have all our droplets where we didn't have a barcoded bead. And we have a problem because we have this middle hump where it could either be a bunch of small cells with a low amount of RNA in them, or it could be droplets with a high amount of ambient RNA. So how do we tell the difference? So if you use the old version of CellRanger, it didn't even try, and just said this first hump, that's your cells, everything else is not your cells. The newer version of CellRanger uses a method inspired by empty drops where you can have this region where you definitely have cells at the top, and then you have this region where you have some cells and some not cells. So how does it actually do this? Well, I haven't looked into the CellRanger code for what inspired by empty drops really means. Just to talk about what empty drops does, because you can use it yourself as well, right? So for empty drops to identify cells, say, okay, we've got this bunch of droplets that have really low counts, we know that definitely doesn't have cells in it. So we can use those droplets to estimate the amount of ambient RNA, and what the gene expression pattern for the ambient RNA should be, so which genes have high expression in the ambient RNA, and which genes have low expression in the ambient RNA. We can then take each of our droplets and ask, is the distribution of gene expression significantly different from the ambient RNA? So here our black line is what our ambient RNA should look like, and here's the significance of whether each of our droplets significantly deviates from that distribution. And everything that's below significant threshold, we say that is a cell, and everything that is above that is not a cell. So here you can see at the high end, here we have our 100% cells, and then as we go down, we get portions where we have some cells and some not cells. And lastly, I'll just talk about how we identified doublets. So similar to barcode multiplets, you have homotypic doublets where you have two cells of the same type, you can't tell that apart from a droplet that has one cell of that type. But you can identify these heterotypic doublets where you have two cells of different types. There's a whole bunch of methods for this, they all do basically the same. So they first randomly generate doublets from measured cells, take some sort of classification method, train it to identify these doublets we've invented that we've created from the true real cells, and then we take that classifier and apply it to our original data. Based on, because we can calculate the expected number of doublets based on how many cells we loaded, figure out how many doublets we expect to see, and draw a threshold based on our classification score for everything above that threshold is a doublet and everything below it is a single cell. There's tons of methods for this. They all do the same thing, just with different classification methods, different numbers of synthetic doublets they generate in slightly different ways they generate those synthetic doublets. In the lab, we're going to cover doublet finder because it's sort of a nice trade-off between how fast it takes and how accurate it is. I would also notice that the accuracy here, this score should go up to one, and we're maxing out at 0.55. So it's not that great, but it's better than that. And I'll just quickly summarize. Yeah, so talk about all the stuff and I think I'm about out of time. So any last questions?