 Okay. Hello, everyone. Thanks for having me here. And my name is Avis Shrivastava. And let me start with my introduction for a bit. I'm a postdoc in New York Genome Center. And I work with Rahul Satija. And I did my PhD in computer science from Stony Brook. Where I work with quantification and bulk RNA-seq and single-seq RNA-seq. So this talk is going to mostly concentrate around my thesis and how we came up with different ideas, how we can improve them specifically for single-seq RNA-seq. And hopefully it will get more interesting as we keep going along. So let me start with this brief introduction with each day, every day we are generating terabytes of data. And there's lots and lots of new data is coming out. And the beautiful part of this new data which is coming out is usually how much resolution we can generate from them and also they're getting generated which is very high throughput. So one important aspect of this is that we are trying to measure many different types of assays. And obviously you have different protocols, different methods, different tools, then you need actually different kind of methods which can be designed and tweaked to measure what you're trying to do analysis using your assay. So over the last decade or so, there's been lots of studies based on different type of assays. We're trying to measure different things in many different ways. So something around the 2006 or 2007 RNA-seq study came along and from that time it just have blown up apart, it's been exploded a lot. And there's lots and lots of research which is done on RNA-seq. And it kind of become a very important factor in doing analysis downstream. And even in RNA-seq, as time come along with the introduction of single cell RNA-seq, there are bunch of single cell protocol which came along, they have different caveats of their own and they try to measure specifically RNA-seq through different ways which we're going to talk about in the coming slides. But you can start to appreciate that there's so many different kind of measurement, you need different tools and methods to analyze them. So there was a beautiful study by Paul Moore and they were trying to say that as you generate, keep on generating the data, the real cost of an experiment is also weighed down to how much money you spend on analyzing it. And it's not just the new data which is a problem. The data which is already present, it's a huge amount of data and you need to figure out methods actually to work around those things which can be done efficiently and gives you some discoveries which you are looking for at the first place when you design your experiments. So these are some of the basic motivations of why we need actually faster efficient methods to analyze the data. So specifically in RNA-seq, if we talk about there's lots of studies which concentrate around getting the expression and quantification. We talk a lot in terms of quantification, but what they try to do is actually what you are interested in, just to give you a brief example, like what genes and transcripts are turned on and off, right? And at what level they are expressed, are they differentially expressed if you provide a couple of experiments, one with and another without an environment or stimuli or any kind of additional factor. So just to give you a brief summary, if you assume there's a DNA with a genome, obviously, it gets transcribed into an RNA and that gets alternately spliced into multiple isopropane. So what we do is we take these small mRNAs, mature RNAs and we chop them into multiple small pieces and these are generally called as reads. So when I say quantification, what I mean is if you have given a reference, reference can be transcriptome, which is a set of transcripts or a set of genome. You have given either of these two things. Also, based on your experiment, you have a set of reads and what you have to do is you have to take these two things and figure out what is the frequency or what is the count which you expect of considering this set of reads, which can be assigned for each of the reference sequence in the transcriptome or the genome. So here, let's say there was three transcripts and there was some set of reads. Here I'm trying to show that the first two were having one count and the third one was not expressed at all. So this is kind of what I mean. Like, you have to quantify the set of reads into some bins which were defined based on previously defined, previously known reference sequence, which can be transcriptome or the genome. And when we talk about these RNA-seq studies, they basically over the time has fallen into two different regimes, bulk RNA-seq and single-cell RNA-seq. So there's a very beautiful analogy which came out from the Twitter and this is how it's tried to post is that if you look at bulk RNA-seq, it's kind of like a smoothie. It mixes everything, blends everything up. But if you talk about single-cell RNA-seq, you want to study each and different foods separately or disjoint you. So just to summarize, if you take a tissue, you try to sequence everything together. But in single-cell, you try to isolate each and every cell and then try to sequence them and study them downstream. So you can imagine that both of them have their own caveats and frozen coms. So in bulk RNA-seq, we have typically millions or tens of millions of cells. And you can imagine if you have so much, so many observations, you have high, let's say confidence and high fidelity in your downstream analysis. And what they try to measure is a transcript level abundances at the population level, which is at the tissue level. And when I talk about single-cell, we typically talk about tens of thousands. This gets us, so single-cell is really awesome that studies, this slide is already outdated. Now even you can study with combinatorial indexing like millions of cells within one experiment. So this should be updated. But in general, you can study tens of thousands of cells. What usually is there's a low coverage. That is, if you talk about droplet-based sequencing, the reads per cell coverage is relatively low. And what they try to measure is the gene level. Remember, here we tried to study the transcript level. Since the sparsity and the reads per cell level coverage is too low, we try to study at this gene level. But with each cell, what are the gene level abundances? And it kind of divides based on that. And then we try to analyze downstream. So for this talk, we have mainly concentrated on the single-cell RNA-seq. Now, when we talk about single-cell RNA-seq, there are a bunch of different ways where you can measure, perform single-cell gene expression study. So just to give you very brief, where you, if you start with a solid tissue, you de-associate them into single cell, you isolate the cell. This is a very important part, the cell isolation part. And they generally get isolated in mainly three big regimes. One was micro-well, where you need human pipetting. And then the throughput is very, very low. Then there was microfluidic, which is like an automated pipeline, it's integrated system. But again, the throughput is still low, although the step is, let's say, integrated. But then there was recent droplet-based technologies which came along, I think it's 2015 or 2016, with Makosco and Ellen Klein, they introduced drop-seq-based cell isolation techniques where they use microfluidic chips. And the beautiful part of these droplet-based sequencing is you can isolate at much more, much more higher rate, but the throughput is very, very high. Okay. And once you have the single-cell isolation, you lies the cell, you extract RNA, then you reverse transcribe it to generate this complementary DNA, you amplify it. So this step is important because the amount of cDNA which you get from single cell analysis is way, way too small. So you perform an amplification step, you perform sequencing, and then you try to generate a single cell profile. So this is where we generally lie in this talk, and how we can generate from sequencing data to generate a single cell expression profile. And then down the line tomorrow and day after, Shale is going to talk about using them downstream, and we can discuss more once we get to that stage. But for this talk, we're now going to concentrate mostly on the droplet-based sequencing methods. Now, so just to give you a brief overview of droplet-based sequencing methods, so what they have is there's a microfluidic chip, okay, they have multiple insertion points in this chip. From one side, you insert a gel-bait emulsions, and from the two other side, you insert one from one cell, and the second side, you insert an oil. What happens is it actually captures in a droplet the cell bead and the cell, okay? And what happens is you take them, you isolate or you take the single cell in a, in a, in 10x word, they are usually called gems, and you tag them with cellular barcode and sumi. So when I say cellular barcode, what it really means is kind of ID, okay, which can separate sequencing reads from this cell to this cell. So you can imagine the color in these different droplets represent a cellular barcode, since they are uniquely identifiable. Here, they have the same color, but we'll talk about how that can be worked out. And then what are the UMI? So UMI is a unique molecular identifier. When we actually kind of zoom in, once we go within this cell, there can be multiple molecules within this cell. So how do you separate the molecules within a cell? That's how this UMI comes into the scene, and we can separate them apart in silicone. So this is how the tagging is usually done in droplet-based word. You take these sequence, you prepare the library, amplify and sequence them. So this is just a very big idea of how the sequencing is performed. Okay, now coming to more on the computational side. So in theory, like we talked, let's say, let's give me a brief overview of this too. So in what, what a sequencing experiment generates, it's a pair of FASTQ files, okay? So the first end of the FASTQ file would be cellular barcode and the UMI and the second end would be generally a sequence, which is a mature RNA sequence. So what you do is you have been given a bunch of a set of reads, like all of them will be generated through a FASTQ file. First, you do is you align to the reference and group them. So what will do I mean by grouping? So you look at these, let's say, 16 bases, initially, which represent cellular barcode, you group them, and each group might represent a cell where these reads are coming from. Okay, so here I'm trying to show with like four cells, a different kind of cellular barcodes are separating them in silico. Okay, and the second level of grouping, which is performed within a cell to separate these reads from this read from this read so that we can assign to each of the genes which this read is coming from. This is how the UMI comes into the scene. And at the end of the day, you perform deduplication. Remember, since there was a amount of material was so sparse, you have to amplify it. So what you have to do, you have to perform deduplication, which we're going to talk a lot on the coming slides. And at the end of the day, you generate gene versus cell count matrix. And this is this is like the lingua franc of lot of downstream analysis, which we try to do like clustering, pseudo time, RNA velocity and all those things. So they take these things, these matrices, gene versus cell count matrices as an input. And if you if you assume like if you look at the workflow, it looks like very simple. And I literally can be done in pandas in like in Python in like, let's say 10 lines of code. Okay. But in practice, this can get very, very complicated. And let me let me start with what why and why what are the problems which can occur when you talk in these experiments, right? So let's start with this, this small one pre PCR molecule. So what happens is what I'm trying to show here is with this blue square, I'm trying to show and you my sequence and this tilde kind of lagging sequence, we have I'm trying to show is it's actually coming from in material. Okay. So once we have this molecules, what we do is we perform PCR, right? And each of them will get copied into two. And there's like a tree which comes along and you'll have bunch of amplified sequences from one this from one pre PCR molecule. Okay. And down the line, when you're performing the PCR, let's assume there was one sequencing error, which happened through the second round of PCR, right? And from this, if you keep on copying down the line from this molecule, you'll end up this group of molecules, which all of them have this one error. Similarly, this red error happens a bit upstream, and it had all of the copies will have this error. So you remember, I was talking about events, we have this all those observation, you have to deduplicate them. The idea of deduplicating is like you have performed amplification, and you you have these set of reads or observation, and you have to predict how many pre PCR molecules were initially present in the experiment. So what is the basic idea is you take all the blue things which are in this pre within this amplified sequence and he grouped them and say they was the and you deduplicate them and say this represent one PCP PCR molecule. So imagine everything was right. Everything was blue, you will consider them as one group and you will say one molecule was present. But here, we have this problem, like these group of IDs were different from these group of IDs. And if you try to deduplicate them, you will end up saying there are three molecules. And in actual reality, there was only one molecule. So this is one of the way there, you can inflate or sometimes deflate, you're actually actually the true count present, which is true count, which is present in your experiment. So it's not just the it's not just the PCR sequencing error, which is the problem. Imagine the following scenario, right? Let's look at the top box. So there's group of you mice, which all map to gta. Okay, and now I asked you deduplicate them. What you will do is you will say, huh, they are all looking same. And let's let's deduplicate them and assign one count to gta. That's fine. Let's look at the middle one. So there was one group of cell, there was one error. You say that, oh, lots of things reads are mapping to gta, but this is one off situation. And this is one base where there was one sequencing error. What you will do, let's let's assign this to let's allow one error in your sequencing and you can still, you know, assign one counts to gta. Okay, then even then you can work out. But the real problem comes in when there's multi map. Okay, so let's assume there's a group of reads, which equally mapped, which maps equally good to two genes, gene A and gene B. All right. Now, if you group them together, let's say, you can say, if you allow one error, there's one molecule which is present, which gene are you going to increment it to? That's a problem. Let's assume let's say a different problem, let's say you assign two of them to gene A, two of them to gene B, then you will even though they're coming from one PCR, pre PCR molecule, you will say, since there was two in this gene A, you will group them together and assign one count to gene A, and group them, group the reads in gene B together and assign one count to gene B. And you will end up predicting two counts when there was actually one count. So the things are not mutually, things are actually mutually exclusive when you assign them to and how you assign can matter how you do it. So this can explode combinatorially if you assume they're bunch of millions and millions of reads, which are equally mapped into multiple genes. So and this becomes important a lot in terms of your experiments. So this is just an observation on open data sets, which are out from annex genomics, and we study on five data set PBMC, new mouse neurons. And we will just try to figure out like how many reads of the experiment which you're paying big bucks on are actually this multi mapping. And it turns out then fraction can be like quite large from 14% to 23%. Okay, and basically what the current tools almost almost every tool which is out there just toss it away. And basically we are throwing away a significant portion of your data to generate a count matrix which which you already have and you can utilize it, right? And this is one of the problem which we try to solve in our method, which we're going to talk downstream. So just to summarize what we talked here, so there's multiple challenges when you talk about single cell and the quantification. And what we basically say that one UMI is not equal to one pre PCR molecule. So you have to group a bunch of UMI and deduplicate them to figure out the pre PCR molecule. So that's the deduplication process. And another thing which we discussed in one cellular barcode is not equal to one set. They can be to send the barcode with the same ID and you have to separate them apart, which is usually called as doublet. And you have to process them downstream. And the third thing, which is important, drop out, but it gets a little controversial over the time. So I'm not going to discuss about this a lot. But bias is important. So you can imagine in these kind of situation how you assign gene A or gene B and what are the kind of scenarios where you if you ignore consistently a set of reads which are multi mapping, these genes are going to be accounts are going to be super deflated. And if your experiment is important, is dependent on these kind of genes which can have multi mapping leads, you will have no counts at all because methods are just dropping them away. So these these are important and how it can create bias. We're going to discuss in the coming slide. And the last is the UMI collision. And you can imagine that if the even if they are coming from two different pre PCR molecules, they can have same UMI. And this is UMI collision, but it's relatively rare as the length of the UMI sequence is 10 to 12. And the probability decreases significantly if you reach to 12. So with 10x version three chemistry, it's reaching to 12 length UMI. And the rare cases are relatively less probable. So we kind of ignore this for this stuff. So let's see if we have any questions. Michael, did we have any questions or should I So far, I didn't see any raised hands. All right. Okay, so we we work we try to we discuss all the we have discussed all these problems. And we try to come up with a solution which is called Alvin, okay, which is a DS, DS here in a seat because we just came up with this DNA, but it's a droplet based single cell RNA seed quantification. And it's designed for let's say 10 10x chromium in drop and all the technologies which are building up upon them like side seek and snare seek, which try to measure attack seek and protein sequence. So these are all family of let's say sequencing protocol which Alvin can take care of. So what does Alvin do? If you have given a cell population, and you assign a set of you will generate a sequencing experiment. Okay. And it goes through a bunch of steps to generate a cell versus a gene con matrix. Remember, the thing which we are interested, if you have given a fast Q read, you want just a cell versus a gene con matrix, but with less bias and with less dropping and considering a principal framework. Okay. So what it involves multiple steps, just to give you a brief overview. So what we do is we first group them based on cellular bar codes, right? That was relatively simple. You take the cellular bar codes which here I'm denoting by this circle. And if there's a yellow bin, you assign it to this bin, red, green, and you try to separate them and you can figure out how to process each cell disjointly. Okay. From here on, we process each cell and group of each disjointly. We perform some correction, but that's related to the material. We look for one at a distance, we try to correct. The important steps comes in when we take the sequence and we have to map them. So it's very important step. We are going to talk about what are the different kind of alignment which can happen. Once we do this alignment, we perform this umid duplication. We're going to get into much more detail what I mean when I talk about umid duplication. And once we perform umid duplication, we generate a cell versus gene count matrix. And then we can process them downstream to whitelist a cell. When I say whitelist, I was mentioning like initially when we say drop doublets, right? Each cellular bar code can have two cells. So we can figure out a way how to separate them apart once we have this cell versus gene. So we talked about this mapping, right? So let's let's formally define what I mean when I say mapping or alignment. This generally I'm posing it as a read alignment problem. So as we talk initially, we have given a set of reads. Here I'm denotering that set of reads with capital R. And each read R sub i can have different lengths. So it is not very important for our use case or the definition here. And also we have been given a reference sequence. So a reference sequence can be a transcriptome. It can be a genome. And once we have given these two things, what we have to figure out for each read, what are the possible locations in the transcriptome or the reference sequence, where each of the reads can map equally well. And within some bound, and this bound is called edit distance bound, it's basically how much error you want to allow when you're mapping this read to the reference sequence. Okay, so this eta is noting that bound. So this is how you pose generally a read alignment problem. So there's a two major regimes where you can perform these read alignment. Okay, so one is based on the genome, the second is based on the transcriptome. So when I talk about genome mapping, basically you have given this big reference genome and you can map to intergenic, entronic, axonic, and splice alignment. When I say splice, I mean, since axons are separated apart, we have to perform alignment considering that this, the read can jump here to here, right, because the reads are generated from the transcriptome. But you perform alignment on the genome. So there are two, there are a bunch of tools which are out there, but lots, the most frequent used one are star and high set. Then there, you actually map to a whole genome and for human it can get as big as three gigs. And the issue, and the, both of them have their pros and pros and cons, but if you, if you map to a genome, the rate of multi-mapping. So when I say a multi-mapping is what I mean is how many locations a single read can map to. So if you map to a genome, the rate of multi-mapping is relatively less. Usually the read can map equally good, very well to one or two locations. And this typically, if you talk about use cases, one of the use case is to identify new transcripts and for, let's say, non-model organisms, you have to figure out what is a transcript. So for, let's say, model organisms like human, mouse, which where we know the transcriptome relatively well. So we kind of relax a problem and at the same time it's difficult in different ways. And that's what transcriptome mapping tools try to work on. So the tools like BOTI, RAPMAP and selective alignments and they actually map to transcriptome, which is for human it's like 300 megabytes. It looks like that the problem is easier, but it's, it gets complicated because, assume if a read is coming from this exon, right, it will equally map, it will map equally good to all the things, all these three isoporms, right, and the number of multi-mapping. So, and let me pose as a number like 80 to 90 percent of the reads actually multi-map because a read can come equally good to multiple exons and that complicates the problem of mapping because you have to figure out equally good and lots of variation and where it can map and that's a kind of problem which is, which we try to resolve it through quantification and that's what the typical use case involves. So even in this transcriptome mapping work there's a bifurcation and you can try to relax a problem. So we have been generating reads in like billions now and we have to perform read alignment as fast as possible without losing the accuracy. So we kind of relax a problem and this, this, how you separate the two kind of mapping, how you map a read. So one is read alignment which you propose initially and I kind of missed one thing was as, when we when we have to find out like which read is mapping to which location, we have to also figure out a cigar string. What is a cigar string? Cigar string is how, what is the process which this reads can be converted into this reference thickness right and that complicates the whole process a lot. So if you relax that assumption that you don't need a cigar string, you just need which read is mapping to which location that's actually called read mapping problem and which is different from read alignment problem which needs a cigar string and how many steps you need like you have to change this base, you have to change this base to actually map to this reference sequence. So this is actually the difference between the read alignment and the read mapping strategies. So what is, what, why do we need this relaxation, relaxation in the problem and the answer is speed and if you, if you try to compare the three different tools let's say bow tie, star and wrap map and here we are, I'm trying to show with like 75 million reads which are 76 base pair and you try to map them using three different tools, you will look at the how much time it takes to really map them. So with 10 threads bow tie was taking some out on four 20 minutes and star was taking relatively less with them but if you keep going with less than number of threads which is four usually for current laptops, laptop standards it start takes 40 minutes and wrap map can do in like which is basically read mapping which solves a read mapping problem, it can do in minutes. So this is why it can become super important just to put a note here that you are trading something when you're going from here to here right and that we're going to talk next in the coming slides what you are trading and why is it important but if you, this let me give you a brief phylogeny and the like one slide rundown of what we discussed here. So alignment and mapping base or mapping of a mapping can be on the DNA sequence or the RNA sequence we talked about RNA sequencing and within RNA sequence there can be genome mapping or the transcriptome mapping in genome we can be perform splice alignment and typically involve mapping to genome, top head star, high set are the tools which you can use for transcriptome mapping we have a very high multi mapping rate it maps to transcriptome tools are bow type, EWM and wrap map and selective alignment within this transcriptome mapping there's a bifurcation where you perform base to base alignment using the cigar string and if you don't need the cigar string which you just need to map you use wrap map and that's how you separate aligner with a mapper and you can assume that is it possible to have as fast speed as mapper but as good accuracy of alignment at transcriptome and that's what our recent paper which we tried to discuss which is called selective alignment you're not performing alignment for every read you can imagine there's there should be a significant fraction of reads which doesn't need like all those fancy mapping or they're mapping exactly to the reference transcriptome so you don't map everything for in the selective alignment as the name suggests you select the reads which you have to align and you align only those so the benefit of this is you get as close as the speed as to the mapper but as good as an accuracy of a mapping close to the regular like Bota2 and BWA type of method so this is the selective alignment method you can check out this paper if you're interested but this was just the whole overview of the how alignment and alignment and start alignment strategies try to separate apart okay let me break here Michael is any question can I ask a question yeah sure how does this form of alignment compared to pseudo aligners like Kalisto right so pseudo aligners and wrap map are in the similar kind of let's say Raji so they are actually mappers wrap map quasi alignment pseudo alignment they are in the same category they are called generally lightweight mapping techniques and they do not calculate cigar springs so this is how this is where pseudo alignment is going to fall I should have mentioned it here my apologies I kind of but that is that answers the question yeah partially I was also wondering I mean basically Kalisto is supposed to be like really fast right and how does it compare and speed speed it is as fast as wrap map and all the lightweight mapping method but accuracy it is not and that's what we discussed discussed in detail in this paper because you can imagine like I was talking when you when you do a lot fast when you do try to map things very very fast you have to trade off something and it kind of trades or some of the complicated cases where you need an actual end to end alignment which cannot be done by just performing a map so speed yes it is very as accurate as as fast as wrap map and other lightweight technique but with accuracy it's not actually does it make sense there is another question from Sebastian yes I wonder about the three prime buyers so you don't have breeds from the whole transcript but only from one end and what consequences does this have is it an easier because you see less and you maybe get more unique counts at least on the gene level right that's a very very good question so we're going to cover a lot in our exercises about three prime and all those things but to answer your question briefly yes all the single cell technology let's not say all most of the single cell technologies are using one end right and that's basically even though we have the second end which is basically cellular bar code and VOMI there's no actual mappable sequence okay so what you end up with is one sequence and if you have one sequence it's it's harder to align okay you can imagine the following scenario if a read gets if you have a paired end reads and if it maps to two different location you can assume you can take this a fragment length which is distance between two end of a end of a fragment end of a read and figure out a model like let's say one location was mapping the reads way too far apart in the second location read was mapping in relatively similar kind of fragment length right so you can you can make a decision based on this two regimes like be using the fragment length but in single end single end and single cell word there's only one end it makes the mapping and decision problem harder to answer your question briefly yes but the map multi mapping cases are still there and that's what we try to resolve using expectation maximization algorithm we're going to talk a little bit in the coming slides down as we come as we talk about human duplication and resolution okay thanks thanks for your question guys we've talked a bit about alignment strategies till here let's take another important extract which we initially talked about is the UMI duplication so let's go back to our example which we talked initially we have cellular barcode UMI transfer sequence we align them and group them and we have to perform the UMI duplication so we have some idea how to reach here but how to reach go from here to here we want to talk next so over the time there are a bunch of tools which have come which which came out which perform this DSR NEC quantification cell ranger is out from 2017 and it's it's pretty well known and it's run by default if you are if you are performing experiment on 10x genomics dataset but under the hood it uses star as an aligner and in 2017 there was one other tool which is called UMI tools which came out and under the hood sorry under the hood it also uses star aligner and 2018 we proposed elvin which uses Salman's quasi mapping and now it started to use selective alignment which is much sensitive and much accurate much more accurate to perform aligner then star solo came out recently which uses star aligner and then there's hereati which came out it's from it's an awesome tool which came it's from a company called bioturing they have perform multiple benchmarks but in under the hood they have their own aligners called herea and at last there was a bus tool which came out it uses Kalistovesh true alignment alignment strategies so just to note one point none of the tools other than UMI tools and elvin are published anyway and for cell ranger there's barely anything known in terms of method what is doing what they are trying to do in under the under their algorithm there's some gist on their website from my understanding is basically through reading their code but I try to summarize whatever they're trying to do and one very important point to note here that no other tool except elvin is trying to resolve multi-mapping way okay and there's no principle framework to use them and every tool is tossing them away and we're going to talk about like why is it important and how can you bias your analysis towards a group of genes if your if your interest of gene is not in them you'll end up having no counts at all so we will we'll talk about why it can be important but this is just giving your gist of what different kind of strategies are there this is not an exhaustive that they're there's zoomie I think tool they're couple of other tools which under the hood try to use style aligner and they try to replicate cell range but this give you a basic rundown of all the quantification so let me propose this you might need duplication problem and we try to go along how a different method are trying to use them okay so for each cell what what you have been given you have been given so remember we talked about how each cell has been separated out so before performing you do you might duplication you have to segregate a group of reads into multiple bins and each bin represent one cell or cellular bar so what you've been given we have given this separated bins and within each bin you have given a set of you mice with their frequencies and their mapping remember this you might if the duplication happens after mapping just repeat for each bin you have given a set of you mice their frequencies and their mapping location and basically what you have to figure out de-duplicate them and give me what is the number of pre-PCR that is before performing the PCR cycle how many molecules were actually present in an experiment so umitools come with umitool came up with their directional approach let me give you a gist of what they're trying to show here each circle represent a you my and the sequence it usually a longer sequence but this isn't a toy example this is showing an sequence of a you my so you start with the highest frequency you might within one cell we're talking and you look for what are the one nearest neighbor of this you my so here since only t is changed from this sequence this is one nearest neighbor c is changed this is one nearest neighbor a is changed this is one nearest okay and you connect them using this arrows this is also you my this is also you with the and the number in the middle is just giving the frequency of this you my frequency of this you my frequency of this you my and the gene g1 g1 g1 here I mean what is the gene it's this set of you my space actually mapping so just to summarize this you my has the sequence ac gt all of the reads were mapping to gene g1 and they are 456 reads such as like this within one set similarly for other circles so once you connect this one added distance now you try to extend further recursively you look for one added distance from this ac at and you figure out you know there's only g is changing here and you connect them together now you can ask like why are you not connecting these two let's say ac at there's only a difference of a and c here so that's how they propose another let's say constraint is when you're connecting higher frequency when you're connecting the this these you my or the circles the one with the higher frequency which is coming from the highest one has to be less than half which is which where you are connecting from so here since 72 was less than half like which is 200 something you're gonna connect this 72 has in connection these two let's say you my but only this one was less than half so we connect this and this was for much more than less than half we are not gonna connect this so once we have this network then the basically the task is to perform connective components what is connected component you figure out all the circles which were connected through this network and you will say you know these all represent one molecule and this represent a second molecule so this is how the you my duplication is performed in you my tools so there's a question I'm sorry to interrupt um there is a question about is there a minimum UMI length for UMI duplication and it's based on what you're showing here probably it wouldn't work for four or just six base pair length UMI's so I'm assuming you're talking about editors and how far they can be I didn't quite understand that question note the length of the UMI which is four base base here what is the minimum length that this makes sense right so currently there's a beautiful paper it's called drop EST and the typical rent where they claim if you so why is it important just to give me a a little run down like you can assume they were these were two pre-PCR molecules they can still have the same UMI sequence okay and this is called UMI collision and let's say assume that what is the probability of them that's probability is relatively less based on the evidence if you take a UMI sequence which is long enough and the long enough criteria is usually around 10 10 length sequences but as we keep on doing deeper and deeper sequencing even 10 is not enough so that's why in the latest let's say latest chemistry 10xv3 they try to use like 12 length UMI sequence but this is a toy example that's why I'm using they are using in the paper as 4 so 10 to 12 I would say the probability is the probability that you will have collision is relatively very low and you can avoid the errors which can be due to that does it make sense I'm assuming it does yep thank you okay so this is the UMI tool strategy strategy and they try to deduplicate now imagine the following scenario I was talking about gene multi-mapping and why it can be important what can it create assume the following thing this the connective there was another read there was this UMI with 72 counts let's assume instead of mapping just to G1 it maps to G1 and G2 okay and what will happen you will just toss it away so all the reads coming from here will be just tossed away so what will happen so what will happen is if you toss away the multi-mapping reads the network will play network will break and if the network network will break you will end up predicting inflated counts and in this specific scenario you will there will be no connection from this to this since this is more than one added distance away from this and you will group these things together as one these things together as second and these things together as third so you'll end up saying there were three pre-PCR molecules okay so this is one of the way it can mess up there can be multiple kind of scenario which we which we can discuss offline but this is one of the way how you can mess up your counts so this is just a gif just of UMI tools directional approach so if you talk about cell ranger again this is very brief because it's understanding of the code but what they try to do is actually reverse the whole situation what we have done here we have assigning UMI to a gene what they do you assign gene to a UMI okay so within one cell again we're talking after binning and after mapping let's assume we have a frequency of this ACAG which was assigned G1 and then there's a frequency of then there was a frequency of another the same UMI with mapping to another gene with frequency in mind so what they do you look all the genes which this UMI can come from within one cell and assign it to the highest frequency one so this is how they try to UMI de-duplicate their UMI count but imagine the following scenario again multi-mapping so if there's a bunch of bunch of reads which are mapping equally well to gene 2 and gene 3 which is a separate gene and there were 8 counts with them so let's and basically you toss them away now let's assume if you can if you were let's say through some algorithm assign all these counts to G2 so the overall counts of G2 will become 8 plus 8 16 17 right and using just their algorithm like 17 counts you'll end up assigning it to G2 right but instead you assign it to G1 so this is again one of the examples like why cell-ranger counts can get deflated when we talk this is not just an example of deflation but different from what is expected and if you take this highest frequency-based approach it can mess things up then there was pseudo-alignment I didn't add a slide but but there's pseudo-alignment based strategies where you can look of all the groups of things and you assign everything to one gene okay and you let's say you look for gene one and look for the group of umis you will deduplicate them and you assign to G1 so this is one of the ways you try to assign a deduplicate a umis again if you have a multi-mapping rate you end up tossing it away and you don't have any idea of how to resolve it so there is another question sure Sebastian do you like to ask yourself no that's not a question from my side oh okay you raise your hand oh sorry Avi for the intro yeah yeah no problem it's good I like if the session is interactive so uh to give you how we try to solve problem we kind of reformulated a problem a bit okay and how we formulate it's important we use actually a graph-based approach similar to umi tools but for defining graph-based approach let's formally define what is graph and graph is component of vertices and edges okay so first we have to define vertex so vertex is a tuple of two things the first thing is the equivalence class what is the equivalence class a set of transcript that specific reads map to okay so this is what is called as equivalence class and the second component of the tuple is the umis sequence so which is actually the nucleotide sequences and if you combine these together that will create a vertex so with with each vertex just like umi tools we have a frequency which is given by c of v and it'll give you the frequency of this vertex so once we have this vertices we try to connect those vertices through edges right so we define two type of edges bi-directional and unidirectional edges it'll get much more clear skipping this part but i'll go into detail about the equation but just to look at the cartoon we'll get much more clearer here i'm trying to show with circle a umi with this polygon i'm trying to show the umi sequence so if the polygons are the same they have the same umi sequence if the polygons have different number of edges it's three here four here i'm here i'm trying to show there one added distance apart okay so one what do you mean by bi-directional edge so for bi-directional edge you observe a group of umi umi which have the same umi sequence and it shares at least one transcript within its equivalence class so let's you let's assume this example right we have a umi sequence which have equivalence class t1, t2 what it means is this umi maps equally well you cannot differentiate through mapping that which transcript this umi is getting mapped there's another group of reads for the same umi which maps to just t1 okay and you assign a bi-directional edge means that both of these umis can be collapsed into each other so what is a directional edge it's similar to the directional notion of umi tools if there are two umis one with frequency t1 one with equivalence class t1 another with equivalence class t1, t2 they share at least one transcript and but the frequencies are different so one has five the second one has 20 and if the if the lower one is the less than half of the first one you assign a directional edge which means that this umi can be collapsed into this one so bi-directional can be collapsed into each other directional that one can be collapsed into the second one so this is how we define edges and vertices deep breath here so how we propose our problem solution so what what we have given is this umi resolution graph which is the set of vertices which we define and set of edges how we define to connect them so what we have to find a minimum cardinality cover by monochromatic orbits so it's a fancy way of saying we have to group the umi groups together that will represent a pre-PCR molecule with considering a parsimony right and it'll get much more clear in the downstream cartoon what i mean by a parsimony in this specific scenario so we have proved in computer science we try to prove that the problems are very hard it's relatively exponential to solve and we try to show this through reduction from already hard problems and we prove that from dominating set but for this talk it's relatively not important because the number of things which we try to solve is relatively less complicated then we actually need asymptotically to compare it as to an anti-complete problem so this cartoon will make much more sense if you try to once we go through this slide so again this polygon represent umis and the gray sequence I'm trying to show here as a sequence and this this sequence this set group of reads mapping to two gene two transcript T1, T2 and they both are coming from two disjoint genes G1 and G2 okay and these group of reads are mapping equally well to both G1, G2, T1, T2 and these group of reads are mapping just to T1 okay so we made a network using the frequency I have removed the frequency it's not important because all it needs all we need is this edges so we have this network which we use the previous analogy to design so this umai has this is the here I'm not using the same polygon it should be the same but anyway here I made the network using the umai sequence and the this represent the let's say the equivalence class for this umai this umai is from only T2 and this umai is coming from T1 now what we have to do is we have to group them together considering this multi-mapping and figure out what is the pre-PCR molecule so there can be multiple ways you can resolve it let's assume we assign this T1, T2 to just to T2 okay and what will happen these T1, T2, T2 will group together like this and T1 will separate apart like that and T1 will separate apart in the bottom one like that so what you will see there are three umai deduplicated uh three pre-PCR molecule right but another solution can be you assign it to T1 and you group all of them together and separate apart this into T2 okay so this will end up predicting two algorithms algorithms is just an analogy to an analogy to pre-PCR molecules and we select this as a solution for Elmin because we are looking for parsimony as a solution okay and and this is how we try to resolve graph using all the possible combination and resolve it in the as few as possible component but the important question is multi-mapping right it is possible that all of the UMIs in the network are coming from everything is multi-mapping then what will you do like these two UMIs are coming from T1, T2 this is also coming from T1, T2 if they are coming from same gene you will just you know say they are coming from one gene and get rid of it but if they're coming from different genes with gene this after let's say you perform deduplication with gene you assign discounts to and that's significantly important and we resolve it through expectation maximization algorithm and just to give what and how it works this slide will try to explain that so we first of all try to dissolve or separate the problem into three components okay tier one, tier two and tier three so this is kind of how hard a problem is so tier one are the easiest problem to solve you replicate everything from G1 everything from G2 and you will say four counts to G1 four two counts to P2 gene 2 this should be T2 but anyway tier two right what are the tier two kind of problems you have this the group of UMIs which are mapping to G1, G2 and they have three counts and there you have one another UMI which has two counts which is mapping to only one gene okay so what you can do is you assign two counts to G2 that's fine and you just divide half and half this three right and assign 1.5 to G1 and 1.5 to G2 right this is one of the way you can assume you try to divide a problem which has some unique evidence now the real problem is like when you have no unique evidence at all like single cell data is very very sparse and these problems are very hard to solve right and there's no unique evidence which you would try to use so here we can assume that you know there's some evidence the reads are coming from G2 then most probably these reads are also coming from G2 you can assign higher weight to them and that's why it become 3.5 but here there's no way you can just do half 1.5 and 1.5 and we try to do using and we call these as tier three genes tier three genes as the hardest problem to solve we have another paper which we try to resolve this using sharing information across cell but that's beyond the scope of the stuff so this is how you define the problem and EM EM tries to solve very very let's say in a bird IV so let's get to like why is it important and why should you really bother about this again so this is really fascinating example and I really love this example it just pops out like why is it important so what I did was here let's concentrate on this bin what I did here was I take the set of all genes and assign a score right for human assign a score and bin them so that score bin one means this is this set of group of genes got score less than 0.1 similarly 0.2 0.3 or something like that and we group them on the 10 genes and what is that score that score is actually gene uniqueness ratio and what is uniqueness we actually pass the nucleotide sequence of a gene and figure out how many sequence or the group or the length of the sequence is unique to that specific gene right if I if the gene uniqueness is one it means that this set of genes have unique sequence and it's not shared at all in any other gene so once you have that kind of situation they cannot be multi-mapping because if the sequence is not shared there will be no multi-mapping and every reads will assign to this specific specifically unique gene now if you go towards left and the numbers keep decreasing so when I talk about 0.1 what it really means is 90% of the sequence of a gene is actually the same across multiple genes right and you can imagine our algorithm will work specifically very very good or relatively better for these kind of genes where you know all other tools where you have multi-mapping they will toss it away and we will end up considering them and using performing the analysis right so the x axis is basically the gene uniqueness and you will you thought the mental model should be that as we go towards left elvin should perform better than cell ranger or any other tools so what is the y axis here so if you perform umid duplication you would have some counts post pre post umid duplication you have some count pre umid duplication this number is basically that fraction of post versus pre umid duplication right and if that that ratio should be relatively constant it should not matter on the what gene you are working on so as you see that elvin and cell ranger would have similar kind of this ratio this umid duplication ratio for the genes which we have high confidence but as we keep on going left cell ranger is to keep on dropping because it just keeps on dropping the reads one thing and that's fascinating like how elvin keeps the similar kind of trend across where it goes keep going left so one thing to note here the bunch of genes a significant portion of the genes are on the right hand side of the of the spectrum but they are group a bunch of cells a bunch of genes which are on the left side as well and if your experiment concentrates on that then it'll it'll it'll be super important that you consider these kind of methods this is the last slide I think but this talks about why it can how can we compare this to let's say bulk RNA-seq so what we did was we take sample from bulk RNA-seq and the same sample study on the single cell RNA-seq we combine all the cells together in a single cell to reflect a pseudo bulk and compare the correlation between the actual bulk and the pseudo bulk and you would assume that should not depend on the gene uniqueness ratio so here we try to compare the correlation the y-axis the spearmint correlation of each cell oh sorry it's a pseudo bulk of each experiment and we show that elven as we keep on going that left has higher correlation with the pseudo with the actual bulk experiment there are a bunch of things which are relatively hard difficult to solve which we try to show here through this tier matrix right so remember we talk about tier one two three tier one being easiest problem so you will see that tier one are concentrated relatively on the right side and tier three are relatively on the left side where the problems get harder to solve so this kind of motivates why we should be thinking and working on this type of type of methods and last it's super fast it should not be like we initially from talking from the start that efficiency a computational efficiency is really important and we compare with cell rangers and you may do it's like almost a magnitude sometimes twice of the magnitude as fast from running cell ranger and uses much less memory than cell range and just to end this we are working a lot on extending this to multiple as initially I was talked that Elvin can work with all the downstream single cell technologies which can actually use Doppler based technique like combinatorial indexing spatial quantification RNA-seq velocity site-seq data sneer-seq data which use ATAC-seq RNA-seq together so there's a multimodal information coming technologies are coming out Elvin can work on that and we're working extensively to write tutorials to help people out how to I hope it was useful and I'm open to questions if you guys have it