 Okay, so yeah, module seven, we're gonna be talking about single cell RNA sequencing, I've been following some of the discussion on the slack about the single cell DNA as well. So some of those lessons are also going to come into play here. So if anything is unclear, especially this bridge between the world of DNA and RNA, this is the this upfront sections probably the best place to cover that all off. So all creative commons, of course, so Francis is really highlighted how important this part is everything here is either published or on a website somewhere. So certainly share and share like these slides are free for you to use. Hi, so much all seven. So my name is Trevor Pew. I'm a senior scientist at Princess Margaret and I'm the director of Genomics at OICR, and I'm sharing this time with have your DS and have me I'll let you introduce yourself. Thank you Trevor, I will open my video in a minute. But yeah, my name is Javier Diaz. I'm Trevor as a scientific associate for about three years at UHN Princess Margaret Cancer Center and starting in March this year. I joined a startup called Phenomic AI. We do cancer analysis using machine learning and artificial intelligence and I'm the head of data science there. And also highlight have your also is the organ answer about the Toronto single cell, our name single cell aren't single cell working group, which I'll plug at the end. I think some of you are already members, but happy to continue interactions with this group through that that working group as well. The goals for this afternoon are to know the difference between Balkan single cell sequencing seems obvious but there are a lot of little technical caveats, especially in the upfront pre sequencing step that are really quite fundamentally different. I'm going to talk very largely about how is single cell sequencing is being applied specifically to cancer samples and how it's different from how work in on cancer cells is done. The caveats just like an RNA sequencing, again, following the slack conversations there's just a lot of nuance around how do you generate your data, all those caveats apply here to single cell as well. And probably the most exciting part of this is really a hands on work with real published data, and most of the figures towards the end of my talk, have you will essentially teach you how to make on your own. So first two pieces. What are the technologies, and then we're I'm going to use the Gleoblastoma stem cell data set that we recently published as an example of all the things that you can pull out and derive from single cell RNA sequencing data. And then picking up that exact same data set, have you will take you through what a single cell analysis would look like right from reads all the way through to to data sharing. Okay, so we're going to talk about technologies to start. There are so many now cliched ways to communicate single cell sequencing and how it compares the bulk this is my version. Really thinking of tumors I'm going to be careful with the language here, the tumors being a collection of all cells that make up a mass an oncogenic mass, and those tumors contain a diversity of cancer cells immune cells and other cell types. I'm going to talk about the bulk myeloma bone marrow up next to a bowl of M&Ms because it conceptually is kind of the same. What defines a blue M&M what defines cancer cell, what defines infiltrating vascular vessel or T cell. And the challenges are really the same if you're really interested only in blue M&Ms or blue smarties how do you go through and pick out those cells specifically. We really couldn't do that with bulk sequencing the old approach would just be grab a handful of M&Ms and eat them or analyze them. Now we don't have to do that you can literally pick through the bowl and ask different questions of different types of cells. And this is the formal way to show that so single cell transcriptomes really let you see heterogeneity that you can't couldn't have seen using bulk sequencing. So the bulk approach is essentially take these six cells you grind them up and you get an average. So there's six green dots you get a square of six. Now we get a different gene expression level for literally every single cell, but there are a lot of technical caveats with with that increased resolution. So specifically our ability to fully read out all genes being expressed by a single cell with that specific topic will go into a little more detail. But before we get there I really wanted to go through the generic approach to single cell sequencing. If you try to keep up with a single cell literature you'll rapidly become overwhelmed by the huge number of ways to process a sample. So digest the tissue do you use full cytometry. Are you working with a mouse. There are lots of ways to get single cells floating in a tube. I'm not going to belabor that point but there are really lots of ways to get to a single cell suspension. So there the goal is to really individually barcode each of your pieces of RNA for each cell and there are lots of ways to do that we're going to be talking a lot about the 10 x genomics platform. But certainly there have been very manual ways using flow cytometry putting each cell and individual well 10 x genomics you basically pull each cell into an oil droplet, the ways to do this with antibodies now. This military barcoding technique probably doesn't have as many approaches as tissue digestion but there are still lots of ways to barcode a RNA and assign it to a cell, and probably more generic at the end is sequencing so this is still the common currency. And really has a dominant position in the sequencing market. So a lot of the data will actually come out certainly in terms of fast cues, really do look the same and there aren't really that many different approaches to sequence DNA align it. So there's a lot of sort of debate around how to normalize a cluster, and it's particularly challenging in in cancer samples and we'll go through that, that in more detail as well. But this in general is sort of a generic approach to a single cell sequencing experiment, and especially as bioinformatics out and that analysts, really understanding which is, I'd say is sort of down here in the third box. And it's extremely important to understand how the up the two upfront processes were done because it will really totally change how you approach single cell data. And this paper is basically a review of all the ways that you can isolate those singles cells. So the very early days of single cell sequencing it was literally a pipeter and single cell suspension. So we just pipette each cell one by one into a well and do what's essentially a bulk RNA sequencing reaction in a single well on single cell. Other approaches. FACTS is probably the other approach rather than using pipeter using a essentially a flow sorter to pull out and assign cells individual wells. So after my today section actually this a lot as a graduate student, literally a laser to cut around a cell of interest and you pull it out. More conventionally this is how the 10 extra notes platform really works essentially you digest your tissue, you digest your tissue into a single cell suspension, and then that is loaded into a micro fluid fluidic device. In this case, this works essentially because you can deliver your single cell suspension into an oil droplet that oil droplet has reagents you need so it's conceptually very similar to a plate. You can do this in much higher throughput. And there are extremely fancy ways to pull specifically panel after talking about looking at circulating tumor cells. So essentially using antibodies or other ways to capture cells to lift those cells out of suspension. So it's nice to say, this is probably the most. This step here is probably the step that will influence your data the most downstream, and it's very important to control a lot of the parameters around how exactly you're isolating and pulling out your single cells. Once you have your single cells. There's lots of ways to to generate the RNA sequencing libraries I put this title in quotes this is literally the title of the paper cited at the bottom of the slide. So like the DNA sequencing capacity you can really just see the huge increase in a number of cells that can now be profiled in a single experiment. And this is just continued to grow since this paper in 2018 and lots of different ways to get there. At the end of the day the concept is really the same. It's essentially an RNA profile of a single cell, profiled to varying levels of depth in terms of transcripts for cell, but also breadth in terms of coverage to pass a full RNA transcript. The two dominant technologies currently for generating single cell sequencing data are the 10 x genomics chromium system which is essentially profiling and reads of your RNA, either from the three prime and or the five prime and technology but still very very widely used smart seat to technology and I did want to go through these in a little bit more detail this paper goes into even more detail than this. But I did want to hammer home the effect of using these two different approaches on what your data will look like when it comes out of the other the sequencer. And the big difference here specifically is this poly a priming and actually both technologies will use this poly a prime so that's already caveat number one, if you're really interested in transcripts that don't get poly a primed circularization circular RNAs, don't go looking for them because they're very unlikely to be found if your transcript is not poly denulated. So the method you essentially prime off that a you it amplifies a relatively what was previously thought to be a relatively short distance is actually turns out to be quite a bit longer than previously thought. You then amplify those transcripts by PCR, and then there's this tagmentation approach which essentially randomly inserts into the transcript and results in a little tiny short tag here at the end. The big difference here smart seat to is you actually read to the end of the transcript and then there's this little step at the end called template switching so when you get the end of the transcript, the enzyme to start parking in seas and essentially add your, your barcode in at this, at this point. And basically it's the exact same idea this tagmentation approach delivers these sequencing adapters randomly throughout the entire transcript. And the huge benefit of the smart seat to is now you're actually getting full end to end reading of your of the entire transcript, whereas the 10 next genomics approach really is just giving you this little short tag sufficient for gene expression but not going to let you get any of the other effects on usage or potential splice isoforms fusion transcripts that you would get from what we're used to from both RNA sequencing. As a result, there's also a huge cost difference the smart seat to is extremely expensive because you have to pay to sequence across the whole transcript 10 X genomics you're just sequencing at the very end so your cost for cells much lower. You can almost tell just by reading the abstract which technology they used. So a lot of smart seat to papers are in the hundreds and 10 X genomics papers usually in the thousands now after 10 three even hundreds of thousands of cells being profiled. And this is really the effect on this data, even if they no one told you what the, what technology it was you could tell just by looking at the coverage of of reads across entire across the transcript. So here's a smart seat to data looks like so here's a generic transcript. This is a little three Exxon gene. And in smart seek you see great coverage across an Exxon, nothing in the intron great coverage across another Exxon nothing intron. It looks like bulk RNA sequencing data, but it's at the single cell level contrast that to the tax chromium, you get this huge spike right the three prime and and essentially no coverage. It turns out there aren't since you are poly a priming if there is a little stretch of a those little regions do get amplified as well so depending on your bioinformatics pipeline. You will only look at this region. So increasingly we're starting to use this off off poly a priming as an approach to start to call mutations deeper into the transcript because that data are there it's just whether your pipeline is actually utilizing that information. As I alluded to a little bit earlier, in general, you're just able to financially afford way more cells using 10 x chromium because you're just sequencing this little end. Whereas with the smart seat to approach, you basically have to pay for the sequencing reads that are mapping to all these dexons. So in general you're sequencing fewer cells, but you are getting a deeper profile so in this example here on average. On average you're getting sort of north of 7500 cells, whereas in the chromium approach, there is this issue of transcript dropout where you're actually seeing 4000 or even fewer cells being our genes being expressed per cell. Trevor question. Yep. So why do you see that plateauing effect when you're using the chromium. It's not 100% understood it is thought to be molecular dropout during the amplification within a droplet. So essentially you deliver the cell, you then lice the cell and all the barcodes get delivered to it. And then there is some loss here, due to this tagmentation effect because you're only capturing the transcripts that have to be sampled here. So it's essentially a, I don't call an artifact but it's like a function of how the RNA is prepared in the, in the, in the lab. I see. Yeah, they have been improving since the first version they are in version four of the kids now, I think the experimental kids, and they've been increasing the number of genes and number of reasons that can come from each individual cell. It's basically a physical limitation of how most reaction can happen within the droplet. And the amount of material that you have for the PCR there is not that you have to do the reaction you only have a small volume to do all the reactions, but they've been increasing it. And also a similar concept is also a challenge with the single cell DNA sequencing, because if you have two, two copies or two alleles, you lose one, you're never getting it back it's literally the only copy. So same idea in RNA, if you're expressing a gene at a very low level, and you just happen to not amplify it, it's gone for good, where the benefit of RNA is the highly expressed transcripts at least there's multiple copies so you're very likely to see them even if you have some attrition. So does that fully answer your question. Do you have any follow up to that. No, that answers it. Thank you. Thank you. Yeah, please break in with questions like that. That's actually a great point. And so we've sort of talked about single cell single cell isolation. We've talked about how to how the RNA is actually amplified and how data is made. And our slide really on bioinformatics tools in general. So this is actually reviewed relatively recently in this paper here. Really trying to address some of the issues that do come through in single cell RNA sequencing data. There's two major factors technical variation is by far the big one we spend a lot of time really as a field measuring quantifying our batch effects, different cells are captured to different efficiencies, even trickier depending on doing school cell or single nuclei sequencing. We're not going to talk too much about nuclei sequencing, but essentially you're freezing the sample the very will come out of freeze at different viabilities to actually get by us just from the act of freezing a cell amplification bias very similar to regular RNA sequencing sequence bias. And of course we have biological variation as well. So just like having technical replicates and RNA sequencing, you look at 1000 T cells are going to be some natural variation around that. I alluded to this also on the slack channel this concept of primary secondary and tertiary analysis, the same concept really holds here to single cell sequencing as well. So there's the generation of data which I largely just talked about that going through either smart seat to or 10 x genomics and into the sequencer. By far, the most important is q seeing your data. Do you have the expected transcripts a number of transcripts for cell, your cells classify as cells that you thought you're putting into the experiment. So the top of copper copper number variation the other copper number variants in what you suspect are the malignant cells. Do you have to throw out a sample you're just not getting enough reads or do you need additional reads to fully saturate a sample that really don't under and spend a lot of time on the secondary analysis it certainly saves you a lot of pain. So when it comes to what I find is kind of the fun part, which is that tertiary analysis, how do you derive biological meaning from a well normalized well formatted and structured data set and to me these are really three separate examples. And at least in my lab we've moved more towards this exact model when we talked to both our collaborators but also sequencing course. So we'll get you through primary they'll even get you to secondary analysis but we'll take it to tertiary analysis, do you need a full support, all the way through this entire process, or do you just want the reads and using phraseology like this, and linking it to specific paper just make sure that everyone understands who's doing what and where and when. And this is also the tertiary part is really where this computational creativity, really defining a strong biological question are absolutely the number one thing to do a single cell sequencing certainly in the early days was very tempting to download every package and attempt to run it. It's sort of fun from a technical perspective. But it really doesn't get you very far scientifically. So as we go into the, both the background to the glioblastoma project, but also into the workshop it's very important to know exactly at what's what question you're trying to answer at each step in the analysis and we have projects that are working in both of the, and all three of these spaces. What are the cells that reside in a tumor, how do they change over time in a purified cell population, how are those regulatory networks changing between two states are very interested in pre treatment post treatment. And how do cells develop both cancer cells but also immune cells to infiltrate tumors, but really building a specific question around each of these analysis, rather than the approach of just grinding through and running every algorithm that could possibly mention a single cell sequencing, especially when it comes to cancer genome analysis. And I certainly caution the group, a lot of tools, especially these trajectory tools are really being built and driven by developmental biologists, assuming two copies of every chromosome and well behaved regulatory networks. Absolutely not that they're aneuploid you have lots of different copies of the chromosome sometimes you're missing chromosomes, and all of these have effects on underlying distribution gene expression patterns. So really sort of treading carefully applying tools that are built for developmental biology, they will almost certainly give you an output. So in terms of biome from actual functioning, but it won't necessarily reflect the underlying biology of a cancer sample with an unusual karyotype. So I'll pause it there that's sort of the introduction to technologies are any questions at this point before we dive in specifically to the brain cancer example. Okay, so you have these concepts concepts can be a little abstract without a sort of a solid set of questions or a solid data set and that's really why heavy and I thought we focused specifically on recent publication specifically looking at the cancer stem cell populations that give rise to glioblastoma a highly malignant brain cancer. So the whole concept for the entire data set is really this idea that glioblastoma is a rise from cancer stem cells and it's this very rare minor population that actually give rise to the ball. So the whole point of generating this data set was to take bulk tumors, and the goal is to really find treatments that specifically target cancer stem cells and the model that the data set assumed going into the study was that current treatments would debulk the tumor but they leave behind these cancer stem cell populations, and it's these cancer stem cell populations that repopulate the bulk tumor and give rise to relapse. And the idea here is, let's isolate the cancer stem cells understand them and come up with therapies to treat them. This ideal stem cell cancer stem cell specific therapy will remove the cancer stem cells. As a result, you'll see regression of the tumor, even if you're not actively targeting the bulk right off the bat. So this data set is specifically a set of cultured glioblastoma stem cells they have a primary GBM just associate it just like you do for a single cell sequencing. In this case they culture it. The culture conditions select for brain tumor stem cells, and then these are then expanded into lasting cultures that can then be sequenced, and their entire research programs in this case Peter Durkin Sam Weiss that characterize these cancer stem cells using a variety of other functional assays. And this data set here is essentially looking at what transcriptional programs are active across a collection of glioblastoma stem cells. So the first question is relatively simple. Our brain tumor stem cells comprised of genetic and transcriptomic subpopulations. So are they homogeneous like cell lines were once thought to be we now know they're not. Or are they heterogeneous so they're actually multiple populations within the stem population. And is it possible to nominate a drug that is specific against one or multiple clones. So this is the first plot so the experiment here, the single cells have all been dissociated. In this case they've all gone through the 10 X genomics chromium platform. In sequence, they've gone through batch correction, some level of normalization, and have gone through the the long ranger, the long ranger pipeline. So have here we'll teach you how to go from the fast cues to plots like these. These are Disney plots which are analogous to principle component plots, going to a little more detail on this specifically in the workshop. And what I've colored here is essentially cluster assignments for one by going through samples one by one for distinct transcriptional populations and there's an algorithm here that essentially has a dial that lets you essentially resolve some of these populations from one another. So I ordered these basically by the number of clusters present in each in each GBM or believe us them a stem cell population at the top are the ones that are largely clonal so essentially one or two clusters whenever there's two clusters. The second cluster is almost inevitably an actively cycling cluster and we see these in patient samples as well. There's always this hyper proliferative malignant population, and they score for cell cycle markers and pathways like K67 DNA synthesis, these types of themes. And I've ordered these so essentially as you get later on the PowerPoint side you start to see additional clones until you get to this very biologically distinct IDH mutant. So this is really just the first trying to answer that first question are these highly heterogeneous, and the answer was, it depends in general there are not that many clones coming out specifically in this GSC population, but there are some very complex clonal structure present in some of these lines. Which is really drawing on what we knew from about Leo Westoma so in from bulk DNA and RNA sequence and we know that there are specific copy number alterations. You can also call out very large arm or chromosome level changes from RNA sequencing data. And the way this works is looking not just that individual genes but looking at whole cassettes of genes that exist on a chromosome arm or cross entire chromosome. And looking for that whole collection of genes to be over or under expressed versus a baseline. And the key to calling large scale copy number alterations from RNA sequencing data is to have a high quality reference. This is extremely hard to do from bulk RNA sequencing data because you have so many different cell types contributing to that bulk. In the cell sequencing data you can actually take in each individual cancer cancer cluster and normalize the gene expression values against their original cell of origin. So that was sort of the secret to getting this to work is doing a genome wide inferred copy number analysis from the RNA using normal oligo denture sites as the control and this is really the key to getting essentially the visualization of copy number variants we knew were there from RNA sequencing, but also be able to do a very deep dive on specific populations. I've just zoomed in on one of those Disney plots here. You can see there's copy number alterations that absolutely every cluster has so this game of seven is sort of a hallmark of glial Westoma, but you can see the sub population here marked in orange, essentially had this loss of chromosome 10 is not to be targeted by P 10 but you can see they're not in every single population so already you can see just by either subclonal structure, and you can play this game over and over so there's a sub sub clones this chromosome 21 gain or chromosome 20 gain, and that's completely unique only to this population. And then italics here the word partially, because these copy number profiles are not explained completely by copy number variation, even within this clonal population there are specific sub populations that are essentially not explained well purely by their underlying copy number profiles, and you can play the sub sub clonal game over and over, like is this little loss of chromosome six a very important and unique sub clone. It's not as briefly colored or called out as a sub clone, but this is really where the biological analysis and data exploration really comes in. Suffice to say it is possible to call a large sale of copy number variants from single cell data, not just at the ball population level but actually for each cluster individually, and have your will tell you how to do this in practice so you'll be able to make a plot like this later on this afternoon. Let's bring back out a little bit across all what we're looking for is a magic bullets that would allow us to understand all of the glioblastoma stem cells together as population. We did see that all patients live with some of them cells were different, but was there any sort of common biology that really brought them together. So this is the exact same concept this is just yet another in this case it's a map plot which serve an updated version of the Disney plot, and you can see each patient is really clustering completely distinct this is very has been reported many times many single cells sequencing experiments, the malignant cell population which these glioblastoma stem cells count are all highly unique they virtually never cluster together across different patients contrast that to the normal cells that we would see in a lot of these tumors that do intermix very strongly so for example if we took 50 primary brain tumors, the normal cells will all intermix across multiple patients, but invariably each cancer or malignant population will form its own distinct group very similar to what's shown on this plot here. You can see there is some relationship between two to spatially distinct tumors taken from the same patient. So whenever we had two lines from the same patient. These were closer to each other, or at least they're more transcription related to one another. But they didn't sort of beautifully intermix like a normal T cell or immune cell population from them necessarily would be analysis is the state of essentially uncovered a gradient. Glioblastoma stem cells that express more developmental populations are at the top. And the second principle component was is these injury response programs, essentially these lines down here at the bottom. And if we score out those two specific programs. But those two axes were mutually exclusive. So you basically scored out the developmental score and an injury response score, and they were largely independent of one another. So if we take these two axes, injury response and developmental program, essentially these lines are either heavily developmental, heavily injury response, or a mixture of these in between we never saw a line that really populated both ends of the axes simultaneously. These ones in the middle are kind of on their way on the gradient, but we never really saw one line that fully occupied both ends of the gradient. And this is really another way to look at that data, the exact same scores but just looking at the distribution of cells across an individual GSC. And the, the take home message from this was really the need for multiple samples to fully characterize these, the, the full spectrum of transcriptional states between this injury response, and this developmental program. But this is really the power of using single cell sequencing because you get multiple measurements of similar biology and you really get nice distributions of these scores across multiple samples. So keep in mind we've been looking only at glioblastoma stem cells. The final question here was let's move into primary data. This is the other real power of single cell RNA sequencing data. So it's just an absolute ton of public data out there. We're going to talk a lot about Crescent towards the end of the workshop, which is a data sharing program to put all this public data into one computable place. The take home from this slide is really the, the two, sort of the two ends of the gradient uncovered in glioblastoma stem cells are actually there to be found in primary patient tumors as well. They're just masked by this overall bulk population as well. And this is really where a single cell sequencing is so powerful because you can informatically pull out the suspected glioblastoma stem cells from these primaries and then look at these injury or developmental scores, specifically in those populations. And this is the model where this paper really landed on the idea that the glioblastoma stem cell populations within bulk tumors exist on this development to injury response access, and then these are essentially turning out the bulk tumor and they're actually on an alternate access around this astrocytic developmental program. And I did caution early on about the dangers of using these trajectory programs but they really are actually quite informative. As long as you're aware of some of the caveats and assumptions that the trajectory methods have coming in so specifically here, we have the glioblastoma stem cells existing on the developmental to injury response gradient, and then giving rise to these mature astrocyte bulk tumor cells. And single cell sequencing really allow me to score out and look at these signatures on a cluster by cluster basis. And that's really where this project is and where a lot of the thinking around single cell sequencing has come certainly a lot of genomics certainly when I was a trainee was really around binning so really putting cancer cancer types or cancer cancer specimens into sub types. Classical pro neuromus and chymol are sort of the bedrock of glioblastoma. The thinking is really starting to evolve with single cell sequencing as we see cells really being anchored especially by stem cell populations and really being a feature of development by and really not trying to necessarily place a cell into a specific bucket, but trying to understand where they may be on developmental trajectories, and potentially starting to map the position on these gradients to specific therapeutic interventions, especially in glioblastoma where really treatments have not changed and certainly survival has not improved very much at all over the last decade. And this is also really where a lot of the bioinformatic analysis stemming from single cell work land is really the need for secondary experiments so it's really important to think about how are you going to validate anything that you infer from your single cell data. Is it a functional screen? Is it a drug screen? Where we got this project is trying to look at drug predictions at the cluster level, but just like most genomic work, genomics in isolation really gives you leads or hypotheses to test but you never really have to plan out that additional validation experiment using orthogonal technologies like proteomics or functional genomics approaches. I just want to finish off with some of the gaps and opportunities certainly in research in cancer single cell genomics. A big one certainly is single cell methods aware of annual employability. So a lot of these methods, especially the batch effects, do not take underlying copy number variation into account. So some of these transcriptomes are extremely skewed due to little hyperamplicons or deletions of large chunks of DNA. The reference sets of single cell data from healthy tissue is extremely valuable, especially for answering questions around how normal cells inhabit or change or differentiate when they're within a tumor. And really where I certainly see the field is going is, what does the clinical single cell sequencing test going to look like? How do you interpret a profile? How do you make sense of multiple sub clones from inferred copy number variation? And how do you really roll these up into a distilled report that can be used to guide treatment patients? One step towards that is infrastructure. So you're going to get to learn how to use crescent today. This is essentially a cloud based RNA sequencing, a single cell RNA sequencing system. It takes raw data so allows you to keep your own data private but also to share it with your colleagues in a controlled way. So some of the heavy lifting is taken off of you. This is all done through a web browser and then some provided dashboards for quality control, working with clusters, labeling clusters, providing metadata. And then actually doing some analysis as well, the differential gene expression, and coming soon integration with multiple public data sets. So I encourage you to poke around the system. There's many cells from many data sets. Some of them are private, some of them are public. So welcome you to join the Crescent user group and continue to interact through the Toronto single cell expression working group. And this data that I just showed you're going to be working with is also available in Crescent so you can actually see that little GSC and bulk tumor pattern in a web browser. You can type in your favorite gene and explore to your heart's content. So this is now publicly released. So not enough time to workshop. This is sort of in the planning. Everything I've talked about is focused solely on the cancer cell population. Of course, there's a whole world of infiltrating cells that inhabit tumors. And this will exploring the tumor immune micro environment is likely a topic for another module. So I encourage you to sign up for that when it's fully developed. So there's the paper if anyone needs it. If you'd like to join the Toronto single cell working group, Toronto's in quotes because it's growing far beyond the bounds of Toronto, it's pan Canada, and actually quite actually several international participants as well. So that's the site there. Developers of Crescent listed there. And if you want to make data we offer this as a service through the Princess Margaret Genomic Center. One of the objectives of this handsome part of the of the session today is that by the end of this lecture you will understand the methods that are commonly used to analyze single cell RNA sick data. You will be able to select parameters for your analysis. So how to use state of the art tools to analyze your data. In particular, we're going to be using data sets from Leo Blastoma from a paper that was recently published by Trevor's group. I will give you the reference in the next slides. So the agenda for for this tutorial, Trevor already gave you an overview about where the single cell analysis field is going. I would say that this is a relatively new field, not more than seven years old, compared with other other fields in genomics, this is relatively new, and it has had very good number of tools been developed so we tried to select the best ones based on benchmark studies we will speak about it. And those are the ones that we will cover today. I will also speak about the motivations to automate single cell RNA sick analysis. Give you a broad overview about how a typical single cell analysis pipeline looks. And then we will enter into the methods themselves, we will start with Cell Ranger, which is a proprietary software developed by the company 10x, which is one of the technologies that Trevor mentioned in his talk, the Chrome technology. So it's essentially everything from taking the fast Q files that the sequencing facility provide you to map those reads against a reference transcript that you have to indicate it can be mouse human. And then we have single cell or single nuclei we will, we will talk about that later. The process that takes a while to run, it can take even in a high performance computer can take a few days. So we will not run this today but I will show you the results and how to interpret those results. In the modules that we have. So in the lab that we have for this module. So we will focus on this part here which is using a tool called Surat is developed by Raul Satija in New York and his group, and it does a second round of quality control check out after Cell Ranger. And then we will also have a solution using PCA or our PCA cell clustering visualization using TSE needs and human plots and finally differential gene expression analysis. A couple of other tools that I going to touch today are tools that we identified in a benchmark study as a top performer called ESPA. It's used for many other purposes like pathway enrichment analysis that I think you will see tomorrow and it's also using the bulk RNA seed field bulk RNA seed field. Essentially you want to rank the expression of genes in this case that are characterizing a certain cluster and you want to label those clusters based on those gene sets. And Trevor showed a heat map with the copy number variants for the glioblastoma data. And I will explain you how to use that tool and how to interpret the results. And finally, we will cover a tool that is being developed in in our lab with Trevor. This is called Crescent. It stands for cancer single cell cancer single cell expression toolkit. It's a web app is made user friendly is for people who don't necessarily have computational biology background. So you don't have to install any software dependencies. You just need your raw data and then you can run the analysis there. Again, the sample data that we will be using today is from glioblastoma from this paper. All right. So what's the motivation about doing single cell analysis in an automated automated way. We have biological technical and I will say bioinformatic motivations for this. The physical motivation is pretty much what Trevor covered in his in his talk. We have a lot of different cell types in the tumor micro environment, and we want to characterize those differences between cell types and all cell stages. In terms of technology, we have many different ways to measure single cell. Yeah, Trevor spoke about these two smart sick to antenna chromium, but there are some others, and they produce different types of readouts. The normalization has to be different. And, of course, you want to have tools that can handle different types of measurements, normalize them correct for batch effects, etc. You can have a better picture of the tumor micro environment with different with data from different labs or different studies. These two plots that I'm showing here are meant to show how the field has been growing in terms of data generation 2015 studies were published in about a few dozen cells per study. And then from 2015 it started to grow exponentially. And now some studies are publishing up to one million cells being measured by single single cell RNA sick technologies. This is still growing exponentially. Not only the number of cells that are being measured is growing but also as a factor of the number of cells, we keep seeing more and more cell types and more cell clusters and assuming that the same cell clusters are representatives or of specific cell types or cell stages. 10 this is telling us that we still need to do way more sequencing to understand more complete picture about cancer and the tumor micro environment. Because we have all this wealth of information data from single cell we need powerful tools that can handle all this, all this data. The next number is I took it from this github repository where the community is curating computational tools that are being developed. So if you develop a new tool for single cell RNA sequencing you can go here and ask the coordinators to index it there so that the community have a control about who is doing what to avoid redundancy in terms of computational tools. This is I think for like at least one year old so it's probably around three or 400 computational tools that are out there now, just in this repository. In bioconductor this pass from two tools in 2016 to 15 2019 and there are about 100 these days just in bioconductor. So that tells you about how hot is this area nowadays but also for a computational biologist or data scientist in general. Sometimes having so many tools can be challenging because you don't know which tool you should use. If you see a nice paper they use the tool. You see another paper they use another tool and then becomes a little confusing which tool you should use. And I will tell you the approach that we're taking in crescent to address this wealth of data and computational tools being available. This is exactly what I'm trying to summarize in this slide. This is a typical single cell RNA sick analysis pipeline, it has nine steps, but they are not linear. Sometimes you do the QC and then you go to the clustering and then you go back to the QC I just listen listening listening the steps here for the purpose of illustration but sometimes is a process that goes back goes back and forth. I mean, if you're working with 10x data from chromium technologies, you will use cell reindeer. There is no need for using anything else because this software is free to use is very well maintained it has various releases during the year. And I have to say that they have a phenomenal technical support in the next. If you're working with 10x chromium technologies, you go straight to cell reindeer. The only thing that you have to take care of, in this case, is the reference transcriptome. If your samples come from human, come from mouse, you have to use the corresponding reference transcriptome or genome. And also you have to take care if it's in human, if it's the HG 19 version or GC 38, GRC 38, etc. So try to keep your things consistent. That's very important at the lab level, at least at all your students or your postdocs are hopefully using the same genome version, so that the data is comparable. The other thing that you have to take care when you're using cell reindeer is in the single cell field, sometimes in more on cancer samples or samples that come from a certain disease from a biopsy, etc. The samples cannot be processed right away in the lab. So sometimes they have to be frozen down and then stored in the cold temperature and then processed afterwards and one way to handle those types of samples is measuring the expression of genes within the nuclei rather than the whole cell. So we call that single nuclei instead of single cell, it will be SN, single nuclei RNA-seq. Because they come from the nuclei, the transcripts there don't have splicing yet. So the reference transcript that has to be used for cell ranger has to be indexed in that way, that still doesn't have splicing. That's one point that you have to take care when you run cell ranger. Now, in most of the cases, at least in the Princess Margaret hospital or UHN, cell ranger is run by the sequencing facility and the computational facility at UHN. So you will indicate in which reference transcriptome you want to use. Okay, so now once you have your REITS map to a given genome, you will have a matrix. Let me see if I have a slide here. Yeah, you will have a matrix like this one. It's called an empty X format. It has three files and is nothing else but one file that has the barcodes that identify each of the cells. So each of these DNA sequences is unique and represents one cell in the sample. And then you have a second file that is called features in all the versions of cell ranger. It was called genes.tsv. But the features are for single cell RNA-seq are genes. That's pretty much what happens is that there are technologies as was brought during the last part of Trevor's presentation. There are other types of measurements like methylation, etc., where the features might not be genes. They might be chromosomal regions. So the file is called features. And in the case of single cell RNA-seq, they are genes. So imagine that these barcodes are in a matrix, are your rows, or actually the other way. Your barcodes are the columns and the features or the genes are the rows. So you have a third file, which is called the proper empty X file, which has this format. You have the number of features, meaning genes, in that particular sample. The number of barcodes, meaning cells that were identified, and then the number of measurements, the number of time points in your matrix. And then you have three columns. You have the column indicating the feature index, which is the number 10 in this file. And then the number of the barcode index, which is the number one here. And so on. And then here you have the number of times, the number of UMIs, or number of reads, that this particular gene was measured in this particular cell. In this case, it was measured only once, sometimes it's twice, and so on. You can have hundreds depending on the cell and the gene. Why do we have this format? The main reason is because when you have big matrices, for example, in a single sample, you can have five to 10,000 cells, let's say 5,000 cells, and you have about 30,000 genes. So you will have a matrix of 30,000 rows by 5,000 columns. Now, think about the typical single cell RNA-seq study, you will have probably about 50 data sets, and then you will, sorry, 50 samples from multiple patients, multiple conditions. So when you integrate those matrices of thousands by thousands, you might end up having a matrix that is just not being able to be loaded, or it will take a lot of space. And the most important point that I haven't mentioned is that about 90% of that matrix will be empty, meaning will be zeros. Those are cases where a given gene was not detected at all in a given cell. So that's a waste of hard drive space. To avoid having zeros, the matrix NTX format will have only indices for cases where the gene was measured in the cell at least once. So those zeros are not included. Okay, so that's the output of cell range here. During today's labs, we will cover the rest of the steps here. I have to tell you that some of these steps are very memory intensive, and hopefully we can round them all. But I'm, I will to share with you the GitHub where I have all the documentation, and I also loaded into Senado, a repository, the intermediate files so that we can run them smoothly today's labs. Okay, so the second step is called is a quality control checkout cell range it has its own quality control that we are going to see in the next minutes. Actually, I should probably go over it now. Let me go back to my. This is another sample where you can see that that's low is not as pronounced as in the first one in the first one it was more like this, whereas here is more. The slope is smaller, right. That means that you have fewer cells with a high content of RNA, and probably some of these barcodes that are called cells, they were actually ambient RNA. You just have to be careful with that. And the next will actually give you some warnings in there in the output is telling us here that there was a low fraction of reads in cells, which is exactly what we're seeing here, we have. Ideally, they will ask you to have at least 70% of the reads to be within this region in blue. If he's less than 70% they will send you this warning. And I think if he's less than 30% or something this warning will be in red, and it will be an error rather than a warning. So this is kind of the first quality control checkout that you want to do as soon as you get your data back from the sequencing facility. Other types of warning that you might see are related, for example, to the number of reads. If this is too low, it might be that the indices, the sequencing facility where it didn't work out as well as expected, maybe there is a mismatch of the indices, etc. One common warning that you might have is a low fraction of reads mapped to the reference transcriptome or to the genome. So here, for example, we have about 95%. 95% of all the reads mapped to the genome, which is good. It's a very good number. If you have less than 70%, you will have a warning and I think less than 50% you will have an error. And that typically means that the reference transcriptome that was used for this analysis was seen correct. Perhaps you have a human sample, and the reference transcriptome was a mouse genome, or you use a single cell data and the reference transcriptome was single nuclear. So you have a lot of issues with the splicing there. Other things that you have to, of course, take care of is the type of chemistry because 10x has been released in different versions of the experimental kit. This is just something to have in mind. The newer versions of the chemistry have more genes and more reads per cell. So if you are using version one, it's okay that these numbers are in general lower than version two. And then finally, you also want to keep in mind which version of CellRanger you are using. I think there is version five now. But the main difference was between CellRanger version two and version three. Between those two versions, there was a big improvement in the algorithm that distinguishes what is background from what is cells. So if you have a sample that was mapped with CellRanger version two, I would strongly recommend that you run it again. Actually, when you are doing, I mean, you run it again with a newer version of CellRanger. And actually, I will strongly recommend that whenever you are starting a new project, if you're getting samples from different labs, you're getting samples from publicly available repositories and then you want to add your own samples. If you have access to the FASTQ files, I strongly recommend that you run CellRanger from scratch. It's just computational time, but it will save you a lot of time afterwards in the analysis because at least in this case, you will have exactly the same genome annotations. You will have the exact same number of genes being measured, etc. I will talk a little bit later about, it's not easy to get FASTQ files always from publicly available or data that is supposed to be publicly available. Sometimes it's not that public, to be honest, but if you have access to the FASTQ files, I will recommend that you always run CellRanger. Okay, let me go back now to the slides. Do we have any questions so far? If not, we can come back later always. All right, so we are in step one, step two. These steps will go, I will go faster because we will see them live with the labs. So the step two is QCE, here we want to see if there are many transcripts belonging to mitochondrial genes, which is usually an indicator that when the sample was processed in the lab, there were many cells that were dying. When the cells are dying, often they start to express a lot of mitochondrial genes just as a general stress response. So if you have a lot of mitochondrial genes in your cells or in your sample, chances that that sample was, the cells in that sample were dying are higher. So maybe you don't want to consider that sample because then you are not seeing the biology of interest, you are just seeing a general stress response. You don't want that. Then once you have these two quality control steps, you do normalization of the samples, I will tell you more about it, essentially, to factor for- Sorry, I have a quick question. So cells dying, is that a more common thing in cancer cells or just aren't a single cell? I think it's just common with the way the samples sometimes are processed. I wouldn't be able to tell you if it's more common in cancer than in other types of tissues or in other types of samples. But certainly the way some cancer samples are processed are more prone to damage the cells. I know, for example, liver tissue from cancer, the liver is very, very, how to say, is very prone to the cells to start dying. So that can become damaged very easily. Yeah, I think one nasty artifact is ambient RNA from necrotic tissues, where you essentially start floating around and they get encapsulated in the droplet on a 10x platform. Sure, sure. I wouldn't say it's necessarily higher than cancer versus others, but definitely highly necrotic tissues. You could expect a higher ambient RNA level. But yeah, I don't think anyone's really done like the systematic cancer versus non-cancer comparison. Okay, great. Thanks guys. No worries. The good thing with single cell technologies is that computationally, you can try to identify groups of cells that presumably were dying and separating from the rest. I mean, compared with bulk RNA-seq, for example, where you have everything in a mixture and you cannot decompose with that very easily. We can talk about that during the tutorial. Then you do normalization. Here we are essentially trying to correct for sequencing depth. Some samples might have been sequenced deeper than others and we want to control for that. If we do something called batch effect correction, we will spend most of the time during the tutorial going over batch effect correction. And this is, for example, when two samples come from different labs, they were prepared by different people or even using different technologies, 10x, Chromium versus Seqwell, etc. This is correct for technical artifacts that will otherwise mislead the biological assumptions that we are making. The fifth step, once we correct for those batch effects, we integrate all the samples into one big matrix and then we do, because we will have a very high dimensional matrix of thousands of genes by hundreds of thousands of cells. The analysis with that big matrix will take forever. So we need to reduce the dimension and typically here we use something like PCA or RPCA, we will see it in the tutorial. Then with the dimensions being reduced, we cluster the cells just with the features, meaning the genes, that were more informative. Another example by that is we want to keep the genes that were well measured in many cells, right? We want to remove genes that are poorly measured in most of the cells. But also we want, know that they are just housekeeping genes, that they are expressed always in all the cells, because that also don't tell us any information. That are well measured, but at the same time they are having changes in the samples and in the cells. So with those genes, we will do the cell clustering by comparing the gene expression profile of the cells with this dimension reduction. And then here we do differential gene expression. We want to see if group of cells is expressing what genes are different, differential expressed in another group of cells, etc. And then as a data scientist, you want to show these to your audience, which can be your paper readers, can be your collaborators. So you want to put all those tables and those analysis in a nice way, using visualization tools like the UMAP plot, heat maps, etc. that we're going to see today. So cell cluster labeling. So instead of telling your collaborators, I have 10 clusters, you can tell I have five cell types, which are split into 10 clusters, something like that. So in Crescent, just quickly, the way we are selecting some of these algorithms or methods to do each of these steps is that we rely on benchmark studies, either done by others or by ourselves. And we pick the top performers, and that's what we implemented in Crescent. Okay. This is the data that Trevor was showing in his talk from Laura and others at Trevor's lab. And in particular, in the tutorial today we are going to focus on these seven samples, or are from glioblastoma, but you can see that they have different number of cells, they have different types of cells. For example, these are all malignant cells from all these seven samples. So I thought it was a very nice example to work with today. Hopefully we can run all seven. Actually, I know that we will remove at least one because of computational limitations in some of the steps, but hopefully we can work with the remaining six. We will just work with three of them or so. And then once you have time and more computational power, you can essentially copy the same code and run all seven or more or more samples from this paper. We have all the data from this paper in Crescent, and I will tell you how to get it.