 Okay, so welcome all to this SIV Virtual Computational Biology Seminar Series, which is the last one before the summer break, so we're going to see you back in September. And today for this last talk, we have the pleasure to host here in Lausanne, Hubert Rehauer, from the Functional Genomics Center Zurich at the Affiliate to the ETH Zurich and the University of Zurich. Hubert studied physics at the University of Wurzburg in Germany, and his main interest there was in the computational modeling of statistical physics. After his diploma, sorry, he joined the ETH Zurich for a PhD in Image Data Mining, and in 2000 he shifted to Life Sciences and joined the company Gene Data as a scientific consultant where he developed algorithm and software solution for the bi-emerging field of expression microarrays. And in 2005, he joined as bioinformatician the Functional Genomics Center Zurich, where he currently holds the group leader position in Genome Informatics, and Hubert is also a group leader of the Swiss Institute of Bioinformatics. So his group at the Functional Genomics Center collaborates with researchers on the analysis of next-generation sequencing data, and the group provides web-based analysis to the research community, as well as custom development of analysis pipelines. The group supports the analysis of virtually all NGS applications based on data produced by any recent sequencing technology, and the group also trains researchers and bioinformaticians on various aspects of data analysis and provide access to their computing infrastructure for running analysis. So today, Hubert will tell us how single-state RNA sequencing platforms perform, and I want to thank you again for accepting this invitation, and the floor is yours. Hello. Hello, everybody. Hello, everybody here in this room, and hello, everybody who is virtually online and follows this presentation in a different place. So yes, today I'm talking about some performance comparison that we did to gather with our research groups on platforms that allow single-cell RNA seed. The outline of my talk is as follows, I will quickly repeat what we are doing, but it's not going to be too long when I will introduce these single-cell platforms and shortly mention how this has been evolved, also mention quickly existing comparisons, and then I will come to our study that we did and the quality metrics that we have done so far, and this is still some work in progress, and we will still contribute to that. Now, if you talk about genome informatics, here's what we do at the Functional Genomic Center. Our mission is to provide bioinformatics services and collaborations in the area of omics, so this focus on next-generation sequencing data. So all types of bioinformatics challenges that involves next-generation sequencing, but can also involve metabolomics, proteomics, or clinical data where the provide support. And we also maintain and run a cluster, high-performance computing environment, cluster and cloud-based, and we also do some software and pipelines. And here on the next graph, I'll just show what is in the brain, what is our visualized roughly, what are the ways, how we want to make bioinformatics easier for researchers. So it, in my view, I think it depends, A, on the bioinformatics skills of the researchers and on the depth of bioinformatics support that we can provide. And so where we have a series of shiny web applications, they serve for visualization of various data, and where this is really for every researcher and pretty much independent of the bioinformatics skills. The idea is that these web interfaces are intuitive and can be easily used by virtually everybody who knows what he or she wants to do. When one level higher, we have our web pipelines, where we have a sushi environment, but a framework where we can run next generation sequencing data analysis. And if you know what parameters to choose and how to run this, and if you really know what to do, and you are able to express it, then this is a great tool that gives, lets you from a web browser generate QC reports from your data, realign your data, call snips and so on. And at the end of the, if you run both things, then at the end of the analysis, you will get a folder with the shell script that has been executed, the parameters, log files, and finally also all the results. So everything is fully documented, you can take it away and use it further for downstream analysis and interpretation. And there we report, support really everything, RNA-seq, the NOVO, metagenomics, all types of applications. And then on top, even for the highly sophisticated users who really need their own environment, there we also have a collection of Linux boxes where they can develop their own scripts on the command line. And finally on top, we offer also collaborations and there we, from our side, provide the most input and this is again helpful for any type of research of whether he has limited bioinformatics skills or whether this is an expert in bioinformatics, this can be fruitful collaboration in all sense. That was already all and here now there is the overview of the study that we did, that was performed, the study was performed by the genomics research group of the ABIF. ABIF is the Association of Biomolecular Research Facilities. This is an association of core facilities like ours, the Functional Genomics Center. And there have been people from the University of Rochester, Sunny Albany, Harvard and the University of Zurich involved and we got also support from BioRide, Illumina, Pan-X Genomics, Waveforgen, Fluidine and ABIF, the ABIF itself. And just because we name, I name these companies, I don't want to express that these are the best companies for single cell technologies, these are just the technologies that we had at hand. Now let's quickly review what are, what is the status of single cell RNA-SIG? It's now already eight years ago that it has first developed and it has been improved over the last years and there has been some major steps, especially also the introduction of free prime and tagging so that you no longer do full length pre-amplification and also with free prime and tagging and sequencing you can get very good expression estimates. Also the introduction of UMI's we've seen and these, with these UMI's you have unique molecular identifiers that help you to get rid of any PCR duplicates that have been introduced during the amplification and if you figure it out you have to do a lot of amplifications because a cell will have something like a quarter of a million molecules that you want to sequence and on the sequence cell you want to produce something like a million reads and there you need some substantial amplification. Now these protocols have been compared by an excellent paper from Sigenheim and collaborators which has been published this year in molecular cell was also already available in bio archives last year and what we see from there is if you map all protocols that they investigated they map, they have a substantial amount of reads that actually map to exonic regions of the genome and that can therefore contribute to gene expression values. There's also a substantial amount of reads that are mapped but non-exonic also unmapped reads these all these fractions are larger than you would expect from bulk sequencing with bulk sequencing we expect for example easily to have up to 70% of the reads or even higher can be anything between 60 and 90% of the reads that map to exonic reads. The single cell RNA is the number of reads that really come from protein coding exons are smaller and what you can also see the top something like eight protocols that have been analyzed and there is substantial amount of PCR duplication because the light blue shows you the number of reads that are mapped exonic and the dark blue shows you the number of different UMI's that are those so this light blue basically are the redundant fraction that you have. This means if you have for cell sig 2 in total if you have got one million reads you can get something like an estimate when 400,000 reads mapped to exonic and something like only 20,000 reads are really unique in this example. And what this comparison also showed is that something around one million reads are enough the number of detected genes on the right figure shows that saturates around one million. There are different views on that but there can be something like between half a million and five million you reach saturation and the graph also shows that there is a substantial difference that for example the drop-seq this is the brownish graph at the bottom end has one of the lowest sensitivity of it has the lowest number of detected genes if you fix for example the sequencing depth. While SmartSec 2 also in this comparison had the highest sensitivity and where with a sequencing depth of one million reads you easily get something like eight or six, nine thousand genes detected. And they further checked okay now every cell is different we see only 8,000 genes in one cell but how does it look like if we have a set of cells if we do a set of cells do we detect in each cell all those the same genes or not and what has been shown a lot of times in different cells you detect different genes and this is a due to different biology due to the way transcription works but also due to dropouts and these dropouts can be also involved caused by the protocol that just misses these few molecules in the one cell but not in the other. And the fact that if you check what is the cumulative detection of genes in for example 100 cells this still increases this shows that this dropout process is somewhat random. So and this also shows that it's really worth sequencing more genes. And then in a second paper by Swenson which was also published this year they checked what is the actual sensitivity of the protocols and they used spike in model for that where we had a spike in data also and with that model they get estimation that the protocols detect the best protocols like the smart or the cell sec 2 and the start sec may detect down to something like a 5 molecules in a million reads they can easily detect the expression for both. So they have really a high sensitivity and but what this also shows again this droplet based approaches like drop sec or gem code they tend to have a higher detection limit they need more molecules in the input to be detected. And this is the channel the previous slides was the average but here in this graph they showed that even you have a well defined average but every cell the detection limit can be very different from cell to cell so you have a considerable variation from cell to cell. And in this protocol and this paper it's also worth reading so they compare these full length protocols versus the tag counting protocols and check the individual performances. Now these protocols were focused mainly on what are the med lab steps that you do after you have the cell isolated but now we go one step back and want to say what are the ways how you can you get actually single cells. So where single cells manipulation is around since years you can do micro pipetting or laser capture micro dissection or you can do fact sorting but the first thing micro pipetting and laser capture this only works for low number of cells so if you want to have 10, 20, 100 cells that's okay if you want to have thousands of cells that doesn't fly. But the good thing is this would work with any tissue the bad thing when it comes to cost these reactions are in micro little volumes and the consumables are usually expensive. And there have been approaches around that do the whole reactions in micro droplets and if you go into micro droplets when this reaction goes into nano little volumes and the cost for the reagents drop. And then there is also as a special product there's this commercially available micro fluidics chips from Fluidime C1 and they are designed to handle 100s of cells and they can also visualize the cells they very very fast and they do the action in nano little volumes. So there have been these two approaches around for a few years micro droplets and this micro fluidic stations and but this obviously triggered some companies to further follow on and that here I'll come first to this C1 auto prep I'll show the specs. It's a micro fluidic cell manipulation device you can have it with 96 cells or 800 cells you can do full length transcript as much as two protocol basically or you can do free run tech counting a big drawback is that you have doublets and that you have different chips for specific cell sizes. So you cannot have if you do a tissue if you do a tissue disruption and then you have small and small cells you can't put them all together in the same chip. And then the micro droplets there has been the 10x company that developed this chromium system and where all reactions go in an emulsion system in small drops where the necessary reagents for CDNA generation are contained and the cell is added and when you have the first step of the RNA sequencing in these drops and obviously you can easily have thousands, 10 thousands of drops 10 thousands of cells that you can handle and the reagents are quite cheap and you can process them all in parallel together. So after these two products have been established I was joined the market Illumina together with BioRed developed with DDSec Shure Cell Protocol it also works on droplet it also has a droplet approach it can capture between something like 300 and 1200 cells it is more flexible in the cell size up to 5 micron so it has the protocol that's free run tech counting only but has UMI's also and the big plus on that is it gets away without any pre-amplification step. And then another player that joined the market is WaferGen with this iCell 8 system and that's a different approach it's a nano dispensing array and here you have an array of this 5000 wells and these wells contain cell bar codes and the dispenser dispenses individual cells on both wells and this goes according to a process distribution so with the dispenser you usually fill up to 34% of these wells so that's why you can get between 1000 and 2000 cells on one well. The big plus on that is display after you dispense the cells you can visualize the cells you can check if they are the ones that you want if they are in a good shape and you can also select those and sequence only those you're interested in and while with the other devices for example with the drop sick device all of those drops are end up in the pool and you're going to sequence all. Now the study design that we used we started out with a press cancel cell line and we had a vehicle treatment and we had an H-Tech inhibition with TSA and we did on both also bulk RNA seek and we used these protocols like these platforms WaferGen iCell 8 with Illumina, BioRed, DD6, the Fluidime C1 chip that has 96 cells and does full length and the Fluidime C1 chip high throughput that has 800 cells and does only free run tagging and the 10x genomics chromium solution and these the different as you've seen before these are very different technologies and they've got different strengths in terms of throughput and sensitivity and our design also somehow matched that so for the WaferGen iCell 8 the typical batch size would be that you can use 900 cells in one run and we did 900 times two on one chip and we did this the whole thing twice and for the Illumina BioRed we had 600 in the both in both batches Fluidime we had you can have up to 96 but we found the failure rate the count on in 80 and with the Fluidime C1 the count on 350 and the 10x genomics we think about 8 the targeted around 1000 G cells. The target sequencing depth was also 100,000 reads per cell and we did the same thing with the Fluidime C1 and we targeted around 1000 G cells. The target sequencing depth was also 100,000 reads per cell for each technology except for the Fluidime C1 where it was 1 million reads per cell because that protocol does full length RNA transcript that has a coverage of the full length transcript so you obviously need more reads to capture the molecules. When evaluating such technology also from a core facility standpoint out these are the aspects that we were interested in a how reliable is such a system how how much can we trust it from this and from the scientific point it's more important what transcript sensitivity do we want is free time taking enough or do I need full length transcripts do I also need non-coding do I need to worry about doublets and from a technical point is what are UMI's can I include UMI's can I do de-duplication with UMI's and also from a technical thing is how many cells can I get and what are the total in total the costs. For the evaluation we looked at the following metrics what is the transcript home that we cover what is the diversity of this transcript home and what is the sensitivity of this detection how reliable and efficient is it what fraction of cells and reads are actually usable and we also checked in out do we rather want to sequence more reads per cell or do we have more a higher number of cells with fewer reads and the things that we didn't cover here is doublets and we also didn't cover ERCC's or other spygans because not every technology allows the spygans and we also didn't cover normalization aspects there are also publications on that there's been a recent publication again where I think normalization for single cell RNA-seq is still a major issue especially if you have a low read and if you have a large number of genes that have that the expression is exactly zero. Now the first thing we looked at what is the strand specificity so here all protocols that claim to be strand specific were also highly strand specific except for the C1 high throughput chip there was a number of cells that had a considerable amount of reads but were on the wrong strand and C1 96 has the strand specificity is 50% because this is an unstranded protocol and same for the bulk RNA-seq protocol that we used. Now for the mapping rates or mapping targets that we looked at do where do the reads go if you are able to align them and the majority of the reads in our example went to protein coding regions and this was basically true for all protocols. So the majority of the reads really come from protein coding experts with this treatment. There are also some reads from pseudogenes there are some reads short non-coding and mitochondria, but in none of the protocol these had really a major impact. And then here this is a complicated slide. Here I compare what are the informative reads and what are the usable cells in the end. Now if you look on the X axis I plot the number of reads and on the Y axis I plot the number of usable cells the number of cells and for the I cell 8 you see at the top the dot that shows you that we have nominal we have 4,000 cells so we have a device that claims it can do up to 4,000 cells and we sequenced that with around 200 million reads and now in the end we looked how many reads were ending up on protein coding axons and how many reads could be actually used and this was this you can deduct from the triangle at the bottom so something like 150 million reads would end up on protein coding axons which I consider a high efficiency but the device is made to capture up to 4,000 cells we had only 1,600 cells where we could get reliable expression values from. So a large number of cells were just lost during processes did not create libraries did not create we didn't detect cell barcodes for that for both genes we didn't have enough material for that. So let's say with that the circle always shows what the manufacturer tells you what you can read and the triangle shows you what you will actually read and for example what is also remarkable here for the C196 chip so we had a large fraction of cells that nearly all cells were really usable we had enough reads we had consistent expression profiles only a low number of cells were excluded in the QC step and also a large fraction of the reads actually contributed to the expression values for the C1H2 the efficiency on the sequencing is not as high we have more reads that come somewhere else that are not protein coding and the outlier here is a little bit the Schur cell which in terms of cells most of the cells are usable but this data set has been completely over sequenced so if because the reads were all duplicate reads that would have and all of these duplicate reads were eliminated in the UMI step so we only had something like 50,000 different unique molecules that we could see with sorry 50 million in total for all cells different unique molecules that we sequenced but we have this has been sequenced with something like 600 million reads and why was this done when this was supported by Illumina and they said we just put it on a whole flow cell basically and where the sequencing was not a cost and they wanted to look very good in this comparison and so they spend a lot of reads in it so but I think it was this was over sequenced so you can produce a lot of reads but if the diversity is not in your library this is wasted but this is also nothing not that much new also that 10x genomics it also has been over sequenced to some sense because we have many much fewer UMI when we have reads now when we look at the expression diversity this shows this violin plots showed a number of reads that are spent on the top expressors so next generation sequencing is a competitive thing if some highly expressed genes eats all your reads you don't have sequencing capacity left to sequence the low express genes and how does this happen this happen if some if the high express genes or abundant genes get preferentially amplified when you will only see this gene and you will see nothing else and this is true for the IFC 800 which has the highest fraction of reads that go on the top 500 genes and here you will see we have done a pilot study of this already in 2006 these are labeled with prefix with 2006 and where the values were comparable so IFC 800 which is the C1H high throughput chip is inefficient because it spends many reads on it shows signs of over amplification and it spends the majority of the reads on very very few genes on 50 genes here and this is true for the high throughput this is also true for the 10X genomics where we did 100,000 cell loading and this is to a certain degree true also for the ISO 8 and with that respect the 96 C196 chip is good and the bulk sequencing obviously is also good you have a high diversity you have less over amplification now another aspect of diversity is how many genes are detected obviously with the bulk sequencing you detect most genes and the C196 chip comes closer this is also what the other papers from Siegenheim and Svensson have shown this SmartSec 2 protocol is really a sensitive protocol and after that the C1H high throughput chip is very sensitive and on less sensitive are the ISO 8, the 10X genomics and the sure cell but this obviously also depends on the sequencing depth the more you sequence the higher you have the chance to detect something and in order to correct for that we sub-sampled 20,000 reads or UMI from each data set and then we check what is the number of detected genes and therefore we try to get out the sequencing depth which influences the detection rates and now the numbers are more comparable but what you can still see is that bulk sequencing is best the ranking basically did not make a change except for sure cell which is really a high number of different genes found if you take only 20,000 reads and I think this high diversity comes from the fact that the sure cell protocol does not do any pre-amplification step and does not over-represent individual genes there now we check also if you go from one cell to the other cell if you look at different cells how many genes do we detect cumulatively in all cells do we always miss the same genes or do we detect more genes if you look at different cells now these are all the same cells from the cancer thing this is not like you have different cell types as you would have from within tissues so we expect smaller variability until you see the same plot at the log scale and at the linear scale again for comparability we have 20,000 reads but they are sub-sampled and what we can see here the 10x genomics this droplet approach has this lower sensitivity and detects fewer genes and if you look at a linear scale what you can learn from there is that the whole thing saturates in our thing with something like 200 to 500 genes so if you have something like 400 genes you have seen most of the expression diversity now things have been sub-sampled for 20,000 reads and with the 10x genomics technology if you have 5,000 cells you tend to sequence low but if you and it also as it shown before it doesn't really make sense to sequence more you don't have a major benefit but if you take only 96 cells as for example with the fluoridime approach then you still have enough money usually left or you can still afford to sequences with higher coverage and then we do the same plots where we take all reads but we have segments so this means about more than a million per cell for the C196 and what you see here is that you really have also a benefit it is worth to sequence cells that you have produced with the C196 with the SMOTSIC 2 protocol with more than a million reads and with the others protocols you start to saturate earlier now what we also wanted to look at there is a lot of amplification amplification can also be influenced by GC content of the genes and then we check what is the chance of a gene being detected if it has a low GC content or if it has a high GC content and what you see here the plots in red this is the chance of detection for genes with a low GC content and here we took an extremely low GC content below 0.4 the green plots show the genes that have a GC content around the center between 0.5 and 0.55 and the blue ones are both genes with very very high GC content and here what becomes apparent the C196 has a GC bias also the C1 high throughput chip but basically a very small GC bias present in the 10X genomics and in the Schur cell GC content is also in our bulk RNSIC protocol and in the isolate and we further stratified that according to length now we have been the whole we have been all genes in a space where we put on the X axis the GC content and on the Y axis the gene length in LOC2 scale and what we can see here that most protocols have a certain bias for example the C196 prefers short genes easily detects rather short genes with low GC content short genes are on the Y axis very rather on the low and on the X axis low GC content is on the left and the very short genes with less than 512 bases they have been detected but what we can see here also this Schur cell overall it has a low detection rate but it detects basically independent of the GC content and length now we looked at the big questions what one should do should I rather sequence 1,000 cells with a million reads each or should I sequence 10,000 cells with 100,000 reads each so these are the trade-offs but you can do cost-wise it's very very similar what we did there is we loaded the 10 X genomics machine with 1,000 cells, 2,000 cells, 5,000 cells and 10,000 cells and always sequence something like 150 million reads out of it around 150 million reads and then this means the number of reads per cell goes down the more cells you have the fewer reads per cell and what we wanted to know then if we vary the cell loading what is our detection rate and what we can see here if the UMI sequence per cell which goes up to 60,000 here if we increase our sequencing effort up to where we also detect more genes so it is true if you fill in with 10,000 cells the whole protocol works but if you spend only a low amount of sequence capacity for it you also detect only a few low number of genes so it is definitely worth to sequence up to 50,000 UMI per cell and which in our approach translated approximately into half a million reads per cell you could also bump it up we didn't do that I think it would have worked I think the information is there but this was limited also and when we did a T-cell, a T-SNE analysis where we tried to see if we can separate the cells and if you have a thousand cell loading we can easily separate these two conditions that was no problem at all if you use these 2,000 cells with lower sequencing depth then we could still separate the two conditions but what we saw here is that there are two subgroups and this is because the whole thing has been done in two batches and these are the two batches I don't have a plot for you but these are the two batches and here the green cloud is also clearly two batches that explains the substructure when in our sequencing if we use this 5,000 cell loading when this separation becomes unreliable and we can no longer clearly classify all the cells correctly and if you use this 10,000 cells loading that we use with low density coverage sequencing when everything was gone so when there was no sensitivity around to discriminate the two types but this is exactly what for example the 10x genomics people advertise but it's enough to sequence 10,000 reads or 20,000 reads or at most 50,000 reads for it because they want that you spend your money on the 10x reactions and not on the sequencing I guess if you ask an Illumina representative it's going to be exactly the other way around and I think there has been this paper out that really showed that with as many as low numbers of 20,000 reads you can classify different cell types from each other but I think from our study I think you really have to see what is your classification task at hand that you want, what sensitivity do you need and I would feel safer if you start with at least 100,000 reads or rather half a million reads per cell to get a feeling if what is the number of reads that you need for your research question but where the demands may definitely vary from case to case Conclusions All technologies allowed with successful single cell RNA-seq experiments I think that different technologies satisfy different needs in terms of detection sensitivity if you go for detection, if you want highest sensitivity you can also, we also had very good experience with this manual plate based SmartSec 2 protocol but if you have a higher demand on cell throughput then you might want to go for a different technology and this also depends on the costs that you have and on the batch sizes that these technologies provide There is a coverage bias present and this obviously is more stronger, more stronger than as compared to bulk RNA sequencing and the droplet based methods provide this highest throughput in terms of cells but at the cost of the gene detection sensitivity and with this I want to thank all those that have contributed to this study and I want to also acknowledge the whole function genomic center that supported this work and I am open for questions if they are here or from your online attendees