 Hi, so we'll get started. My name is Arsim Siddiqui. I think I had a, probably Francis probably said a few words about me yesterday, but I'll, I don't know if you, Francis, did you say anything about me yesterday other than I'm an evil, I'm an evil industry dude now? Yeah. So you probably, yeah. So I used to, yes, so that's true. So I used to, that is actually true. I used to work with Francis when I lived in Vancouver for a number of years. I was at the Genome Sciences Center there. And, but in the last few years, I've been moved into the corporate side of things, to the dark side. And then for the purpose of full disclosure, now working for applied biosystems, like technologies, and working on a solid platform. So I won't, I'll try and ensure that I keep a fairly neutral, neutral tone to my lecture. There won't be any borat style singings of the US National Anthem saying AB is the best technology in the world and all the other technologies are not so good. Anyway, really none of that. So, but I would say if you have any questions on the platforms, I can answer specific questions about solid, but if you have any questions about Luminar 454, probably I'll direct those to my colleagues. And certainly if it comes to points of comparison, just to be fair, I'll direct those to my colleagues as well. Okay, so if you made it through the lectures so far, you'll be glad to know that there's very little math this afternoon, if that was something that was buzzing. I have no equations on my slides. For my first of the slides, we're calling this transcript but it's really any sort of counting application. And for counting applications, the only math you really need to know is Poisson, as a Poisson distribution. If you know the Poisson distribution, you basically have all the tools you need. You can use Poisson stats to calculate significance of difference given the depth of the library and the counts. And if you've done any sort of sage profile in the past, you'll probably be familiar with those types of techniques. So these are essentially, these are pretty much exactly the same slides I used last year. There have been a few developments since last year and I'll speak to a few of the changes that have happened. I want to understand if I'm talking to some of my colleagues over dinner last night, is that there's some consternation yesterday on the changes that are occurring in our space and there's a lot of new technologies, new advances being made. I would sort of counter that somewhat, saying that if you're really interested, if you're biased petition, obviously you want to keep on top of the latest techniques and algorithms that are out there and that's where you make your mark. But if you're more on the biology side and you're really interested in getting to the bottom of a biological question, if you take a method that's been developed a year ago, it's probably going to work just fine to get some interesting biology out of it. Now you may not have necessarily the most sensitivity or there may be certain questions which only the latest methods are able to answer, but you can probably get 90% of the value out of methods that are already out there. OK, so we're going to start off by just talking about the types of problem spaces I'm looking at. This first talk is really around what do you do beyond just genome sequencing? What are the other sorts of questions you want to ask around mRNA sequencing? Is that keep going? All right. What are the other types of questions? Suddenly became that. What are the other types of questions that are of interest? And what two of these questions can you answer using the next generation sequencing devices? And the natural questions that follow once you have a genome is to actually understand how that genome is impacting the cell's function and how the cell is actually working. And to get at that, there are questions around which proteins are being present in that cell, so which mRNAs are present, how those proteins are functioning, and where they're binding to the genome or to mRNAs, and what's methylated, what's being turned on or off. So basically, how does this whole milieu of biological species work to create a functioning cell? OK, so this is, I'm sure, familiar to everyone in the room, but the basic relationship between DNA, RNA, and protein. I wasn't here yesterday, but I'm assuming most people have. This would be familiar to everyone, right? I don't need to go through this again. OK, so transcriptome of a cell. So the transcriptome is the entire set of RNA transcripts in that cell. Interesting features about the transcriptome is that it's cell-specific, so different cell types will have different transcriptomes, and it is time variant. So over time course, you will see that transcriptome modified in response to various stimulus that apply to that cell. Understanding how the transcriptome in those different states allows us to understand how the cells differentiate and how they're going to respond to the changes that are occurring in their environment. Transcriptomes and cells transcripts, historically, were thought of transcripts as being relatively simple things. As we are able to investigate them in more and more detail using the newer techniques, we're learning new things about them. So in terms of their splicing, for example, we used to be thought that there were relatively limited number of splice forms associated with each gene. Now over the last year or so, there's a recognition that there are a lot more splices out there. And essentially, over 90% of genes have multiple splices. And there are a lot more splice forms than we previously thought. And this has really been made possible by the deep sequencing that whole transcriptome sequencing allows. I'll expand on that later on. So there's various forms of the transcriptome database. So historically, we made gene expression experiments using northern plot strengths and RT-PCR. The downside of these types of assays is that they're targeted to a specific locus. ESTs provide a more genome-wide scan for transcription elements. And the major reason they aren't used today is to the cost. But now, with the cost of sequencing being reduced by the next-gen sequencing, we're able to actually now come back to this way of sequencing the RNAs. So chips are still very popular, microRNA chips. They've been highly successful. And they have the advantage, I think. Right now, they still are cheaper than running a next-gen experiment of equivalent depth. And I'll explain what I mean by that in a few slides of time. They are useful when there is no genome sequence. They're not that useful when there is no genome sequence. You need to have a sequence genome in order to use chips. And they provide a 500-fold variation across expression levels. They've also been improved for clinical use. There's a couple of chips that have been improved for clinical use as well. But the disadvantage is that you're limited to what you're observing on that chip. The probes that are spotted onto the chip are specific to the locus that have been designed. And so you essentially have measurements at certain points across the genome. And there are various transformations or inversions. They won't be able to detect those either. Sage came along in the late 90s, if I remember correctly. The advantages of sage is that you get a digital count for each transcript. You get essentially one count, one sequence per transcript. And that sequence allows you to map that sequence back to the genome. And from that count that are accumulated at that locus, you can ascribe that to the underlying gene expression level. Sure. So the way that sage works, you have your mRNAs. There is an M sign that you use, typically MLA3. It's a full cut as used. And you capture these by the poly-A tail. And then you get the three prime most NLA3 site, if that's the enzyme that you're using to cut. And then you extract 20 or so base pairs downstream of that. Those tags that you can then concatenate together, and then you can sequence those linked tags. That's typically on an old Sanger sequencing platform. And these tags will then represent the three prime most tag in that original mRNA population. And then you can add those up and then count them and get back to the original gene count. Now, one of the disadvantages of this approach is that you only get the three prime most sage tag. And so you won't get the expression across the whole gene. So the advantage is you can do novel transcript discovery. The disadvantage of your single tag may map to multiple locations. And alternative transcripts may share a tag. And if this doesn't work well, the genome is completely unknown. And it's relatively expensive as well, especially in comparison to chips. But as you've seen on last day, so with the drop-in cost of next-gen sequencing, it is becoming more practical to run larger-scale libraries. Now, the rough rule of thumb that's out there is that to run a transcript sequencing experiment where you're running your sequencing tags across your mRNA population, you need around 10 to 20 million tags to get an equivalent coverage that you would get from an equivalent dynamic range that you would expect to see from a chip experiment. And so using those numbers, you can try to get a feel for the relative cost of next-gen sequencing in relationship to use the running microarrays. So chips are probably going to be around for a while. But the days and numbers, some of you may have seen that there's an article in Nature published. I forget his name. Post-talking from George Church's lab is now a faculty somewhere else. The paper, which is basically the death of microarrays, commentary piece, I think they're going to be around for a little while longer. But they're certainly, as we drop the cost of next-generation sequencing and as features such as barcode multiplexing, which allow you to multiplex in multiple samples in a single next-gen run, as those techniques are introduced anticipate there's going to be a crossover point where it will actually be cheaper to do an experiment using next-gen sequencing. So what's the basic flow look like for an mRNA-seq experiment? This is where I said, at this point, if you've survived the math so far, there isn't a lot more. You just use the same methods that we talked about yesterday. So if you can align your tags to a genome, this is what you do with mRNA-seq experiment. So you get the tags from your transcript. And then you tally the transcript counts. So you align them to genome. Tally the transcript counts. This requires you to have a model of the genome of the genes. But if you have an annotation for those genes, you can just simply count those up and align tags to your flight transcripts to add those all together. And essentially, you're done. So I'll go through just a few papers that were published in this space. So this was a paper published out of the Sean Grimans lab where they sequenced the U-solid to generate 10 gigabytes of data. And they have an approach which allowed them to map across exon junctions using known splice events. So they took their library of genes, and then they created a special fast day file which had, essentially, the bridge events between the different exons. And they also generated that for the known genes. And then they created all the alternate splices between the exons of those genes to allow them to do novel transcript discoveries, or novel splice junction discovery, as well. If you take us from the paper where you can see that there is expression here is localized for the most part to where the exons are. But there is some intronic expression as well. And you can see differences between these two cases here as well. In terms of their pipeline, this is the pipeline that they developed. It's available from their website. And it's somewhat complicated. But essentially, they sequenced originally at the time 35-mers. And they had a cascading flow where if it didn't match in the 35-mer, they truncated it and mapped it again as a 30-mer, and then truncated it again if it didn't map and match it again as a 25-mer. And then they had a splice junction approach, as well, that I mentioned earlier. And then from this data, they generated the bed and wiggle plots, which you can import into UCSC. And I'll make an Amy-specific comment. We also have our own pipeline, as well as an AB, which does something similar to this. But these approaches are, there isn't really any magic or mystery around this. As I said, it's really just a case of aligning tags to the genome and then counting those in reference to the annotation. As you can see, just highlighting from their libraries here, most of the tags, about 60%, do map to known axons of the ones that do map. But there's a remainder of mapping to unannotated regions. So of those unannotated genes or unannotated axons, 14% map to known regions. They describe known regions as being a region where there was some evidence of expressions, such as a previous EST being found there, or it being in another annotated library, such as MGC or another RefSeq set. Even so, there's predicted regions where there's a gene prediction covering that, conserved regions, but there's still these other regions where we see some expression. And then following on, then from the ENCODE paper, which came out the previous year or the year before that, that paper showed that there was evidence for expression really occurring across the genome. And these latest papers have also built on that and found the same sort of concepts. So another paper that used 454 to look at transcriptome of ESLs. And I'll spend a couple of slides talking about a method that came out of Barbara Wald's lab, where they used the alumina technique to look at liver cells. And probably the key things that they found again, they found most of the, they found a lot of expression occurring, mostly expression occurring in terms of annotated exons. I think the key points here are, if you look at dynamic range here, 10 to the 4 to the 10 to the 9, that's five orders, five logs, a lot of orders. So that's obviously much higher than what one is seeing from microarrays. So the sensitivity of these techniques are a lot higher. There's also from this graph where they're showing the saturation of genes that are shown to be expressed at different expression levels. So this is asking if we have for more highly expressed genes, which is shown by this case, you don't see saturation until you get up to about 40 million tags. But if you're asking, well, is it just really, is it expressed at all? You only need around 10 to 15 million tags before, in that low expression levels, you're seeing most of the genes that you're going to see. So hence my earlier estimate on the 10 to 20 million tags to get an equivalent expression to a microarray. Now, where the field has gone into a lot more detail over the last year has been looking at the alternate splices. So we're identifying expression across splices, finding new splices. Some of the techniques that are being developed now are looking at paired end splicing. So there's a paper that came out about a month ago where paired end tags were used to look into fusion splices occurring in a cancer transcriptome. And there's also approaches that are showing promise that can identify splices using even just a single read. Michael Stromberg and I were talking earlier, when Michael Brunner gave us talks, he said there weren't really any methods that can do breakpoint resolution in end valves. I believe it's probably true today if you want to go out and look outside at programs you can download right now. That's probably true. But in terms of some of the methods that are going to come up in the next couple of months, there will be methods that will allow you to identify breakpoint resolution of end valves by using single reads. And you can do the same sort of approach to transcriptome by fusion splices using the same technique. So there's some of the developments in this area over the last year. Now some of the issues that you have with this approach is with these mRNA secret approaches in general is that as you may have noticed from some of these graphs, and you see this with all the data with both solid data and with an Illumina data and 454 data, all the data that you get, you don't get constant coverage over the exon. And the reasons for that, there's several reasons for that. One is that you will have different mapability across those regions of the genome. So if the exon, so for example, if you have a region which is similar to another region in a genome, then your mapability will be reduced and you'll see fewer counts to that. The other reason is simply that the library sample preps and secreting methods themselves are not bias-free. There is typically in any method a bias, and so some sequences are going to be more prevalent than others, and that's just the way of the world. So those types of issues lead to you seeing that the Chegg-Jaggy profile across the exon. There's some of these map to multiple locations, some reads that map at all. There's different schools of thought on how to treat multiple locations. Some people suggested that if we map to multiple locations, we should throw it out. Others say that it's better to map it to a single map, sorry, map each read to each of the locations to which it maps to, and then essentially divide the count by a number of places to which it maps. But there's no, actually no uniform approach on how to represent those types of multiple maps. OK, so just, and then just kind of always like to do the back of the envelope type of calculation is just to think about what we actually need to do to get to some questions. So we're looking at exhaustive sequencing of a transcriptome. The paper by Carter and I, they estimated there's around 500,000 to 800,000 transcripts in a cell. If we say the average size of the transcript is around 2KB, then transcriptome in a single cell is around 2 gigabases, which really means that the cost of sequencing of transcripts is equivalent to a cost of sequencing genome. Now, that's actually not quite true. The reason is that in a, because we we depend on having coverage, so we depend on having coverage to allow us to find differences and variations. So if they're going in a genome level, if we're looking at Michael Brunner was talking about earlier, looking at structural variants, we allow the fact that there are multiple reads crossing that structural variant to allow us to find that, to find that variant. With the genome, we're really just trying to construct a single consensus sequence across that genome. With a transcriptome, because you have multiple splices, that and you have multiple tennis splices, it doesn't quite work. And then the other problem you run with the transcriptome, which builds on that, is that different transcripts are present at different expression levels. So if a transcript is highly expressed, you're probably gonna, or a gene is highly expressed and all of its transcripts of that gene are highly expressed, you're probably gonna find all the alternate splices of that gene. But if a gene is present at only low expression levels, you're not gonna have enough reads covering all those breakpoints to allow you to satisfactorily recover all those alternate splices. So it's not quite true, but it's a kind of a neat rule of thumb. And I'll make another AB specific plug, if I may. We had a method come out earlier this year, where we actually recovered the transcriptome of a single cell, and that paper is available in nature methods. Yes? Go back, what's the range? No, I have to admit, it's been a couple of years since I read this paper, probably more than that. And so I don't recall, but I would imagine they would have looked at a particular cell type. So it is gonna vary by cell type, for sure. There's this, yeah. Right, so yeah, so I guess I should repeat the question for the camera here. The question was, certain cell types will be highly expressed in specific transcripts. Yes, that's very true. So we find, if you sequence pancreas cells, you find huge amounts of the insulin gene present. And that forms out all the other expression. So yes, you're absolutely right. If there's a gene like that, if there's a gene like that, it's turned on in a huge state. But I actually don't know what the real answer to that question is. Because what, you know, what I throw back to you is, I don't know what's happening to all those other transcripts. Are they, is the cell actually actively more active than other general cell types? In other words, it has way more transcripts in general. And so the other transcripts are all present, and then this one's really high. Or is the fact that, oh, does a cell have a limited amount of mRNA production capacity? In which case, the fact that that one is really high means everything else is just generally low, but there's not a lot more things to find, right? I mean, I don't know if we know the answer to that question. And then the other thing that I think is that you seem to be getting picked up by a lot of the articles. Yes. Yeah, so one of the issues that, and I remember talking to, again, Blankey on the name, Blankey on the name, someone from Max Planck, they were doing an experiment theoretical calculation of how many transcripts were present in a cell, in a cell using based on the gene expression data from next-gen sequencer. And some of the issues that you run into is as you sequence more deeply, you get sequencing errors. And if you have genes that are similar, a sequencing error is going to look like an expression from a closely related gene. And so they were trying to account for those types of issues. So that's definitely a problem, yeah. Okay, so moving on, how am I doing time? Okay. So another class of experiments are looking at DNA binding proteins and trying to identify where these are bound to on DNA. Now, also, obviously, this is important from the point of view, looking at activation and repression of genes where we can identify promoters and represses that are binding there, and that can help us understand where things are. And so we'd be interested in histone binding sites as and also for a little bit later about methylation as well. But you can have both sites and seeds and sites and trans that can affect the expression of your protein of interest. And common sort of evolution here has been from AMCID chip-chip-chip and chip-seq. That would chip-seq obviously being the mRNA, the next gen approach to answering this question. So the chip chromatone immunoprecipitation assay allows us to isolate those fragments of DNA which have a protein bound to them and the way in which this is done is by cross-linking so the protein is then stuck to the DNA. That is then shared to separate out these fragments. There's an antibody of interest which is targeted to find a protein that we're looking at and then we're able to enrich for that particular fragment and identify that. Now, once we have these fragments, we can use the sequence term, run them on a chip. But what we're finding now, well, of course, with the next-gen methods, the approach is used to just sequence directly. And this allows us to get much better resolution on the... Excuse me. Allows us to get much better resolution on the binding sites and also look genome-wise with no prior hypothesis as to where those binding sites should be. And again, we have the same basic workflow where the weeds are aligned to the genome and then you look for peaks. So I think this is one of the first papers, it's not the first paper that came out, looking at stat 1 binding sites. They used the Illumina technology and they found that they were able to, in their comparisons to orthogonal methods, they found that they had a very high sensitivity of the specificity. In terms of they didn't actually have to go very deep and this is again, back to now with the higher throughputs, you don't need a fraction of the throughput that the solid and Illumina technologies are currently generating in order to actually do this type of experiment. So even with 24 million tags in total, of which around 15, 12 to 15 were mapping uniquely, they actually, they got very, they reached saturation on their analysis. And so they're able, so even with the lowest, with these now low tag counts, you're able to actually do these sort of experiments. So you can imagine with barcoding, you can multiplex 10, maybe 20 of these into a single experiment and run these logs more cheaply than you could. So these typical profiles, they look like this, where this is the stimulated one, this is the unstimulated, this is the same region. And you could say, what would expect where this has been stimulated, there's a lot more stat one binding sites better and better apparent. And there've been other methods as well that have been published in this space. This paper is found at 98% according to chip chip. This is specifically showing correlations between chip seek and chip chip. And you can see these look, there's the one very well. This is an interesting, paper came out of the Barbara Wolves lab showing some of the power of these techniques. There's been unknown what was causing, it's even known to be regulated by Neural D1, but for years this has gone on known that the actual binding site here was not found. And traditional biochemistry methods and bioinformatics methods, where they were looking at consensus binding site profiles and doing those scans, had failed to find this target. They ran a single experiment next-gen experiment and then, lo and behold, they just got the answer right out. So here they were able to find, just again, using this chip seek assay, this binding site, it was a weak match to consensus motif, which is why the original bioinformatics techniques could never found it. But just one run on an X-gen sequencer and they got the answer. And then just before we move on the mesoen, these methods are on, chip seek is fairly well established. They're terms of looking at DNA, DNA RNA binding of proteins. There's a neat paper that came out just again, just a couple of months ago, the authors developed a technique that allowed them, they were bound ribosomes to the RNA molecules. And they were able to then again pull out tags and identify the positioning of ribosomes across the mRNAs. And so they're actually able to look them to determine protein translation rates and find positions where the ribosomes were pausing during the translation into protein. So this is this kind of neat way, I think we're gonna see sort of evolution of these sorts of techniques. And there's probably new techniques that are gonna come out and be established over the next couple of years. But these basic techniques are there and they're established and the methods are not gonna change much. Where I think there is still a lot of work to be done on the algorithmic side is in looking at methylation. So methylation, the DNA, methylated DNA with cytosemes are methylated. The regions which are methylated are silenced and in the genes in that area are not transcribed. So it's together with the histone modifications, it's another form of transcriptional control. And a number of techniques for doing this, you can enrich hyper-methylated regions and then sequence those in terms of applying that to next-gen sequencing. You can enrich those regions and then sequence those regions that are hyper-methylated. Another approach is to by sulfite convert your genome. When you take it to that process, actually un-methylated seeds they can get translated into timings. And now this causes all sorts of problems for your sequence alignment. Un-methylated seeds will get converted but methylated seeds will not. And so now your genome is a hybrid of the original where some of the seeds are converted into T's and others are not. So as you can imagine, this causes the alignment to be much more difficult. Common techniques involve looking at sequencing to un-methylated genome and then sequencing to the standard genome and then, sorry, aligning to a standard genome and then aligning to en-silico bisulfite converted genome. That's kind of the way to do that. But these methods are still being developed. One method was one such method developed by this group where they use dynamic programming to try and identify those regions. They also use targeted sequencing to reduce the space and that allows you then to have a much smaller genome and you can do a much more exhaustive search over that reduced space. Did you? Yeah, I was just going to point out one of the caveats is when you align normally you can just take the reverse complement in en-silico bisulfite treated genome and then you can do a lot to create a no-materials problem. If you were aligning you can't create a gigabase gene or something you have to align. And it's kind of interesting. You have this, so the fifth gigabase is if you're looking in base space and then in color space, you actually need to do nine. So there's advantages and disadvantages of color space. The advantage in color space is that you still have four colors that are reasonably well balanced. I don't have the stats for that here but you can imagine if you bisulfite convert your genome if your seeds are going to be converted to teas you're going to end up with essentially three bases present in most of the genome is not methylated. So you're going to end up with most of your seeds converted to teas. But in color space because the colors are related to transitions you actually don't see that much reduction in the colors. The four colors still end up being relatively well balanced. The downside is that when you bisulfite convert your genome the rule that you're, let me get this straight. The rule that you're forward and reverse strands are identical in color space which is I think Michael or one of the Michaels I think pointed out in their slides yesterday. That no longer applies which is why you need to search against essentially nine gigabases instead of just six. And then another method again this was more exhaustive where they tested weeds against every possible methylation pattern and retain a new kit. So again though the basic workflow is align weeds. More difficult for methylated bisulfite convert weeds. But essentially that's the same process of lining weeds, counting and analyzing. So I mean that's if you can do online weeds you can basically apply any of the methods I just talked about. And just briefly mentioned just metagenomics. Some of the earliest papers published by Craig Ventron this is one most well known the one where sequencing of the various C samples and that used Sanger sequencing. But there have been many more recent studies where either where the various techniques for next gen sequencing have been used to study metagenomics. Now these are primarily used four by four but I think now people are using alumina and solid. I'm certainly aware of solid. People using solid I'm so sure that people using alumina as well to study metagenomics too. But four by four does have advantages here but it's stronger along the read length. But typically these are then target the 16S, 18S, vibrational subunits and look for variations in those units to identify species that are present. So as I mentioned earlier to this basic process just take your reads, you'll line them. And once you have counts you can analyze your reads using many of the existing tools and approaches. If you have counts for gene expression levels you can plug those into gene spraying or whatever experimental, whatever research tool you've been using previously. And then yeah metagenomics has been obviously been gaining interest in this area as well. And I think that's it. So I'll take any questions on that.