 Okay. Good morning, everyone. Welcome to our 11th week of the Current Topics series, and today we're going to be turning our attention to large-scale expression analysis and techniques that have quickly become an integral part, not just in the field of genomics, but in biomedical research in general. So it's my pleasure to introduce to you today Dr. Paul Meltzer, who's the chief of the cancer branch at the National Cancer Institute Center for Cancer Research. Dr. Meltzer's research focuses on understanding genomic instability, the abnormalities in genome structure and function that often lead to cancer. And he was one of the very early pioneers driving the original development and refinement of microarray technologies, both on the instrumentation side and on the bioinformatics side, and has used this approach to its best advantage to better understand the basic biology underlying both tumor genesis and cancer progression. He'll be sharing many of his own personal insights on how to best use both microarray-based and sequence-based techniques to understand changes in patterns of gene expression, as well as how to apply these techniques to important clinical questions, particularly how our understanding of subtle yet important changes in gene expression might play a role in actual clinical practice in the future. Please welcome me in joining today's speaker, Dr. Paul Meltzer. Okay, thank you, Andy, for the opportunity to speak in this course today. I want to make a few introductory remarks before I get going. One is that this field has expanded and changed tremendously over the last years that it's been developing, and it no longer can be encompassed in any kind of anything approaching a comprehensive manner in a single lecture. So it presents a challenge for what to talk about today. So I'm going to talk about the audience that I think is the largest group, and that's the people who will be consumers of scientific publications, which include large-scale expression analysis, which is an immense number of publications now. The next group will be those of you who are conducting laboratory research who will at some point use these technologies in your own work, which is also a really huge group. Then there's a smaller group that's more concerned with the details of the technologies and the details of the bioinformatics, which are incredibly important and complex, but which we really can't present in detail in a lecture of this duration. So that's what I'm going to try to do. I would also like to thank at the beginning, many people from my lab who've worked on various aspects of large-scale expression profiles for now, about almost the last 20 years. And I particularly thank Sean Davis, who besides doing that has helped me to get some great slides to add to the talk about the topic of RNA sequencing, which is very much a current topic in genome analysis. So with that, I'll proceed. I have no relevant financial disclosures. So the way I like to orient my thinking about this topic is that now in the era of having a tremendous amount of genome sequence available for many species, gene expression is one of the fundamental aspects of analysis of genome function. This field of transcriptomics is important to really all of the topics that you can think about, function, variation, organization of the genome. Now, right here at the beginning, I'd like to list some topics that I think are important to bear in mind as we go through some of the more specific parts of my presentation. First, the expression of genes has a huge dynamic range from transcripts that may be present in only a few copies per cell or maybe not even in every cell per population to transcripts which are present in tens or even hundreds of thousands of copies per cell. So that's huge. Then we have another issue. There are a large number of genes in particularly chordate genomes, and that creates a statistical problem because we're measuring lots of things, but frequently we have many fewer samples than the number of genes, and that creates a statistical problem for us because the sample number is usually much smaller than the gene number. I want to point out that amazingly not all transcripts are known, not just splicing variants but even specific gene transcripts, and with the explosion in research into non-coding transcripts that's really very much an ongoing process. Finally, all the technologies are imperfect and they have their own limitations and imperfections. There is no technology which perfectly will deliver to us the ground truths about the transcriptum. All of them have issues, and it's a truism here that the analytical tool development lags behind data generation technology development. So once we start to crank out a new kind of data, it usually takes some years for the analytical tools to catch up with the kind of data we're now generating. So a very brief historical comment. So the talk is going to be divided between microarray hybridization analysis and sequence based analysis. The timeline for microarray hybridization research goes from about 1995 to the present, and you may think of sequencing that is something that's really new, but actually tag based capillary sequencing of transcripts began around the same time as hybridization based analysis and was used extensively by certain groups. It didn't get it to be as widespread in utilization because not that many people were capable of doing a lot of capillary sequencing on CDNA, but that all changed in about 2008 as we'll talk about in a little bit of detail with the advent of what we call next-gen or second generation sequencing methods which really have changed the landscape, I think, forever. During this period from the mid-90s to recent years on the microarray side, which I'll talk about first, we went from being able to look at a relatively small numbers of genes into generating arrays that could really have millions of probes, which is now completely routine. So there's been a huge evolution on the technology side, and we're now talking about what is a fairly stable technology for gene expression analysis. I'm going to talk about a few technical points. I'm going to try to keep this to the minimum, but just so that you've heard this. So a feature is an array element, and a probe is a feature corresponding to a defined sequence which is hybridized to a target, a pool of nucleic acids of unknown sequence. Array features can be any kind of nucleic acid, but at the present time they're by far most often synthetic oligonucleotides. This is very attractive because the oligonucleotide array design is extremely flexible. It can be made any way you want. You can have a three-prime bias, which could be useful. You can have the full length of a transcript, exon-specific oligos, candidate genes, microRNAs, whatever, and you can achieve very high densities. But you do have to have our priori sequence data of the genome in question, which is not so much an issue in the present era. Microarray manufacturer has pretty much converged on synthesis in situ, of which there are several methods. I'm going to illustrate very quickly three that are very current. One is light-directed oligonucleotide synthesis using a mask lithography-based method, and this would be the afimetrix chip method. So basically the ordinary chemistry of oligonucleotide synthesis has been adapted so that using light-directed activation of steps in the reaction, you can end up creating a lawn of oligos on a silicon support, and in the end that's what is delivered to you as a reagent to hybridize. Agilent has developed a similar technology using inkjet-directed synthesis, which can also achieve very high feature density and is completely configurable in a silico manner. Illumina has a variant technology which is based on randomly positioned high-density arrays of addressable oligos coupled to beads, which is a really remarkable technology. So all these and others are commercially available, and they're relatively stable products with a lot known about their performance characteristics and many accompanying bioinformatics tools. So to use these things, you hybridize probe to the target and you need a readout, and the idea is that you want to determine the quantity of the target bound to each probe in a complex hybridization. To do this well, you have to have a high degree of sensitivity and a low intrinsic background, and high spatial resolution has to be possible in order to resolve these individual features on very high-density arrays. It's useful to have dual-channel capability, and really fluorescent tags seem to be the best way to do this, and that's essentially the dominant technology. To actually do these things in a laboratory, and I should say that most of the time these are not done in the individual investigator laboratory, but more likely in a centralized core facility, you have to have the arrays, the equipment to hybridize and wash them, the appropriate kind of optical scanner to convert the hybridization signal into a number. Software for processing that array image, software for data analysis and display, and for many projects, you'll want to have a bioinformatics collaborator who can help with statistical issues that are important. Now, when thinking about things that you might do with expression profiling, which I'll talk about and give some specific examples, I think probably one of the more important things I can tell you is to first find out what's out there already, and there are two major public databases that contain a tremendous amount of this type of information already processed for your consumption. One is here at NCBI, that's the Geo database, which currently houses expression data on over a million samples, so that's a tremendous amount of data from many different species, and it's incredible how useful that is and how frequently that preempts the necessity to actually do an experiment or provides data which can be merged with new experimental data in a creative and informative way. There's a complementary database in Europe, Array Express, which has about now 1.4 million arrays available, so a tremendous amount of data. The next thing to think about if you're going to be doing studies is that you have to be prepared to deposit your data in one of these databases for many journals who will require data submission, and this is done according to a particular standard format, and I provide it the URL so you can read about that, but the Miami standard or minimum information about a microarray experiment has been established so that there can be uniform data sharing systems developed. So let me talk a little bit about the properties of some of the different array formats, so the two-color array, which is embodied by the inkjet arrays I showed you a little bit or some other commercial platforms, there is typically based on a test and reference sample, which are labeled with distinct fluorochromes, hybridized to an array, which is then after suitable washing scanned, and then you get two grayscale images corresponding to the two fluorochromes, and then this data, which you see here, pseudo-colored, is then turned into a series of ratio data. One-color arrays, which can be really done with a number of the different oligonucleotide platforms including inkjet and photolithography-generated arrays, generate a single color, so the normalization is done across arrays that way. So in the end the output is the expression ratio information or relative expression ratio depending on when you're hybridizing with two or one color, but both types of data can be analyzed with essentially the same tools. So why would you want to do these experiments in the first place? What are they good for? So there are two major categories of application that I want to mention. One is the expression profiling of tissue specimens. This can be from any organism and is frequently done to compare tissues that represent different disease states. And to do these kinds of studies, which are intrinsically complex because of the high order of biology you're looking at, to do it properly, the power arises from large sample numbers, particularly with clinical samples that might come from human patients. In contrast, there are many direct comparisons between experimental groups that are possible and these include things such as gene induction or knockdown. And here the configuration of the experiment is absolutely critical. And just as some examples that where this type of array analysis comes into play, it would be studying the effects of disease genes, genes that are regulated by specific transcription factors or hormones altered by the exposure to drugs, infectious agents, physical agents, treatment with SIRNAs and on and on, all the different things that you can do in the lab to manipulate cells to perturb their gene expression instead of looking at one or two genes. It's now quite routine and usually almost required to look at a large-scale gene expression. So when you do these types of experiments, whether they're tissue profiling experiments or laboratory-based experiments, you will generate a moderate amount of data. So for example, if you look at 25,000 probes or probe sets across 200 samples, you're going to get, you know, five million data points, which is in excess of what you can interpret by just looking at a large spreadsheet yourself manually. So you inevitably are casting to the era of computational analysis, and so you have to have analysis and visualization tools. So I'm going to go through some of the things that might be done with these kinds of data, and some of this thinking that I'm going to spend the next several minutes on is equally relevant to microarray data and sequence derived data. So in general, each sample is analyzed individually first and needs to be assessed by quality control, and so quality control metrics are relatively standard for the commercial microarray platforms, and one would throw out experiments that didn't work well enough. You'll then go through some type of pre-processing, which is somewhat platform-specific, but will include a form of normalization, and may well include removing probe sets or genes which are not accurately measured because they didn't generate a signal, and will tend to involve removing genes which are similarly expressed in all samples because they won't really add anything to the differential analysis of gene expression. Then you'll go into probably both an unsupervised analysis and a supervised analysis, and I'll explain what I mean by that. So an unsupervised analysis really asks, without any prior knowledge about the samples, how did the genes and samples organize into groups? It's a very powerful method of data display, but in and of itself it does not necessarily prove the validity of the groups which might pop out. But the idea is that clustered samples will tend to be biologically similar, and clusters of co-expressed genes may be functionally related and may be enriched for pathways or other important features. So I'll give one example, which was one of the first that we did from our lab. It's simple, which makes it easy to explain quickly. So we were working on a group of pediatric cancers that had similar morphology, and we were interested in understanding at that point, whether there was a specific expression profile related to a group of tumors called rhabdomyo sarcomas. And what we had done was in a very early generation microwave, we had taken several of these tumors which are across the top here, and then several related and unrelated cancers listed across the bottom. And what I'm showing here is, this was done on a two-color array, and what I'm showing here are the scatter plots of all the possible comparisons between one of the samples and all of the others. So what you can see is that we've calculated a Pearson correlation coefficient, and we saw that the similarity range from 0.77 as judged by that metric down to 0.40 as judged by that metric. And so this is two rhabdomyo sarcomas which were rather similar and a prostate cancer which was very dissimilar. So by generating the full matrix of correlation coefficients, you get something that looks like this with values that ranged from very low numbers to very high numbers. And with these types of matrices, there are some fairly standard display tools available which are still at the root of many of the display methods that are current for this type of data. And so one of them is hierarchical clustering in which a tree is built by putting the most similar samples closest together and then working your way out to the more dissimilar samples. And we were very excited at that point to find that indeed we did see something that could be considered significant enough to group all of these sarcomas close together. So this is still a current data display. You will also see data displays of this type as each sample being displayed as a point in either two-dimensional or three-dimensional space. And this method is in particular called multi-dimensional scaling. And again, you can see the related tumors are plotted together. And here's an example of a three-dimensional display from another example that again shows similar samples plotting together. And these have some advantage in that you can more easily see, I think, the relative relationship of all those samples. Here's a more complex example where variations of similar statistics were used to extract the genes that allowed you to plot several different tissue types on the same axis. And this concept can be extended to large numbers of sample types. So to give you an idea of what this looks like, I'm going to just work through an extremely simple example, which is based on only six genes. And these happen to be breast cancer. So we've picked about 200 samples and six genes. And what you're looking at here is the completely unclustered data. So this is a data visualization problem I'm trying to address. And this isn't actually not that useful. So let's see what happens if we cluster the genes in the fashion I've just briefly described. So now what we see is a little bit of order. So I can see ESR1, which is the estrogen receptor alpha, and GATA3, which is a master regulator of emery epithelial development, which governs the expression of the estrogen receptor positive phenotype. We're now clustering together. And you see the others have paired up too in a similar fashion. This is a little bit informative, but this heat map is not all that easy to look at until we go ahead and apply exactly the same type of math to the samples which are plotted individually across this axis. Now when we do that, we've grouped the samples and now we can see a series of estrogen receptor positive samples clustering together. And here are some strongly estrogen receptor negative samples. Within that, we can see a subset that are expressing the ERB B2 oncogene, which we know is important in a subset of breast cancers, and so on. So that's just showing how this looks with a simple set of six genes. There's a very large set of samples in genes, and you can see the kind of structure and complexity that arises and can generate many happy hours of perusal wonder what that group is there. And so this is an extremely useful approach for data visualization and can be done with either software that's freely available or commercial software and is actually not terribly difficult to do. Supervised analysis is a little bit of a different approach, and here we're asking the question, what genes distinguish samples in selected groups from each other? And the choice of the groups can be based on any known property of the samples. There are many possible underlying statistical methods, but the t-test or the f-statistic are among the more frequently used. The output will include a ranked gene list, which is both a wonderful thing and the bane of our existence because it's kind of hard to look at long lists of genes and make sense out of them. And this could be used to the development of classifiers, which can be applied to unknown samples, which is potentially of clinical relevance. We have to address, though, the problem of false discovery due to multiple comparisons and this discrepancy that I already have between sample and gene numbers. There are a number of statistical approaches to do this, which can be used to filter the results and produce the most significant genes. So here's an example where we've compared two types of sarcoma, gastrointestinal stromal tumor with other sarcomas, and we've come up with genes that are most highly expressed specifically in gastrointestinal stromal tumor. Many of you may know that this is a tumor that can be caused by mutations in the kit gene, which is also a therapeutic target. So this comes out at the top of the list. This kind of list can be displayed in a fairly simple fashion with a heat map, and we're seeing the most highly expressed genes in GIST. So this is one way of looking at the results of a two-way comparison. Here's a more complicated example where we're looking now at several different types of tumor and have calculated the genes that are specifically associated with each of several different diagnoses and plotted them in this form of a heat map, and of course this is accompanied by a gene list that go with each of the known tissue types. So because of the potential clinical implications of this, there's quite a bit of work that's been done trying to turn things into clinically useful assays, and the workflow is pretty much the same in all of these. You start out with a whole genome, undergo some sort of gene selection process to try to identify the informative genes, and then go through a validation step and then assay development. Now one of the problems that arises in these clinical experiments is that the signal strength will vary quite a bit. So in the examples I've given you so far, they're easy ones where we have tremendous differences between different tissues or different tumor types, and those are easy, but many of the more interesting and important clinical questions, so for example, is there a prognostic signature which predicts response to a treatment or predictive signature or an intrinsic prognostic signature that predicts tumor behavior? These can be associated with a rather weak signal and that's really a big problem. So inevitably we find that some features will work well to separate samples into classes and could be reduced to single gene tests and implemented in a conventional fashion, but that's unusual. Others are going to be more difficult and will require multiple gene measurements, and many of the clinically relevant features appear to fall within this more difficult group. And when working through the data that you're starting with, you have to continuously reckon with the possibility that some genes are going to show differences between groups of samples by chance alone, and there may indeed be no gene which separates the group reliably, so you end up wanting to find the most informative genes and use them in combination. That creates a risk of all sorts of statistical errors, overfitting being one of them, and the best way around these problems is ultimately for independent validation sets. And I'd like to point out this publication from Rich Simon here at NCI in 2007 where he found at that point, which was relatively late in the game, that there were still quite a few publications which continued to make errors in statistical analysis, and one hopes that that's continuing to improve, but you, the consumers of these publications are ultimately the guardians here of your own understanding by hopefully educating yourself enough so that you can look at this kind of a publication reasonably critically. Now as I mentioned, array or now RNA-seq studies generate organized lists of genes, and they're often cryptic and hard to interpret. They're very useful for hypothesis generation, but this can be rather subjective because you tend to see the gene that you know about, so that's a bit of a problem. It would be nice to have more objective approaches. They seldom provide strong evidence for a specific mechanism, although they can certainly strongly suggest that a specific mechanism is important. And ultimately, expression data standing on its own is intrinsically limited, and you need to think about other ways to nail down a mechanism almost all the time. So getting beyond the gene list, there are some things that I can suggest. One is the optimal use of gene annotations. So that's readily available in the form of a gene ontology database, which has many tools linked to it, and I mentioned the David system that's available freely through NIH. There also is an opportunity here to come back to the point of optimizing the use of existing public data, and this has been done for you and pre-calculated in many different ways, both through Geo and Array Express, who have some very nice pre-calculated profiling tools. There are several academic websites available, which have useful data pre-calculated, and then there are many gene signature-based methods such as Gene Set Enrichment Analysis and others, and I'll mention this website, which is provided by the Broad, which has a tremendous amount of molecular signatures curated and aggregated in this website with tools to use them to compare your data to look for existing known signatures. There are also a number of pathway analysis tools, including some commercial and lots of free software, which can be used to generate these types of maps organizing lists of genes so that it's easier to think about their functional interrelations. I want to talk for just a minute about what you should look for in clinical microarray studies, since these are important, and a little bit on our microarray technologies relevant for clinical practice. So first of all, I think the study needs to be very well defined and have to have the right patients and the right question. There have to be high-quality array measurements, which means that they should be publicly available, and that has proved important again and again to allow recalculation and analysis of the data. The appropriate and rigorous statistical analysis should be carried out, and authors should present a formal classifier that could be applied to new samples in a way that's transparent, and hopefully there would be a validation sample set included in the first publication. If not, you certainly want to see it downstream by other groups. The goal should certainly be in any of these studies to seek and validate clinically relevant signatures within defined patient groups for which no current features adequately answer the clinical question posed. There are some problems with the clinical translation of array technologies. It's a specialized technology. RNA itself is unstable and frozen tissue, which has been the basis of many of the discovery studies, is not part of the usual sample flow in tissue sampling environments. Options include the use of reference laboratories, which has proved pretty important technical approaches for preserving RNA or using fixed tissues, or using arrays simply to discover signatures which could be acid with alternative technologies which are more clinic-friendly. And here are two examples, one array-based, the so-called 70-gene microarray signature, and here the RT-PCR signature, which is used also commercially now, and both of these are being investigated and applied in the stratification of breast cancer patients. So now I want to use the rest of the time to talk about the transition from array to sequence based analysis, but I'll point out that many of the conceptual issues that I've just been through for arrays are equally relevant for sequence data. So I'm going to particularly focus on this period here, which is about the last six or seven years since publications began to appear. And once again, as we did more than a decade ago, we're seeing an explosion of publications from the few that came out in 08 now to over 600 last year, and I'm sure that will be eclipsed significantly this year, so we're in a kind of a period of exponential growth, so you can see why it's not possible to cover this in detail in terms of all the things that people have accomplished. These are some of the critical initial publications that showed that this was possible, and I'm going to spend some time explaining how this works at a relatively superficial level, but try to give you some general framework to think about it as you see this kind of option presenting itself to you more and more. So first of all, array technologies basically do nothing more than measure the relative abundance of nucleic acids of defined sequence in a complex mixture, so sequencing can really do in principle the same thing. What you may not appreciate at first is that we've been sitting a lot on our relatively stable home with microarray expression. This is now working really well, and out from left field comes this genome sequencing technology, which has really kind of blown things up both in a good way and in a challenging way. So the good way is that we have many new opportunities to do things that were difficult or impossible with arrays, but also many new challenges in the best way to use this new technology. So let me compare and contrast a little bit and expand on some of the points I'm going to make. So first of all, microarrays on the pros side are readily available. They're mature technology. They're relatively inexpensive, and they're effective with very complex samples, and you can easily process in a matter of days to a few weeks hundreds of samples. And you can also target a subset of the genome if that's all you want to do. On the cons, microarrays require platform and application-specific data processing. They're prone to platform-specific artifacts. They have a limited dynamic range. Some probes always perform poorly. Some pros are many sources of noise, and whole genome studies may require lots of arrays, making things really complicated. In contrast, sequencing intrinsically is able to give you whole genome data. It could have a relatively uniform analytical pipeline. It doesn't have hybridization artifacts necessarily. It has a large dynamic range, particularly at the low end. I think this is very valuable. The possibility is there for one technology platform for all array type applications. But on the cons, it's still very much an evolving technology. It does turn out to have technology-specific artifacts. It has been relatively resource intensive. That may be changing. It certainly is computationally intensive, and there is no standard data analysis yet. Also, we'll tend to have lower sample throughput at this point. But there are some advantages. First of all, it provides variations detected at single nucleotide resolution. So you can look easily at allele-specific expression. You can look for the expression of mutations. You can look at things like RNA editing, and you can look at RNA structure, splicing, start sites, termination sites, and interestingly structural rearrangements. The detected signals are relatively unambiguous, which gives them a potential to outperform microarray, at least in certain applications. And of course, sequencing can be used for transcript discovery, which is harder to do with arrays. So measuring gene expression by RNA sequencing could be done in several different ways, just like several different modes, just like microarrays. So it could be full-length mRNA seek. You could be doing tag sequencing, just like the old SAGE serial analysis of gene expression technique. You can be looking at polyA-specific RNA or total ribosomal-depleted RNA. You could be looking at strand-specific versus non-strand-specific data, microRNAs, or long non-coding RNAs. Some of the limitations that you encounter are a lower limit of detection, which is constrained by the mRNA abundance distribution and the number of aligned reads per sample. Large numbers of samples are difficult to process without automation. Software is still evolving and requires sophisticated bioinformatics collaboration at this point. And the computational hardware requirements are substantial, at least if you're handling a fair amount of different samples. The library preparation methods are still evolving, so there are many alternative ways to make RNA-seq libraries. And the data comparison of these is problematic when the methods for generating libraries are different. So I'm going to just illustrate now the methods for RNA-seq in very broad strokes. So for mRNA, one usually quickly fragments the RNA first and then converts to a CDNA library. So the ends are then sequenced using the appropriate primers with the given sequencing platform, and you generate sequence tags, which typically now range from 50 to 250 nucleotides at each end. So that means you're sequencing relatively short fragments and generating large numbers of individual tags. So that's in sequencing of a full-length molecule. You can go for just three prime tag sequencing by generating a digest and then priming from the poly-A end, and then you can get specific tags. That hasn't become as popular as you might think it would have, but I think we're going to see that being a lasting technology in certain settings. So the sequences are simply aligned and counted from the three prime tags, and you can sequence libraries of many samples together by adding an oligonucleotide barcode to each sample that are then pooled before sequencing. There is a potential also to analyze that way, a large number of samples in parallel. So let's talk about, in very broad strokes, the RNA-seq computational workflow, which is absolutely impossible to talk about all of the problems and issues in a brief period of time, but you always start with the raw reads and you have to align those reads to something. So usually this is the genome, but one could imagine also creating transcriptomes, and you certainly will need a transcriptome for annotation of the results, even if the data has been aligned to the genome. Then we have a question, well, what aligner should be used? And there are many different aligners which have very different properties and may not be suitable for RNA-seq, and this is still an area of research and we're still seeing new aligners introduced and existing aligners perfected for RNA-seq purposes. And the reason this is challenging is illustrated by this diagram. RNAs, as you know, are produced as this continuous, a continuous transcript which is discontinuous with respect to the content that's going to end up in the spliced mRNA because of the intron-exon structure of these things. And so you'll end up with reads from a mature RNA which map easily to the mRNA, these are the black reads, but you'll also have reads that cross intron-exon boundaries, and this is where it becomes a challenge for the aligners to handle these efficiently. So the short read will be split by an intron when you're aligning to the reference genome. Nonetheless, there are some pretty good tools available now, and I'll show only one screenshot of a genome browser that has been loaded with RNA-seq data from several samples, and this just shows one multi-exon gene, the IGF-1 receptor. And what you can see is that the sequence tags assort pretty efficiently two exons, so you'll see the alignment with specific exons. So this would then be the base of expression analysis that has to be worked through to get to the next stage. So, okay, if you have successfully aligned the reads, now you have to come up with some sort of normalized read count, which is an amazing thing if you think about it. You're taking a huge amount of data, and you're trying to reduce, for each annotated transcript, you reduce it to a single number, and this is rather challenging. And at this point, the statistics of arrays and sequencing tend to diverge somewhat because the statistical properties of counts are a little bit different from the statistical properties of a continuous variable like a fluorescence signal. So there are many different issues that need to be addressed, whether or not to use how to implement a count-based model, whether to attempt to resolve isoforms, whether to include the paired-end reads and look for aberrations that might come up due to aberrant lengths between the paired-ends, but to consider positional bias along the transcript length and sequence bias, those are just a few of the issues that come up. And I'll show you one figure here that illustrates pretty well the problem that you're up against. So here is an example where it's really simple. On the top case, you've got your read, and it's aligning to an exon of a gene, and then no matter how you look at it, this should be counted as credit towards the expression of this gene. But what if you have something like this that overhangs? Well, depending on how you approach this, it may yield either no signal or a signal. What if you have something like this? Same problem. Now here is a frequent situation, and you probably do want to count this because this looks like an appropriately spliced transcript. There are further complexities. What if there's a gene expressed on the opposite strand? Then, again, you get different answers depending on how you decide to do the counts. I might add at this point that strain-specific libraries are now becoming more commonly used, which may help with issues like this if these genes are on opposite strand. Sequence bias, which is a big issue in array hybridization, turns out to be also important in RNA-seq. And one of the things we're most familiar with is GC bias. And this is actually a paper worth looking at that came out from the Hopkins Group in 2012. And really here what they're illustrating is that they're in a two-way comparison, which is a frequent thing we do with microarrays, that you get a significant skew in data based on whether you're looking at GC-rich genes or not. So that's an issue that really, not terribly unremarkably, but continues to follow us into the sequencing era. But once you deal with all these problems, however they've been coped with, you're still left with a big spreadsheet of numbers with samples and now a normalized log-transformed gene count number. And this is what you would use now to look at the differential expression problem. But you have to remember, as I've tried to hint, that a large number of variables intrinsic to RNA-seq accompany the data. And these pose a whole new set of computational problems which really differs substantially from those encountered in the analysis of microarray data. And this is an area of intensive work in the bioinformatics field. There are several papers cited here that deal with this issue in comparing different software. I'll particularly call your attention to this one because it's based on all the hardware that's available through the bioconductor project in our statistical environment. And things are beginning to converge. I would say that it appears from my take at least that many of the significant problems have been identified. And we're beginning to understand what should be at least the elements of a correct analysis of RNA-seq data. But, of course, the thing that always is an issue is the underlying technologies continue to evolve, and they may bring new problems or remove old problems. So it's hard to be dogmatic at this stage of affairs. So, and this just shows an RNA-seq workflow proposed for R-based tools in a recent review in Nature Protocols. And you can see the relatively complex number of stages that you need to go through in order to actually finally get to the results, and you'll still need some additional sanity checks to make sure the results are not absurd. So it's still a fairly complicated problem, and although the tools are there, they're not necessarily packaged in a way that leads to an easy automated workflow for someone who is not already invested in learning about the bioinformatics. So the RNA-seq computational workflow, then, once you've gotten past this, there's still a whole lot, and this is in some respects the most interesting. So just, at this point, we're basically emulating microarrays, but with what may be a little bit better technology. But all these other things are relatively unique to RNA-seq, or difficult to do with arrays. So looking at exon usage, transcription, start and stop sites, structural variants RNA-editing, the expression of single nucleotide variants, the expression of antisense, spran specificity, and so on, these are what you can really do nicely, uniquely. But even these present fairly significant computational problems which are not trivial and maybe not completely solved at this point. So let me talk about one that people who work in the cancer field are at least always very interested in that fusion gene detection, because as you probably know, rearrangements of the genome that fuse two genes underlie many forms of cancer. So if you look at this as a variation of the problem we talked about earlier, we basically, instead of having a nice neat spliced transcript that all comes in one color from this gene, we have a very odd chimeric transcript that has a piece of this gene and a piece of this gene. And from the point of view of RNA-seq to discover this, all we really need are enough reads that span this junction, which have the properties that they, the ends will map uniquely so we can spot this. And of course we have to be able to map both rearrangements that occur arising from two different chromosomes or possibly from an internal rearrangement of the same chromosome. And there's lots of software that's out there that you can get your hands on to do this. And I'll just show some figures from one relatively recent paper that's compared many of these which differ in how they consider these various features that come up when you go through mapping, the mapping process. And what they found, and this is the part that you can't, shows you can't escape noise in genomic analysis. It's always something that you have to understand the signal-to-noise properties of anything you do at genome-scale level is that with completely synthetic data, which did not have any true positives, all of these softwares generate some level of false positive fusions. It varies, but there will be probably a trade-off with the ones that have extremely low background will probably have a slightly weaker ability to recognize true positives. So this trade-off between true positives and false positives permeates everything I've talked about today in one way or another and needs to be dealt with. So in transcriptome analysis there is never one correct analysis that will give you one correct answer, but you're always dealing with that trade-off between wanting to have the best possibility of discovery at the least cost in noise. So as of today, and this will probably change next week, but as of today, when is RNA seek the preferred technology? And I think it really entirely depends on the experimental goals of your project. I would say that currently RNA seek is the preferred method for assessing transcript structure and sequence variation at genome scale. It's a wonderful method and many of the hundreds of publications will have that will be some aspect of their work. The role of RNA seek for routine count-based expression analysis is less clear at this time, but what is clear is that as sequence throughput increases and sequencing costs decline and importantly, as standardized analytical pipelines for specific experimental goals are developed, RNA seek will certainly become attractive for general use. So I think that's the direction this train is going, but we're currently in a state where I think it's very defensible to say we're going to do this count-based project with arrays. You may be able to get an answer much quicker at a lower cost. And there will be situations where maybe you'll want to do that to get sort of a quick look at a problem and then say, well, you're going to still do the RNA seek to see what that may add. So I think I'll just conclude with this list of websites. So there's a tremendous amount of information on this topic available online. Many of these are well worth familiarizing yourself with as you plunge hopefully ever deeper into the field of genome transcriptional analysis. So with that, I'll thank you for your attention and I'd be glad to answer any questions from people who are actually here. Thanks. Okay, great.