 Good morning, everyone, and welcome to today's lecture, the 10th in this current topic series. We're going to turn our attention today to large-scale expression analysis and techniques that have quickly become an integral part, not just of genomics, but of biology in general. We're pleased this morning to have our speaker, Paul Meltzer, who's the Chief of the Genetics Branch at the Center for Cancer Research at the NCI, and also an adjunct investigator at NHGRI. Dr. Meltzer's research focuses on understanding the abnormalities in genome structure and function that occur in cancer. He's been one of the driving forces behind the development and refinement of microarray technologies, both on the instrumentation and on the bioinformatics side, and has used this approach to better understand the basic biology, understanding tumor genesis and cancer progression. He'll be sharing many of his insights on how to best use these and other high-throughput techniques to understand changes in patterns of gene's expression, as well as how to apply those techniques to important clinical questions. Please join me in welcoming today's speaker, Dr. Meltzer. Okay, thank you, Tara. I'm glad to be here in this course and to cover this really interesting and exciting topic in genome analysis. I will start by explaining the slide, so the talk's going to kind of have two parts. The evolution part will bring you up to date, more or less, in what's going on in the microarray world focusing on gene expression. The revolution part is the impending and current impact of new sequencing technologies on the analysis of gene expression, which is certainly an important aspect. Initially, I'm going to cover part of the technology, the development of the technology, the kinds of things you can do with microarrays, and then talk about the analysis, and then move on to the RNA-seq. It used to be that you could cover this topic pretty thoroughly in an hour, or an hour and a half. Now it really takes a course. This talk is really directed at those of you who are junior investigators or senior investigators who are just moving into this area to learn about things that are going to be part of your toolkit for molecular biology in the coming years. So I'm not going to be able to go into anything in tremendous detail, but I think this should give you an overview of where the field stands. I have no relevant disclosures today, so let me get going and put this in the very broad context. So we've had access to the human genome sequence and lots of other organism sequences for some time now, and this topic today puts a square in the middle of the next kinds of things that can be done. So looking at gene expression, gene variation, and gene function, all can be addressed with aspects of microarray technology. It's important to think about this in a very global way, and microarrays basically provide a useful, rapid way to do whole genome analysis, and the primary impact of all of this is accelerated discovery in hypothesis generation. Certainly with expression arrays alone, you very seldom can prove a biological mechanism, but you can tremendously, rapidly generate hypotheses which then can be validated with other technologies. So this is an important thing to bear in mind. We don't want to ask too much of expression profiling alone, but as a tool, it's incredibly valuable. So this gives you a brief history of the development of the modern era of microarrays starting in the late 90s, and this is just PubMed citations for DNA microarrays. And really for the past few years, it's kind of come in around eight or 10,000 citations. But what that statistic masks is the fact that there are many more publications now which use microarrays, but the terms don't appear in the title or abstract because this has now become such a routine tool in molecular biology in all aspects of biomedical science. And this rapid rise here represents the speed with which the community embraced this technology, which tells you something about how badly something like this was needed. So again, the main reason for this is the acceleration of discovery, new theoretical constructs through hypothesis generation, and really helping to move us towards a systems approach in biology. This is another part of the short history in terms of feature density starting out with just a few thousand probes on the surface of a microarray up until now multi-million probe microarrays are utterly routine and normal the way we work today. And again, the revolution part sequencing technologies may really supplant arrays in many applications. So a little bit of array terminology. We talk about the array elements as features, which are probes that is a feature corresponding to a defined nucleic acid sequence, which is hybridized to your target, a pool of nucleic acids of unknown sequence. Possible array features include synthetic oligonucleotides, which are really the dominant type used today. You could also use PCR products or clone DNAs as features on microarrays. That's really the landscape still today. Oligonucleotide array design is extremely flexible. It can be done with a three-prime bias, with a full length design, it can be exon-specific. You can have candidate transcripts, unknowns, microRNAs, non-coding RNAs, you name it. It's very flexible because it's all computer-controlled. You can achieve a very high density, again, multi-millions of probes. But it does require that you have sequence data to go with it. That is, you have to know the sequence of the probes you want to create. So that's a limitation. How do you make microarrays? Well, they were first made by printing technologies and still these have a place in the array universe, but a small one at the present time using this type of robotics to spot DNA sequences on slides like this. But synthesis in situ, either light-directed or mechanically-directed synthesis of oligos, is really the principal technology that dominates what goes on in the array world. So this just is an example of light-directed oligosynthesis. In this case, for short oligosynthesis, using masked oligo design, which is really very similar to the technology used in the semiconductor industry. And so with cyclical changes in masks and a light de-blocking reagent, you can build up lawns of oligos of a defined sequence. So this is the basis, for example, of the AFI gene chip. A similar method can be used in the nimble gene light-directed system where there it's based on digital micro-mirrors such as might be found in LCD projectors to produce very high-density synthesis in situ. Alternatively, the Agilent system uses inkjet-directed synthesis to produce fairly high densities of oligos, rather long oligos on the surface of a slide. And another technology that's out there is the bead array from Illumina, and this utilizes beads that are coated with oligos of a specific sequence, absolutely remarkable technology because the beads randomly are arranged on the slide and decoded, and so they arrive in the lab with a bead manifest that allows the address of each individual oligo to be identified in the scanned image. So I'm not particularly recommending any one of these technologies. They all work, they all have their place, and the choice will depend a lot on local expertise and the availability of instrumentation. So to get information from any of these platforms, what you're really trying to do is to determine the quantity of target bound to each probe in the complex hybridization. So you need a good readout mechanism to do this, and fluorescent tags turn out to be the most important. I'll come back to this point in a minute, but dual channel capability is also useful, and that can be accomplished with, again, two fluorescent colors. So to build microarrays, the methods are applicable to any organism. For sequenced organisms, oligonucleotides are preferable. For unsequenced organisms, which are getting to be fewer and fewer cloned DNAs, become the way to go. The density depends on the specific technology. I've already kind of covered this, and the array design will be linked to purpose. So the laboratory essentials you need to be able to do this are the arrays themselves, the hybridization and wash equipment to hybridize your labeled target DNAs, a scanner to get the image of the hybridization signal, software for processing the image, and for data display and analysis, and probably for big projects by a collaborator who can really work with relatively complicated data sets that are generated. So there are a number of many of microarray applications, including gene expression, copy number determinations, SNP arrays, chip arrays for chromatin immunoprecipitation or transcription factor localization or chromatin modification, histone modification, all these things can be done on arrays. In the interest of time, I'm going to focus completely on gene expression, but just wanted to make the point that array technologies have been used for all of these other applications. At the beginning, I think to get you into the frame of reference here is that there have been studies done on microarray data quality. So when you have these kinds of high throughput measurements, data quality becomes important, and this has been addressed in a large multi-center study several years ago, and probably one of the major results of the study and similar ones is that expertise is important. So getting the arrays done by people who are familiar with doing them often in core laboratories at most institutions is the way to go to get good quality data. It's also important that you be able to access existing expression data. So the first thing I tell people who come to me about planning a project is, well, has anybody already done that study? There are two very nice databases that are publicly available, so GEO or the Gene Expression Omnibus, which is an NCBI system, and that now has almost 600,000 sample sets in it available. You don't need to write down URLs, because all these things are easy to find on Google that I'm presenting today. The EMBL has an array express website, and that has also a large number of experiments available. I think about half a million or so microarray hybridizations. So to publish data and to get it into these databases, you'll hear the term the Miami Convention or the Miami Standard that stands for the minimum information about a microarray experiment. This format is required for database submissions and is required by many journals for publication of conclusions drawn from those studies, so that's important to know about. Now I want to talk for a minute about strategies for generating the signal from mRNA, and so it could be a fluorochrome conjugated CDNA. That's a very standard way of doing things. Or you could use ligands substituted nucleotides with a secondary detection. So for example, a biotinstrupt evidence system, that's also very standard. Radioactivity is now not used very much except for very specific studies. RNA amplification certainly can be used so that it's possible to do arrays on extremely small amounts of RNA. I'll give you a walkthrough of a simple two-color array experiment, and then we'll talk about how that compares to one-color experiments. So in this figure, we've created arrays, in this case, imaginarily by robotic printing, but they could also be made by the other technologies. So we've created arrays and labeled two samples, a test and a reference sample with two different fluorochromes which are pooled and hybridized to the microarray. That goes into a scanner which extracts the fluorescent signal at each element of the microarray with appropriate lasers and filter combinations, which gives you a grayscale image which is pseudo-colored here and merged. And this essentially then generates the relative expression in each channel, the test or the reference channel. In contrast, the one-color hybridization signal as illustrated here is really based on relative expression normalized across an array or a set of arrays, and typically with a labeled cRNA, which often is labeled with biotin, and then hybridized to the oligos on a microarray when the biotin signal then is detected with a fluorocoupled, avidant-type molecule, and that generates a similar kind of scanned image as I described for two-color arrays. In the end, they both can be used for most purposes, but the important difference is that the output of a two-color experiment is the expression ratio, and for a one-color experiment it's a relative expression level. And both types of data can be analyzed with essentially the same kinds of tools. They have pros and cons, which we'll discuss in a moment. So what can you do with these kinds of expression arrays? So I like to divide this into two simple categories, profiling, which means taking tissues and extracting RNA and analyzing your targets of interest across a large number of samples. And here the power to see things arises from increasing sample number. Or there are lots of direct comparison experiments, which I refer to as induction experiments. So I'll give examples of those in a moment. And here it's really looking at a specific well-defined biological system. And arrays can also be used for genome annotation. So you can make an array, for example, covering all of a chromosome or all of a bacterial genome. And without prior knowledge of the transcription units present in that genome, you can actually annotate the genome. And by induction experiments, I mean these sorts of things where you try to address these types of problems of how these sorts of treatments of cells affect downstream targets. So what genes are directly or indirectly targeted by manipulations, such as those I've listed here? And there are really tens of thousands of experiments like this have been done because this is really an important problem in biology that's very general. Okay, well, you end up with a large amount of data when you do these kinds of experiments. We used to think of it as a large amount of data anyway. I have to say that compared to the data scale produced by the next generation sequencing technologies, actually not that bad. But it's a lot more than biologists were used to dealing with routinely. But now this is a routine part of molecular biology. So for example, if we had 25,000 genes or probe sets for genes on a microarray and 200 samples, we would get something on the order of 5 million or more data points. That's still a fair amount of data to manipulate and more than you typically will want to do with an Excel spreadsheet. So this does require analysis and visualization tools in order to deal with the data. But I'm here to tell you that although it's a little bit clunky, the field has really evolved and stabilized to the point where there are many software tools available that make this all really rather readily accessible. So let me talk a little bit in broad strokes about the analysis of expression data. So typically the most important step will be to check the quality of individual experiments. It's very sad to have to throw away data, but the fact is that you really need to do that. If an individual array hybridization doesn't pass quality control metrics, which are pretty standard in the field now, then the best thing to do is to throw that experiment out. So it's a garbage in, garbage out situation. You really want to clean up the data before working with it. Often there will be some pre-processing of the data. This will be somewhat platform-specific, but that could include normalization of the data. It could be removing the genes which are not accurately measured because they're not really expressed or they're detected poorly. And to remove genes which are similarly expressed in all samples because obviously they'll carry very little information about heterogeneity within the sample set. And then typically tools are used which can be broadly defined as unsupervised or supervised analyses. And this just shows a typical microarray scatter plot for a two-color experiment where we're comparing a treated and untreated sample. And here I'm just showing hybridization signal for treatment on one channel and an untreated in the other channel. And this deviation from a perfect line, this group of probes over here, would be those whose expression was shifted by the treatment. So this is a typical kind of a scatter plot that one looks at in two-color array experiments. So what are the types of questions that we ask in analyzing the data? So in unsupervised clustering, we really want to understand how genes and samples are organized into groups. It's a powerful method of data display, but it does not necessarily prove the validity of the groups. That requires additional statistical tests. But our idea is that the clustered samples are going to be biologically similar and that clusters of co-expressed genes may be functionally related or perhaps enriched for pathways of one form or another. And we'll show a couple of examples of this. So just to get you going, I'm going to give you a somewhat ancient, but very simple and easy to explain array experiment, one of the first ones we did by Javed Khan when he was a fellow in my lab. And we were interested to know whether a particular pediatric tumor called rabdo myosarcoma had a discernible expression profile. So basically we did two-color microarray experiments where we took several of these tumors of interest. I'll call them alveolar rabdos or arms for short. And we took several other kinds of cancers and we hybridized them to a very early first-generation microarray. And what I'm showing here is the scatter plot of ratios when we compare one tumor to each of the others in this group. And I should mention that this was a two-color experiment where we used just a fibroblast cell line as a reference probe to allow us to normalize the data across all of the experiments. And so what you can see is that each of these comparisons gives you something that's either more globular or more linear looking. And those two-way comparisons can be captured in a single statistic here, the Pearson correlation coefficient, which can be very high or very low, depending on the similarity. And so you end up with this sort of a matrix of correlation coefficients for all possible comparisons in this little experiment, which range from very high to very low correlations. And it's easy to visualize this sort of information. So here, using the hierarchical clustering dendrogram, we've plotted the samples which are most closely related as the final twigs in the dendrogram. And then as we move up through the hierarchy, we get forks, which then capture the more distantly related things. And we were very pleased at the time to see that the related rhabdomyosarcoma is clustered together, which at that time was not a foregone conclusion, and at least it told us our technology was probably seeing something meaningful. So this is still an important visualization tool. The other one that was introduced at that time was multidimensional scaling, which is a two or three-dimensional way of displaying individual samples. And it's convenient because it captures the distance between samples in a way that's, I think, a little easier and more revealing visualization. This is sometimes presented as a three-dimensional plot as illustrated here and or here with multiple different tumor types. And principal component analysis can be used to generate very similar plots. So these are very, very useful tools for looking at large numbers of samples to see how they're related. So I've explained, to some extent, I hope, clustering samples. Now let's talk about clustering genes of the typical two-dimensional heat map that is so familiar. So here's a set of data, so I've just picked out six genes to make it easy. This is breast cancer gene expression data, and we've got about 200 samples here. And I've just got the data displayed in a random fashion. So each vertical column represents a sample. Each horizontal column represents the expression of one gene. And in this case, it's on a red-green color scale, where red represents high expression and green represents low expression. Now if we look at this data, it really doesn't seem to be telling you very much. But let's go ahead and cluster now the genes and what we see right away is that the genes do cluster, and probably the easiest one to explain in terms of breast cancer biology. Here is the clustering of the estrogen receptor and the gata-3 gene, which are co-expressed in ER-positive breast cancers. So this helps organize the data a little bit, and this is done with exactly the same kind of algorithm as I illustrated for clustering samples. So now let's go ahead and take this heat map and apply clustering to the samples. So now we're going to have a heat map where we've clustered both the genes and the samples, and now it actually does make quite a bit of sense. And so what you see is that we've split these tumors into two large groups, the ER-positive tumors here that are expressing some level of the estrogen receptor, and the ER-negative tumors over here. And we can see how each of these genes behaves across this, and we can see, for example, within the ER-negative group, we have an ERB2-positive group, not perfectly clustered, but pretty well clustered. But really, I think this is just to give you a flavor of how this sort of data clustering visualization tool helps the investigator to make sense of the data. And then I'll just show, I think, on the next slide, this is a much larger set of genes and experiments, and what you can see is a lot of complexity here in the heat map. And so this is the type of data, if you did a study on this scale, that you would spend some time trying to figure out what's the significance of this cluster here, or what is this little fine structure here telling you, and so on. So really a data visualization and discovery facilitating tool. Okay, now let's talk about supervised clustering, a little bit different or supervised analysis. And here we're asking a question, so what genes distinguish samples in selected groups from each other? And the groups can be based on any known property of the samples, so it might be in a clinical situation, for example, good prognosis or poor prognosis patients, high grade or low grade tumors, survivors or non-survivors. Whatever property you can, you know, clinical diagnosis, whatever property you could assign to a specimen you can use, and typically you'll have lots of annotations for these kinds of samples. And so it can use any known property of the samples, and there are many possible statistical methods which I won't have time to go into, but very simple ones such as variations on a t-test or f-statistic are still very frequently used. The output, though, is different from the unsupervised approach, and yet it will generate a ranked gene list with some statistical metric assigned to each gene. And this can potentially lead to the development of classifiers, which can be applied to unknown samples. Moving in the direction, perhaps, of developing clinical tests, among other things. But you do have to worry about a problem with this kind of data, which is one of false discovery. And that's simply due to the mismatch that we have creating a large number of multiple comparisons, and that's a mismatch between the sample number, which is typically, even in a large study, only a few hundred samples, but we're looking at the universe of the genome, which is more on the order of tens of thousands of genes. So that mismatch creates a lot of potential for false discovery, and this creates a lot of problems for misunderstanding of rare data. And I'll just illustrate one way of dealing with this, and that's through the use of a random permutation test. So here I'm showing data where we've compared two types of cancer, a cancer called gastrointestinal stromal tumor, or GIST, and similar morphologically-looking tumors that are spindle cell sarcomas. And we've run microarrays on these, and then run a T-test type study comparing the two groups. And then a permutation test, and what we're really looking at is the number of... In the permutation test, the labels on the samples are permuted thousands of times to generate a random curve of discovery, and that's this red curve here. So the statistic here is referred to as a gene weight, so the bigger this statistic, the better it separates groups, and this y-axis illustrates the number of such genes. So you can see that even with completely randomized data, you still come up with some genes with a high weight. I'm sorry, it's actually the green curve is the permuted data, but you'll still see that that green curve has a few genes coming out with a significant weight. But this is a random and completely false discovery. The difference between these two curves, so the red curve being elevated more off the x-axis, that contains the informative genes, and the list would look something like this, and the kind of cool thing about this experiment was that TIT, which happens to be both the mutational and therapeutic target in GIST, is the number one gene in this particular analysis, but then the other related genes then can be ranked in this fashion. They can be visualized in a heat map, and here we see a very clear separation of the GIST from the other tumors. An important point to make here is if someone pre-selects genes in this fashion or in any fashion and shows you this kind of a heat map, that in itself does not necessarily prove the validity of that separation. You need to know what the underlying statistical information is, so you have to be a little careful not to get caught in a circular analysis. This can be scaled up to more different kinds of cancer, so here we see the genes that are statistically significantly expressed in several different kinds of tumor, and so on and so forth. But these types of signals separating different kinds of tissues different of all types you can imagine, they're very strong and easily recognized by very standard methods, so this is no longer something that would be looked at as a very novel thing to do. Here's a very simple two-way comparison of high-grade and low-grade ductal carcinoma in situ of the breast, in which we've added to this array data some clinical annotations which help to interpret these results in terms of the behavior of tumors. So this is kind of the next level of building in other kinds of annotations into the analysis. So let's switch gears slightly and talk about taking genomics from the bench to the bedside so you can see where we're going with these sorts of technologies. We start with a whole genome, look at a lot of samples, and from that go through a gene selection process to come up with really what are nothing more than candidate biomarkers. So we come up with these candidate biomarkers, and then we have to go through some sort of validation procedure to make sure that the arrays are telling us the truth, and that might include immunostanding, it might include other mRNA quantitation techniques, but we want to make sure we have valid data, and we're correlating that with clinical properties and then proceeding to assay development. And the assay development could be simple, it could be an immunohistochemistry assay for a single gene that you've found with an array experiment, or it might be more complicated, and that's of course where things get a little bit more sticky, and I'll try to explain this, and this is probably where most of the blunders in the analysis of array data occur, and where most misunderstanding occurs. And so I've used the term signal strength, and by that I mean the number of genes and the magnitude of their differences between the groups you're interested in, and the signal strength can vary quite a bit, but typically the most interesting questions that people really want to know, the question who will respond to a drug, who will have the best prognosis, those tend to be associated with a weaker signal and they're harder to get right. And I'll give a very simple analogy. So suppose we had a sample set like this, apples and oranges, well if we take a single property of these, say their color, we can cluster them and separate them perfectly. That's because they're really very different, and if we were to plot color, we'd get very different plots and you can see statistically it would be very robust. So if this was a single, very highly informative gene that had a radically different expression level between two groups of tumors, they're going to be easy to distinguish. But what if you have something like this, different kinds of apples? These are not so easy to separate and well I can probably do that by eye, but I might make some mistakes if I think that this one looks a little like this one and they might really cluster together. They're harder to separate because they require more than one property to separate them and some of the properties overlap. So if you imagine back into a tissue case, here we might have expression levels which do have a trend to difference, but they also have some overlap in your sample set. This is typically what you encounter in most of these more difficult, lower signal strength kinds of questions. So it requires more than one measurement per individual to achieve a separation. So that's really the analogy I like to use. We can tell apples from oranges, but can we really distinguish different kinds of apples? And to reiterate, because I think this is important, some features are going to separate tumors or tissues and other diseases and it is easily into classes. They might be reduced to single gene tests and implemented in a very conventional fashion. Others are going to be harder and require multiple gene measurements and really many clinically relevant features appear to fall within this difficult group. And then we always have to remember the problems with these kinds of studies. Some genes are going to show differences between groups by chance alone and there may be no one gene which reliably separates the group. You can use the most informative genes and use them in combination. And there is a risk of overfitting the data if you have small sample sets. So ultimately the cure to all of these statistical problems are independent validation sets. And that's ultimately what people in the field are looking for before you really believe something that is claimed as a weak signal. And there are many, many examples of weak signals that are falling by the wayside when independent groups have attempted to validate them with additional sample sets. However, there are some of these that have validated and clinically meaningful things have indeed come about. I'll recommend this paper from Rich Simon's group that was published in 2007 that really is a critical review of the published array studies for cancer at the time that he wrote this. And what he found which is really striking is that, you know, 50% of the studies contained at least one of three basic flaws in the statistical analysis. And despite going through peer review, despite coming from really sound groups, there were still accidents made and mistakes made in the analysis. And with time and the dissemination of better tools, more people on the statistic side coming up to speed and what can be done with these technologies, this is improving, but there are still these pitfalls. They'll never go away. And if you're going to be a consumer of this kind of information, you have to be aware of these things. Okay, now let's move on to the problem of interpretation. So array studies, what they really do is generate organized lists of genes but they're often cryptic and hard to interpret. They're hypothesis generating but that can be subjective if you just look at a list of genes. You see the genes you know about but you're not really thinking about it in an objective way. They seldom provide strong evidence for a specific mechanism and that's because there are some limitations intrinsically in array data and expression data alone in what you can hope to conclude. So you want to get beyond these sorts of gene lists and use more robust approaches. And again, I don't have time to go into this in tremendous detail but gene annotations are out there now and there's a very nice gene ontology annotation system which can be used objectively through very easy to use, extremely user friendly online software as well as lots of other tools to categorize gene lists into their functional groups. Again, I want to emphasize the optimal use of public data. A lot of this is out there already and from many sources. I would point out that in the geo database there are also some very nice tools built in now for doing simple kinds of analyses that are really readily available online and don't require a great deal of sophistication to use. One thing I want to emphasize is the importance of gene signature-based methods. And the particular one I'm going to feature is Gene Set Enrichment Analysis which was developed by Jill Messerov in Boston. And this is captured as a very nice set of tools on the GSEA website and you'll also hear the term MCIGDB. So the Broad Institute which hosts this website has done is a tremendous service to all of us by aggregating a whole lot of the existing array data that captures signatures of interest. So a lot of these kinds of treatment and comparison experiments that I've talked about so far, the significant gene lists from those studies are captured in the database. And GSEA gives you a tool to allow you to compare your gene list with individual gene lists or the large number of signatures that came from a whole lot of other investigators. And this is very powerful and it includes not just these curated gene sets but some related to positional information in the genome, to specific motifs, computationally defined gene ontology defined. So there are a lot of things that can be done with this database. No cost, very user-friendly, very simple. And this just gives an example from one of our recent publications where we looked at a tumor of interest to me driven by a particular transcription factor, translocation, ewing sarcoma, where we looked at the genes that were up or down regulated with a knockdown of that transcription factor and used gene ontology to identify the pathways which were regulated. Just to make another illustration, you'll also see and perhaps by journal editors expected to generate these kinds of figures which are pathway-based. This was generated with Ingenuity Pathway Software but there are many other tools available which are based essentially on one level or another of curation of gene functional data to try to make it easy to visualize the relationship of genes in your gene list, in your data, based on their functional relationship or on annotations that come directly from the scientific literature. So these can be quite valuable in helping you think about the data and suggesting hypotheses which can then take you deeper into your problem of interest. Okay, I want to move on now to clinical microarray studies and their implementation of clinical practice very briefly. So what do you want to look for in clinical studies? And I think this is important because that's been one of the hopes for this kind of technology. So first of all, a very well-defined question in patient samples set. High-quality data. And I'll emphasize that that's hard to assess without reference to the primary data. So data of this type should always be made public in one of these databases I've mentioned so that other investigators can reanalyze the data. And believe me, that's routine in this field. There are lots of people out there who will pull the data down from the web and see if they can still extract the same thing that you did. And this has sometimes led to very interesting discussions. You want to have appropriate and rigorous statistical analysis of the array data so you don't want any of those flaws from Rich Simon's paper to be present in the data analysis before you're going to believe it. You want to see some reference to a formal classifier. So if you were to apply this to someone who walked into the clinic with a particular disease state and you had a tissue to profile, well, you'd like to know what should I do? What should I look for? And so someone has to tell you that or you won't know how to use their data. So the methods have to be transparent. And then ideally you want to have a completely independent validation sample set. That's probably the single most important thing, ultimately. And then finally, there should be some sense of novelty. I think we're well past the point where we want to do it just because it's a cool technology. It's fun, but we'd like to know that there's some real novelty in the result and that we can do something for which there are no adequate tests that can be done with previously available techniques. So something you couldn't figure out just with a routine pathology asset. There should be some new discovery. There's some real problems with doing expression profiling in the clinic. First of all, it's a specialized technology. I've told you that the quality of the data can vary from lab to lab. RNA is an unstable molecule, not very convenient to work with. And frozen tissue is not part of the usual operating room or biopsy room sample workflow. So those are problems. Some of the options to solve these would be to use reference laboratories that are very good at the technology to use RNA preservatives, which might keep the RNA from degrading, to be able to analyze paraffin-embedded materials which is quite routine in processing tissue samples. And actually, current technologies do allow really remarkably good analyses of paraffin-embedded materials. The other possibility is to use the microarray technology for discovery and from that to extract signature, which might be a multi-gene signature, which can then be used to assay new samples with alternative technologies which might be more clinic-friendly. I'll just illustrate two examples of now several profiling tests which are in clinical use at the present time. One is the 70-gene, these are both breast cancer examples, the 70-gene signature, which is an array-based signature for prognosis, and then the multi-gene PCR signature, which is also in routine use for clinical purposes. And so things are beginning to move into the clinic exactly for which types of situations we will ultimately use signature-based testing in cancer medicine and other fields of medicine is not yet completely clear, but these things are definitely moving into the clinic. Okay, now I'm going to switch gears for the rest of the talk because I think this is so important and timely. And this slide is just to remind me to tell you about the explosion of next-generation sequencing technologies and this website cited here kind of keeps track of this, and you'll see that they're really all over the world at this point. So these instruments have really spread, and that's because of their incredible attraction. And so here's the counterpart to my opening slide on the rise of the microarray. This is the rise of RNA-seq, so forgetting sequencing as a whole and just focusing on RNA. So this really started in 08 and is continuing. This rapid expansion is very reminiscent of what we went through in the early days of microarrays, and a lot of the issues and problems have a very close parallel to what we went through at that time as well. So I would have said back in those accelerating days of microarrays, this is not a stable technology yet. Now arrays are pretty stable, and we kind of know their properties and know how to use them, and the tools are out there for everybody to use. But here, we're still in the accelerating phase, and we don't really know everything about how to do this, how to analyze the data, and the tools are really not as readily available as they are probably destined to be in the future, perhaps the very near future. And here are the citations for these three papers that really opened the floodgates for RNA sequencing. So what's the primary thing to think about as you compare these? Well, what do array technologies do? I hope I explained clearly that all array technologies do measure the relative abundance of nucleic acids of defined sequence in a complex mixture. Well, sequencing can obviously accomplish the same thing. If you can do enough sequencing cheaply enough, you can really get to the same place. And so here's a little bit of a comparison as of today, and all of this is changing so fast that probably within months I would have to edit this slide. But let's look on the pros side for microarrays. So it's a readily available, mature technology. It's relatively inexpensive. It's effective with very complex samples. You can really do hundreds or even thousands of samples, and you can target a subset of the genome very easily. For RNA sequencing, it's nice because you do get whole genome data. That means you can look across the entire genome without regard to annotation. You remember, arrays are always constrained by whatever's on the array. Maybe the most important gene for your discovery isn't there. Well, you're blind to it, but if you can get it by sequencing, you're really looking at the whole genome. That's an advantage. There's a relatively uniform analytical pipeline for different applications, but there are some counterbalancing issues to that. You're free of hybridization artifacts, of course, which are plague, the microarray world, but that's not to say there aren't unique sources of bias in sequencing, but you do have the possibility of one platform for all types of applications, and that's kind of appealing. So all the different categories of arrays beyond RNA, you can also look at all of those kinds of things, potentially with sequencing and, indeed, that's being done. The cons, well, for arrays, you have to have some platform-specific data processing equipment and applications. They're prone to platform-specific artifacts, so you have to know about each of those. There are many sources of noise. Whole genome studies generally require a lot of arrays, and that doesn't apply so much to expression arrays, but to other applications. And cons on sequencing sites, it's still an immature technology. It may yet have technology-specific artifacts. It's still relatively resource intensive. It's quite computationally intensive, which is a big difference. A big microarray study you can still carry around on a flash drive can't do that with sequencing. There's no standard analysis yet, and there's a significantly lower sample throughput in most labs. There are some advantages, though, that are worth expanding on. First of all, you get RNA sequence variations detected at single nucleotide resolution. This allows you to look very nicely at things that are harder to do with expression arrays, so allele-specific expression, mutations that might occur in a disease, and RNA editing. You can also look at RNA structure, so splicing, which can be done on exon-specific microarrays, but it's really kind of easier to do it on RNA sequence data, determining exact start and termination sites and finding aberrant disease-specific rearrangements in RNA, so that's a plus that is easier to do for sure with sequencing. It's also nice because the detected signals are relatively unambiguous, so you can have a potential to outperform microarray at very low signal. So if you think about it in hybridization, as the expression goes down and the abundance of the transcript gets to be very low, you'll have a very weak hybridization signal, and sometimes it's hard to know where the truth starts and the signal arises out of the noise of the non-specific background. But if you have even one or two reads that align over a reasonable distance perfectly with an exon, that exon was expressed. There's really no ambiguity to that. And then, of course, you can do de novo assembly and discovery of transcripts you didn't know about to begin with. And also you can do everything you would imagine wanting to do, so you can do full-length mRNA sequencing, you can do tag sequencing, I'll come back to that. You have some options in what kind of RNA to sequence. You can look at strand-specific expression you can sequence microRNAs or other small RNAs and you can look at long non-coding RNAs. So you can look at whatever kind of molecule you want to look at. There are some issues. The lower limit of abundance is constrained by the abundance distribution as in microarrays and the number of aligned reads available which comes back to the number of total reads. The large sample numbers are still difficult to achieve. Software is still in the early years of development. Computational hardware requirements are substantial and the library methods are still evolving. Library being the buzzword that's used in the sequence world for nucleic acids converted into sequence-ready form. The data may not merge well if not generated with the same method. So this is something that's emerging as people change their protocols just as in microarrays you probably need to be pretty consistent in how the data is generated. So typically what's done in mRNA seek is to randomly fragment the RNA, convert it to a cDNA library which is sequence-ready and get some numbers of millions of reads which would reflect the sequence of all the different molecules in this mixture. Well, what do you do with this? And this is where it gets sticky. This is a very, an RNA seek workflow painted with very broad strokes. No real detail here, but just to give you some idea of the challenges and what you would look for. So you start with raw reads. So this is data that comes off the sequencer. Now you have to align this with reference transcriptome or genome. But right away you have a question. What are you going to align it to? Well, if you align it to the genome, that actually is not a very good choice because RNAs, mRNAs, have a punctuated use of the genome in the form of exons that are spliced together. So you know that many fragments in your library will cross and the aligner may not be able to handle that if you align to the whole genome. But then you have to pick a transcriptome. What transcriptome will you use? RefSeq, the EMBL gene models, something that you've developed yourself. You have to make that decision as well. So there's that additional layer of complexity. Maybe you'll need to use more than one aligning tool to get the alignment you want. So if you do this, you're going to end up with some alignment to a transcriptome. And then you're going to get some unaligned reads, which are then aligned to the genomes somewhere else, ultimately. So you have these aligned reads, and that can be used fairly simply to generate a normalized read count which will be the equivalent of an expression level. And it can also tell you amazing things if you then go on and work with this data further. So exon usage, transcription start and stop, structural variants, RNA editing, single nucleotide variants, anti-sense expression with strand specificity of transcription, all that can be squeezed out of this with considerable effort. Then you have to turn back to these unaligned reads because you can't assume there's no importance. So you'll have to do some sort of de novo alignment, an assembly of those if you want to capture that information to look for potential new genes or new exons in existing genes. And believe me, there's plenty of that out there. And this just shows a little bit of what this data looks like. Each of these represents a single track in a genome browser view of several samples for a particular gene. And what you're seeing is how the reads, and here's the gene and its many exons, there's the 3-prime UTR, so you see lots of reads there, and then reads coming up at each exon of the gene, very specifically and beautifully. So this gives you an idea, and I won't show you a lot of these kinds of pictures, there's plenty of that kind of data out there, but just to give you an idea of what the data looks like. And so this number of reads here, which is shown in the top track, is ultimately related to the abundance of the transcripts. Just going back to one of the 08 papers, this shows the very nice relationship between RT-PCR and expression as determined by RNA-seq. So the data has a quantitative relationship with other types of assays, but each of them will have their own property. One of the points I would make about the statistics of RNA-seq is it's really different from the statistics of microarray where you're talking about the statistics of a continuous numerical measurement. We're here in RNA-seq, you're talking about the statistics of counting things, and that's really a different approach. This just shows the abundance curve and still the challenge you have because of some genes that are present at a tremendous abundance of the huge dynamic range of RNA expression. So if you really want to drill down here, it's very challenging to do that accurately and to get enough reads to be able to make comparisons. So if you only have a few reads across a bunch of samples, you can make the statement that that gene is expressed, but it may be not statistically possible to compare across samples if they all have relatively low expression, even though if you had a true absolute measurement, there might be true differences. So this is still a problematic area for RNA-seq, and you very quickly hit a point of diminishing returns because of all the hyperabundance of the most common transcripts that tend to use up your sequencing space. So what are some ways to simplify this? Well, one might be to just sequence the ends of the fragment, so just take a few nucleotides off the end and you have that be your data. Do less sequencing. Probably the most attractive embodiment of that concept is 3-prime tag sequencing. So instead of sequencing across the whole cDNA, you would fragment off with restriction enzymes or some other strategy. Just one end, sequence that and generate a tag, and you can sequence an truly enormous number of tags with current sequencers. And that could be a way of generating a number of publications using tag sequencing, which has not caught on as well yet as it might, but I think with progress mainly on the data analysis side and better tools, it is a potential technology that may be more widely embraced in the future. So basically you take all those tags, align them and count them, and you could imagine multiplexing samples by adding a barcode sequence to them and getting many samples together on the sequencer. And if you were able to automate the process of creating the libraries, you could look at very large numbers of samples in parallel. And in fact, some of the large sequencing centers are doing this, particularly for microRNAs at this point because you can really sequence a lot of samples for microRNAs. So what is the future of all this? I think that as sequence throughput increases and the costs per read decline, it's likely to become an important and interactive alternative to microarrays for more and more applications. I think that's pretty much guaranteed. It's hard to say when and for which application for certain things outside the expression field. So for example, for chromatin immunoprecipitation, I think that's pretty much replaced chip-chip for most labs because it's not very resource intensive. It generates a higher quality data. And I think that that's an array application which is rapidly fading. Expression, I think the future is still a little bit further off. I conclude when the handout has some useful websites which just are scratch the surface. You can get to most of the topics I mentioned very easily just by googling them. So I think that's the end of my formal presentation. I'd be glad to answer questions if there are any. Substrate is the same. I don't think the same, but they may not overlap. Does that mean, despite being the same substrate, different methods of introducing different architects? Correct. I think it's probably true that there is bias associated with the library prep methods. So just as in microarrays, at the present time, if there was a real... It depends, of course, always with these things. The first thing is what's the study for? What's the goal of the study? If it's a quantitative comparison, then for sure all the libraries should follow the same methods and the sequencing should follow basically the same methods. Even though something new may come along during the study that's very attractive, you really shouldn't change unless you want to go back to the beginning because there's a risk of bias being introduced by the method. In a method, everybody seems to follow... Even for microarrays, we've been doing this for many years now, we still haven't... That's correct. It's still an issue with microarrays absolutely and it'll never really go away. And it's sort of, in a way, good that that has happening because it allows people to develop their own approach and do innovative things. In the case of sequencing, I'd say even more so, it's good that people are trying different things. And it will probably, as in the array world, tend to converge on a few methods. What's probably the biggest challenge is not so much at the front end, where at least it's easy to describe what's going on and everybody can understand it. Every molecular biologist can understand it. It's the back end, the computational end, which is much more complex and where molecular biologists are going to have a steep learning curve to understand how to compare one RNA-seq experiment results to another because you can get very different results depending on what you do with that raw data. A question back there. I want to... So, right? So, the first point is, as I'm sure you recognize, is that formal infixation, you end up with the fragmentation of the RNA molecules. So, when you attempt to work with that material, you've got... You're up against the variability across samples. So, some samples, depending on how they were fixed, the exact fixative condition of the sample, when it was fixed, the age of the sample, they're going to have very different size profiles and that can introduce bias into the data no matter how you work with it. It's relatively straightforward with some of the newer technologies to label RNA from FFPE samples and if there's reasonable RNA available, meaning that there are fragments on a scale of 100 or so or greater in size, you can actually get pretty decent hybridization to microarrays. Sequencing is tougher because if the fragments... So, if the fragments are very short, they may not align uniquely as easily, so you may have a lot of poorly aligned or even incorrectly aligned sequence. You may get fewer fragments that cross splice junctions accurately because if you imagine taking, I say, a 75 base pair molecule and splitting it somewhere to a splice or maybe even two splices, you may have really short tags that don't align well and so that's a problem. On the opposite side of this, some sets of paraffin samples, particularly if they're very fresh, may actually have pretty decent RNA and actually work quite well. So there's really less information about this. This is kind of at the community lore and experience stage rather than a lot of publication. So I think a lot of people are asking your question and beginning to dip their toes into this problem. It's going to be difficult, though. Okay, if there are no other questions, thank you.