 Thank you, and thank you, Andy. It's a pleasure to be here. It's funny, I remember giving a talk in this auditorium years and years and years ago when I was at Tiger, the Institute for Genomic Research. And I came in here, and it was in the days when we used slides. And we turn on the slides, slide projector, and the first slide went up, and then the bulb went out. And for the next hour, I did an interpretive dance presenting my talk because no one could find a bulb. So hopefully this will go off with fewer hitches. But it's a real pleasure to be here. And my presentation today is going to be more or less a historical journey through expression analysis. But I'm going to try to really emphasize what I think are some of the important points about how we approach this overall problem. So since this is a continuing medical credit or education credit course, I have to provide disclosures. So I have titles and lots of titles, as you heard, I'm professor of biostatistics and computational biology at Dana-Farber, professor of computational biology and bioinformatics at the Harvard School of TH Chan School of Public Health. I have many other academic titles. So being an academic rather than paying me more, they give me more titles, which means I sit on more committees and waste more time. I'm on a bunch of advisory boards, none of which are relevant for the work I'm going to present here. I founded a precision medicine software company in 2012. That company provides software, so I disclose that on the official disclosure slide. But nothing that company does is going to be relevant to anything I present here, other than the fact that everybody deals with data. And data is really our currency for all of the analysis we'd like to be able to do. So I like to use quotes in my talks. And this is actually one of my favorite quotes. This quote really crystallizes the challenges and opportunities that are available with the data we can generate today. So every revolution in science, from the Copernican heliocentric model to the rise of statistical and quantum mechanics, from Darwin's theory of evolution and natural selection to the theory of the gene, has been driven by one and only one thing. And that's access to data. And if you think about that, that's really true. That when we think about how science advances, data is what allows us to test our theories and hypotheses, to either falsify them or validate them. And if we falsify them, data is the raw material that we use to develop new theories, new testable hypotheses. So you can see why I love this quote. In fact, the reason why I really love it is I made it up. So this is my intellectual selfie. And then once I made this up, I realized it was too long. So here's the Twitter version, 115 characters. So you can tweet it and make me famous. But it's really true. Data drives innovation. And one of the real revolutions we've seen over the past 15 to 20 years has been a revolution in our ability to generate large scale quantities of gene expression data. So how do we do that? And what is that data useful for? Well, I'm going to give you a little history of expression analysis to really start this. Because even in the earliest methods that people used, they're really important lessons for understanding how we analyze gene expression data today. So the human genome was sequenced in 2000. These papers were published in 2001. The draft sequence was available. I'm sure you all know a lot of the history of the battle between the public and private genome projects. Having worked in the field at the time, it was a very interesting time and place to be full of interesting characters, some of whom you love and some of whom you hate. And I won't tell you which those are, but we can talk about that later if you like. But it was a very interesting time to be involved in genomic science because there's this wealth of data. But these real pronouncements about what the value of the data was, what was most interesting was it gave us, the genome project gave us, a reasonable catalog of the set of human genes. And that was really transformative and the way in which we think about biological systems and how biological systems function. So I came into the field in the early 1990s with a fellowship from the then National Center for Human Genome Research, now National Human Genome Research Institute, to work in genomics. But my PhD and all the work I had done previously was in theoretical physics. So I had to learn molecular biology from the ground up. And I realized, while most people here understand this, that there are some people who may be here, may be watching this on YouTube, who don't really understand molecular biology the way I hope I do now. But this is the way I really started thinking about biological systems, that encoded within that DNA, there are elements of DNA called genes. And if I asked everybody in this room what a gene was, I'd probably get 100 different answers, 100 different definitions. But sort of the working definition that people have been using for a long time is that a gene is a region of DNA that encodes a protein. And even in saying that, there's a lot of difference in interpretation about what exactly the protein encoding piece is and how extensive it is and whether isoforms are the same thing or different things. But we'll just have a gene being the region that encodes a protein. That gene, in order to make that protein, is first transcribed to RNA. And the RNA is exported from the nucleus, taken up by the ribosomes, and translated into a protein. So this is what people often call the central dogma of molecular biology, DNA to RNA to protein. And when we look at this problem, when we look at this simple paradigm, what we see is really the core of the place where we ask questions about what these genes do and how they function in developing the phenotypes that we'd like to be able to observe and understand. So what we really think about what the proteins do is a little more complex, that those proteins fold up into a three-dimensional structure. And in fact, that three-dimensional structure is fundamentally important because that structure dictates their function. And the analogy I always like to give is that I can take a piece of metal and I can use it to make a hammer or I can make a wrench. And if I take that same piece of metal and make a hammer or wrench, well, I might be able to use the wrench to pound in a nail. I can never use the hammer to loosen a bolt. So the structure actually matters and that dictates the function. And part of that function is regulating for some of the proteins called transcription factors is regulating the expression of other genes. And when I say expression, I'm really referring to how much RNA is being made. So we have this really interesting closed loop. And at the end of this presentation, I'm gonna come back to this idea that this regulatory process is really important for understanding biological systems. So what the genome did was it gave us a parts list, a catalog of all the proteins that make up a human cell. But what we recognize is that different cells in the body perform different functions. And so all those cells are taking the same fundamental parts lists and doing something very different. Now, when I say it's very different, it really isn't. All the cells have to consume oxygen and sugars and produce carbon dioxide and, oops, energy. But these cells are performing different functions in the body. So a neuron is going to be expressing a core set of metabolic genes that are also being expressed in a muscle cell. But the neuron is gonna be making neurotransmitters while the muscle cell isn't. And so what we see is that in each cell type, well, keep on hitting the wrong button, in each cell type, well, there are core functions that are carried out in common. There are a set of functions which are carried out differently. And those different functions are represented in different sets of genes that are expressed. In the same way, when we think about normal tissue and disease tissue or healthy tissue and disease tissue, what we recognize is that one of the differences between those tissues is fundamentally the expression of different genes which change the way in which those cells act. So even before the genome was sequenced, people were developing techniques to ask, are there differences in the genes that are expressed? And one of the earliest techniques was the use of northern blots. A northern blot is a really interesting technology in which we take RNA from cells and we load that RNA into a gel and we use electrophoresis to separate the RNA based on sizes in that gel. So the smaller fragments are at the bottom, the larger fragments are at the top. I can actually run a sizing ladder. I can use RNA or DNA with known fragments, fragments of known size. And I can use that to benchmark where different sizes are in the gel. And then what I can do is I can take that acrylamide or agarose gel and I can extract the RNA fragments onto a membrane and then I take that membrane and I take specific gene fragments that I wanna know about and I hybridize those gene fragments as probes to the RNA which is now bound to this membrane and then I can visualize that in a number of different ways. So one typical way to do that is to use radioactively labeled probes and then to visualize them by looking at an X-ray film exposed to that radioactive membrane and developing it. And so this is an example of a northern blot and one of the things you can do is look at the presence or absence of different genes where what really distinguishes the different genes from each other, what allows us to detect them is their specific sizes. And this is important because the size of the amplification product is really on the filter is really a filter on the data. It's what allows us to detect the presence or absence of different genes. But one of the important things I have to do to compare the levels is I actually have to scale the data. What I tend to measure when I look at a northern blot is the darkness of these bands, the intensity. And in order to compare them, I can look at this and say, well, this one is clearly brighter than that but I can only really say it's brighter if I know I've loaded the same amount of material, if I loaded 10 times the amount of material here than here, while it is brighter, it may not mean that on a per cell basis there's more RNA here than here. So what people typically do is they load some kind of normalization standard. And so in this case, it's what we refer to as a housekeeping gene actin. It's a gene that we assume is expressed at near constant levels across all the cells. And if I look at this, what I can do is estimate the intensities here and then scale the data for each one of these up or down in such a way that I'm scaling the actin levels the same way to really facilitate that kind of comparison. So I can do a quantitative analysis based on the assumption that I'm loading the same amount of material into each cell, into each lane in this gel and that this gene down here actin is expressed at constant levels and I can scale that and all the other things in that lane to make effective comparisons, okay? So it's an important lesson and it's one that carries through to all the other methods that we use to analyzing gene expression data. So quantification requires normalization. And in this case, normalization is based on the assumption that actin is expressed at constant levels. After northern blocks and the invention of PCR, people realized you could use polymerase chain reaction or PCR as a way of quantifying the levels of different genes based on the RNAs that are expressed. And so people develop quantitative RT-PCR and now this is a widely used technique that's used to accurately assess gene expression levels based on the detection limit you have of seeing fluorescence from a minimum number of RNA molecules once they're amplified into double-stranded DNA. So for each gene that I wanna look at, what I see is a curve in detected intensity. There's a detection threshold. And essentially what I look at is when this curve crosses the thresholds after how many cycles of PCR. If I have a lot of material, I'm gonna get to the detection threshold earlier. So I detect it in a fewer number of cycles. If I have less RNA, I have to do more and more amplification until it's detected. And so by running fixed quantities of RNA for particular genes, I can actually get a titration curve where very low levels of RNA are detected very late, very high levels are detected very early. And by looking at the number of cycles after which I detect the presence or absence of a particular gene, I can in fact determine its approximate expression levels, its approximate representational levels. So I can look at this and many, many different copies. There are always things that are below threshold that I don't see. But by looking at the number of cycles, I can back calculate the approximate amount of RNA. And so this technique was very widely used. It's still very highly used. People still refer to it as a gold standard. Although again, there are a lot of assumptions. One of the assumptions is what I use for this titration curve. And the other is that just like in Northern Blots, I have to include some control genes for each sample that I assume are gonna be expressed at about the same levels in each sample. I have to include some kind of housekeeping gene control. And I use that to scale all of my measurements. So the filter here on detecting what the genes are that are being expressed are actually the specificity of the two PCR amplification primers that I use. Quantitation requires normalization. And this involves comparison to standard curves as well as the use of a reference RNA. And that normalization again is based on a number of assumptions. In particular, that the gene that I select for my standard is expressed at the same level in all of the different cells. Now, quantitative RT-PCR is often used a gene at a time. But in the late 1990s, there were a number of studies that really looked at large-scale gene expression using quantitative RT-PCR as a way of measuring large numbers of genes. So this is actually one of my favorite papers that was published in 1998. And this study looked at rat brain and looked at the expression of 100 different genes over different stages of development in rat brain. And what they were able to do was perform a very interesting analysis. And this was really the first example that I can find in the literature of someone using what we refer to as a heat map to represent gene expression levels. So they have a number of different genes and a number of different classes. And they depict for each one of these genes the intensity of its expression level in each row, its expression level as a gradient of intensity. So the darker it is, the more highly expressed it is. And so what you can see are lots of different patterns. They also represented in this paper as waves of expression. And they grouped gene expression together in clusters, what they refer to as functional clusters or expression clusters, based on the patterns in which these genes were expressed. And that was really interesting because what it allowed them to do is to map what they refer to as different trajectories of these different classes through the developmental cycle in the rat brain. And I thought this was really interesting because in this early paper with RT-PCR, it really captures a lot of the ideas that are even used today in the analysis of gene expression. So after RT-PCR, the sequence of even 100 estimated expression of even 100 genes was extraordinarily difficult. So after this technology was developed and deployed, people recognized that there were probably better ways. So in the late 1990s, a new approach was developed to ask a gene expression. And that approach was the DNA microarray. There are lots of different flavors of DNA microarrays that were created. Some of the earliest were developed by Pat Brown and his students and colleagues at Stanford. And those arrays, shown here at the top, were arrays that were made by mechanically spotting DNA fragments that were complementary to the RNA sequences that were interested in assaying. So DNA fragments from individual genes mechanically spotted onto a glass microscope slide. Other technologies like those developed by Afumetrics and Agilent synthesized probes de novo on the surface of the slide. So in the case of Afumetrics, they used photolithographic techniques like those developed for the semiconductor industry. For Agilent, the technology they used was one which was based on using inkjet printers and using those to deposit the chemicals to do de novo synthesis. In either case, what these array technologies allowed us to do was to represent the thousands of genes in the genome on a single substrate to perform expression analysis. Now, the implementation of these technologies were slightly different for spotted arrays. What one typically did was isolate RNA from two different samples. So you can think about this as a treatment and control sample or a treatment sample and a universal reference sample, an experimental sample and a universal reference sample. In each case though, we take the two samples, we isolate RNA, we create a fluorescently labeled molecule. So a fluorescently labeled target molecule, often complementary DNA made to that RNA, but we make a label target molecule representing the different genes in those samples and we co-hybridize them together to a single array. And so if you look at the panel at the top, what you can see is different colors and the way these were typically represented is to use red and green. And so if I take my two samples and hybridize them together, the red samples and the green samples represent, are the red spots and the green spots represent RNAs that were present in one sample or absent or present in the other sample and at the various levels of expression. So you can see some spots are red, some spots are green indicating presence in one or present in the other, but most of the spots look sort of yellow, suggesting the present in both. In either case though, what one can do is measure the relative levels of expression in those two samples and produce a genes by expression level vector. And by looking at hundreds of samples or tens of samples, anyway, some number of samples, get a matrix of gene expression levels. Single color arrays in some sense were conceptually a little easier. What one would do is take RNA from a single sample, create labeled targets, and then hybridize those to an array. In the case of aphymetrics, there were multiple probes for each gene. So what one would do as illustrated in the bottom panel is take those expression levels for the individual genes, the individual probes, and average over them to get an estimate of the expression level for each gene. But again, the end result is a vector for that sample showing each gene in an estimate of its expression level. And then around this data, a whole suite of tools and analytical methods were developed to really facilitate understanding how those expression levels change between different states. Well, DNA microarrays were extraordinarily successful. As sequencing technology became less expensive, what happened was we moved beyond assaying just genes that we could place down on an array to think about doing de novo sequencing of RNA. And so in fact, almost every project that you'd undertake today wouldn't use arrays, although arrays are still fairly widely used, but any project you undertake today is very likely to use RNA sequencing or what we refer to as RNA-seq. So what we can do with RNA-seq is extract RNA from samples. We run a standard sequencing assay. So we take the RNA, we converted the cDNA, we do DNA sequencing. And then again, there's a pipeline that we've developed to either take those sequence reads, align them to the genome, or align them to a library of gene sequences. And in doing this, what we are able to do is estimate expression levels for individual genes, or in fact, to go even farther and estimate expression levels for individual transcript variants in different splicing variants, even fusion genes that occur in some diseases. So it gives us a much finer grained window into what's happening. But again, at the end of the day, what we have coming out of these assays is a vector of genes and the relative expression levels. In the case of RNA-seq, it's individual counts. In the case of DNA arrays, it was fluorescent levels. In the case of quantitative RT-PCR, the raw data that comes out are delta CT values. But at the end of the day, what all of these technologies do is give us the opportunity to measure gene expression levels at a very fine grained level of detail. And armed with that, what we've been able to do is to begin to think not just about single experiments, but to think more completely about how we tackle the problem of understanding disease development and progression. So I and my colleagues, like many others, have started to look at this overall progression of disease and to think about how we can use these modern technologies, including RNA profiling technologies, to address the entire lifecycle of disease. So if we think about disease, every one of us carries some genetic risk. And what we'd like to do is to be able to estimate those. So a lot of the technologies we use for those are really to look at genes and gene sequences and genetic variants. But beyond estimating risk, what we then like to do is to do a better job of early detection of disease. And once we've detected the presence or absence of disease, we'd like to be able to stratify patients into different groups, recognizing the fact that diseases which manifest and appear to be very similar to each other may in fact be different at the molecular level. They may express genes, for example, in cancer that make the tumor more sensitive to one therapy than another, or make a person much more likely to progress rapidly, or make their tumor much more likely to metastasize. So we'd like to be able to look at that kind of data and information, often using expression data, and to use that as a way of staging the disease, but then more importantly, selecting the most appropriate treatment option that's likely to provide a good outcome and improve quality of life. Now as we started to develop these assays, and in fact, a lot of my career has been built around the use of data that's come from DNA arrays. But as you might imagine, with the rapid expansion of DNA sequencing technologies, a lot of our work is shifted to using techniques like RNA-seq or genome sequencing to really understand that molecular basis for disease. All right, so we have technologies, but now the question is, with those technologies, how do we actually use them in a way that will allow us to make measurements that are meaningful? And so that raises the whole question of experimental design. So let's take a look at that. When we think about experimental design, a lot of people just, and this has been a really interesting experience for me over the course of my career, they look at these high throughput technologies like RNA-seq or DNA microarrays, and they'll show up in my office and they'll say, well, look, I've generated a lot of data. Why don't you tell me what this data means? And invariably, my first question for them was what was your experimental design? And honestly, the experiments that we can analyze are those for which the person has actually thought of the experimental design ahead of time. So although a lot of people take this for granted, it's actually one of the most fundamental elements of any analysis that anyone's gonna do, which is to think carefully about the design and what questions you wanna ask of the data, and then make sure that you have an appropriate design to address those questions. So when we think about designing the experiment, the first thing we have to think about is whether or not we actually have enough replicates and the appropriate number of assays that allow us to estimate the expression and the variance and do meaningful comparisons between groups. Having done that, the next thing we wanna do is then collect the data, process and normalize it so we can make comparisons. Once that's done, the next step is to perform a meaningful statistical analysis of the data. Some of that is an exploratory analysis where we look for just broad patterns. Some of it is much more focused, testing our experimental hypothesis. How do these two different groups compare to each other? And what is the meaning of those differences? So at the end of the statistical analysis, typically what we have are a set of differentially expressed genes. And what these genes represent are those which have the biggest average difference between our experimental groups. So in the standard two class comparison treated in control, we might perform a t-test after normalization to do our statistical analysis. And then what we get are a set of genes in the associated p-values showing how different they are and how confident we are in the differences between those two groups. The last step is looking at a biological interpretation for the genes that we get out. And ultimately that's kind of where the rubber meets the road. We don't just want a gene list, we want some insight into what those genes tell us about the underlying disease. So at that point, typically what we do is we look for over-representation of different classes of genes, different functional classes, different pathways. And that allows us to understand what's driving the different phenotypes that we see. So we'll take a look at each one of those elements in this pathway in a bit more detail. So first, why design an experiment? This is a fundamentally important question. And the reason, as I alluded to earlier, is that the design of the experiment really influences everything that you do downstream. It influences the analysis, and it influences the accuracy of the analysis. So if you don't put time into that experimental design, you end up in trouble. I remember when I was first working in DNA microarrays, one of my colleagues walked into my office one day and he said, oh, could you help me analyze data? I have 60 microarrays. And I sat with him, I said, well, that's great. You know, what was the experimental question you had? And his answer was, I had 60 microarrays. And so I said, well, you know, what were the different conditions? And he told me he had 60 different conditions. And when I looked at that, I looked at him and I said, there's not too much we can do because with 60 different conditions, not really any thought put into what conditions you are comparing, we don't even know where to begin to look for meaningful differences. So the lesson, the take home lesson for me and for him, and I hope for everyone I tell the story to, is that putting time into that design is actually extraordinarily important because the goal of an experiment dictates everything from how the samples are collected to how the data are generated. And it's really worth thinking carefully about that because what ultimately we want to be able to do is to come up with an experimental design that's going to provide us with the insight that we hope to glean from the experiment. Doing the experiment takes a lot of time and a lot of effort. Analyzing the data can be relatively quick, but one of the things that I've seen in my career is that bad experimental designs don't lead to publications. They lead to post-mortem discussions about why the experimental analysis failed. And having that discussion before you've invested time in generating the data is ultimately important because it's going to save you a lot of grief later. So the design of the analytical protocol and what you choose to do with the data once you have it should be reflected in the experimental design. And there's some simple questions you have to ask. Do you have enough replicates in your experiment to actually measure the things you want to look at? Do you have sufficient numbers of experimental controls so that you can make meaningful comparisons? And do you collect the samples and data in a way to avoid confounding and batch effects? So there are many classic examples in the literature now of people who've done bad experiments where they haven't thought carefully about confounding and batch effects. They've collected all the treatment samples at one hospital and all the control samples at another. Or they've run all their treatment samples on Monday and all their control samples on Wednesday. Or they've used one lot of reagents for one set and another lot for another set. And when you have differences in those situations, you never know if the differences that you see are actually biological or they're because of some difference in the experimental protocol. So breast cancer we used to think of as being sort of one monolithic disease. In the late 90s, Chuck Perot and his colleagues published a really foundational paper looking at gene expression in breast cancer. And what they found is by looking at the thousand most variable genes, they were able to classify breast cancer into four broad subtypes based on patterns of expression. And those subtypes today are still really widely accepted and they're the foundation of how we think about managing patients. The different patterns of expression are represented in different outcomes and different responses to therapy. We have a better understanding of why, we have a better understanding of how to make this therapeutic interventions. But looking at the genomic data, we've been able to discover different classes. Now in breast cancer, people argue over how many classes there are, they argue over what the definitions are for those classes, they argue over how to best classify patients. But that class discovery was incredibly important. And if you read any of the papers from the cancer genome atlas using either genome sequencing or gene expression, those groups all attempt to identify new classes in the data and to link those classes to meaningful clinical endpoints. So class discovery is still something that's important, but again, careful experimental design is gonna be important for identifying robust classes. A third type of experiment is classification and it really in a way follows from class discovery. With a set of samples that represent different phenotypes, can I come up with a set of genes whose expression levels provide a rule for classifying patients? Can I take the data and actually come up with a rule that will identify my different classes or that will identify a rule that allows me to predict whether some patients will respond to therapy or not respond, will some patients have a good outcome in the disease or a poor outcome? And there are a large number of experiments that have been studied that have come up with different class predictors. Now one of the challenges even in breast cancer where there are over 3,000 published classification methods, the number that have actually made it into clinical practice I can count or clinical use I can count on one hand. And the reason is that a lot of those experiments were not carefully designed and were not carefully validated. So the biomarkers themselves are not that useful. So in this too, you have to think carefully about how you design the experiment and how you design the follow up analysis and validation strategy. And the last piece is sort of large scale functional studies. And in a lot of ways, this is related back to the class comparison that what I wanna do is try to discover some kind of causative mechanism associated with the difference between classes. And in fact, I present these as if they're four different types of experiments. But almost every microarray analysis anyone has ever done involves different mixtures of the different components. But again, when you start an experiment, what you really have to do is you have to think about your goals, what you wanna get out of the experiment and then think about how you're going to design the experiment to meet those experimental goals. So a single data set can be used in different ways if the data set has the appropriate properties. So once I've designed an experiment, I collect data. Once I collect data, I have to normalize it. So I'm gonna think about gene expression data and why we wanna normalize it. The goal of normalization is to remove systematic variation from the data and to scale assays from different samples so that I can effectively compare them within in between studies, all right? So in DNA microarray analysis, the most widely used method is called RMA, robust multiarray analysis? I forget, I used to know this. I knew this last night, but RMA, look it up. That's a challenge. So RMA was a method that was really developed by Terry Speed and Rafael Irizari who's now one of my colleagues at Dana-Farber. And RMA is based on a really simple assumption. When I measure a gene expression level on an array, what I'm really measuring is a background signal plus a real biological signal. And so that the overall pattern of gene expression I see across thousands of genes is going to be a sum of a distribution of background plus a real biological signal. And what we have to do is deconvolute this to really understand what this signal is and to compare these signals. So one of the underlying assumptions in RMA analysis is that this distribution of gene expression levels within each sample for closely related samples is going to be approximately the same. So in RMA, what the method does is it actually looks at that distribution. And what it does is it forces the distributions across multiple arrays to look the same. It uses an approach called quantile normalization. Quantile simply means we take gene expression levels and we organize them from lowest to highest. So just like percentiles normalize them every 100th, what we do with quantile normalization is take the thousands of assays we do or the thousands of genes that are assayed in each experiment and we represent, we organize for each sample their expression levels from the lowest to the highest. And what we see is that there's a number of things expressed at low levels, a lot of things that are expressed at intermediate levels and very few things expressed at high levels. So what we can do is we can take each and every sample in the data and we can normalize the data. We can look at each distribution now, recognizing that each assay is going to differ a little bit in what it measures. We have slightly different quantities of RNA. We have slightly different hybridization conditions. We have slightly different detection conditions. And so all of these distributions are going to look slightly different. But our assumption is that for a related set of samples, these should be more or less the same. And so what quantile normalization does is really take all of the different quantiles and scale them so the distributions are approximately the same. And that's what's shown here, the median distribution of the median distribution of all these different distributions that we scale everything to. So what you do is you fit this model I showed you earlier of background plus signal. You can fit it in a variety of different ways and filter the data. But what RNA does is it scales those distributions so you can make comparisons. So here are just two examples where this each column here, each box plot column, represents a different sample. And all of these samples are very similar. They're all going to be compared in the same experiment. And some arrays are brighter than other arrays so their median expression level is higher or lower. And the distributions are all different. And what RNA does is it takes those different distributions and it forces them to be the same. So you can argue about whether or not that works or doesn't work, whether it's the best idea or not. The truth is that what it has shown to do over the last 10 years is give us very reproducible differences in gene expression across multiple experiments. And differences that can largely be validated using other technologies. So this idea of quantile normalization based on the assumption that the distribution across each sample should be about the same works surprisingly well. And so even in modern techniques like RNA sequencing, the most widely used technologies for normalizing and widely used methods for normalizing RNA expression levels are based around this idea of quantile normalization, of scaling the data appropriately under the assumption that there are some low expressed, high expressed, and mid expressed genes and that the distribution across samples should be approximately the same. What that does is if I do a comparison between expression levels for any two samples, what I see is that the raw data often looks like this. So let me explain what these diagrams are. People call these ratio intensity plots or MA plots. And I won't explain what the MA means, but it's basically the log of the ratio versus the log of the average intensity between the two samples. So I take the expression level in one sample. I divide it by the expression level in the other sample. And I take the log of that and what I have plotted along the vertical axis is the log of that ratio. And then I plot it against a measure of the average intensity for the two samples. And what you can see is that all of these samples have some kind of artifact that one is expressed higher than the other at low intensities. Some are expressed in different ways. They curve. Some are almost flat. But one sample is just generally higher than the other one. And what normalization does is it scales all of these so that I can make comparisons. And really, in comparing any two samples, what I'm looking for are the things up here, the things that are very different from the mean, very different from the average, very different from the comparison between what I treated and control. So normalization is an incredibly valuable technique because it makes the data across all our samples or all our pairs of samples look approximately the same so I can actually do meaningful comparisons. And this is exactly the basis for the method that was used when looking at northern plots where we scale the data appropriately so that we can make effective comparisons. Only it's much more sophisticated in the way it's implemented and applied. So there are many, many, many different methods for normalization. All of them more or less attempt to do the same thing. RNA is widely accepted as a standard for microarrays. There's less consensus as to what works best for RNA-Seq. There are lots of different methods out there. There's still a lot of competition. And we constantly have to test our assumptions even with normalization. So I'm going to give you one example and that's data from G-TEX. My colleagues and I, Dana Farber, are working on a large project analyzing gene expression in G-TEX. G-TEX has 600 individuals, all of whom have consented to provide autopsy samples. These are patients who died in a variety or subjects who've died in a variety of different ways. And from each individual, genotype data is collected, but then gene expression data is collected from as many as 50 different sites across the body. And so from those data, what we wanted to do was do comparisons across all the different tissues. And so what we did was we had to come up with a normalization strategy. But what you can see in this animation is that on the left-hand side, because a lot of genes are expressed in tissue-specific fashion, what we have is a large tail of things that don't meet the assumption in quantile normalization. They're not expressed at the same level in all the different tissues. And in fact, if we're not careful with normalization, we can get swamped by the things that appear in the tail. So in normalizing these data so we can make comparisons, we actually had to invest a lot of time and effort into dealing with the stuff in the tail so that at the end of the day we get distributions that are actually reasonably comparable to each other. So we all constantly have to test our assumptions and think about what this means and we have to be smart about how we do it in order to make these comparisons. Once we have the data normalized, then what we want to do is ask, what does the data tell us about the biological systems we're interested in studying? One of the most commonly used methods goes back to that early paper by Roland Samoyi and his colleagues of looking for clusters or patterns in the data. Some of the most commonly cited papers in looking at clustering come from Mike Eisen and Pat Brown. And the reason those papers are so highly cited is that Mike did one thing that stands out and it's something I tell all my students they should do. If you have a method that works well, provide software that's easy to use so people can do it. And so we do clustering, I think largely because Mike Eisen created a program called Cluster that did hierarchical clustering in data. What's hierarchical clustering? Well, it's grouping samples together or genes together based on shared patterns of expression across the experiment. So what I might imagine doing is having a set of samples down here or a set of genes. And I want to group these together based on which are most similar to each other in their patterns of expression across the experiment. So to do this, what I have to do is come up with some way of measuring the distance between them. So when I draw them down here, I can draw them so that geometrically these two are closer to each other. But mathematically what I want to be able to do is to determine which are actually most similar to each other in the experiment. So the most common way of doing that is to use something like the Pearson correlation coefficient. We're looking for genes across samples that have similar patterns over time or over samples across samples. So we want to find groups of genes that are similar in how they behave. And so I can use the Pearson correlation or some other measure as distance. Once I've decided on how I'm going to measure distance, then what I do is I calculate the distance between all of the things down here that I want to cluster. If I have lots of different pairs, I can then look at those distances, those pairwise distances and find the smallest distance. Once I've done that, then what I do is I group those together and I call those my first cluster. And then what I have to do is to take that cluster and calculate the distance between this cluster and all the other genes. The distance between two and five hasn't changed, but now I have to come up with a way of calculating a distance between two and this new thing I've created. So there are different ways of doing that and I'll come back to that in a minute. But once I do that, I can recalculate all the distances, really the distance between this and everything else. I find the shortest distance and merge things. I find the shortest distance and merge things. I find the shortest distance and merge them until everything is linked together. And then what I do is I draw a tree, representing those results. So the process is to find similar things, length them together, and just to repeat the process until I get a full tree where I have different clusters, different groups of samples linked together that describe what I hope are relationships that capture the patterns in gene expression. So there are different methods for calculating the distance between my new clusters and everything else, and these are called linkage methods. So one of them, the three methods that are most widely used are single linkage, average linkage, and complete linkage. In fact, single linkage is almost never used and you'll see why in a minute, but average linkage and complete linkage are both widely used. In single linkage, what I do is calculate the distance between any two clusters or between a cluster and any gene by taking the smallest distance between the clusters. For average linkage, what I do is I take the average distance. For complete linkage, I take the biggest distance. And when I apply these to synthetic data sets, what I see is that the patterns of grouping things together are different, and in fact, they're most different between these two and the single linkage. Single linkage, because it's always finding the smartest, it's the smallest distance, does what's called add one clustering. It builds a cluster and then adds the next thing, builds a cluster and then adds the next thing. So you see lots of these very long branches. Well, these two do a much better job of grouping things together in larger, more coherent groups. And so these are the two most widely used linkage methods and in fact, the default for most tools that you run software tools is single linkage. There are other ways of clustering data. What we often do if we have an idea about how many clusters there are is we use K-means clustering. So I'm gonna assume that in my data I may have five clusters of patients or five clusters of genes. I'll take in this case my genes and I'll randomly assign them to different clusters. Then for each cluster, what I'm gonna do is I'm gonna calculate an average expression level using some measure of similarity like Pearson correlation. And then I'm gonna test each gene and ask which cluster is that gene nearest to? And I'm gonna move each individual gene nearest to its closest cluster. I'm gonna do that for each and every gene. Some things will stay, some things will move. And I repeat the process until all the genes end up in a cluster, okay? So the advantage here is I get well-defined clusters but I have to know how many clusters I have ahead of time. In either case, these methods are very, very useful as a first step in exploratory data analysis because what they allow us to do is to look for patterns. So one of the first things we do with RNA-seq data or with microarray data is we'll run typically a hierarchical clustering and we'll look for patterns that emerge in the data. If all of the cases are on one side and all of the controls are on the other side of a big hierarchical clustering dendogram, that's great. If they're that well separated though at first pass I might worry about how closely related the samples are whether they're confounding effects, whether there's something else that might be driving those differences. But sometimes I can run a clustering and find out that they're a group from one hospital that just stand out from all the other hospitals. And that tells me before I do any other analyses I should go back and check for systematic biases in how the data were collected or how the samples were collected. So it's a great exploratory technique and it's a great technique for revealing patterns in the data. But ultimately what we wanna do is find differentially expressed genes. So this is a place where statistics plays a big role. So differential expression is really based on making comparisons between populations. In this case, it's gene expression levels in one population compared to another population. And the most commonly used test for comparing two populations is the T test. So when I think about the T test what it really does is it measures the signal, the difference in the average expression level in one population to the average expression level in another population relative to the noise that's spread in these distributions. So the signal is the difference in the mean expression of gene one in population one compared to gene one in population two divided by the standard error, okay? And so what I measure these, what I actually wanna do is find a significant difference in the expression level between different groups. Now there are lots of different variations on T statistics for making these assessments and for correcting for different expression levels. But the fundamental test is the fundamental test we all learned in first year statistics. We're gonna measure the signal to noise, right? And so what we're looking for is a set of well separated genes where the expression levels are well separated between populations, not those where the distributions overlap substantially. And so this is gonna give us our differentially expressed gene set. And this is what we're ultimately gonna use to drive the analysis we wanna do. As I mentioned, there are lots of variations on gene expression on doing T test. One of the most widely used methods is one that was developed by Gordon Smyth and his colleagues at Weehai in Australia. And this is a method called lima. Lima actually uses a Bayesian T test as part of a linear model to take the experiments or take the samples I wanna compare and to do a very sophisticated analysis looking for differences in gene expression and finding significant differences. And again, there are lots of variations on this method and lots of different techniques, but what they ultimately give you based on some assumptions in the model is a set of genes that differ between conditions. So the next question is, what do we do with those genes? How do we interpret their biological function, okay? Or what do I do with the genes in this list, which is a question that people ask me almost all the time. When I first arrived at Dana-Farber, I remember the first day I sat in my office, somebody walked in and said, oh, I ran a microwave experiment, somebody did a statistical comparison, I have this list of genes, what do they do? And I had no idea what the experiment was, I had no idea what they were trying to test, but that fundamental question is the same question we all ask, what do these genes do? Or what often turns into tell me a story, grandpa, look at this list of genes and tell me something that tells me anything about the biology I'm trying to understand, okay? So an obvious way to gain biological insight is to assess those differentially expressed genes in terms of their function. Now what a lot of people have done over the years is they look at that list and they'll look at about the 15th thing in the list and they'll say, oh, I know something about that gene so it must be important, right? But that's not really a rigorous way of determining what that list of genes is telling you. That's really grandpa telling you a story, right? Oh, back in the old day, I studied that gene so that must be important, right? That's not biology, that's biopoetry. So while we can tell these wonderful stories, what we'd really like to do is look at the patterns of expression. So we need an objective and automated approach to doing this. So what we often rely on is functional, pro-filing, or pathway analysis and there are lots of different ways of doing this. The earliest ways were really a generalization of the telling me a story grandpa approach that what people would do was they would look at these differentially expressed genes and they'd group genes together based on their functions and then they'd look for differences in the representation of different functional sets. But there's some potential problems with that and I'll just illustrate that with a nice example, all right? So here's an analysis from, I think this is actually from a paper, I don't remember where it's from. Maybe it's just a toy example. I created this many years ago so I don't remember exactly what this is from but what I can do is look at my genes, say I have 100 genes that are differentially expressed and I can go to some functional classification system like the gene ontology which groups genes together based on the functions carried out by their proteins and I might discover that of my 40 differentially expressed genes or my 100 differentially expressed genes, 40% of them, oops, are involved in immune response and since that's the biggest category, I might guess that that's actually the most important category for telling me what's different between my two different biological classes, right? And that even seems like a reasonable conclusion and I can tell you that for any category you find you can spin a good yarn about why that category is important but really what you wanna know is whether or not that difference is more significant than you'd expect by chance. If I have 100 differentially expressed genes and 40% are immune but on my array, 40% of the genes are immune associated genes then 40% is exactly what I'd expect to see by chance. So the fact that I've seen them skewed towards some category doesn't necessarily mean that that category is differentially expressed between categories between my two phenotypic classes. So I need to consider not only the number of significant genes but also the background distribution, what the representation is on the array. So what I can do is look not only at the observed number but the number I expect based on what I see or based on what I actually assayed. So for an array, this expectation is based on the set of genes that I'm surveying. For RNA-seq it's based on the representation of genes in the entire genome, right? Almost every experiment we do with RNA-seq we see ribosomal genes. Guess what, ribosomal genes are one of the biggest classes in the genome and that shows up all the time more or less by chance. Unless we see a really strong skew in the representation, we know what we're really just doing is sampling that background distribution. So what I'm really looking for are examples where I see something which exceeds what I'd expect by chance and exceeds it by a significant amount. And that if I'm looking for a biological explanation for the difference between my phenotypes, a immune response probably isn't going to be it that these two other areas are more likely to really reflect real biological differences. So what I wanna do is look at these categories then and really try to understand whether they tell me something about differential expression. So there are lots of different tools for doing this and this is one of my favorite places to play around with different methods. There's a great tool that was developed here at NCI, where is this? I think it was NIAID, where? I don't know, NCI, is it Frederick? Okay, NCI Frederick. But this is a great tool called David, we use it all the time. And what David does is uses a statistical test based on Fisher's exact test, another widely used approach is gene set enrichment analysis which was developed at the Broad Institute. Ways of categorizing genes come from the gene ontology project, the keg pathway database, wiki pathways, pathway commons. There are lots of different ways of putting genes into categories. But all of the statistical tests rely on doing the same sort of thing. What they rely on is testing whether we get a greater representation in some way than we'd expect by chance. Oh, then I forget, MCDB, which is used by the Broad. So we have lots of ways of looking for biological differences and for putting those biological differences into context. And really, this kind of represents the state of the art. And if you came to me with an experiment, if you wanted to do an experiment, what I would do is I would sit down with you and I'd map out the experiment and I'd map out our analytical pipeline going from differential expression to differential expressed genes to functional pathways and categorizations so we can begin to understand what those genes are doing and to put them into some context. But one of the things we're really coming to recognize is that there are differences that aren't captured in a linear fashion. There are differences that represent the complex interactions that occur in biological systems and that in fact, biological systems are driven by networks. So at this stage, I'm actually gonna indulge myself by telling you a little bit about some of the research work we've been doing over the last few years to move beyond just gene lists and gene functional categories to see whether we can discover new processes by looking at biological networks. So the question is, can we make this more complicated and hopefully more informative? Now, there are lots of different ways of creating biological networks and I'm gonna give you my own sort of tongue-in-cheek bias for how not to do this because I'm gonna tell you about our methods which are of course the best methods and they're the best methods because they're ours. But I'll tell you why I think a lot of the other methods may not be the best. So every method is based on some assumptions. Now, one of the ways that's widely used for doing differential expression analysis and looking at pathways is to take two different conditions, two different phenotypes and we do a statistical comparison. So we normalize our data, we do some kind of version of a T test, maybe lima, and we identify a set of differentially expressed genes. And then I can do a functional pathway enrichment analysis using David from NCI-Frederick or GSEA or some other method. And once I do that, I take that list and I want to interpret it in a new way. I wanna see, well, how are these genes linked together functionally and biologically in the cell? So one of the things a lot of people rely on are protein-protein interaction networks. So they'll take this protein-protein interaction network which is done in one experiment and they're differential expression from their own experiment and they'll color the nodes based on whether they're up-regulated or down-regulated and then they'll start the process of biopoetry telling stories. But my problem with this, while interesting things have definitely been learned by doing this, one of the things I always wonder is should things that are differentially expressed actually be functionally connected? And one of the hypotheses that people in my group have is that different phenotypes have different networks that manifest different interactions. So I might imagine that in a tumor that these two things which are expressed in the same condition may actually not interact because this guy may have a mutation which will prevent it from interacting with this one. Or that something may not be expressed or may be expressed in the wrong isoform or maybe phosphorylated or something may happen that will prevent the interactions that show up in these protein-protein interaction networks. So I don't know that the things that are differentially expressed should actually be connected and I'm not always sure that the protein-protein interaction network is relevant to the situation I want to analyze. So I won't tell you this is wrong but it's something we shy away from and I hope you understand or at least I've given you the reasons why. Another approach people often use is to start with the same comparisons. They'll do a statistical analysis and they'll get differentially expressed genes but then they'll ask, are these genes correlated with each other across samples? So when I did my clustering, I look at correlation but I can look at pairwise correlations and join things together. Or I can look at some other metric of similarity like mutual information and join things which are highly similar together. And so I can draw a network that looks something like this where the edges, the connections here are not physical interactions like the protein-protein interaction network but here the interactions are similarities based on a high degree of correlation. And some methods look at all correlations, some methods are a little bit more selective in that they look at correlations between transcription factors in their targets or transcription factors in other genes but either way, a lot of methods simply look at these correlations. There's some methods that scale the correlation so WGCNA is a widely used method that takes the correlation coefficient and raises some power with the goal of really separating out these individual clusters or groups. But what all of these methods rely on is the fact that they're correlations and that what often happens is you color this network and you start to tell stories and my problem with this is that I'm not always sure that things that are correlated are actually functionally connected with each other and I'm not sure that the correlations are the same in each individual phenotype. So the philosophy we've developed and what I'll tell you about is a method that we've come up with that's based on this hypothesis I alluded to earlier that the network in phenotype one and the network in phenotype two should be different from each other. So what I'm gonna do is I'm gonna develop a method that's gonna produce a network here and a network here and I have the same collection of genes but I'm gonna find different associations in the different phenotypes and then rather than doing the statistical comparison and node painting and we always do the statistical comparison and look at the genes and what the genes do but rather than imposing it on top of this what I'm gonna do is I'm actually gonna compare the network I'm gonna look at its overall topology and structure I'm gonna look for differences in connections and then I'm gonna look at how those differences and connections are related to differences in expression to try to understand it. Okay, so the real difference here is that rather than inferring a single network and trying to compare expression in the context of that network I'm gonna infer individual networks and try to compare the structure of the networks to see whether or not they're informative. So my assumptions going into this is that there's no single right network that each phenotype is gonna have its own individual best representation of its network the structure of the network matters and that network structure can change between states. So I would argue that chemo sensitive and chemo resistant tumor have different networks that are active. So we have to move away from asking is this network right and I put this up here in case any of you review my papers because the real question we wanna ask is is this network useful or more fundamentally does this network inform our understanding of the biology that we're studying? And so this is our defining question can the network tell us something we couldn't have learned otherwise, okay? So to do this, I'm gonna give you one example of a method we've developed based on modeling gene regulatory networks. And this is a method we published a couple of years ago it's a method that we call Panda. So the acronym for Panda is down here it doesn't matter if you ever develop a method one of the things you'd know in bioinformatics is that it needs a cute name. And as we all know there's nothing in the world cuter than a Panda, thank you. So Panda it's a really cute method that's why it's the best it's cute. So we have Panda and this was a lot of work done by my friend and colleague Kimberly Glass. And Panda's based on modeling the one thing that I would claim we understand in biological systems. And that's how an RNA is made. So RNA sequencing, DNA microarrays, what they measure are the levels of these RNA transcripts. So to get an RNA what I have to do is I have to assemble an RNA polymerase construct on the DNA that runs off this RNA. And typically for that to happen I need transcription factors that bind in the surrounding region to be able to assemble this complex and make the RNA. So one way to think about this, and I like very simple examples or very simple explanations is that these transcription factors which are encoded by other genes these transcription factor genes are sending a message that says turn on, right? Because if you remember one of the functions of genes is to regulate gene expression, okay? So how do we actually model this process of a transcription factor sending a message to its downstream target? Well as we thought about this as all great scientists do we stole a method that somebody else developed. And this is an idea from communications theory called message passing or affinity propagation. So what this relies on is we have a transmitter and a receiver, okay? And the transmitter is going to send a message to its downstream target. And for that message to flow what we have to recognize is that both parties have to be actively engaged in the transmission of that information. So the example I like to give for this and how this breaks down is that when I come home at the end of the day I walk in the door and before I take off my coat or put down my bag or take off my shoes my wife immediately starts telling me about her day, right? And I now realize it because she loves me and misses me dearly but I can tell you in the first five minutes I hear gentlemen, nothing, right? And later on she'll say well don't you remember when I told you and when I first got married I would say you never told me which we all now understand is wrong and that I am ladies, wrong. So if I didn't hear it it doesn't mean she didn't tell me it meant I didn't listen but the truth is she never told me, okay? So without an external source of validation we have no way of knowing whether she didn't tell me or I didn't hear, okay? Now I could say well my son is in the room he could serve as a judge but my son is now smart enough that he will throw me under the bus to curry favor with his mom. So without a larger network to actually really validate what happens we have no way of knowing where communications breaks down but for communications to flow we both have to be actively engaged. So what we can do is draw, we can write down mathematical functions to estimate the responsibility and availability of what we call these two properties but really what we need is a much larger network where we can iteratively estimate all of these functions. So if I were to tell you all about this method and I were to give you a quiz at the end and I were to measure how well you did based on that quiz, right? If I were to only tell Andy and only give the quiz to Andy and tell him behind closed doors if none of you heard, if he did well we'd all agree we did our jobs. If he did poorly, you'd have no way of knowing but if I gave everybody the quiz and you all did well except for Andy the simplest conclusion is you did your job and I did my job and he's just a bitter disappointment to his parents, right? So not true, he would do really well. So his parents are very proud of him. So if we have a large enough network we can actually iteratively estimate these functions. Now one of the things that we try to use to build networks is the fact that there's a lot of prior information. So what I'm gonna do is I'm gonna assume that I can guess what that network is. So I'm gonna use the fact that I know where transcription factors are likely to bind in the genome. I have transcription factor motifs and I can scan the entire genome and construct a preliminary network. And then we have other sources of data. We have protein-protein interaction data because we know proteins interact with each other and we have gene expression data from each condition. And so I'm gonna take each edge of that network. I'm gonna estimate the responsibility and availability. I'm then gonna update my model and I'm gonna iterate this process until it converges. And then once I have a final network I'm gonna use that final network as a way of comparing between my different phenotypes. So the panda paper describes that and describes some early applications. I'm gonna give you one example that shows how this is at least in some way useful. And this is an application to comparing two subtypes of ovarian cancer. So this is a paper we published a couple years ago in which we identified doing a class discovery experiment with gene expression analysis two subtypes in ovarian cancer that differ in their angiogenic potential. So angiogenesis is the process by which tumors stimulate the development of blood vessels. And not surprisingly, patients who have the angiogenic subtype whose tumors are creating blood flow, feeding themselves, have a poor long-term prognosis. So we have two subtypes that are defined in this paper. And what we wanted to do is ask, are the networks different between these subtypes in meaningful ways? So rather than using the data sets we used to derive these subtypes, we used the biggest available data set, that from the cancer genome atlas. We assigned samples to the angiogenic and non-angiogenic subtype. We took panda and we built individual networks for each one of the subtypes and then we compared them. And for me, this was a really informative way of looking at these networks because what we found were the differences in the connections. And in looking at these, what we really came to recognize was that the atom of the network is not the set of genes, but the intangible edges, the connections that differ between the two different subtypes. So we ran panda, we compared the network, we picked out the 10 transcription factors that had the biggest changes in their connection. And this is a representation of the networks, the regulatory networks surrounding these 10 transcription factors. So the red edges are the angiogenic, the blue are the non-angiogenic, here's the red, here's the blue. Study this carefully, this will be on the quiz I alluded to earlier. You'll have to draw these and tell them what the differences are. But if you look at them you can actually see differences in the network. These differences are the differences that we want to try to capture and understand. So we spent a lot of time looking what the individual differences were. And then Kimberly Glass came up with what I think of as a creative way of representing this data. So she calls it the donut because there's a hole in the middle. I think of it in terms of the toy I had when I was a kid the spirograph, right? So what we have is this pattern of connections. And what you can see, and I apologize you can't read the names of the transcription factors up on this projector. But what you can see is that there's some genes outside here that are regulated by these transcription factors in very simple ways. But there's some in which the pattern of regulation is very complex with multiple transcription factors appearing to play a role. So we asked ourselves what are the differences in the connectivity between these transcription factors? And what we discovered was that there are many transcription factors that appeared to co-regulate genes at a frequency that was far higher than we'd expect by chance. So we thought that was really interesting. And we went back to the literature and we looked to see whether or not there was some association between these transcription factors. And what we discovered is that the pairs that we saw in this list actually represented transcription factors that are known to form complexes, regulatory complexes, and that those regulatory complexes are associated with angiogenesis. And there's actually information in the literature about interfering with the dimerization of these transcription factors, or in this case with the SOX-5 complex with changing methylation in such a way that you interfere with the regulatory process and prevent angiogenesis. So for us, this was a really interesting validation because what it did, what the analysis did, was that by comparing networks, we were able to find differences in a presumptive regulatory process by looking at the characteristics of the regulatory process, we were able to identify meaningful associations and to identify potential therapeutic interventions. And so for us, this really provided grounding that we have other ways of looking at gene expression beyond just looking at the individual genes to start to think about networks. The jury's out as to what the best network method is, but there are lots of them out there. And as you start to analyze data, it's very useful to start looking at these methods to see beyond simple linear correlations like come out of clustering, whether or not they're more complex patterns of association that can inform our understanding of the systems we look at. So there's a paper describing this application to angiogenesis. There's actually a paper describing an application of this kind of comparison to sexual dimorphism and chronic obstructive pulmonary disease. We have a paper describing Panda applied to asthma to children who respond or don't respond to corticosteroids, the different dietary interventions, the different subtypes in breast cancer. And so we're really starting to get a lot of mileage out of how we compare these networks. There are a lot more questions about networks than if I had more time, almost at 90 minutes. If I had even more time, I would tell you more about those methods. But for today, I have hopefully I've convinced you that there's at least some interesting things we can start to look at once we get expression data that will allow us not only to discover known biology, known pathways, but to potentially new biology, new signaling cascades, new signaling associations, new associations between sets of genes that may drive the biological phenotypes we're interested in. So at the end of the day, the goal of an experiment is to discover new biology. The challenge is sorting through lots and lots of data. Comparing groups of samples though really requires we understand what those groups are. And then we go back to the underlying experimental design when we start to begin to think about what we're gonna do in a gene expression experiment. So making sense of the data we generate really requires a good analytic plan that goes back to our experimental design and experimental plan. But the great thing is we're just a wash in data. So we have lots of data sets we can begin to explore in new and interesting ways. I mentioned early on GTECs and what we've been doing is we've been building gene regulatory networks in each and every tissue. And what we're finding is the differences between tissues are fascinating. We've been building gene regulatory networks and comparing males and females in each tissue. And what we're finding is fascinating. Some tissues are very similar between men and women. Some tissues are very different. Some of them won't be surprising like breast. Male breast and female breast are completely different. Some of them are surprising like subcutaneous fat. Which may explain why you and your spouse argue over whether to open the window at night. But subcutaneous fat is very different between men and women. And by looking at these networks we're starting to understand some of the mechanisms why. So I like quotes. I open with a quote from me. Now I'll close with quotes from real people. This is one of my favorites. This is from William Gibson, the science fiction writer. The future is here. It's just not widely distributed yet. Thanks to YouTube, it's now broadly distributed. But this is one of my favorite quotes. My PhD is actually in theoretical physics. As you now know from listening to me, theoretical physicists are the smartest people on earth. And so this is from Enrico Fermi, a physicist. Fermi said before I came here, I was confused about this subject after listening to your lecture. I am still confused, but at a higher level. So I hope I've raised your level of confusion. And I wanna thank you for inviting me here today.