 So I want to talk to you about what we know about optimal pre-processing methods and how we know it. Talk very quickly about the cases where microarrays really matter a lot and highlight some of the key issues in each of those, and then quickly provide some code overview and comments for you guys moving forward. So a core question that we addressed that we talked about yesterday and didn't really address in detail is what's the right way to process your data set? And this really clearly shows that that's an important challenge, but we still don't have an answer of how do you know what is the right way. So the Expresso function allows you to take very, very large sets of normalization methods. These are all the things that are enabled by default in Expresso, so three, seven, three, and five. You quickly do the math in your head. You find out that there are 315 ways of processing your data using the AFI package alone. The AFI package is not the only way of doing it, and I do not recommend people process their data 315 different ways. That's a lot of work. But on the other hand, how do you even evaluate that? How do you identify this technique is better than another? There are two ways that are prevalent in the literature. One uses what we call defined data sets. A defined data set is a data set where it is actually entirely synthetic, entirely constructed. So somebody will create a tube, and in that tube, they will put X proportion of molecule A, and two times that of molecule B, and one third of that of molecule C. And of course, they'll do this using robotics. So they will create an entirely synthetic sample of RNA. That, of course, means that you know exactly what is in that synthetic sample, and you'll be able to move forward readily. Alternatively, you could use something like real-time PCR nanostring as the gold standard, as something that provides some absolute universal truth. And that's a great technique too, because at least that's likely to represent what people would do downstream of a microwave study. But any of these other technologies have weaknesses. And so both are in the literature. Both have been looked at a number of times. And we'll focus on one particular method. This is a wholly defined spike in data set. So in other words, it's a data set where they created the entire thing synthetically from all the nucleotides that they synthesized themselves. So preferred analysis methods for afro-metrics gene ships, two. Wait, why is this part two? Well, good question. It's part two because way back in 2005, the same group developed preferred analysis methods for afro-metrics gene ships, revealed by a wholly defined control data set. So they varied it from wholly defined to wholly spiked in and so forth. But they basically tried to do this five years earlier. This initial paper was controversial. I was a PhD student when this paper came out. And everybody who did micro-analysis looked at the paper and found a different thing wrong with it. And there were a lot of challenges with it. Just to give you an idea of what was said in public, don't ever mind what's been said in private about this. There are some comments here that you will rarely get. So for example, first, the spike in concentrations are unrealistically high. We demonstrate that background noise makes it harder to identify differentially expressed genes at low concentrations. We point out that the concentrations for spiked in features result in artificially high intensities. Second, a large percentage of genes are differentially expressed. This design makes the spike in very different from that used in many experiments. And people writing this, they're wondering, are Raphael Irizari, who created RMA, Leslie Cope, who wrote the AFI package, and Xi Jinping, who created GCRMA. So people who really, really know what they're talking about, and we're being fairly robustly rude about it. But a better one is the rebuttal from John Story. John Story created the false discovery rate adjustment. So if there was a Nobel Prize for statistics, you would win it. Unfortunately, serious errors are evident in the charadelle data, disproving their conclusions and implying that the dataset cannot be used to validly evaluate statistical inference methods. That's not a gentle thing to say in a public rebuttal to a paper or anything like that. And at conferences, much more strident things were said about it. So there's a couple of things that they did wrong in the first round. And five years later, interestingly enough, none of the authors on this paper, on the first paper are present on the second paper, except for the senior author. So all of the people who were involved in the first paper said, ah, I'm not going to try this again. But the PI was like, no, we can do better. I'm confident that we can do better. And so they took five years and they came up with what they called round two. They came up with a couple of different ways of designing their study to try to make it robust and encapsulate some of the problems. And I'm not going to go through the problems with the first study because there are a lot. It's actually almost easier to say what they did right than what they did wrong. They had real challenges with having an asymmetrical sample, with having biases brought between samples so they didn't have proper independent replicates, by using concentrations that were a physiological, by mishybridizing their arrays so there was saturation of signal intensities where there shouldn't be, and like a billion other problems. So their second attempt was much, much more rigorous. And so the first thing that they did is they created a block design. And so by that, what I mean is that they created their experiment into two groups, one of which they said would have technical replicates where each individual sample they would have done in triplicates. So sample E1 would be the three times half a baby in gamma. So these are closer to biological replicates. This is a technical replica where you take the same aliquot from a tube and you repeat it on the microarray. And the only source of noise should be the microarray-dependent noise that is hybridization differences or manufactured in biological replicates. Difference between me taking a sample and making it today and making the same sample tomorrow. Cell lines might drift over the course of a couple of days so there would be differences. Two separate groups, the angel also made some, these are Drosophila, I believe that they were using Presenter Absinthe. And they know exactly if a gene is Presenter Absinthe because they put it there. So there's a gold standard of whether or not this gene is there. And you'll see the analysis methods ranged from one that was basically random chance, which in other words, the methods were all getting accuracies of about 75 or 80% at even telling you if it's right up here at one, so the analysis, the first thing that they showed is that background correct deviation. So if you take a background correction methods tested and these are different replicates and different points within that caused by choices in background correction. It's fascinating to see that RMA, which is the most widely used method looking at differential gene expression, and they produce these plots. I think the way I would phrase it after reading this paper four or five times is a little bit of a hot, very intuitive ways of expressing what they need. And in action there's been a lot of controversy about the second paper about how you ought to have been displaying and it's asking, what is the percentile accuracy and what fraction of methods and combinations are reaching that accuracy of other normalizations. The purple, red, and blue lines are interesting. The purple is the lowest normalization, which is absolutely standard for two color, one color. Quantile normalization, which is what RMA uses, and BSN, which is again for one color arrays. So there's a lot of evidence from this that there may be methods that are invented for microarrays, but not broadly applied to aphometrics arrays that would be abused. Inclusions from the study are kind of interesting. They claim that most commonly used methods perform strongly and that there is no single best way to analyze aphometrics microarray data. Their performance is not great. The best methods reach an 85%-ish sensitivity with a 5% false positive. So in other words, their best methods are missing 15% of the hits, 15% of differentally expressed genes. We had a couple of questions about merging multiple methods together and only taking the common ones. This data actually goes in the other direction. It says that we have a much bigger problem about missing things than we do about having false positives and that at any particular cutoff or method that we choose, we are going to miss more hits than we are going to have false positives. So these conclusions are a bit interesting because in the best study in the field, clearly a lot more work is needed, but their basic conclusion is that it doesn't really matter what you use, that all of the methods are equally good and that doesn't quite concord with what we've seen. We've seen that there are big differences. Their argument would be you are missing a different 15% and you're making a different 5% false positive and that encompasses 20% difference in the hits that you're going to have. That's basically the source of all your problems and that's reasonable, but if you give me two gene lists which differ by 20%, I'm going to be a little bit worried about the results. And if you start to do downstream analyses like pathway analyses, 20% is enough to completely change it. So their study has some clear merit to it and has some truth to it, but it kind of describes the need that we have as a community to do better work in this field. And as microarrays are starting to be seen as clinical diagnostic tools, there's an increasing need for studies like this that will allow us to kind of essentially say, this is the way in which we should be analyzing these data and that's something that's going to need to be firmly decided and established, especially for a diagnostic where it needs to be locked in and not changed from patient to patient. Optimal pre-processing methods may not exist. There is a huge amount of work in bioinformatics now to ask if we can merge multiple methods together. Basically the question is, if we analyze our data using 10 different techniques, can we get a better result if we merge or integrate them? The basic strategy is to use machine learning methods to integrate the result for different pre-processing methods. So there was a paper from my group last year and there's a paper that I've reviewed that's in press at a good journal right now that does the same thing for different kinds of applications. And there's a lot of reason to believe that if we do it cleverly, we can integrate multiple techniques beneficially and then instead of having to choose, we just use all of them. The right way to do that remains completely up in the air but that's something that's likely to change in the next three or four years in the way we think about processing arrays. Any questions about pre-processing? The last couple of things I want to point out is that microarrays of course are a big in the area of expression analysis and that's something that is diminishing fairly rapidly and that's something that I think we're going to see people doing less of in three or four years. But there's a number of areas where I actually expect microarray usage to grow. The first one is QAQC for sequencing studies. So I'll give you a good example of this. Every time my lab does a sequencing study, we run a microarray to make sure that our results are reliable. It's cheaper to do that as a quality control metric than anything else. It also provides very affordable copy number profiling. There was a paper in sale about two months ago which did whole genome sequencing of about 60 prostate cancers. Whole genome sequencing, you can call copy number variations off of the whole genome sequencing. Very, very top bioinformatics group at the Broad and yet they did not use their whole genome sequencing data to call copy numbers. They believed that the cheap SNP6 mathematics array that they used was more reliable. So in that sense, for their perspective, spend a few hundred dollars on array, get more reliable data, not have to worry about it, not have to think about how do I do the analysis? How do I run it through the pipelines? Well, it's a lot more established. So there's a lot of reasons why groups are continuing to use it for copy number profiling. Similarly, for one of my large projects where we're doing 500 RNA-seq experiments, we're doing a matched aphymetrics expression array with every single one. Thinking that's a good QAQC requires less RNA and is very affordable next to the RNA-seq. Another application that's growing very quickly are these custom SNV chips. So especially Illumina has this ability to create, I think it's a 40 or 50,000-featured chip for about $100, $150. You can put whatever SNPs you want on that chip. You can then interrogate a set of samples with it. And because it's $100, you can now start using that in routine clinical diagnostics. So PMH has a hospital probably sees about 10,000 patients a year. $100 times 10,000 patients, I can't do my math is, what, a million dollars. So if you were able to come up with a set of 20, 30,000 SNPs that you wanted to profile on every patient, now you can do it for an entire hospital for a million dollars a year. A million dollars gets you something like a hundred whole genome sequences. So there's a big advantage to being able to do these massively parallel experiments looking at lots and lots of patients. And with 10,000 patients, now you can start doing some really interesting downstream statistics. And lastly, for methylation studies, microarrays remain the standard. They are dramatically cheaper, dramatically cheaper than sequencing methods. And here I mean by an order of 15, 20 fold still. They provide the same quality of data at least today. And it's likely that advances in sequencing will eventually eliminate that. But I imagine that's one of the technologies that will remain array based for a long, long time. Did anybody have any questions about the integration of arrays and sequencing or how those two things might fit together? Yeah, so two things. One, that comes down to validation. And two, I would trust the array over the RNA-seq for most things. So for example, if a gene was going up by array and down by RNA-seq, I would immediately think to myself, the anthropometrics RNA array experiments in my lab validate 95 plus percent of our hits. The RNA-seq experiments at my institute validate 80 percent. So if there's an error or discordance, I think it's much more likely to be the RNA-seq. Now there are things you find by RNA-seq that you can't find by array. Fusion proteins, weird splice variants, alternative start sites, point mutations, all those things you can't get. But for absolute abundances, especially for low intensity genes, the arrays are much, much more accurate. So that would be the general approach, but if I really cared about that hit, I would start off by validating using real-time or nanostring or something. I'll give another example of what we use arrays for. In standard whole genome sequencing studies, you can use the array to genotype a million locations for an individual. Use that to update the genome, give an improved estimate of what that patient's genome looks like, and then do your whole genome sequencing relative to that improved estimate of the genome and improve the accuracy of your alignments in your whole genome sequencing pipeline. That's one good step. Another one is to identify sample mixups using the array before you bother spending $10,000 on sequencing. And the third one is giving you an estimate of the true positive, true negative rate of your array, of your sequencing relative to the array. So there's a lot of things you can do with that. Any questions about any of these applications? So let me go over the basic things that I want you guys to remember. The basic process of micro-a data is to load it, pre-process it, do QAQC, and do statistical analysis. And if you kind of keep that in mind and go from step to step, you'll be able to get through almost any of the analyses that we would be looking at. There are little tricks and tips in how different things will work, but that kind of core will take you most of the way. And I want you to remember three things. First, pre-processing is very hard. Nobody knows this is the right way to do it. And if you hear somebody say to you, how did you pre-process your data? I used X. Oh, that can't be right. Then the person probably doesn't know what they're talking about. It's much more likely that they've always used a method because it worked for them a few times and blah, blah, blah. But actually, we have data sets in my lab where one pre-processing method will generate a high validation rate and another one will generate a low validation rate. And with experience and looking at the characteristics of the data, you can start to get an idea of what that'll be. And that just means seeing lots of data sets, analyzing it, starting to come up with some familiarity of what are the characteristics of good and bad data. The second thing is nothing else really matters if you've designed your experiment incorrectly. Find a statistician and talk to them over a beer for your statistical design. That is incredibly worthwhile. And one of the numbers that's on Steve's dial on my phone is a statistician I worked with during my PhD. And every once in a while, it'll just be like, speak down. So, Melanie, I have a stupid question for you. If I have X and Y, can I do Z? And usually, those turn out to be stupid questions. But every once in a while, the answer would be like, huh, I have no idea. And then she goes and calls somebody who's on our speed dial and eventually an answer comes back to me that's like, well, four or five statisticians talked about it and we think this is an unresolved problem. So why don't you try and find out and let us know what happens? And so that's really interesting because now there's an opportunity to improve statistical analysis and it creates that communication between the two disciplines. And so finding a statistician who can be kind of a consultant or an advisor, even if they're not gonna be deeply involved with your studies is extremely valuable. Also, when you decide to do a statistical test, think really carefully about what you're asking. What is the biological question? If you can't precisely design the biological question, you'll never pick the statistical test to match it. And once you have that, you have to ask yourself, okay, I have the statistical test, but what are its assumptions and are they met? And that time thinking up front about these issues is extraordinarily worthwhile in improving the quality of your analysis and your results. And lastly, I told you on the first day if you forget everything that I've ever said over the course of these last two days, remember, microanalysis is a pipeline and you have to complete one step before the next. You can't skip or you can't go backwards and forwards. You have to say, I'm over here and I'm going to hear next and get one step finished properly before moving on. It's the single biggest mistake I see people make is skipping a step or saying, oh, I've got something. I really wanna know what the genes are. And then we'll start seeing a gene and go, oh, this gene is really interesting. And then they'll tell the supervisor and the supervisor will get excited. And the next thing you know, there's a follow-up experiment when the QA, QC wasn't done properly and they're gonna have to exclude three of the arrays that were causing this entire outlier. I've seen that happen over and over again. And it's important that you always go through one step at a time and no matter how much your supervisor or a collaborator is saying, can we see the results? Can we see the results? You say, no, we have no results yet. I'm busy trying to make sure that the data is the right quality. I often will use lines like, I can show you the results, but they're gonna be wrong. Do you wanna see the right results or do you wanna see things that are gonna be wrong? And you know, that's a little bit mean, but it's not incorrect. You wanna give people results that are accurate. And so it's worth taking the time to do those upfront steps of the pipeline properly.