 So, basically, we're going to talk about microarrays because they provide the fundamentals of how pipelines are being thought of and developed for analyzing any type of omics data, actually even beyond that, any type of data period. It's not dissimilar from how you think about working with imaging data. And essentially, we're going to start off by taking a look at a couple of different features. First, we're going to say, if we do expression profiling in cancer genomics with a microarray, what are we supposed to be thinking about? What kind of factors should you be going, I need to know this about the experiment. I need to understand this about the data set itself. I need to understand this about picking the methods that I'm going to use for analysis. And when I get the results, what are the key things that should make you go, hmm, do I trust that? So, we'll talk about that theoretically for the first two hours and 20 minutes. And then we'll switch to going through an actual microexperiment, a published experiment small enough that it can be run easily and locally and go through all of the things that go wrong in it and give you a chance to see what happens. The experiment was picked fairly carefully so that there are significant things that can go wrong pretty easily. And that's intentional so you can see some of the errors and things that'll happen. And at the end, I post a full analysis script that shows working through exactly how you would come up with a standard analysis of a micro study. So any questions on that before we kick in? All right, don't hesitate to ask questions as we go through. Don't just stick up a red card or a yellow card. Yellow if you think I'm being nice, red if I'm being mean. So what do I want you to learn at the end of this? There's a couple of easy things to understand the types of microarrays and where the noise is in a micro study. But most importantly is this point number three. If you forget everything else and you understand the pipeline by which we analyze microarray data, the way that there are sequential sets of algorithms that interplay with one another, I will be a happy person. The last three points have to do with how you would actually do this in practice, how you would input your raw data, how would you pre-process it and understand how to run a series of different pre-processing methods, and how you do statistical analysis of it. So let's start off with a question. And this is the question I'm going to ask you in a second. Let me pre-empt it by a couple of questions for the audience as a whole. So how many of you have analyzed microarray data before? Yep, so about a third. How many of you anticipate analyzing microarray data in the next two years? So about everybody, so that's good. And is there anybody who's going to be doing like a chip-chip microarray experiment? So one person, anybody who's going to do a protein microarray, methylation microarray? So what you're capturing there is that almost everybody does microarrays, but essentially all of those microarrays turn out to be mRNA expression arrays. They are the most sold, most utilized genomic technique that we have. They are products there that have sold millions of individual assays that have literally been applied to millions of individuals, mice and rats, and there's a huge amount of data available there. So what do they measure? Anybody who knows about this technology and all of you know enough that you know that you think you're going to be using it, what is an mRNA expression microarray measure? So a number of transcripts. Go ahead. I have a level of transcripts relative to control of some kind. So level of transcripts relative to a control. Does anybody want to supplement that? So is it just fluorescence? Good. Anybody else? Alternative splicing. Which alternative gets spliced and gets measured by the array? So it could be right on the junction. So we're already getting fairly complex and you could immediately say it's not as simple as you'd like. It sort of sounds like it should be it measures how much mRNA is there in the cell. But we definitely don't get that because we have many cells in the population and they have some sort of a distribution of RNA levels. So we're getting the center of a distribution or the sum of the distribution depending on what you do. But it's only relative mRNA levels. Essentially no microarray gives good absolute quantitation. This is probably the biggest advantage of RNA-seq over microarrays is that it theoretically gives absolute quantitation. You may then go into many RNA-seq papers and ask if they generate absolute quantitation and take advantage of that data, but it's not common. But it is doable, it is sometimes done and it cannot be done at all with arrays. So it's a really clear differentiating point. Which genes are on a microarray? Well, actually it's only a subset. So typical microarray platforms will have 20 to 30,000 genes represented. There's only 22,000 genes in the genome so the other 8,000 might be splice variants or specific isoforms, sometimes fusion proteins, sometimes just different regions of the same gene. And so essentially you're getting some relative fluorescence quantity that is proportional to mRNA levels averaged across a population of cells for specific target regions of a single gene. That sounds fairly limited and to some extent it is, but it represents enough of the important biology that you can do really, really interesting things with it. So we're going to start off talking about what are microarrays in a little bit more detail. And if you are going to work with any technology, you kind of have to start here. What is it? How does it work? What are the error characteristics that might come up in the way that the experiments are run? That will guide you to start thinking about how you would analyze the data because the data analysis pipeline should be intimately linked with the experimental pipeline. It should reflect steps that go on experimentally that do or don't introduce different types of noise. So this is a reasonable Wikipedia style definition of a microarray, a multiplex technology consisting of thousands of oligonucleotide spots each containing picomoles of a specific sequence. That's a reasonable definition. What you actually do with that is typically to quantitate some sort of nucleic acid. Typically DNA, that's the most common usage and I said these are mRNA expression arrays so you can immediately guess there's a reverse transcription step in pretty wide use. But you can do it to directly measure RNA and that leads to a whole slew of interesting applications. So we typically run microarrays in what's called hypothesis generating mode. They're treated as an unbiased experiment that's going to give you a survey of what possible interesting things are present in the universe. That's not the only way to do it. There are clearly exceptions to that but that's the majority of experiments that you'll read that have a large microarray screen are really looking for what is the interesting gene that I'm going to follow up with additional experimental biology. And that sometimes leads to the impression that oh because it's a screen I don't have to worry too much about the experimental design. And actually it generally goes in the exact opposite direction. Your screen is the first part of a long experimental pipeline so if you script the experimental design here you can guess the next five or six years goodbye. So that means that you have to be very careful in selecting sample number. You also have to be very careful because these studies are naturally expensive and that's true for sequencing studies as well. We spend a lot of money on them which means that we really don't have excess statistical power. You don't go and say well I could probably get away with 400 whole genome sequences but we're going to do a thousand just to be safe. Like I just described 6 million bucks of sequencing or something and just you don't do that. And so as a result that means that the studies are even more sensitive inherently because we tend to barely power the studies. They're just at the threshold of finding what's interesting so if you mess up the experimental design you've got a problem. And there's a lot of aspects of experimental design that I'll touch on over the course of the day. I'll give you one really easy one to think about. In the nature of your input samples essentially what you're going to do in any array experiment is start off with a sample and you're going to extract RNA directly. And the RNA may later be transformed in some way for hybridization but it's going to be RNA. Now you would think why would anybody do something silly like compare total RNA preps to poly-RNA preps to something else. It's really really common and here's an example. I'm studying a rare type of cancer, anaplastic thyroid cancer and I say geez this has got an incidence of 1 or 2 a year in Canada and there are 4 of these published and I'm going to just take those 4 data sets and integrate them with mine because that will increase my statistical power. Then you do your statistics and guess what your subgroup is. You have the sudden outlier because the samples were prepared inherently differently. You could also say I'm going to be looking at frozen versus FFP fixed samples and why would I do that? Again similar example you might be looking at a rare tumor type, it might be a subtype of breast cancer and you've got lots of clinical specimens with 15 years follow-up that are FFP E-fixed that is in standard pathology fixing format that you can get in any hospital and then you need control samples. So you go ahead and say where could I get those? Ah control breast there's a reduction mammoplasty clinic where we can get fresh or unfrozen tissue and I've seen that exact experiment happen and it looks like there are huge differences in gene expression, huge being 15,000 of 20,000 genes in the genome are altered. It's not real of course, it's because of the sample prep. So the first thing you have to do is think about this kind of an issue of your sample prep. Once you get past that then you're going to start off saying I have a microarray and if I have a microarray and I'm trying to process it what is that actually going to look like? So this is a conceptual one-spot microarray and we'll take a look at kind of how this works for terminology and scale up. So in a one-spot microarray what you're really seeing is this glass substrate, this chip and on that chip there is a feature. The feature represents a specific oligonucleotide sequence represented many times by different probes. So the probes are attached to the chip, the feature is the collection of probes and the chip is the glass substrate. And so your idea here is that if you've got this one-spot microarray and you've got some nice yellow RNA strands floating around or CDNA strands you are going to label it with some sort of fluorescent dye and then it's going to be called the target DNA. And so getting those words straight is really, really important. In fact there was confusion in the microarray field for two or three years with people not being able to reproduce stuff because they were talking about different things. So the target is your experimental sample, the probe is your underlying chip. And then you hybridize the target onto the chip and because these are binding by Watson and Crick complementary base pair associations they are sufficient binding that some things will stick tighter than others. So if it is a strong base pair association then you're happy and everything is going to bind and you can wash quite stringently everything else away. And so you can essentially use the hybridization stringency as a filter for only exact matches and you can customize that. You can decide just how much you want to accept cross hybridization. And of course eventually you're going to scan and wash the chip and you're going to have a fluorescent image because you fluorescently labeled each of your individual nucleotides before you started in the probe population. So it's kind of the overall view. Now nobody does a one-spot microarray, think about a real microarray. You might want to start looking at how reproducible it is. So the first study you might do is what's called a homotypic hybridization. So you might take a rat, mouse, tumor sample, whatever, and you might extract DNA from it or RNA from it twice. You'll have two separate RNA extractions which will get labeled separately. They will go through separate analyses and then they're going to be merged into one tube and stuck on your chip. And what you're really asking here is, are these two different? What is the technical variability that I get if the only thing I change is the way that I process my samples? Those numbers can be distressing, especially in cancer. They can be distressing. Why especially in cancer? Does anybody pick up what the challenge is going to be? Go ahead. Anybody? Yeah, so Rick, the tumor itself is inherently heterogenous and so that means actually doing these studies to benchmark your data on a tumor sample is usually a bad idea. You usually want to take an old normal met sample that's age matched and controlled and they'll have much, much better results because you're really getting at technical variability. The other thing you could do is do a single RNA extraction and pull everything from that pool. This is called a homotypic hybridization. It's a standard way of assessing the noise of an experiment. Obviously, once you've done a homotypic hybridization and know the noise, you're going to think conceptually of doing experiments like this. You have one rat, you introduce another rat, you do something different to these two rats. You take one with a chemical and now you extract RNA from each and take a look at the differences and you can sort of conceptually see exactly where the differences are. Look, here's a gene that is only expressed in the green rat, another one that's only expressed in the red rat, one that's expressed both with some sort of differential amounts. I'm giving a toy example. We're obviously going to do this properly with statistics and we'll get there by the end. You can kind of conceive of how this would have been done and if we go back to the first microarrays which were Pat Brown's group at Stanford, they were done with robotic printers and that's exactly how they did it. Grad students would look at the six or seven thousand spot array and compare east condition one to east condition two. Oh, this one looks brighter. That one looks brighter and they would go through the gray spot by spot to make those calls. We obviously don't do it that way anymore but it's not conceptually all that different. We're just trying to automate it computer-wise. So one of the other things that makes microarrays a really powerful technology is that they allow you to do differences from species to species. So you can change the washing frequency and that allows you to use microarray knowledge of one species to powerfully advantage your study of another. So you can do primate studies. You can say, I'm going to take a human array where we have good knowledge of the genome and the transcriptome and I'm going to use that to compare it to a specific primate. I wish I knew my primates can say what that actually was. The idea there is that there's enough similarity between the two that you should be able to get most of the genes cross-hybridizing and in fact you could use this for studies just on monkeys themselves and that would make your life a whole lot easier. And in fact in general when there is a species that has a poorly annotated genome, this is a very, very effective approach. It's even more widely used in plant studies where people would be using different strains at different plants to study other different types of plants. Nobody laughed so I take it, nobody here recognizes what plant that is, which is very good people. So all of the arrays that I've been showing you so far are what are called spotted arrays. They're the first type of micro-developed. They were developed as I said by Pat Grounds, Group in Stanford in the very early 90s. I think the first one was published in 92, it was a yeast study in science. And essentially what's happening here is that this thing here is a robotic head which has 384 needles on it. Those 384 needles are just hanging off the head like this. And adjacent to it, back behind here, are a series of 384 well plates. Those 384 well plates have a solution containing the CDNA of interest. So the needle is going to dip into the 384 well plate that contains your solution of interest. Contact deposition is going to lead to, sorry, surface tension differences is going to lead to some of the liquid coming up onto these needles. And it's going to move over to the individual slides, these guys over here. And it's going to touch them. Contact deposition is going to deposit a small amount of liquid there. And these things are then going to be baked to get a chemical reaction to occur that links those CDNAs to the glass slide. And this will repeat with washing steps in between and fun stuff like that. And this will happen over and over. So a typical robotics facility could produce 10,000 slides a day, maybe more. And you can sort of immediately see why this would have some disadvantages. So what are the things that immediately strike you as this could go wrong with a robotic printer? There should be a few. Yeah, already it's actually 0.001. Yeah, so somebody has to program the robot. And if they make any mistake, things go bad. And there are many, many examples of our local array center having throw up batches of 10,000 arrays because of mistakes like that. So human error in programming the robot. What else? What are the implications of the glass plates? Yeah, very similar. Typically human error. The glass plates, as you see in a lot of things now, are not exactly glass slides. They have some sort of ticks to try to prevent you from making that mistake. But yeah, absolutely. Any kind of liquid retention in the pipettes as a whole? Yeah, so very good. The needles turn out to be super critical. They can do a whole bunch of unfortunate things. So they can retain liquid and washing can be incomplete. That's a problem, but that's actually not the most disastrous. Much worse is when the needles start to wear down and the needles then just at the beginning deposit a lot and then less and then less and then less. And you can see sometimes very clear batch biases. So there's a lot of interesting things that you can immediately infer about trends you might see in your data set. If you did 500 arrays with this technology, you might see exactly that, that the signal intensity declines over the 500 arrays and you need to think about computationally adjusting for that. That's a clear simple example of how your analysis needs to link back to the experiment itself. Please, David. How many old nuclear tides on average do you expect in one spot? Good question. I'll give numbers on that in a second. But it's much more than you might think, so in the millions at least. Oh, wow. So what I just described, spotted robotic printed arrays, are the first type of microarray. In, OK, people who do mass spec will not like this statement. But in some sense it's the first large scale genomic technology that was invented. And you can defend that because it went to do thousands and thousands of genes almost instantaneously. And modern mass spec's got to thousands of genes like last year. So to the 7,000 gene range last year. So I mean, you can make arguments. The key point is it's been a technology that's been used for almost two decades now. It's not the only type of microarray. It's just the first. And there's at least three others. There's ink gen arrays, photolithographically generated arrays, and bead arrays. We'll spend the majority of our time on photolithographic arrays. And I'll talk about those the most. There are also some work at protein arrays, or cell arrays, or lipid arrays. And those have a long history of being described in a high profile paper that we can do good protein arrays or good lipid arrays. And then nobody ever does it again for three years until another high profile paper with a different technology describes how it's possible. And even today, I think the best protein arrays look at about 200 genes. Total genes are isoforms, so they're quite limited in size. So let's start off with inkjet arrays. Inkjet. Inkjet. What does inkjet have to do with arrays? Well, it turns out that in 1999, HP was really enjoying the dot-com boom and said we're making all this money off of selling servers and having our people doing consulting. So we are going to get rid of all this boring life sciences and measurement stuff. And we are going to spin that off as a separate company. It'll make our shareholders happy. They can focus on what we actually do. And we have a lot of patents. We can do this nice cross-licensing of patents between our own work and what this new company is going to do. And it'll be really friendly. So of course, that company is Agilent. It's a little ironic that Agilent is significantly outperforming HP. And they were thinking very hard at the beginning, what are the key things that we can use as an exemplar, as a guiding light for how this collaboration is going to work? And they thought about inkjet printers. And they thought about inkjet printers because really, what is a printer? It's depositing a very small amount of liquid in a very specific place, very accurately. It's kind of like what a microwave needs to do. To make a microwave to deposit a small amount of liquid in a very specific place. And so they said, we're going to put a lot of work to see if we can get these things to work as inkjet printers. Does anybody know who did this work and created the inkjet printers? They're a prof here at Toronto. Tim Hughes, so Tim Hughes, who's at CCBR and is subsequently famous for huge amounts of work on RNA binding proteins and yeast transcriptomics and some really big biology studies was the person who led this work as part of his postdoc. And so Tim basically worked with a series of engineers to create what is proprietary and so we can only kind of guess system for producing arrays based on inkjet. So if you see an Agilent microwave, we don't exactly know how they are produced, but we kind of know. They simply replaced the four colors in a printer, CMYK with four nucleotides. And then they would move the inkjet head into specific locations and they would drop the next nucleotide on top. Now that doesn't exactly work. You need some sort of a polymerization reaction. And in the case of Agilent, we have no idea how they're exactly doing that. We just know that it involves heat and they're doing something as either drying or heat activated reaction from step to step. But basically what they're doing is moving the next nucleotide over, dropping a little bit of the solution on doing something to the chip to hybridize and repeating billions of times until they make the entire array. That's a great idea. We don't exactly know how it works. What we do know how it works is how photolithographic arrays work. Photolithographic arrays were pioneered by Affymetrix in 1992 and 1993. And Affymetrix is headquartered just down the road from Stanford. And so essentially the people who are working on robotic arrays and photolithographic arrays were doing the same thing four miles apart, something like that. And so they chose fundamentally different ideas. The robotic arrays are robotics. You're basically going to take what a human does and make it repeated faster and cheaper. Photolithographic arrays said, let's use the technologies used in semiconductor manufacturing. And we'll use those, apply them in high throughput to generate a microwave. And so the technique used in semiconductor manufacturing is called photolithography. Hands up if you know what photolithography is. OK, awesome. So I'll explain very basically how it works, but with a pretty worked out example of Affymetrix arrays. So pretty much all Affymetrix arrays start off with a glass wafer. The glass wafer is silenaded. You're probably wondering, why would you do that? Well, silenation gives you a really active hydroxy group. This is going to involve some chemistry. So it involves a really active hydroxy group. That really active hydroxy group is something that we can uniformly with high reactivity, good thermodynamics, and awesome kinetics do stuff to. So we can transport these wafers. We can be pretty sure nothing's going to happen to the surface. And then we can activate them when we want. And the way they activate them is with a proprietary linker molecule. It's proprietary, but we're 99.9% sure that it's using sulfhydryl chemistry. And you'll see that come up a couple of times. This linker molecule has a couple of nice characteristics. First, it's photosensitive. And secondly, it's very reactive once it's been photoactivated. So it's kind of inert in the absence of light. And then once it gets shone with light of the right wavelengths or right band of wavelengths, then it will become activated. And it can undergo other chemical reactions. And so how do you control where light might get to a chip? Well, you do that with what's called a photolithographic mask. It kind of looks like that. It's just selectively choosing where light is going to shine through onto an array. And it kind of works like this. You take a lamp. The lamp has light going in any different direction, any point source would. And there's a blue thing. It's not actually blue, but there's the thing here that is blue that's called a columnator. And essentially what it's doing is ensuring that all the light beams are going in parallel. You can sort of see why this is important. The light beams are going off in every different direction. Then when they get to the mask, they're not all going parallel. So some of them are going to diffuse and diffract out into different directions. Now the light passes through the mask, and you can very tightly control where the light is going to go and pinch on the chip, which will only photoactivate small parts of the chip. So there's this huge work to control where exactly light hits the chip. Couple of comments. One, lamp is almost always a laser. It's a very specific wavelength. Two, the columnators and the masks for them are fairly expensive. So it's not one of the cheaper parts of manufacturing, one of these things. Three, you don't want to remake masks all the time, which is why this technology is really inclined for somebody who will have a single array design that they're going to reuse frequently. And four, you just discovered why everybody has to use a clean room to do any manufacturing of a semiconductor. Imagine that there is a speck of dust in there between the mask and the chip. The light's going to hit the speck of dust, diffract and reflect around the speck of dust, and now you've got light shining where you don't want it to. And so that's why they will have things like 5PPB. You're allowed to have five parts per billion of air of dust, and that would control to certain quality considerations. Everything I talked about there, every single word, is exactly equivalent for the production of computer chips. So the advantage Affymetrix has had is that they don't have to do any R&D. They wait until Intel or AMD, ATI, does R&D on new computer chip ways of manufacturing, wait until that gets off-patent, and then applies it here. And so what actually is being used to produce microarrays is like 15-year-old computer chip manufacturing technologies. So it's actually not even close to state-of-the-art, and it's certainly good enough for the applications that we do, because computer chips, super tiny, microarrays, tiny, but not that tiny. So let's walk through what actually happens in a photo of the graphic synthesis. Yeah? Why are we even actually using the mask? I'll show you right now. Exactly what I'll show you next. So why do we use a mask? If you take a look at the initial step of a chip, you're going to have the wafer. It's silenaded, and there are these linker molecules on top of it. And that's useless. We have no differentiation of one sequence versus another. So what we'll do is we'll use the photolithographic mask to stop UV light from impinging on one area of the chip, but we'll allow it to hit two others. The net effect is that linker molecule is going to be activated. Two parts of the chip are now going to be prone to chemical reactions. They have a much higher affinity that will allow them to start doing stuff. And to facilitate them doing the right stuff, and of course the third part in the middle, feature two, is still been masked, so it's got this linker molecule blocked, and it's not active. Now you would pass over a mixture of different nucleotides one at a time. So here you're passing over an adenine. It's a labeled adenine with this photolabel linker molecule at the end of the adenine. And it will only bind to those places that have already been activated. And so now instead of having a uniform chip, have a chip that in two places has an A, and in one place has nothing attached. We can now go apply a different mask that blocks UV light to features one and three, activating the linker molecules on feature two. We'll give us an activated feature two, where we pass over another labeled nucleotide, and now we've built up the first layer of the chip. Two spots have A, and one spot has C. You can repeat this. You can now block features two and three, activating them, and put on a G there. And now we've built up the second level, or part of the second level. You can immediately sort of see some things that could go wrong. What would happen if one of these things did not go to completion? You didn't have full blocking of a chain. And these are massively parallel chemical reactions, so it absolutely happens. Well, that's a problem. And so what aphometrics does is pass off a very high affinity molecule called a capping agent. The capping agent essentially has a chemical reaction that is going to occur any time it sees a linker molecule, will go to completion pretty much immediately. It doesn't allow the chain to extend further. It kills the chain. So if there is a problem with incorporation of nucleotides into a growing chain, then you cap it and say, this chain is lost. And then you continue building up step by step. And so at the end, you end up with something that looks like this. You've managed to build up an entire array, one nucleotide at a time. That capped area is just a chain that is impartial, incomplete, partially complete. And every other chain has its nice set of features that we've built one at a time. And now you have these spatially separated features derived from light controlled localization of chemical reactions, photolithography. Any questions? How do you know which ones are capped and which ones are not? You don't. Yeah, there's no experimental technique that's easy to get access to that. You could do like scanning experimental microscopy to get a feel for how much is capped, or you could do comparisons of different arrays to see if the proportion is changing, but to exactly know which strands have capping and which ones don't, almost impossible. So this gets to the next slide. You guys are setting me up for questions for later. So I'll get one thing at a time and I'll be right there to you guys. So why we don't care is because the typical chip is about a centimeter and a quarter by a centimeter and a quarter. And that means that there's going to be, let's say a million and a half individual features per chip. But if you take 1.3 centimeters by 1.3 centimeters, divided by one and a half million, the area that's left is still 100 plus micron squared. And 100 micron squared is easily enough to have probably about 25 million of these probes. So if you end up with a few hundred capped, the effect on your net signal is basically negligible. You can't detect it. So you don't know exactly how much is capped. As long as you don't find that when you wash off the capping reagent, there's a ton missing, you're fine. It means that just a small amount is being capped because of general chemical differences. And then their process control will know what those numbers look like. There were two other questions. One over here. It's certainly the sequence of the randomness. The sequence of the randomness of the masking. In other words, if you have, how do you decide what order to add sequences to? Yeah, well, the order that you're masking is determining the order in which you're gonna add sequences. And so, good question. Let's pretend that your, excuse me, let's pretend that your array was 25 base pairs long. Then the maximum number of masks you need is 25 times 400. You don't wanna do that. So they will typically get through this in about 75 masks and they are optimizing the order of masking to maximize your chances of minimizing the mask number. And so there's some pretty sophisticated algorithms that work through the minimal number of masks that you need to be able to get this to work. Yeah? Yeah. Sorry, I think at the back first and then somebody at the front. Steph, sorry. Is it the same technology that you put inside? I don't think so. I think those are done using a labeling with a barcode on it, but I'm actually not sure. It's a good question. Somebody else? What is the short footed? Not just capping rate. The usual capping rate? Oh, it's nobody's asked me that before, but it's like less than 1,000 in 10 million. So, yeah. So it's less than one in 10,000. And actually it's probably less than that. Affymetrix's typical statement is that they scan their arrays on a one to 60,000 scale and that if there's an array where capping would change that, then they would consider it below quality. So it's like one in less than 60,000, I guess, is the rate. Okay, so at the end of it, what you've produced is a million and a half spots, each of which contain millions of probes and then you can start to do measurements on them. And so of course that's the chip and that's the feature. So we're gonna do millions of measurements on them. What kind of sample do we include? It's worth going through the Affymetrix sample prep quickly because it's a little unusual. So as with everybody, it starts with RNA and as with most microarrays, it starts with total RNA. These get reversed transcribed into cDNA and that's super common for all microarrays. Here it gets weird. Next, they in vitro transcribe the RNA into DNA. So they go from total DNA to cDNA and then they in vitro transcribe that back into cRNA. During that in vitro transcription step, they're labeling your deans with a biotin. They fragment that basically just sonication into smaller chunks and then the fragmented chunks are hybridized to an array where they can use biotin strip to have it in conjugation to light up and quantitate the array. So there's a couple of interesting things here. They start off with cDNA or they get to cDNA and then say, we're not done, we're gonna do an in vitro transcription and we're gonna have all this RNA hanging around. Why are they doing that? Why would you want to have a tube of RNA hanging around? So what's more stable, DNA or RNA? So everybody says DNA? Nobody says RNA? So the question I gave is not fairly posed actually. Double-stranded DNA is certainly more stable than RNA, but single-stranded RNA is significantly more stable than single-stranded DNA. And so they're going to be storing as single-stranded that's their goal because they would like to store the labeled material and to do that you need to have RNA to maximize stability. They also believe that biotin as a label gives you more consistent, more accurate quantitation than using fluorescently labeled nucleotides where the nucleotide itself directly has a label and allows the hybridization to be more clean and more pure. And so for both those reasons they use a significantly different sample prep. And it differs so much so that essentially if you were to take the cRNA from an AFI experiment you wouldn't be able to use it in any other experiment. But if you took the cDNA for most other arrays they could actually be used for pretty much any array. And of course you wash and hybridize and you end up with a chip that looks sort of like this. You have complimentary Watson and Crick face-pairing and labeled uracils. And in this case the uracil happens to be at the top. That's just a demonstration. And eventually you wash things off so that you have these patches that have specific colors hybridizing specific colors, specific intensities hybridizing to each. And you scan that to get an image that looks like that. So let's look at that image. Does that image have, is it uniform? Does it have any interesting features? So there's bands, yes there's definitely bands. We're gonna talk about bands in 15 minutes. What else? Two other really clear features. There's a green blob, this guy, right? You might also notice this thing up here, right? Okay, so if you were to zoom into that that little blob are the aphometrics control probes. They are giving you probes that'll light up on every array and they are shaped into the version number of the array. So if somebody gives you a micro experiment and goes, I ran this on an aphometrics array. Awesome, which one? The human array. They sell 40, which one? The new one. You go and you look at the image and you go, oh, you used HG2 gene 3.1 plus two and you now know the exact characteristics of it. This sounds like something that nobody would ever get into that situation. Trust me, the number of times what I've been told by collaborators and the actual truth of an array are different is more frequent than you'd expect. One more thing you might see on this. If you take a look right down the line, you can see these dots all the way there and you might see them carefully around here at the bottom and again on the side. So aphometrics has a series of control probes as borders to the array and that's how it lines up the array for the scanning so you can try to match spot to spot. And we'll talk about that again in five minutes. The last type of array that I'll talk about really quickly are aluminum bead arrays. These are essentially three micron probes, three micron silicon beads. They are, each one of those beads is coated with about 100,000 identical 25 base pair sequences. That 25 base pair sequence has just before a barcode, an address label that can be easily read using sort of complimentary multicolored labeling probes. And rather than say we're going to have an array that is physically structured that has these 20,000 probes, 20,000 beads each of which represents a gene, they will say we're gonna have an array that might have 200,000 and we're gonna randomly sample a number of these probes and so some of them will be present 10 times, some 50 times, some only one. The idea here is that you get additional replication from it. The disadvantage is that the precision of your measurements vary in a random, unpredictable experiment to experiment way. This was very late technology to the microwave game and that had a lot of consequences for its adoption which didn't end up being very large. So we talked about four platforms, spotted cDNA, aphometrics, inkjet and beat array. And sometimes people ask me which one should I use? And the answer is it kind of depends on what you wanna do. If you want really, really long probes, say mega base pair long probes, you're gonna use a spotted cDNA array. It's the only technology that really is capable of doing that effectively. If you want things that are quite cheap, you're probably gonna do spotted cDNA probes or maybe inkjet or beat array. If you want the bioinformatics to be easy, you're probably going to do either spotted cDNA or aphometrics. And if you're looking at data quality, in general aphometrics arrays have the highest data quality. So they serve trade-offs dependent on what you want to do. I personally have no endorsement of specific platform. They can all do awesome things. What I will say is that if we look five years from now, it's likely that aphometrics arrays will still be widely used and the other platforms will probably not. I think that's the direction the market is going. And there's a lot of reasons for that, but largely prices continue to come down, quality becomes a more important consideration. And when the price difference lessons, people are gonna end up using larger numbers at the higher quality arrays. Yeah. Technology is a matter of interest. We might be more interested in a few thousand dollars. A few thousand dollars for a sample. So the array could itself be used up to three times. So prices continue to change, but what we're paying now for aphi arrays is 1.5175 per array, 1.5175, which is pretty comparable to a multi-use array at that pricing. Array is used to cost 4,000, $6,000 each. I think for most groups, the price differential no longer really matters. Somebody else? Yes. Actually, I answered half the question. I want to ask about the whole part of the price. Yeah, sure. How about the preparation kits? That's it, taking considerations. But the cost of a prep kits? So pretty much all, almost everybody I know, no longer does their own arrays and their own prep kits. They send extracted total RNA for core center, and then the prices are all in, except for analysis at that stage. But they're all similar for different prep kits? They have different prep kits, but I don't have to, I never pay for them or think about them. They get even into the facility core costs. So yeah, cost differences, there are a function of everything from the labor to the equipment to the kits to the actual chips. And it's kind of like the total price for everything to get the data before somebody analyzes it. I've heard a lot of technology called nanostrings. What's that compared to a laser? Nanostring is what would be called a medium throughput platform, by which I mean you can't do more than 700 probes on a nanostring array. So in that sense they are very limited, they can't handle whole genome surveys or anything like that. They're customizable in the sense that you can decide which genes you want those 700 probes to go to, splice variants and interesting things. They have two advantages, two key advantages. One is that they can work with very small volumes of RNA. And two, they can work quite robustly on FFPE fixed samples, more so than arrays. So there's a lot of groups that will do their initial exploratory discovery work on arrays and then will later robustify the assay into a nanostring assay for attempted clinical application. And that's perfectly reasonable. They've, let's go with similar but not identical analysis approaches. Okay, so now I think I'll very quickly talk about what arrays are used for and then dig into how do we analyze array data. So what are arrays used for? In one slide, a lot of things. And it kind of depends on what you're hoping to array. So if you are looking at DNA arrays, you could use them for sequencing. That's true, especially for SNP discovery and you probably see a GWAS study published in Nature Genetics every two days. No, that's not fair, once every week. And those GWAS studies essentially all use DNA SNP arrays today. The first exome sequencing GWAS studies are likely to come out this year, large ones. And then the first large whole genome sequencing GWAS arrays are probably, GWAS studies are probably three or four years away. So there's a lot of ability to process large number of samples. And when you're looking for relatively subtle effects in a population, this is almost the only way to do it still. You can get copy number assessments from a microarray, capture for a long time, people would do exome capture by using a microarray to actually facilitate the exome study. In the RNA space, there's a lot of work on splicing transcription degradation, transcription rates by using pulse chase type experiments, quantitating mRNA abundances, RNA protein interactions, and many other applications like that. And if you take just one of those, so say RNA, you kind of flow through to a series of things that you might wanna do with the data. You might wanna find candidate genes, which means you need to know a lot of statistical analysis. And we'll touch a little bit on that here, but of course you're gonna hear more about statistical analysis later. You've already talked about pathway analysis, and of course pathway analysis is a fundamental output of any of these large screens. You can also do some pretty interesting things around dose response studies and machine learning to do classification and prediction models. And the data allows you to do those sorts of things pretty readily. So, all right, let's get to what we're actually gonna talk about for an hour. How is micro-data analyzed? If you get an experiment, what is it? So let's pretend this is our hypothetical micro-experiment. It is a glass slide with a series of probes on the series of features, each of which contains some probe DNA that's going to look at something. For simplicity, I'm gonna draw this like it's a two-color microarray. It wouldn't matter at all what technology we're looking at. And now you've got the image back from your core facility. The first thing that you're gonna do is say, I need to turn this into numbers. I'm gonna quantitate it. So that's all good. We turn into numbers, and for nailing it to describe it as Psi three, Psi five, just by, because by convention, the two most widely used dyes used in the micro-experiment. We're gonna background correct it. We're gonna remove local noise around each individual spot. We're going to identify poor quality spots and drop them out of our analysis entirely. We're going to normalize the data within an array to remove spatial biases and heterogeneity. We're gonna normalize data between arrays to make sure that they're all on the same scale and all comparable for statistical analysis. We're gonna do that statistical analysis, identify interesting lists of genes, do clustering or other machine learning, and maybe do some integration studies. So I just described a pipeline, a set of sequential algorithms that we're going to apply. Those algorithms fit into two groups. The first group and the one that we're mostly gonna focus on is the removal of noise. Does anybody know another word for the removal of noise? Sir? QC? QC? Pre-processing. QC is part of the pre-processing step. And pre-processing is the idea of let's start off with our raw data and do all the stuff that we need to to get it in place to start asking biological questions. And so in a sense, the second set of things is about extracting information from the experiment. Don't make the mistake of jumping into the extracting information until you're confident that you've actually removed noise appropriately. That's a fundamental mistake that many people make and then we'll get an exciting hit and we'll become invested in that hit before really going, oh, it's actually a technical artifact. And so it's really important that you systematically think through pre-processing. So we will systematically go through the steps of pre-processing one at a time. What do they do? What are they trying to accomplish? How well do they work? And the first is image quantitation. So you start off with an image and you need to convert that into numbers. That's done using image segmentation algorithms. It's a fairly difficult thing to do and it's not very well researched at all. You can kind of imagine that if I gave you this plot, you could go, oh, I bet there are grids there. I can see the patterns. And if you look at it long enough and you think about it hard enough, you're eventually gonna come up with that. Well, that's nice. And initially grad students did exactly that. So the first automated image quantation involved grad students basically placing those grids one at a time. That sounds miserable. And so people tried to automate it further. And so what we're really trying to do is to teach a computer how to do this. Teach the computer how to find those grids and then after it finds those grids, teach it how to dig into any individual grid and find the spots within it. And so how might you do that? Well, the algorithm used is pretty darn simple. It's basically a signal integration approach. Imagine that you take this image and you take a look at every line and you just sum up the intensity of spots on that line. Well, obviously for the first line, there's nothing there, so you're gonna get no intensity. But in other cases, you're gonna get these peaks corresponding to individual lines of the array. You can do the same thing on the X and the Y axes and you can see peaks corresponding to individual lines of the array. And now you've actually kind of mapped the X, Y coordinates of every point. You just take each of these and each of these and say there ought to be a spot right where they intersect here. Great. And now you look at that area and you say there's a spot. I don't exactly know where it's gonna be, but I'm gonna place a big box around it, see how much signal intensity is in there, and I'm just gonna start shrinking it from side to side until I find where the spot is. Nice and easy. So that sounds nice and easy. It turns out to be disastrously difficult. For example, focusing on exactly that grid, what do you do with something that looks like that? So that little circle thing there, is that a spot? It kind of looks like a spot, but it could also be stray signal, could be stray fluorescence, and it looks like it's offset one row to the left. So if you decide that's an actual real spot, now you've got this entirely new row and you have to pull everything over. And if you found that you had a missing row and you didn't know where, sorry, miss a column and you didn't know if it should be on the left or the right, this could lead you to put it in the wrong place. That's a total disaster because now your arrays are all off. The intensity for a gene is for the gene next to it. So it's actually critical to get this right. You can see that pretty clearly on these two rows. Those rows don't really have much going on. You can see there's a little tiny piece of fluorescence right there, but that's super hard to see and could easily be stray background noise. So then you could ask, is this a missing row from the grid below? Is this a missing row from the grid above? And you could end up having two grids that just get pushed into each other by this. It's difficult to figure out what happens with empty rows. It's one of the reasons why the design of arrays like this is done very carefully. You think hard about what goes into each row. Try to make sure that you don't have huge regions that have nothing on them, that thing that's expressed. And so you're actually trying to balance tissue specificity and features like that to maximize your chances of having something expressed in every column and in every row. So with that being what's obviously a really critical challenge, you would expect a lot of research. There's not. There's not a lot of research for a couple of reasons. One is it's really hard. I don't actually immediately have any intuition or insight about how that would be done better. When I say I don't have any intuition or insight, I've taught this course six or seven times. I still don't have any intuition or insight. Like there's no obvious things that you can do in terms of research on it. The best thing to do would be to improve scanning or to get rid of background genes or to do something experimental. And a few things have been tried but nothing has been very successful. It's probably a source of error in pretty much every study. A couple of groups have estimated using the only way we know to do this, which is manual spot checking, that the error rate is something like 1%, maybe 2%. So that means that every microwave study that we do has probably got exactly this issue happening. And you see an outline and you go, that sample just doesn't make sense for that gene. Yeah, it probably really doesn't make sense and there may be nothing that you can do about it. And I promised to draw some analogs to sequencing data. Sequencing data has got a series of clusters growing on a chip that you take images of one at a time. The algorithms to work with that are almost identical to what's being done here. So the first step of quantitation of a sequencing study has exactly the same issues as you have for quantitation of a microwave study and sequencing studies contain not just a few million spots, but hundreds of millions to billions. So this is actually a systematic problem in sequencing metabolomics, microwave work, but not in proteomics. So that's the first thing that you do and it immediately introduces, fortunately what we think is reasonably random error into your study. Yes, Eric. On the control probe at the edge can help to enter? It certainly does. It helps anchor the edges, but doesn't get you very far in the middle. So you can tie the two control probes on two sides so the rest of the spots to be along the line. That assumes that your array construction is perfectly linear. There's no deviations that the glass slide is not bending in an ideal world that would work. In reality, there's still lots of noise. But how will the manufacturer consider the sub control spot in a somehow in the middle so they can not join the online? So you're pointing out that you could make a better array if you spent more money and it's true, but researchers unfortunately buy on price a lot. And so if the manufacturer devoted a really good number and used a quarter of the array for control probes and the number of genes that are on it might drop or the number of splice variants or the costs might go up and then people wouldn't buy it. And we know that because manufacturers have a couple of times tried to make higher quality arrays and the only people who buy them are for clinical usage. Yeah. Is fluorescence the only method of detecting what people have tried? Like I just know before radioactivity was used for a time. Yeah, so absolutely radioactivity has been used for essence. There's something else that I'm not thinking of. But yeah. What are the post processing stages? Post processing. Is there a post processing stage? Four. Once you get these images, is there a post port? Is there a post processing stack? After segmentation? So I'm going to get into. Post segmentation stage. Oh, no, no, the images are not pre-processed. So you're not applying band filters or anything like that in any deep way. There will sometimes be what's it called, a quality check on the image using heterogeneity measures and the scanners will sometimes be pretty specific to wavelengths. So in that sense, they're not just catching any visible light, they're catching within certain wavelengths. But that's about it. It's probably something like a dilation or a motion. Not being done standardly. And so it has been researched and wasn't found to be dramatically successful, but also is not being routinely researched. But yeah, your point is well taken that it could be possible to improve the signal processing of the arrays. And I would strongly encourage people as a research direction. There may be nothing that anybody in this room could do that would be more impactful than improving the quality of every sequencing study by one and a half percent. I mean, the massive effect that would have societally would be huge. So if you have ideas, I would strongly advocate that as a good research direction. It's something like the PhD group that I was in did away modeling with this. And these are, these can clearly be things that's used in the area. So let me continue. If we go through after quantitation, then we have to start thinking about now that we have the actual array data, what are we gonna start to do with it? And the first thing numerically that we do is we try to remove background noise. So background noise is really something that involves stray fluorescent signal caused by cross-hybridization around the individual spot. It's typically done using statistical models. It also turns out to be very difficult and with not that much research happening. So what is this? What does this actually look like? Well, if you're doing the spot segmentation, it's actually a little harder than I described because each spot is not really just a circle or square depending on your technology. It definitely has that. It's got this core signal region. But outside the signal region, there's a second halo at less intense, quite symmetrical in most cases, that is thought to be bleed over of fluorescent signal from the underneath of the main forest of probes. And then outside of that, there is something that is clear background. It is hybridization randomly to the spot. This particularly happens in low intensity spots and it's thought to be incomplete washing of probes with similar sequences. So what you might think that you'd get is that this foreground area would be your main signal, but it would be the main signal with the additional noise caused by the background intensity and so you need to apply a very sophisticated mathematical transform of subtracting the two. And that's a really good idea. It also turns out to be really terrible in practice and it turns out to be terrible for a couple of reasons. One is that it has the problem that if your background is larger than your foreground, you start getting negative signals. That's not realistic. It's not realistic for genes to be negatively expressed. And top of that, it breaks a lot of the things that we tend to do downstream. Most microwave data, in fact, a lot of human genomic data follows a distribution called the log-normal distribution. So we take a lot of logarithms, a lot of logarithms and you can't take the log of a negative number, so this messes up practically a lot of things that we wanna do. And you might say, how much does it actually happen? Well, 2% is a reasonable estimate in some of the older rate studies, probably 1% now. So in 1% of your spots, you're seeing this weird thing where the gene is negatively expressed and people put a lot of work into figuring out why that might be the case. And there were a couple of studies, both of which came out of Argonne National Labs in 2001 that were able to demonstrate why it happened. What they did is they used a hyperspectral scanner. Instead of scanning on a singular wavelength, they scanned across a range of wavelengths and were able to see what the signal looked like. And across that wave of wavelengths, they saw two really distinct peaks. And those peaks corresponded really clearly to one for the underlying background and one for the foreground. And they had different peak intensities and they spent a lot of time trying to track out why this would be the case. I'll highlight here. Well, if you see that phenomenon where the background signal is higher than the foreground, that corresponds to unbound spots which corresponds to unimportant genes like transcription factors, signaling molecules, stuff that we don't often care about in cancer research. That was surprising in case anybody was missing it. We missed the most important genes in cancer research because of this problem. And a classic gene that's affected by it is MIC, which is the third or second most mutated transcription factor in all of cancer. And it just is regularly invisible in microwave studies. So essentially what people recognized is that the core cause of this is that the glass itself is fluorescing. There's a weak auto fluorescence of the glass, especially when it has nucleotides conjugated to it and the auto fluorescence is at a different wavelength distribution that messes up everything. A theoretical solution is to do these hyperspectral scanners across a range of different wavelengths. Unfortunately, those are expensive and nobody actually has them. So although practical and an immediate solution, that's not viable. So instead, there's a broad range of mathematical models that are used. They fit into three basic models and they basically have different assumptions about the distribution of signal and noise. The Edwards model assumes that the noise is, sorry, the signal is linearly distributed and the noise is logarithmically distributed. The Smythe model assumes that the noise is normally distributed, the signal exponentially, and the Kupperberg model uses a very fancy and fun Bayesian analysis. And the underlying mathematics behind that are very advanced and really complicated. So what I'm gonna do is put up easy cheat sheet summary of how you might pick amongst these. And basically, what people say is that the Edwards model is fast and good, the Norm X is slow and better, and the Kupperberg is very slow and fast. And that sounds all so simple. Until you actually ask, how would you make that decision? What experiment would you do to give you that ranking? There isn't an easy experiment that you could do that will extract only the effect of background correction from all of the other characteristics can do so in a reproducible way across the types of backgrounds that would be seen in practical data. These conclusions are derived from the Microrate Quality Control Consortium, MAQC, which is great, except that they were super clean, high quality arrays. And it's unclear how generalizable these are to other arrays. And when I say Kupperberg is slow, it's actually pretty slow. So the last time my lab ran it was maybe a year ago. On 100 arrays, it took a week and a half or something like that. That's slow, that's sequencing analysis level slowness. And you can kind of go, do I care about that lab? Really, is it gonna make that much of a big difference? And sometimes you cannot justify it. So we have difficulties in selecting which of these methods is best. But what we can say is that all of these methods vastly outperform something like doing subtraction. How do I know they vastly outperform subtraction? Because they don't lead you to have to remove genes because they seem to give more accurate estimates when compared with real-time PCR of full changes. That's across a couple of studies. So we think that you should use one of these and we don't have strong guidance about which one. So that's background correction. The next thing that you're gonna wanna do is spot quality assessment. And we'll talk about that for a little while. Spot quality assessment is the idea that your array is not perfect, so you wanna be identifying artifacts in it. It is very challenging and very under-researched. So what exactly do we wanna do? Imagine that we're taking a look at a microarray. I think it's easy to guess that not every spot on every array is equal. Some have better quality than others. Some are perfect. We should use them entirely in our analysis. Some are complete noise for whatever reason. They should be excluded entirely from our analysis and some are somewhere in between and we'd like to weight those. And so that conceptually is a very natural idea. Our data is not all equal quality. We have a ton of data, millions per array and then dozens, hundreds of arrays so we could have hundreds of millions of data points. All we wanna do is give each one a weight between zero and one telling how confident we are in it. Of course, the problem is how do you calculate it? And there have been a few ways proposed for how to calculate these things. Interesting. That's totally okay, but it's a Mac fun. Do you guys have printouts, right? Yeah, but the printouts. PC. But that's okay, I remember what I wrote. So there's a couple of ways. The most common way, by far the most common way is to use what's called the mean median correlation. If you've got a spot, that spot is gonna have 50, 100, 150 pixels in it. So calculate the mean of those 150 pixels, calculate the median of those 150 pixels and take the ratio. If the ratio is one, it suggests that this is a very symmetrical means, that it's a very symmetrical distribution. It can be symmetrical in different ways. Doesn't have to be a nice unimodal distribution or anything like that, but there is some sort of fundamental symmetry when the mean and the median are matching. That's a good thing. And it suggests that when it happens in an experiment, that spot is good quality. And then you could penalize samples through the more different those two things are. Because if the mean is really higher than the median, it suggests that there's some outlier pixels that are really messed up. And if the other way around, the median is really higher than the mean, then it suggests that there's some sort of eschewedness in the distribution, both of which are interesting things. Unfortunately, that doesn't always work. And I'll talk about that in a second. The other thing that happens that are that are used are what are called composite Q-metrics. Quality metrics. What you would do is say, when you're doing the segmentation, how circular is the spot? Is it more oblong? Is it really a circle? How even is the signal intensity distribution? You can calculate a measure of entropy. You can do a whole bunch of calculations like that and then merge them together into a quality metric. So those sound like good ideas, but both kind of fail. And when I say kind of fail, they will occasionally do really disastrous things like I've seen arrays where pretty much every spot is flagged as poor quality, even though the array matches every other array in the experiment quite well. And there's no obvious reason why that's happening. And you will see very, very sensitive results of the quality metrics that seem to be just unrealistic with what you'd expect of tolerances in a biological experiment. And so of course, the question that gets asked appropriately is do we need this? Do I need to spend a lot of time and effort thinking about that? And so I hope to scare and convince you in the next two minutes about how critical this is. We're gonna start off with CDNA spotted printed array and can actually maybe grab the lights at the back. And I want you to see these slides carefully. Thanks for showing me both sides, cool. Okay, so this is a little fragment of one array, tiny little fragment. And you can see that there's that bright center spot. It looks great. You should be able to see that from this bright center spot, there's so much fluorescence here that it is starting to impinge on the region around this other spot. And now the fact that your neighbor is highly expressed is going to change the background intensity for this spot and artificially reduce how intense that spot is. To put it differently, whatever gene is next to this gene, independent of what that gene is, independent of if it has any biological relationship, is going to have underestimated signal intensity. Over here, you can see something which may be a very expressed gene or maybe a dust smote, some sort of small artifact here that is affecting the expression of this gene and basically obliterating the expression of that one. Here, you've got a really disastrous scenario. Here's a piece of dust. That dust fleck is right adjacent to a highly expressed gene. The dust fleck itself has managed to soak up lots of nucleotides, so it is also bright. And if you look carefully, you can see the circular halo with increased background for all of the spots in this region. So all of them are going to have overestimated background, underestimated what's it called, signal intensities. And here is my favorite. I have no idea what this is, but you can basically see one spot leading into the adjacent spots. And I guess this is the printing defects, likely the needles were going on the slide and the slide moved a little bit at one moment in time. And so of course, you're going to say, well, how common could that be? It's all from a single array, a single good quality array that we published and validated data from. But we had to adjust for some of these things. So these are very, very routine types of issues that you could see. And at this point in time, often somebody will say, but Paul, I use affimetrics, which has higher data quality or whatever other thing that's going to be better. I do sequencing data, it couldn't have this issue. So what I've just described happens for everything everywhere. Let's take a look at AFI data. You can take a look at this array and you can see some sort of pooling in the bottom right corner. There's more intense signal there. Here you've got some sort of a weird artifact down the middle that's probably like an oil stain on the cover slip that changed the hybridization intensity. Somebody's thumb. And those three are from a spike in experiment done by AFI, put up on their website as, here's an awesome experiment that everybody should use for high quality affimetrics data. As not to say that there's not good data there. Of course there's good data there. It just shows that we really need to think hard about how we would come ahead and use these quality differences from one spot to another. I promised to come back to somebody who asked about scanning. And so here's a really interesting scanning artifact. This slide shows you the, these are slides from an experiment that we did. This is the third slide in the experiment. Looks pretty normal. The one thing that you might notice is this weird line there, right? This sort of thing here are some small spatial artifacts but pretty typical. This line here, who knows what that is? And by the time we got to array five in the experiment, you see a lot more of these lines. And by the time you get to array eight, you see a ton of lines. So can anybody pick out what's happened? Told you it's a scanning problem so that should give you a big hint. Hmm? Camera motor? Good guess, not the camera motor, something else. Shaking. The which? I didn't hear you, sorry. Something shaking. It's not shaking, it's, unfortunately it's too ordered to be shaking. If it was stochastic, it would actually be easier to pick out. Yeah, no, it's not the arm. It's interesting, it's the progressiveness would tell you it's the capacitor. So it needs to be able to suck up a lot of energy. And the capacitor is not doing a good job at holding charge. So it's sucking up a lot of energy and the thing is going Pause, recharge. And as you've got further and further down the experiment it goes like that. And basically you've now got this spatial trend of signal intensity in your experiment. So does this invalidate the experiment? Should you send this back to the array center? No, no, this is trivial to adjust with kind of standard normalization methods if you know it's there. So you have to look for the quality of your own studies and get a feel for what they are. And you have to put significant time and effort into thinking about how we would adjust things for good and bad spot quality. Michelle, could I bug you for the lights again? Thank you. So spot quality is a huge issue. It's a platform independent issue. It's an issue for methylomics, proteomics, everything that you could think of. And how are you going to fix it? Well, you could nominate grade grade students to go ahead and do manual flagging and identify problems with studies. Unfortunately, and probably as you'd expect, they disagree with each other, you know, 10, 15% of the time. And it takes a lot of time. And so spot quality is a huge unsolved issue. Most investigators that do genomic experiments never even think about it. It doesn't even come to their mind that there is variable quality within their data set. And most bioinformaticians will think of it, contemplate it for a while, go, wow, that's important. That's really hard. And then ignore it. Because there's no obvious solution. So I've told you about it, and now you know it exists. And I hope one of you is going to come up with a solution to it, because I have no idea what it is. And there's essentially no published research on this anymore. I mean, I think in 2014 maybe there was one paper on low level quality of microwave data. And I know that there are two on low level quality of sequencing data. I mean, that's minuscule. And especially as we're going to see in later steps of how much research there are at other stages of the pipeline, these are opportunities being missed by the field to do a better job. That doesn't mean that we can't get good quality information out of our studies, but it clearly means that we're missing opportunities. So by now, I hope you're depressed. I mean, are you depressed? Are you depressed? You should be depressed. Good. I like seeing that nodding. You should be depressed because I came in and told you microarrays. It is the classic omics. We've been doing this for 20 plus years. We know how to handle it. It's the only omic data that we really know how to handle. Except that we can't really quantitate the data all that well. The background correction, we don't know how to decide what works and doesn't. And then we know that there's lots of quality issues, but we ignore them. So that's really, really depressing. How do we adjust for that to go and kind of square the circle from, we think we know what we're doing and yet we don't know how to do anything? Well, it all comes down to the next two steps. Essentially what we say and what is said across pretty much all omics is, okay, we got all these physical biases and problems that we don't understand. We cannot form an accurate physical model of reality. So you know what I'm going to do? I'm going to take some heavy duty statistical tools and I'm going to even out all that noise and force stuff to the distributions that I want. So I'm going to manipulate strongly my data to generate distributions that have good characteristics that will allow me to do downstream work. And I will know that this at least didn't become totally disastrous because I'll validate some of these things with alternative techniques and say, oh, they validate. They don't all validate, like 90% of them validate, but that's okay. 90% is enough for me to do a lot of things that I want experimentally. So that's not intellectually satisfying. That's not intellectually optimal, but it is very pragmatic. And so the next two steps, intra and intra array normalization, end up accounting for most of our reduction of noise and very little noise reduction happens during accurate quantitation. Accurate quantitation probably increases your noise quite substantively. Background correction is usually a relatively modest effect. QAQC is usually non-present. And so instead we get to these stages where we're going to really put mathematical and statistical hammers to our data to control it to try to make sure it looks nice. And control is not an inappropriate word. So, of course, the first thing that we're going to do is intra array normalization. We want to work within a single array, and we're going to be doing things like balancing the different channels. So if it's a two-color array with two samples on it, how do we make sure that they're in equal quantities? We'll move some of the spatial artifacts that we saw. And this is typically done with a series of different mathematical transforms that are quite robust. This is intensely researched. There are still new methodological papers coming out today for what you could argue is a solved problem, and it is no longer thought to be very difficult. And so the idea here, of course, is to either balance channels or remove spatial artifacts. And you can think of three types of bias. The first is spatial gradients. I'm going to make a spatial gradient that looks really, really strong so you get the idea. So imagine that your sample is tilted to one side and all of the hybridization happens on one end and the other end barely gets any hybridization, so your background is really biased to one side or another. So that type of bias turns out to be reasonably easy to remove. There's an algorithm called a Gaussian spatial smoother, and essentially you're just fitting Gaussians of different types across the array, and they smooth out the spatial variability, and they can do this, I mean it's actually magical, can do it extremely effectively, especially for uniform artifacts. If you have multiple samples on the same array, then you can have pipetting issues. How many of you do experimental work? Awesome. So if I came to you and said I was a 98% accurate pipeter by doing that weighing water test, would you think that's good or bad? You can be honest, because the answer is really bad. Really, really bad. We should be like 99.5 or 99.9 kind of thing. But yes, so you can think that a good pipeter would have one in a thousand error, and that your average undergrad student might be 97, 98, when they start in your lab and get a little bit of training, they'll get better. So that sounds good, but there are 20,000 genes in the genome, and even if you have a one in a thousand error, that means that it's going to be extremely likely to be biased from gene to gene. So you're going to have differences that will be apparent that are solely a function of pipetting differences. So you can adjust for that. Oh, that's nice. But it says our red and green and equal quantities in the starting sample. And then, of course, last there is something called intensity bias. And intensity bias is really kind of fascinating, actually. So what you're looking at here is a homotypic hybridization. Homotypic hybridization means we've got the same sample analyzed twice, once on the x-axis, once on the y-axis, and the dots represent different probes, and the axes are signal intensities. And you can sort of see that they're very well correlated. So great correlation up here, great correlation down there, not too many differences. You can also see that there's some inherent noise here. What's kind of weird is, as we go down even to these low intensities, the amount of noise is pretty consistent. Noise is independent of signal. So that means that the amount of noise that you have, how much the noise contributes to your study, is strongly dependent on signal. In other words, this measurement here is 65,000 plus or minus 1,000. Ah, it's not a big deal. This measurement down here is 500 plus or minus 1,000. So that's called variance-to-sequel vaporization. It's a common feature in almost all genomic studies, and to say that it has big consequences for the downstream statistical analysis is a big understatement. So a variance-stabilizing transform of some type is extremely widely applied in microwave and in ohmic studies. I said a variance-stabilizing transform of some type. What's a really common variance-stabilizing transform? How about taking the logarithm? Just the logarithm will have the tendency to stable the variance by reducing the differences between things. So they can actually be quite simplistic or they can be much more sophisticated algorithms than that. These are true. The ones that you're seeing here are true. Normalize it. It depends on how you normalize it. So if you normalize it, then you could get rid of the top right-hand quadrant value to be that big of a difference. So what kind of a normalization are you thinking of that would do that? For example, if your intensity is in the last day, well, it's actually going from, I don't know, like 20,000 or something. And another value is only 0 or 2,000. Then by normalizing it, you wouldn't normalize it. So you would normalize the highest intensity to actually fall within the range from 0 to 1,000, for example. So you mean standardize in the range? Yes. So yes, you could do that. We pretty much never do that because that loses a lot of biologically relevant information. So that loses all the information about whether a gene is more expressed than another and how you would make understandings about high and low expression and stuff like that. So that could be done and it certainly would have some benefits, but actually it won't even get you the way that you're hoping because imagine that we take that gene that's at 20,000, plus or minus 1,000, and we divide the 20,000 by 20. We also divide the 1,000 by 20. The scaling is on the variance as well. So it doesn't help in variance stabilization. What are these? Maybe you should bring it down. Sure. So this is not a difficult thing to handle. Combination effects are almost always used to fit using what are called robust lines, splines, smoothing splines, which is a very standard curve fitting algorithm in engineering. Another very widely used algorithm when you have intensity effects is something called a lowest algorithm. It's a locally estimated scatter plot smoother and essentially it does what you think. It's fitting small pieces of linear lines step by step through a curve, and of course Gaussian spatial smoothing is often applied for spatial effects and these are often applied sequentially. Sometimes you'll normalize out multiple effects one step at a time. All the methods are well established. By well established, I mean you could write one line of R code and get any of these methods to run. So they are easy and you don't... you should, but you don't have to understand the math to make it work. And after we get through that, then we're going to do inter-array normalization. Inter-array normalization is the idea that we're going to balance different arrays in our study so that they're comparable. So there are many different algorithms for doing this. They are all extremely robust, very easy to run. There's a lot of surprising research in this that hasn't had much impact in like seven or eight years and it turns out to be kind of the best solved area of microanalysis. The basic idea is that pipetting error can lead to differential loading of a sample between arrays. It could also be batch effects in mer-array production or array execution, some other sort of systematic artifacts. Whatever they are, your solution is going to be to scale the arrays. It's extremely easy to handle. When I say scale, I don't actually mean mathematically scale by doing a shift. What I mean is to do distributional normalization. For example, this is a series of experiments that before and after. It's two color, red and green. The y-axis is the fraction of probes that have a given intensity and the x-axis is that intensity. This is a standard display of an array experiment. And so you've got all this variability from chip to chip. Each line is a different chip. And a distributional normalization will do that to it. It will force them all into a common distribution. Is that a good thing to do? So it's a good thing to actually force all of your data into a single distribution. Who thinks it's a terrible idea? One, two, three, four. So a few people who are very shy put their hands up like this. Somebody tell me why you think it's a good idea. You can compare the values across. The assumption of your statistical tests assumes a similar distribution method. Good. So it might better match the assumptions of a large fraction of statistical tests and let you do things better. But if you say some of the red ones are a different type of sample, tumor emergence, some of the green ones are normal, and they shouldn't all matter. So, exactly the problem. Distributionally, yes, before I... Yes, take a look at the before data on the left. Can you figure out how many distributions are there within the noise of the experiment? Yeah, that turns out to be solely experimental noise of hybridization differences between red and green. There actually are just one group here, but you would never be able to pick it out for sure. And so, exactly the problem. If you have an experiment that has liver and kidney, should you apply distributional normalization? Should you apply distributional normalization separately for the liver and separately for the kidney and then try to compare those two? Ah, don't do that. Anybody thought that was a serious suggestion? Do not do it, I just suggest it. You have to figure that distributional normalization is the sledgehammer of all sledgehammers that you can use. It says, no matter what my data looks like, I am going to force it to look like this at the end. And that can be super valuable because it can rescue experiments that have noise that we have no understanding of, and it can be extremely valuable in allowing you to detect low-intensity effects. But it is extremely dangerous, and so you have to have great caution about it. Let's talk about this, and we'll come back to this a little bit later. Let's talk about this experiment. You see that outlier right over here next to my mouse? I applied distributional normalization. I tried to force that sample to look like everything else. I even said, I'm going to try to percentile match every sample, every probe on you and try to force it to be within the same percentile of the rest of the probes. And it still failed. That sample still looks like an outlier. It's fundamentally different, even though it looks that good right now. And so distributional normalizations should usually yield like bang-on identical results. And if they don't, you'll almost certainly have a big problem. So these are widely used, easy to work with, and have a lot of value. And so let's go back to where we are for a second. Instead, we're going to start off with a raw chip and we're going to quantitate it background corrected, spot quality and do a couple of steps of normalization. And what we end up finding is that we do quantitation, but people don't pay much attention to it. They often just use whatever algorithm is being presented by the vendor of their software and don't think of it much. Background correction is often done in some way that is opaque to users and is not carefully thought through and it's difficult to figure out whether it's right or wrong. Spot quality is usually ignored. Unfortunately, that usually means people don't even look at the images of their studies, which you should do all the time because they can point out to you all sorts of useful things that could guide where most of your improvements end up happening, which is normalization. And because we can't come up with accurate models of the physical processes that guide our experiments, we end up using statistical techniques that do things like work with a distribution. It's secondary. If we had an accurate model, we would use that accurate model every time. We just don't yet. And all of these things together are called the pre-processing of our data. So I've been going for an hour and 40 minutes. Do people wish to have a break before we come back and talk more about how pre-processing works in an applied sense and then get into what's it called actual doing some analysis? Please. So we left off by me depressing you and now I'm going to depress you a little bit. I depressed you and told you that we have to pre-process data to actually make it reflect the biological things that we care about and that we suck at doing so because we can't accurately, physically model the processes that generate the data. And that lack of accurate physical modeling requires us to use heavy-duty statistical and manipulations of the data that lead to difficult stuff. That's called pre-processing. If we have successfully pre-processed the data, we would start thinking about what we're going to do with it. We would start to think of the kinds of questions that we'd ask, for example, statistically. And there are a number of unique things that you might think about in statistical analysis as regards big genomic data. And there's a fair amount of research into this. This is something that's quite an active topic. And I won't go into exhaustive detail about the stats, but suffice it to say that there's a lot of questions that are actually kind of familiar. Like, are two groups different? Is the classic statistical question that you spend most of your time in undergrad courses thinking about what is this? What is the reason which we could look at that? Do two groups synergize as essentially a two-way ANOVA or a general linear model? And again, the methods are pretty much the same as you might use in regular statistics with some important tweaks to make them a little more appropriate and avoid some of the disasters or crises that can come up with genomic data. The one that we're going to focus on over the practical portion of this is whether or not two groups are different. So we'll start off with a simple statistical question and then you can kind of use that as an analogy for how you build to more complex ones. I will speak only briefly about clustering and machine learning and say that it is an extremely widely used technique in genomics for data visualization. And I think you can say that clustering is a way of finding patterns in the data and these patterns correspond to clusters, groups of things that kind of look similar. And it's a very small branch of a field called machine learning. So do they get machine learning at all this week, Michelle? Good. So how many of you have used machine learning today? Only? Okay, so let's start off over here. What did you do with machine learning today? Today, like this morning. Well, I watched you talk about clustering and so I can hear it that way. That counted. My clustering talk was... That's not bad. This is very flattering. So I like that my talk is now... I'm a machine. So has anybody done more applied machine learning today? Yes, please. My alarm clock. An alarm clock, yes. We could call that machine learning. We could do better. People have done much more classic machine learning. Google. Absolutely. Google is machine learning. Anybody else? Logistic or as statistical. I mean, you could call it machine learning, but I think we could come up with some really classic things like elevators, high-frequency stock trading. I don't know if you've had high-frequency stock trading this morning, but if you did, you made a lot of money and wanted to be here to celebrate. Weather forecasting is heavily intensely machine learning. On Amazon, if you read this book, you might also like this book. Google Ads, if you came from this page, you are likely to like that page. YouTube recommended videos for you. What is it called? Spotify is the one that does songs and makes you like the next song. Facebook. Yeah, Facebook for your timeline. In short, it is one of the foundational technologies that does stuff that you should care about in life. It's an incredibly important field. There are sometimes distinctions made, appropriately distinctions made between statistics and machine learning. You can argue that all machine learning is a sub-branch of stats, but not all stats is a sub-branch of machine learning, and some people care a lot about that distinction. Suffice it to say, the distinction that you care about right now is that clustering is a sub-type of machine learning. It's a specific sub-type. It's called unsupervised machine learning. Unsupervised because you've given it no information about what are the groups. You don't go into a clustering thing and go, there are two reasons there are normals, and I want you to figure out what the differences are. That's supervised machine learning. In unsupervised machine learning, you go, I have a bunch of data. Is there something in there? And you'll go, there are two really big groups that are really different. And then you might yourself go and annotate upon that information and go, hey, group one is two, and group one two is normals. There's a big difference. And so this is an incredibly valuable thing, but it's also very, very overused in mathematics. Bioinformaticians make recurrent fundamental mistakes with that. I'll walk you through the kind of output of what a clustering looks like first, and then we'll talk a bit about the mistakes. So this is a heat map. Heat maps are matrices that encode using color in some way, shape or form a usually quantitative value. And in this case, it's a red-green color heat map. You should never ever do what I'm showing up on screen. Do we have anybody who's red-green color blind? So it's something like 3% of the male population. Don't do that in your figures. I put this up simply as an example to remind me to tell you not to do this. You should be using blue-red or blue-yellow pick something better, purple-white. There's a lot of better ways of doing this. Either way, the colors are encoding information and you've got columns and you can have sorting or orderings of those, and those sorting or orderings are the dendrograms. If they look a whole lot like phylogenetic trees, that's because they should look a whole lot like phylogenetic trees. There's quite a lot of similarity in the way that they can be generated, although not identical. And the heat map, of course, is the colors. You can have one without the other, but they are often shown together. And I can simplify several hours of machine learning that basically what you're trying to do is to say if we have a scatter plot that looks like this, we're trying to find out where the groups are. And this scatter plot here is stimulus 1, so let's say these are each genes and these are the genes that are induced by stimulus 1, repressed by stimulus 1, sorry, to induced and repressed. And so you would like to draw these circles around it, these clusters and say these sort of genes are similar. They're all induced, they're induced by one and repressed by the other, they're induced by the other and repressed by one, and then a couple of outliers. You want to be able to draw those circles. And if you're trying to do so essentially the two metrics that you're going to care about are how tight is the cluster? In a perfect world, the cluster would be so tight it would have a single data point in it, which is totally useless. So it's not just cluster tightness, it's also how far apart are the clusters? And so you're kind of trying to balance these two things, come up with clusters that are quite far apart but are reasonably tight themselves and you can trade them off to come up with an optimal size. And there are many different ways of encapsulating the trade-offs. Sometimes you will encapsulate those trade-offs by saying I think there are five clusters. Okay, that's perfectly reasonable you're giving your own intuition about how it would work. Alternatively you can go back here and you can say I'm going to cut this dendrogram at different levels. I'll cut it right here and I've got two groups or I'll cut it right over here and I've got three groups or four. So you can make a decision that way and there's a lot of different ways of deciding. All of those decision-making metrics which you should think about are heuristics. There are rules of thumb that are informative and helpful but they're not mandatory statistical facts that are true from on above that if you violate the world is ending. Sometimes they really are two groups in your study and you're trying to confirm a hypothesis and that's perfectly viable. So there's a couple ways of doing that. This exact same diagram will actually give you a lot of insight as to the difference between supervised and unsupervised machine learning. Pretend for a second so we're going to do a thought experiment all of these genes over here are blue all of these genes over here are blue and these ones are red. And imagine that I have a new gene right over there where my cursor is. Is it going to be red or blue? Red? Hands up. Red? Yeah. So let me make sure I got it. This cluster is blue this cluster is blue this cluster is red and if I have a new gene over there it's going to be blue. Hands up if you think it's going to be blue. Hands up if you think it's going to be red. Good. It's going to be red. You just did machine learning. The algorithm that you used is probably in your head something like k nearest neighbors. You took a look at its neighbors in your head you had an idea well the five or six closest genes to it are all the same so I'm going to guess it's the same. Notice how different that is from clustering. In clustering we're trying to draw those groups in supervised machine learning we're asking what's going to happen if we add one new point where the decision making boundaries and so those are kind of fundamentally different questions. If you confuse those questions you can cause yourself heartache and disaster. Why do we P use clustering? Well there's a lot of really good reasons. One is data visualization. It just makes it a lot easier to visualize complicated data. A second is to predict class assignment and we'll talk about what that means in a second. I guess that I'm talking about supervised machine learning but I don't. The last one here is to predict quality control. It's really nice to take a large dataset and ask does this dataset cluster by the technician who did it? Or if you're mixing experiments from your own and public data does it cluster by which site did the dataset? You'd be quite surprised at how often that happens and this is a very informative way to take a look at it. For example to predict class assignment this is something very widely done. Most genes have no functional annotation. No experimentally derived functional annotation. In yeast it's about 20% of genes and humans about half. So you'd really like to say that we could automatically predict their function. Imagine that you took 300 yeast strains, each of which had a knockout of a specific gene and for each of those you did for each of those strains you did expression profiling of a series of genes and then you took a look at it and you said oh sorry I got that wrong. These are strains and these are genes and then you took a look at it and you'd say gosh if you look at it right over here all the genes involved in mitochondrial function are all clustered together. There are 21 genes involved in mitochondrial function here in yeast and they're in this cluster of 22 genes. I hypothesize that the 22nd is a novel mitochondrial gene. So, pretty darn good hypothesis. And that was how a large fraction of genes got their initial functional annotations, either that or domain structure studies. The first time that was done was again by Tim Hughes. It was the first description of the utility Agilent inkjet arrays. So after a nice paper describing their manufacture, there was a nice cell paper doing exactly what I described. 300 strains all of which got expression profiling, the clusters were analyzed and demonstrated to be functionally coherent. So there are a lot of bad things you can do with clustering. Well, there's a lot of bad things you can do. In clustering they include clustering pre-selected data. So imagine I do an experiment and I'm comparing tumors and normals and I identify 91 genes that are significantly different between tumors and normals. And then I cluster those 91 genes and I find that they perfectly separate tumors into tumors and normals. Yeah, of course they do. I just did a statistical test to show that. The clustering was meaningless. It's not meaningful if you bias the set of things that you're going to be clustering. You can't infer anything from that. So if you do that kind of selection, that supervision, then all you're doing is visualization which is totally okay. You just have to be confident not to say things like and the clustering supports my results because it doesn't. You will occasionally find times in your life where a collaborator will look at your data and go geez, I want these genes and the point of a heat map of those things. Don't pull out genes from a heat map as being the genes of interesting. Take a look at what they're trying to get at, the actual biological trend and construct an appropriate statistical test. You will never never find that a clustering analysis is statistically better powered than a proper statistical analysis. Not possible. And then lastly clustering is a statistical thing. You should, if you see something and you go oh, this looks like a non-random cluster, put a p-value to it. If you think that your clusters are enriched for tumors versus normals, you could do a chi-square test and test exactly that hypothesis or what's it called, fissures exact. And you can say this is the way to be able to be sure that the results that you get are good. And people routinely make those kinds of mistakes, drawing information from a clustering when they either bias their experiment, they're doing something sub-optimal in the first place by missing a better technique or they're not actually providing some sort of inference about the chance. So what did I told you? Well, micro-data is analyzed with a pipeline of sequential algorithms and that pipeline defines the standard workflow for micro-experiments and it looks like that. If you forget everything I've said, so if you're sleeping wake up. This is the most important thing. Remembering the pipeline and understanding that flow of how things work is the most important point. The second most important point is that this is very active. Even in a field that's 20 years old where the technology hasn't really changed much we're still coming up with better ways of doing things. Thinking through new algorithms you could argue whether research is in the right places but we know that there are major unsolved problems. And summary point number three I talked completely generically there. I didn't talk about aphometrics in Agilent in great detail. I talked about microarrays. Those basic steps are really similar to the basic steps of a sequencing study. They're pretty much similar to the basic steps of a chip seek or of a proteomics. There's tweaks in different adaptations and stuff and it's not identical but this is kind of the core of how you think about bioinformatics. So we're going to look at some data. We're going to look at aphometrics data because it's the single most widely available type of genomic data. And we're going to do so using bioconductor. Have you used bioconductor yet? Yeah. Awesome. So we're going to use bioconductor largely because it's really, really awesome for microarrays. It's kind of the standard. The people developing new methods are developing them for bioconductor and they're cutting edge state of the art stuff. And so that general workflow is great. There are always tweaks or techniques that have to be optimized for any specific platform or technique that you're looking at. So let's talk about how aphometrics data would be analyzed. What are the differences between the general workflow and the aphi-specific workflow? Well, start off with all of the same problems but as you know, quantitation is typically being done by commercial software that nobody looks at. And to make our life simple today, we'll do just that. We're going to take in the quantitative data from the aphi-software and we're not going to go through the long and fun exercise of trying to grid up things and find out that it's impossible. Aphi-metrics data is one channel. There's only one sample per array. So we're only going to have one intensity measure for each probe on the aphi-array. It's a single channel and that means that it's typical to do all of your normalization in a single step. So you're going to do inter- and intra-array normalization simultaneously as opposed to break them out sequentially. You actually could break them out sequentially with no mathematical differences but nobody does it. It's just convenience to do them together at the same time. So if we collapse this and rephrase it a little bit, we get that. I gave one step in there that I hadn't talked about before probe-set annotation, which just means saying what gene each probe actually belongs to. It's not any more complicated than that. So you start off with a quantitative data CEL-soft files chip expression level files. You background correct them, normalize them and annotate them preprocess them and then you do statistics clustering and downstream analyses to extract information. So not all that different at all, really just summarizing things a little bit. Let's quickly talk about that probe-set annotation. This turns out to be really important for afymetrics data. If you're analyzing your own data this is something you want to think through very deeply. Arrays have a big weakness relative to sequencing. A big weakness because arrays are not changing. If I find that there's a gene that I didn't know existed before, it's in the sequencing data that I have right now and I can get my team to go through and say guys I think there's an extra gene in here. GRCH38, the newest release of the human genome includes a hundred genes that have not previously been shown in any other release of the human genome. Think about that. There's a hundred genes that haven't been sequenced in most sequencing studies because they weren't part of the reference. They're still there. You can go back to your old experimental data and just drag out that information. They just weren't being looked at. Well with arrays if you aren't looking at something and hasn't physically been put on the array you may be in trouble because you might not be able to get it back. And that's a problem because over time the genome sequences change substantively. Novel splice variants are found. You'll find that genes are duplicated. Sometimes genes are completely removed because they're found not to exist to be artifacts. And you identify SNPs and polymorphisms in the genome. So you can change what you think is the right definition of a probe. Some probes you thought were informative now might not be. Other probes that you used to think were not all that informative may now capture an important splice variant. And so what you would do is say I'm going to go and improve my array and the obvious way to do it is to make a new one. That's expensive. Affymetrics doesn't want to make new arrays quickly. So instead what they do, what they do, what researchers do, is they exploit the fact that there are multiple probes for gene. They say well the average gene in an AFI array is covered 10 to 20 times. So we're going to take those 10 to 20 sequences. 25 base pairs each. And we are going to blast every single one of them against the most recent version of the human genome. And we're going to find that 30% don't match. So we'll test those. And then we're going to find 40% match exactly where AFI told us they would. We don't change those. The other 30% maybe they represent specific isoforms or they match two different genes that cross hybridize but have great similarity. And so they would annotate it better so you can have the most recent genomic information with an array analysis. The file in the Afimetrix analysis protocol that allows that is called the chip definition file. The CDF file. This is an incredibly good diagnostic for whether somebody has competently analyzed an Afimetrix experiment. If somebody bothered to do this the chance that they did not analyze it competently is fairly low. If they did do this it's fairly low. Yes, I forgot that, right? If they did not do this you immediately wonder. It should be a standard part of any good analysis workflow. And basically it's a file you can download off of a website and when you incorporate it into your analysis it automatically updates the annotations for you. The most widely sold Afimetrix microarray even today is the HGU-133 array. U-133. Does anybody know what that stands for? 133rd update. Ah, 133rd update of what? You're right. Oh, Fazlis knows. Yes, listen. Unigine. So who knows what unigine is? So unigine used to be an incredibly important database. Unigine took all EST sequences. Sequence tags that were transcriptome sequencing merged them with reference gene models and emerging new human genome builds. If we go to I'm going to break it, it's a Mac and I don't use Macs, but if I go to the unigine website which is NCBI slash unigine you will find that we're currently on build 230 or something like that. So given a release cycle of roughly once every month to two months, you can see that what the most commonly sold array today is using is our definitions of the human genome from a decade ago. So if you believe human genomics has improved at all in the last decade, you want to use a CDF file to capture that. And it turns out to be free. Download one extra line of code and you have better analysis. And so in some sense it's the biggest simple thing that you can do that will make your life better. Now let's go back quickly to pre-processing and I want to be clear here what exactly is pre-processing? What would be a definition of it? I define pre-processing most do as the removal of technical, often systematic sources of noise, non-biological noise. If your pre-processing removes important biological signal, it's failed. In some way that may not be disastrous, but it's failed. And if it fails to remove technical sources of noise then you're a miserable person. So where does the technical noise come from? When you think through your analysis, where does it come from? It comes from here, the construction of your array. It comes from here, from every step of the actual sample prep pipeline. So it's from design, manufacturing, quality, hybridization. I put up here ozone. Ozone levels affect the quality of your array. A good array center is going to be controlling the ozone levels. Until people knew that, people didn't do microarrays in the summer because they got really bad results and couldn't figure out why. Or we actually did them really early in the morning. So we had this window here in Toronto where you could do good microarrays between about 10pm and 4am. And so you'd have a whole bunch of people in the lab doing the experiments and then after that they wouldn't do analysis or sleep. So preprocessing is trying to remove those systematic artifacts. It is not an excuse to do bad experimental design. You have to get your experimental design issues right. And I'm going to point out a couple of things that you should keep in mind in your head as you go through this. First, try to balance your experimental groups. If you are trying to compare two groups and that's what you actually care about then try to have equal numbers of them as much as possible. And while you're doing it, randomize the sample order as you're doing the experiments. If you have to choose biological replicates over technical replicates choose biological replicates for your actual experiment. Do the appropriate work up of the noise characteristics of the study and then biological replicates are preferred and they will maximize power. And if you can't do all that stuff, include controls. Imagine that your supervisor comes to you and goes, okay we can do this experiment, it's 200 arrays, but I have grant money to do 100 this year and 100 next year. The right thing to do is to say awesome. What we're going to do is we're going to do 100 this year and we're going to do them in December and we're going to do 100 next year and we're going to do them in January and because I'm still worried about the batch differences in January we're going to run 90 and we're going to repeat 10 from the stuff that we did the year before and we're going to use that to see just how much things have changed. And so on one of the large projects that I run, we've been doing 500 methylation arrays spread out over I guess it's almost three years now that we've been doing them. So we have standard samples that we run routinely and just decided at the beginning, these are standard samples that we're going to look at many, many times and make sure that the results don't change significantly. Helps us let us know if our normalization is working well, helps us have confidence in the results. And so thinking through those experimental design issues will make your life a lot better. Okay, so last thing before we get to actually giving you fun and exciting stuff to work on is to point out to you that there are two major ways of processing aphometrics data. One is called RMA for robust multi-array and the other one is called MAS5 for micro-array analysis suite version 5. They are probably the two widely, most widely used methods for AFI data period. They are in routine use by groups around the world today. You can think of them as having essentially a precision accuracy tradeoff. MAS5 is going to give you a more accurate measurement of the true value of how much something is expressed but it will have more variance, more variability, and RMA is inverse. It will often have bias to one side or the other but its measurements will be much more precise and you can see why you might want one or two, one or the other of those situations. If you've got an N of 3 cell line experiment, you might not mind a few extra false positives because gosh, you're not going to find much anyway. On the other hand, if you have a 500 patient cohort, you may really want to be sure that your measurement theory is close to accurate as possible. So what we're going to do is