 So, I'm a principal investigator here at OSDR. I started my lab about five years ago, a little bit less, and my team is basically looking at using omic technologies to develop biomarkers that predict how we should treat different patients. And in some sense, we use any omic technology, really any large data set that we can get our hands on. And we're going to spend today talking about microarrays. Microarrays are a really good technology to talk about because they're probably the only omics technology where we actually sort of know what we're doing. I mean, it's fair to say you've seen the rest of the course that there's a lot of un-clarity, lack of clarity, imprecision about how you might analyze different types of data. In microarrays, there are some clear things that we can say are right and are wrong, some things that we know work well and some things that we don't. There's obviously still opportunities to improve the way we do our analysis, but it's reasonably worked out. The other part of it is that as we've gone forward, we've started to understand general principles of how you would analyze large-scale genomic data from microarrays that are now being transferred to all of the other types of data. So it's kind of led the standardization that's being seen for other types of information. So it's a good way to end off or almost end off the course is by seeing just where all the other omic data types are ultimately going to get to. Obviously, all the slides will be available to you and you can use them as you wish under our Creative Commons license. And we're going to talk about gene expression profiling. We're going to talk about gene expression microarray. We'll spend the first, what is that, two hours and 15 minutes or two hours straight talking about what microarrays are like, how we pre-process them, what are the characteristics of the data. Then we'll spend the last hour and 15 minutes, hour and a half after a coffee break, looking through how you analyze a microarray experiment. And unlike sequencing, you can trivially download a micro experiment to your computer and work with it. That's one of the nice things about it. And during my PhD, I analyzed most of the major experiments on planes while traveling to conferences. So that's something that becomes much more viable and easier to do while you're on the road. So the things that I want you to take away, and you forget just about everything that I talk about, these are the key things. Number one, understanding the different types of microarrays. There's several. Number two, understanding where the noise comes in the experiment. If you know where there is noise, then you can figure out at least whether your experiment looks good or does not. Appreciate the entire pipeline. What are the steps involved in an analysis so that when you read a paper, you know, is a key step missing? Have they forgotten something entirely important or have they not given us the information I need to interpret what was done? And then in the practical side, how you can input the raw data into our bioconductor, how you can pre-process it, and maybe how you can get into the standard statistical analyses. That last question, some people will answer, some people won't, depending on the timing of the, how far people get during that study area. But we give full answers so you can sort of see template code that goes through those types of analyses. So you'll discover, I ask a lot of questions. You guys are going to have to stay awake. And let me start off with one. Microarrays are really old, so the first version of them is 1995, so almost 20 years ago. So what do they measure? What is an expression microarray? Standard affimetrics or Agilent Array measure. Hybridization signal. Anybody want to be more, anybody want to tweak that answer? Intensity of probes. Okay? I'm sorry, mRNA levels. I saw one more hand at the back. So we got hybridization, intensity levels, intensity and mRNA levels. So there's a couple of things to comment on there. One is that they only capture steady-state mRNA abundances. We will often, like I did on the slide, call them expression arrays. The terrible term. They're not really measuring expression, they're measuring abundance. They're measuring an absolute level. In fact, they generate a signal intensity that is proportional to the absolute level. So in fact, we don't even get a particularly good estimate between genes of which genes are more expressed and which genes are less expressed. Instead what we find out for an individual gene is an ordering. That's essentially what they're telling us, the order of samples for individual genes. And we do that for every gene in the genome and therefore we have the ability to learn quite a bit. So we're going to start off talking about what are these fundamentally. We'll talk a little bit about the technologies and a lot about how they're analyzed and what are the key steps that are common to all of the technologies. And then we'll zoom into one particular technique, Affymetrix arrays, which are probably the most widely, well, which are the most widely used today. And take a look at some of the details of how they're analyzed before looking at that in practice in the practical. So here's a kind of definition of a microwave. I say kind of good. This is the Wikipedia definition. It's a multiplex technology containing thousands of oligonucleotide spots each with a specific DNA sequence. It's not actually a bad definition. It's a little bland, but I think the key point there is that it's multiplex. A microwave is a way of measuring a lot of things in parallel. And therefore that has two consequences. One is that the errors that you would observe on each of those things will be correlated with one another because they're measured at the same time. And the second is that you can generate very large amounts of data. Large amounts of data will allow you to be able to do statistical and other types of analyses in ways that would be impossible. The whole point of them is to allow you to quantitate something. So in a way, the whole goal of a microwave is to allow you to measure how much of something you have, how much RNA you have, how much DNA you have. DNA applications can be specific to specific sequences. So you can say, oh, here's how much of allele A versus allele B, a genotyping array. Or here's how much I have of one region of the chromosome versus another, copy number arrays. And so that means that there's a wide variety of applications. Anything you can imagine measuring about DNA or RNA can be measured using a microwave. So fundamentally, it's a flexible technology that we shouldn't think, oh, a microwave only measure DNA or they only measure RNA or they only measure RNA abundances. You can measure splice variants, genotypes, pretty much whatever you think of. And in general, although not exclusively, we don't use microwaves as a technique to verify hypotheses, although there are clear exceptions to that. Instead, we usually use them to generate new ideas, new questions. So they're kind of a screening technique. They highlight specific features or genes that are of key interest to investigate further. There are obvious exceptions, like my whole research program is around biomarker discovery, which is a hypothesis-driven experiment, not a hypothesis-generating and pathway analysis and things of that nature. And because they're hypothesis-generating, there's often an idea that, therefore, I don't need to worry about experimental design. Actually, the fact that we typically can't afford to do as many experiments as we need for statistical power, and you'll talk about statistics in more detail this afternoon, makes it very important that you actually design these experiments correctly. We'll talk about this a little bit in about 45 minutes about some of the key principles of designing a micro-experiment. But suffice it to say that a poorly designed micro-experiment is orders of magnitude harder to analyze than a well-designed one. So you can save yourself months of time by simply thinking through your experimental design properly. The fundamental nature of an array. We start off with some sort of a sample. Typically, we're going to have a couple of samples. We might have samples across a clinical range, different types of diseases, or different patients with different outcomes. And we're going to extract DNA or RNA to put them onto the microwave. That sounds simple. This is actually one of the future challenges, is that it's difficult to get your samples to really look similar. What happens if some of your samples are a precious tumor type? Something very rare. And as a control, you say, oh, well, we can go and get the nine versions of that tissue from patients who are coming into the hospital right now. And you're comparing a control group, which is fresh, unfrozen samples, to a treatment group, which is formal and fixed archival tissues that have been sitting in a hospital in air-conditioned places for 40 years. And now you say, oh, I expect that there are no technical differences. Well, it turns out that you can trivially detect the RNA and DNA differences in those types of samples. The sample preparation has as big or bigger an effect on all of your data than any of your bio-dermatics. So you're upfront designed to make sure that your samples are similar in existence. Similarly, it's not just your sample, it's how you do your extraction. If you do a poly-A versus a total RNA prep, completely different biases and sources of error. And if you give me the same samples processed with both protocols, you could pick them out clear as day as two separate groups because they have systematic biases. And there's nothing wrong with those systematic biases as long as we know they're there and we can control them. But it means that from the earliest step we have to be thinking about the samples and what are the characteristics of them so that we can incorporate that into our pre-processing and our statistical models. Let's conceptualize now that we've got a great sample. And we're going to put that sample in a microwave. And here you've got your simple, one-spot, simplest possible thing that we can have. We've got some terminology. The glass substrate on which the microwave sits is called the chip. The actual group of spots, a group of DNA molecules all of which are identical are called the feature and the individual DNA molecules are called the probes. This sounds stupid, but in practice this took something like six years on this and it led to huge confusion. So we've got the probe that's on the chip and we've got a series of high quality DNA or RNA that we extracted. And for now I'm just going to say RNA to make life simple. Well, the first thing that we're going to want to do is link it to another color and we call this the target DNA. The target is your sample, the probe is the chip itself. We're going to hybridize it onto the chip. That allows there to be standard and correct base pairing and that creates some sort of an affinity between complementary bases. This is the whole basis of how microwaves work by that Watson and Crick base pairing and at the next step we're going to have all sorts of non-specific hybridization kind of noise and we wash that away and this is basically stringent enough washing conditions that will remove non-specific hybridization but keep the one that we really want. Now, this makes microwaves super flexible because we can do things very cleverly. We can wash so stringently that only perfect matches remain. Right, genotyping now. Or we can relax the washing so we can allow there to be certain mismatches which is really good if we have samples from a different species that we don't know very well. Ultimately, we're going to take a look at each individual spot in CNN and these target DNA is fluorescently labeled so by scanning it we can simply see how much light is emitted and that will give us a measure for how much signal is at that spot. So what we're doing is we're detecting a fluorescent signal that is proportional to the amount of hybridization at that individual feature. Obviously, a microwave doesn't contain one spot it contains a series of spots and for many microwaves we don't use just a single color. Instead, we're going to start off with an individual we're going to say tinker liver and we're going to have one labeled with one dye, a red dye and another labeled with a green dye. This allows us to have two identical samples that only differ by their four or four labeled. We would mix those together and put them on a microwave. This is called a homotypic hybridization. It's a very important quality control metric. Essentially, what you expect here is equal signal intensities at each spot. So you can see here you have two reds and two greens. One red, one green. Those are cases where you saw as expected no bias from sample to sample. In other cases one one or this two and one you see differences. That's an estimate of the noise in the microwave. About half or 60% of microwaves use a single dye. Those are primarily half-metrics of sample generation and about 40% to 50% use two. Those would be things like Agilent arrays. Both technologies are perfectly viable. With two color arrays you have the advantage of being able to do this additional quality control check where you can take a look at homotypic error rates which is very very useful. On the other hand, having two dyes is inherently bringing in additional noise complexity, single color arrays are sometimes thought to be more simple. Now let's stay along the two color array design. Now we can introduce a second individual. We're going to treat them differently. One with a drug, the other one not. We're going to again extract some sort of DNA or RNA from them mix it in a tube and hybridize it. And this allows us to get good estimates of the levels of the abundances of each gene in a relative way. So microwaves don't really do absolute abundance. They do relative abundance and it'll say in group A the animal on the left versus the animal on the right, this is the ratio. There is more of this in the animal on the left and there is more of this in the animal on the right. So it provides a real assessment of their abundances. I mentioned that microwaves allow nonspecific hybridization. That's really important because for example, that's how we can do every important cross-species study was done using a microwave for the last 20 years. Why? Because we can take a human array and we can hybridize a chimp. But you can hybridize a chimp or a bonobo or other primates to a human array and get very, very good signal. Similarly, we can do the same thing for plants. And so the vast majority of agricultural research has used that. There was not sufficient laughing, which means not enough people recognize that we can put many different agricultural species onto arrays that are related to one another. There's obviously still error rates, but this allows people who are researching one type of orange to study all citrus fruits simultaneously. That's a lot cheaper and the cost envelope of an array is $150. The cost envelope of a sequencing experiment is $1,500. So you have a lot of price differential studies have continued to happen in microarrays and probably will for the next several years. Spotted arrays are produced by a robot. You're actually seeing the robot right here. And the way this works is the robot has a series of pins basically on a head like that. And the robot is going to go to a 96-well plate or a 34-well plate and go and it'll just dip these pins into the plate. That alone is enough to suck a tiny bit up onto the pins. And then it's going to go to a glass slide and it's going to touch the glass slide. And a tiny bit of liquid that was on the pins is going to drop onto the glass slide. Contact that position. That sounds like an incredibly strange way to make a microarray. Well, that was the first way that microarrays were ever made. It came from Pat Brown's lab in Stanford and basically they were thinking about robotic ways of analyzing yeast better. And one of the reasons why this type of array was so common is that they made all of their designs for the robot publicly available on what was a very lightly used internet. That was basically only for universities. So somebody would go and say, huh, here is the science paper showing the first high throughput expression profiling of an organism, yeast. And I want to do that. And they made their schematics for how to do this available. Robot. And literally thousands of microarray printing robots were built in that way using the schematics that they would provide. Easily recognize a two color array. A spotted array. They're trivial to see when you look at the data. First, they're two colors. So you can immediately see here that there's a mix of red over there. Green. And lots of yellow. Yellow because red and green are in equal quantities. Red and green equally. They average out to give you yellow. You've got some red spots and some green spots. That's one. The other reason is because they're happening in these 384 well plate batches, you will very, very clearly see grids. Each grid corresponds to one 384 well plate. And they'd have multiple on this little array they've got four. So as soon as you see that pattern, you know the technology that was used to produce the array. You should also almost immediately be able to recognize some of the characteristics that are bad about this. Well what happens over time as the pins wear out? What happens if one of the pins gets bent in a little direction? What happens is I'm continually doing this and what happens if my sample starts to dry out a little bit? So the concentration of DNA increases over time because you have a dry out of the sample in each of those 384 wells, which are kind of small. They're common and they lead to systematic artifacts across experiments. As a result and for years there was an influx of work to develop new technologies for microarrays. We'll talk about a couple of these. There are ink generates, photo lithographically generated arrays, which are still all in reasonably wide use. There have also been for probably the last 15 years routine publications talking about high quality protein microarrays and cell arrays. And as I like to say, protein arrays are redeveloped in a nature paper to market because the technology is very, very difficult. It's hard to get antibodies that are going to be robust, shipable around the world that can stay in conjugated form like that. It's hard to do this in high throughput very cheaply and the net effect of it is that there's not really any useful protein arrays on the market. It uses a 184 protein array for a lot of studies. It's not useless, but it's 184 proteins out of the 20,000 plus all the post-translational modifications. So it's a very limited assessment compared to what we do with typical genomic studies. I'll mention these in a bit of detail. We'll start off with these arrays. These are quite class so far. No questions? I'll see if that's what I needed to do. I'd surely first them searching. No. The human part was mostly a joke. It would be a plant on a plant array. Making custom arrays turns out to be reasonably hard. Here's why. Imagine that we take the second or third most important agricultural crop. First is certainly corn, second is certainly rice. The third most important is cotton. Cotton just got sequenced. The genome of cotton was reported I think a week ago. That's a big problem. If you want to design an array you need to know what the genes are. We're talking about all of our clothing. We all have large amounts of cotton that is a billion-dollar industry and the genome just got sequenced. It's a long time before people studying many other industries, many types of grapes, fruit, etc. will have good genome sequences to build an array off of. Even once you have that, building the array isn't cheap and you need to be willing to or planning to use a lot of it. I don't know how many cotton research labs there are in the world, but let's pretend that there's 100 and across them they're going to be doing 100 arrays a year. That's 10,000. That's about the size of a typical single print run from any of the major array producers. That means that all the labs have a lot of stuff for us to be able to make it worth your while to print and create an array. It's very difficult to get traction and good financial characteristics around that, so I think that's a big part of the problem. By contrast, if cotton is reasonably similar, 75 or 80% of its genes with Arabidopsis, there's tons of Arabidopsis arrays, so you can get a lot of information about that. The probe itself is single-stranded DNA, so it doesn't do that. That's one comment. The second is that these are generally done at reasonably elevated temperatures, so the number is not in my head, but I think the hybridization is 60 degrees or maybe it's 70 degrees. That's enough to make sure that no secondary structure should be forming in RNA. I'm sorry? Why would it hybridize? Because it's more thermodynamically more favorable than RNA, RNA. There's a lot more thermodynamic advantage, but that being said, we typically put the RNA in excess on the array. There is some of that happening and sample gets lost, but instead there's still enough to be able to give good signal. We don't have strong reason to believe that that formation of secondary structure or auto-self hybridization and not other samples. It's obviously biased towards some sequences, but what we're really looking at is are our results biased because of it? Or is it just a uniform shift of signal? And it appears to be adjacent when microbes are not good at telling you this gene is more expressed than that gene. They're more a technique for telling you in this patient, this gene is higher than that gene is in this patient. I've got the vodka then at the front. I'll get there. I'll get there in five minutes in a lot of detail. And then after that, if you have questions, bug me. So there's a lot of uses for two-color arrays. There's a series of really beautiful statistical papers in nature reviews in genetics and things like that. Back around 2003, that kind of showed how you could use these effectively for a very wide range of experimental designs. But the classic would simply be imagine that you have two groups treated in control. Imagine that you put the treated with the red and the control with the green and you compare them directly. Similarly, if you have an individual patient and you have a before and after treatment, then you compare those together on the samurai and you can use that as a kind of control or ratio. So those are classic examples where two-color arrays are very natural experimental designs. They can be. So that example of pre-imposed treatment is widely done in cancer studies. So those could be done frequently done on the same two-color array. So it's not a practical problem in terms of like it hinders your microarray analysis from deep way. NGS does have better dynamic range. That's absolutely true. But that dynamic range comes at a cost. Imagine that you sequence 20 million reads and that's not a large sequencing experiment. You don't get better dynamic range or better accuracy than a microarray. Imagine that you sequence 500 million. You definitely do. And so there's some sort of a trade-off point at the number of reads and there's been several calibration curves published. I feel like the number is 120 million, but I need to go back to the primary literature. But in short, that's kind of the trade-off. You'll have people say, oh, we'll do RNA-seq for you for $175, but they're also producing so little data that a microarray gives better signals. So there's kind of a trade-off level there between how much sequencing you do and how good your results are. But that's inherently one of the advantages of RNA-seq. You can make that trade-off. With a microarray, you can. You can basically say this is what I'm going to get for this price and this is the quality of the data. And eventually, this technology is going to go away for most uses because sequencing allows you that greater flexibility. Other questions? I guess it all depends on how recent. About 18 months ago, AFI released a new human array which had more spots and a higher spot density than anything they'd released before. And similarly, Illumina released a new genotyping chip at that timeline as well which was denser than anything that they'd done before. So that was 18 months ago. I don't know how I'm going to see another iteration or version of them that is denser. Both companies seem to be focusing on reducing the amount of sample and view that as a strong, I think, a strong competitive advantage versus sequencing. Because, for example, you can do a genotyping array from a few nanograms of material and you probably don't want to do a reliable sequencing experiment less than 200 or 250. That allows you to potentially access a biopsy based market that's pure speculation. Other questions? Let's talk a little bit about each of these other types of arrays. Each array started in 1999. It was in the middle of the dot-com boom. The world was a happy, exciting place and HP went away. They were not getting as much credit for all of their wonderful and they thought that the problem was that the company was too complicated. Too complicated because they had all these other things like this really, really, that measured things extremely accurately and that did life science work. So they said, let's create a spin-off called Agilent. Agilent Technologies was going to have all that and then the computer company would exist and would be the one that would eventually write the dot-com bubble and be super successful. Didn't quite work out that way for HP but for Agilent this turned out to be a really great thing. Because Agilent was able to start looking at it and say, wait, this is our sister company. We can license any of their patents that we want for free. We're supporting one another. Patents on HP, printers, how to put a small quantity of liquid in a very precise location. That kind of sounds like a microarray. How to put a very small quantity of something in a very specific location. So they decided to try to see if printer technology could be harness to generate microarrays. I'm not sure if anybody recognizes who did this initial work. They're here in Toronto. So Tim Hughes as is post-doc did this as his initial work. And that's part of why he ended up getting a faculty position and he basically developed the technology and did all the initial quality control of it. Essentially the way it works, and so there's lots of proprietary steps here, but it's really contagious. Instead of having four dyes CYMK you have ACTG. And instead of spraying a dye at a spot instead you go and spray a nucleotide at the spot. And they add nucleotides to spots one at a time. And they build on top of each other. We don't know the exact chemistry. We'll get a good estimate of what it is. And I'll show that in a couple of minutes. The idea then is pretty simple. You've got a printer head. The printer head is moving around the glass slide and going dropping nucleotides step at a time. And that with the same sound effect that I used. That leads to the building up of an entire array. You can also immediately guess what are some of the challenges there then. If you're going to be moving this around over time there's a limit before eventually the chemistries are going to start to say you've waited too long before fixing the sequence. In other words you're limited in the length of the probes that you can print. And the limit is not short. We'll see in a couple of minutes. It's in the 60, 70 base pair range. There's not thousands or hundreds of thousands of base pairs long. The other type of array that we'll talk about in a lot of detail are what are called photolithographic arrays. So photolithographic arrays were being developed right at the same time that Pat Brown's group was looking at printed arrays. In fact, they were both in California, both at Stanford in different divisions. One was in robotics and the other one was basically looking at the techniques used for making computer graphics for a long time. This is going to be the last standard microarray technique because they have very good error characteristics. Does anybody know how photolithography works? What the technique is? At the back. Chris. Good. So that's kind of how we build an affimetric array. Kind of going back to photolithography, the critical aspect of it is photo, light, and lithographic basically digging. So it's digging or building using light. And so what was just described, and I'll go over in a second, is how you can use light to build large nucleotide structures. It's also how computer chips are built. You've got a silicon wafer. You shine light selectively in certain parts and that allows chemical reactions to happen. This was initially done by Affimetrics. There's other companies that do it including that for a very long period of time. Here's the idea of how photolithographic arrays are synthesized. You start off with a wafer. The wafer is really critical. It has to be completely flat. If there's no flatness, if there's not complete flatness, sample will pool in certain parts of your array. The matrix itself is silenated. Silenated with hydroxyl groups and that creates a generally sticky matrix all over. A linker molecule we don't know exactly what's going on in the linker molecule, proprietary is added to this wafer at every hydroxyl group. We don't know exactly what's going on but it's almost certainly a sulfhydrochemistry. We don't know the details. And this creates an ability to sequentially add things on top of one another. The linker molecule is kind of like a flexible, chemically reactive site. Then we do, just as Chris described using something called a photolithographic mask. So, a mask sort of looks like this. It's a series of black and white spots. White where light can shine through black where it can't. So it looks like that. Basically, you have a lamp. You shine light through this mask and it will only allow the light to shine on specific locations on the chip. There's a couple of things here. One is, if you have a lamp, we have a lamp over there, the light doesn't all go parallel. Instead, it goes in all different directions. So there's this little blue ring. That blue ring is called a columnator. It's one of the most critical parts of the manufacture of a computer chip because it ensures that all of the light is going to be parallel. You can imagine why this is important. The light wasn't parallel. If the light coming out of this spot right here was at different angles, then some of it's going to go over here. Some of it's going to go to the bottom of the chip. Instead of having very precise definitions of where there is light and where there is not, you're going to have fuzziness. The more fuzziness, the less accurate and less high quality your chips are. This is, of course, super critical for computer chips because even a couple of stray photons can activate them in action in the wrong place, put a transistor where it's not supposed to be and you're doomed. By contrast, with a microwave, there's even more flexibility to it. So the columnator is very critical. The light is almost always a UB source, so it can have different wavelengths. And the mask here is of different characteristics. It's actually quite expensive to produce different masks and to replace them sequentially. And so the design of masks and the careful decision of where to go is one of the important aspects in developing a microarray. So the idea here is that we take our initiative to which we added these linker molecules and we shine UV light in two places. We've divided the chip up into what we're eventually going to have to be features one, two, and three. And we're allowing UV light to shine on features one and three. Number two is protected by the mask. No light gets there. And UV light is a linker molecule in some way. Now it's able to undergo a chemical reaction that would not be possible in the absence of UV light. Let me make one other point here. This is also of course why it's critical to do this in a clean room. You can imagine that a single stray speck of dust coming in here distracts this light and that light will go in many different directions. So another one of the key characteristics of making microarrays are good computer chips is that you have a very low air particulate ratio. You want to have as few things in here as possible. So if this works correctly, then only features one and three will be activated by the UV light. And now we're going to pass a solution of modified nucleotides over the chip. The modified nucleotides are going to have the ability to bind to this linker molecule. And so here there are A's. There are A's binding onto the chip and they will bind everywhere where the linker molecule is activated. Features one and three. And now we're simply going to repeat this. We're going to protect features one and three. Activate feature two with UV light. Pass over a different nucleotide onto the activated area. And now feature number two continues doing this. Activate feature number two and three with UV light. Once they're activated, pass over nucleotides and we have built up the second base. You can do this sequentially as long as you like. Of course the longer you go the more likelihood a mistake will be. A mistake like an incomplete chemical reaction. And so it's entirely possible that in one run it won't change down to an activated spot. Not by any fault of the experiment, but simply because kinetics works that way. It's always random chance probabilities that a reaction will go to a certain point. And sometimes it won't. The reaction won't go to completion. That's a big problem. Imagine now that we have this C and feature two that is not correctly had a G added to it. Well the rest of the sequence on top of it is going to be a problem. The sequence that seems to have an insertion deletion it's going to bind to something that we don't want it to. So instead after you have your initial activation you run some sort of a capping agent. Some sort of something that blocks any activity. A capping agent again probably uses something like sulfhydrochemistry and it's a very high affinity reaction and it's one that is inert. So we'll no longer know that strand will be incomplete. But we don't introduce that. We'd eventually build this entire chip up and so you can see that here in this example we have features one two and three and here is this capping agent. We have that actual strand that just didn't get completed. And if we kind of color code these differently you can see we have four features right next to each other with very little spatial separation and each contains a series of equivalent oligonucleotides. Now an affinity array will typically be 25 base pairs in length. 25 base pairs four possible bases means that you would probably need 100 masks. With deconvolution and optimization you can often get instead of 100 masks down to the 85 to 95 range. So you can be a little bit more precise and allow yourself to have some masks that will build up different layers at different lengths of time. And figuring out how to do that optimally is one of the big challenges in chip design. To figure out the right way in which to add things especially given that you probably don't want to be hitting adjacent spots too often because the larger the adjacent spots the more chance for there to be edge effects of light diffracting around the side and going to places you don't want it to. Similarly it's easier to build spots at the edges of the array because you don't have to worry as much about stray light. So there's a lot of characteristics that go into designing the order in which you do these synthesis and then the way in which your masks work. Your final chip is made on a 5 inch by 5 inch wafer something like that size I guess. 5 inch by 5 inch is puny. If we take a look at a Intel wafer where they're making like real computer chips they're big circular things like this size where somebody has to hold them up with both hands. There's a couple of reasons for that. One is that it's important to do robust quality control for computer chips but for arrays not so critical so you can make them much smaller batches just more quickly. You can do smaller clean rooms. You don't have to worry as much about a lot of the technical details. The second one is we make a whole lot more computer chips a year than we make microarrays a year. So you don't have the same scale as that's involved. You probably sell I'm going to guess four or five hundred computer chips a year but you know a few maybe a million microarrays a year something on that order. The wafer itself is 5 inches by 5 inches. But actually that contains a lot of identical chips and if you move down to the size of the individual affometrics array leave aside all the plastic packaging it's a centimeter and a quarter by centimeter and quarter. That's it. All the rest of it is plastic packaging microfluidics and the entire thing is about a 5 inch package but the important interior part is just a centimeter. Within that the individual chips are typically on the level of 10 microarrays by 10 microarrays there's been a little bit of a move to reduce that as we talked about but not dramatically. So I don't think they're smaller than 9 now and 11 is a couple years ago when AFI released its last public information about this. So a kind of a 9 micron size is quite small but relative to molecular things that's large. So each individual spot contains a lot of replication at the probe level within each of those spots. So what we're actually detecting is the ability to see fluorescence signal on a range from zero to millions. So how many of those probes show fluorescence at each of those different levels? And it's probably true that we are far more limited by our scales which allow us to detect 65,000 grayscale levels than we are by the arrays which potentially allow you to detect millions of different questions. So how do we know the sequence of the individual genes? Sometimes we know it because the genome has been sequenced and that gives us a lot of information so we can design it well. In other cases the genome hasn't been sequenced but there's a lot of intermediate information that's scripted on. So there's a technology called ESTs or Express Sequence Tags which are useful for basically reading out a series of genes and identifying the large abundance ones sequencing them. And there were a series of technique studies that went to try to characterize ESTs for many different species, I guess mostly in the early 2000s and so that provides a good estimate of it. Between those two along with kind of gene specific models inference of genes that are found across multiple species people come up with good estimates of the gene content for an individual species. And then from that they design the probes that will best reflect those genes for the species that they're studying. Yeah? So what you're getting at is are there different gene isoforms that might have different consequences or effects from tissue to tissue? So typically most microarrays have their probes designed on the 3 prime end. The 3 prime is the least variable part of a gene so that's one way in which companies try to get around it. That's not perfect that just averages out the different isoforms. More recent arrays so kind of the last couple of years will often include probes for specific exons so you can take a look at individual exons and say is this exon expressed in this tissue or not. That's perfectly viable just a bit more difficult to analyze but it exists as a technology. Any other questions? Okay. So we have the aphometrics chip. We have the chip itself and we have the chip on the chip. The next thing we have to consider is what we do with it. The last test of an aphometrics chip it's a little unusual. It's a little unusual and this is a great trick question to ask students at committee meetings and things because most people don't realize that aphi-chips and photolithographic are fundamentally different in their sample prep for most other chips. Like everything else you start off with total RNA. You do a reverse transcription to cDNA and at this stage many microarrays will incorporate a fluorescent label in the reverse transcription step and they will take cDNA and hybridize it to a standard expression microarray. That's a very reasonable approach but instead aphiarrays take the cDNA and do an in vitro transcription at that stage. So in vitro they use the cDNA and transcribe RNA out of it. They incorporate a biotin label at the step and now you've got biotin labeled complementary RNA. They then fragment that, hybridize that to a chip biotin with biotin-structabitin labeling to get a very, very robust estimate of signal. So there's two reasons why they do it. One is that biotin-structabitin conjugation is super powerful. Very, very strong interaction which lets them get very sensitive signal detection so it can potentially be more accurate than the simple use of fluorescent dots. There's another reason why you might want to do an array using complementary RNA instead of complementary DNA. Does anybody know? So I didn't know this until about four years ago when in this class somebody pointed this out to me and I realized this is entirely true and did some research on it. Single-stranded DNA is not stable. Single-stranded DNA is an unhappy molecule. Single-stranded RNA is a perfectly happy molecule. So it's much easier to store and save cRNA than it needs to store and save cDNA especially if you're going to be doing good in conditions that are not prone to hybrid Watson and Crick base pairing. Double-stranded RNA experiments you're going to want to have for strand-specific analyses cRNA stored, which gives you a critical key. Obviously, double-stranded DNA is a much more stable thing than either of these but single-stranded RNA is a very good thing to be able to store and you repeat hybridizations on and that's one of the primary reasons why this is done. A typical sample preparation procedure that's been tweaked or optimized to try to maximize sensitivity and give you samples that you can reuse again later. At that stage an AFI array is not different from anything else. You have your probes hybridized to one another on the, or your features and you're hybridized to one another on the array itself and you're using the reston signal from the biotin strip to have it in conjugation to be able to recognize what intensity level there are. So this is what an AFIMETRICS array looks like. There's a couple of interesting features on this array. Does anybody see anything there that looks weird? So this. What is that? Can anybody read it? So, AFIMETRICS right? Yeah. AFIMETRICS says, we're going to try to make it easy in case you forgot what kind of experiment you did. The control probes on your array spell out the name of the array. So it is impossible to lose track of what kind of an array you have if you did an AFIMETRICS array, because as soon as you have the image, if you go to your collaborator, so what type of array did you do? AFIMETRICS. Which one? Human. They sell 40 human arrays. You know, the human one. Then you can go and look at the image and go, oh, you did HGU133A plus 2. Great. So now you have an easy way of assessing what actually happened. This sounds like it's the kind of thing that should never have happened to me in my lab at least five times where we've been unsure what people did and they couldn't tell us and you look at the picture and at least it tells you that's good. There's another key feature here that's quite interesting that we'll get back to. Do you see just next to my cursor, there's kind of a bright line or border. You can see it really clearly at the bottom here as well. So there's kind of a series of bright dots along the sides. They give the computer when it's doing the scanning a specific spot to look at and say, here is the edge of my chip. These are the borders and now it can kind of lock in on the borders and do its identification of where other spots are relative to that. We'll talk about how that works in a few seconds, but this is a very useful feature because it's the likelihood that you have incorrect identification spots. We're going to beat arrays. Any other questions on after-array? No, it's pretty robust. You need to have double-stranded or you need to incorporate a label. You need to have a label on your RNA otherwise you can't do it. You have to have a step that allows you to label the RNA. If you have complimentary RNA then you need a way of taking that component, sorry, if you have RNA you have to have a way of incorporating a label. I'm not aware of a robust enzymatic procedure for doing that. You need a way of kind of taking an RNA you need an RNA dependent RNA polymerase to do that and I don't think there is a robust RNA dependent RNA polymerase. By contrast there are robust DNA dependent RNA polymerases. Other questions on after-arrays? Okay. We went back to molecular biology for a long time. Don't worry, we're going to go to pure bioinformatics in a couple of seconds. The last type of array we'll talk about super briefly are Illumina Arrays, Illumina Beta Arrays. It's a very different technology. Basically instead of having a physical location is fixed, instead you create a whole series of holes tiny little holes and create a little ball that is the same size as the hole. And on the ball which we call a bead, you put an address, something that tells you what type of probe it is and you put a probe sequence. You have a 25 base pair from the sequence and it's all coated by 100,000 of them. And now you have your whole little bunch of slots and you randomly stick a bunch of balls into those slots. That's a kind of interesting idea. So the hybridization happens on those slots and for some genes you're going to go balls measuring that gene. So it gives you a great assessment of internal error. On the other hand, spot to spot. So some genes are going to have only a couple of balls, maybe one or two. Other genes are going to have 10 or 15 balls and so some of them are going to have very precise estimates, some of them are less precise estimates. The idea is very innovative and it certainly has a lot of potential. Unfortunately, I guess there's a couple of reasons for that. Data rays were the last developed technology. As a result, although their price is not too expensive, unfortunately they didn't attract a lot of attention in bioinformatics research. Bioinformatics researchers said well, we've been working on maybe exhausting, I don't really want to learn how to work with technology, plus there's this cool sequencing thing starting to come up. So there wasn't a lot of research development of methods on beta rays. 5 base pairs, typically. So they don't have a strong advantage. There's not a big selling point. By contrast, ink generation of longer probes, 60 to 70 base pairs, they came out of good research on them and they have good quality data about equivalent to beta rays, probably. After you raise the most expensive, they have the kind of good advantage of having very high quality data. The internal packaging, the high quality control, and they also have very extensive bioinformatics research. People spend a lot of time trying to work out the optimal way to analyze it. So we're probably maximizing the information content from the technology. Spotted arrays are super cheap. You can probably make a spotted array for 25 bucks today. The length of them is incredibly variable. So we published a paper in Lancet oncology last year where we spotted arrays with 25,000 base pair long probes. So you cannot do that remotely if you have technologies. On the other hand, it's an inherently noisy technology. It's very difficult to do robust quality control, but there's lots of bioinformatics research. So we have a good idea to maximize and handle that noise. So there are clear trade-offs between all of these. I think in the long-term, it is likely that the high-quality, well-analyzed, well-understood affymetrics is going to end up being the last used microwave technology, and particularly in applications like quality control. You could imagine if we had yogurt there, you'd peel off the cap of your yogurt and there would be a little tiny micro underneath that would be sensing for the detection of bacteria that would tell you that your yogurt had gone bad. That's an easy application between microfluidics and microarrays, and the kind of thing where that technology is probably going to last for a very, very long time. By contrast and discovery, I imagine over the next five years, most of these array platforms will slowly vanish. I think clearly I don't endorse any particular technology. They all have their strengths and weaknesses, and we use all of them for different types of experiments. And you shouldn't just say, I'm going to use AFI because I like the sales rep. It's important to think through why you use a different technique and whether it's appropriate for the experiment that you're looking for. Yeah, Chris. Yeah, so essentially it's using a secondary probe that recognizes the address label. Other questions? So my group has a paper that came back from a good quality journal with revisions and the revision said can you do some irony stuff to give us confidence in your DNA based assessments? And fortunately for us on that project, money wasn't limiting. And we had 80 samples that all we had to do is generate the data and analyze. We chose to do microarrays instead of RNA-seq because if we did the RNA-seq, we would still be analyzing the data two years later or 18 months later. With the microarray, we were able to get the samples to the array center, get the data back, finish the analysis, get it into the paper and it works. So there's clearly a turnaround time issue as well. And lastly I guess microarrays in some applications are probably superior and will be for a while. So there's no example of clinically used RNA-seq, but there are clinically used microarrays. So if you have something that you believe today is good enough to go into the clinic and change patient management, then having it as a sequencing based technology or test is not helpful. Having it as an array based test would be really useful and applicable to an FDA application. So there's a couple of reasons why you would choose microarrays even despite, you know, potential greater accuracy even if money wasn't limiting. So it is a good idea and it is perhaps not a absolutely standard thing to do because it's rare to have a thousand coding mutations in a tumor. I mean 1,000 differentiate explicit genes. Aha. So yes, so if you have a thousand genes, there's nothing difficult about making that assessment. It depends a little bit on the technology used, how sensitive it would be to that feature anyway. So for example inkjet arrays would actually be quite sensitive to it, aphometrics arrays would be less so. And so with some array technologies we might be able to just ignore that issue entirely. No, no, just just isn't today. The time frame for getting DNA seek into clinical applications is a bit unclear but probably you imagine over the next couple of years RNA seek will probably take a few years longer than that. So you might be looking at four or five but making estimates on FDA approvals for things is something that large pharmaceutical companies suck at so don't trust my estimates. Where are we? Very good. I'm going to very quickly talk about what microarrays are used for and then spend the bulk of our time talking about how the data is analyzed. So very quickly obviously at a molecular level we can interrogate a large number of different features using a microarray. So we talked about how if we make the hybridization conditions very stringent we can go ahead and say we are going to infer a sequence. Genotyping directly and you can essentially do resequencing in an array if you really want to. Similarly you can copy number analysis. Genotyping arrays are frequently reused to get copy analysis making special in terms of the genomics data generation. You just have to play with a bunch of data to come up with the same array used to get a genotype and come up with not too bad copy number estimations. Microarrays are still widely used for capture type experiments. For example, one of the major areas to capture the coding regions then a lot of the microarrays are actually used in collaboration with sequencing experiments and also for a lot of genetic aspects. On the RNA side a lot of the more novel applications of arrays involve things like mRNA where you would take samples you'd spike it, pulse chase types of techniques, modified nucleotides you'd use those modified nucleotides to be able to isolate different fractions versus not newly synthesized RNAs and from that you can infer half-life for translations like that. So there's a lot of molecular applications and they lead to a huge series of biological applications things that you've probably heard about before or will hear about this afternoon. I compare drug treated to non-drug treated or drug resistant to drug sensitive tumors that are different than that sensitivity. You can also do pathway analysis. Very, very commonly today we're looking at new models for analyzing cancer. There are huge numbers of primary patient xenografts being developed. So I think here in Toronto the number is that there's a thousand primary xenografts with a human tumor put it into immunocompromised mice and it's growing it. A thousand of them. We want to know what are the genomics like, what are the characteristics of the transcriptome, how accurately do they model patient tumors and can we use them to understand optimal treatment protocols. And then of course the whole topic of my own work which is classification to be able to make predictions about patients of how drugs are going to work, whether drugs are going to be toxic to an individual. So in our stream analysis you've seen pathways already and you're going to talk about clinical integration today and what fits in between those two. An idea of how a microwave analyzed is kind of like a pipeline or a pathway. So what you do is you start off with a single glass slide which has been scanned and on that you can see the spots and where they are and all the different features in them. And the first critical step is to make each of those probes and that might be a series of two numbers, Psi 3 and Psi 5 the standard red and green dice that have a single channel array. Either way, we don't work with images, images are impossible to deal with. We start off by scanning it in two. Those quantitative values will be associated with several types of models to actually try to remove in a series of steps. For example there's background signal there's non-specific hybridization in the region that we try to remove using different statistical models. We'll have some spots that are inherently duality and it's a little signal that we want to either remove them entirely or mitigate their effect on the experiment. We also have to do normalization. Different parts of the same array may have different error characteristics. There can be spatial trends that we need to remove and when we do experiments we usually do replicates, we always do replicates and therefore we have options to have balancing the different arrays that we have to have statistical analysis. Once we've done all that normalization we might want to start doing statistical analysis or identify patterns using machine learning techniques or integrate with other types of data like what you've seen all week long. What you're really talking about here are two fundamentally different things. The series of the first five steps is about removing noise from assessing any rate of unnecessary technical artifacts from your data. The second part is trying to extract information or extract meaning from it to draw biologically useful conclusions. We're going to talk a lot about the removal of noise because the removal of noise in some sense is exactly like the removal of noise from any other type of genomic technique. Yes. I will in five minutes talk about it in lots more detail. I'm giving you the overview and then we're going to go over each of these steps in lots of detail. Other questions? So let's go over the steps one at a time in a bit of detail. Let's start off with image quantitation. So image quantitation is basically the idea of taking these pictures to interpret so to change the qualitative information from the images to some quantitative. Microarrays didn't originally work this way. The first microarrays were images and grad students looked at 6,000 used arrays and went and said, ooh, that spot is darker than that spot. That spot is lighter than that spot. And they did this repeatedly over and over until they went insane. And so one of them goes, you know, I'm in a really computer program to do this so that I don't have to do this much work. I wish I was joking. So that's actually the way it started and ultimately we have to do this. This image analysis is not just a microarray thing. Fundamentally image analysis is at the heart of most tableomics and of course all next gen sequencing. And it's potentially a major source of error because it's difficult. None of us actually go into the images and look at it in a lot of detail which is something that we might actually need to do. The image analysis starts off with something that looks like this. This is a typical microexperiment and the first thing we have to do is figure out where all the spots are. Fortunately, we'll typically divide things into little subgroups, grids that will allow us to have assessments of where things are. And by I think we all think we can sort of see where the grids are. The problem is we've got to teach a computer to see where all the grids are and that's not an obvious thing how to do. I won't go into all the details on how this works. I'm going to go and say alright, let me go ahead and take a look at every pixel and I'm going to take a look at all the pixels in this row and I'm going to add them. And I'm going to go to the next row and add them in the next row. And sometimes there's going to be a lot of intensity and a lot of peak and sometimes not very much. I'm going to do the same thing for every column. And now I'm going to look for peaks in the rows and peaks in the columns and I'll say here's a peak and here's another peak and those correspond to this spot. And I'm going to use that as an initial estimate where a spot is. Not a very good initial estimate. There's going to be lots of noise and spot to spot variation but that gives me an area to start around that spot. Looking in circles, trying to identify where is the boundary of that actual spot. So that gives me a starting guess and then I expand out from that starting guess. Initially in spheres and then when I start detecting intensity I start moving in that direction until I refine on my best estimate of where the spot is. So that's how image segmentation works. Image segmentation turns out to be the easiest thing in the world whether a raise or not perfect. If a raise were 100% reproducible with zero error, I wouldn't be talking to you about image segmentation as a big problem. But instead there's all sorts of different things that go on in any typical genomic experiment. For example, that spot is not a little fleck of dust. Is that somebody's skin flake that managed to fall into the experiment as it was going on? Well, it's green. So it kind of looks like it's a spot but it's not in a particularly normal location and it's not obvious what it would be. Similarly take a look at these two rows over here. You can sort of see some hints that maybe there are spots here. So maybe this grid this row here is the last row of this grid or maybe this is just noise. Just like that's noise. That green spot is certainly more intense than this red one or this red one over here. So it can easily be either case. And we have to trust A, our genomics is good and B, that our segmentation is able to ignore these mistakes. We have to identify when there is a good likelihood of something being messed up in genomics without introducing new types of noise. Unfortunately this is a very very difficult problem. So difficult that in most fields people have stopped even thinking about how the image segmentation is done. So for example I think every lab I know trusts the Illumina based segmentations for their DNA sequences. I don't know a single lab that is doing it themselves or even developing algorithms for it anymore. There used to be a couple and now we just assume that the company gets it right. In fact we usually do so for aphymetrics arrays as well. Surprisingly little investigation even though any error at this stage will carry through to the entirety of your personal analysis. And it's probably a source of error in all your studies. The last time that my team looked at this was maybe four years ago and we did an assessment where I got some very poor grad students and undergrads to go ahead and look at spots. And we estimate that there was probably something on the order of 100 spots misgraded on every array. It's not the end of the world right? 100 spots is not huge when you've got 20,000 genes and 10 probes of them is 200,000 but that's still a couple of hundred. That's a whatever 1% error rate or 0.5% error rate on every single array which we hope is random. I'm not going to reduce the systematic bias but certainly doesn't have to be. The only way to do it that I know of is to do manual spot checking. I don't know anybody who is doing this systematically anymore when we do experiments. It's just impossible. It's a 500 arrays which is 10,000 spots. I can't afford enough undergrads to do that in the first place. It's a significant challenge but one that largely gets ignored at this stage in microarrays and essentially all genomic studies. If we have a feature that feature contains millions of individual strands but there are for any gene typically 1-10 features representing that gene. Good question. Imagine that your scanner was perfect then yes you absolutely could but that would require something like a photon level scanner. We certainly don't use those. I don't even know if they exist. You can start single photons but nobody uses that for microarrays. The scanners that we have have much more thresholds of intensity. I don't know what it is in terms of molecules per cell I'm not sure. I think it's in the order of 1-2 molecules per cell but I need to really double check that. Two color arrays. Any two color arrays can have three red dots. Or yes. I think you can use feature and spot interchangeably. So the signature is a bit depressing. Backup and correction is a little less depressing. The idea here is that you've got stray signal or stray hybridization around each of your spots and you want to remove them. This also turns out to be a very difficult problem which could also benefit from a lot more research. The idea is that if you look at a typical array you're going to see something that looks like this. You're going to see an interior strong point of signal. Around that you're going to see this weird ring which is probably some sort of diffraction effects we're not sure. Around that you'll see background. The background is hybridization that doesn't get washed off. That's unrelated to the genes and that has a spatial bias across the array. When you take a look at this what you really want to do is to be able to take the foreground and remove this background signal which we think is entirely noise. It should be simple. All we do is have the signal as the foreground as the background. Of course simple things don't always work and this fails miserably. It fails miserably for a whole series of reasons. The primary one is that we often see cases where the background is more noise than we have signal. That's weird. That's weird in a lot of fundamental ways because if the background is really background it should be adding to all of the spots not just the stuff next to the spot. That's a huge problem. Negative signal is a huge problem number one because it's biologically plausible. Number two because the vast majority of statistical analyses assume that microarray data is normally distributed. We have lots of log transforms and log transforms of negative numbers are not happy things. A lot of people have been trying to figure out why this happens because there are lots of experiments. 2% of spots. That's looking at potentially a couple of thousand genes would be affected by it depending how those spots are distributed. The reason it happens is one at Argonne National Laboratories did some really beautiful work using what's called a spectral scan which basically scans all the different wavelengths. What they showed is that it has less signal than the background. If you have a spot that has no signal, it actually quenches the fluorescence signal. That's very, very interesting. It suggests that there's some sort of DNA quenching going on with several of the dyes. At the same time, they also showed that several of the chips in the glass used in the chips would alter fluorescence. These are two important ones. An important one is that unbound spots are particularly prone to problems. Unbound spots, what might that be? Well, that would be low-expressed genes stuff that is unimportant in cancer like tumor suppressors, transcription factors, lots of signaling dark. So we have a big problem. The genes that we care about in cancer are particularly prone to this problem. So we need to figure out ways to address it. And a series of fairly intense statistical models were developed. There are three that are in wide use. They are associated with the names of the people who did them, Edward Smyth and Bloomberg, and they fundamentally differ in their assumptions about the signal and noise that the signal is normally distributed and sorry, the other way around, the error is normally, I'll try this one more time, Edward assumes that the error is logarithmically distributed and the signal is linearly distributed. Smyth assumes a normal exponential convolution and Cougar seems to be a complicated model. The mathematical underpinnings of these is extremely advanced in something that we could talk about for like an hour. So I'm just going to spend a moment on a summary of how you might think about it. The other way to model is fast. So it runs in a few seconds on a typical micro-experiment and it's reasonably accurate. It's certainly better than doing a micro-experiment like that. By contrast, the Cooper Berg is very slow. The last time my lab did a Cooper Berg experiment was I guess three years now and we left it running for about a week and a half to get the background correction done on a couple hundred arrays. So I mean it's slow and it's slow because it's taking into account everything like the number of pixels per spot, how many of those pixels had this type of expression level, the number of background pixels, how those relate to one another. But it's thought to have the best accuracy. Assessing good, better, best is a very difficult thing here because how do you design an experiment that isolates background effects? Fundamentally, background effects are based on specific hybridization. So it's hard to come up with a controlled experiment that would allow us to accurately assess this. So the way these numbers were assessed as good, better, best was using something entirely different downstream metric like how accurate you were at doing a certain classification task. It's not a terrible idea but it's also not a great criteria for making your algorithm selections up front. So background correction is important and we have several techniques that can do it but on the other hand we don't really have full confidence in how to do it. Fortunately that's a much better situation than what we get into a spot quality. Spot quality turns out to be extraordinarily difficult. The idea is that we should identify artifacts, things that are not correct in your experiment. You want to identify all the mistakes and fix them. Unfortunately we don't really know how to do it and it's an extremely difficult research problem. A exceptional idea is very simple. If you have a large scale experiment that has 250,000 spots on your microwave, they're not all equally good. Some of them are going to have very robust signal. That's perfect. Maybe we give those a score of one because they're perfect. Some are going to be completely garbage. The manufacturer of the ring was poor and the net effect of it is that they're useless. Those should get a weight of zero. We shouldn't even look at them. Of course there are all the spots that are in between. The problem, of course, is how you come up with those numbers between zero and one. How do we assess that? There are a few approaches that have been widely used in the literature. For example, the mean-median correlation is that we would take the mean of the pixels in a spot, the median of the pixels in a spot, take the ratio. If the ratio is exactly one, the mean and the median are identical. The distribution of those pixels is symmetrical. That's nice. By contrast, if it is very skewed in that direction, it tells us something fundamental about what's going on in the spot and that the distribution is certainly not normal and it's not what we might expect. By contrast, other groups have looked at things called composite Q-metrics, quality metrics. What they would do is take a look at the circularity of a spot. You can come up with a number between zero and one that tells you how circular a spot is. Or you can take a look at how elastic it is, how much of a ability there are between the pixels within a spot. And these metrics can do quite a good job at improving what we call the homotypic signal-to-noise. You take the same sample, hybridize it against itself and see how much noise there is. You expect to see everything having exactly the same signal. The deviation is the signal-to-noise ratio. Unfortunately, all these approaches fail randomly. When I say randomly, I really mean it. You'll sometimes take a look at it and see a spot and go, that is perfectly good. Why is that being called as bad? Or, you have huge artifacts in your experiments and go, I don't believe that. Why on earth is that being used? And instead it'll say oh, this is a good quality spot. And so the question that you always get asked when you do this is, do I really need to worry about spot quality? So I'll show you a few examples of whether or not you need to worry about spot quality. So here's a nice little segment of a two-color array. You can see, here is a halo of background around the saturated, very intense spot, which is bleeding into the spots around it. That bleed of halo of spots around it is going to affect the background signal from the neighboring spots. Okay, that's not a terrible thing. By contrast this is probably a terrible thing. Some sort of a dust mode here is impinging directly on two spots. There's no ability to distinguish this spot and its fluorescent ability from that spot and essentially you're seeing a large introduction of noise. Here you have a combination of those two. This is some sort of a dust event who knows what it is which has created a large halo of signal in the surrounding region that is affecting multiple spots in their background. This is probably a printing artifact where the printer actually moved and you can see that the DNA from these two spots is even being mixed and look at this weird thing where there's a little pseudo-spot off to one side. Segmentation is going to have disasters with that. And of course all of those are from a single array, a quality array that we analyzed and published on several years ago. So there's nothing that stops you from using data like this to be able to potentially address it. But the problem is, we have to be able to address all those issues. Now, when I show stuff like that somebody will always say, yeah, that was a two-color array with affimetrics or whomever other array type that they use and it will be so much better. Well I spent some time telling you that affimetrics arrays are the highest quality. That's very true. Let's take a look at some affimetrics data and see high quality data. So here's a nice array that has a clear pattern of signal increase in one corner of the array almost as if there was a little grease mark from somebody's finger tip on it that changed the hybridization temperature. Here's another one where you see very, very clearly a pattern of a pattern of unusual sensor distribution here probably related again to some sort of weird hybridization characteristics. And you have a thumbprint on the side of your array. So you say that's fun. Those can't possibly be a good affimetrics array experiment, but those are from the standard AFI data set done by them themselves on their website that we can go and use as the exemplar for the best quality affimetrics data that you should use for designing studies. That's a good example of it. Similarly, if we take a look at a very interesting experiment that we did mention a few minutes ago, a line here. That's weird. Why is there that line here? This was the third array in our experiment. The fifth array, you can see one, two, three, four, five, six lines. That's weird. But the eighth array, you can see a lot of lines. Anybody know what's going on and why we have all these lines in our experiment and we're getting more and more as we go on? Any guesses? Different samples. But yes, it is proportional and correlated. It's not linearly related, I guess, what I'm saying. What do you think is causing it? I don't know. Unfair. I want to hear a guess from somebody else. Affimetrics array rat samples treated with a drug. It's purely experimental. So that's a good guess. So let me answer that first. An equal hybridization is a good guess but it's weird that it would happen in very ordered rows. Chemical reactions don't... Yeah, hard to imagine what the mechanism of that would be. You're going to guess? So the answer is simpler. Imagine that you're scanning an array. The scanner goes in order. Going all the way across the array. As the scanner is going in order, it has to have a lot of power to be able to do the scanner. There's a piece of equipment called a capacitor that stores a lot of charge. The capacitor was a bit tired. So in the first array, it's got lots of... On the second array it goes... By the third one it's going... Okay. Let's not do this. So eventually it got tired and it wasn't holding as much charge. So basically the capacitor wasn't charging fully. It wasn't allowing to have complete scanning effects. And you see this huge spatial artifact. Now, number one, it's trivial to remove it when you see it's there. Like, that's pretty straightforward. You have XY coordinates on an array and you can fit a simple model to remove that effect. That's fine. But number one, you have to look at your data to see it. And number two, auto-detecting that trend is quite challenging because you can't go through and think of every possible thing a scanner is doing wrong. Because this is one that we've never seen before or since. So that gives you a kind of feel for the fact that spot quality is one of the next main issues. Regardless of the platform, regardless of the technique, we pretty much never talk about quality. And certainly not at this level when it comes to sequencing studies. Let's just say that there is equally problem with sequencing data sets that we analyze in terms of quality. And we just don't know how to address it in general. We'll try to have manual flagging where you get very bored undergrads to go ahead and look at spots and go good, bad, good, bad. Unsurprisingly, undergrads are not very enthusiastic about it. And if you take two or three of them, their concordance differs by between 5 and 20%. Which is really what you'd expect. And we have far too many spots to do that. So spot quality is a huge unsolved issue. In my mind is probably the single biggest issue in the pre-processing of all genomic data. Most investigators ignore it and any bioinformatics does this stuff seriously, struggles with it and struggles with it and struggles with it. Eventually they ignore it too because we have no good solution. And I think if you wanted to make improvement in bioinformatics, this is the single biggest open field. If we're able to do a better job of quality assessment on our individual spots. So like three or four minutes ago I think Angelina had a question that I wanted to ask. So it depends on the array platform. So we'll talk about how Affymetrix does its replicates in a bit of detail. So they do theirs having the replicates are not pure replicates. They're different parts of the same gene, different sequences of the same gene and they are spatially distributed around the array. By contrast, some of the Agilent arrays have exact technical replicates, duplicates which are adjacent so can suffer from spatial effects and others do not. So there's a huge amount of variability to whether there are technical replicates and what kinds of replication they are in spatial distribution. Beta rays have technical replicates that are spatially distributed all the time because they're inherently randomly structured arrays. From the lab in January looked at it. I caught it very early and went why does this look like this? And I said, yes, why does this look like this? Well, that's right. So the chips are scannable? Yeah. So I'm not 100% Yeah, we haven't ever seen other patterns and I guess the scanner is also scanning a band of spots at the same time. So I imagine that it's kind of biased towards one time. Oh, yes. Yes. So if you take a look at any strong bioinformatician kind of in the microarray analysis field, they'll have a series of papers on spot quality assessment. So for example, what was it called? It's called Music. It was the name of the group but there is a UK consortium that got non-trivial amounts of funding to try to resolve this and Wolfgang Kuber's group. So you produced one of the standard algorithms used for microarray analysis. Still published a couple of papers. Audrey Thompson, I think is the first author. I think that's it. Raphael Irizari who developed RMA an algorithm we'll talk about as well wrote a series. So there's a lot of literature on people attempting to fix this. Yeah. Absolutely. So we're going to do a little bit of proteomics builds and try to apply similar techniques. It's the same problem where some stuff works and doesn't work for large costs as a various. I'm going to just do a time check. I bet 10.05, 10.08. Okay. So we're going to be running something on the order of 15 minutes late. So just to prep you and about 10 or 15 minutes I'll ask if you want to go for a coffee break 20 minutes longer and then come back after a coffee break. So in about 10 minutes I'll ask you that. So, spot quality. Spot quality is impossible and nobody researches it anymore but we don't know to fix it. Very fortunately we then go on to in case you're getting depressed by the fact that we can't fix quantitation, background correction, spot quality. Normalization turns out to be easier. The idea here is that we've got polarized biases of different types and we want to remove them. We want to bring them to a common distribution so that spots on different parts of the arrays have similar error characteristics. While this turns out to be a difficult thing to do in some sense it is extremely intensively researched even today. You'll find papers published almost every month on this and it's probably close to being a solved problem. You've got three basic classes of spatial trends or biases one is simple spatial gradients so you start off with an underlying array and on top of that array you have a spatial bias because it's some sort of characteristic lack of flatness in the array. Straight forward and simple. You hope that your initial sample is going to have equal amounts of your two colors if you have a two color array unfortunately nobody is aware of that and therefore you're going to have some random stochastic differences in it. Lastly you have something called bias. What you're looking at here is homotypic hybridization X and Y axes are signal intensities for the individual hybridization so it's the same sample analyzed twice and you can see of course that it's pretty darn linear notice that it's not exactly linear like it's not a straight line this is not atypical, this is kind of the level of noise that you might expect and within that you can sort of see that most of the spots fall within these dash lines great. But you'll also notice that most of these spots fall within these dash lines independent of whether or not they have intensities of 65,000 or intensities of 5,000. So you're seeing the same amount of noise same plus minus for a spot that has 5,000 or a spot that has 65,000. That's actually a bit of a problem because now we have a difference of 5,000 plus or minus 5,000 versus 65,000 obviously in the latter case we're going to have much more confident estimates so we have variance that's proportional to our signal and there's a need to do what's called variance stabilization so that we're able to make accurate inferences and that's really important if we have a gene that goes from very highly expressed to very lowly expressed like deletion of a tumor suppressor if we do not then we have strong heteroscedasticity which will be important to account for in our statistical models and we have this unequal noise characteristic that even makes the next stage of normalization difficult. This turns out to be I don't want to call it ridiculously easy to analyze but it turns out to be simple if your array dataset has large amounts of spatial effects then simple Gaussian spatial smoothers will remove them with high reliability and in a few seconds of computer time. By contrast in the vast majority of arrays the intensity based effects turn out to be more important and so something called a lowest smoother a locally estimated scatterplot smoother is widely used and it's implemented in the command in R called lowest to allow you to remove these types of effects and lastly sometimes you'll see what are combination effects both types of error will be present simultaneously and in that case what are called splines it's a kind of cubic polynomial can be fit iteratively across the array to allow you to come up with good estimates of removing the noise. All methods are well established and all be run by a single line in R and never really require any thought or justification to a journal and if we come that easy we eventually get to inter-array normalization inter-array normalization is very intensively researched but is as opposed to a solved problem as you'll have in my case essentially the idea is that if we do multiple experiments by pattern or other characteristics can lead to differential loading or intensities of samples between different arrays and so you want to scale the arrays so imagine that you have an experiment that looks like this each of these are a different array or a different channel of an array the y-axis is the fraction of spots that have a certain intensity and of course the x-axis is the intensities it's just a distribution and you can easily apply a scaling algorithm and bring these to a common distribution they look far far more similar now there's something very characteristic about this array experiment actually just looking at those plots you should be able to tell what type of experiment it is can anybody figure out what we're looking at from this it's a two color array but there's something about the biological what we're studying biologically that turns out to be really evident from these plots what is it telling you about though there's something there there's something there biologically yeah what is that peak biologically what might it be and you guess you're on the right track very much not over expression sorry so this is as soon as you see this profile with the secondary peak over here immediately you go this is a chip-chip or a rip-chip or an enrichment experiment basically what you're seeing is this large peak here at the intermediate intensities is all noise and this little peak here is the actual binding and signal in the experiment so that's the characteristic pattern of an enrichment experiment that gives you a little bit of contrast a complete noise and have a nice noise distribution and how many are the signal so that turns out to be very easy to handle and allows us to kind of say that we've closed the door on how we remove noise we start off with an image we quantitate it into a series of numbers we use some mathematical models to do background correction we at least think about spot quality even if we don't do a whole lot on it then we use some sort of a spatial smoother or an intensity smoother to remove internal effects to each array and then eventually we balance all the different arrays on the experiment and as a battery that's the way in which we remove noise on the study and that is very very similar to how we would analyze a sequencing experiment fundamentally a series of steps that start with the image and eventually remove different types of noise until we get with a series of data that we can start doing real analyses on I think it's pretty close to being an automated pipeline that and then you invest in all these characteristics to go what does that look like and how does this work and so you think about whether or not, for example it's a good idea to actually look at the images for all your arrays, not spend hours doing it but if you have 50 of them then 30 seconds on each one is a valuable exercise most of these are pretty minimally parameterized algorithms it's actually quite clean that way especially relative to sequencing data not too much typically your choice is algorithm A versus algorithm B rather than re-parameterizing algorithm A that are in any sort of regular use anywhere, there's a couple of basic things like you have negative control probes and positive control probes and you verify that the negatives are negative positives but in a kind of systematic QA QC, no I'm not aware of anything so we're at quarter after, 20 after how do you guys feel about taking a coffee break now and then coming back and talking about the next steps of it for probably 20, 25 minutes before getting to the practical I'm okay if you want to keep on going for another 10 or 15 minutes but this is a nice logical point so wait, green if you want to take a coffee break now red if you don't want to take a coffee break okay we're going for coffee so coffee break now and then we'll come back and pick up the next step you have on 30 minutes 30 minutes so what is that 10-2 alright let's get started so where are we left off we have come to the conclusion that pre-processing your data is sometimes hard and sometimes easy depending on which part of it and that people like to study the easy things we're going to very quickly focus on, for a little bit on the idea of how we extract information from it and then we'll zoom in a little bit into how aphometrics arrays work with a bit of detail so very quickly on significance testing this is what you're going to talk about a lot this afternoon in some fundamental way statistics is statistics and just because you do it for microwave data it doesn't change all that much that being said statistics for microwaves has a couple of unique features there are some ways to take advantage of the dimensionality and the most common statistics questions that we get in microwave data are clearly our two groups different and if two things synergize there's also a lot of work on survival analysis that's increasingly becoming important and so we're going to focus on number one in the process for example work but in some sense you can once you have your data you apply the statistical model that is most appropriate that's not microwave data and that's not a bad approximation of what you should be doing we'll spend a little bit more time now talking about clustering clustering it's a branch of machine learning machine learning is something that I think everybody here is used today for machine learning today hands up just three people four people only four people as long as your email the spam filtering learning maybe anybody else no people took the stairs to come up here the rest of you the rest of you took an elevator your allocation works all I was talking about they found my talk book good everybody who may have googled for other reasons that's machine learning stock trading, automated stock trading high frequency stock trading would be machine learning algorithm if you're trying to take a look at does anybody wager on sports you don't have to answer that question but if you do the odds are being set using machine learning algorithms as one of the major determinants weather anybody look at the weather today weather prediction is machine learning so I can keep on going but in short it's probably one of the most important parts of your life is the ability of computers to predict what the heck is going on so unsupervised machine learning is a very, very tight part of the field when I say tiny let's pretend that we get a canonical textbook the kind of stuff that you give to a fourth year undergrad or a first year grad student 500 pages the classic one is called pattern classification in those 500 pages there's one chapter 12 pages on unsupervised machine learning and the rest of it talks about other types of machine learning it's actually a very tiny part of the field but one that we use very frequently in bottom formatics it's sometimes called clustering clustering is not the same thing clustering is a type of unsupervised machine learning and it's about finding patterns in a data set each individual pattern clustering cluster very small branch of machine learning and probably extremely overused in bioinformatics probably is the wrong word it is extremely overused in bioinformatics generally it's pretty pictures like this this is a terrible picture is anybody in the room red-green color blind? okay so I forget the exact number it's I think 3% of the male population it's very likely in people of certain ancestry so imagine you submit your paper to nature and by chance it goes to 3 reviewers in Germany Finland etc then I believe the number is 1 in 79 chance that one of them will be red-green color blind and you've annoyed your reviewer with this figure so red-green is a bad color choice it should probably be red-blue either way this is what we would call a cluster gram it has a couple of really interesting parts the top part of the data is proportional to how similar those sounds are the heat map itself is the color the more intense the color the more intense the signal ought to be so for example here dark green is very up-regulated and dark green is very highly down-regulated and black is in the middle and that's a weird color choice pattern so when would you do it? the x-axis is stimulus 1 and the y-axis is stimulus 2 and you can see here are some genes that go up in both other genes that go up in stimulus 2 and down in stimulus 1 and so forth through these circles that I claim are clusters well a computer is going to try to do the same thing it's going to try to draw the circles the way in which it's doing so is to employ two very very simple heuristics first it looks at how tight the cluster is it asks what is the separation of things within an individual cluster the average difference between individual elements within a cluster for example and it wants that to be as small as possible you want to make the clusters nice and tight so everything within it looks very similar you also want to make sure that your clusters are separated by a large distance and there's some sort of a trade-off here so if you're going to be able to have lots of clusters the number of clusters can go down to having one cluster for every data point which is great the clusters are super tight but now they're going to be unfortunately close together because all of the genes are going to be right next to one another so you're kind of looking at how to balance the inter-cluster difference and the intra-cluster distance and those two things together are essentially the metrics that we trade off in any clustering algorithm be it hierarchical or k-means or whatever we choose to use so generally we try to do mid-max, maximizing inter-cluster and maximizing inter-cluster and in some algorithms we'll specify the number of clusters in other cases we'll allow the algorithm to inherently figure it out so you can also use the assumptions that you made there so you don't have to use different assumptions that might be most useful for your realm probably four major reasons why we use it in bioinformatics let me rephrase there are four good reasons to use it in bioinformatics it's a lot of bad reasons number one is visualization these cluster-grams are kind of pretty and especially if you use them you'll be able to figure it out and it won't annoy your algorithm that's one good reason to do it a second good reason is to take a class assignment I'll talk about it in a second or to identify co-regulation those are different machines that tend to move together the last reason is to do quality control in an early micro-A studies all of the arrays that the center had done see patterns that would look like this a series of arrays that would look very similar to one another and they'd go ah we have discovered some types of cancer and then somebody would go those four arrays they were all done by John and those other four arrays were all done by Jennifer and have you noticed that the data completely clusters by clustering turns out to be a really good biases in your data to identify quality control metrics so when you cluster your data you'll say here are the most natural strong trends in the data do they correspond to a biological phenomenon like tumor versus normal or do they correspond to a technical that I ran my experiment or the person who was working batch of arrays that I used and imagine that over time in a large experiment the technology changes well do you see those different technologies as different batches in your cluster or instead do they turn out to be intermingled suggesting that your normalization is working better so it's a very effective method of being able to assess the quality of your data let's talk quickly about class assignment most genes we don't know their function there's about 1500 yeast genes without assigned functions and about 12,000 where all we have is electronically so 1500 in the most studied organism yeast is quite shocking so you might think you might be able to come up with good inference about their function just based on patterns of expression I mentioned before Tim Hughes and his work were on the other thing that he was involved in was applying them to the first study the first major application and what he chose to do was to take all of the yeast knockouts that he could get his hands on and he would say let's cluster this data clusters of genes that would be involved in mating and you'd see here is a group of 10 genes that are all involved in yeast mating oh wait 9 of them 1 of 10 has no known function interesting across these 500 experiments those genes go together and they are highly correlated therefore I predict that the number 10 gene is also involved in mating experiments and they are able to show that this does a reasonably good job 60 second estimates of gene function so this is a great example of how you can infer information about methodologies by using unsupervised methodologies as I said I said it's overused and some of the uses are terrible I cannot tell you the number of times where I will show a heat map and then somebody even though it is often a clinician will point and go these are the genes that I want well we'll do a statistical analysis to figure out if any of them are statistically significant and none of the genes will be statistically significant and then they'll say but those are the genes that I want and these are fundamentally different endeavors clustering is not a replacement for statistical analysis it can never be in fact it's less power than statistical analysis will be it doesn't give you assessments of p-values in the way that you think it's just a way to visualize the data and of course that means some people will say okay I know what to do I'm going to go ahead and do my statistical analysis and then I'm going to cluster it and see if it gives the trend that I want well okay so if I select the genes that are going to be different between tumors and normals and then cluster them I'm going to see tumors and normals cluster differently that's almost definitional so clustering is either a visualization tool in which case you can do a before or after statistical significance or it's a way to see the largest trends in the data set in which case you have to use the whole data set as soon as you apply that statistical filter now you've biased it and you can't infer anything from the clustering profiles it certainly doesn't replace standard statistical analysis and it doesn't give you inherently an assessment of chance actually it's almost true so if you've ever seen a phylogenetic tree there's a confidence interval on each one of those branches on a phylogenetic tree saying we are this confident in it there are lots of techniques for doing that similarly if I think that my technicians show different effects that cluster together there are lots of metrics that will allow me to assess whether my technicians on biased clustering patterns but if you don't apply them and you just look at the data and go oh that looks non-random to me that's not particularly meaningful so good clustering should also be associated with statistical evidence to support what you think you're claiming from it remember the following things micrometer is summarized with a pipeline of algorithms and that's your standard workflow and remember that pipeline is critical because then you look at an experiment and go steps out of the pipeline that's the pipeline sometimes with some technologies there will be steps that will be split but these key characteristics are present in microarray second point is this is still an active area you still see a microarray analysis paper published at least every month probably a couple of months so that's on the order of 20 plus new methods a year those methods in some cases are better than what people use and there are opportunities to get more value out of a micro experiment simply by reanalyzing it five years later and lastly what I've shown you holds true for all micro platform types obviously with some type specific changes that would be added or subtracted from it the vast majority of micro analysis happens in an R based environment called bioconductor bioconductor basically happened when a series of high profile statisticians came together and said I don't want to analyze micro data but I don't really want to have to worry about how you read those file formats and how you do those sorts of things can we make a standard library that we can all use to put my effort into doing the interesting statistical problems and they build some highly robust wide range of software framework called bioconductor that is the majority of how micro data is analyzed the majority meaning easily 95% there are non bioconductor approaches which are sometimes really really good but it's a safe bet that if somebody did something using bioconductors they were more likely to have gotten it and read algorithms than if they didn't and I often at this point in time tell a quick story about commercial software for analyzing micro data so one of the most common algorithms for analyzing affymetrics data is called RMA we'll talk about it in a few seconds RMA was initially implemented in bioconductor in R and a bug was found was found a few months after it was first released and so they changed it at the same time I'll leave it the commercial software package said this is the new standard so we implemented it it took them over four years to fix the bug in the commercial software package that was fixed in R within a month and so for years you could be buying this getting I am doing RMA just as everybody tells me I should and get numerically wrong answers bioconductor is a huge advantage because the people who are developing it use it every day so if there's a bug then so to the grad student or the programmer fix this or else all of the analyses we do are going to be incorrect and because they're doing analysis all the time these changes happen very rapidly so there's a big advantage there to the open source software over the commercial software in terms of the up to date and up to datedness and correctness of the analyses so I said that there's some technology specific characteristics let's talk quickly about an affymetrics specific workflow here's our generic workflow in affymetrics we're going to ignore qualification because we have met people it's a one channel array so there's no sign of 5 of course we all ignore spot quality because we have no idea what to do single channel array we're going to do an extra array normalization in one step that doesn't mean that they're being ignored we just do them simultaneously we rephrase and collapse that a little bit we get this high quality start off with the quantitated data we do a background reaction a normalization we have to do a probe set annotation typically there are typically 11 separate regions for each gene and we collapse those together into one and then we go to the standard statistics clustering integration type analyses the probe set annotation well our arrays can become outdated perhaps the most important thing to think about in a micro experiment relative to an RNA-seq in an RNA-seq you find whatever the heck is there the RNA in contrast to an array you're measuring targeted aspects of the experiment you're saying I want to measure these 20,000 genes if somebody discovers a new gene tomorrow you didn't find it the most commonly used microarray is affymetrics HG133A that's the single best-selling array I'll show you the world look on GEO or other databases there are hundreds of thousands of these HG133 does anybody know what that stands for it's for unid gene and 133 is the build number so that's the database that they decided to write so we're currently at unid gene in short there's 110 different builds they release on average 10 to 12 so the annotation used to build the most widely used genes we've sequenced a lot of genomes we've identified genes that aren't even genes from bacteria or viruses in the initial sequencing of the human genome we've discovered new genes we've found in many cases things that we thought was a single gene number two or three that we hadn't been able to accurately result so as that happens it's actually measured and if we use the annotation years ago which is the default for most analyses then we're going to change the array regularly because the mass design and production itself is very expensive and we take advantage of the fact that for each individual gene they'll be on average 11 probes so those 11 probes some of them are actually going to be good faithful representations of the gene some of them might just be noise we might find actually that's not an exon that's an intron and it's never actually expressed we made a mistake or this class hybridizes to 12 other genes in the genome and so there's a great opportunity here to say let's take those multiple regions per gene and map them and remap them and so when you do an aphymetrics experiment you start off with a chip which gets scanned to this dat file a dat file is actually just a tiff I don't know why it's got a special fancy file extension except they thought it would look good but it can scan to a tiff image which then gets processed into your quantitative into your raw data or your cell file the cell stands for chip expression levels this is kind of your need to take into all analyses there's a file called the seed all the probes both control and non control are related to genes and all you have to do is update that we'll do that so I'll go ahead and line all of the probes and you look at the shopper CDF file the single biggest identifier of an farray experiment was well analyzed I see an experiment under review when I analyze it using an alter into CDF I go oh boy it's gonna be a they did the odds are really good that they probably got it right so it's a key character key genetic characteristic alright the last thing that we should talk about so what exactly is it well preprocessing is the removing of sources of tactical noise for example anything that comes from the initial manufacturer of the chip spatial effects in manufacture anything that comes from process the samples or hybridize them to the array and so that can introduce technical artifacts some of which you just know for probably 10 years we had systematic issues with microarrays run in the summer in Toronto rise in the summer and ozone quenches the dye that is being used and so during my Ph.D. we would find that between the months of May and August we got no useful micro hybridization and it took us like a year and a half to figure out why all of our experiments suddenly stopped working the only way to get it to work was for a postdoc to come in at 11 p.m. and run the experiments between 11 and 5 a.m. that sounds crazy but actually most stuff are run in ozone free rooms now to avoid this kind of stuff and the systematic artifacts are everywhere in all sequencing and genomic experiments and it's incredible that's the goal of pre-processing but the pre-processing that we do the normalization is kind of a sledge hammer because we don't know the exact contribution of each of these effects so we apply broad statistical transforms that try to smooth out huge characteristics of the data hoping to remove these sources of noise and so pre-processing necessary because we have all these sources of noise but it's more important up front to try to minimize these and so I have a question actually at the brief as well pre-processing is not the best way to remove all of these things you want to remove them by good experimental design design your experiments and minimize the noise then you can do more gentle pre-processing and your data will be more clean in an ideal world we would not even have to pre-process everything would be perfect right off of the experiments we wouldn't think about it I'm not going to talk for hours about good experimental design you probably know most of the principles and should think about them but I'll point out a couple of key things number one if you have to think about it so if you have a choice of doing 20 samples 10 tumors and 10 normals that'll give you your maximum statistical power number one number two it's very valuable to have a group of normals or a group of controls that you're going to use to identify the normal experimental variation number two to use biological replicates versus technical replicates so if you can do 20 arrays those are cancer and 10 individuals without 10 times and one individual without cancer biological replicates you're simultaneously measuring biological noise and technical noise that allows you to get combined estimates of both by contrast if you take a single individual only 10 times you'll have 10 robust estimates of technical noise unfortunately you'll have no assessment of biological variability and of course in most of our experiments biological variability is much larger than technical variability therefore it's very important to be able to assess it and maybe most of our want to process your samples identically every way shape or form but that's not actually possible so imagine I get a grant that gives me $200,000 to do my experiments and I get $100,000 in year 1 and $100,000 in year 2 so I'm going to process 1,000 samples a year well I'm going to be very excited so in year 1 I'm going to go to the first samples and go let's process these then I'm going to wait a year until my next batch of samples process the next thousand all sorts of things will change the batches of reagents samples that get older, the technician might change you include controls so for example you would take a small set of samples call it 20 samples and I would do those same 20 samples in year 1 and year 2 so instead of getting 2,000 samples I'd get 980 samples but now I have 20 estimates exactly of what changed from year 1 to year 2 and I can use those to assess just how big are my batch effects and what kinds of batch effects do I have can I somehow mitigate or remove them and in fact you might even run those 20 samples first thing before you run your another 980 samples in year 2 to get your assessment of do we have any hints that there are large batch effects if they're really large maybe I should redesign my experiment or go back to genomics and figure out where they come from that kind of thing is incredibly common and we often do experiments we'll do 5 now and then we'll do 15 later next year we found another 10 and that sounds like a nice way we're just increasing the size of our experiment that's a very dangerous thing because of course as it grows you're also going to be having different batches and different amounts of technical artifacts and so it's important to think through and advance the types of controls that you're going to need lastly before we get into the practical there are two widely used sorry let me stop there any questions yeah so it depends principle component analysis doesn't necessarily so it depends on the experiment sometimes that's not the largest trend in the data set and so it might not show up in PC 1 but it might still be a large factor so you can imagine that PC 1, 2 and 3 might comprise 80% of our ability PC 4 might be 10% out of analysis entirely and you'd have a 10% noise so I don't think that's a good way of assessing batch effects in general it's a nice visualization but if you want to assess batch effects formally you should have the same sample done multiple times look at the algorithm directly look at the fraction of genes that change do statistics and multiple testing adjustments so there are two major ways of pre-processing affymetrics data which you're about to do they are called RMA for robust multi-array and MASS5 for micro-analysis suite 5 long time ago like a decade ago affymetrics thought it was fun to develop their own algorithms and did a lot of work on it they released an algorithm that's truly terrible called MASS4 a bunch of statisticians said we can do better and they created an algorithm called RMA when affymetrics realized that other people were going to do the hard work and money is developing algorithms for them they basically stopped so the statisticians who fixed it also had the consequence that the company doesn't really do it and they routinely release products now with not a lot of that final product support available the algorithms kind of trade off strengths and weaknesses in a very reasonable way of analyzing micro-data in short MASS5 is a more accurate algorithm so it's kind of able to come up with a better estimate of for example a full change but it is not as precise there will be a plus minus around there less accurate it tends to have bias and in particular it tends to underestimate true effects but it is much more precise and therefore you kind of think that if you have arch-patient cohort that has statistical power MASS5 is probably a better way to go and if you've got a very small one then RMA is probably a better way to go because it allows you to work in discovery mode well accepted by reviewers and journals and neither will get you into anything that approximates trouble in the peer review process as long as you do your QAQC to see if these are appropriate algorithms for your own individual data set so we're going to go through the use of the algorithms on a real data set see what it looks like ok