 OK, so we're going to talk about a bunch of different things today. And for a large portion of the day, we're going to be practical. So I would like you guys to actually have the experience of opening a micro-data set, learning how to load it, how to process it, those kinds of things. So in the morning, I would have gotten to that in two seconds, so I'm going to do a hand-up survey in a second about that. So I'm going to start off in the morning by telling you a fair bit of theory about things we'll talk about, basically, first, what is expression profiling, how it works, what are the critical factors that you need to consider. I'll then walk you through on-screen showing you the analysis of how you'd handle a sample experiment. So I'll have an R session open. I will take the questions, and I'll go through the first three, four questions with you guys step-by-step showing you what I do, how I do it, why I do it, in that particular way. I'm going to actually code it while we're talking. So it's not going to necessarily exactly match the solutions, and that's kind of the point. You're going to see it done twice, which is going to be really useful. At that point in time, and in the afternoon for the workshop, I'll actually lose on the remainder of the questions, and we'll walk around and help you and you get stuck with different questions. So as Michelle guest, I was going to say, I wanted to get a feel for what you guys are working with. So how many of you are working, practically, with microarray data at the moment? So a good half of the room. OK. How many are working with next-gen sequencing data? So another group. And how many are working with something else, proteomics, metadoromics, or another type of moments there? So a couple of people. So for the people who are working with microarray data, how many of you have decided that you're working with a data set of more than 100 arrays? OK, a couple. So for people working with large data sets, there are certain technical issues that we won't be discussing at all today. We're going to be focusing on issues that would mostly arise on smaller data sets. And one of the issues is simply computational power. The data set that you'll be working with was chosen. So it can run on your computers. And we can't run most hundred microarray data sets on a local computer. And there's also time factors. So for large data sets, sometimes they take a couple of things for an analysis to complete. And that's probably been practical for the one day workshop. All right. So we're in the first section. There's a couple of things that we're going to talk about. And the first thing I'd like to ask you guys is a question. What exactly does an mRNA microarray measure? mRNA microarray. So what do you mean by mRNA? Messenger RNA is coming out of it. So does anybody want to expand on that? Messenger RNA from the cytoplasm of the cell. Most of them are poly-denuated. Usually? And why is that, you know? For a mode, it's easier way to grab it. Yeah, easier way to extract it. And did anybody have something else? Yeah? I was just going to say the transfer products of the reactions that are occurring at particular tissues to different points of the tissue. So there's definitely tissue specificity with us. So? You can let all of this in place. So it's one of the critical things that we'll look at different levels with us. A couple of critical factors. So they actually compare expression between your cells. And I did miss this. Yeah, so it's some sort of a relative measure. Very good. Anything else? I think it measures the fraction of the gene, but not the chunks of the cell. The expression of the gene? It measures the fraction of the messenger RNA. So it only measures a portion of the messenger RNA because not the entire thing is represented on the array. Good. Anything else? And changing over time starts something fixed. So there's a temporal fraction. So good. So what do we say, initially, an expression microwave measures MR in expression? It doesn't really do that. It measures a snapshot in time. It measures a single tissue. It measures a fraction of the gene. And worse, if you take some of the most homogeneous tissues, say the liver, the liver is 90-odd percent hepatocytes. But there's 10% of other cells, structural cells, uter cells, immune cells, blood, all running through the liver. And it only measures all of those cells pulled together and pulled in some sort of an average representation of a clump of tissue. It doesn't represent the entire transcript. It may not even catch all splice variants or all versions of it. And lastly, it's relative. It doesn't actually give me the absolute MRNA level. So it gives you something relative. And because of the way the technology works, we'll talk about it in a second, it's relative in a different way for every gene so that you cannot trivially say, this gene is more expressed than another. So those are a lot of limitations for a technology that's very widely used. Why is it so widely used, then, if it's got all these limitations? Because it's simple to measure in comparison with other molecules over there like protein. So it's easier to do than measuring something like protein, definitely. Anything else? Yes, is it Michelle? Yes, exactly. It's just very cost-effective. The equivalent of doing 20,000 PCRs is one of the cheapest ways in which somebody can generate data in large quantities. All right, so that kind of gives you a hint of the limitations of the platform. I didn't ask. Can anybody at the back here know that we're using the microphone? Yes, OK. So with those limitations in mind, we should talk a lot about what they actually, the micrase actually are, how they work, and what the underlying sources of noise are. When you analyze any type of data as a bioinformatician, it's critical to know where the noise arises from. Because if you don't understand the technology itself, you can't understand how you need to model it in your computational procedures. And if you don't understand the computational modeling, you won't get your statistical analysis correct. So it all has to start off with an understanding of the technology and what are the potential sources of noise within that technology. So we'll talk a lot about what microarrays are and where potentially errors or problems can come up in microanalysis. We'll talk a little bit about what they're used for, particularly at the molecular level, but a little bit of the biological and downstream aspects. And then we'll talk in detail about the basic bioinformatic workflow for a microexperiment. It's a kind of a template that with modification would be useful for any microexperiment. And then we'll discuss in particular the affimetric template for one particular type of experiment and one particular technology. And we'll look at how you'd apply that to your own analysis and own data sets. So for starters, defining a microarray is, I remember lastly saying, if you have a question, just stick up your hand and ask. Don't mind being interrupted. So underlying a microarray is a technology that one of the most critical features is its multiplex. It's highly parallel. And you can describe it as sort of an ordered array of things. If it's a DNA microarray, it's an ordered array of DNA. And each one is an algorithmic retide that can range in length from a small of maybe 20 base pairs to a large of 20 or 30,000 base pairs. And there's a specific sequence in each of those spots on array. It's generally used to either quantitate or to capture RNA or DNA. And in general, what you said, DNA, do you mean like UTA or DNA? Absolutely, one sec. So in general, when we deal with RNA, we normally reverse transcribe it into CDMA. And that allows you to have more stable and more easily labeled substrate. But there is definitely some arrays that will work directly on RNA. That's not unheard of at all. In terms of DNA, there's a number of things one can do. So one can use, and we'll talk about this in a second, but one can use DNA to measure SNPs by microarray. Or one can capture the DNA from certain regions of the genome using an antibody and hybridize that to an array to determine where the antibody was binding. That's called chip bunchet. So there's a number of different techniques, but each of them rely on the fact that the microarray is a parallel method of measuring amounts of DNA or RNA. And so as long as you know the sequences, you can do just about anything that involves measuring RNA or DNA amounts. And in general, although not exclusively, microarrays are hypothesis-generating experiments. So what that means is that a microarray is a way of saying which genes are involved, which features are involved. You don't hypothesize feature X is involved. You hypothesize there are some features and let's figure out what they are, and then we can go ahead and validate them and look at them in other ways. But it's not entirely true. The word usually there is for a good reason. For example, you may have a hypothesis that there is a biomarker for lung cancer. That's something that you would directly answer with a microarray. But in most versions of it, it's something to screen, to identify candidate genes or regions of interest. And just because it's a hypothesis-generating experiment, it doesn't mean that you don't have to do experimental design. It actually makes it far more important to do good experimental design. And for all of you guys working as bioinformaticians, the most critical thing, any micro-experiment, is that before you do the experiment, you think through the analysis. You say, all right, when I have this experiment, how would I do the analysis? Do I have the appropriate controls to normalize my data? Do I know that I'm gonna have sufficient statistical power to discover the hypotheses that I'm interested in? Do I even know which technology is better, which regions of the gene are most interesting? And of course, every company will come up with a very interesting and useful explanation of why the technology is going to solve your problems. And you have to do a careful evaluation of which technology is best for which question. And sometimes that comes down to what you have skills or expertise with. So when it comes down to the experimental design, it's actually a challenging thing. And I would say that in good experiments, you spend almost as much time designing the experiment as you do during the analysis. By the time you get down to the analysis, it becomes not the first time you do it, but after a few times, relative routine, you say, oh, I normalize the data using this technique. Here's a chunk of code that does that. I need to do my statistical analysis using that technique. It's the evaluation and determination of those techniques that is time consuming. And that's one of the major things I want you to get out of today is how can I do a good job of picking techniques and understanding what are the things that I need to think about? The other thing that really critically determines how an experiment works is the nature of the sample that you're using. For example, we could have a sample that has never been frozen, that comes directly from a cell culture line or from an animal tissue, and that directly gets RNA extracted and then is hypervised onto a microgram. That's good because the process of freezing down tissue or RNA can damage it or degrade it. And so you'll see a definite difference in quality between samples that have never been frozen and samples that have been frozen. Similarly, large numbers of clinical samples are what are called FFP E-fixed. Does anybody know what FFP is? Yeah, so can somebody stick up their hand to who wants to explain what FFP is used for? It's a paraffin-embedded tissue that's used for histology, mainly how most samples are prepared. Yeah, so it's the standard way in which samples, clinical samples are stored and maintained. As an estimate that there's something like a billion FFP E-blocks in the world. And it's basically taking the sample, fixing it in formalin, and embedding it into a wax paraffin, so formalin-fixed paraffin embedded. And this allows for long-term storage and critically long-term storage at room temperature. It's not as critical that an FFP block be frozen to minus 80 or something like that, which greatly, greatly reduces storage needs. In the answer, if your freezer goes down, you don't lose critical clinical samples. So FFP E-studies are very interesting for a lot of reasons, but part because they represent rare clinical data sets, rare clinical tissue types. But the challenge is that in these tissues, the RNA is very, very degraded. Even DNA can be substantially degraded. So you've got a quality continuum. You can imagine in your experiment if half your samples had never been frozen and half of them formalin-fixed, you'd have a systematic difference in your experiment. And if these happen to be all your interesting cancer samples, and these happen to be your normal controls, now any difference that you detect between normal and cancer is in part going to be a result of a difference between formalin-fixed tissue and unfrozen tissue. And those are the kind of experimental design issues that you have to evaluate up front. And in an experiment like this, it would be perfectly reasonable to say, okay, we're gonna formalin-fix our tissues, at least some of our normal tissues as a control to understand what effect that's having on our results. The last thing that was mentioned earlier is you're not always working with the total RNA fraction, although in many cases you are. Sometimes it would be polyA extracted, so it would only be those transcripts that have polyA tails. Incidentally, not all polyA tailed transcripts are protein coding, but lots of polyA tailed transcripts that are not. For example, lots of processed pseudogenes, several untranslated RNAs, even some of the long-link RNAs have polyA tails on them. So it's not as if that restricts you to only protein coding genes. But of course there's other subsets that you could look at. You could, one of the classic micro papers looked at different subcellular fractions and showed that the RNA that's present near the membrane of the cell looks substantially different than the RNA that's present in the bauxitoplasm of the cell. And maybe most interestingly, the ones that are close to the membrane mostly included membrane-associated proteins. So the ribosomes and RNAs tend to localize closer to where they're actually gonna go. So there's a lot of subsets that one can use to look at this. So I wanna make sure that we use 100% clear on how a microwave works. So imagine you've got a one-spot microwave and there are a couple of things here. This is the glass slide. This is the spot with DNA strands arising from it. We did each of these names. The glass slide is called the chip. The DNA that is extending off the chip is called the probe. And the spot that contains a number of DNA molecules, identical DNA molecules, is called the feature. I'm curious, does anybody have a guess for how many DNA molecules make up a single feature on a modern array? Order of magnitude, yeah? 25. 25 to 40 base pairs long, yeah. Yeah, 40 probes. 40 probes per spot, okay? Anybody else? Any other thoughts? 100. That's right. 200? So I wanna be clear. So the question is, in one of these features, how many identical strands of DNA are there? So 40, 200, I mean, 1,000? Yeah, so typically more on the order of millions. We're talking about a single spot will contain 10 million. I think on the low end, it's been estimated to be 100,000 and on the high end, it would be sort of 10 to 100 million. So that kind of gives you a feel for when I draw this with three strands on it, it's got a lot more than those three strands. And that means that there's a lot of possibility for experimental artifacts or for things to happen just once by chance alone, because there's 100 million spots there. 100 million spots, anything can happen a couple of times. So that's the terminology we use for the array itself. We also have our DNA or CDNA or RNA that we're going to be hybridizing onto the array. And that will initially be sitting in the cell somewhere and we're going to label it. And it's typically labeled with a fluorescent dye. There's a couple of different fluorescent dyes and it's hybridized. So basically, just at a slightly elevated temperature, it's allowed to flow over in solution over the microarray. And the micro itself is a little bit sticky. So large fraction of the DNA or RNA is going to physically bind to it. And some will bind more strongly than others. The regions of strong binding are, of course, going to be the ones that are complementary to the probe sequences. At that point in time, a wash is done. And this wash has to be done really carefully. It has to be stringent enough to remove all the nonspecific binding, but not so stringent as to remove the actual Watson and Crick base pair binding. So it has to be at the right amount of salt concentration, right temperature, right buffers, those kinds of things. But it's done correctly. We're left with a slide that for each feature has only the Watson and Crick base pair binding for the exact matches. If there were none, everything should get washed off. If there was 100 million, then this should be completely saturated and every single target sequence will have a probe sequence bound to it. At that point in time, you can just directly scan this in any sort of a typical scanner and you get a picture that looks sort of like that. You can see that it's not actually a perfect spot by any means. When you look at raw micro data, it will always have this kind of fuzziness around it and a lot of other features that make analysis tricky. But that's the core of it, you scan it and the intensity of your fluorescence signal will be proportional to how many molecules of labeled DNA you have there. So if you have one, you should have half as much signal as if you have two. And since we have 100 million in some arrays, you should see a big difference between 100 million and 10 million. You should be able to distinguish those large differences. There's a couple of different micro platforms and I'm not gonna focus primarily on different technologies but I'll talk about it a little bit now so you're familiar with it. The classic micro technology are what are called two-color micros. Two-color micros work because you have two different animals, two different species, two different experimental conditions and you compare them. So for example, you can take two different organs and you have a little rat and liver and kidney. You would move each with a different fluorescent dye and then you'd combine those into a tube, mix them together and hybridize them onto the array. And when you do that, now you can do what's called a competitive comparison. For each individual spot on the array, you know how many strands of red and how many strands of green. If this was liver and that was kidney, you'd be able to say I have two liver, two kidney. One liver, one kidney, two kidney, one liver. And that will allow you to compare directly on each spot. The advantage of this type of array platform is that any bias that occurs as a result of the features of a spot will be canceled out because you're comparing this to this. So for example, it's a simple normalization technique. So for example, if some spots were bigger than others and you are unable to reproducibly guarantee spot size, then by doing a competitive hybridization, you control for that directly. Similarly, if there were sequence-specific differences from one spot to another or from one batch of the arrays to another, this controls for that type of an effect. And of course, I showed you here comparing liver and kidney, but we obviously can take samples from different rats and mix them together and compare them. And this is not limited in any way to just rats. You can take different species. And I'm slightly joking here, but I'm not entirely on one of the major uses of this technology is in comparing different species. Traditional arrays do a very good job of comparing human to primate. And almost every experiment that has successfully done comparisons of primate gene expression actually had to use competitive hybridization to account for the differences in sequences. And so, it allows you to do a much better job of comparing species of great apes or spice variants across different types of monkeys or things like that. And I'm still semi-joking. Similarly, a lot of plant research started off with this kind of experiment too. Plant research was initially, and still is, a fabric behind the research in yeast or mammalian model organisms. And there were some very, very good libraries. And one of the features that's changing now is that for many of the other microwave technologies that we discuss, you need to have a sequence genome to work with them. And for plant researchers, that wasn't necessarily the case until recently. So they would take large CDNA libraries and use them to construct their microwaves using this type of technology. By contrast, if you have a sequence genome, there are other things that you can do. Of course, with more genome sequencing, that's not as necessary anymore. So these are spotted arrays. The native produced is clearly with a robot that has a series of needle tips. So what you're looking at here, each of these glass slides is a microwave. And this is a robotic arm that has, I think it's four meter, 96 meters on this one. And it's going into a solution. So the solution chain blues are over here. And it's going to dip in, grab a little bit of DNA onto the needle, and then move to the microwave and just drop it. And essentially it's just capillary force that's going to suck the liquid from the needle onto the glass slide. The glass slide has been treated with certain chemicals. It's extremely flat and it has the ability to covalently attach the DNA, the linker molecules and things like that. But the part of this is that you're gonna have a robot that basically just goes producing array after an array. You can imagine some of the limitations in that are going to be the size because your limitation is how close you can put needles together. And that may be 10th of a millimeter, something like that. But that's still a physical limitation. So other microwave technologies that have higher density, more tightly packed spas don't use physical procedures like this. You can almost always recognize a two color microwave produced this way. One of the reasons is that they have these characteristic gridding pattern with spaces in between the squares. Each of these corresponds to a single run from the needle head. And so essentially what would happen is there would be spot one, lay down two, three, four, five, six, seven, eight and so forth. So it's a sequential gridding pattern. And over the course of a print run, it's entirely possible for the needles to wear down. And so you can see a bias either across one of these print grids or within the spatial array if the robot is not doing a good job in directly controlling things. So remember what I said before. To understand the technology, you have to understand, to tell you to do the good by and for not understand the technology and know the sources of error. And I just told you a whole bunch of sources of error for this technology. We won't be looking in detail at its analysis, but you have to immediately guess that there are techniques for print grid normalization groups to reduce the effect of sequential degradation of print grids. To take a look at the fact that as this robot goes through all of these samples, the amount of DNA on its needles will start to decline. It's not going to be exactly constant and that there's going to be some sort of a systematic batch effect. So all those factors have to be taken into account in our statistical and bioinformatic modeling. So that was sort of the classic micro, probably the oldest ones. It was developed by Pat Brown and Stanford, but a number of other technologies have been developed using different types of techniques. And they all try to get around the idea that the physical spacing of the needle is a key limiting factor. Everything, all the problems that I mentioned before really come down to the needle. So couldn't that be replaced? There's a couple of different techniques to do it. There'll be inkjet-based arrays. There are photolithographically synthesized arrays. And then there will be arrays. And we'll talk about each of these for a couple of minutes. So back, 12 years ago, HP decided that it was going to be a printer and computer company. Couple of days ago it decided that it was no longer going to be a computer company. So I guess this is appropriate to be talked about this now. And they spun off all its life science and measurement work into a company called Agilent. And it's actually quite striking. Agilent has been a very successful company and I didn't realize 10 years ago that HP had such profound life science techniques, technologies. And Agilent itself decided that it should start thinking about what was HP good at and how could that be used. And it started to think, well, we're good at putting really tiny dots of ink onto a piece of paper. And that's sort of the same thing as saying, I should put really tiny dots of liquid onto anything. Maybe I could put really tiny dots of DNA onto a glass slide. So could you use the same technology used for printers to generate a microbe? That's the idea. And indeed, you could. And so the basic idea for their arrays is that you have instead of four colors or three colors in your printer, you have four bases. And they have some specific chemistry that will allow the inkjet shooter to go ahead and say, oh, this part needs an A. And then the microbe will be flashed at a temperature with a little bit of reagent to allow a linker molecule to be added to the top of it. And they'll say, oh, this next part needs a C. And it will sequentially build up molecule by molecule just by shooting these out one at a time. So there's a couple of things you can imagine. One is that this isn't necessarily the fastest thing in the world if you're going to be printing really dense arrays. So there's some speed concerns. But a more practical one is if you're going to be building it up base by base, at some point in time, you're going to make a mistake. Molecule is not going to be added correctly. A linker is not going to be attached. The printer will make a mistake. Your software sucks. Lots of possibilities. So there's physical limitations, not physical. There's practical limitations to how long the oligos produced with this kind of technology could be. So what kind of a range of oligos do you think would generally be produced by this? 60 base pairs? Anybody think anything different? So 60, does anybody say that? So 60 is what they sell as the B commercial. Actually, there's different reasons for that. The initial paper that described this technology did some analysis and found out that there is no real scientific benefit to going up to 70, 80, or 90, which is about the max that they can do. And so they do 60 because it's a lot cheaper and a lot faster for them that gives about the same quality data. But you could probably get up to 100 with modern printers and piezoelectric circuits. So these are the core of all the Agilent micro arrays. And you could imagine that this is produced in a way that would be more reproducible than the spotted arrays. It's limited by the density at which you can shoot things on a piece of paper. So it's much higher density, although it's still physically aimed, so there's still some spatial concerns there. And it's got a much more limited range in terms of the length of the oligo. These kind of arrays, I would say, would be mostly used in a two-color fashion, although there are definite cases where people use them with one-color experiments. The second major technique developed, and I really do mean second, the arrays that we just talked about were from 1999, 2000. But we're back in 96, a year after, two years after the first spotted arrays, a company out of Stanford said, well, you know, maybe you should think about this. Robotics are all nice and well, but we specialize in developing printed circuit boards, integrated circuits, transistors for computers. So can we use that technology to produce a microwave? And the basic way in which any of those technologies exist is based on the fraction of light through a mask, so the allowing light to touch certain parts or not touch other parts of the microwave. And so in that case, you can imagine that you're going to be limited only by the wavelength of light that you're using. The tidal wavelength and the smaller the physical mask you can focus on to very small spots. So the way in which it works is quite different from the other technologies. And I'll walk through this in a little bit more detail than the other two because we're gonna be working with this data. The basic idea is that you start off with a silenaded glass slide. So silenaded just means that it's doped in a couple of different ways chemically that has some nice effects about making it a little bit stickier, making it more chemically reactive, and also helps smooth out the surface. The surface slides has to be extremely flat. If it's not, if there's a blue in the middle, sample or pool. And as soon as you have that you have some sort of bias because different parts of the array don't have similar characteristics. So the ratio will be silenaded with these silene molecules on top and those are hydroxylated. So that provides sort of a sticky end onto which chemical reactions can occur. To the sticky ends, and the molecules are attached and to the best of my knowledge we don't know what the linker molecules AFI actually uses are. I think they're not part of the trade secret but we could probably guess what it is. There's only a couple of chemistry types that are possible. But the idea here is that this provides something sticky and standard to which you can easily attach a nucleotide one day set of time. And then to that, we'll develop the photolithographic mask. So a photolithographic mask kind of looks like this. It's a series of spots and you can imagine a big series of spots and some places will have a hole and other places will not. Obviously, where there's a hole light will shine through where there's no hole light will not shine through. This is identical to how your computers, your cell phone chips, anything like that is produced. The light shining through the mask produces the structures that later become the transistors in your computer. And so it looks exactly like that. So you've got a mask and the mask will have a series of holes and a lamp will be used and different wavelengths can be used here to shine the light through the mask and it will only illuminate certain spots on the chip and this can be controlled with very fine resolution. I want to make a point here. Do you notice that the lamp starts off as a point source as do most lamps but the wavelengths have to be collinated. If the wavelengths are not exactly perpendicular parallel when they go through the mask you can have extreme problems and this is one of the more challenging parts in the production of an integrated circuit is ensuring accurate collination of small wavelength light. So now you can imagine what we've got. We've got the wafer which has been silenaded, the linker molecules have been attached and for this kind of experiment you're shining UV light and the mask is present at some spots. Here it is blocking feature number two or what will eventually be feature number two but the UV light shines onto feature number one and feature number three. So certain places the light shines in other places it does not. All right, then we will do a deprotection and that basically means that the UV light stimulates the linker in such a way that it can be easily lost, easily removed. At that point in time we're going to pass over nucleotides. The phosphorylid nucleotides obviously they contain a linker molecule on one end and the other end is reactive and can attach directly to the linker. So we've built up on the feature one and feature threes on A. Feature number two still just has the linker molecules. You wash off everything, get rid of all of your reagents and then do this again. You change your mask and now the mask just shines UV light onto feature number two but feature number one and three are protected. That deactivates the linker and I like to say reactivates the linker and now you wash over a different nucleotide and here you've got it sees built up as the first layer on feature number two. You do the sequentially we're going to add features to number two and three simply by where you change your mask and sequentially build up your array baseball base. One thing that's worth pointing out what happens if for some reason the chemical reaction is incomplete that's definitely going to happen all the time because you've got 10 million spots, 10 million sequences on an individual spot. So instead you're going to have these two have CG but this one just has a C and it's going to have an off by one error in every case it's going to be incorrect. It's actually easy for them to identify this kind of case. What happens is we deprotected the linkers there's no linker on here and that means that you can identify after the base has been added any molecule that does not have a reactive end to it. And instead a capping agent can be added and a capping agent will chain from being built any further. So you'll have some chains that get built completely and other chains will not and so you can build this up sequentially and so you have complete chains. And here you can see that feature that we were just discussing it's been stopped halfway whereas the other features managed to go all the way forward. So this is sequential building up using light activated chemistry. So what do you guys think is the range of base pairs in length that would be suitable or practical for this kind of technology? 25? 150? Under 50, under 50. Anybody else? Anything more than 50? Yeah, so 50 would be a bit of practical limit. You can get a company called Nimbogen does manage to get 50s or 55s on some of the other platforms but that's about it and Affymetrix does use 25 base pair. Probs is a standard. And so you see that there's a trade off in the different platforms here. One of the things that you get by hacking things more tightly is a reduction in the length of individual sequences. So these sequences are shorter and shorter sequences have more potential to be matching in different places in the genome. There are fewer unique 25 base pair sequences in the genome than there are unique 100 base pair sequences. And that means we have to think much more carefully in the design of the array and in the application of the array to different questions. I mentioned different species. If you've got 100 base pair region there's a pretty good chance that it'll be different between two or three different species. But if it's a 25 base pair region that can bind in multiple places. And if we don't know the species genome that's either worse because we've got a 25 base pair sequence and it could bind to six places in one species but only one and another and a short enough sequence like that can be problematic. So there are big differences. The other feature that we could be talking about here is what happens if this is cancer? What happens if there's a SNP? A single SNP is not such a big problem in 100 base pair sequence but a single SNP in a 25 base pair sequence you got a problem. So there's a lot of other factors to consider when you're taking a look at the technology and the trade-off not just density but also the length of the sequences and the other characteristics. Perhaps you're gonna speak about this later but when you're talking about different platforms how do the companies decide what region the message has to have? So I'll talk only a little bit about this. Different companies have different design criteria and there's a couple of, you can imagine pretty straightforward bioinformatics things that you'd think about. When is the uniqueness? So that's practical and critical for it. The second is many ways will be bias towards the polytail. So towards the UTIs and there's a couple of reasons UTI is much less valuable. So there's far more variability in terms of alternative start sites and alternative promoters than there is in terms of alternative tails to transcrips. And so that allows you to be capturing a region that will encompass more transcripts than others. But when we design a sandhexon array and you have sequences targeted to every exon in the chain essentially you're just trying to maximize hybridization and uniqueness. And hybridization means balancing things like GC content and so forth. I guess I haven't said this explicitly but each of these strains is going to have a different GC content and the GC content will determine a lot about how tightly the Watson and Crick binding is. And the tighter the Watson and Crick binding the more stringent you can be with your wash the easier it is to remove non-specific hybridization. But you start to get a problem if your array contains huge ranges. So you have to be able to find a good balance between having sufficiently high GC content that you can have a good separation but having too much variability or potentially losing some genes or some interesting features. So it's a balancing act that each company does in their own specific way. All right. So let's take a look at the structure of this is an older AFI array but eight or 10 years old now. And what you're looking at is an initial wafer which is five inches by five inches. And from that five inch by five inch wafer a tiny little piece will be taken for each microwave. So five inches is 12 centimeters. That's 144 centimeters squared and that's about 1.4 centimeters. So they're getting something like a hundred in. So basically it's a 10 by 10 grid onto which these masks are used to shine repeatedly the light one at a time and we're gonna make a hundred identical arrays. Each one of those is placed in a nice little plastic packaging with some other things. It's a bit that big but the actual way in it is a bit that big. So there's a big difference in terms of the physical packaging versus the actual experimental region. Each spot is 11 microns by 11 microns on the older arrays. Some of the newer ones are I think nine by nine. And as I mentioned, they would contain millions of identical features. I think the newer AFI ones would be about one or two million but there's something about the density that's determined by the length of the light. They look sort of like that. And you can sort of see the clear patterns in biases where there's lines here that look consistent. There are a couple of reasons for that. Some are experimental artifacts that we'll talk about and others are placed intentionally on the array by the manufacturer so that it will make it easier for the software to distinguish one 11 micron spot, the hundred micron squared spot from another hundred micron squared spot. So it allows you to do a better job with distinguishing spot from spot. So we call that the chip, we call that the feature. And let's talk a little bit about what the overall experimental procedure is if I don't trip and kill myself. All right, so the overall experimental procedure for an AFI array starts off with total RNA, not power array, total, which is reverse transcribed into CDNA. It's in vitro transcribed then into a biotin labeled CRNA. Biotin gives some sort of a captureable thing, something that allows you to do strong tags to it to fix things. So it depends, at this stage, there's actually a couple of different kits that can be used for, there's one that's specific for Lugin, which is random primed, there's some that are oligo DT primed. And so it depends what you're looking for on the array. Some AFI arrays only contain polyad transcripts, others don't. So in some cases, it wouldn't matter. These fragments, these RNAs are then fragmented. And remember that I said that the microarray itself will contain 25 base pair fragments. Well, that means that it's gonna be a little bit weird if you have a thousand base pair RNA and you're trying to hang it onto a 25 base pair hinge. So you need to have them fragmented into smaller chunks that you can attach. These are then directly hybridized onto the microarray, washed and washing to remove the nonspecific. And then the biotin can be stained using different flow force and those can be scanned for analysis. So here we don't have the fluorescent label coming in at an early step like you do in the other arrays we talk about. Instead you just have the biotin labeling and it's only the last step that you introduce the fluorescence after things have been down to the array. And this may or may not increase the signals and noise qualities of the platform. I say may or may not. The problem with saying that is we only know what the signal qualities are. We don't actually do experiments where we compare five or six different techniques of this. Presently the companies have done that and we trust that they got it right. But we wouldn't be able to do that. And we could only say, AFI has this signal-to-noise characteristic, Agilent has these. We don't really know what exact feature of the platform leads to that. And so of course when you look at it in the end you're going to have a series of probe DNA to which a target DNA would be bind with the biotin conjugation and then you can label those and directly visualize. Okay, so this is that image that I showed you before and I want to be a little bit more detailed about this. So you can see along the edge there are these little dots. Can people see those over here? Yeah. So those are what are called Mandem lights. They allow you to accurately software grid and align the DNA itself. You can also see up here it says G-CHIP HG-U-1-3-3-A. That's not something that we add digitally. They actually make some of the control probes make up the label for the type of DNA. So that if you really don't know what version of the DNA you are working with, you can look on that and find out exactly what it is. Now, if you don't know what version of the DNA you're working with, you have other problems that we should talk about, but yeah. There's also a couple of other things that I was noting. So this right here, bright spot, experimental artifact, that's not a reasonable control or anything like that. Similarly, you can see some dead spots. So in this region over here, it's very light intensity and you can see bands. And the bands are just scanner artifacts. They're generally things that are not physically part of the array manufacturer that it could be. And they are, again, experimental artifacts that need to be removed in analysis. We'll talk about how to do that in a second. The other thing that's really important is to emphasize that it's a set of 25 base pair probes. And Affymetrix is clever enough to say, well, you know, I don't think one 25 base pair probe can completely represent a gene in all its spice variants. So instead, it says we'll represent a single gene by multiple 25 base pair probes. Anybody have a few for how many? They typically use to represent one gene. Three to four. Three to four? Ten. Ten, anybody similar? So depends entirely on the array. Depends on the version of the array. The older arrays use 20, the newer arrays use three to four. So the exon arrays use three to four. And there's a couple of reasons why. One is if you use 20, then you get one fifth of as many genes in your array as if you use four. So if you want to represent every exon, you need 10 times as much, and therefore it makes sense to reduce the number. But there's immediately some quality trade off there. The fact that you've got that redundancy is really important, though, because these 25 base pair technologies are really prone to SNPs. So a SLIP can mess up 125 base pair probes, and therefore you're going to need something that will allow you to control for that. And so having multiple and seen here's an outlier allows you to systematically control that. But the other thing that's advantageous about this, and not just advantageous, but critical to successfully analyze this type of data is the fact that the mappings between 25 base pair probes and genes can be changed. The array that's most commonly used is something called the U133 array. So that corresponds to Unigene version 133. Unigene is at version 200 or something like that now. So it's using a really old definition of the genome, or transcriptome, and a really unclear mapping of specific sequences. So as a result, you're going to take a look at it and say, well, many of the genes that we thought existed then don't exist anymore. Some genes that we thought were unique, the sequence we uniquely represented this gene, we find out it no longer does so. Two independent genes, instead they were actually just a single gene and we have different variants of it. That's a problem, because if you tried to use gene definitions that are 12 years old, 11 years old, you're not going to get very good results. So instead, one of the things that is almost critical to the analysis of any AFI experiment is to update the gene mappings. Those are in what's called the CDFR, the chip definition file. And the CDFR will allow you to customize the mapping however you like. There is substantial computational involvement in doing that. There's substantial computational involvement in doing that. So the typical way of doing it would be to take the million odd sequences and map each of them to the transcriptome and piece together the mapping. To come up with a criterion to remove any mismatches or any sequences we represent in the transcriptome to remove hypervaluable regions that are highly prone to SNPs and then to repeat this for every single AFI. Surprisingly, AFI doesn't do that, although you kind of expect the AM manufacturer to do so. That happens several academic groups and this afternoon we'll see the application of it to your actual dataset that you'll be working with. So that's AFI metrics. Another technology is the aluminum self-assembled beauteurs. It's pretty different from anything that we've talked about. The idea here is that those individual beautes, glass beads, to which there's going to be on the order of hundreds of thousands of probes, small probes, notice that hundreds of thousands is actually quite a bit smaller than the millions that we were talking about before. So these glass beads are gonna have 25 base pair probes attached to them and they're gonna have some sort of an address, some sort of a label that can tell you what each sequence on them is. And you're going to have an order the way where you just put a random set of beads. Doesn't really matter, it'll be, I think it will be on the order of a million beads. About a million beads. And each individual sequence will be represented tens or hundreds of times by replicate beads. So in essence, you're going to be measuring this 25 base pair sequence, you're gonna aggregate 100,000 signals for each bead to get a measurement, but then you're going to have a lot of beads, say a hundred times, so you'll replicate that many times. And that'll allow you to have assessments of the variability caused by the rain manufacturer, the bead orientation and those kinds of things. Immediately from what I just described, can anybody see the experimental design issues that we're gonna have to deal with when analyzing this kind of data? Maybe some of the things that we really start thinking about. Yeah, so you've got the address mapping which could have errors, I'm good, what else? Other technological things here that are going to cause noise in an experiment. Misha. Closeness of beads. Yeah, so they try to order the beads in a systematic way, but if there's any errors in ordering then you have a big problem. There's a couple of other things that are really systematic data facts here. One of them is that, as I mentioned, the 25-beast pair probe, and instead of replicating different parts of the gene many times, we're replicating that same part many times. So you're not getting a good assessment of the biological variability, you're getting an excellent assessment of technical variability, and there's a problem there. This kind of technology would be very prone to SNPs, as an example. It doesn't have the redundancy in terms of sequence differences that the other technologies that we described do. If you do that, you reduce the amount of technical replication, and so they don't. On the latest though, we've seen the typical of 70,000 unique sequences, and there's 70,000 sequences for 20,000 genes in a genome into the throughput gene, most of which will be Spice Friends. So there's been recording studies? Yeah, so I wasn't going to talk about all of the recording studies. So basically, there have been a couple of big studies on this. The most important is called the Micro-Quality Control Consortium, or MAQC. MAQC, to the best of my memory, did not include an aluminum ray in the comparison. It's from 2006, but they did do AFI, Agilent, Spotted CDNAs, and a couple of other platforms. The experiment itself was done by, I think it was by NIEHS, so the Institute of Environmental and Health Studies in the US, and it's a controversial study for a couple of reasons. One, the bioinformatics of it, and the way it was analyzed, was very controversial. Two, some people have suggested that they knew the conclusion that they were going to reach before the experiments were done. And the conclusion was that the ray platforms are very highly reproducible. They compared them to, for example, a real-time PCR of 1,000 genes. They did mixture experiments where they had 100% of one thing, 90%, 80, 70, 60, 50, shoot 100% of something else, and they looked at the dilution curves and those sorts of things. And so the state of conclusions is highly reproducible in the order of 98% to 99%, and a subsequent follow-up suggested that modern arrays are better than that. I do agree with the criticism, and this is now personal opinion, I agree with the criticism that these studies have been slightly oversold, but I think they provide a pretty core of truth. For the majority of genes, you'll have pretty good reproducibility. That being said, it depends a lot on the nature of your samples. So if I were to see a much more unusual sample where FFP cancer samples, which have lots of mutations and degraded RNA, you'll see a much bigger difference than you do using cell culture work, like what they did, or fresh frozen tissue. So it's not exactly straightforward, and none of the manufacturers is going to do and publish a truly unbiased study of this. So it's a little tricky to address that. The other thing, of course, that causes a problem here is if you have 100 technical replicates of something, how do you collapse them? So the mean, the median, do you throw out liars? How do you do a liar removal? There, immense informatic and biostatistical questions involved in how you'd handle that data. These for the NACI, all you've got is you've got one signal for each sequence, and you've got 11 signals per gene, and you merge them together. Here, you've got a variable number for each technical replicate, and you have to determine how to aggregate those. So there's a number of different features here that make life more complicated. And not necessarily, it's just more or less complicated. Comparing platforms is tricky, and this is a very subjective ranking, and it's not actually a ranking, and I'm certainly not going to stand here and endorse one half long over the other, and I think that's fair. But I think what's pretty clear is that there'd be and well-described differences in price. So the spotted areas are by far the cheapest, although they're not really being manufactured very much anymore. Atheist areas are probably the most expensive, and the expense is not necessarily manufacturing, it's also related to the market pricing ability, and inkjet and bead areas are sort of in the middle. The length of the sequences ranges from the very highly variable for the spotted areas, for 25 base pairs for Atheist and bead, and 70 for inkjet. And this is a personal idea that I think of the data quality of it, and I think maybe a few would disagree, but the spotted areas have the most artifacts and challenges to remove. And for a number of reasons, primarily related to the systematic manufacturing processes, many of them not all will agree that Atheist has highest data quality with inkjet and bead areas being somewhere in the middle. But I mean, what exactly do you mean by data quality? Depends on your experiments, it's also a key point to consider. But lastly, the bioinformatics research is something else that's really important. You do not want to get a micro platform and spend six months working out, okay, how do I normalize my microarray because nobody has ever normalized this type of array data before? That is not a fun situation. So instead, we want to be able to rely on a community of experts who will help you working through the analysis. And that involves everything from the low-level image that comes off of these microarray machines parsing it into art. That's a pain. I have to do that by hand once for a micro platform and took two and a half months of coding and thinking and going through it. That to me as a bioinformatician is not a waste of time, but not what I would like to be doing. I'd like to be thinking about what's the biological relevance of it? How does it work? So there's clearly the most research into spotted CDNA arrays. They're the oldest platform. They have a number of features that really interested bio statisticians when they first came out and there are billions of methodological papers on it. Anthometrics are also of tons and tons of methodological work and that's one of the reasons we're talking about them working through them in the practical session. Each other has a smaller amount, but in many cases they can use the techniques that were used for spotted CDNA arrays. And in part because they're more recent, the video arrays have the least amount of algorithmic development that's been done. Yeah. Well, the CDF or the files that are unique in the change according to what you know about the genome to general truth, that was for Anthometrics, but I hear the same thing for all the other platforms or something similar to it. So for Anthometrics it's well systematized. So it's well described how that's done and where they're available. There was a group in Cambridge or Oxford, but one or the other that was looking at doing the same for CDNA arrays and remapping them. And I'm not sure if they still do so if they didn't have the resources to continue doing so. Adulant does a better job than themselves and so it's more company available. And I'm not aware of a systematic universal resource for spotted arrays. And in part that's because spotted arrays are much more valuable from who manufactures it where it's done. And so some places, core facilities that make them, we actually do that. Others probably do not. I know that in my lab when we dealt with spotted arrays we actually had a statement that we would run at the time that we would remap every probe because it was just not worth the effort of looking for all the individual remappings. So it's kind of available on that. I would say that it's the only place where it's developed but for Anthi it's the most critical that it be there. With an ink generator it's important to have a 70 base pair sequence as it is in the 25. And so the next thing that you might ask about is whether this micro is actually used for. And I'm not gonna dwell on this in a lengthy period of time but I wanna highlight some things that you may not have thought of. So the first thing I'll point out is that from a molecular level these things do a lot more than you might initially think. So if you're hybridizing DNA to your microarray the first thing that you can do is you can take a look at DNA sequence. If you know there's a SNP you can design two sequences one of which recognizes a real area the other one of which recognizes a real B. You can look at the ratio of those two sequences and you can determine is this AA, AB or BB. So you can directly detect a genetic variation using a microarray. Similarly if you space your probes across the entire genome you can look at trends in signal intensity across the genome and therefore identify copy number variation. And I think Sorab Shah is going to talk about that by technology in some detail tomorrow. Wish up? Sorab, tomorrow? Thursday. Thursday, so a couple of days. We can also use this as a technique to actually extract specific regions of DNA. And so John talked to you yesterday about sequencing and one of the interesting things in sequencing is to say can we focus on a specific region of the genome? Or maybe you really just wanted to sequence chromosome 2. Well, you could have a microarray that represents all of the sequences on chromosome 2. You hybridize your DNA from a patient's sample to that microarray. You wash off all the non-specific. What's left is just the chromosome 2 DNA. Now you reverse the hybridization put a very, very harsh salt wash. What you've got left is just the chromosome 2 DNA. So it can be a purification technique. You can do that to extract only the exonic regions of the genome. You can extract anything that you like in that technique. Similarly, I mentioned chromatin immunoprecipitation. You can work on subsets of DNA extracted by antibody-based techniques or cellular fractionation or any other kind of technique. And the last thing we're pointing out for DNA arrays is type quantitation. So imagine somebody does genetic screening and a sample experiment would be they have overexpression library where they're going to put into each cell a construct that represents every gene in the genome and that they're going to randomly transfer cells with these and they're gonna look at some sort of phenotypes. Perhaps they look at growth rate and they'd like to see which cells grow faster and which ones go shorter. Well, if you have a tag that's present in those, you can quantitate those tags directly and determine the relative abundances of those to accelerate how quickly you can do genetic screening. And so that's another classic technique that would be doing it by microarrays but more easily doing it by sequencing but more cheaply doing it by microarrays at least today. At the RNA level, there's a number of other things that you can do. So we talked at the beginning about the limitations of an RNA array. There's also a lot of possibilities. So obviously what we normally get is mRNA abundances but there's been a lot of research on how you can take a look at piecing together different transcript isoforms. So how you can take a look at different spice variants and determine what is the overall structure of a gene. In particular, there's several ways that include not just probes to individual exons but include junction probes where half of the probes in exon one and half of the probes in exon two. And if a transcript doesn't exist that pieces exon one and exon two together, that probe won't light up. And so you can piece those together to have an exact picture of what the transcript looks like. There's studies on mRNA localization, mRNA degradation or half-life using metabolic laboring techniques and mRNA translation rates using sucrose density gradients for polyzone capture. All those RNA techniques are pretty readily accessible. These are things that have been published pretty widely and are easily enough used. They're not necessarily easily used on patient samples. Several of them require you to be able to manipulate the cells in a growing way. And that's why techniques that allow you to manipulate patient samples like xenographs or primary cell lines are increasingly becoming important in cancer genomics. And lastly, we're not gonna talk about this at all but I said at the beginning, a micro is an only way of stuff. It doesn't have to be DNA. It could be proteins, it could be cells, it could be lipids, it could be just about anything. And there are techniques present in the literature and there are people doing analysis using all of those different things. You can imagine there are challenges with each of them and the bioinformatics with each of them is a different question. But a micro does not have to be RNA and DNA limited although that's the application we see the most. From a biological point of view, and let's focus here on RNA, there's a couple of things that you'd wanna think about. First, a micro, as I said at the beginning, is most often used as a hypothesis generating tool. It says, we've done an experiment what genes are important in this biological situation. I'd like to know what they are so I can figure out why they're important and how they work. So that candidate gene identification is easily the most common thing that we do with micro data. However, the next common thing is to say, all right, we've got candidate genes, how do we fit into biological pathways? What are those pathways? What do they represent? Is there a series of changes that are not the largest changes in the cell that that together are able to account for much more coherent variability? Every glucose metabolism gene is down two-fold. It's not as exciting as having one gene that's down a thousand-fold but that consistency will allow you to get a bigger picture understanding of what's going on. On the other hand, you might be thinking about classifiers and predictive models. So if you do a micro experiment on clinical patient samples, the most common question asked is, can we predict the difference between subgroups of disease? Subgroups that differ in their outcome, differ in their clinical presentation, differ in their response to therapy? And those biomarker-based methods require all sorts of statistics and bioinformatics in the fields of machine learning. Many people work on drug dose response in time course studies and of course, there's integration of this type of data with many other types of data. So there's a lot of different things that you can do with the micro-aid data and they all start with you actually having the data in hand and being able to extract core meaning from it, normalize it, reduce biological variability and technical noise and focus on interesting trends. Some of the downstream analyses are going to be talked about extensively over the next couple of days. So as an example, there's detailed work on pathways, detailed work on clinical integration. So I'm just going to talk about copy number variation. So you're going to hear about each of these things in a little more depth. So I'm not going to cover those. I'm going to tell you how you get the data to the point that you can generate what you need to be able to do pathways, to be able to do integration with clinical data sets and to be able to look at how it relates to copy number variable. And so that leads us to the core question, which is how is micro-aid data analyzed? And this is down to the core workflow or the core pipeline that we use. And the short answer to how it's analyzed is you're going to have a series of algorithms one at a time, each of which is intended to remove one source of noise from your micro-aid study. And we're going to talk about these, how they fit together and what they're intended to do. So you start off with the micro. Let's pretend for the sake of example that it's a two-color micro, there are differences for each platform and how this goes, but not very substantive ones. So you start off with the micro. The very first thing that you have is a nice pretty picture of 100,000 or a million spots, each of which is shining red or green or some other color. And you need to be able to do something with it. And that means the first thing that you need to do is to take each of these probes and quantitate it. You need to be able to attach a number to that. And the first set of algorithms takes your image and converts it into numbers, takes analog and converts it into digital. And that is absolutely critical and a very important step. As an example, people who work with micro data often can't tell you how they've got this step done because it's sometimes done by manufacturer pipelines or manufacturer software that people haven't considered. What is the source of noise? How does this work? Is the availability of truth parameters? At that stage, we're gonna start looking at the individual spots. And when you look at an individual spot, very frequently you'll find that the spot at the center is very clearly defined, but outside of it it was sort of halo. There's some sort of a background effect and this could be non-specific hybridization. It could be incomplete washing. It could be a lot of different things, but that background needs to be removed to ensure that what you're dealing with is pure signal. At that point, you'll typically want to identify low-quality spots, regions of the array that you don't trust for whatever reason, and exclude them from your analysis. If you have a two-color array, even on many one-color arrays, you're gonna wanna remove trends that are specific to individual micro arrays. That would mean, for example, if you have two-color array, red and green, you can imagine pipetting red, pipetting secondary and sci-5 into a tube. You're not necessarily going to get exact quantities. You may be off by a tenth of a microliter, a hundredth of a microliter. That needs to be accounted for. You need to be able to balance the signal intensity so the channels look the same. The assumption's implicit in that approach, but we'll talk about them later, but those kind of variability on a specific array by array basis needs to be removed. After that, your typical experiment will have many arrays, and you need to bring all those arrays to a common basis, and those arrays can differ because of batch effects or differences in any quantitation or label efficiency or sample differences and how good your RNA extraction was between that. And to do any sort of fair statistical analysis, you have to bring all of your arrays to a common distribution. At that point, once you have your arrays in a common distribution, you have to do statistical analysis of it. You have to identify are there any trends of interest here at all? Does this drug cause an effect? Are the genes correlated with patient outcome? Whatever the statistical question is, you need to have appropriate techniques to measure it. Then you get into downstream analysis. One of the first things that we do in most experiments is to cluster the data. We'll talk a little bit about why, what that teaches you, and what it shouldn't be used for. And then lastly, integration. We'll leave that as a black box for today, and we're gonna talk about integration over the rest of the week. So what I've just described, you can sort of break into two different sets. There's the first set of steps, algorithms that are used to remove noise. They're intended to clarify or purify your data set to make it as clean as possible. We generally call the removal of noise pre-processing. And then you wanna extract information from your data. We wanna find biologically relevant conclusions. And this is the statistical or downstream analysis. So we're gonna talk about each of those steps in a little bit. And if I'm on track for time, it's about 10, 20. So we'll go through a couple of steps now. Then we'll come back and finish off the rest of the steps before you start on that or two. How many break is that? 11. 11? Like it's not that bad. No, no, that actually works well. You should get almost to the end. Thank you. All right. So we'll talk about each of the steps in a little bit of detail. And the first step, as I mentioned, the critical one is the quantitation of the data. So the way this is done is when the series of algorithms called image segmentation algorithms is actually one of the more difficult things that we work at in micro-analysis. It's one of the hardest problems in bio-traumatics, actually. And in an interesting side note, it's also one that we studied. There's not a lot of work going on on image segmentation algorithms for this type of problem. So the core idea, when you're looking at a two-color array, which is because it's easier to visualize, but the challenges are really unchanged for any way. The core idea is that if you look at this, you can sort of see that there's a pattern there. You can imagine, well, I see these empty columns over here, and I think I see sort of rows like this. But you have to be able to identify them. And the algorithm is gonna say, all right, let's first start off by recognizing local subregions. With either those local subregions, I can then identify specific spots, and then for each specific spot, delineate exactly where the spot is. So it's sequential. And so it's not that hard to look at it and to be able to draw those grids around it. But for a computer, it's actually a pretty challenging thing. And the way the computer goes ahead in the desert is by saying, it's going to integrate a signal across both dimensions of the array. And so there are many tweaks and differences in the algorithms used, so it really will come down to this. Here we've got a mirror array with four subgroups, and all it's going to do is for every line of pixels, it's going to add up the intensities across. One, two, every single row, and every single column. And when it does that, you're going to see a pattern that's going to emerge like this. We're going to have peaks where you have spots. And you're going to be able to say, oh, here's a spot, corresponding to different peaks. And that will allow you to identify the gaps as well as to identify the spots within the grids. Now there's several challenges here. One is that you would hope that the computer would be properly geometrically arranged, but it may not be. So where you're going to be saying, all right, here's the intersection of this and that. It won't actually necessarily be there. So what you have to do is initially grid the spots, they go into an initial location, and then the software is smart enough to tweak it and modify it to bring it into the correct location. So it's got an iterative adjustment procedure to account for that. But then when you get to having a single grid, you might repeat the procedure, but it still gets pretty challenging. There's a number of things when you look at this, what appears to be a correctly gridded array, that will make your life semi-miserable for your computer program. The first is, what is that? Is that a spot, or is that an artifact? And there's no information to tell you about this. You can't tell if there should be something there or not, and there's nothing that the computer program will have looking at this image that will help it distinguish those two cases. Similarly, there's a couple of blank rows in here. You can imagine, maybe these are both real blank rows. That, to me, looks clearly like a red spot. But of course, if you look back on fluorescence, maybe the entire array over here is going to be shifted off one in either direction. And so that misgridding or misalignment would throw off the entire array experiment. So there's a number of different things that have to be accounted for, and the software has to be smart enough to know when it can't accurately make a call. When it says, you know what? I'm not sure what I'm looking at, so I'm just going to flag this. I'm going to say there's something wrong here. And the software packages not only have to handle cases like this, but know when a case came up that they can't make an accurate decision on it. Surprisingly enough, there's not a whole lot of research into this, and maybe it's not surprising. It's probably a source of error in all studies. And nobody really knows how to... You can maybe really take a look at all the different kinds of features that you get on an array and try to program those in. But that requires a large amount of work. And as arrays get to be very high density, how do you look at it? Who has the energy and time to systematically look across a million spots on 50 arrays to get sufficient replication, identify whether the systematic kind of trends and artifacts, and then write a computer program to do it, to account for it? Part of it is the people who would have the capacity to write the computer programs may not be the people who have the patience to look at 50 million spots by hand. And even the people who have the skills to develop the algorithm to solve this problem may not be the people who can implement it in a computer program that will run fast enough to be useful. So this is almost certainly a source of error in all studies. Manual detection of... Manual checking of spots remains the only normal way of doing this. And my lab did this up until three or four years ago and we no longer do. It's just impractical with the size of arrays and the experiments that we deal with. And I guess the last point, I don't have a clue. Anybody have a guess as to how often it happens that there's a gridding problem? It's a couple of estimates in the literature. So what fraction of spots on a array appeared to have a gridding problem? So who thinks it's more than 50%? More than 10%. More than 5%? I should give everybody who said more than 10% by that way. Anybody think it's less than 1%? Yeah. So estimates for seedling arrays were 5% to 10% and for allegro arrays were about 1%. So that's not totally horrible, but 1% is 1% on an experiment that you're paying a lot of money and generating millions of data points. So it's a significant source of error. Are you talking about spotted arrays? No, this is true for even allegro piezoelectric printed arrays or for... So the next thing that you do and you should convert it to numbers is to say I'm going to have to remove background signal. And we should talk about what exactly I mean by background signal, but it's talking about stray fluorescence, something that's not biologically of interest. We typically do it using model-based techniques and I'll say that this is probably the other extremely difficult problem in two or three extremely difficult problems in micro-analysis, maybe even harder than the spot problem that we talked about. And the typical way it's done is with model-based approaches and there's a lot of research. So what we're actually dealing with is a spot on an array that is going to work something like this. There's going to be a core spot that's very, very clearly hybridized. I'm drawing these as circles, it doesn't matter if your technology produces squares or circles or beads, they'll all have this feature. So feature of those circles is going to have its own core signal. It'll be surrounded by a sort of limbus an aura of less intense signal that is just due to this part being saturated in this one not. And then by some sort of a halo of background is either non-specific hybridization or the fluorescence of the glass slide itself or something like that. So it's not really clear what any of these things are except for the middle one. The middle one is clearly signal but the other two, we don't exactly know what they are. And so what should the true signal intensity be? Should it just be this or should it be related to what we see in the back? And especially since these two can vary systematically from spot to spot. So the typical way in which we look at it and this is part of what a segmentation software has to do is say this is the foreground and we really ignore that intermediate limbus and so this is the background and then you might immediately think that it's pretty straightforward. The signal intensity of the spot is just the foreground minus the background and it would be nice if it were that simple. Unfortunately it's not and there's a lot of reasons why. The first is that this means that when the background is larger than the foreground you get negative signal intensity. No such thing as negative mRNA abundances. You shouldn't be adding negative fluorescence. So something physically is going wrong in your experiment and it might still be okay to just ignore those spots if it didn't happen too often but in a typical experiment it can happen one or 2% of the time. So a simple background subtraction doesn't work just from a pure physical perspective. Not only that, the spots that have these particular features are actually biased towards being particularly interesting spots. For example, there are spots that are low intensity genes. Genes that are turned on that we can prove using other technologies are on but that have low intensity. It's also correlated to certain sequence characteristics. So low GC content, high AT content sequences are more prone to have background than high GC content sequences. So a couple of papers back in 2001 showed that in many cases an empty spot, a completely empty spot would have less signal than the actual background of the array. So that if you had just the DNA there and not the glass slide, not the intrinsic glasses of the glass slide you would actually have an inversion of this situation. And so these spots in many cases turned out to be genes that were entirely off, negative controls and things like that. So undamned spots are particularly important to the province and that meant to a lot of people thinking that maybe we should just threshold your background. And there are a number of techniques in the literature that just say we're more the fact that we're going to be losing low expression and high AT spots and anything that has this characteristic will set it to a signal of one or two. And that's one, the fair, although not exactly a careful technique for handling background. The other people are focused on what I would generously call heavy-duty mathematical techniques. There's three, they come from different places of whether it's models from Stanford, Smythe is from Australia and the Cooper brothers from Boston. And essentially, each of them around the same time used very sophisticated mathematical techniques to attempt to remove background noise. And I would be pointed to a bibliography that was stuck up on Ricky, so you can see papers that reference these techniques, but the matter is very advanced and will probably take several hours to go over. So let's just do the other side, the mathematical details of it and talk a little bit about what are the distinctions between the techniques. One of the names that we used here, what is model is called a log linear technique. It seems that the noise is logarithmic distributed logarithmic redistributed and the signal is linearly distributed. And it seems that within an individual spot, if that assumption is incorrect, the model won't work. By contrast, the Smythe model uses a normal max, a normal exponential convolution. It's assuming that the background noise is normally distributed and the signal is exponential distributed. There's a big difference between log linear and normal exponential. And so you can see immediately that they're making an assumption about the nature of the noise. And if that assumption is correct or incorrect, it'll change what happens. By contrast, the Cooper brothers model is Bayesian. A Bayesian model does not make the same kind of assumptions, but it attempts to know what they are and kind of empirically understand what the data looks like. So with that in mind, you might not be surprised that there are big differences in how fast they work. So when I say that the Cooper brothers model is very slow, we do an experiment of 72 arrays and process them in with just the Cooper brothers background correction took two days on five nodes, five CPUs. So about 10 CPU days to do that background correction. I'm not talking about quantization, normalization, statistical analysis, anything like that, just to do that aspect. It's hard to say for sure with any of these questions, what is the best of the worst background correction? But the fear in literature is that the Cooper group actually is probably the best. It does a good job of being flexible to unusual background noise characteristics. And it does a good job of handling even cases which have very, very little noise. But in any case, any of these methods are certainly superior to the foreground minus background or the foreground minus background and pretend that everything negative is zero. They've been shown in validation studies versus QPCR to be much more accurate techniques. So it's important to think about your background correction technique, according to how much time you have to wait for your analysis to run your computational characteristics and what you think the background noise characteristics of your data set are. All right, so background correction is challenging and maybe the last super challenging question that we'll come across is a spot quality. And so I mentioned that you're gonna want to know which are the spots on your way that are reliable and which are those that are unreliable. And the other spots you'd want to include in your analysis and the unreliable ones you'd want to exclude. And those there for manufacturing defects could be artifacts in your hybridization. You could have scratched a portion of your way. A lot of things could have happened. And the best way of doing this, I would classify as unknown. The general way, let's talk conceptually, the general way we really like to do this is to say that a perfect spot, we should give it a rate of one. And a really useless spot, we should give it a rate of zero. And there would be spots that are somewhere in between that we'd have a different amount of evidence for. And we'd say, this spot, I think it's okay, but there's a little bit of noise there. It's a 75% spot. Those spots that you go, well, it contains a little bit of information. And if we have lots of collaboration and support from other ways, I would consider it, but by itself, it doesn't give me a lot of evidence. So it's a 10% spot. So you want to be able to rank each feature in that sort of a way. So then the real question gets how do we calculate those rates? How do we know what is really good and reliable and what is really useless and unreliable and maybe what fits in between? And so there's a couple of approaches that are well described in the literature. By far the most used approach is what's called the mean-median correction. So imagine that you've got a spot and the spot contains 100 pixels. We can calculate the mean intensity of those 100 pixels and you can calculate the median intensity of those 100 pixels. If the spot has any sort of a symmetrical distribution, the mean and the mean ought to be the same. If the mean is not similar to the median, if the ratio between the mean and the median is skewed, that immediately tells you that there's something unusual about the spot distribution of the pixels within an individual feature. Number of feature contains say a million probes and if there's 100 pixels covering those million probes, then each one of those pixels is capturing about 10,000. So there's a lot of signal integration going on there, but not so much that you can't be able to distinguish that. So that's one clear technique that's been used. The other one that's been used are what are called composite Q-metrics. Q stands for quality. Composite quality metrics sort of means that your arbitrary will come up with a number of different measures of quality. They can be really arbitrary. And you multiply them together, composite. And that gives you a composite Q-score. And people have demonstrated pretty clearly that sometimes if you hybridize the same sample against the same sample. So compare the same sample to itself. Taking into account spot quality will allow you to reduce the noise availability. The problem is how do you define those Q-metrics? You need to come up with some sort of a characteristic. People will use things like similar intensity or the circularity of the spot or the number of pixels, the standard deviation with the pixels within the spot. The problem is both of these were sometimes fair randomly. And what I mean by fair, they're things that you can't find in spots that are usually clearly incorrect to be perfect quality. And they will say to a spot that is very clearly unreliable that, sorry, they would say to a spot that is clearly reliable that is unreliable. And that's because the distributions that you're trying to estimate off of only 100 pixels or in some newer ways even fewer, you just can't make that kind of statistical inference from it. So it becomes really challenging. Nobody would want to solve this. So you might say, Paul, do I really need this? You're telling me it's a hard problem. So why do we need it? And the short answer is you need it because if you look at it in a way, you can immediately see what's going on all over the place. So here's an example of these are adjuvant piezoelectricers, I think. And you've got a sort of bright spot that's got lovely, similar characteristics. And you can see that its signal is bleeding over into the spots next to it. So that this strong central spot is increasing the background and reducing the quality of the spots around it. Here's a spot that looks perfectly fine except for this black X in the middle of it, a black plus sign. Now, if we go ahead and do quality assessments, this saturated spot is going to come up perfect. And this other spot that, to my eye, looks just as good is going to come up as a problem because it's got that interior heterogeneity. That's not really a bad spot. But this is a much worse one. No one can actually use the spot and win exactly what we're dealing with. And if you ask me, there's no way any gridding alignment is going to be smart enough to figure out what the underlying similar intensity here and here is, the correct answer is to drop both of those spots from the analysis. It gets worse. So not only does this one have an artifact, this one has an artifact, you can see that the artifact is going to have a halo effect that's affecting the background of one, two, three spots. So not only is one spot compared to the other three, you should probably have reduced quality because their signal is going to be effective. This is probably an array manufacturer issue. It looks like in the printing, some sort of bleed occurred from one spot to the next. And I'm not sure that we could reliably trust the similar intensity of even this spot. Is it likely that the segmentation software accurately distinguished this or this? I think not. And all of those are from a single array that I would have called good quality, but certainly we used the array in a published experiment. So those are the kind of future qualities that you will see over and over in a microwave. And somebody's going to say, ah, I used this array, this other commercial supplier, or Affymetrix that won't have those quality issues. So let's quickly look at some Affymetrix data. So the first set of Affymetrix data I'll show you is just looking at the overall array signal intensities. And I colored this in green and yellow to make it a little clear what you're looking at, but you can look at this in black and white. All right, so over here in this corner, we've got an unstraight glass slide, the hybridization pooled, and there's much more similar intensity in this corner than in any other part of the array. We've got some sort of an interesting spatial pattern over here. These kinds of spatial patterns often correspond to incomplete core temperature gradients in your hybridization ovens. I would have called that a thumbprint kind of feature. And you can see here is the Affymetrix label, one of the two Affymetrix label IDs, and it certainly does not look at the same background, oops, background signal intensities as everything else. And all of these arrays were from a spike in experiment done by Affymetrix themselves and published on their website as experimental data set we should all use for assessing techniques. And that's the Affymetrix data. Here's some affidavit from an experiment that we did. And you can immediately see that number one, you see higher variability across the array, much more spatial heterogeneity. We can see big patches that presumably again go into pooling a sample. And here, we see something really interesting. We're seeing a periodic vertical pattern. And these are successful arrays. And as you go later, the pattern actually gets tighter. And our best estimate of this is the scanner being tired. And that the scanner here at the beginning was able to do the entire scan in one go without any degradations in signal. Whereas here it looks like there's periodic fluctuations to something's going wrong with the scanner. We've seen this before. So with an experiment like that, we can clearly see there's something going on and it gets worse from one to another. This experiment had eight arrays and you can see at the beginning it looks fine, but there's two, four, eight, 16 until it ends up like this. So how do you account for that? I hope that your spot quality matters. It's clearly an issue. We just don't really know how to handle the issue. Some people have bravely thought about manually flagging spots and that was done for old CDNA arrays. Unfortunately, a couple of studies showed that the error rate of manual spot flagging is kind of atrocious. So it's five to 10% discordance between people doing it in different places. So you won't agree on what's good and poor quality. So I would characterize it like this. It's a huge problem. And I think it's one of the most critical problems facing likeers and probably facing all mixed technologies. How do we assess in an unbiased, accurate, high throughput manner the quality of our experimental data? How do we do that and incorporate those quality measures into our bioinformatics? At the moment, I do not know the useful solution for my colleagues. I think it's a problem that you should all be aware of. And if there's a useful solution, I think everybody would immediately adopt it. But at the moment, most labs, it might include, would generally struggle with it and then eventually ignore it as well. I don't think there's an easy solution to that yet. Excuse me. Yes. You showed some problematic cases of arrays. What's the ideal picture that you'd like? So all of these are words we used without modification in our analysis. What kind of picture, if I see, I could say, yeah, everything's correct with this one. We used those. No, no, we used those as we were assuming. Those were not the level of deviations that made me drop on array. So one could argue, maybe we should have dropped all those arrays, but in my mind, one way of knowing if your array experiment was good enough is are you able to validate the results? So from that last off the experiment I showed you, we validated 18 out of 20 genes or 19 out of 20 genes by real-time PCR. So 90% validation rate, suggesting the arrays and the combined analysis pipeline didn't introduce a lot of errors. So when you should drop an array, it comes down to really what you're using it for. Our phrase of this way, any experiment you just looked at was cell culture or an animal experiment. That's probably not so critical because you're interested in having biological replication and looking at things. If you're using this as a clinical diagnostic and you wanted to be sure that you're going to accurately tell this patient, do you get chemotherapy or do you not? Then you have to go in and look at it. And I would have dropped every single one of those if those were the clinical diagnostics. But for a question that you've got three biological replicated animals and the noise you believe is not systematic, that level should average out. If I'm going through each array and looking for a dose error, so if you can each one of them look at it, what approach would you think about the general value of the cell actually, the radiometric cell? What are some points of interest? And actually it assigns a huge number for that. How do you look at it? Is it something that you should be looking at? So the question revolved around AFI. AFI and a couple of other companies will produce some QC metrics for the data sets. So the first thing I'll point out is that they're producing per array, not per spot QC metrics. And they're basically very, very coarse estimates of should you drop or should you continue using this array? There are pretty well defined thresholds below which I would suggest dropping the array. However, above those thresholds in an intermediate range, it's not clear at all how quality of the array and its usefulness relates to the QC metrics. So they're useful as a coarse filter for which arrays need to be repeated, but beyond that I haven't seen any reproducible reports of them being useful. If you are saying it's suggested to be repeated essentially, that would be cool. In some cases, if the QC metrics were to look, in some cases, the cores will repeat themselves because they really look at it and know that it was too low quality. Yeah, and we can discuss that offline I guess. So at this point in time, my discussion of preprocessing takes a positive turn because we go from things that are very difficult and not very well researched into things that are much easier and very well researched. And there's an interesting correlation between the easier problems having gotten the most study. And so the first relatively easy problem is the normalization within an array. So the intra-array normalization. And there's a wide variety of algorithms that have been developed to solve this problem. The idea here is that you'd have an artifact that is specific to a single chip and is systematic across that chip. Something like the sample pool that we saw or incorrect balancing of your experimental samples would be classic features that you'd want to remove. And so the way in which you remove these features depends specifically on what feature you're looking at. So for example, if you have an array and you see that superimposed on that you've got some sort of a gradient, a spatial gradient. That immediately makes you go, huh, how do I handle that? Well, there's a technique called a Gaussian spatial smoother. And we're not going to go over the math of it. But very basically it's designed as a filtering technique in optics to remove exactly this. And if you've got a systematic removal like this, it'll do fine. Imagine that instead this was five different spots of spatial variability. Normalization won't like you too much. But any systematic artifact like that is relatively easily handled. I showed you one of the arrays. It looked like a sample pool at the bottom. This algorithm would have handled it. Very trivially removed that effect. There's also the question of channel balancing. What happens if in this case I have six reds and four greens? So I accidentally piped in more of the one sample than the other. The signal intensities for that one are going to be 50% higher. The intensity ratios are going to be more, and it's going to look like every June is off regular doing one sample. And then lastly there's something called intensity bias. And the intensity bias indicates the idea that we're taking a look here. This is a homotypic hybridization. So it's, I think my numbers got lost. So this is a row one versus a row two. It would be exact same sample on it. This is zero signal intensity through to the maximum, which is 65,000. And along the middle we've got the vast bulk of the spots. And we've got a couple of outliers, technical noise, bad spot quality, background issues, whatever it is. But the bulk of the spots fall within this set of one standard deviation, one and a half standard deviation range. But what should be immediately clear enough to you is take a look over here. The range is about the same at the high signal intensity as it is down here at the low signal intensity. So the variance, or the noise, is not proportional to the intensity achieved. If a June is present at 100 copies, you've got 100 mRNA molecules and you've got noise of 10, that's only 10 out of 100. But if the June is only present at 10 copies you've got a noise of 10, and you've got 100% noise. So this is called the intensity bias. Microarrays have much more noise at the low intensity or at the high intensity, and that means to be systematically removed because it violates just about the every assumption of every statistical test. And we're not going to talk in great detail about statistics, but if I were to say what are the assumptions of the t-test, I suspect most people would say normality. And that's true. The t-test distribution is one of the assumptions of the t-test that the sample is drawn from a normally distributed population. But it's actually a really, really weak assumption. The t-test doesn't care if it's drawn from a normal distribution. It can work around that most of the time. The t-test really cares that there's equal variance. The assumption of this called schurasticity in statistical testing is critical. And if you think the t-test is harsh for that, I know those or general linear models are extremely rigid in their assumptions of that. So removing intensity bias is another critical factor to allow you to do your downstream statistical testing fairly. So there's a number of algorithms that are useful for each of these. The intensity effects are well removed by lower smoothing. Lower smoothing is that it's weak that basically fits a straight line piecewise over small regions of your way. And it will put those piecewise straight lines together in a way that will allow you to smooth out a nonlinear shape. When you have multiple effects techniques based on splines or quanta proven effective. So there's a number of different factors that all come down to knowing many of the noise characteristics of your platform and your experiment, which will lead you to decide how you should normalize it. Some of these methods are well established, well demonstrated in the literature and a variety of experiments to work in to improve validation rates. Similarly, you might have multiple arrays and you want to merge them together. And this is maybe close to the most study topic in microarray analysis. It is clearly important and there's a number of different techniques that have been developed. The most important application for this is when you've done your experiments at different times. You did half your experiments this year and half of the next year and you want to merge those together in some sort of a meta-analysis. This will allow you to pool the data in a fair sort of way. So it could be caused by things like differential loading by batch effects in your array manufacturer. And the underlying approach is simply to scale the arrays. The distributions, as you can see here, look really similar. It looks basically like they're just offset. They're just shifted over one way or the other. So all you need to do is scale them into a common distribution. There's a number of algorithms that are directly designed to do that. One of them, in many cases, is a simple Z-square transformation which is simply subtracting the mean and dividing by the standard deviation works for a large fraction of the array experiments. Otherwise something called a quantile normalization works for a large fraction. So it's extremely easy to handle. And here's a sample data set that you would look at and go, ah, a lot of noise. Huge bias between sample to sample. And after you point out this is the intensity of spots on the array and this is the fraction of spots that have that intensity. It's a smooth histogram called a density plot. And afterwards you can see that everything looks basically the same. Can anybody immediately look at this experiment and tell me what kind of a microarray it was? There's something, two clear features of a telearray array is actually. Yeah, so one thing is obviously a two-color array, green and green. But the other thing that's more characteristic, see that hump over there? That's a chip-chip experiment. It's had an enrichment and these are the enriched population outside of the main distribution. So that's very characteristic. Any time you see that you'll see that it's a chip-chip or a rip-chip or something like that. All right. And then at that point in time we would have removed all the noise and we can start moving forward to statistical analysis and assessing the data set itself to draw meaningful conclusions. And I'm not going to talk for hours about statistics although I would like to. But I'll point out that microarray statistics have a lot of really interesting features. There's a lot of things that you can do based on the fact that microarrays have a lot of different variables, the multivariable, and those variables are highly correlated. And let's talk really briefly about what I mean by multivariable because that's not a term you've probably heard. In a normal linear equation you would say that there's an independent variable x and there's a dependent variable y. We will sometimes do multivariate statistical analysis which means we have multiple x's. So we may be looking at how gene expression, the dependent variable is changed as a function of the dose and time that we've given a specific drug. We'll take a look at how patients are viable. The dependent variable is a function of the stage of the tumor, the treatment, and the age of the individual. There's a classic multivariate analysis. This is a multivariable analysis. It means that we have multiple dependent variables, multiple y variables. Every single gene can be treated as a separate variable. And that means we get into a field of statistics that is, in general, not where people are trained or familiar in. Does anybody have a multivariable background statistics background? Multiple y's is multivariable. Not multivariable. Yeah, we'll talk after a bit. It's just a multivariable. So multiple dependent variables is multivariable. But the key point is, this opens up a field of statistics that probably most people have no background in. And as a result, with no background in it, most of us don't necessarily know how to do it. And we recourse in micro statistics to using, instead, sequential statistics per gene. Instead of trying to fit a model to the entire data, we say, well, that's pretty complicated. That's something that, frankly, the statistics is often not there for how to do it. On top of that, there's experimental lines, and instead, we're going to focus on doing a separate statistical model for each gene. And so there's a number of different models that you would look at. So, very generally, you can divide statistics into a couple of different areas. There's what you might call point destination techniques and there's hypothesis testing techniques. And what we're mostly going to focus on is the idea of statistical testing, significance testing. We will take each gene and ask a hypothesis, phrase a hypothesis about that gene, and you will do that for every single gene on your microarray. So, to remind you what we're going to be thinking of, we're talking about something like probability, and we're going to be asking, can we assign a p-value that tells us whether this hypothesis is true for a given gene? And we're going to sequentially test that hypothesis one after the other. So, there's a number of different questions that you would ask, and they each correspond to a different statistical test that you would apply to your microarray. So, for example, you could first ask, are different groups different? If I have control versus treated, drug-treated versus experimental control, I would sequentially ask, for each gene is there a difference? I could also ask, is there synergy? If I give two drugs together, is it better than giving them separately? I could certainly ask questions about patient survival, and I could definitely ask questions about predicting clinical futures. So, off the top of your head, I suspect most people have heard about the statistical techniques to answer at least one and two. So, we're going to focus in the practical session on number one. The techniques used to assess whether groups are different. The most common, of course, is something like the t-test, but there's lots of others. And if people are interested, then talk to me separately about something about the techniques for patient outcome and treatment differences. And I believe patient outcome will be talking about extensively on Friday. So, at that point in time, we're going to have a list of genes that you've identified where your hypotheses are met, or if something interesting about those. And you're going to say, I want to visualize those genes or understand something about their characteristics. So, we'll talk briefly about clustering. Clustering is another well-studied problem in plant informatics. And it comes from a field of study called machine learning. So, I guess I'll ask a quick question. Who has used machine learning today? One, anybody else? Two, three, all the three people? All right, how did you use machine learning? I'm studying complicated studies. All right, fine. Sorry. He used to do his research. Anybody else? Anybody who used it, not as part of the research today? All right, you all did. So, who saw the weather report today? Anybody use Google? Anybody see an ad on a website? Best recommendations for a book from Amazon? Elevator patterns? These are all applications of machine learning. Machine learning is the technique by which a series of data is processed to make predictions about future behavior. So, all of what you've done in many aspects of your life are machine learning algorithms. Computerized trading that makes stock of strangers go crazy, machine learning algorithms. So, machine learning is a big field and a very, very interesting field. And we're not going to talk about much of it. We're going to talk about a specific subclass called unsupervised machine learning. It's called unsupervised because it starts off without having any particular information or data about what you're trying to find out. It doesn't know what the patterns are. It says let's try to learn or discover them. So, it's the process of finding patterns in a data cell and each pattern and unsupervised machine learning is called a cluster. It's a very small branch of machine learning. If you get a standard upper undergrad textbook on machine learning it will be four or five hundred pages and clustering will be one chapter. So, it's a tiny little topic. And I guess I should say that it's a very overused part of it and we resort to clustering techniques when formerly trained machine learning person probably would not. And there's a number of reasons why we use it and some of those are good reasons. The first is that it provides pretty pictures and that's really useful for communicating data. So, this is a cluster gram or a heat map and there's a couple of different portions here. The first at the top is the dendrogram. These dendrograms tell you the linkages between different types of data. The distance to the origin is telling you how close two things are related. So these two these two columns are very tightly related to each other. These four are a little less related to each other but they're much closer to each other than they are to this group of four which is different from this group of four which is really different from these eight over here. So, there's some sort of a linkage between the distance there's a very specific mathematical linkage between the length of the arms of the dendrogram and how similar the samples are. Similar could be measured in a couple of different ways and we'll talk about that. The pretty part here is called the heat map. The heat map itself is typically colored but there's a lot of different colored skins you could use. I'm showing here red-green is anybody in the room red-green colored one? One? So that's a bit right. It's about one to five percent of the population is red-green colored one and I did this sorry as an example of what you should never do you should never make figures that are red-green because they're really hard to interpret for people who are red-green colored one. So it's best to take things like red-blue or green-yellow or combinations like that. So what does this actually mean? So in this experiment these are genes, these are samples and each of these corresponds to the level of that gene in a specific sample. So it's telling us that this set of genes is very high in this set of samples but not in any of the other samples and allows you to see similarities in groups of genes across samples. That's an interesting thing to look at. There's a number of ways in which clustering is done and a very simple way of looking at it is to say if we have two genes, two stimuli we're going to have sorry two stimuli and a number of genes within them and we want to draw clusters for each of these genes you can sort of see pretty clearly these four fit together these five, these three and that and I can immediately pick out where we think the grouping should go. If you want to teach a computer to do that it's actually not super challenging to teach the computer you tell it that there's two metrics it should measure the first is what is the distance within a cluster, how tight are these genes to each other and then second it measures how far are clusters from each other a clustering algorithm should minimize the distance within a cluster and maximize the distance between clusters there's a number of ratio techniques and things that can be done to look at that so a clustering algorithm is a nice way of visualizing the data based on those different criteria why is clustering used there's a number of different reasons obviously it's pretty visualization it's unsupervised machine learning so it's a nice way of predicting things it can be used biologically to identify co-regulation and also for quality control and I'm going to stop in a couple of seconds so that we can go for coffee but I'm going to tell a quick story about quality control in some of the earliest micro experiments groups would do a series of experiments on yeast at that time and they would look at the experimental results and when they clustered them they found that they had a clustering pattern that didn't correspond to any biology that they could recognize they couldn't think of what was going on so eventually they looked at it and they said wait, it's clustering according to the technician so it can be used for quality control because what it's finding is the strongest pattern in the data set you're not telling it what that pattern is you hope the strongest pattern in the data set is your biology but if it is not then it's an interesting thing to find out what's actually going on so we're out for a break now we'll come back, finish talking about machine learning talk a little bit about the athymetric specific analysis pipeline and then get into some practical work so let's get back to it does anybody have any questions that they are burning came up over the coffee break so we were talking about unsupervised machine learning and I just told you what it's useful for and some of the things that it's useful for and now let's talk about these in a little bit more detail so the first I said is data visualization I think that's pretty obvious makes a pretty picture and the next one is to predict class assignment so here's an example something that would be interesting it comes from the east but the same work has been done in human and it basically revolves around the fact that for a large fraction of the genome we have not much of an idea of what it is so you could imagine that it would be really nice to be able to estimate the function of all of the genes that we don't really know what they do and to come up with more refined functional estimates so a good decade ago Tim Hughes's group said this is definitely a solvable problem we can come up with estimates and what we're going to do is run a microwave on yeast cells that have had essentially every gene knocked out one at a time and sequentially subjected to a series of chemical stimuli and we're going to take a look at the genes versus the stimuli and we're going to try to identify clusters, groups of genes so you might identify here are a set of genes that all show reduced expression when the meeting has been knocked out when meeting has been impaired and this set of genes also shows response to erysteria which is a yeast mating something or other and they show activation of map case signaling and you can imagine that if there were 10 genes in here 9 of whom were known mating response genes and the 10th of which was an uncharacterized novel gene what do you think that uncharacterized novel gene does? probably involved in mating and so they show very elegantly that you can take a map like this with quite accurate predictions about gene function for the entire genome now you're going to make mistakes just because a gene shows a similar transcriptional response to another gene doesn't necessarily mean it does the exact same thing co-regulation is not co-functional but still it does show that there's good information there however clustering can be abused in this process very frequently and it's probably the biggest crime sense that our informaticians make is inappropriate clustering the first one is clustering is a pattern discovery technique it learns new patterns it's not pattern classification it's not meant to do predictive work like Google would be predicting what your next query is going to be it's meant to find the most important signals within a data set so that means that clustering is supposed to be done on your entire data set if I go ahead and find 100 genes in my data set that are related to response to a drug and then cluster those 100 genes only well I'm going to see the response to a drug that doesn't mean the clustering has told me anything it just reaffirmed the statistical analysis that I did before it didn't even reaffirm it it just showed that I did a statistical analysis so if you intend to draw any conclusions from your clustering it has to be done on an unbiased gene set now it doesn't have to be done on the entire gene set imagine that your clustering algorithm is too slow to work on every single sample then you can cluster random subsets of 10% of your data and you can show that those have the pattern that you're interested in there's nothing biased about that but as soon as you introduce a specific experimental bias then immediately you're going to see that your clustering results lose their ability to tell you anything secondly, clustering is not a technique to identify differentially expressed genes it is not a substitute for proper statistical analysis so we're going to go ahead and say I plotted control and treated samples and I see a bunch of samples that are red in the control and green in the treated a bunch of genes I'm sorry that are red in the control, green in the treated so let me look at what those genes are and those must be the differentially expressed ones clustering pattern discovery technique it's well demonstrated but it's underpowered it will increase your false negative and your false positive rate relative to even extremely simple statistical techniques like a t-test or a u-test so even though there's a definite temptation to see those must be the genes that I'm interested in that doesn't mean that the clustering assessment will turn to a group of genes as reliably differentially expressed as other techniques it's not a substitute for your standard statistical analysis and lastly clustering is something that gives you a pattern and I could throw random data at these clustering algorithms and they'll give you a pattern it tries to discover the strongest patterns in your dataset I think those patterns are actually meaningful you cluster tumors and you see that you had 8 patients who responded to drug and 8 who did not and their clustering pattern was 5 and 3 versus 3 and 5 is that random chance or is that an enrichment? nobody needs statistics to support whether your clustering showed you anything meaningful when I told you about the QAQC experiment where my career is clustered according to technician just looking at it how do you know if that's real? you need to do statistics to go yes the clustering that we expect the actual known technician groupings does match which we have gotten by chance alone this isn't just an assortment issue so in all those cases clustering is something that you have to think about carefully how you're going to use it so you kind of walk through the entire analysis pipeline to this stage and I want to draw some key points that everybody should be keeping in mind and frankly if you forget just about everything that you learned today these are probably the most key points that you need to think about first microdata is analyzed with a sequential pipeline of algorithms sequential pipeline algorithms that's the workflow you can't skip a step the results of one step are critically dependent on the results of the last step you can't move to the next step until you optimize the previous one you have to take things in a careful systematic ordered way and you have to remember that the pipeline does something different and requires different choice and different consideration from the analyst so I mentioned at the beginning that there's not a lot of research into quantitation algorithms alphametrics quantitation is almost standard the way that the company suggests is what we all do but you make a choice and you still have to know that you still have to report that in your manuscripts and you still have to remember that that has downstream effects on your results the second point is that there's a very active research area so I showed you this pipeline and on this pipeline I told you about all the research that's going on into steps down here and all the research that's needed into the steps up here none of those are things that are in stasis the best algorithms that we use to analyze a microdata three years ago are not entirely different but changed from what we would use today every five years from now we would probably use different techniques there would be better improved methodologies better understanding of the signal with noise and the technology itself would change in a way that would lead us to do things differently so this is the first time in this course that you've heard this about a pipeline what you're going to hear for every other type of analysis is exactly the same it's a pipeline of sequential algorithms that's how you analyze copy number arrays and proteomics sequentially with a series of algorithms and the exact same principles that you've heard me talk about for the last hour and a bit are exactly what you should be thinking about for those as well the last thing I want to point out on this topic is that you should be thinking about getting very familiar with the val conductor package or library of our packages in different fields different tools become used for a variety of reasons field effects where lots of people get involved in it and founder effects and so I was originally an engineer and we used MATLAB for everything and I don't today know why now that I know better I realize how many flies MATLAB has a language has it's got its good points and it's got its bad points for bioinformatics the two lingua franca are purl and r they're the most commonly used languages and in part it's because of the existence of very strong open source really available lives and the one that's characterized for bioinformatics in our is bioconductor it is very well maintained it's reasonably well funded and the vast majority of the analysis that we'll talk about use bioconductor packages and they contain methods from proteomics and ECI data flow cytometry and anything like that and so you should think very carefully about whether you want to do something do I need to do it or has somebody done it before that I can take advantage of bioconductor is a pretty user friendly website and I guess everybody would have installed it already yesterday before starting the the tutorial so here is where we were going to start for a break and what we're going to talk about next is aphometrics pre-processing we got specific about that workflow and then we look at how we can load aphometrics data into our pre-processing data and compare different pre-processing techniques not so much so I showed you the workflow that is generic that kind of covers every possible situation now let's focus it down on what is an aphometrics specific so this is the overall workflow the first thing we can say is that quantitation is done according to AFI defaults with minimal user intervention AFI does go ahead and tweak things on us sometimes to report version numbers in papers it's as a reviewer one of the first things that I'll reject a paper for I look at the methods and if there's no version numbers for their software then it's an early producible bioinformatics analysis and you send it back and tell the editor to get them to put version numbers in before I'm going to review it similarly if the data is not publicly available we would reject it and tell the editor to go get them to deposit their data in public users and then you'll review the paper properly for scientific contents secondly it's a one channel array so there's no side 3 and side 5 just side 3 are the actual stain that they do use Spark quality is typically known in most analysis pipelines but definitely so in aphometrics it's a single channel array and in general simultaneous normalization procedures are being used one that simultaneously between and within a normalization so if we collapse all that and rephrase it a little bit you get a pipeline that looks like this you start off with the raw quantitative data which is in files called cell files which you need to background correct normalize probe set annotation remember I mentioned the CDF files earlier you need to annotate the probe sets due to statistical clustering and integration okay probe set annotation is something that I wanted to spend a little bit of time on so we talked about it before the arrays can become outdated and there's a number of reasons for this there's the changing gene definitions the reference genome sequence that these arrays were based on was not finished and I guess John yesterday probably talked to you about how difficult it is to define a finished genome sequence but these ones were really really not finished and so there's a lot of changes new regions in the genome that were the deprecated regions that were collapsed and so forth additionally the minimum of a spice variance and some estimates would be 10 to 20 spice variance per gene and it will often be found that probes will tie one spice variant but not another and you have to know carefully should we collapse those together and so they're targeting the same gene because they're going to show different results if there's differential spacing or de-separate them into different transcript definitions and then of course any array that's made in the initial design remains present in all arrays arrays are what's called a closed platform and what it means is that you fix at design time what you're going to be assessing those are only the things that you can't assess by contrast in sequencing it's an open platform you can find what you'd like does anybody know of a way in which microwaves can be made into an open platform where you can assess any species any single thing on a microwave correct? so there's a technique of which it's been academically described but I don't know of a commercial implementation but they're called universal arrays essentially the array contains every 15 mer that's possible on the array and that way you can have it as any sample to it and if you do appropriate normalization and deconvolution you can pick apart any single sequence or any single gene in any species as long as it's got a unique 15 mer tag is the key question but that's actually a pretty nice technique and in theory it would be able to do everything that you can do in sequencing on a single array with the arrays could be more dense you could get up to 17 or 18 and at that level it's actually practical for human genome work at lower levels it's perfect for things like bacteria and so forth the reason why AFI does not come out with any company array does not come out with a new array every six months is because the array design is expensive in fact the two very expensive things in making these arrays are the columnator and the mask and you can sort of see you're going to need a lot of masks so at a minimum you're going to need 100 masks because every base pair you're going to build up one at a time and you've got four possible nucleotides so you're going to need a minimum of 100 cases based on design characteristics you may not want to have a square of four arrays together of four spots together because that will allow the light to shed out a little bit more and that diffraction can cause noise so it actually will take more than that so 100 minimum masks these masks are expensive to produce time-consuming and expensive so they want to get highly used so that's one of the reasons the second reason is that these masks can be occasionally unwilling to change and move to new technologies and say but you know this one was working, I don't want to have to worry about how I integrate the old and the new array data think about all the bioinformatics challenges batch effects do another validation experiment to make sure I trust you a new array platform and as a result, AFI's best-selling array remains a 10 or 11-year-old product why? for exactly the reason because we need to do an experiment and I'm like, oh, you use U of 33 plus 2 why? because I know how to analyze the array we're familiar with the signal-to-noise characteristics and it's easy to integrate with all the data that we've already done and other sorts of other experiments and platforms so there's advantages to that of course the disadvantage is that we're continually working with older data and so that leads you to saying well can we take advantage of the idea that there are multiple probes that are mentioned in there that you've got 11 probes per gene or 10 probes per gene on average it's actually a little bit more complicated than that what I didn't mention about afimetric arrays is that in the original version of the array way back in 1907 these 25 base pair sequences were all designed with a paired control a paired control where the 13th base pair the middle base pair was mutated and the other 24 base pairs were identical the idea is that this mutated base pair will provide you a really good assessment of non-specific hybridization it will give you a really good measure of any background that was binding to that same sequence and it's a pretty good idea so the exact matching probe is called the PM perfect match and the one with that mismatch in the middle is the 13th base pair and the PM and MM probes need to be aggregated together in some sort of a useful and clever way to make sure that you can take advantage of that information the challenge is, well actually who knows what the challenge is what's wrong with MM probes in case there's a SNP good yeah if there's a SNP in that location it'll definitely be a big big problem hybridized specifically and no one has a mismatch could hybridize specifically to the perfect match hybridize these things so the differential affinities may not be strong enough definitely true at least two other things that go wrong here so one is that's true but what what happens if the mismatch probe exactly matches another region of a genome so that is by far the most common problem and I told you a few slides ago that we didn't know the exact definition of a genome at the time that affinities so now when you go ahead you find that large fraction I think it's about 30% a mismatch probes perfectly match something else that's not so useful because now what happens is when you bind it what you are seeing as pure signal or pure for match and no mismatch we're going to see as no signal mismatch binding that means if there's a snip in the perfect match probe and there's a snip in the mismatch probe say location say the 20th base pair now the perfect match will bind at 24 out of 25 and the mismatch will bind at 23 out of 25 I can be able to distinguish that difference accurately sure 25 out of 5 versus 24 out of 25 that's a big difference in affinity but 23 versus 24 so they can often cause big big problems in your ability to detect things so today we normally exclude the mismatch probes in fact the CDF mapping algorithms that I described earlier actually treat the mismatch probes as if they're regular probes usually it's going to go and take every 25th base pair perfect match or mismatch and see if they can find any place on the genome that this could be used to interrogate something and that means some things that were never designed in the array are now interrogatable because of the mismatch probes so let's try to maximize the use of the data that being said I kind of talked about this like the company really doesn't want to go ahead and give you more new versions of arrays that's not true they certainly have released new products it's just they don't get as wide use so there's an exon array which doesn't get particularly wide use there's what I call the gene arrays which are kind of an updated version of the classic expression arrays with more normal product definitions and maybe most importantly and interesting for everybody here in January, February a paper was published in PRAS about an array that they co-developed with Stanford and basically it's a hybrid exon non exon array it's a nice product in the sense that it's in theory covers a larger fraction of transcriptome which is better known today but tries to maintain compatibility in some ways with some of the older arrays most importantly it's pretty valid in FFP samples it's been shown to work so in theory that's what they would hope would be the next flagship product but as John would have mentioned yesterday and Anna's going to mention on Friday one could question in 10 years if anybody has a new microarray for that kind of experiment maybe the right thing to do at that point is just to sequence everything so companies like this are in an intermediate transition stage and that actually makes it difficult for a company to draw in substantial bioinformatics expertise would I rather spend the next year working on algorithms for a new microarray platform or would I rather spend it on algorithms for RNA-seq development one is probably going to have a better long-term payoff there are some challenges with that and that again comes back to why this probe site remapping is critically done so I mentioned the CDF file and now that I've done that you can sort of see the other files here that come out of an AFI experiment one that we didn't discuss is the first one the GAT file the GAT file is just a TIF image it's the actual raw image reflecting in my code it gets processed quantitative using the AFI default algorithm into the CEL file and the CEL file sometimes it's in a compressed format but basically it's just a table of locations X coordinate, Y coordinate signal intensity and covers every single point of the array in that way and we'll talk a little bit about AFI specific preprocessing so let's think about this again what exactly is preprocessing I'll talk to you a little bit this morning but we already have a specific one-line definition that they would use after what we just discussed I would have just said we're moving technical noise that's exactly right so preprocessing is removing any noise from the dataset and so that is of course you've got to think about where the technical noise comes from so where would you find the technical noise in an AFI dataset wow, you could have batch effects from one way to another you could have effects that are related to the individual features or effects that are related to the hybridization and scanning and so you have to look at each of the individual steps and ask what could happen what could go wrong and those will reflect what you're going to do in your analysis so as an example you've got an in vitro transcription and an RTN and an IVT for each sample and that means they could be different from one sample to another which means that probe affinities or probe biotin labor could be different from one to another and that would lead to systematic differences hybridization could vary and so forth so when we come up with an analysis pipeline we're going to try to remove all of the things what are the factors that could have caused technical noise and one thing that we're pointing out is the second last line hybridization is hybridization is greatly affected by ozone levels it's pretty well known that the environment to ozone levels can really really mess up the fluorescence of certain dyes the ozone scavenges the dyes and causes free radical reactions chain reactions actually that can prevent you from seeing any signal and back in the mid 2000s the lab that I did my PhD in gave up on doing micro experiments in the summer because ozone levels in Toronto were too high and we were just losing one of every two, one of every three experiments because of insufficient signal that's not the kind of thing you'd immediately guess and the UHM Micro Center in the other tower of Mars has an ozone free rail for doing parts of its hybridization and analysis procedures that's not something that you immediately go oh I got a control for that but actually if you happen to do the ozone level with every sample and you'd say okay this was done on a day where environmental ozone was X therefore I'd scale everything to the ozone level and bring them all to a consistent basis so those kind of things really require you to know your technology, I've said that a few times and there's no reason not to if you don't be able to do the analysis correctly but I just said it's to remove technical noise it is not a substitute for designing an experiment properly like a really poorly designed experiment and then say okay but I'm going to use normalization to clean it up. A point of normalization is to remove whatever you couldn't remove with experimental design so as an example there's some basic experimental design principles that you should always keep in mind you should balance experimental groups so imagine that you were going to be doing a hundred arrays let's call it a hundred arrays and that your array center did them in batches of ten and that means it's going to take them two weeks and you add 50 arrays from patients with cancer types and 50 patients with a different cancer subtype Y the worst possible experiment in design is on week one to do 50 patients with cancer type X and on week two the 50 patients with cancer type Y however I would go out in a moment and say that at least half of the core facilities in North America if you just gave them all your samples you're going to randomize them for you why don't you even think of it so you'd have to either tell them or pre-randomize them yourself and think it was kind of a feature and it's clearly established that there are systematic processes that occur over time and it's obvious that we'd go wrong so you should always be not only thinking about that proactively about balancing your groups but making sure you've recorded those groupings so you can test to see if there are group specific effects if you have a choice and there's not always a choice it's much better to do biological replicates than technical replicates so if you have only money to do 20 arrays and you wonder should I do 10 arrays of treatment type X and 10 arrays of treatment type Y each from a different patient or should I do 5 but 2 replicate arrays from each patient it's always better to do 10 the vast majority of cases it's better to do 10 and the reason for that is the technical bias will already be incorporated into the biological replicates they still have that technical source of error so you're going to be able to simultaneously assess biological variability and technical variability and that allows you to increase your power and most importantly increase your generalizability and really cares are my results true not true for this one sample are they true in general and if you have a smaller number of biological replicates you do manage the ability to see that there are exceptions imagine them developing a clinical diagnostic then I really care are they true for this one patient or imagine that I have lots of money but a limited number of samples okay then you would do technical replicates but in general biological replicates would be preferred and then lastly imagine that it's not possible to process experimentally your samples identically that's not a practical thing but it's critical to introduce controls let me give you an example imagine that I get a grant from some buddy Michelle did you want to give me a grant so Michelle gives me a grant to do 100 micro experiments thank you and I'm going to she says well all these are expensive so I can only afford to give you 50 this year today and 50 next year and I say wow that's a bit of a pain so my experiments are going to have to be physically separated in time by a whole year that's going to introduce a source of bias yes I would make sure that I would balance my experimental groups but the other thing I would do is I would take some experiments and make sure they're done in both years so instead of doing 100 experiments Michelle would only get 90 experiments worth with 10 replicated in each year and that is a much more preferential experimental design because now I can have a way to control for that batch effect so you have to think when there is something that is going to be systematically different in your experiment what experimental control can you tell whoever is doing the experiment to include so that you can actually analyze the data better if your controls are frozen and your experiments are formal and fixed then it makes sense to say can we get some formal and fixed controls there will have to be a large number but enough to be able to systematically identify what the source of bias and how big it's going to be so obviously you're in for perfect balanced and highly replicated experimental designs but when you can't you build those controls directly into the experiments to assess sources available so I'm not going to go in great depth over the pre-processing techniques used for after experiments in part because half of you aren't dealing with micro data and in part because the half of you that are are going to be dealing with a bunch of different platforms but I'm going to mention them so that when you work with them this afternoon you'll get a feel for what the differences are so I'm not necessarily describing for you here the two best algorithms I'm describing the two most commonly used the two that if a reviewer saw they would probably just go ah okay fine I'm not going to think about it too much they may not go earlier this person clearly thought about this for six years but at least it's reasonably accepted by the community so those two algorithms are RMA for the best multi array and MAS5 for microarray analysis suite version 5 I have to look at that so they're very very different algorithms that were developed around the same time in some sense they were both inspired by the fact that there's an AFI algorithm called MAS4 which was based on an average difference calculation doesn't really matter how it works but it had a lot of statistical naivety to it that led several very very good biostatisticians look at it and go wow we can do better than that at the same time the company looked at it and said yeah that was probably good we can do better than that too and so two different algorithms got published at the same time that approached how to analyze affimetric data in different ways mathematically they're not trivial they both include techniques that people are probably not highly familiar with like an even polish as an algorithm to centralize data and to aggregate linear trends that in essence despite their differences they come up with reasonably correlated results and they each have strengths and weaknesses so the first is that they have a precision accuracy tradeoff so in MAS5 it yields very very accurate results and by accurate what I mean is that the true mean is very likely a good reflection of the underlying truth however the results are not necessarily precise and by that I mean that there will be higher variability around that true mean RNA goes the other way it says that it's going to be highly precise that the technical replicates are going to be very close but they may not actually be on the true mean they may be off of it and you can argue for which is a better approach when you want it to sacrifice I'm not saying that MAS5 has absolutely zero precision and that RNA has perfect precision there's some sort of a variance bias tradeoff there of which one is going to yield more of one and less of the other the general consensus is that RNA because it has greater precision does a much better job in small and experiments in small replicate experiments the idea is that if you have few replicates then you don't have a lot of statistical power and if you artificially increase the noise like you might with something that will be accurate but will have a lot of bias or variance then you will start to diminish the number of true results that you're able to see so the argument which is reasonably well accepted is that RNA will work much better for low replicate studies and MAS5 will work better for higher replicate studies MAS5 also has a couple of characteristics that allow it to work on a single array at a time and this is critical a single array at a time means that we are going to be able to take a look at your individual patient one at a time and say I'll analyze the next patient who comes in the door next one, next one and that makes it much more amenable to clinical diagnostics by contrast RNA does not have that characteristic it can work on groups of samples together and that can make it much more difficult to transfer into a clinical diagnostic so we'll start out this afternoon working right now working with RNA in the afternoon we're going to take a look at MAS5