 Talk to you for the next couple of days about microarrays and how we analyze microarray data. And let me start off by giving one just general comment. A couple of people before we started asked if I recommend things like our studio or so forth. In general, actually I don't. I think that that hides you from the actual complexity of the code that you're working with and our studio itself has a lot of bugs and problems with it. And I suggest people work directly with the R interpreter that gives you the best control over your data, the best understanding of what's actually going on and gives you the best quality analysis. And we'll deal with that over the course of the next couple of days. While you're working on the assignment, I will actually be doing the same thing in front of you right up on the screen. And you can probably help out a little bit. Help somebody who might be working on what I record for the program or target. We all need the exact same thing. We'll also talk about the basic type of questions as we go on just raise your hand. So the kind of key thing. Common type of microarray. The most common type of only experiment is an expression microarray. It's U-1, very, it's widely sold. Same type, that's okay. I'm just transcribing on the number of signals. Ten, that's the signal. Anything else? For us, the signal is one. Why are they actually transcribing? For us, the signal is one. So I think we can all the way around to a number of different methods. Starting off saying it's a really healthy measuring account. It doesn't measure the absolute amount of the mRNA. In fact, it doesn't even measure the amount of any particular mRNA. It just measures a sequence. And anything that has that sequence will be measured. So in some sense, the technology is fundamentally sequence-focused and you can see the account that we hope is proportional to the amount of the mRNA in a certain range, but it's not absolutely down to the limit. So keep that in mind as we go. We're always dealing with something that is proportional to the true signal. We never know. There are 13 molecules of RNA to give a sample without developing a titration curve or something like that. I'm not going to talk about a better microwave knowledge. How is it constructed? What makes the loop of slides and then we'll move on to the map. Surprisingly, when microwaves are invented, they need to be so easing, what's called a photolith at the back. And basically what it does is allows you to shine light through the map on only certain regions of the chip on the unit. So you can imagine that there are molecules that are what it means if a team of light or a photon of light is coming in an unusual angle, so that it might not go exactly on to the infinite spot that you wanted to. Instead of going straight, we'll come to this hole in the map and it will disarray. What do you guys actually do? I'll quickly walk you through the last thing that I'll point out about Africa to raise is that the arrays include redundancy. And I said that a few other times. They produce a large series of output files and that large series of output files include something that will allow you to change what it is. Kind of the principle. The most widely used, the other important technology that uses there is the Illumina self-assembly which is one technology that is a very critical application. That's not something that is kind of insane. So eventually, we'll probably find and replace the most powerful. Anybody who's still having challenges with their arrays will pick it up in the next practical. I want you guys to focus on the didactic and I don't want to kind of delay everybody just to pick up a couple of people. What you're dealing with are probably small issues that we can fix in the next practical in a couple minutes. All right. So can we have laptops down for people and no side conversations, please? I want to go through a couple of quick points. There was a question just now about mass five versus other pre-processing techniques and let's revisit why we do what we just did. I taught you the commands. You should probably do the data part. You should be able to recapitulate that on any data set. But that's not really fundamentally so to make sure that anybody wants to do the technical part. Good. So I think we basically come at it. Are we processing an example like that? As we talked about, you and your standards, are we pre-processing into a more sort of design? Take a long story short. So we've talked so far about what it is a microwave actually is and how they work. Now we're going to switch to talking about how you analyze them, which will be the focus of the rest of our time. Are there any questions so far? Anything that came up over the break that you go, oh, I didn't like all of that or I wanted more information on something? I'm going to go right into it. Hi. How often, how did it come to the end of the day? And what was the end of the day? Great, good question. So we'll address this in a lot of detail over the course of the workshop, but how often is a corporate decision in companies making some like, for example, the classic microwave that actually uses the U-1-3-3-1, why they sold a random world? This is how many people have heard of U-1-3-1 class. So who knows what U-1-3-3 stands for? U-1-3-3 stands for Unigine Build 133. Unigine is an NCBI database that you can access at NCBI, just by going NCBI blah, blah, blah, slash Unigine. And it basically takes our best guess at how transcripts in the genome are defined. And if you take a look at Unigine and click on the statistics for Homo sapiens, Homo sapiens, okay, great. You can see that our current U-1-3 build is Unigine Build 236. Alpha metrics, U-1-3-3 was based on Unigine. So given that they're doing it 10 times a year, that tells you that it was based on a definition of the human transcriptal from about 10 years ago. So what changed in 10 years in understanding of the human genome? Well, most of the genome was sequenced back then, but it needs actual structure in a way that was born ago. And on top of that, our understanding of what genes in this place variants look like has changed. And so one of the major things that comes in just a minute is the definition of an array. What comprises a gene? What are the different place variants or isoforms that are being looked at? And if you take a look at it, every new build of Unigine, even today, which are going to be a computational artifact of the gene instead of two genes that we thought were separate genes actually are the same thing, or alternatively adding genes that wouldn't really exist. So those are the kind of things that will be added as differences. In terms of the rate of, as I say, that's a corporate decision. Originally, Affymetrix produced one that's selling array ever. And it's so much money that even when they introduced the data, they thought that they were so good for comparison and were so comfortable with knowledge. So Affymetrix produced these exon arrays which were actually pretty good, but nobody bought it. There are a few publications you can find on there. And more recent, so they basically didn't produce a lot of arrays. More recently, they've generated an increased number of arrays, which includes non-conial arrays, transcriptome 1.0 and 2.0, which are very broad junction, space junction and transcript arrays that are very, very good product. But you can kind of see flows in there on how often things are generated. In an ideal world, you want a new array platform every time we have a new array of what genes are. And obviously that's one of the limitations of my queries is that you can follow-ups or any other questions, yeah. So that's a tough question to answer. Start off here. Most painful array in the world because the original array was called U-133 and it had two parts, A and B. So they merged them together, which they decided to call the U-133 plus two. And then they decided to make a low input version of the U-133A array, which they called U-133A that are existence or U-133 plus two and U-133A 2.0. I can only say that I've routinely seen people not use the array that they wanted to. So that's one thing. Second thing, how easy is it to do this? How you do your analysis, there are ways to do the analysis, which I'm gonna teach you that will make it easier to do the comparisons. And so it depends on how the arrays are different. Imagine that the exact same sequence is represented on two arrays. And you'll see our differences that are caused by the way the arrays are constructed or the way in which they hybridize. So technologically, the sequence characteristics are the same. So in the side that we should actually be designing an array to a different part of the gene. And now you'll have sequence-based differences, differences in cross-hybridization, noise, et cetera. And now you've got really big problems. So if it's the exact same sequence, it's not terribly difficult. If it's different sequences, U-133A and U-133A and U-133A and U-133A plus 2.0 share a common core of sequences that are present for everything. There are differences between the B and the BUS2. And my recommendation to people almost always is to not try to do a analysis of different microarrays. It's very difficult. There's lots of literature describing ways that get it to work and then they don't really get it to work in a general way. It's at one to validate it in dataset two. And then there are different approaches at one and two if they're on different microarray platforms. And even if they're on the same platform, I do a real challenges with trying to get data generated that Harvard and Toronto merge together because of differences in the production facility. So batch differences. So in general, I suggest you treat each experiment as a separate experiment and just determine if it works in circuit court. I would probably recommend that I give a lot as changed so that the majority of its data is imagined that your difference that you detect biologically sufficiently large that you detect it even given the change in platforms and technology. That would be the kind of problem validated across platforms. So for example, even if it's over expressed in a wide variety of levels specifically because it's degradation. And that's something that's seen by RNA arrays in the entire announcement. If you do a focus gene-wise analysis and that kind of go back to the raw teaching process and build usually merge the data into a single model and use it to that you should be able to kind of take gene-wise as a research question. Yeah. It's absolute, right? So one problem, if someone has the data set had only 35, the first thing I would do is not, of course, I would give a power to them. So we have a 0.03 percent chance of what's in this data set. That shows that you've actually thought about the question. As you had a trouble on that 100 percent of the time. As for the challenges that everyone's interested in, those bioinformatics, that's absolutely true. My team's strategy is not even to use the results. I think it's in some bioinformatics and analyzing that can be universally applied all the time. I'll give one story about TCGA. So we've had a discussion about TCGA when I was saying, sorry. So TCGA, pretty broadly, that came through General Atlas. It's a US NIH funded study to analyze 500 tumors, about 25 different tumor types using DNA sequencing, RNA sequencing, and methylation analysis. And so that's what I'm telling you. So Atlas has been using mutations within each one. So TCGA, it's publishing. It's kind of in its publishing phase now. There may be five or six papers that are published with five or so in graphs and another five that are covered. So you would have seen last summer, last summer we had a lung squamous cell carcinoma, breast cancer, kidney cancer, and colorectal cancer. In the summer, there were a lot of breast cancer in Latin America, and many others. So it's a pretty large data set that's standardized to know much about what's in the world, actually much bigger than 1,000 genomes or half or anything like that. Right. The other thing I'll say is that TCGA is interesting because a lot of people are interested in learning Atlas, not to provide an answer to every question people might have. I wouldn't. We find the top and bottom 10% of every gene. So if you have 500 patients, or 50 patients with the lowest and the 50 patients with the highest levels of that gene are called to have over-expression and under-expression. That doesn't mean that they necessarily do. It doesn't mean that the population is tri-modal. It doesn't mean that there's even any variability at all, but it's just a uniform definition of bottom 10%, top 10%, and those in the middle. The effect of that is they're receiving 20% of all people are outliers for everything in the gene. It's probably not a great assumption. There are genes where it's much smaller and there are genes that are much smaller, but to kind of standardize things across a big project, they did that. Depending on which level you download the data and how you report it with stuff that has no real component, it's always one story that's applicable to everybody. If you download a public data set, I strongly recommend that you download a data set using your own experimental protocols, your own analysis, exactly as you would if it was something that you would generate yourself. Don't change it as if, oh, that group did it, it must be right, even if it's my group. We think we do it right. That's great, but that doesn't mean it's right for what you want to do with the data set. That's one thing that we'll get over the course of the session is that different analyses provide different results, different methods to answer different questions. Another question, so let's talk about algorithms and then kind of see how that works in light of how we're going to talk about this. Quantial series of computational devices, too. I think I see a grid over here. I think I see another grid over there. An easy-to-read problem is that a lot of people don't read that much and they don't know how to do it. But the patient level is plus or minus two, plus or minus two. It's a very simple solution that's going to be sort of the thing upon which many people are writing the same thing. Let's talk a little bit about the specifics of the metrics today and then we'll go over the details about the technology in a little while so I'll be back in a minute. Basically this is the aftermath. This is the generic metrics data. A couple of things are different. Metrics defaults for quantitation. Actions for that. It's called a new phrase. Basically an anthropometric score. What is it? Annotating form, statistics, clustering, quantitation. What is it? Annotation. You have to change your weapons genome and self-change it. So we used to think that we were representing the most common sequence of notes. This was a unique variant that was a unique private person who we sequenced being weapons genome. Those were things like splice variant. Talked about the math. We have metrics that are expensive to change. So we have advantage in the fact that there are multiple probes printed and multiple probes per gene that are related to genes in the chip and what we're trying to do is to look at the math which is source of noise can arise. The destruction of the array. The hybridization of the book. So we're going to start off I think at this point we'll pick up here after we've kind of got it in the afternoon and discuss what we're starting to read.