 It's just one department, biostatistics and medical informatics, but it's okay. Thanks so much for having me visit. It's great to be back in Berkeley. So I'm an applied statistician. I spent a lot of time looking at data, and a good amount of that time is diagnosing and trying to resolve data problems. What I wanted to tell you about is one particularly egregious example of this from my past work. It's not particularly new, but it's hopefully new to you. And yeah, for sure, I'm hoping I don't slip into jargon that's not defined, but if I say something that you don't understand, please ask, and I think you don't need to wait for the microphone to have me fix some jargon that I skip over. I mostly focus on genetics, and mostly the genetics of model organisms, mostly genetics in mouse, trying to identify genes that are contributing to complex diseases or other traits in mice. Mice have a lot of advantages for biomedical research generally, and one of them is that you can create inbred strains of mice. Mice like humans have two copies of every chromosome. These mice are members of a single inbred strain. They're the result of many generations of mating between siblings, so a brother-sister pair for 200 generations. Result of that in breeding, you get a strain of mice where their two copies of every chromosome are exactly the same all the way down. All these mice are genetically identical, but sort of the two genomes, the chromosomes they got from their mom, the chromosomes they got from their dad, are identical. So that is advantageous for biomedical research in that you can completely eliminate genetic contributions to whatever you're studying. An advantage for genetics is that you can then do really big crosses. A given female mouse can only produce 20 or 40 offspring in her lifetime, but because you have genetic, because you have clones of her and her partner, you're not restricted to work with family sizes of 20 or 40, but you can have crosses that are effectively unlimited size, however much space you have for cages of mice. And mice are not unlike humans. We're not particularly interested in curing mouse diseases, but we're hoping that by learning about, say, diabetes in the mouse, that that will tell us something about diabetes in humans. I like to think of mouse as a model for human disease in three ways. It could be that a particular gene for a trait like blood pressure in the mouse is also involved in the human population blood pressure differences. In some cases it could be the same allele, the exact variance that you see in the mouse that are participating in blood pressure variation in humans. If it's not the same gene, it could be the same pathway that, you know, what genes are involved in blood pressure in the mouse could tell you something about what pathways are important in blood pressure in humans. And the third way is that, say, the genetic complexity or the genetic architecture of a trait in the mouse could tell you something about human disease. Like if blood pressure involves 1,000 genes all with complicated interactions in a given mouse cross, you have to expect that in humans it's going to be similarly complex. So maybe it's obvious, but I have to give you a bit of background before you get to the whole, you know, massive data part. This is, yeah, this is the background bit. So the typical experiment I study is an intercross between two strains that differ in some trait of interest. You have a strain of mice that has low blood pressure and a strain of mice that has high blood pressure and you cross them and you cross them again. You do the cross, basically the same sort of cross that Mendel did with peas. Mix up the two genomes so that the mice down here, they, well, they'll get a chromosome from each parent, but because the two F1 hybrids, the hybrid parents have one copy of a pink and one copy of the blue chromosome, here, you'll have genetic variability as a result of this process of recombination along chromosomes. They, you know, these F2 mice may get a pink chromosome or a blue chromosome intact, but they mostly get a chromosome that is a mosaic of the blue and pink, you know, that in, in meiosis that produces sperm and egg cells, these chromosomes find each other and exchange material so that the actual chromosome transmitted could be a mosaic of the two parent chromosomes. So typically what we do is measure blood pressure in each of these mice and then measure genotype along chromosomes, that is, at given sites figure out which mice are homozygous pink, which mice are heterozygous, and which mice are homozygous blue, and then look for sites where what genotype you got is associated with the clinical outcome blood pressure. And this is the sort of data that we would have that you have a bunch of genotypes, say positions along the genome, along each of the chromosomes as, as the columns and mice as the rows, and at each spot you get to see, you know, blue for homozygous blue or pink for homozygous pink or yellow for heterozygous with some smattering of, of missing data in there, these little black specks. And then we have some phenotype. For every mouse we get to see its blood pressure, and I've sorted the mice from lowest phenotype to highest here, but we're going to scan along the genome and look for sites in the genome where, where there's an association between this category genotype and this quantitative phenotype, and so, so, and the, the result of that, you know, at every position in the genome we calculate some test statistic, which for historical reasons we call the test statistic we use, it's called the LOD score, and it's a measure of how closely associated that region's genotype is with the mouse blood pressure. So a region where that thing is, is high, if you split the mice into three groups according to their genotype at that spot, you know, sort of blue, blue, red, red or blue, red, they show some difference in whatever the, the phenotype is, those three groups. But regions that are associated with, where that phenotype is associated with genotype, regions where genotype seems to be affecting the trait, we call those quantitative trait loci or QTL, it's site in the genome that is somehow affecting the phenotype. So I've been working on this problem since I was at Berkeley 20 years ago, and, but I moved to Madison 10 years ago, and that the, the, the first major project I got involved in, we was already, you know, getting, going, ongoing before I got involved, but it was an experiment like this, they were interested in diabetes in the presence of obesity. So they did an intercross with about 500 mice between two strains, B6 and BTBR. The mice were all knocked out for a gene called leptin. As a result, they're all really obese. Leptin is this, leptin is, I mean, the gene produces a hormone that makes you satisfied at the end of a meal, sort of your signal to I'm done, I should, I'll stop eating. Maybe if we were all knocked out for leptin, we would want to, we would just continue to eat and eat and eat. So these mice just eat everything given to them, and they're all really, really big. And one of these strains, BTBR, in this obese state becomes diabetic and the other one does not. They cross those two strains, made an intercross, about 500 animals, genotypes across the genome, lots of clinical phenotypes like insulin levels, but in addition they measure gene expression in six different tissues. So for each mouse, they add a certain age, I think 10 weeks, they kill the mouse, remove all these organs, extract RNA and put that onto a microarray to measure the level of mRNA of every gene in the genome. So for each of these six tissues, for each of these 500 animals, we have 25 genes, we have their measure of expression. We were working on this for two years or so, and we're about ready to write the first paper, and I said, well, I'm willing to sit down and write the paper. But the way that I work, I wasn't, like, I'm going to sit down and write a paper, I'm first, let me go back to the raw data, so I really have a full understanding of everything that happened. And, well, first I have to say that the cross was done in a certain direction. They took a female BTBR mouse and crossed it to a male B6 mouse, and because of that, the offspring should have certain genotypes on the X chromosome. The F1s, the female F1 will have two X chromosomes, one from each parent. The male F1 will have a BTBR X chromosome and a Y chromosome from B6, because his father was B6. That's where he got his Y chromosome from. The cross, those two, the offspring, these offspring that I'm working with, the females will get a blue X chromosome from a BTBR X chromosome from their father, and then a recombinant X chromosome from their mother, and the males will get just the X chromosome from their mother. The males will be hemizygous along in the X chromosome. They'll have either blue or pink, but just one allele, and the females will all be either homozygous blue or heterozygous. But when I looked at that, watching that type data, I saw that there were out of the 500 mice, 26 mice, this was clearly wrong, where either the females were homozygous pink, a whole bunch, or the males had a whole bunch of heterozygous. This is not atypical. It seemed like there were, say, 25 mice like that, not unusual. But I was thinking, we have this expression data. How can I use that to help me figure out what went wrong? Are these mice male or female or what? I could look at a gene on the X chromosome, exist. That is super highly expressed in females and not at all expressed in males. Maybe that would help me to tell. Then I realized, well, we have, well, later that day, I came across, I really figured out that, jumping to the punch line now, we ended up, like there's 20% sample mix ups in this data. Of the 500 mice, it turned out that 100 of them, the genotypes didn't correspond to who they were supposed to be. These are the 96 well plates or 8 by 12 plates where they put DNAs on with little wells in them. Each arrow here is pointing from where the DNA should have been to where it actually ended up. The insight I had, I was thinking, what's going on with these sex-swapped mice? I could look at a gene on the X chromosome. Then I thought, I have all these other genes whose gene expression is really closely associated with genotype. There are lots of genes like that whose, the expression of the gene is super highly associated, especially with the genotype of the spot where it sits. There are other genes that maybe aren't, the expression of the gene is super highly associated with some other spot in the genome, but lots of them anyway where the expression of the gene is super highly associated with the spot where it is. I could use those as tags of mice. I looked at that. Here's that gene, there's a gene on chromosome 1 super highly associated with genotype at its location. If I look at the expression of that gene broken down by the three genotypes, it looked really odd to me, like not how nature is supposed to work. Most of the mice, like if you have the homozygous R genotype, you have really low expression of this gene. If you have the homozygous B genotype, you have high expression of the gene. If you're heterozygous, you're sort of pretty high but not as high, but then there are these other mice where homozygous B, they should have a really high expression of the gene, but they have a really low expression. And these sex swapped mice, if I highlight just those, it turns out that they're sort of enriched to be in the wrong blob, if you will. So there were say 16 mice that I had that looked like their sex of their genotype didn't match the sex of what had been recorded. And nine of them end up being in the wrong blob. So it suggested to me that the sex swaps were sample mix ups and that all these other mice in these other blobs, that they were also sample mix ups. I mean, some other positions in the, well, and if I could, so, if I could, I could use the expression and use this association between expression of an expression and genotype at a site to try to figure out, for a given mouse, I don't believe that it's really heterozygous here because its expression is really low. What do I think it's, what do I think its genotype is at this mark, at this, at this gene? I think it's really homozygous R. You know, all these mice down here, I think really are homozygous R because they have low expression. All the mice that are in the middle, I think are heterozygous and all the mice that are really high, I think are homozygous B. So I could use, say, K nearest neighbor classifier to, from their expression, what do I think their genotype is at this position? For each mouse, I look at some number, say, 30 mice whose expression is similar and I allow them to vote to say which genotype should they really belong to. It allows me to sort of very quickly form these bands of the low expressors are all this genotype, the middle ones are all the middle, the heterozygous and the upper ones are all homozygous B and then maybe some in purple that are sort of halfway between two blobs and I don't know for sure who they are. And then I do this at a bunch of other sites in the genome, for a bunch of other genes whose gene expression is really tightly associated with genotype at their site. In some cases, I can use pairs of genes. We have two genes that are really close together and I can look at their, you know, a scatterplot of the expression of one gene, the expression of the other gene, you know, colored by genotype at that region. It's pretty clear what the three different genotypes are and if I highlight the set of my sex swapped mice, many of them are, you know, some of them are fitting into where they're supposed to be, but more than half are in the wrong blob. So, again, I can use the expression of these two genes and figure out what I think each mouse's genotype at this site should be. That those in this blob are really homozygous B even though the genotype data says they're not and those in the middle blob I think are really heterozygous and those in the upper blob I think are homozygous are. Here's another example of it. I have two genes that are really closely, that, you know, sit really close to each other. Their expression is highly associated with genotype at that location. The actual genotypes for a bunch of the mice don't really fit what I think they should fit. I can use a K-nearest labor classifier just for each individual dot, find the 30 dots nearest to it, and let them vote to say what I think that genotype really is for each mouse. So, as a result, my basic scheme is I have, I focus on a set of genes that I have gene expression that's really tightly associated with genotype at the corresponding location. So, I have this rectangle of the expression values for those genes that have really, really strong QTL and the observed genotype data for all the mice at those locations. And I use that sort of each column or maybe pairs of columns to make a predictor of from, from that what do I predict each mouse's genotype really is at those locations. And then I can ask for given mice, for given mouse, does my observed genotype data actually match the, you know, what I infer it to be? And if it's not, if, if it doesn't match, can I find another row that, say, does match? See? So, I find a bunch of sites, I find a bunch of genes whose expression is really highly associated with genotype at a given location, may get a hundred of those across the genome. For each of them, I make a classifier that allows me to take expression and tell me what I think everyone's genotype should be. And then I ask, do they, does that actually match what my genotype data says they should be? And if it, and, and is there another mouse that it matches better? And so, for every DNA sample I have and every mRNA sample I have, I can make a heat map, I can calculate a distance between them. How, how similar is the, the set of genotypes to what I predict the set of genotypes to have been? Calculate for every, for every pair of, for every DNA sample and every RNA sample, what is the proportion of mismatches? And I get something that's kind of hard to see because you, I, we have limited pixels. If we blow up the first hundred, the lower left corner, you get a nice perfect blue diagonal line that the first hundred mice, their RNA sample and the DNA sample, they're really similar. You know, the DNA, this, the set of genotypes I have and what I predicted them to have, they're really similar. And the off diagonal are all not similar. So it looks like these are all good. But if I move up a little bit, you know, the miles 200 through 300, you find that, you know, there are some of them that are nice on the diagonal, but there are all, then a whole bunch that are off by one or off by two. And so I, I know that their, their DNA sample is wrong. If I just take that, that distance matrix and I make a histogram of the diagonal, you know, each DNA sample against its own supposedly mRNA sample, I get, you know, a bunch of mice that are really close, you know, there are RNA sample, their DNA sample look the same. And a bunch of mice that totally look different. And if I look at all the off diagonal elements in the distance matrix, look at a histogram of that. Most of them, they, you know, they're on average about 65% mismatches and a small proportion that are, that are, you know, really similar. So that, I mean, the idea is that, you know, these guys are all right. These, these are all wrong. And, you know, they're, they're over there. So for each row, I really want to figure out for each row in my distance matrix, if this, the minimum isn't on the diagonal, I want to figure out where is the minimum. And that's similarly for each column. So, so for each DNA sample, I can say, I plot the, the value I get on the diagonal and plot that against the minimum for the row. And I get a bunch of points where they, that, the value on the diagonal, its distance from an mRNA sample to its own DNA sample is, you know, the smallest in the row that look to be correctly identified. I get a bunch in green, green points that it's clearly not, the value on the diagonal is big, but there's another value in the row that's small. So it's clearly wrong, but it's fixable. And then a bunch of others that, a small number of others where the value in the diagonal is, is, is not small, but there's nothing else in that row that's small either. That mouse, I don't have an RNA or a DNA sample for. I have just one or the other. And then, and then I went back to the part of the record, the, the company that did the genotyping, they wrote down where the, where the DNAs were arrayed onto plates, because they, they do them in these, this 96 well format. And if I look at, and that allowed me to make that picture for every sample that I know is wrong, I can find, you know, many of the samples that I know I wrong, I can find where it really ended up. And that's what these, the, the arrows are here. Blow up just one of them. There's some additional annotations. So the, the white dots mean that the correct DNA was found in the right well. The pink circle over there at D2 means that, that sample was duplicated. So it's got a white dot in it. It means it's, the correct DNA was put there, but it was also put in, in B3. That arrow means that it was supposed to be, a sample that was supposed to be in D2 got actually placed in D, in B3. And the thing that was supposed to be in B3 got placed in B4. And the thing that was supposed to be in B4 got placed in E3. And the little green arrow means the thing that was supposed to be in E3 that didn't, I don't know where it went. And so that the green arrow, I mean the orange arrow means that there was something got placed in 7, but I don't know what it was. And the purple arrow that, I mean that, that thing, the thing that was supposed to be there, I don't know where it went. I don't, this one, I, I, it puzzles me. Like how, I don't know how, maybe it's my, my lack. I'm not sure how that this would happen. But the data are clear that this is what happened. That, you know, a sample got duplicated, but put in the right place, but put in another place, and then, you know, three sort of all shifted in some weird way. But, you know, of the 500 or so samples, really 20% were put into the wrong well, and there were mostly these two plates. And let's just look at one of them. So something that was supposed to be on that upper plate got put into A1 there. The thing that was supposed to be in A1 got put below in B, in B1. And that pink circle means that that was in replicate, because it was put in the next two wells. And then everything was off by two for a period of time. But, and then, and then, there was a sample that they, they somehow got back to be off by one. And they were off by one for quite a period of time. Yeah, so, you know, it was clear, it seems clear that whoever was working with these DNA samples was moving them, say from one set of plates to the other through one well at a time. And they may be, I like to think they got a phone call, and they came back and they did things a little differently, or they, and they maybe realized they were making some mistakes and they tried to correct for it. The, the game that I, so, though, I should emphasize, you know, the, the errors that I'm seeing, that really the only way I could, I mean, there are two ways that I can ascribe it to be the genotype data that's wrong and not the expression data, because it could be they mixed up the expression data. One is that I have these six tissues worth of expression data, and the, the six tissues correspond, and it's the genotype that's different. If something is off from one tissue, it's off from all of them. And the second one is this, that the errors are corresponding in a way that is course, you know, has to do with the way the DNAs were arranged in plates, that geometry of being off by one, off by two, whereas the RNA samples, but it, it suggests that it's the DNA that's put in the wrong places. But I can play the same game with the expression data. You know, for, if I look at two different tissues of expression data, I have expression in islets in the pancreas, which is really important for diabetes, and expression in liver, which is also important for diabetes. I could, you know, the rows of these two data sets are supposed to correspond. So I could ask, you know, pick a mouse and let's plot their expression patterns. All the expression of all genes in the islet, expression of all genes in liver, do they actually correspond? What you end up with is just a splotch. Thirty, twenty-five thousand points just blah. Most of the genes aren't expressed. They don't show any association at all. So you don't see anything. So that didn't work. But there exists some genes that are really highly, if I look at a given gene, there are some genes that are really highly associated between the two tissues, you know, bunches of them. Some of them, they're highly associated between the two tissues because there's a common locus that's driving their expression. You know, genotype at a given site is affecting expression of that gene in islets and expression of that gene in liver. So what I'm going to do is just, let's just focus on the genes. You know, look at every gene one at a time and say which ones are highly associated between the two tissues and just focus on those. And then pick a mouse and look at, okay, for that, you know, 150 genes, then you can see that there's clearly association between islet expression and liver expression, you know, within a mouse. Except when it's not, you can find some mice where their expression of islet and liver doesn't look associated at all. But oh, it's, you know, the expression of that mouse in liver and the other mouse in islet, they're highly associated. So, you know, we do this the same thing and the other way around. So, you know, this pair of mice, if I look at a scatter plot of, or a, you know, a matrix of scatter plots of, this is the plot of islet versus liver in mouse 3598 and this is a plot of islet versus liver in mouse 3599. They're not associated at all. But islet and liver in 3598 and 3599 are really highly associated. And the other way around. So, it looks like these two were just swapped completely. And then here's the case, 3280 islet and liver, 3281 islet and liver. You can see that 3280, islet and liver, they're highly associated. Looks like the same mouse. But, oh, this is a good one. I should, this is an example of what it's supposed to look like. But here, here's a bit, so here 3295 islet and liver are highly associated. 3296 islet is not associated with 3296 liver. It's instead super highly associated with 3295 islet. So, that looks like what I would call an unintentional technical replicate. Where mouse 3295 islet was, its islets was done, the mRNA was assayed twice. And 3296 islet was not 3296. The strategy is take two tissues, look for genes that are highly correlated between the two tissues, focus on those, and then for every mouse, you know, focusing on these highly associated genes, make a distance matrix. How similar are they? Or a correlation matrix. And look, you want everything nice and perfect along the diagonal and not nice on the off diagonal. Where it's not, then you figure out there's a problem. And there were quite a lot of problems. I mean, not anything like the genotype data, but more than you, we want to deal with. You know, this hypothalamus, there were nine cases of sample swaps. In the liver, there was a sample swap and then one of these, the arrows pointing from which was supposed to be, who was actually written down to be. And so, you know, that guy was done twice and the other one not at all. In adipose, there was a three way. And in kidney, I can't really tell, but I'm pretty sure that these two samples were maybe mixed and then assayed and duplicated as a mixture because they're really highly associated with each other and not with the rest of their, the panel of tissues I have. But when you, when you, if, one thing, I mean, probably the most surprising thing about it is that 20% of my DNA samples were mislabeled. And we didn't notice it for two years because we were getting lots of cool findings. The findings were strong. If you, if you fix them, the findings get even stronger. So this is, you know, the, the scan of the genome for loci that affect insulin level. In pink is what we originally had with the original data. And in blue is what we got after we corrected the 20% sample mix-ups. Everything increases quite a bit. So that's good. But it's kind of surprising that it was not, I mean, 20% sample mix-ups was not as terrible as it was. I showed you these pictures of the really strong QTL effects of, you know, the expression of a gene is really strongly associated with genotype at the site. So in dashed are the curves that I showed you before. After correcting the sample mix-ups, these come super strong associations. Sort of my test statistic is way increased. But in summary, sample mix-ups happen that my sample mix-ups are worse than your sample mix-ups. With this expression data and genotype data, we have the ability to not only identify them, but correct them. Having these multiple tissues really helps you to figure out the origin of the problem and not just know that there is a problem. I think the general idea has a lot of application in data science generally. If you have matrices whose rows are supposed to correspond, you check that they actually correspond. And if you have some columns that have strong signals, that should give you an ability to really figure out whether your matrices' rows are actually corresponding. A paper describing this that took a long time because people don't want to... It was great that my collaborator was willing to lay it all out because others might not be so willing. A bunch of other people came to the same idea four years before me. That was at the time that I was working this out. The lessons I take from this are first, don't trust anyone, even yourself. You decide you're going to sit down and write a paper, someone else did the data cleaning. Don't trust them. Go back and double check everything. Or it could be... Well, if it was you two years ago, it's worthwhile going back and looking at it a second time. It wasn't me, but don't trust anybody. Make lots of plots. One problem that we have is moving from low dimensional data to super high dimensional data, you think that then you're going to be making lots and lots of pictures, but in practice you tend to make fewer pictures. I can't look at 30,000 histograms, so I'm just like forget it. I'm not looking at any histograms, but you need to really figure out a way to look underneath the hood and see the data. You need to make a lot of plots. In this case, we were burned a bit by having made a choice to transform all the outcomes so that the level of expression of genes can have lots of these outliers of points that are just artificially wrong. So we chose to normalize each gene's expression to just sort of cram it into a normal distribution before we did any analysis, just so that we weren't just looking at artifacts all the time. With all of that, so we were mostly looking... When we did look at pictures, we looked at pictures like this, where the vertical scale was crammed into a normal distribution. Everything looks a little weird, but looks okay. Actually, the raw data was this one that I showed you at the beginning that had really distinct clusters with individuals that didn't belong and that were in the wrong cluster. If I had seen this picture, I would have come to the conclusion I would have found this problem earlier. Because we were looking at this picture, we went 10 years in this project before we figured out that this was a problem. It was two years before we figured out this was a problem. Another lesson is follow up all aberrations. 500 animals, 26 of them have swapped sex. You could say, let's just throw those out. We still got 475 animals left. Every little hint that there's some oddity going on, you should follow it up and really figure out what's causing it. Because it could be that there's another 200 of them behind. You might ask, if there are 200 sample mix-ups, why were there only 25 that had swapped sex? Well, it's because mice are kept the same sex in a cage. Two males, two females, two females, two males, etc. They labeled them in that order and they arrayed them on the plates in that order. If your mistakes are off by one, it's very likely your mistakes are all same sex mistakes and you wouldn't find it by looking at sex swaps. Take your time with data cleaning. Ask your collaborators to give you time for the data cleaning, a week, a month, a year. If you have big rectangles whose rows are supposed to correspond, check that they actually correspond. I think the general technique of look for columns that have high correlations and focus on those and see whether those rows correspond is worth doing in all cases. A whole bunch of people involved in this project because it's a big project that others really responsible for the situation that allowed me to spend a bunch of time studying this and write a paper about it. I think they're not on this particular list. These slides you can find on the web. You can find on my website slides for every talk I've given since I was here in Berkeley in 97. Code that I've written and you can find me on Twitter. Thanks very much for having me.