 Okay, so there's a small change in the title, I've corrected it, so that it reads detection diagnosis and correction of batch effects in TCGA data because I hate to be the bearer of bad news. So here's good news. We actually are able to correct some of the batch effects that we see. And the really good news is there aren't many batch effects to begin with. So here's the simplified flow diagram for the TCGA data. I'm sure you saw Kenner's presentation earlier. This is a very simplified diagram of how the process works. So we have these multiple tissue source sites and then they send their samples to the BCRs which in turn extract the DNA and RNA and send them to the genome characterization centers and genome sequencing centers which in turn then produce the data that goes to the DCC which goes to the TCGA website. So there are many potential points at which batch effects can be introduced. They can be introduced at the tissue source site level. They can be introduced at the BCR level, the GCC level and so on. So the first step is once we have the data, that's all we have. We have the data at the end. The objective is to detect or quantify any batch effects that we might see and to identify sources of those batch effects if possible. We use certain tools for the diagnosis of those batch effects. We at MD Anderson GDAC have developed this tool called mBatch which is an R package that's available from that URL that you see. It has multiple tools there. For example, it has PCA plus which is a novel algorithm that we developed. It's an enhancement of the traditional PCA plot. There's a batch core algorithm that we have developed which is again a novel algorithm. There's hierarchical clustering which you're all familiar with. Clinical correlates with the clusters. We have box plots and we have ANOVA and MANOVA. There's a disclaimer, obviously there's no substitute for human input. These are just tools. They allow you to look at the data, but then what to do with it, the final decision, that relies on the user. I'm not going to go through each one of those tools in great detail, but I'm going to show you a few screenshots so that you have an idea of what I'm talking about. If you look at the traditional PCA plot, this is what it looks like. I don't know about you, but for me, I can't really tell anything interesting from such PCA plots. We have these enhancements and now what you see is a PCA plus plot. It's exactly the same plot. The points are the same except we have connected each of the points by their batch centroids. What you see is each of the solid points that you see here, they are actually batch centroids and from here, you can see that one of the batches, which is batch 29 in this case, stands out from the rest. This is how you can identify that there's potentially something weird or something different about that one batch. We've also introduced a metric called DSC, dispersion separability criterion, that quantifies the differences between these batches. Anything that has a value close to zero is good. Between zero and let's say 0.3, you know, it's not too bad, 0.3, 0.5, you should be somewhat concerned, 0.5 and above, you should be really concerned. Here's the other algorithm, this is called Batch Core and what it does is it produces numbers and between zero and one, the closer the number is to one, the better it is. Anything that's less than 0.5 or less than 0.7, you need to be a little bit concerned about. But we've also introduced p-values there. So you have not only the metric itself, but p-values to statistical associations. So again, in this case, you can see that batch 29 is somewhat problematic in this case, for example. We have hierarchical clustering, which you're all familiar with. We've annotated hierarchical clustering with tissue source site, batch ID, ship date, plate ID and BCR, among other variables. So these are some of the variables that we are studying it by. And then we have the clinical correlates. So these clinical correlates, if you have, for example, in this case, we have the colorectal data. So you can see, well, if you look at the histological type, I know you can't read it, but that's histological type. If you see that, you get excited and you say, well, we get a separate cluster for histological type. And we also see that the colon data and the rectal data, they are separated out. So maybe the microRNA expression profiles of rectal data is different from colon. But then we see that rectal data sets were processed in a completely different batch. So they are confounded by that batch. And we can't say for sure whether any differences are because of batch or because of biology. So there is some confounding that you will occasionally see. Now, with a large project as TCGA, we know that we have some of the best scientists producing this data. But it's such a complicated project that it's inevitable that sometimes you might end up with a few batch effects here and there. But like I said earlier, the good news is that we've analyzed most of them and there are very few batch effects to begin with. And when there are batch effects, we hope we're catching them in time. Here's another example. We have these box plots. So in the box plots, you can see this is plotted by batch medians or by batch means. And you can see over here that batch 29 has low expression. So that's a suspect batch. Then we have ANOVA and MANOVA tools, and this actually quantified the amount of variants that can be explained by any one variable. So the amount of variants that can be explained by batch ID or by TSS or by BCR or even by subtype, which is a good variable. We want the variance by subtype to be maximized. So it shows you all of those variables in a neat Venn diagram. One caveat about batch 29 that I'm talking about continuously, it's actually confounded with subtype. So it's actually one unique subtype of colorectal data, which is why it may very well be biology. The differences might be due to biology rather than technical effects. But because it's confounded, we can't be sure one way or the other. Okay, so if you do see a batch effect, what do you do? How do you correct for that? So of course you want to correct the source of the problem whenever possible, right, that's pretty obvious thing. But when it's not possible or the source is unknown, then the algorithms, different bioinformatics algorithm can be used. For example, some algorithms that are there in the M batch package that we are offering is combat, which is empirical base, ANOVA and median polish. And there are two flavors of each, so a total of six algorithms. So this particular URL that I've highlighted over here, the package is up for download. So now let me move on to an interesting story that came about in kidney DNA methylation data. And this story sort of evolved and this is to give you an idea of how we use the tools and how we can use the tools in order to narrow down what the problem is. So we worked with USC in order to narrow down this particular problem. When we first saw kidney DNA methylation data, we saw a dichotomy and that was weird. So we drilled down to it and we found that the dichotomy was due to male and female patients. So the next thing we did was we removed the sex chromosomes. So after removing sex chromosomes, the same figure looked like that. Again, we see a dichotomy, but the new dichotomy corresponds perfectly with batch ID. So then we drilled down into the data, we collaborated with USC and USC informed us that they'd done some controls, experiments and so on. And they identified a few probes that were suspect. Now when I say a few, I mean really few. We're talking about 27,000 probes that were there in this data set. Out of that 150 probes are responsible for the batch effects that you see over there, only 150 out of 27,000. So we removed those 150 suspect probes and then this is the figure that we get. Well, it's improved, but I put my hand to my head again because I saw another dichotomy and I'm like, okay, this is the third time that we're seeing a dichotomy. What's going on? Lo and behold, these are chromophobic samples. So it turns out that these samples were chromophobes that were not identified as chromophobes by the clinician. And there's a long story behind that. I'm sure you guys know that there were some chromophobic samples in kidney that were eventually removed from there. So that we have pure renal cell carcinomas. But this just shows you, you have dichotomy, you gotta drill down. You can't just trust on the tools and say, okay, the tool is saying there's a batch effect, so I should take it for granted. Finally, I'm gonna talk a little bit about the TCGA Embatch website that we've put up. That's the URL for the website, it's the same URL that you saw earlier. I'm gonna switch to a live demo and keep my fingers crossed. So when you go to the URL, this is what you see. This is the documentation page and you're gonna see two green buttons there, which will offer you either you can download the Embatch package. If you do that, then this is the page that you end up with. We've got a Linux version and a Windows version and we've got documentation for the batch effects package. The alternative is you can launch the TCGA BatchFX website directly. So if you wanna launch the TCGA BatchFX website, this is what you would end up with. On the left-hand side, you can select different parameters like what disease you're interested in, what data type and what center, what data level. Right now we just offer level three, so what level you want and then whether you want the original data or corrected data. I'm gonna talk about that in a little bit. But here we have the assessment algorithms. As of right now we have hierarchical clustering and PCA to choose from and which variable you wanna categorize by, batch ID or tissue source site or whatever other variable you wanna categorize by. And then you get this particular PCA plot. Now cool thing about this plot is you can mouse over it. So if I clear this window right here and I say, well, let me see what that particular outlier sample is. I just mouse over it and I see all the information regarding that particular sample in the window that's in the data point log. And I can mouse over multiple points and later on select all copy paste into my favorite text editor or whatever the program that I want and drill down into what those samples represent. The DSC values are also given over here for your convenience. I can also zoom in into the plot and I can pan around and see whatever area of the plot that I'm interested in. So zooming, panning, and mouse over capabilities are included among other features. What is also included in there is, well, if you wanna see the compendium of TCGA data and see, okay, where are batch effects in TCGA data? Or which data sets are likely to have batch effects in them? Then you can actually click on this button which says Algorithm Specific Scores. And when you do that, you will end up with this DSC table. Over here you have your disease type. This is the name of the disease, the disease type, data type, and so on. All the information regarding a particular data set. And at the end of that, you'll see if you scroll to your right, you will see the DSC value. And you can sort by any one of these columns. In this case, I've sorted by DSC. So I'm gonna see the highest DSCs, basically this is in descending order. So when you see a high DSC value, you can go and look at, okay, that's the data set that is showing me the highest DSC value. So you go to that particular data set. So for example, one of the data sets in which we saw a relatively high DSC value was this data set. This is a prostate cancer data. And we see two batches and they're quite different from each other. So when they're different from each other, what we can do is, once we've exhausted all of the options of getting the data corrected, then we can actually go ahead and download the corrected data. So here, we have corrections by many different algorithms like I mentioned. This is empirical base. So this is before the data was corrected. This is the original data as you downloaded from the DCC. And this is the data after it's been corrected by empirical base. So you can clearly see that the centroids match up. And you can actually download this data. The way you get to that data is simple. You just go to the data set. Instead of original, you say corrected by batch ID. And that'll update automatically and you're gonna see a figure that has been corrected. If you say, okay, I wanna download this particular data now that's been corrected, you go to related documents. And there you have this document that is about 800 megabytes long. So I'm not gonna download it right now, obviously, but you can download that and you will get your corrected data. But again, word of caveat and a disclaimer, only download corrected data if you need to. If you don't see a batch effect, don't download corrected data. Because the side effect of any correction algorithm is that some of the biology may quote unquote be corrected. In other words, some of the variants that you wanna see in the biology might also be corrected for. So that's a basic demo of the website. And like I said, that's the URL for the website. So I'm just gonna skip over this as I just showed that to you. And I would like to conclude with acknowledgments. We worked closely. This is our group at MD Anderson. And we worked closely with the in silico people. And that's my email address right there. If you have any questions, please feel free to email me. I also have a poster, poster number 50. And here's the URL once again. Thank you. Rehan, thank you. Hello, Mehmet, based on from NCI. Normalization has an effect on the batch effects. So the data that you show, is it the raw data or the normalized data? It's data at level three, which by definition has been normalized to a certain extent, depending on which data type we're talking about. But yes, it's level three data that I just showed to you. And the Mbatch website plans to do regular runs of the data. So the data is automatically downloaded from DCC and shown to you as is. Okay, have you made any comparison on the raw data to see what part of the batch effect is reduced by normalization? So that's data at level one and level two. And to be honest, we have not looked at that primarily because of the size of the data that's involved. At level three, you have considerably reduced size. And even then, it takes us two to three weeks to get this run there. So for level one, which could be 10 to 100 times larger, that's just not practical at this point to look at level one data. But if you're interested in specific data sets, then you can certainly download the Mbatch package and run it yourself at your end and look for batch effects at level one or level two. Thank you, Rahan.