 Thank you very much for the introduction. Thank you to the organizers for giving me the opportunity to present here. What I'm going to talk about today is a joint project together with Alexander Lex from the University of Technology in Graz and Austria, who was a visiting student in Peter Park's lab this summer. And basically this project that I'm presenting here is the result of his visit and it's a data visualization tool that we want to give to the community hopefully very soon that should help you work with TCGA data in particular to identify and to characterize tumor subtypes. So there were a couple of other people involved. Mark here at the top of the list was one of the main coders of this software. A couple of other people from Graz who had some involvement. People at the Broad, people at Dana-Farber gave input on this tool and none of this would have been possible actually without the work of the Broad TCGA GDAC team led by Gaddy Gatz and Linda Chin. Funded this project by NCI and a couple of Austrian research funding agencies. Okay, so while I thought I'm going to start this presentation with a question, why do we want to identify tumor subtypes? But actually it turns out I think this question has been sufficiently been answered during this wonderful symposium. So we know this is very important. It has clinical implications. You can predict the prognosis and so on and so forth. So this is really in the end something that will help patients because we'll be able to treat them better. So maybe one classic example of subtype prediction or subtype identification comes out of TCGA. The glioblastoma example with the four expression-based subtypes that we can see here at the top. So this figure from this paper was what inspired this data visualization tool that we developed. So what you see here is the four subtypes are... Does this work? Four subtypes are shown at the top here. Okay, we'll just use the mouse. Can you see that? No. Anyway, subtypes at the top. Is this better? All right. Yeah, subtypes at the top here. And then... So we have a bunch of other data types and other information that goes along with the subtype information, which sort of is the characterization of these different subtypes. And basically here, in this case, every column of this matrix represents a patient. And well, we can now just turn this over to its side. And again, we have a list of patients here that are stacked up. And then, well, we have these different groupings on this data. So we have other data types potentially, just the same data, but other groupings on the data. For example, if we have micro... Sorry, mRNA data, we can cluster that and we get a set of clusters. Nice. If we have copy number data, say the copy number status for a particular gene, we can group our patients based on that information. If we have mutation data, we can do the same thing. Now, since we are very interested in the actual groups, we're not going to break them apart and put all this in a matrix. Instead, we're going to show the relationships between these groups using bands between these columns of data. So the width of the band simply represents the number of patients who are in these two groups. So in cluster number two in the mRNA expression data and in the copy number cluster where we have the amplified, or where the gene is amplified. And, well, we do this for all groups between two columns and then we do this for all those and between all columns. So this is the main idea behind this visualization method. And now the nice thing about this was that Alex had already developed a visualization tool that did something very similar. This method is called VisBricks. It was just published last month at the VisWeek conference. And this method is implemented in the Kaleido visualization framework which already has a bunch of different visualization methods built in. All right, so I'm just going to show you how it works. So before I start, I just want to say a word about the data that we've been using. So this is all GBM data from TCGA. We took the data set from July this year. Gene level data, we had about 500 samples for the major data types, mRNA, microRNA copy number. And the analyses that we ran on this data set were, well, the consensus, NMF clustering from the gene pattern analysis pipeline, and GISTIC2. So, okay, how does it work? So here's, you see, a set of clusters. We have five clusters in our data. You see at the very top, there's a summary. It's a histogram of the expression values in our mRNA expression data set. And then beneath, you see five clusters. We not only show the groups, we also show the heat maps because this is what gave rise to this grouping. So this information is, of course, very important to see. So next, we just want to drag in another grouping. So it's the same data, but only four clusters in this case. And then you see the bands right away between these two groups, or between these two groupings. You see pretty thick bands, primarily, but also some thinner bands. This is, to be expected, I guess, when you cluster. So what we've added, though, is a little slider that you can use to actually control which bands you see. So these are the major trends. And that's a pretty clear picture. And here's some outliers. So samples that move between the clusters, sort of in smaller groups. All right. So we can turn this off again. And let's drag in another grouping. So in this case, we're going to use three clusters. And once they're in. So we see now a fairly interesting pattern of bands going around here. I should add, we're actually not optimizing the edge crossings right now in this case. But if you're now interested to see sort of how these groupings, well, how the patients sort of travel together between these different groupings, if you want to go across more than two columns, we can actually select these clusters. And this information will be highlighted. So we select the bottom left cluster here. And we see that actually, very interesting in this case, when we cluster with three clusters, all these patients are together in one cluster. In four, they're split into two clusters. And we have them coming together again in a single cluster. So I haven't investigated why this is. It's probably some artifact of the clustering algorithm. But nonetheless, it makes a nice picture. All right. So now I want to show you how we actually configure these visualizations. So we have a lot of data we have. And you see here at the bottom, we have five different data types in here right now. The green stuff is mRNA data. And you see at the top of the visualization here, there is a little box which represents this visualization, this vis-break visualization that I just showed you. And now I want to show you how we add more data. So let's add some copy number data. So we have three genes here. So I open up the copy number matrix. And now I'll just grab one of these purple spots down there and drag it up to the visualization view. And it takes a little bit. It's a drag, actually. So once we've done that, we've got a connection between the data and the visualization. So we're always in the picture about what's going on with our visualizations. And we can do this for the other two genes that we've selected. In this case, actually, David just talked about this EGFR and CDKN2A and B. So let's look at the visualization again. Now they're here on the side and we'll just drag them in like the clusterings before. So EGFR... Sorry, where are we here? So we've added everything and now we're actually... Because we've got a bit dense, I'm moving out some of these clusterings and we'll just keep the four mRNA clusters together with the copy number data for the three genes. So the column with all the red, that's EGFR, highly amplified in most patients, and the other two columns are the CDKN2A and B. And you see that, indeed, CDKN2A and B are usually go together if they go. This is very clear in this pattern. There's this tiny little crossing that you can see which goes from the normal state, from the white one, to the heterozygous... to the homozygous deletion, but that's about it. Otherwise they're always together. All right. Now, of course... What's happening? Okay, so now I want to show you what's going on if we move columns around, because as I pointed out earlier, so we can only have basically binary relationships between two columns, but that's why we make this interactive that you can just grab one of these columns, move it to any other position in your visualization, and then see directly the relationships between these two neighboring columns. So I've just dragged the mRNA data over from the left to the right, and now we can see the relationship between CDKN2A and the mRNA data. All right, so all very nice. Let's throw in some more data. Let's add some microRNA data and some methylation data. So red is methylation, blue is microRNA. And then if we go back to our visualization, we again get these two little things that we can drag into the view, and what you'll see now is actually that these histograms are actually quite different for the methylation and the mRNA data. So we did this on purpose just to show. So if you by accident or not just grab a data set that hasn't been normalized in the same way, you'll see it right away. Okay, so great. Now we can add all these different data types, and we can try to find these groupings. We can certainly add more data types. You know, I mean, there's plenty that we can do. I already mentioned mutation data. We can add clinical data which provides categories or groupings of patients. External classifications say from a published data set we could use all sorts of multivariate data. We have these heat maps that could show that right away. And if you want to do some sort of more debugging kind of work, we can look at batch information, right? So if you find high correlation between your mRNA clusters and your batches, then you probably found a problem. All right, but this is all nice about the grouping. Now we want to drill down a little bit, and we can do this with this tool as well. So here's a slightly smaller data set, about 50 genes in this case. Earlier we looked at 1,500. And so if you're, for instance, interested, you know, what are the samples in one of these clusters? So we just go to click on a little button, and it opens up a heat map, which is fully interactive, and actually, yeah, it's playing already. So we can select things in there. This heat map view is also linked to another more detailed heat map view which is built into Kaleido. If you had two screens, you could sort of see this kind of information. So this heat map has a lot more detail there. So like a complete overview of all the data on the very left, then slightly more, and then this complete detail view on the right. All right, so let's close this here. And now I want to show you how we can bring in other data types. So let's add some pathways to this. Where are we? All right, so now you'll actually see this is a real tool. We can just click on there, and then there's some dialogue that lets you select a pathway from a list. In this case, we're going to select the vent signaling pathway, and we'll add this to the view. So here we go. They're over there on the side. We dragged them in. So, and now what's important to point out, so we selected it for the right-hand column. So the data that will be mapped onto these pathways, the expression data is coming from this right-hand column. On the left, this thing is still showing us sort of the connections between the three clusters and the four clusters. Okay, so this is pretty small, hard to see. You probably can't see anything. So let's just get a detailed view of one of these clusters. When we open this up, you can now see, hopefully, that there are indeed some data mapped onto this pathway. So we don't have data for all the genes in here, but for a few. So we can then, because we have these four instances of this pathway over on the left, we can switch back and forth and we can start comparing sort of what is the expression data look like within the context of this pathway across these different groupings that we've identified. All right, so let's close this again. And of course, you're usually interested in more than a single pathway. So what we can do is load more than one set of pathways. So now we have the keg cancer pathways, as well as AKT signaling from BioCarta. Again, we can do the same spiel here. We can open them, we can compare them. What you notice here is that the gene expression column moved out of the picture to make some room for this view. We can open multiple pathways from different columns and start comparing and so on and so forth. All right, so now the big question, how did we actually implement this? How did we make it? Well, why would this tool make it easier for you to work with TCGA data? And well, so firehose to the rescue. You've heard a lot about firehose here and maybe now you also know why Apple put this wonderful flaming animation into Keynote. So firehose, as you know, grabs a lot of data from the TCGA data coordination center, runs all this data through numerous algorithms, for instance, clustering algorithms, just a copy number analysis, mute segmentation analysis, et cetera. Generates a numerous number of results, which are pretty hard to interpret if you have nothing. So that's why we've added these nozzle reports and we talked about them, I think, at a similar meeting earlier this year. Nozzle reports are just basically reports on a single pipeline. There's no way to compare across pipelines, for instance. Well, and then we take all this and send it back. But now, since we're sitting at the source at the brode, we can just grab these results and actually anyone who has access to DCC could do this. Grab the results, we've written an importer tool that converts firehose output directly into input for this Kaleido tool. And there we go. We can now basically pre-generate a data set with all the results from, say, one tumor type from one firehose around. We could put that on a web page to Kaleido. Kaleido is a Java application. We could just web-start that pre-loaded with the data and you can start exploring the data set. I should add one more thing. So given the fact that there are so many results in a firehose run, we can actually use some of the output. For instance, correlation analyses that we already run. Sort of correlation between mutation status of a gene and, say, a gene expression cluster. We could actually load that into Kaleido too. We're not doing it yet, but we will, to basically guide users when they explore this data set. So they'll be able to look at genes that are already correlated with the gene expression data. And finally, one could actually just take all these groups that we find using Kaleido, export them and feed them back into firehose to maybe run these analyses on the subsets that were identified. But that's definitely the future. I'd like to thank you for your attention. And I want to point out that there is actually so the tool exists. There's a website for Kaleido. If you go there, you won't be able to download this tool. But we've set up a little page, and that's that URL up there in the corner where you can leave your name and email address, and we'll send your email as soon as the software becomes available. Thank you. Good question for Niels. Okay, let's thank Niels again. Oh, we got one here. Okay. So if I look at the download page for the Webstart, it says Windows and Linux. Can you give us an OSX version? Yes. I'm actually, I'm running, so it's not available right now, but it will run an OSX. Specifically for this project, Kaleido was converted and now also runs an OSX. Great question. Thanks, Niels. All right.