 And Baran will talk about the TCGA Computational Histopathology Pipeline. Well, thank you very much for the invitation. This is a joint collaboration between Hong Chung, Johan, Kemal Bilgin, Jerry Fonteney, and Sandy Borotsky, who's a pathologist, Joe Gray, Paul Spillman, and I'm Baran Parvin. So, as you know, the TCGA system actually is outputting a number of tissue sections as well, the frozen-embedded, paraffin-embedded tissue sections. These tissue sections are often quite large in terms of the, maybe there are about 40,000 pixels by 40,000 pixels upward to 100K by 100K. So they're downloaded from the NIH repository because they're very large. They're broken up into blocks of 1K by 1K pixels. And then every block then is analyzed for its nuclear characteristics. So in this case, for example, a pinhole of this tissue section could be somat heterogeneous or it could be homogeneous. Having computed these nuclear morphology and the organization, then we then compute some kind of a distribution function, which is then put into a database system and then normalize across tissue sections. And then having had this information, then we proceed to do morphometric subtyping. And of course, if I can do the morphometric subtyping, you can actually compute the survival functions per each subtype. And of course, the next step is to link this information with the molecular data, which is transcripted in this case. So let me give you a specific example. We start with glioblastoma. There are roughly about 146 patients, 380 tissue sections. It takes about a week of computing time to process these images on a cluster of several hundred nodes. And what are the challenges over here is that the technical and biological variation. Every tissue section has been prepared by a different laboratory. And that there are very large data sets. So in order to address these issues, what we've done is that we have developed a new algorithm for characterizing tissue sections. Actually, there's a collection of algorithms for that characterizes tissue section. Of course, they need to be efficient to address the issue of the large database, data sets. And then having done that, we compute a number of features and meta features from the data. And then having done that, then you actually go proceed with your future selection or dimensional reduction and then proceed with the molecular association. So in order to characterize the tissue sections on a block by block basis, what we learn is that because of the technical and biological variation, you need to build a database of dictionaries. So there are a number of reference images, about twenty reference images that have been hand segmented by undergraduate students. As a result of that, we have a number of filter responses for the foreground and background model. So what do these reference images look like? They could have a highly pink background. They can have a light pink background. Sometimes the background and foreground have similar intensity. Or the foreground could have a dark intensity and the background could always be blue as well. So as you see, there's quite a heterogeneity in this. And so what's happening here is that these log responses that are computed for the foreground and background model are then represented as a Gaussian mixture model. So when a new image comes in, first it's normalized against all the reference images. And then having had this bit of information, a global probability model is built based on the Gaussian mixture model, the global fitness term over here. And then the other thing that we learn is that even within a 1K by 1K tissue section, there's a quite bit of heterogeneity. So we add this local model as well. And then whatever labeling that takes place, it also needs to be spatially smooth. So we add the geodesic constraint. The net result is a cost function, which is optimized using the graph called Formalism. And the end result is validated through geometric reasoning. So this is showing an example of the variation from within a block. As you see, the background over here is more pink than down here. And this method of seed detection actually provides the necessary information to establish the local statistic for the local probability map. So let's look at some results. Again, you see a collection of dark and less than dark nuclear over here, nuclei over here, and the segmentation result. Here's a far more complex example. So now let's move on to the representation, now that we can actually delineate the nuclear feature. Again, there are some information that are computed from the block-by-block information. So these are structural features where the nuclei, formation and nuclei are represented as a graph, and a number of features are represented, are computed from the graph. Simultaneously, we make measurements on a cell-by-cell basis, and you end up with a multi-dimensional density function. Having had this information, we then normalize it across the tissue sections. So let's look at the morphometric data now. So it appears that if I go through the GBM data, maybe you end up with four different subtypes. This is strictly based on morphometric features. And of course, having had this bit of information, now I can go to the gene expression data and identify a group of genes that best describe each morphometric subtype. This is done through a sparse coding or the L1 regression, and multivariate analysis. And the edges between these dots, basically indicates how closely those two genes predict or infer or classify a subtype. So let's do some sanity check. This is across all the patients in the GBM data. And then... I lost my pointer, okay. So as you see, it's just doing a sanity check showing that there's basically four different distributions in terms of cellularity and the nuclear area. And one of these subtypes actually shows that it has improved survival rate based on more aggressive therapy. Now, the reason for... I need to point out that basically, there are not that many patients in the GBM data set that have received therapy that are less aggressive. And that's why this information could be a little bit noisy. So having... And as I said before, we actually can do the molecular association. And for each subtype, basically you obtain a set of genes that you subject that through pathway or subnetwork enrichment analysis. And among these things, the stat signaling is, as you see over here, is highly enriched, which basically is one of the pathways that are necessary for maintenance and activity of the GBM. And it also inhibits the apoptosis. So the next question that we can ask is that since we're actually making measurements on a cell-by-cell basis, what can we tell about the statistical composition? What we do is that essentially, instead of doing the morphometric analysis on the patient-by-patient information, we break the tissue section into blocks. So now we're independent of the patient information. And we do the subtyping based on the blocks. So this way we can measure the composition. So here's the sanity check. There are basically, yes, four different subtypes across all the GBM data sets. So now for a given patient, I can have a highly homogeneous population or a highly heterogeneous population. And based on that, I can compute a heterogeneity index. And then maybe I can plot that information into multiple quad and basically identify patients that belong to one subtype versus other subtypes. So from this plot, you can see that cellularity and heterogeneity are anti-correlated. And this all makes sense. So what are the four subtypes? Maybe something which is necrotic, something which has low cellularity, middle and high cellularity. And this is doing the survival analysis. So for these two cases, we have semi-descent p-value. Again, bear in mind that there are not that many patients that receive the less aggressive therapy. But can I kind of keep this in mind for the next slide? The other way we can slice the data is just plot the data for cellularity and maybe nuclear size. And essentially divide this population to four pieces. And this one pops out, which basically says that if it's highly cellular and low nuclear size, that's population of patients are basically very responsible for more aggressive therapy. So one possible interpretation would be that high cellularity, low heterogeneity, highly proliferative cells and therefore responds better to cell cycle inhibitors. So in conclusion, there are many ways to slice and dice the data. We've shown a couple of different examples of that in terms of cellularity and nuclear size, but there are many other features that are computed and registered with our system. The example of meta features is heterogeneity. And depending upon what features we use, we can have different biological interpretation. We've shown new ways of actually linking this information to genomic data. And all of our data is available through our website, which also supports a Google map type of viewing for tissue section as well as segmentation result overlaid on top of it. Thank you very much. Questions? Barron, maybe I could get started. Have you compared or perhaps thought about comparing the heterogeneity index that you derive with other indices from molecular data? For example, from SNP data, you can also derive a similar type of index of cellular heterogeneity. It would be interesting, I think, to see whether they're correlated. No, that's something we haven't done. It's a very good question, and that's probably an area of collaboration that we've been talking about. But that's definitely a right direction to go. So that's one of the advantages of having a tissue section analyzed is because it can actually offer you information about the heterogeneity. Whereas most other genomic white data is basically bulk measurements. Thank you. Please. Here. Hi. Have you tried to estimate the number of fields required for kind of reliable estimation of the type and heterogeneity of the sample? Number of fields. Yeah, the number of pictures taken from the same section. So basically every tissue section is about, as I said, about 40,000 pixels by 40,000 pixels up to 100k by 100k. And every, so every 1k by 1k block is represented as an independent, okay? And then there are about 100, 336, you know, such tissue sections. There's lots of data. So there's no shortage of data in this case for estimating heterogeneity. They have a live-out approach or something like you try to see if you use less of them, how it grows to be stable. Okay, in terms of sampling and all that. So we actually tried it with the cross-validation. This is done in two ways. One is, you know, leave one out, also cross-validation. You end up with four subtypes regardless of how you do it in terms of number of tissue signatures. And then we validated that by actually, you know, building a library of these blocks and looking at them visually to see if the signature is identical, that they are actually following the same signature. Okay, thank you. Linda? Great data. Have you, maybe you haven't gotten that far. I don't know how many cases you've been able to, whether the 150 summer cases really represent the four subtypes on the molecular level and the spectrum or genotypic distribution. I'm wondering whether you have been able to establish any kind of association of the morphologic feature that you are detecting or measuring with a genotype or subtype. On our initial first-past analysis, for example, the small cell nuclei nature is associated with, for example, P53 mutation. You know, that kind of molecular correlation. You may not have enough cases. I'm just wondering about that. So one of the subtypes, in one of the subtypes, you do have... So we haven't correlated the mutation data. We only done the correlation with gene expression data. And P53 does show up in one of the subtypes. When I see your imaging technique, you seem to pick all the nuclei regardless of whether it's tumor or stroma. So including all sorts of cell nuclei. Does that affect your results? So that's also a very good question. So now it gets into the business of what are the cell types you're picking up. So right now, we're picking up everything. So there could be other cell types. In fact, if you look at... Some of the cell types are easy to pick up. For example, when you look at the tissue sections, there are quite a bit of lymphocytes. And if you go to your gene expression data, you see those kind of activities as well being represented in the gene expression data. But do we separate them? No, we don't. Not at this stage. Interesting to talk. Thank you. I wonder... I worked with some pathologists that even for the same endoscope, they take multiple biopsies. They have different readings. So I'm wondering, is that worth using Alex's method? You build a random forest with multiple biopsies for the same patients. Would that be better to classify the survival or not? That's a good question. So the question is, instead of looking at a single variable, maybe we can look at multiple variables. That's one approach of, yes, that's a good idea. There's also another piece of information that I didn't present here, and that is there are essentially two sets of information that are computed over here. One set of information comes from nuclear measurements, which is their demerfometric property as well as their organization. Then there's also another set of measurement that comes from patch-based analysis. Like this region is apoptotic, and this region is kind of half and half of necrotic and apoptotic. So all that information needs to be combined in order to provide a more reliable correlation and association. Thank you. Okay, no more questions. Thank you.