 In last lecture, you got a broad understanding of the MEGA project on human protein atlas and also an overview of human pathology atlas project. Today Dr. Sanjay Navani is going to continue his talk on human pathology atlas. You will learn more about the human pathology atlas which is one of the three aspects of the MEGA project on human protein atlas which also incorporates the tissue atlas and cell atlas. So, let me welcome again Dr. Sanjay Navani for his lecture and continue discussion about human pathology atlas project. At this time of big data, there are a lot of people protein atlas is one of them, there is a lot of big data and the cancer genome atlas is another place with high amounts of data. Can you combine resources and produce a product that is beneficial or gives us a better answer? So, it was with that thought that the RNA sequencing data with the clinical metadata which means the survival and how the patient did was derived from the cancer genome atlas. And we got data out of a total of 11,000 patients for about 9,600 patients which was the study pool. We looked at the global gene expression patterns for all the protein encoding genes. Now here you have to be a bit careful. I am using these words protein encoding genes. What I am basically telling you is only one protein per gene, I am not looking at post translational modifications. So, you must remember that that is still another variable that may need to be crossed in the future. The gene expression in 37 normal human tissues were obtained from 162 patients from the HPA project. So, the cancer RNA seek data with the clinical metadata from the cancer genome atlas and the normal tissues from the HPA. How are you getting all these different global tissues from healthy people? So, I tried to define healthy. I mean one source was autopsies and the other source was people who had been biopsied but who were non-cancer. That was the control that was used. So, every biopsy did not expect a diagnosis of cancer. There were some of those patients and some of them we were done to rule out a cancer. So therefore, they were not entirely normal. There is only a limited number of cases where you might do a biopsy like that for cancer. Well, the Swedes had it. They had a biobank. A very well maintained biobank, I have to say. So, it was a struggle and that's why... So, it was a potential bias? Yeah, yeah, of course. Or it's called a non-toxic and healthy person? Yeah, for sure. But it was when there was no pathologic issue seen in that tissue under the microscope or correlated. I'm sure the patients, I'm sure, must have had many other problems. So, normal, I mean, all of us have many problems. So, we are not normal too. So, that was the best that was possible under that circumstance. And all the RNA-seq data both from the cancer as well as the normal tissues, they were processed in the same pipeline and they were given normalized according to the FPKM. So, that was how it was expressed. When we looked at the data initially, majority of all the cancers, 26 out of 33 cancers clustered in the same group. Majority of all normal tissues, 33 out of 37 clustered in a different group. And the conclusion was, of course, the first basic conclusion was that most cancer types share expression features that make them quite different from normal tissues and that was what we expected. Out of all the protein encoding genes, 41% were present in all cancers. So, a breast cancer was not necessarily that different from a gallbladder cancer. There was a large overlap. Secondly, 46% and this I considered to be very important at this stage of our research activity. They were, they showed a restricted type tumor expression. So, in the tumors they were different, but when we compared it to normal tissues, those same genes were different and 13% of the protein encoding genes were not seen at all. Now, that's a big question mark. Where are those genes? You've probably read a bit about the missing genes and the missing proteins and stuff like that. Some of them, there are only theories for that group. What are the proteins that those genes are coding for? Are they important only before birth? But I won't get into that now. The housekeeping genes were detected in all samples, both cancers as well as normal tissues. The housekeeping genes are very important because they are increased in cancers because they do all the activities looking after every cell. They are the same in every cell. Only because the cancer cell is multiplying very fast, they are more in the cancer cell. But what we can't forget is they are also present in normal tissues. That was how the program worked out. So, the transcriptomics was taken from the cancer genome atlas and the HPA. The HPA contains the immunohistochemistry stained images and the clinical data was got both from HPA and the cancer genome atlas. A systems-level analysis was done that gave rise to the human pathology atlas. So, when you look at this diagram on the website, you can click any of these cancers and you'll be taken to that data. We then narrowed the search to 17 tumors with large numbers and clinical metadata. From the original 23 we came down, 37 we came down to 17 because these cancers had adequate numbers. And as you can see, these cancers which are grouping here are gastrointestinal cancers. This is colorectal carcinoma, that is stomach adenocarcinoma, that is pancreatic adenocarcinoma. This group which is coming here are the squamous cell carcinomas. So, it's head and neck squamous cell carcinoma, cervical squamous carcinoma. If you see this group here, these are the endometrial adenocarcinomas, ovarian adenocarcinomas and the breast adenocarcinomas. So, they group together from the same system. There were only two very far outliers. One was hepatocellular carcinoma, very different from everything else and glioblastoma, multiform, which is a brain tumor, which looked quite different from everything else. The surprises were yet to come. When we look at this, each color and each figure corresponds to a particular type of cancer. You can see how much variation there is in each individual cancer. That means even though people like me say this is a moderately differentiated or a grade 2 hepatocellular carcinoma and you take two of those, they look different. Not only that, you will see that there's also a spillover into other sites, cancers of other types. In fact, if you go only by the transcriptomics profile, you will say that this resembles this cancer more than it does its parent cancer. Which raises a question in my mind, which is so far I haven't been able to answer. If that's the case, then why does it look like that? So, the 10-year survival data also was available on the cancer genome atlas. It shows prostate cancer and germ cell tumors to have the most favorable three-year survival. If you have to have cancer, those are the good cancers to get. You'll do well with them because of therapies available. So, the TCGA specifically focused on people with end-stage disease, right? So, I think that's a true statement for people with end-stage disease. It might not be fair. I don't know whether it's only end-stage disease because the RNA-seq data on the genome atlas is obtained at the point of diagnosis. And they follow to the event of death which is what interested us. But, and this is where your point is very valid, they don't say that he died of prostate cancer. They just say that he died. Now, it may have been a myocardial infarction. But because we had an end point of death and we had the RNA-seq levels at the start, that was the reason for using this data. Okay. So, what we did was actually a huge exercise of Kaplan-Mayer curves which has looks for survival for each gene, for RNA-seq data from each gene. And the RNA levels at the time of diagnosis were plotted against the survival data. That's something I was just saying. We stratified the RNA-seq data in each patient into those expressing the highest and those expressing the lowest and correlated that with the outcome. So, the basic point was to see if this is the highest and the patient is doing badly, then that's not a good gene. That's finally what we wanted to say. And for this exercise, there were more than 100 million Kaplan-Mayer plots that were generated. I don't think anybody saw all of them. It was just the machines. Okay. So, let me give you an example of what favorable and unfavorable genes we found. So, let's look at the different cancers on top right here. Black lines mean high expression, red lines mean low expression. So, this is being expressed high and therefore you see these events happening faster and faster until this time, until the patient is gone. Therefore, that classifies it as an unfavorable prognostic indicator. This gene, Mark II, the events seem to be happening more slowly. If that is present and therefore that's a favorable prognostic indicator. Finally, if you combine with a panel, that's what you get. Let me put it in a different way. The number of prognostic genes were classified into favorable and unfavorable, as I just told you. Hepatocellular carcinomas and renal cell carcinoma which are displayed at the top had the maximum number of prognostic genes in the study in which we found a correlation. What did we call a prognostic gene? Why did we say that this has got some prognostic effect? Because the expression level was above the experimentally determined cutoff in an individual patient. That is a statistical analysis. So, I'm just reading it off the chart there. P with less than 0.001. Now, in favorable and unfavorable genes, did we find more favorable, same favorable genes in more than one cancer? And unfavorable genes in more than one cancer? Yes, there's a big overlap. Some unfavorable genes for some cancers like lung cancer, pancreatic cancer clustered together. Favorable prognostic genes were seen in liver, lung, but there was an overlap of those genes there. And finally, no prognostic genes were shared in more than seven of the cancers which we thought were significant. Because we didn't get it across the board saying favorable and unfavorable for everything. Take a look at this chart. These are the unfavorable indicators. These are the favorable indicators. In the unfavorable indicators, the most impressive one was the mitotic cell or the cell cycle phase, which we know also for a fact. We've seen that as well. Even before these tools were not available, the tumors which are rapidly dividing, they will be an unfavorable indicator. So, that much was proved. On the favorable side, it was mainly the regulation of immune cell activation. Because the mitotic cell cycle was found to be significant, all 314 cell cycle genes were studied because it was that significant. And each gene was studied separately for a prognostic effect. Now, if you ask me for, this was a big surprise for me. You know what my concept as a diagnostic pathologist is? For mitotic index, you either count the mitosis on the slide or there is a marker called KI67, MIB1, which everybody swears by, but which does not work in all cases. So, that's just one prognostic gene from the cell cycle. Therefore, all cell cycle genes may not apply to all cancers. Now, there was a publication in 2011, a very famous publication which was called hallmarks of cancer. In that, there were 2000 odd genes which were defined as hallmark genes of cancer. It's the biggest work of its kind up to date. When we studied those hallmark genes in our data, two-thirds of them, 65% of them were predictive for the clinical outcome in at least one cancer. So, that was verified what was earlier reported at least by our study. Most genes affected only a few of the cancer types. All cancers were not affected from the hallmark genes. And the network analysis showed that most of those genes were not shared. So, the next step was to take a cancer example from lung. And to see, you see, in case you've lost track of it, we started out trying to say that these are prognostic genes. We, at the end of all this stuff, we said, yeah, these are the prognostic genes. Then as we discussed in the earlier lecture, let's go back and say what did other people say about it? So, that's how the hallmarks of cancer paper came in. So, we are studying, we are trying to find out what are the prognostic genes in cancer. We've done all this stuff, we've done all this data analysis and we say, yeah, we've got something. Now, is this true or are we backing up the wrong tree? And there's very little work in this area. So, is there any other work? It's the hallmarks of cancer papers, two of them to be exact. So, pull them out. What did they say are the hallmark genes? So, what they said are the hallmark genes and what we are saying? Is it matching? So, 65% of them matched. So, they weren't entirely wrong or entirely right. What we found was that most of those genes which they had identified, it affected only a few of the cancer types. So, we began to think that maybe they didn't get all of them. And the network analysis showed that none of these genes were shared by the cancers, which maybe meant that they were looking for specific genes in individual cancers, not necessarily all the genes. So, am I clear? Is it better now? Okay. Now, what we did then was we went into a specific lung cancer example. In that, the statisticians, they always create something very beautiful and attractive. So, this is one of those examples. But don't get carried away by that. Let's try and understand what it means. If a gene is circled, these are all genes. I'm sorry, I couldn't enlarge it enough for you to read actually what's written, but that's the name of a gene in there. And if it's got this red circle around it, it means that it was the hallmark paper that first said it. And we studied it along with others. If it doesn't have the red mark, it means that it came up as a prognostic gene in our study. So, if you look at the middle, these are called the hub genes right at the center. And they have a greater, all of them were thought to be prognostic. They all came up as prognostic. But during plotting this, you get some genes which are in the hub and you get some genes which are in the periphery. And there is a greater likelihood of these genes in the hub being having a prognostic effect rather than once at the periphery. And therefore, it's a speculation that when prognostic genes affect a cancer, that 50 of them, 100 of them or 500 of them, who are the really bad guys? Who are the drivers? And who are the passengers? So, it's tempting to speculate that these guys in the middle are the drivers and the ones at the periphery are the passengers. But it's only a speculation and that's where it should stay for now. Then another beautiful diagram, it looks like popcorn doesn't it? Yeah, so what the statisticians did was, they said you're talking about these prognostic genes, 1-1 gene and 20 genes you've got there. You talk about a prognostic cluster, all of which are closely related. There are so many genes which are related, which are doing DNA repair, cell cycle processes. You get them together, make it a prognostic cluster. So that's what we did. We made this a prognostic cluster. This is many genes together. And in this diagram, the reverse is true. If you look at the periphery, the large ones are the ones which have a greater chance of having an effect. The inner ones, not so many. And also the differences are highlighted are the gray ones are the hallmark clusters. They were published 7-8 years ago. And the prognostic and hallmark clusters are our work which is superimposed on the hallmark. We agreed that these are all important. And then there is a third group in which we say this is a prognostic cluster but hallmark hasn't talked about it. A slightly different way of putting this, these are the prognostic and hallmark genes. We agree. These are the prognostic genes which are co-expressed with the hallmark genes. And these are the different prognostic genes which we found in our study which the hallmark papers have not mentioned. Now with all this information, bringing it to the end of almost my talk, is it possible to generate a personalized model for an individual cancer for treatment? Something that is referred to now as a genome scale metabolic model. If you construct a full genome scale model for this cancer and for the next cancer, let's take two liver cancers for which all the proteomics are known or transcriptomics are known. We'll be able to compare and say how they are different. Are you with me or no? In that case, how will we develop that particular marker? No, it's quite simple. There's no need to get into genetic differences between individuals. At this stage, we are now talking about genetic differences between cancers which look the same, which are of the same type. So, hepatrocellular cancer which looks the same, what does the transcriptomics data say on it? That's the question I'm trying to answer right now. So, let's just finish that. Are you with me? Yeah. Okay. So, we carried out a personalized genome scale metabolic model for tumors from more than 7000 of the 17 major cancer patients. And we expected something different because cancer cells pull in more nutrients from the surrounding and they build up more of a biomass. This is some of the statistics that it threw out. I won't go into all that except to say that 1400 metabolites, a thousand reactions and 334 of all the genes were present in all the personalized models. It was common. Then, we looked at these are all liver cancer patients and I'm looking at elements of tri-carboxylic acid metabolism. So, I want to tell you that in that FH or fumarate hydrase was found in all liver cancer tumors. Then, I want to tell you that ACL wire right there on the top, see those small bars there? It was found in less than 5% of liver cancer patients. And finally, succinate dehydrogenate complex unit A was found in 60%. The point I'm trying to make is that there is sufficient difference among cancers of the same time, which underscores the need for a personalized model. Okay. One more impressive chart for you to understand. This thing over here, the most common genes that were commonly expressed were of the most common metabolic functions. And they were all expressed in several of these cancers. Now, if you think of a drug target, because these are common metabolic functions, they are also occurring in normal cells. Therefore, if you give anything to these patients, as I outlined before, there was a possibility that about 80% of these targets would have side effects. So, you hit the cancer, but you also hit the normal cell. And now the concept of what chemotherapy does to a patient, I think you can begin to appreciate why they have so many problems. Okay. Now, the last point. 32 gene targets that were mainly involved in nucleotide metabolism look like potential targets. They are expressed in more than 80% of the tumors of the patient, regardless of the cancer type. And they are potential targets because they will not affect the normal cells. Okay. So, if you go to the protein atlas, there's a new section called, it's well six months old now, but the human pathology atlas, which will give you access to all the Kaplan-Meyer plots. You want to see anything, the significant plots, the insignificant plots, all the data is up there. You will also get the survival of the patients if you wish to check that. More than 5 million IHC images for cancer can be seen there. Most of it annotated by us here in India. A few points just to leave in your mind as I finish. These prognostic genes as we heard in the morning, they have to be verified in an independent cohort. Secondly, the death, which was a question that Joshua raised earlier. We are assuming that the death was due to the cancer, but we don't know that data is not available. Another point which was discussed earlier in the morning, how pure was the sample? People spoke about not being fixed on time, technical issues. Is that also confounding the data? And finally, out of all these prognostic genes, are all causing the cancer or some guys doing it and the others are just following? And what are those? I'll just like to leave with a thanks to the people who made it possible for us to do the job here in India. In particular, this man right there, Mathias Ulyan, who is the director. Really quite a person I enjoyed meeting and interacting with. I think, you know, when you associate with people from other places, other disciplines, other countries, you get exposed to many things which may not necessarily be true from the country that you're working from. What I envied him mostly for and which I told him quite frequently was his capacity to think big, not just in numbers. You see, when you start thinking big, you really think big on all levels. I have never been able to get over how he gave me the job. I met him for 15 minutes. There was, of course, a previous work up, you know, more younger people in the organization met me. I put forth my ideas, everything happened. I gave a presentation and then finally I met him for 15 minutes and he said, okay, I think your idea looks good. I have just one question. Have you ever done this thing before? Scan all these images, send them to India, look them on the computer, work on the software. I said, no, it's just an idea. In the research field, people look for background. You can come and say anything, but what have you actually done? Now, fortunately for me, this kind of thing had never been done. So new ideas were, okay? If the idea is good, let's give it a shot. We were supposed to do 2 million images in a year. And when I first heard that, I didn't know whether I should say yes, but I did. But the moment I said yes, he said, okay, like a good senior person. He said, okay, I'm sure you can do it, Sanjay. Why don't we do a small experiment? Why don't you just do 250,000 images for the first year? If you do those well and we get good results, you're able to maintain the quality control. There are many things. The internet has to work. The pathologist must understand big exercise. Then from the next year, you do all 2 million. I thought that was fair. But I don't think he and I knew what we were talking about because nothing had happened. So I went out, I hired pathologists who had never seen a single slide of IHC in their lives. All this stuff is done by guys and girls who'd never seen any IHC ever. And why did I hire them? Because they wanted to learn it. And these are people from here. I'm not talking about a different country. I'd just like all of you to remember that. So after we started it, we finished 250,000 images in one and a half months. And Mathias called me and said, do you want to continue? I told him, you bet. And that's how the whole thing happened. So as time has gone by, you know, at one point in time, the numbers used to be very important to me. 15 million images with quality control. It's that all that, you know, everything finally passes. The only thing that's remained and which always makes me feel very good when I think about it is that there were Indian pathologists who didn't know anything about IHC, who did this job. And it makes me feel very happy because I never expected that. So I just want you to remember that it's very important to have enthusiasm, to have a positive outlook and to say clearly when you can do things and when you cannot do them. It's all right. Thank you for your attention. You know, a number of the major targets in cancer, even before all the old mixed up were DNA metabolism. Yeah. Yeah. 5.4-eurosil. Yeah, yeah. So did those show up? Yeah, yeah, they showed up. In fact, they were the main group. They were the main group. So the cell cycle genes, because there was a group that separated from the rest. So the 314 genes in the cell cycle were evaluated on an individual basis. And 60% of them showed a correlation with the cancers, unfavorable prognosis. But not all of them affected the same cancer. So there were different genes in the cell cycle itself, which had different impacts on the cancer. So they were very much a part of the group. Yeah, yeah. That was also described by the Hallmark group. So all of that is confirmed. And now there is, and by our study, I mean confirmed by our study. And now there are additional prognostic genes, which we've brought up, which we feel should be evaluated further. So I'm sure by now you got a very good understanding of the human protein at least project. And especially human pathology at least project. The human pathology at least was created as part of the human protein at least program to explore the prognostic role of each protein coding gene in 17 different cancers. It's really mega project. And this project, the HP project shows the impact of protein levels for survival of patients with cancer. It uses transcriptomics and antibody-based profiling to provide a standalone resource for cancer precision medicine. I must say that you should look into this really enriched resource for your own research, where you can get so much data and information for all the possible proteins in various cancer type. In the latest versions of HPA, the survival scatter plots also show the clinical status for all the individuals in the patient cohorts. All the data which is presented are also made publicly available in a very interactive open access database to allows one to study the impact of individual proteins on clinical outcome in major human cancers. We are now moving almost toward the end of the course and I hope you are enjoying not only these lectures, but also the information available and resources available for you to conduct your own research, even if you want to do some bioinformatics work just from sitting on your, you know, place on your computers. You can do a lot just by looking at available data and from these resources. So, I hope you are really going to make best use of this and I will see you again in the next lecture. Thank you.