 So, we're going to talk about how one can start thinking about large-scale cancer genomic analyses and the cool things that can come out of doing that, and we're going to use three different projects in different ways as motivations and exemplars, the ICGC, ICGCPCOG, and CPCG. And if none of those mean anything to you, that's why the first objective is to be introduced to those and know what those things mean, and then understand how we can discover really important facets of not just cancer biology, but how we should think about cloud-based analyses and cancer biology through doing that. And we're going to spend a reasonable amount of time using prostate cancer as an exemplar. Not because prostate cancer is the only or most important tumor type, but it's one of the ones that has moved farthest in cloud-based analyses and allows us to take really deeply some ideas of how we'd link the genomics to specific clinical characteristics. And then we'll conclude by talking about some of the key lessons that you might apply to your own analyses as you move forward. So, I'm going to start off with an introduction to localized cancers and why we think hard about them. And then we'll talk about ICGC, which is largely a study of localized cancers, and then get into CPCG and conclude with PCOC. So, if we take any type of cancer, the key problem that happens when the tumor is confined to the organ is that patient outcome is hypervariable. You're looking here at lung cancer, it doesn't really matter what tumor type we'd look at, and this is a Kaplan-Meier curve. The y-axis is telling you the fraction of patients that are still alive, and the x-axis is the time, and so at time zero, you've got 100% of the patient cohort is alive, and then over time, people are dying of their disease. And of course, if we knew at time zero what those two groups were and which patient was going to die early and which patient was going to die late, we've got a very natural and simple treatment modality. We intensify treatment for patients who have bad disease and we can either keep it the same or de-intensify it for patients who are likely going to be cured. And there's lots of ways in a localized cancer where we can improve treatment. We can do radiotherapy more aggressively, you can simply increase the dosing, or you can have more aggressive surgical margins, or you take a little more adjacent normal tissue to try to get additional tumor that might have been missed. Or you can give standard chemotherapy regimens, like platinum-based chemotherapy, which has been in use for 40 years, is used to intensify and remove the risk of metastases for these patients. Never mind, all the targeted therapies that are now widely used in metastatic disease are easily applied in localized disease and we don't because they cost so much and we don't know which patients would be good targets for it. So if we could just make this distinction, it would be transformative. And so of course, people try to do so using cancer genomic data. And this isn't a new idea. It's based on a relatively straightforward hypothesis, or the initial studies were based on a relatively straightforward hypothesis that if you took any type of cancer, you could divide it into a number of subtypes. Those cancer subtypes would be characterized by distinct molecular profiles. And those distinct molecular profiles would have distinctive outcomes, distinct prognoses. And so you're probably, at least some of you are screaming at me to say, Paul, you're talking about pattern discovery or clustering or unsupervised machine learning. And you've all seen these types of analyses. This is the classic study on the field. It was by Chuck Satirio's group, and it was published on September 11th, 2001. They took 60-odd patients with mRNA data and breast cancer, clustered it, and it pops out into these five incredibly distinct groups. Five groups that are so distinct that they have outcomes that are completely different, that today in treatment are treated completely differently, and they're the basis of molecular therapy and breast cancer. This is a real success story, and it shows that there are subtypes tightly associated with both outcome and treatment response in breast cancer. But unfortunately, this isn't the case for most tumor types. At almost the same time, David Bier's group at the University of Michigan did an identical study looking at lung cancer. And they identified subgroups of patients with differential outcome, and then groups around the world tried to validate this. And we couldn't. We found different subgroups of patients in Toronto as in Boston, Boston as in Michigan, Michigan as in Japan. It was just hyper variable. And these studies came to their head in a really important paper published in the New England Journal of Medicine 2006 by a group at Duke University. They used a very sophisticated empirical Bayes clustering technique to reduce patient clusters on the bottom. And on the right, you can see those survival curves. And those are the best survival curves you'll ever see in your life. Those are the best survival curves because they're fraudulent. The work was retracted. Clinical trials started on the basis of this data had to be stopped. I mean, this is as much of a disaster for precision medicine as you can imagine. It had serious implications to funding of these types of studies from the NIH and a real change in the rigor that was required because of that. And so let's leave aside why people decided to be fraudulent and ask, why couldn't that hypothesis of subtypes associated with outcome work? And if you look at this honestly, you see tumor types that look like this. Here we're again lung cancer. The columns are six carefully chosen genes, but six genes. And the rows are 200 patients. And ask yourself, how many subtypes do you think you see there? It is certainly not five or six. It's probably 50 or 60. And our best guess is today that are a few hundred subtypes of lung cancer. And of course, if we've done a better job listening to our statistician colleagues, we would have recognized this from power analyses that were telling us there's way too much variability. You need cohorts of thousands of patients to be able to find these biomarkers. And so in the mid-2000s, as a field we pivoted, and we said, actually, we should stop thinking about this as if there's a small number of subtypes, because there's a large number. And their molecular profiles overlap. But what we really care about is still outcome. And outcome still really is kind of like a binary phenomenon, good or bad. And so what we should be doing is applying supervised machine learning techniques. And I'm going to fast forward through 10 years of really great machine learning work in computational biology. But pretty quickly, we got very good at taking any individual data set and making a prediction of outcome from that data set. In fact, using information content analyses, we can say that we are able to routinely extract more than 99% of the information present in an individual data set. Phrased differently, no algorithm is going to do better. We're reaching theoretical maximums. But is that good enough? Well, until two years ago, this was the best available biomarker for lung cancer. And there's a great Kaplan-Meier curve. It's got a nice p-value that you can certainly publish in a good journal. It looks happy. But actually, if you think about it, precision medicine means denying therapy to some individuals. What we're telling the patients in the blue group is, sure, you've got a good prognosis disease, so we're not going to give you chemotherapy. Or we're not going to give you expensive radiotherapy. We're not going to give you this great new target at AJA, because we don't think it's going to work. And the patient would very reasonably say, great. How confident are you in that prediction? And with this very publishable biomarker, we would say 66%, at which point in time, the patient's going to go, yeah, right, thanks. I'll take the drugs. And of course they should. Survey data suggests that we need biomarkers around 80% accurate before they are accepted in patient communities as they should be. And so that led us to realize that if we were achieving the best we could with single data sets and we were still so far from success, then we need to move to look at multiple data sets simultaneously. And in a really deep way, this is the origin of projects like the TCGA, the Cancer Genome Atlas, and the ICGC, International Cancer Genome Consortium, because they're goals to try to put together large numbers of tumors that would have multimodal information and allow us to generate clinically useful tools. And if you think about it, what you really care about in a biomarker is, of course, that it's accurate, but also that it's extensible, that it's able to handle different types of problems or different patient populations, that it's clinically focused, that it's not just the genomics in isolation, but it considers what we already know about the clinical characteristics of the disease, the treatment options, in a health care system like Canada's, probably quality-adjusted life years and cost. And of course, you want this to be fast. One of the other hats I wear is that I work in transplant, lung and kidney. And in those cases, you have an hour to make a clinical decision. Never mind, like, extract the DNA, sequence it, let it align on the cluster for 24 hours. It has to happen very quickly. So we need all these things to come together. And so this requires large multimodal data sets that have lots and lots of concordant information married to rich clinical annotation. And so I'm going to talk to you about the projects that are generating that, the ICGC, ICGC-PCOG, and CPCG. So briefly, ICGC is this set of consortia. It's a consortia of consortia, really. Countries around the world decided to fund projects looking at an individual tumor type, analyzing 500 tumors, and doing DNA, RNA, methylation, in some cases, proteomics. And then we came together as a group with all of those different projects and different tumor types together to share ideas, protocols, technologies, benchmarking exercises. And this is a list of the ICGC projects as of, I think, March. And you can see that they're spread across the world, and there are several in Canada. And those projects not only have their own independent ideas and research questions, but they also come together to do shared meta-analysis. And that's what I mean when I say ICGC-PCOG, Pan-Cancer Analysis of Whole Genomes. It's the first 2,800 whole genomes from the ICGC projects merged together with standardized analysis, with different working groups to look at different questions. And we're going to spend a fair bit at the end of the talk talking about PCOG, what it taught us about cloud-based analyses, and, for that matter, what we can learn about how we should do these things better in the future. But I'm going to start off not talking about how we merge all those tumor types together. I'm going to talk about one of the three biggest individual projects. And that project is the Canadian Prostate Cancer Genome Network, CPCGene. CPCGene is the largest prostate cancer project in the world. Some of you may know that Canada's the world leader in prostate cancer research, both clinically and molecularly. Standard treatments have been discovered here. And so when it was a decision to have a project looking at the genome of tumor types, prostate was a really natural starting point. And people often don't know the characteristics of prostate cancer as well as we do many other tumor types. It afflicts one in seven men over the course of their lifetime. So in this room, four or five of us will develop it. Fortunately, only one of those four or five will die of their disease. But all the men will be treated and suffer serious morbidities, health care costs, and reduction in quality of life. And unlike almost any other tumor type, we're really good at finding prostate cancer. The plot on the right there in the gold curve is age-adjusted incidents. It's telling you how much we detect in the population. And you can see it doubles in the 1990s. That's from the introduction of PSA testing, a simple blood test to identify prostate cancer. And the green curve at the bottom is mortality. And you'll notice that mortality has barely changed. We're detecting lots and lots of tumors that are never going to kill a man. And as a result, we have this huge problem of over-treatment. The way in which treatment decisions are made today is very simple. Patients are essentially done, staged with three different criteria. I mentioned PSA. Prostate cancer is the only tumor type, solid tumor type, for which imaging is not part of the standard clinical workup. There is no radiologic assays for prostate cancer. We also use biopsies and take a small piece of the tumor. And in prostate cancer, 3% of men who get a biopsy are later seen in the emergency room for bacterial infection. So that's a huge health care cost. Prostate cancer biopsy will be the single most commonly performed medical procedure in Canada by 2030. And 3% of men end up in the emergency room. And then there's a digital rectal exam where a physician inserts a lubricated finger into a man's anus, palpates the prostate, and tries to say, hey, is there a bump? And if so, maybe that's a disease. And then I'll try to guess at the extent of it. And that's it. That's the extent of how clinical decisions are made. Those are used to group men into low, intermediate, and high-risk groups of patients. And I talk about those things like they're really terrible. Actually, they work really remarkably well, and it shows just how amazing our clinical colleagues are. Those groups are therefore treated differently. Men in the low-risk group are not treated actively. They go through what's called active surveillance, just monitoring their disease in a protocol developed in Toronto at Sunnybrook. The men in the intermediate group receive definitive local therapy, surgery or radiotherapy, with both the main surgical and radiotherapy protocols being developed in Toronto and Vancouver, respectively. The high-risk men receive those therapies with adjuvant hormone therapy to try to make sure any occult metastases outside of the organ are caught and the discovery of hormone therapy for prostate cancer was also here in Canada in Montreal. And so we chose to focus on the intermediate-risk group of patients. There's about a third of these men that are overtreated and we want to identify those. A third who are undertreated and we need to identify those. We designed a big whole genome sequencing study. Toronto Montreal, Quebec City and Vancouver are the sites. 150 specimens taken prior to radiotherapy, 350 specimens taken prior to surgery, long-term clinical follow-up, all sorts of cool things done, including whole genome sequencing, methylation profiling, primary chip-seq on the individual tumors, RNA-seq, whole proteome mass spec, and other things. You can't see that, but what it says is that all patients were reconsented. So we actually had a CRA go to the home of every single patient to get both a blood draw, an extra blood draw, and to get approval for deposition of the data in every public database that we could think of so that the data would have the maximum utility. So this is one of the most open cancer data sets anybody can get. There are other prostate cancer genomics projects in the world, in other countries, and we've come together as a large consortia, the ICGC PAN Prostate Cancer Group, which is doing shared meta-analyses and actually our first co-publication is Impressat Nature Genetics. We have 2,000 prostate cancer whole genomes alone, which we are harmonizing the bioinformatics for all of them, all in the cloud, and we have downstream working groups focused on prostate-specific questions, and this will become likely the largest cancer genomics data set for any single tumor type next year. So, okay, that's why we did prostate cancer in the context, what did we actually discover? Well, the first thing that we did was say, this is really exciting, what we should do is take a look at an individual prostate and ask about the spatial variability. So the prostate is about the size of a walnut, and you can imagine what we've done is put it through an egg slicer, and you're taking a look at different layers of the prostate here. And for each layer, our pathologist would go in and identify the regions of tumor in that layer of the prostate. So for example, you can see that little blue region of what's called Gleason Pattern 4 prostate cancer, more aggressive, the green region of Gleason Pattern 3, which is less aggressive, and each of those would be macro or laser capture micro-dissected and then whole genome sequenced. And that way we get an idea of the spatial variability in the prostate. Then we went ahead and looked at the variability with an individual man. That green rectangle is nine regions from one man's prostate. Complicated plot, so just focus on the top bar plot. The top bar plot is telling you the number of point mutations, and you can see at a low, it's about 20, and at a high, it's about 150. So region to region across the prostate, the number of point mutations varies almost an order of magnitude. The second green rectangle, five regions from a different man's prostate, and you can again see this order of magnitude variability. But it's not just point mutations, it's true if we look at copy number aberrations in the top panel, it's true if we look at genomic rearrangements in the bottom panel, anywhere we look we see this hyper variability in mutational burden spatially. And of course you're saying sure, but is any of this clinically relevant? And so let's go back to that man from whom we had nine regions from his prostate. You can see them laid out here with the H&Es and IHCs, and let's just focus on this panel of cancer related genes. So there's the nine regions laid out as columns. The first one is the diagnostic biopsy, and the next eight are surgical specimens. You can see the diagnostic biopsy has that green rectangle for PIC3CA. That's the H1054 non-synonymous point mutation, drug-able, it occurs in 1% of prostate tumors, really exciting. This man might be a candidate for targeted therapy instead of surgery or radiotherapy. You can also see the blue dock for P53 deletion, and P53 deletion is an adverse prognostic factor for the delivery of radiotherapy in prostate cancer. So this man is a good candidate for surgery, maybe a targeted therapy under a trial. But now if you look at the other eight regions, only one of the eight regions has the PIC3CA mutation. Three have the P53 deletion. Four have a deletion of BRCA1, which is just bad period. If we superimpose upon this, the pathological aggressivity, it's the Gleason 4 plus 3 regions that are more aggressive, and those are associated with the BRCA1 deletion. And unfortunately, this man has had a relapse of his disease and no points for guessing, but it's region number three that doesn't have any of the actionable mutations that we talked about that gave rise to his relapse. This is a bit of a disaster for the delivery of personalized medicine when any individual biopsy doesn't give you good information. And of course, it does a man no good to say that your prostate tumor is not aggressive after you did the surgery to cut it out and get all of the spatially separated regions. There's a question. Yeah, so my question is, while you're doing this analysis, are you also looking for markers that aren't necessarily linked to cancer and therapeutic? I'll get there. So this extent of spatial heterogeneity is actually an underestimate of what's going on. Here's another man, four regions from his prostate. The top plot is the copy number landscape laid out from chromosomes one to Y. You can see deletions on chromosome eight and 16, those blue lines in the top two regions and in the bottom two regions deletions on chromosome 19. You'll notice that there's nothing shared between the top two and bottom two regions. And actually, if you look at the bottom plot, which is point mutations, there's also nothing shared. The man doesn't have prostate cancer. The man has two genetically distinct tumors in his prostate. One in seven men gets prostate cancer, so one in 49 get two. That's without environmental effects or germline predisposition effects, life history effects. And so now we've got a huge problem. We now estimate 5% of men have multiple genetically distinct tumors in their prostate at diagnosis. Which one do we treat? How do we even detect these individuals? How do we understand treatment modalities? So this is a problem and it's a huge problem. And the question just asked was how do we start to overcome it? Well, the natural thing to do is to be really frustrated. And so that's the first thing that we did. Then the next thing you do when you have a very difficult problem like this is hire a really smart graduate student. So Emily LaLonde was the student who cracked this problem fundamentally. She said, okay, I get why my supervisors are depressed, but instead what I'm gonna do is say, let's take a look at whatever is the most common across different regions of the tumor. And that turns out to be copy number. Copy number is more consistent spatially than any other mutational metric. It occurs really early in the truncal evolution of a tumor. She took a series of pretreatment biopsies, looked at the copy number landscape of these 150 tumors and saw that there are four distinct groups. Those are those four groups and you can really see they look different. Cluster one is characterized by that red amplification on chromosome seven. Cluster four looks really flat without a lot of mutations. And then cluster two and three have specific homework characteristics. And if you look at those, they actually have different outcomes. And it's that red group at the top, which has different outcomes. Well, that red group is cluster number four. It's the one with few mutations. And somebody here is thinking, Paul, that's really obvious that tumors with few mutations aren't as bad. We thought so too. So we went ahead using the ICGC data and looked at 36 tumor types. There is one other tumor type with this relationship, breast cancer. Really interesting that two hormonally driven tumors have that. Ovarian cancer has the same relationship in the opposite direction where more genomically mutated tumors are better outcomes, probably because of increased sensitivity to platinum based therapy. That tells you actually what we should be doing is maybe something very simple. Just count the mutations, the proportion of the genome with a mutation. And that's a pretty good biomarker, almost 70% accurate by itself. But that's still not good enough. I told you the number we need to get to is 80%. And Emily said, okay, what else is consistent across a tumor? Well, one thing is the levels of oxygen. And so she looked at tumor based hypoxia, levels of oxygen in the tumor. And she expected that genomic instability here on the y-axis would be related to levels of oxygen here on the x-axis with 100% meaning very low oxygen content and zero very high. And there's no relationship at all. But when you put these two things together in the context of clinical outcome, something magical happens. That curve at the top, those are patients who have low levels of hypoxia, their tumors are normoxic and they have low levels of genomic instability. And synergistically pulling all the way out to the bottom are tumors with high hypoxia and high genomic instability. It's not good enough to have one or the other, a tumor with both, those are the really aggressive disease. And that gives us a sense for how we might start putting together biomarkers. I'll take a quick aside here and say, well, actually hypoxia is something that we know about every tumor type. As part of PCOG and the TCGA Pancan Atlas, another consortium project I'm not talking about, we measured hypoxia across all tumor types, all tumor types of sequencing data. I'm showing you a subset here. You can see on the bottom left, those are the least hypoxic tumors, thyroid and prostate, interestingly. And on the extreme right, the most hypoxic tumors, lung cancer and head and neck cancer, certainly not what I would have predicted, but interesting. And you can use that data to start to understand the effects of hypoxia across tumor types. And in fact, using the calibratory, we've been able to identify interactions between somatic mutational profiles and tumor hypoxia in these data sets. The top panel here is kidney clear cell cancers and they're laid out patients from most hypoxic to least hypoxic. And if you look closely at that line, BAP1, you can see it preferentially occurs in hypoxic tumors or breast cancer at the bottom. P53 is more commonly mutated in hypoxic tumors. And by contrast, PIC3CA is more commonly mutated in less hypoxic tumors. Whether this is selection or adaptation or clonal evolution is an interesting question. We have some hints but it's hard to figure out. Let's go back to our prostate biomarker story. Well, using those two pieces of data, Emily put this biomarker together looking at a hundred genomic regions that is able to measure somatic characteristics of hypoxia and genomic instability together. It has an accuracy just under 80%. Our reviewers kindly told us that we should go validate it more. So working with Genome DX, a company here in Vancouver, we did a thousand patient validation cohort and the top CNARF signature is Emily's signature with an accuracy just around 80%. And then now a series of commercial or academic signatures, some of which are in routine use with accuracies between 60 and 70%. So we've improved current standard of care by 10 to 15%, which is an amazing win. We've since gone ahead and done validations in large independent cohorts of low intermediate and high-risk tumors and you could see the signature works really well. But actually if you think about it, we're still not doing all that well. Like 80% means we're making 20% mistakes. So what are we missing? Somebody is thinking, you're missing it because you looked at copy number and there's a lot of other things that happen in the genome. And that's what I thought. So we went ahead and sequenced the whole genomes of 200 tumors and looked at every other characteristic. We also did their methylomes and transcriptomes, put this together in clinical outcome and said we're gonna crack this problem. And we discovered all sorts of fascinating things. For example, here's the mutational landscape of these patients. You can see patients laid out as the columns and I'll point out a couple of patients that show something fascinating. That group of patients has 30 or 40 mutations in the entirety of their genome. This is a cancer that has almost no mutations. There is no adult tumor type, which we know from PCOG studies, with anywhere close to this rate of genomic mutation. Actually every adult tumor and almost every pediatric tumor has more mutations than prostate cancer. Ultimately, we put these data together with clinical outcome, identified a series of 40 driver events that drive aggressive prostate cancer and then used machine learning techniques to put it together into a biomarker which has an accuracy of 83%. So that's terrible. Like this is as bad as you could get because I took an assay that we had working for $250 a patient with 80% accurate based on the copy number data and then I did whole genome sequencing a transcriptome and a methylome. So $10,000 of work on clinically not usable assays to go from 79% to 83% accurate. So clearly whatever we're missing is not in the genome. Like we're missing something else. Well, actually, maybe we just looked at the wrong genome. So a postdoctoral associate in my lab said, hey, you should have thought about something entirely different. You should remember that every human cell has two genomes and the mitochondria might have its own mutations. So Julia Hopkins went ahead and looked for mutations in the mitochondria of prostate cancer, also pancreatic and many other tumor types as part of PCOG. And this is the circular mitochondrial genome and each of those lines pointing inward is a hotspot and each of those mutational hotspots is as frequent as P53 is in prostate cancer. So there's a lot going on in the mitochondrial genome. Julia wasn't done there though. She said, okay, maybe there's some association between nuclear and mitochondria features. And here are the rows are those different mutational driver features that we identified from the nuclear sequencing. The columns are different mitochondrial genes or mutational characteristics. And every time you see a red dot with a black background that means they co-occur. So that when a tumor has a certain mutation in the nucleus it is significantly more likely to have another specific mutation in the mitochondria. And when it's blue, they're mutually co-occurring. We can't understand the nuclear genomic mutational landscape without also looking at the mitochondrial mutational landscape. We need to see the two together to be able to interpret the effects that are going on. And I'll give an example of how this explains missing variability. Here the black curve are patients who lack mutations in the oncogene mick and in the origin of replication of the mitochondria. Blue are patients who have one or the other and red synergistically pulling down with every patient having an event, a relapse of their disease are patients who have both. You can't interpret the nuclear mutation unless you know the mitochondrial profile. And this actually is mechanistically very reasonable. Mick increases mitochondrial biogenesis, the origin of replication mutations, prevent there from being an explosion of mitochondria wasting tumor resources. This explains a portion of the variability but it's not the only thing that explains the variability because in the end DNA is only part of the story. What we found is that we should have skipped past DNA and RNA and looked at the proteome and paid a lot of attention to it. Working with Thomas Kislinger, world-class proteome assistant in Toronto, we sequenced the proteomes using mass spec of a subset of our tumors. We also had lots and lots of other multimodal data sets and we asked what do we learn from that? And I'll show you an example first and then move into the clinical application. These are tumors driven by ETSFusions. ETSFusion is the most common mutation in prostate cancer, almost half of all tumors have it. And on the right panel are RNA changes between ETS positive and ETS negative tumors and on the left are protein changes. And if you look at it, they look pretty similar. So we thought let's take a look at this very carefully and understand the relationships between them. And so we did this for every single gene asking how related are the changes between ETS positive and ETS negative tumors at the RNA and protein levels? Protein at the y-axis, RNA at the x-axis. And they look sort of related and that correlation at the top, row, experiments row is 0.7, so that's not bad. And if you stopped there, you would have missed an incredible story because the axis labels are different. This is log two space and in protein it goes from minus 10 to plus 10. Those are changes of a thousand fold at the protein level. In the RNA it goes from minus three to plus three which is only 10-fold. We see protein changes a thousand times and RNA of only 10-fold to 100-fold difference. And if you see at the very top, there's that gene gram one. Gram one is over expressed at the RNA level for sure, two-fold, but at the protein level it goes up a thousand-fold. And effron B2 at the very bottom goes down just under two-fold at the RNA level and 500-fold in the protein level. It's completely ablated. And it's not just the subset of genes. We looked at RNA protein correlations across the entirety of the genome and here we're dividing proteins by how abundant they are. So on the left it's the 10% most abundant and even for these most abundant proteins the correlation between RNA and protein is only 0.3. Phrased differently, RNA levels in prostate cancer explain only 10% of protein levels. So when we do an RNA assay and say, oh, this is a good proxy for what's happening at the protein level, it's not even close, not at all. But just think about this. That means that they're really unrelated. So there should be different pieces of information present at the RNA and protein level. If you put these two things together you get the second magical story I'll show you. We'd use these information content techniques I talked about at the beginning. It's a very compute intensive technique so done on a cloud. On the left you can see this cluster of three blue curves. Those are for nucleotide based biomarkers, CNA, methylation, RNA. And they have accuracies that have a median around 0.6. Then the purple curve shifted to the right, about 6% better is what happens if you make a biomarker using just protein. Protein's better, clearly. And then the two black curves on the far right are what happens if you decide to use a multimodal biomarker with copy number or methylation or RNA plus protein. And now you start getting accuracies that shift another 10% further out. Taken with the mitochondrial DNA, this allows us to develop biomarkers that are over 90% accurate for prostate cancer. And it's all based on the synergy of different levels of data. So that's why we want to have data sets like this. That should motivate you powerfully to want to look at questions like this in almost any tumor type, where you marry rich molecular data with strong cloud resources, with key clinical annotations. And so one of the first big projects to do so at scale with real power was ICGCPCOG. And ICGCPCOG is almost a model for how you think about coordinating these types of consortia. It's got a steering committee from four different countries around the world, two from the US, one from Canada, one from the UK, one from Germany. It sub-organized into 16 working groups, each with independent working group leaders. There was a fundamental underlying technical working group which did all of the core alignment and variant calling and things like that. And those research working groups covered everything you could think about from integrating the transcriptome with the genome to mitochondrial to mutational signatures to evolution. And each of these has multiple papers that are either in the late stages of drafting or available on bioarchive and under review right now. It's 2,800 tumors from 14 countries and 20 tumor types or 20 cancer types. You'll notice that this is smaller than the entire scope of ICGCPCOG. Part of that is because there was a data freeze several years ago, also because different groups have thought about doing the same tumor type. For example, Germany has got a modest size project looking at the prostate cancer in young men, men under the age of 45. So that means that there's more prostate cancer samples than almost any other tumor type. And indeed, this is the distribution across tumor types and you can see that there are multiple projects. So on the extreme left is pancreatic where there are pancreatic projects in Canada at the bottom, two in Australia and one in Italy giving you a total of about 350 pancreatic tumors. And so there was initially back in 2013 and 2014 a roadmap put together for how all of these analyses would fit together with data owners putting the information in an annotated structured format on a cloud-based service for running of three variant calling pipelines and an alignment pipeline. And then everybody getting the consensus calls for downstream working groups and using those for interesting things. And this had never been done before and clouds were actually a lot younger than they are today. So there were occasional challenges. The alignment phase, for example, required 1.2 petabytes of disk which was more than most people had considered. Getting the data from place to place in the world was challenging. And one had to do careful quality control to make sure that the runs of a pipeline in one place matched identically those in another place. And then you had to track all that information and there was a series of metadata standardization to harmonize all the information we were collecting and a search database to be able to put this into a central repository. The alignment happened at seven centers around the world. It happened on an eight core 16 gig workflow that took a few days for each tumor and it generated about a quarter of a terabyte for each one of those samples. So that's a lot of compute and a lot of expensive compute. The first set happily finished in about four months by December, 2014. And as a result, additional tumors were opened at that time about 800, including 200 more prostates actually. And at that point in time, things should have been worked out and all we have to do is run the variant calling pipelines. That turned out to also be a lot harder than expected. People who run their pipelines in a local cluster hadn't really tuned them to work on the cloud and there's a lot of development that happened there. Not only that, but each sequencing center produces data with slightly different error characteristics. And that means that there were different artifacts and different data sets that had to be controlled for or analyzed. That also included new algorithm development or things that just had not been seen before in terms of bugs or errors and a lot more compute was required. So the three core variant calling pipelines, one from the Sanger, one from DKFZ in Germany, Heidelberg and one from the Broad with some collaboration with Baylor required amongst them almost 4,000 compute hours to run. And those are actually runs on the cloud itself. When people were running them in their local compute, it was even slower. 14 more compute centers were brought online to support that and to allow them to be faster, including several AWS sites in the collaboratory itself. And this eventually allowed core analysis to get completed. And you'll notice that it's long lag times before analyses even started. And once they started, suddenly there was this rapid acceleration where a large number of tumors were finished. And then a sudden change in the curve where there's a subset of tumors that are really complicated. And if you notice sometimes the number of samples goes down because errors were found or mistakes had to get fixed, pipelines were changed, new library preparation artifacts were discovered and had to be corrected. And so in the end, it took multiple iterations of many of these pipelines on many tumors to be able to figure out how it works. In the end, you then have 2,800 whole genomes and somebody has to look through them. Fortunately, only 6% ended up being excluded, some because of lack of clinical data, which made them unclear value. A couple because of sex differences between what we thought the sex of the individual was and the sex of the genome itself. There were also some contaminated genomes where they appeared to have sequence data from two different individuals. And for that case, a couple which had most contamination. When you put all that together, there were also 1,200 with RNA-seq, which I haven't talked about as well, which went through a completely separate pipeline for automated annotation. And then another analysis to remove artifacts in the data. Oxidative artifacts, which can happen when library preparation is occurring. Regions where there are significant normal contamination because of low coverage. Chromosome Y-calls that are being made by pipelines in female tumors, which are clearly just false positives. And then annotating all of this with quality data, genome type data information that could be used. You can imagine how massive this effort was. And it actually required people from 20 or 30 groups full time coming together to be able to figure out how to do it. And it taught us a series of really important lessons. And the first is how you might think about managing multiple clouds. How critical it is to have central metadata repositories that are able to merge that information together and serve as a standard place. And I'll show you in a couple of minutes how you might be able to access that through the ICGC portal. It also taught us that we need multiple clouds, that one isn't good enough, and that we should use academic along with commercial. Academic is usually cheaper, but every single academic cloud is a little different. So you need people who know it well to run it. And for that matter, they don't have the diversity of different types of nodes, different types of CPU characteristics and memory that are available on commercial clouds. But commercial is more expensive. And sometimes there are things that just take a long time to run. And does somebody really wanna be paying for two months for one individual sample? No, sometimes you've gotta figure out which are the ones that are gonna run quickly commercially and which are the expensive ones that will just keep running locally. In addition, some places have concerns about which data goes into a cloud and which character, in which place. So for example, Germany doesn't want German sequencing data to go to a cloud in the US because of differences in the legal regiments that exist between those two countries. And that means one has to think about a cloud that doesn't transfer data behind the scenes to the US for load balancing. There's a lot of other key pieces of information. Figuring out the clinical and metadata that needs to be annotated to each sample, being ready for the fact that your data centers go down all the time. And if you've got a project like this, you've gotta be able to flexibly move information from one to the other. And for that matter, figure out how to have your processes keep coming back so that if it gets 70% of the time and then, oh, your cluster goes down because of a power outage, then you can resume at at least 65% and not have to repeat it. You have to learn how to figure out the right size of the VM, the virtual machine that you're gonna use to make sure that you're able to do this quickly so you don't end up being limited by not having made the right selection. And we have to realize that people are almost always more expensive than compute. So it's worth the time to figure out how to have our pipelines automatically recover and retry and restart. And of course, we have to validate the results. The validation that I described there was months of effort from different people around the world. And that needs to be automated and done to an equivalent standard for the next 10,000 genomes that are sequenced. So how can you get access to these data? Well, where is the data available and where can you get access to the flows? All of the core ICGC data is available through the ICGC data portal. I'm not shocking, dcc.icgc.org, data coordinating center. And it's got a beautiful search facility where you can put in queries. The red line there will allow you to have quite complex queries. And you've got a faceted search characteristics that allow you to say, I only want to look at specific projects or specific tumor types. So do specific types of analysis. And when you do that, you can see over here, you get a file ID, a file ID showing you where the results went from. And when you click on that, you can find all sorts of important information about it. First, that file ID is consistent across different repositories. So if ICGC data is stored in five or six places, then you don't say, I think this tumor is different from that tumor. No, you know, because the file is identical and you know where it came from. It also gives you the opportunity to click onto things like the actual statistics about the BAM file. You can see on the left, there's a little hand pointing at BAM stats. And in real time, you can find out characteristics about what it looks like, where the coverage is, where there might be biases, what the mapping rate is. That allows you to rapidly assess the quality of a file even before you download it. You can go ahead and start to take a look at the align files and variants and you can figure out what are the characteristics of those for individual mutations and in fact come up with a manifest of what you'd like to download. And that allows you to not just say, these are the files that I'd like to download, but to identify different places in which they're hosted and to set your order of preference. I'd like to try to get this from Toronto, but if I can't get it from Toronto, I'm gonna go to Virginia and if I can't get it from Virginia, I'll go to the UK. And you can make those decisions based on prices or based on download speeds or whatever characteristic. And in fact, there's a download client that will allow you to take those types of manifests and other types of JSON metadata files and to be able to rapidly go and extract the tumors that you'd like. They have authentication linked to the ICGC authentication portal and they've got relatively straightforward command line syntax that's flexible to allow you to do a broad range of things both locally or on the cloud. I went through that quickly. There's a lot of documentation on how you can develop the, how you can use the ICGC portal and the types of features that it has. And so there's a whole website docs.icgc.org that has it. The workflows used in Peacock are largely available as Docker containers. So you can replicate the vast majority of what was done and that's available at docstore.org. And so that way you not only have relatively easy access to the data, but also relatively easy access to the code so you can replicate almost the entirety of what was done in ICGC Peacock yourself for a little bit of compute cost. So what did I tell you? I hope I showed you that cloud-based computing can enable exciting clinical genomic discoveries within and between tumor types and that those types of things happen in a way that the science is preeminent and enabled by the cloud. And if you think back to my prostate cancer talk, you have no idea which parts were in the cloud and which ones were on local compute except when I told you. And that's the great part about it. You don't need to think about it. The same workflows were in some places run on multiple places. I also hope I pointed out something fundamental about prostate cancer biology, which is that we can do a great job in improving treatment of tumors when they're localized and curable just by having the right multimodal datasets and putting them together in clever ways. There's a lot of complexities in using the cloud and especially with interacting with commercial providers. And that means that you need true multidisciplinary teams, people who are working together. And that includes clinical colleagues that includes expert cloud shepherds. At OSCR there's a team, Vansal Ferretti's group who takes care of the ICGC portal and really shepherds how the development of that happens. That's a PI who puts a significant part of their research effort into thinking into these types of problems. And of course, he's not the only one around the world. And lastly, the PCOG papers are gonna show all sorts of fundamental cancer biology. In a real way, we're within the decade where we're writing the textbook of the cancer genome. And the PCOG papers are going to be a key chapter of those. There's now something like 20 of them on bioarchive and they are all worth reading. They're really fundamental stories and a large set of them are going to be published in good journals in the next year or so. I'll end by closing, I'll close by thanking the people who did the work on the prostate cancer side. These are the other people who have led the Canadian Prostate Cancer Genome Network with myself. My team, of course, and my colleagues at OSCR, particularly Christina Young, who put together a lot of the PCOG slides that I showed you and was the project manager for that. Thanks, I'd be happy to take questions. Curious, that was really incredible. Just curious of what you think in terms of the multimodal data, that's what's the next improvement, something else? Yeah, it's a great question. I speculate two things. One is that we need much better a metabolome information and that we need to see how clearly we can understand the metabolome as a function of the proteome, non-coding RNAome and so forth. That's one. Two, I've treated each case here a tumor or a tumor region as if it's a synchronous set of cells, but as we well know, that's not the case. And there are many groups, including my own, Saurab Shah's group here in Vancouver, Peter Van Lu and London, who have really done pioneering work to show the subclonal heterogeneity of individual tumor specimens. We need to do that on the RNA level, on the DNA level and on the protein level and stop doing what I'm doing here, which is treating tumors as consistent chunks. What I will say though is our biomarkers are now 90, 95% accurate, which says that there's only incremental value to be added. Now, for a patient, that's immense value, but as a healthcare system, actually, we're transforming care with the types of assays that we're using now that are quite cheap, where the protein I didn't show you, but we can do using immunohistochemistry for 100 bucks per patient and the copy number, a $200 panel. So actually, it's unclear that we need to do more for routine clinical care until genome sequence prices drop by an order of 10, which will probably be another four or five years. So was there IHC data that you put on top of? Yeah, so we took the top hits from the proteomics analysis and then validated those using IHC and developed a multiplex IHC assay that recapitulates most of the value of having a whole proteome. I think that's a general phenomenon. There's huge value to having the entirety of a proteome or genome or transcriptome, but for making clinical predictions, you can often reduce to a subset that has most of the predictive value. Ballpark, how many proteins do you need? Four. In this case, I don't know if it's true for the clinical questions or diseases, but in prostate cancer, four got 95%. Any other questions? Yeah. Yeah, that's a great question. I never said it explicitly, but all of our biomarkers incorporated, oh, repeat the question. The question was, is it valuable to add PSA to our biomarkers? So yeah, it's a great question. And I didn't say it, but every single biomarker I showed had worked on the backbone of Gleason score, pre-treatment PSA, and tumor extent. So it was always showing the ability to add to existing clinical characteristics rather than replace them. The only case where we think that we would be replacing them is that we think in some cases, tumor extent, the digital rectal exam is not adding significant value, but the other two always have some value and are still worth you doing today.