 Welcome everybody to the last session and I especially want to thank the organizers for this opportunity. I think we are all very grateful for them putting together this program and this is the first time, as many of you know, that TCJ has opened up these meetings to the larger community and I think it was a resounding success. So thanks very much for their participation. So I'm going to open the session with a little bit of overview. I want to first talk a bit about some structural issues that we're addressing now. I think infrastructure is incredibly important for a large-scale operation and especially since TCGA is really setting the tone for a transformation that will happen in medicine where genomics becomes much more integrated. Cancer is definitely at the vanguard of this, so we have to pay very, very attention, very, very careful attention to infrastructure. We're building an essential piece of infrastructure. It's called CG Hub or Cancer Genome Hub and it's designed to hold the massive amounts of data that are being produced by this project and a few sister projects, in particular the target project on childhood cancers and an eclectic collection of genome projects under another umbrella at the NCI. We're in the process of thinking through the design issues and actually implementing now we've actually got the first hardware actually running. The lights are on and it's under test phase. The system is designed for 25,000 cases with a very rough estimate of 200 gigabytes per case. That actually is going, if you did full genomes, a full genome BAM file for just one genome is 300 gigabytes if you want to do the normal as well. That's 600, so we will need some compression if we're going to do full genomes. That doesn't count the RNA sequence you heard about from Marco, magnificent other sets of data and so forth. So we're going to have to work in the future actually to compress down to just that much per case. If you multiply these numbers together, you get five petabytes, five times 10 to the 15th. That's a lot of bytes of data. And so you have to organize the infrastructure at a different level of rigor when you're dealing with that much space. A lot of the things that work on your normal size systems don't no longer work with this amount of data. There are some technical details here that I won't go into in this short presentation about how this is put together. Suffice to say you can read simultaneously 12 times off very, very heavily redundant, very optimized disk structure. And the last thing I want to really emphasize, I've been talking to a number of people here. There are co-location opportunities. We're in a big data center. And so if other groups want to co-locate a rack of equipment or share a server there with some other people in the same rack, it's nice. When you have this amount of data, petabytes of data, it's very, very hard to send that over the fiber optics to another location if you really want to analyze in detail the full data. The goals of the project are simple, and this is something that I'm extremely passionate about and a number of us are passionate about. We need to enable the direct comparison and combine data analysis of these large genomics data sets. We will not get where we need to go if we're all analyzing our small data sets in isolation. And we've seen this multiple times in this meeting. Eric's Lander's opening talk was brilliant in the sense of pointing out the fact that only the large data sets provide the statistical power to really attack the full complexity of cancer. We are not there yet, but we will never get there unless we're able to aggregate these large data sets. A project like this sets the standard for data storage in exchange and encourages data sharing. So we want to encourage a degree of openness. And of course, as you heard, we're running into obstacles. The remuter agreements and lauderdale agreements that have governed genomics up until now do not apply to clinical, private, or otherwise controlled research data. We have to understand how we can navigate through that world and still maintain some of the principles that were so brilliantly established in the early stages of the genome projects. And finally, it's very important to maintain compatibility with the European Genome Phenome Archive, the DbGaP Archive, the ICGC project, the international arm of this of which we are really just a part of in a way. The 1000 Genomes project, which is looking at germline variation, but has developed many of the critical tools for thinking about large-scale genome analysis, which we need to adopt. And the ENCODE project, which is pioneering the other types of data that you can get with these high-throughput sequencing machines, we're barely scratching the surface in this project on epigenomics. There will be a flood of epigenomics data in cancer. It's going to be extremely important, and that is really the flagship project. So we need to coordinate with all of these projects. One of the things that I will emphasize is that we've worked with 1000 Genomes, ICGC, and other groups to establish variant call format. Just Google that, VCF, variant call format. There is a large group. It's an open standards group, and we will be able to then represent changes that we find in our genome sample relative to the reference genome. These files are about a gigabyte as opposed to 300 gigabytes, and they also contain higher-level information. If we can establish a standard there, that will be extremely important. It's also relatively machine-independent. As we go forward, there will be different manufacturers that are producing machines that we will be using for sequencing, and we need a format that is manufacturer-independent above all of that and above that is at a higher level than BAM. So I'm going to put, that's my last sales pitch for that. Where are we right now? One of the things that the Cancer Genome Atlas has been doing is taking a very hard look at the process of simply taking the mapped reads, the BAM file that we're all familiar with at one level or another, and trying to call the mutations. So just calling point mutations, we can actually ask three different groups, the Broad Group, UCSC, and Washington University, St. Louis, to predict, from the BAMs, where are the point mutations? Where are their single-base changes that are just in the tumor and not in the normal? And the first time we did this exercise, which we call a benchmarking exercise, we were immediately confronted with the fact that we all have very good software, but it doesn't exactly agree. So the way that we think, so this represents the calls that are made on the exact same data by different programs. You can see that there's a substantial overlap in the middle, but there are substantial cases where one group is seeing something that the other groups are not. Now this is a difficult problem. You have 30-fold coverage of a whole genome sequence. There are places that aren't covered so well where it's kind of dubious, whether it's real or not, and there are artifacts that are induced by the mapping of the reads to particular places in the reference genome that can be very misleading and make it look like there's a mutation there that doesn't actually exist. So I want to emphasize that mutation calling is not a solved problem. We need to harden our tools in this area, and that's one thing that we're working on very hard. If you go to, and we're going to hear a lot more of that from Andre later in the session from the Broad. So we are just now beginning to look at the accuracy and consistency in the detection of structural variation, which is much, much harder than detecting point changes in the genomes. And I'm going to switch into a kind of a case study that we've recently been doing between UCSC and Broad on whole genome glioblastoma data. And I hope to also then switch into a few scientific topics, because this of course is interesting not just as an exercise, but of course we're trying to learn about this deadly cancer. So if we looked at the whole genomes that are available, there are these 18 genomes that are available for various technical reasons. UCSC was not able to analyze this particular data set and Broad not able to analyze that. So all together we both looked at 16 whole genome sequencing data sets for both tumor and normal for GBM. And we were interested particularly when it comes to structural rearrangements. One of the first questions you want to ask is, does it alter a gene in some way? In particular, does it create a fusion of two genes? And so when we looked at gene fusions, our UCSC program BAMBAM, which is the product of Zach Sanborn, who you'll hear more about, detected 167 events in these data, and Broad's program D-Ranger detected 188. So a similar amount of aftercareful filtering and automated analysis come up with reasonable lists. And if you look at the top hits, these are listing top hits and what gene is fused to what other gene by this rearrangement. You see that there's a substantial agreement. It's only these three where there's really a disagreement where we're saying it's there. The D-Ranger is not saying it's there, for example. And you could do the same thing with their list and you'd see that there are some missing on the other way around. These ones here are just from the fact that we actually looked for a gene that is then fused to another part of itself. And that is not something that we had from the D-Ranger list. We were only given a list of genes that were fused to other genes. So those were likely also detected by the Broad pipeline but not reported to us. So altogether, there were 136 potentially overlapping events out of this. And so you can see the Venn diagram. So you can see that also, of course, this is clearly an unsolved problem. So there isn't any definitive, there are lots of borderline cases, and there's no definitive program that will actually... a single program that will find you all of the rearrangements in a whole genome tumor set. Now, let's dig in a little bit more in this. The two of the particular tumor samples had the majority of the called events. In particular, of the very top ranked events, 23 of the 25 events in D-Ranger and 21 of the 29 events in BamBam occurred in these two tumors. And you can see from these Circos plots how rearranged they are in each case. You can see these large lines indicating different detected rearrangements. Again, primarily detected by both pieces of software. Now, if you go in and look at these things in the UCSC software, we take the genome and we distinguish two different alleles, two different germline alleles. And so when we plot copy number, this is the overall copy number as you go along the entire genome, Chrom1, ChromZome2 all the way through, you can see, of course, that the overall copy number is detected by read depth, varies throughout the chromosome, but you can separate that by the two alleles. The arbitrarily determine the minority and the majority depending on which one occurs more often in the tumor. You can see that in fact there are differences in which allele has been deleted. Now, I want to point your attention to this line here. There's a light space here down below. You can get very precise estimates of the amount of normal contamination, stromal contamination in the tumor estimate from global genome sequencing. And so all of these counts that exist below this line represent what we think are actually normal reads that are contaminating in with the clonal tumor reads. So these kinds of data are extremely important and gives us an idea because the two things that we're really looking at in structural rearrangements are do they alter the copy number and do they create juxtapositions of segments of DNA that don't normally go together. And usually you see these things in concert as we'll see. So zooming into this one particular GBM tumor a little more, you can see the detail now on how we have broken out the allele specific changes. And in here we have a loss of an entire arm of chromosome 9. And you can see because this is down to the level of just pure contamination reads, there is essentially no reads from this section in the tumor on this allele. And on the other allele we took a bite out of it at this point. The combined effect of these two events, clearly indistinct and independent events, which you see in the overall copy number, you see a drop and then a notch. And now it's separated allele specifically, is that we lost this section of DNA and of course this section contains the very important tumor suppressor gene CDKN2AB or P16. And that event is quite common as was known in glioblastoma. And so here we have an explanation of it. For example, if we just do a high level diagram of what the whole genome sequence information is telling us, it's telling us if I include the germline data, which I wasn't showing in that diagram, that of course the patient started out with two fully functional or two intact copies of CDKN2AB. One was lost in a segmental deletion and the other copy was lost. We can actually determine by a non-reciprocal translocation event. So this kind of level of analysis gives you an interesting picture on these. So let's look at a few more of these glioblastomas. And sure enough, we see in another one here, again, the overall copy number drops to zero in this section containing CDKN2AB and you can imagine what the mechanism is. And here I want to point out this diagram that comes with this. This means that we have a number of reads where one part of the read or one pair of the read is here and then the second pair would continue over here. So that indicates that these two positions are very, very close to each other in the tumor in much of the DNA and that tells us that there was a deletion event along this. And then the second notch is going to be caused by another deletion event as we talked about. And here's a case where we loop back and repeat a little bit and so this is a little more complicated analysis of what's happened. You can see what's DNA connected. So the tilt of these things has to do with the orientation. So you would be reading here and then you would go back and start reading here in the tumor. Again, the net effect is loss of CDKN2AB. So overall, in the 16 cases that were available to both of us in 11 cases, we not only had homozygous loss of CDKN2AB, but it was always broken down into a focal loss and a large-scale event, which is interesting. Now, what's happening in the five cases where we do not lose this important tumor suppressor? Well, we're just starting to look at that. You can look at those five cases and try to look at what's happening in other genes. You can see several other important oncogenes amplified. You can look at the type, pro neural mesenchymal and so forth. And you can look whether there are rearrangements or you can look for specific mutations like the very important IDH1 mutation. But one has to say that we're a long way from personalized medicine in the sense that we can take one of these difficult cases and completely analyze the genome and tell you what's going on on that. That is definitely the challenge of the future. I want to look a little bit about one of these fusions. So we listed a lot of fusions that we both found. And one of the fusions that is undoubtedly there, both groups found it, is the fusion between these genes. And you can see, again, the DNA that starts here and ends over here, indicating that it starts reading backwards through this other gene. And that fusion event is associated with an enormous amplification as well right there at the fusion boundary. So that much is data you can see from here. But if you zoom out, there's the fusion event now looking at it in a broader context. All hell has broken loose in that region. So each one of these is a highly validated rearrangement event where a piece of DNA is being connected to another piece of DNA. This is chromothripsis, as we've heard about. There was brilliant papers by Sanger and the Vancouver group. And a number of groups noted this at the same time. And Brode had a nice paper about this is connected to this is connected to this that Eric Lander referred to. So this is an amazing event, must be a high energy event that's really shattering these pieces and putting them back together again. And the fact that this fusion exists in the context of such an event makes it a very difficult fusion to analyze. Do we have any real confidence that there actually is a simple fusion? I would say no. God knows what that transcript really looks like. You really need RNA-seq at this point. And whether it's relevant or not is another question. As you look at this thing, you also see that in these highly amplified regions, there are important oncogenes for, here's MDM-2, the negative regulator of P53 that's highly amplified. And if we look up here, we see connections from this place on chromosome 12 to chromosome 7. And if we go back to a circle plot here where I'm only now showing chromosome 12, chromosome 7, and chromosome 2 within this particular GBM and drawing all the lines, including these lines where the lines of rearrangements that are within the same chromosome, you see that in this shattering event in chromosome 2, we also brought in other material. In particular, we brought in material that includes the other very, very important oncogene in GBM, EGFR. And it is highly amplified. We're talking 50 copies. So let's talk about that in a little more. Chromothripsis can create these extraordinary events that create multiple mutations that might be driver events in one shot. EGFR is well-studied in GBM, and we have important subtypes, in particular this subtype of EGFR mutation, which in exons 2 to 7 are deleted, creating a constitutively active phosphorylated kind of kinase. Actually, we see that specific mutation in about four of 11 of our samples, and at 11 out of 17, when we throw in that extra sample, have some kind of amplifications of this. But what's fascinating is that this particular event, this exon 2 to 7 deletion, occurs at low copy in highly amplified cases. So it does suggest that a possible pattern of tumor agenesis is to duplicate EGFR quite a bit and then wait for this mutation. Now, of course, having 50 copies increases the rate of mutation, and so the tumor can get lucky that way. And whether that's a real strategy in a pattern is, I think, an interesting biological question at this point. Let's look at that in a little more detail. Here you have the enormous copy number in that area. Here's the EGFR gene. And among that, we see some evidence for this linkage. So this is skipping over just these exons, and it occurs in a fraction that we estimate maybe as one out of 50 copies. So it's very hard to detect this from normal, on your normal rearrangement analysis, and this is a warning to those that do that. One has to be careful, but there is enormous, absolutely enormous, read coverage in this area, literally four or 5,000 reads covering every part of EGFR, but we don't know which of the 50 copies we're coming from. So here's one thing that Zach Sanber noticed that was very, very exciting, and I'll throw this out. This is just hot off the press a couple of days ago. It is a fact in the literature that GBMs release exosomes. So these are encapsulated pieces of cytosomic material or nuclear material that can actually escape in its own little bilipid layer vesicle, sometimes they're called micro vesicles, and they're known to be a part of glioblastoma pathology. In particular, people are actually finding important RNAs in them, and this paper finding mitochondria, and they're actually detecting some of these things in the blood. Now, this is pretty exciting because we would, of course, want to have a detection for a glioblastoma or a glioblastoma recurrence in the blood. So our question was, could we see this in the data? Now, here it is, hot off the press. So Zach just showed me this a couple of days ago. Here's this amplification of EGFR in this particular glioblastoma, and you see this incredible, this goes way, way off the screen. You see this incredible number of reads that are colored yellow because they pair with something in chromosome 12. So we're looking at chromosome 7 right at the end of the EGFR, and these pair with somebody in chromosome 12, and this is completely the dominant form of read in the tumor. In the blood, most of it looks great except for four reads here, and they pair exactly with the same place that you would pair if you were tumor. If you look a little further into this, and you go down here, in addition, there's two reads where they show a completely diagnostic and characteristic pattern of mutations that exist in the tumor. So if you look at this, these tumor reads all have a certain pattern of mutations, and almost all of the blood reads are completely normal except these two, they look like split reads that would exhibit that. So altogether we have six reads out of a very large coverage, and so is that a tiny amount of tumor DNA that we're detecting in the blood? I would love to follow up with the brode on that and see if that's true. Last copy number states are something that are incredibly important, and as I said, we have allele-specific distinction of copies so we can plot for every piece of DNA, imagine going through the tumor genome, you get up into 100 kb fragments or so, and plotting each fragment copy number in two dimensions. One, it's overall copy number, and one, it's a minority allele copy number. And if you color these dots by the chromosome of origin, then you see you get this kind of fingerprint for the copy number variation throughout the whole tumor. And in this fingerprint, so in this axis it's overall copy number, 0, 1, 2, 3, and in this axis it's minority copy number, 0, 1, and this is the level of contamination down here. So a lot of familiar things appear. In particular, in this tumor, we see the homozygous deletion of CDKN2AB here. We have zero overall copy number, of course zero minority copy number. Here's a case of red is the color for chromosome 10, and we have a single copy loss of chromosome 10, so overall copy number one, minority copy number zero, and here's a lot of normal diploid. You can see the bulk of the data is normal diploid, and then there's some amplification here that occurs, and the other amplifications are off the charts. And so that is a great way to kind of get a rough idea of a breakdown of the copy numbers. This is similar to the analysis that we heard from Gaddy where you're actually breaking down, and I love this because you're essentially digitizing what is essentially noisy data. We are snapping this to integral values at this point, which is a way we have to do to think about from when if you're going to do then logical analysis like in paradigm downstream, you want to know you don't want 2.15 copies, you don't want two copies or three copies or zero copies exactly. We can do that to a certain extent, except of course what the hell's going on here, right? So we're definitely getting something that's not snapping to an integral value on chromosome 6p down here and other places. Well, it's no surprise what's going on here. This is not a purely clonal tumor. There is an arbitrary fraction, not necessarily integral, of a different type of structural genome that represents a subclone that exists within this tumor. So we can now look at the different structural parts of the tumor in terms of what fraction of the tumor tissue do they appear in and try to identify these subclones in terms of their structural proportion within the overall tumor tissue. In particular, 23% of the tumor shows a pattern in which these mutations happen and we'll see that these also occur in all of the other fractions that I'll list down here. So these will represent, in this configuration, probably represents the first or earliest state that we can detect in this tumor and it is already identified by the classic events of loss of the tumor suppressor CDK-N2A and amplification of EGFR. So those will be, as you'll see, those will be universal events that occur in all of the subclones within the tumor. Now, if you want to fit this data to an actual scenario, it looks like there's another subclone that incorporates 15% of it where an additional event has occurred here and then independently it seems like there's a second version of that event that occurs in 8% and then finally the bulk of the tumor appears to have this second version plus even a new event and so if you summarize this from this structural analysis of the subclones we might predict that we have a series of events in which there was an original set of drivers, two variants, the one variant was more successful and now represents more than half of the tumor mass. Okay, so thank you very much for listening to this and I want to make sure that I acknowledge the extraordinary team that works behind this. You heard from my co-PI, Josh Stewart, on how we're analyzing this from a systems biology. You heard from Jing about the tumor browser. She gave a great demo and discussion of that along with Chris and the top model project. A lot of the work that you heard about on the GBM and the tumor browsers is the work of a brilliant grad student, Zach Sanborn and Sophie Salama on the analysis of the GBM and Mark Deakins is here. He is the chief engineer on the CG Hub project so talk to Mark if you want, raise your hand Mark if you want to know about CG Hub. An extraordinary team and I'm very privileged to be working with them. I think you also heard from Sam earlier and Dan's, I didn't have a picture of Dan. Sam and James were on the first talk. Any quick questions? Everybody wants to go, okay, here we have one. Thank you Dr. Hossler. Very nice talk actually. Your group helped me one time through the internet. Thank you for that. Oh, thank you. So my question is like, you know, cancer-evolving structure, last slide you showed, I mean, some markers you don't see, most of you see like 54%. You don't see in an all tumor, I mean, I just asked your general comments how to explain those, some of the other talks mentioned like low frequency, high risk, and then what you see is high frequency, the relative low and why you don't see 100%. In general, how do you explain these kinds of phenomena in general? I believe the explanation that I think most of my colleagues share is that the tumor represents essentially an ecosystem in which different subclones are competing, the ones that are growing more aggressively are more successful, and so you definitely see the results of that competition, and oftentimes one has been so successful that it's essentially completely dominant, and we see lots of cases where it is almost all totally clonal, but we do see evidence that there are these subclones, and they may actually cooperate with each other, so there may be some stability here. It might not be a simple replacement of the fittest, but there may actually be a cooperativity, one's doing one job and one's doing another, and so you would get a kind of a homeostasis that preserves some kind of complexity in there, so I'm not, you know, we're not ruling this out as a possibility, and there are some very nice papers that are suggesting that as well, but what's important is that if you're going to hit it with the drug, then that's going to change the rules and what was very, very tiny fraction but resistant is now going to emerge. Yes? I have a question that... we know many of the breakpoints under arrangements and copy number breakpoints occur in regions of repetitive sequence in the genome, and we're missing them in these kind of analysis. Yes. So we currently even don't have a good estimate of how many of them are we missing and how should we deal with it? We would like to have all rearrangements in the genome regardless if they happened in the repetitive elements or not. Excellent question, Gaddy. Yes, of course, I think if we jump, if our repairs jump over them, we might get them even though we can't locate the exact breakpoint to the base, but there are many cases where it's too repetitive and we're just losing evidence all together of that. So I favor actually a global copy number analysis and I have a brilliant postdoc, Dan Zurbino, who's doing a logical kind of global copy number analysis of this based on the necessity to explain all of the data and sometimes you can actually infer that the only way to explain the whole data in terms of the copy numbers and the new juxtaposition is to hypothesize that there actually was another breakpoint that's invisible and if you go back, you might get been weak evidence for that. So I think we can use a little bit of overall sophisticated algorithmic logic to try to reconstruct events that must have happened and then go back and validate them. That's my best shot at that because it's very difficult with short reads, as you say. Just to comment on that. So the GCC at the Broad we are playing with ideas of long insert size sequencing that would allow us to jump over that. So once that's in production, we could do it on all TCJ samples and then provide a complete map of the cell and those cell lines that we are doing in this shared benchmark, so we can try these technologies on them. That would be fantastic. I think we need to do a complete analysis and we need these larger. Of course, the sample size requirements are larger for that and its expense is a bit larger, so it may not be routine, but we need to do that on a benchmark set. We'll have to go quick, yes. Yeah. I really like the way you can infer tumor heterogeneity from those different copy number variations. However, I was wondering on chromosome 6P arm, it's mostly the HLA locus, and I was wondering maybe some of those subclones that you define would be maybe different cells like these cells or immune cells that would be infiltrated in the tumor where you can't map properly to the HLA locus? We've accounted for what we think are the total amount of infiltration of non-tumor cells and we can account for that and it's a pretty flat line across the whole tumor, so I think if there were some exceptional changes in a few, you know, if that was heterogeneous itself, it might be, yeah, it's an interesting question, but I can't imagine that there's that, at that level, at that gross level, there's such structural rearrangement in one of the non-tumor cells that would actually offset our baseline and be confused with tumor heterogeneity, but it's an interesting point, something to look at, yes? I have a question about, so you're able to somehow track the evolution of the tumor and, you know, the added mutations. Yeah, I mean, this is speculative at this point, but we try. I'm sure my question is too early, but would you support a DAG, a tree, a single-source type of a model for whatever you're seeing, or do you see any evidence that there are multiple sources, but there is, you know... We don't see any evidence for multiple sources, but it's way, way too early and way, way too speculative at this point. All right, we better move on. Let's bring on the next speaker.