 So the next talk is from Mike Snyder, again, on non-coding genome in cancer. Sure. So I'm going to continue the same theme. Like many others, we're very interested in understanding non-coding information and its contribution to disease. I'll tell you about our cancer work, but we have work in other fields, particularly in autism in the space. So I think all of you appreciate until recently, most disease studies were really focused on exomes, because it was cheaper, easier to get lots of data, and I think this is from about 18 months ago when, I wouldn't say when we launched it, but towards, I guess, when we were getting ready to publish our first paper in this area. This is the relative size of the genome to the exome, but when you look at the sequences that were determined at the time, it's much, much more skewed the other way. There are many, many more exomes than cancer genome sequences. And, of course, this is shifting now that the cost of genome sequencing, human genome sequencing is dropping considerably, and, in fact, it's at the point where we pay about $1,200 for a human genome sequence. And so there are getting to be large projects out there, and, in fact, the PAN cancer project has just released 2,800 whole human genome sequences, so there'll be a lot of sequences in this space. What we've been doing is using several approaches of trying to identify recurrent mutations or highly rich mutations in regulatory non-coding regions, as well as within the genes themselves. And to do this, we've been using a variety of approaches, I'll mention two today. And we've been basically using the cancer data, but we've also been taking advantage of this database Colin told you about yesterday, regular MDB, which is primarily ENCODE data, road map data, but it's also other what we felt were high quality data sets that were in the literature that we processed with the exact same pipelines and put into this database, and it's a joint project between our lab and Mike Cherry, and now Alan Boyle, who started, who's at Michigan. And so what's nice is that, again, it has a bit more information than just ENCODE itself, and it does have the information annotated in ways that I think Colin told you about yesterday. So we've taken these whole cancer genomes, and we've used a fairly stringent mechanism for identifying the recurrent mutations. We're taking them, the called variants, and there are several methods for calling variants that are enriched in cancer samples. Mutech is a common one, the Variscan's another one. We've taken the intersection of those, which is throwing out a lot of good mutations, so to speak, but it does give us a more higher quality stringent list. At the time we started the study, there were 700 genomes that were done at the time in TCGA. We downloaded them into the cloud, and then we were shut down at 436 because at the time they didn't seem to like you to do computing in the cloud, even though we were very secure. That was a new concept, and that created all kinds of problems for us and others. But then gradually they said, yes, you can do secure computing in the cloud, so they gave us a break and let us continue. So anyway, that's why it's an odd number. It won't match anything you'll see in the literature, but it is what it is. So these are the tumors. There were eight types that were there, and you could probably guess from the name what most of those are, and we'd call mutations in these various ones. But we are a bit underpowered because it's only eight tumor types and spread across. So we did do a lot of analysis on individual tumors, and there were marginal signals there. But to be honest, the best signals really came out from the aggregated data. Although as I say now with the larger data sets, we should be able to drill into individual cancers, much like the analysis you just saw. So basically what we did was we started looking for mutations that would be enriched in regions of interest using a variety of methods. And our control of most of these cases is using simulated regulatory regions that have similar characteristics. And specifically, the controls you need to run for cancer analysis are varied. There is one, the different kinds of mutations differ with the different kinds of tumors. So that, as I say, lung cancer, as you might imagine, especially for smokers, has a certain distribution. Melanoma has other kinds, and skin cancer, and so on. So we actually control for the mutational spectrum across the individual tumor types that we normalize per tumor. We also control for variation in gene expression. We control for replication timing because it turns out there are more mutations. Many of you may know this in the late replicating regions versus the earlier ones. So we control for that. And then we also control for base composition as well. So all these are taken into account for simulations. So then what we do is we do start looking for enriched regions, either regions for different kinds of analyses. In one case, like this one here, we basically look to see whether there are bound regions by transcription factor using regular MDB or ENCODE and seeing whether we can see enrichments in certain mutations for different kinds of transcription factors. And these are some of the positive results we've obtained. So basically for these different factors, in certain kinds of tumors, you will see enrichments in mutations in the binding sites for these particular factors, again, implying that they do have a role in cancer. We can also drill into the individual sequences, in this case using motif analysis, to see where those mutations may lie. And specifically, here's two examples here that were successful where you can show that the mutation counts, they don't just lie anywhere in the sequences. They can lie in very specific residues. In this case, the essential two seem to be the most mutated ones. In this case, it's actually more varied across where that, for SB1, varies across individual residues. What we found for ARAs, which differs for the other case, and if you think about it makes sense, most of our mutations are actually loss of function mutations. Not all, there are some that seem to be gain of functions, but most appear to be loss of functions. And so you can show relative to the PWM, most deviate from that. And so let's see, well, I guess here are the various factors that we've analyzed. And bottom line is they're, again, mutating away from the PWM. There are exceptions, individual cases. So then we can ask, is there recurrence happening at very specific genomic loci? And I think I won't get into the details of this so much, but basically the bottom line is you're looking for an enrichment over certain regions, and this is done, we use fixed windows for this. Varying sizes say 50 or higher, and surveyed the genome. And when we did this, we came up with about 123 sites, sorry. 123 regions, let's see, I thought it was 200. Okay, I don't remember what the difference is now, between the 123 and 200, I should. Colin, are you here? Yep, I lost them at key time. All right, anyway, so we found, I believe it's 123 sites, regulatory regions. And the top two are actually telomerase regions. And these are well-known sites that are mutated. In this case, they're actually gain-a-function mutations that activate. And then we found, again, 121 others, regulatory regions. I think the difference for this, the 200 is that we can map them to more than one gene, and I think that's what the point of this slide is that even some sites will actually, many of them are not necessary. In fact, the majority are not in promoter regions or immediately adjacent to a gene, so they can map to one or more gene. And so I think we assigned about 200 genes that were either in the vicinity or using some of the interaction data. And a number of these were, in fact, known oncogenes or genes implicated in oncogenesis. And so quite a few of them were, but the vast majority of them had not either weren't assigned to known genes or maybe were not mapping them to the right locus at the time. So similar to what was presented before, we took nine of these regions and tested them in reporter assays. And two of them, which were predicted to be loss of function mutations, actually do, so if this is the original reference locus, and these luciferase assays actually lose activity, that's just two different constructs. Here's another one. And this is the telomerase control, which is a gain of function mutation. So that one goes up. So the vast majority of ours didn't validate in this simple, simplistic assay. The other thing we've done that was a little more sophisticated after this came out is actually, rather than use fixed windows, we look for increased density overall, so we let the size of, if you will, the window vary. And then we just basically looked for enrichment across any size regions for very small, even single bases up to several hundred base pairs. And then in this case, we applied it to exomes. We wanted to get more power. And so you might think, why is this in an encode talk? Well, it turns out, even in exomes, there's plenty of promoter regions that are sitting right upstream because they go just beyond the five prime end in many cases. And in fact, we actually do pick up quite a few promoter ones. And we've run all the same corrections I mentioned before about sequence composition, replication timing, et cetera. Because there are many exomes, now there's many more than this, or 4,500 a time. I think these slides didn't come out that great, sorry. But I think I can still explain what's going on here. Bottom line is you can get, we call these significantly mutated regions. And you can get really huge enrichments when you have this kind of power of 4,500 exomes of data. And basically, I won't go through all the details of the scoring pipeline, but we wind up getting a strong enrichment in many, many different oncogenes. So again, this is probably hard to see, but I think for some of the points I want to make, you'll see in a minute. So the bottom line is that all the top ones are very well-established oncogenes. It was a case, as I mentioned, that many of these enriched regions, these significant mutated regions, do lie in binding sites, either in three prime UTRs, five prime UTRs, three prime UTRs. And also in promoter regions, and this is really hard to see, sorry. Well, maybe you can squint or something here. And the bottom line is you will see a lot of them are in the promoter regions. And you can assign this either based on motif scores or using ENCO data to help validate the motifs are the ones you think are really hit, you can basically see that these are mutated in regulatory regions. Although I think you do have to be a bit careful about assigning, at least for the UTR ones, some of them that landed in UTRs, we don't think they're necessarily affecting translation per se. They actually may be affecting gene expression. And these are some of the ones we had. We had several that landed in UTRs, as they say. What I liked about the variable approach is that you can actually drill into specific locations, whether you're outside the gene or inside the gene. So the example I'm going to show you here is that you can drill in, you can see hypermutated regions. Typically this has been done for cancer across the entire gene. But what we did here was because we're finding recurrent mutations, you can drill within subsections of a gene. And it turns out that a lot of the mutations actually landed at protein-protein interfaces, which was quite interesting. And then some of these actually landed in a way that they would be mutated in some cancers but not others. And here's where having 4,500 exomes is very valuable. So as an example, this is just the PIK3CA gene, which has mutated many different cancer types. These are the different cancers, and these are the different regions that are mutated. So you'll see this particular region is mutated in virtually all cancer types that we've analyzed here. But other regions are only mutated in some cancer types. And so you're seeing very specific subregions of this protein that are obviously more important to one kind of cancer versus another. And so it's nice to be able to get this kind of resolution. And I think the way you get this is by having huge amounts of power and lots of genome sequences. So in summary, what I've shown you is we can identify mutations in regulatory regions that are enriched for mutations in cancer. They often lie near known oncogenes. They can disrupt specific motifs of very specific factors, and we can implicate certain kinds of factors in very specific cancer. As is mentioned before, the density analysis is nice to be able to identify very specific coding regions. And of course, now we want to incorporate these in the whole genome. And actually, we're trying to see if some of these regions can be put into a clinical interpretation as well, which I think is promising as you get larger numbers. And with these 2,800 genomes, that's the next set that's coming along. And of course, there'll be tens of thousands after that. So I think we should be able to get highly annotated non-coding regions. And I think, again, the recurrence information and the ENCODE data will be extremely valuable for these sorts of annotations as they are for other kinds of diseases. So these are the folks who did the work. None of it was done by me. Colin Melton, you've heard yesterday he did the first analysis I mentioned. Density work was done by Carlos. Help from John Sennick and Jason Ruder on reporter assays. Alan Boyle is the one who made a regular MDB. And the last analysis was collaboration with Will Greenleaf's lab. And of course, if you want to learn more about this whole space, I'm making a plug for my new book that's come out. So feel free to go buy it and have fun with it. All right. So anyway, that's my story. If you have questions, I'm happy to answer them. Great work. So I'm interested in the transfer factor binding side mutations that you mentioned there. I'm curious because even though there's consensus, but the transfer factor binding motif can tolerate a lot of different variations. So if you see, for my understanding, the transfer factor can tolerate a lot of the differences inside this motif. So why the poor mutation inside the motif, but a less concerned position can have a functional impact? I'm curious about that. Yeah, it may be the specific kind of mutation. I didn't see what they were switched to. Yeah, you're referring, presumably, to this one here. There it is. Yeah, the CBPD. Well, that C is clearly a highly conserved one. And presumably, I guess all I can say is, first of all, maybe the C isn't the only important thing. That is the reference one, maybe the tube. But if you think about it, it should be both these residues, because it is a symmetrical sequence. So offhand, you would think both should be important. So I'm guessing this is the more likely conserved one. And so our data would suggest those are probably the two most important residues of the bunch, at least in this context. I mean, your point is well taken otherwise. And certainly the same things here. The highest ones are in these Gs, which conserved. What's going on here isn't clear. And I think that illustrates, though, a good problem, which are understanding these motifs in the native context. So a lot of these PWMs, they come from in vitro experiments, which doesn't really match the chip experiments for those who've done this sort of thing. Sometimes they match, and sometimes they match pretty well. But there are plenty of cases where they don't match that great. And there are plenty of cases where you will find alternative motifs besides the one that's in the literature, suggesting there may be a different binding site in context with something else. And so I think the in vivo ones, I do think that would be a good lesson for this group to take away. Don't assume the PWMs you see from a select experiment are going to match what you see in the chip ones. They definitely, you know, they generally kind of overlap. Exactly. If you look at the strongest of the chipsy pigs, there's a conditional team inside the pig, actually, minority. Yeah, and we believe there's probably a biological reason for that, especially when you're turning genes on and off, you don't want to cement your transcription factor on permanently. You probably don't want the strongest binding site available if you want to turn that gene off later, if it's an activator. It's a great question, though. Were the sites that you found in the 3-prime UTR, typically microRNA targets? Yeah, I don't remember the answer to that. I know we did look at it, but so maybe. Very nice presentation. I would like, having worked with Luciferase reporter assays for many years, and it appears like that the bioinformatics people are now beginning to adopt some of that. Maybe they've been doing it for a while. But I would actually just voice a word of caution about using those. They're very quickly done, and I understand the screening aspects of this. But the false positives and false negatives of putting together these heterologous promoter regions and then doing the transfections, as I'm sure you're well aware, is very problematic. And I think that the assays and the plasmas that you derive and all those sorts of things all impact the results. So it gives you a warm and fuzzy feeling if it looks like things are up or down. And we still use them just to gain that warm and fuzzy feeling. But in reality, I think it's something that you have to be extremely cautious about, particularly when you're putting active segments or potentially just random segments next to a heterologous promoter. So just a word of caution. Yeah, no, as you might imagine, we're pretty aware of this. And the purists hate those assays. It is an assay. That's how I treat it. It's just one more piece of information. There's no perfect experiment for anything we do in life. But it is an assay. We have a lot of controls that we're on. And in your right, some random ones will go on. We always do the scrambled sequence as well. But it is what it is. You know, the problem with the knockout assay, which is a pure assay, if you go in and crisper it out, is that there's a lot of redundancy. And so if you don't see something, you don't know whether it's there either. So, you know, it's a bit of information. I don't disagree with what you said. And we're cognizant of that. But yeah, OK. Yeah, so actually, I have a very simple question. You mentioned that you choose eight cancer types for their study. Are there any particular reasons for that choice? Yeah, they had the most sequence genomes at the time we were downloading the data. So that's why we chose them. Because we were hoping to get a strong enough signal, even within a cancer type. And I think we were marginal with that number. And they're roughly equal. So it's a 436 divided by 8 is roughly how many we had of each. So, yeah. Well, by the way, I thought of one more thing on Luciferase assay. You'll never see this kind of thing reported. But we actually did one where we were so sure it was a hypermutated region. It was actually coming out of a cancer study as well. Very, very heavily mutated in cancer. And it sits right upstream of an AWG. So we were so sure it was going to affect translation. And we did a zillion reporter assays and had zero effect. So it actually turned us off thinking it was having an effect on translation, like we thought. So rarely people use information that way. They use it to show what they're looking for. In this case, it was very disturbing to us. Because it think of us the result we were looking for. So that one's not published, as you might imagine. So, yeah. To your left. There's one more. Oh, yeah. Yeah, we're here. Yeah, hi. I'm just wondering whether the mutations within regulatory elements are mutually exclusive for mutations in the putative target gene? Yeah, we did look at that. And I don't know the answer to that. I'd have to go back and look. Yeah, it's a great question, though. Yeah, you would think they could be separate. Although they don't have to be. Because there are plenty of cases where genes are mutated and amplified as well in cancer, as you probably know. So it's conceivable that they could co-occur. That's a great question, though. Thank you.