 Good morning everyone. Let's see. What do I need to do? Do I need to do the matching here to get it to show up? This says nothing's detected. What am I doing wrong here? Oh, it's not plugged in. It is displays. John, you were using a Mac, right? Sometimes you just need to plug it back in. Yeah. I use this all the time for these. I've got my own. There we go. There you go. I guess that was equivalent to the, did you turn it off and back on? To your solution. Okay. Good there. Okay. Yeah, I'll move around so probably you have a good idea. All right. Sorry. We need to turn it on now. All right. Give me, be patient. Me? Okay. Good morning everyone. So thank you for, I'm sort of a last minute invitee, and I had plain problems. So I'm glad I made it here, I guess in time for the talk. So I think mine's going to be pretty different from what you've just heard and probably heard, it's a little less mechanistic, but how, even though I'm very interested in transcription mechanisms, I think we do a lot of, both in my lab, lots of collaborators and the other faculty at our institute as well, lots of biomedical research. And that really means a lot of different kind of diseases. So I'm going to tell you about several examples of those and how we use encode data. I do need to disclose, it was just mentioned, I am part of encode. Did we start in 2003? So it's been for a good part of my children's lives. And the, we have, our group is actually one of the 16 or 20 groups, whatever the total number is, includes these people here. And so we generate a lot of encode data, but we also use it in a big way, mostly with other funding mechanisms because the purpose of, or at least our goal in encode is to produce the encyclopedia. So I know you've, hopefully you all know this, but I want to just make a simple version of this. We're trying to figure out what's important in the genome. Not all the debate about function, et cetera, but which, what do the base pairs do in the genome? Lots of them in the human genome. And obviously that means other mammalian genomes, and actually I'm not going to give an example, but some of this even goes over into plants. We have plant genomic folks at our institute, and it's remarkable how you can even certainly take some of the principles, but even some of the base pair ideas there. So we want to annotate the human genome, and the other goal is to get all the data out to everyone. I think this is a hallmark of an HTRI from the very beginning, is we produce a lot of community-type data and we get it out there for rapid, and in this case of course free use for everybody to do whatever they want to with it. So I'm going to give a few examples of how we've used it in our work, and that means my group, but other groups at Hudson Alpha, as well as many collaborators. We don't do anything completely on our own. So here's the one I'll spend most or more time on, and that's looking at, this is really looking at genetics, and it's undiagnosed relatively rare genetic diseases, but everything I'm going to say holds for common disease as well with some complications, and it holds for cancer in other kinds of places where you have allelic variation in the genome. So here's one that's quite striking, and by the way, this is actually funded by another NHGRI grant that's a project called the CSER project, and I've been working on human genetics for more than 30 years, and most of the old work was linkage analysis, you had families, and you would work your way down and walk your way and eventually find a mutation in a gene that you thought was the gene responsible for the disorder, and there were many things, statistics, et cetera, that would help you do that. But what's changed is there are, and we've identified, I don't know, maybe 5,045 of those medelian diseases in humans. There are probably at least that many more that we haven't found yet. There may be a lot more, and many of them are not even defined as genetic because they're so rare, and they don't necessarily run in families, and I'll tell you why that's the case. But this is something that is a worldwide issue. When we talk about rare diseases in the aggregate, it's probably 20 million people in the United States alone. So it's huge numbers of people, and about 2% to 3% of kids born worldwide varies from population to population, have some kind of problem related to these here, and the majority of those appear to have, or likely I should say, to have a genetic component and maybe even be strictly caused by genetic variation. So the reason this is important to figure these out, even if they're rare, in the aggregate they're not so rare, and every individual case has a huge impact on the families who have these children. Many of these children live full lifespan, so they might be adult offspring of people who, parents who've had these children. And it just causes all sorts of problems. You see this in the ones that you know about, but also especially in the ones where you have undetermined causes. There's a lot of guilt, et cetera, a whole bunch of things, and actually it's an economic problem too because people spend a lot of money and sometimes invasive testing going around from doctor to doctor. This is not the physician's fault. These are rare, undefined, often with amorphous types of symptoms. So if we can figure out what the causes are, it'd be better, it'd be great. So we're one of several groups around the country, actually growing numbers of groups. This one, particular one's funded to do this to try to figure out the causes in 500 kids. And we're focusing on ones with mental issues, what we call intellectual delay and intellectual decline, developmental delay. And this is the team here. It's led by Greg Cooper, myself, and Greg Barsh at Hudson Alpha with a wonderful group of clinicians and bioethics folks working on this. And our goal is to sequence, as I said, 500, but we're sequencing trios, both the parents and the kids, and one child. I'll mention why in a minute. So we started out with this, and the grant was funded to do this with exome sequencing. So the first, I guess, 75% of those that we've sequenced so far were exomes for the families. And the thing that shocks me is that the diagnostic rate is really high. It's at least 25% that are definitive, another 20% or 19% that are almost surely right. They're ones that, as soon as we find another child, for instance, with the mutation, that we would believe it. And then probably this number of 50, whatever percent that we can't figure it out, that will increase as we gain new knowledge and new technologies. So to me, that's remarkable because we're returning results to about 40% of these families. That's been several, you know, a couple hundred now. We switched, we bought into the, Hudson Alpha into the Illumina, not advertising for the company, believe me, Illumina X10 system, which allows whole genome sequencing at 30X or greater. It's a lot cheaper and faster, et cetera. So it's gotten to the point where it's cost us about the same to sequence a whole genome as an exome. So we switched over to whole genomes. We finished 30 of these. We're more in progress. And the results of that are the diagnostic rate goes up maybe by 10%. Others have seen this as well. A couple of other groups have switched over to whole genome sequencing for these. And just in these 30, and I think this number is going to grow, we found three cases where we are almost certain, well, we actually are certain in a couple of those, that it's a regulatory change. You don't find those in exome sequencing. And as you probably already know or certainly would learn from this meeting, there will be more DNA sequence variants and regulatory elements that affect function than there are in the coding regions. So this is important. That's one reason to switch to whole genome. And we relied heavily on ENCODE. I'm about to tell you the ENCODE connection here. We relied heavily on ENCODE to help make those decisions. You sequence any of you in the room and you'll find 100 stop codons, splice like mutations, just exomes, just looking at exomes. Sequencing your genome, you find millions of variants between each of you and even your two copies. Which of those are important? Or which one even is important, because often it is one. So that's the hard part of this project, the recruiting and then doing this. Not the sequencing, but that kind of analysis. So the problem is that there's a huge number of variants. Most of those, of course, are not important. So how do you figure out which ones at least have an effect on the molecular biology of the organism, not necessarily their phenotype, their outward phenotype. But does this mutation affect this gene? And that could be up expression, down expression, weird expression in the wrong place, et cetera. It is also really interesting to understand how this might affect the organism. I'm not going to talk about that here, but we actually have a number of things, certainly functional assays can do that, but a number of things that help us say this is almost surely the cause or is the cause in this child. So we've relined heavily on this set of programs. It's a suite developed by Greg Cooper and Hudson Alph and Jay Shinduri at UW and their lab members. It was published, I think, just last year. And it's essentially, it's a system of, it's a set of algorithms, I should say, that go through and score every single base pair in the genome as to whether it's important or not, at least with regard to molecular function. And this little top line shows you, this is a ROC curve. The top line shows you the performance of CAD versus any of the single conservation scores, other kinds of methods that we all use to say whether something is functional, especially in a protein. The difference here is that this does regulatory and non-coding variants as well. And it constantly improves. I'm not showing this, but as more data come in, the algorithm gets better and better. So it performs extremely well. And what I will say, I'm not showing all the underline here, but a major part of the algorithm that goes in here is encode data. That's why it changes frequently. As more encode data goes in, then the algorithm gets better and better. Romap data, others as well, and other literature data. But encode, because of the big projects are producing so much data, they're having an impact on this. This is just a good example of, oops, so you get a score and the higher the score is, the more likely it is to be pathogenic here are clearly ones that were chosen to be non-pathogenic. They were controlled data in various ways. They show up with scores under 10 or so. And then this is just showing for Elise and me, beta-thalastemia and other phenotypes as well where there are sequence variants that are known to be pathogenic. They're absolutely known to cause the disease in these people and how they score, those variants score high. So it's a really nice algorithm. It's free to use. If you're in a company, I think you have to pay some license fee to UW and Hudson Alpha, but it's free to use and it's really, really valuable and really easy to use. So I encourage you, even if you're not thinking about this now, and you can go in both directions. You have a disease and you're looking for the variant. You can also say I have a variant. Is it that I just happened to find? Is it important for molecular function? So I encourage you to look at that. So now I'm going to give, I didn't pay attention to when I started, so give me five-minute warning, please, if you would. I'm going to try to cover five of these, but the others are shorter, but maybe it'll be four. So let me tell you about one story. A graduate student in my, well, she's not a postdoc, but Brittany Lissane in my group and Jim Brooks at Stanford University, a friend and colleague that we collaborated with, have studied kidney cancer. And we identified markers, or Brittany and Jim did, I should say, DNA markers, DNA methylation markers that are, well, I'll show you, I say 100%, but it's actually 99.0%. So this ROC curve shows you the specificity and selectivity of this and you want the number under this curve to be as close to one as possible. So our data that we generated in this study from about 135 tumors gave 20 methylated CPGs, methylated or un-methylated, but 20 of these markers that were absolutely 100% or 99% predictive of whether this is kidney cancer or not. That actually replicated in another large-project TCGA data that came out after we had generated ours. They weren't actually studying this particular problem, but they had data in it that we then analyzed. Again, one of the great values of large data sets that everybody has access to, to replicate and show that this was good. This will be really important if we can detect those markers, those methylation markers in the blood or urine because it's one thing to say this is in the tumor. We already know we have a tumor. Kidney cancer gives you big tumors, usually big by the time it's detected because you don't have symptoms. So it's actually very important to be able to detect it. So we're working hard to see if we can see these positive markers. So how did ENCODE play a role in this? And this is how. So what we also did, turns out for these kidney tumors, we didn't have RNA. They were older samples and there was no RNA left, so we did DNA methylation and copy number variation from those 135 tumors, tumor versus normal tissue that was adjacent to it. And what was interesting about this, and this is where it's important to actually have an annotated genome, is that we found DNA methylation events that we thought might be in various places in the genome that we thought might turn down transcription. So by using ENCODE data, we found that those were correlated with low transcription levels. Those were DNA methylation events. But we also found, that was in some of the patients, and then in other patients, we found a deletion in the same region. And so this is a good example of how orthogonal, or how two different types of assays combine to give you more power. I mean, 135 samples is not so bad for a study like this, but you get a lot more power by having some of them with this and some of them with this. And we're actually spending a fair amount of time in ENCODE has as well. How do you integrate those different data sets to try to quickly, because there are lots of places in the genome where you need to look, and a lot of this right now has been manual. So we're working hard on that. Another way to state this is that there are two ways to get, to lose gene expression. Or sorry, we measure two ways. And both of them show places where they locked in. So these diagnostic markers are already being used, because even if you have tumors, it actually helps you do the subtypes. And I mentioned in the first slide, I had the words, I didn't say it. They actually, they distinguish whether they're going to be aggressive or not. And that actually determines the way that the physicians treat the disease. This is just corroborating how we, you know, went from with just methylation, went with some of the patients, and then got more when we did copy number variation and adding up to that much. So there's another problem. This is sort of related to this, is that can we use ENCODE data to try to figure out which variants that change, or which places in the genome that change in cancer are the ones that we ought to pay the most attention to. I think a lot of you probably work on cancer, but you know that many, many things change in cancer cells and tumors and even the tumor cell lines. So we decided to see if we could, and this is a little bit related to John looking at hypersensitive sites in various other ways, but we looked at transcription factor binding data, and this is something we've just started to see if we can then prioritize a huge number of transcript variant changes that happen. This one particular one is in prostate cancer. We were applying this to several others. I think this is the prostate cancer one. Yes, it is. And then look in those regions, and especially in the promoters, but also in the long-distance elements, are there transcription factor binding sites that are over-represented, and then we focus in on those, because we think that it's just giving us a hint that they might be the more important ones. And it's the binding motifs themselves, and then because we have ENCODE data in lots of cell lines for quite a few transcription factors, we see whether they're actually binding events in those themselves, and that's been quite starting to be quite helpful. This is the example I'm showing actually in prostate cancer that just showing when we did that type of analysis is some of the factors that showed up, and so we're now focusing in on those and the transcripts that they, or genes rather, that they regulate. Okay, now this is, let me do this. This is my next to the last one, I think, and it's related to the other one. It's another cancer study. This is breast cancer study that a graduate student in my lab, Joy Agee, and a former post-doc, Katie Varley, have done this study and are just about to submit it for publication. So there are two types of, well, there are multiple types of breast cancer, but they're two major classes, basal and luminal. The triple negative breast cancer is typically basal, and they're really hard to treat. In fact, they're not really very good treatments for them at all. We could understand the differences between those subtypes. A lot of people study this, and there's lots of data already out there, but what we decided to do was to, we weren't going to go do CHIP. We do a lot of CHIP in our group as part of ENCODE and for other studies as well. We didn't want to do CHIP on dozens or hundreds of tumors or tumor cell lines for a whole bunch of transcription factors. It just isn't or wasn't practical. It may be practical to do that in the not too distant future, thousands and thousands of assays on the few dozen breast cancer dissected tumors that we had. So what we did is we, and this is now has grown, because this was from a couple years ago, where ENCODE data for 70 cell lines and 150 transcription factors, we had that. We've generated part of that, and Mike Snyder's lab and John's lab and others generated that as part of ENCODE. So we had those data, and then what we did is we did some tumor cell lines for breast cancer, some luminal and some basal ones, maybe a couple dozen for each of those. And we measured DNA methylation. Now it's just one assay that we're doing. It's relatively simple. We measured it's not too expensive. We measured DNA methylation, and then we looked at places in those tumor types where there were under-methylated regions. So we found a bunch of under-methylated regions, especially around promoters, but in some other places too. Then we asked the question, which of those regions have transcription factors bound? No matter what the cell type is. There's some breast cancer cell lines in here too, but it doesn't really matter what the cell type. Which one of them have transcription factors bound? And by doing that, what we found, what Joy and Katie found, is that when they looked in the luminal, they found a big difference in these in terms of the methylation events, but when they looked at the places that were un-methylated in the luminal, they were more common. These are ER, PR positive, sometimes HER2 positive as well. Breast cancer lines are tumors. You found transcription factors, one of course that was known, and they're greatly enriched by the way, seven or 10-fold or so. Estrogen receptor, one that was not expected in FoxA1. So that was helpful to at least know, are these the regulators, these are possibly master regulators, or at least important regulators of breast cancer. And the particularly new ones were for the basal lines and basal tumors. We found two transcription factors. They've been very interesting. They're significantly overrepresented, glucocorticoid receptor and STAT3. So we're not claiming these are truly the only regulators, but they're clearly important ones. So there's a lot of follow-on data for this to show. And one of the hopes and what we're trying to do is to see in almost like an IPS line, so can we take a basal line and turn it into more luminal by expressing those transcription factors or maybe repressing those. So Joy has done some of those experiments and they look pretty promising. Okay, one more? Five minutes? Okay, good. I'm not doing it so badly. So this is... Oh, I think hopefully it was obvious in that last study that was totally dependent on the formation and annotation of the genome. So this is a short one. RNA-seq is one of the... People do RNA-seq all over the place. It's not just in code, obviously, although some of the first ones came out of labs that were in the first development of the assay. But this requires a lot of... What I'm about to show you, a lot of annotation and understanding of the architecture of the genome and where the functional pieces are. So this is, again, Katie Varley's work. In breast cancer, I should put breast cancer here. And she found in breast cancer seven genes that were pointing in the same direction. The distance between them was less than a few kb, or a few kb or less, where they had fusion transcripts. This is not like the Philadelphia chromosome where the chromosome is rearranged and you get a fusion transcript. These are ones that are fusing in the normal state. And just as an aside, by the way, we sequenced these and found no cis-acting variants that were responsible for that. So it's probably something that's going haywire with the regulatory network, the splicing or poly-edition network. So this was published a little bit, a little bit ago. One thing that's really valuable about this is that three of those are membrane-bound proteins. Sorry, three of those, one of the partners is a membrane-bound protein. So now, and cancer drug people love this because the idea that you might have something on the surface that's different, these are breast cancer-specific. There are some that are not specific, but these are the specific ones. And so the idea there, and we just showed, KT just showed that they are expressed on the surface. And the reason this is potentially exciting is that you could then develop a drug that would recognize that avarine protein. We are working with Seattle genetics to try to do that now. All right, I'll know in two more minutes, and this will take less than I hope. There's a big problem in the whole field of measuring transcription factor binding in cells. We do a lot in tissues. You can do it in all sorts of things. One of the problems there is that for every transcription factor practically, I'm exaggerating a little bit, you get thousands of genuine DNA binding events. And that's in primary cells, mixed cell-type tissues, and tissue culture minds. And we don't believe that most of those are doing anything. They're spurious, they're evolutionary, or they're places in the genome that have maybe changed rapidly the way that John was talking about that still can bind, but that are not around regulatory, or in regulatory elements. So which ones of those are important? And that's a really valuable question to ask. And you've already heard probably more than one. I'm sorry I didn't hear the earlier talks other than John's, but you've heard we can do all sorts of tests to say this element or this region is important. Almost everything we've done in ENCODE is a correlation. You say there's the hyperactive, DNA-sensitive sites here, there's something here, there's something here. So there's transcription factors bound, there are different histone markers, et cetera. Can we actually test those? And of course you can. You can do CRISPR-Cas, change individual ones. In fact some ENCODE folks have done that a little bit. But what we decided to do was to take an old idea of what a new version of this that Barrett Cohen's lab developed and Jay Shenduri and 10 other groups have done similar kinds of things where you try to test putative transcription elements for function in a very artificial system. In tissue culture cells in a transient transfection assay. Yes, I did these in graduate school. They are artificial and you still learn an enormous amount because this experiment is at least 10,000 times cheaper than one individual CRISPR, you know, blah blah, whatever animal model. So at least you get data and I'm not, I'm apologizing too much. I'm not ashamed of this. It's a good thing to do. You get hints about what might be functional. And the interesting thing that we just heard a lot about tissue specificity, I studied that in graduate school. We all talked about that endlessly. It's a hard thing to say what is really, truly tissue specific. Just this cell type. There's probably nothing that really behaves that way. But when you do these artificial assays, you get it out of context. You can also learn whether something is active in general. You retain some of the tissue specificity, but you don't retain all of it. So it allows you to assay many, many things. So what Dan Savick in my group has done collaborating with Barrick and his group, and Jay Gertz, a former postdoc in my lab, is we use this assay called Creasec, that Barrick's lab. And what you're doing, the details don't matter. You can look this up if you want, but you either clone or take lots and lots of putative elements. And this can be tens of thousands of them, depending. These are often done with long oligos. They can be done with fragments that are cloned en masse into reporters, if you want to do it that way. There are various ways of doing this. And they're set up so that your readout is the... You connect the readout of the transcript from each of these with the putative regulatory element. So you can screen through thousands of these to say which one of these have positive... This is a positive assay, not a repressor assay. Which one of these have positive transcription activity? And if you're not relying on every single one of those to be absolutely accurate, but you're looking at the population, you can get really interesting data. And this is something that we're having trouble publishing this because everybody says, yes, this is obvious. We already knew it. Nobody's done this at this level to really test the function, but the answer was simple. It said when at least one of the answers was simple, I should say. All the distal elements, and it's really the long distance elements that we're struggling with here, not the promoters because the promoters are pretty obvious, and we can usually figure those out pretty quickly. That long distance elements that have RNA polymerase bound were the ones that are functionally active in this assay, at least. So at least we have that picture of it. We're using this. So we got this information about RNA polymerase bound from ENCODE data. So it relies on the large data sets. All right. So I don't go way over. I hope I didn't. Those are just snippets of some of the things that ways that we think about it, or at least my group thinks about using these data. And I'm happy to answer questions. Yeah, Rick. When you are intersecting the methylation regions, do you use like a window? How do you... Yeah. How do you... You don't do individual C's. No, no. In fact, and I think it's a sliding window of about 100 base pairs, it's interesting that you do get a little back and forth within a few base pairs, but in general they're patches that if it's un-methylated here, it's un-methylated right around it, in general. Yeah. But it doesn't hold up 100%. So you have to... And so that's kind of the generally un-methylated, maybe with a methylation in the middle of it. So yeah, you just use some sort of cut-off at all. Yeah, and I think it's 80 or 100 base pairs that KT used. Okay.