 All right, so well, thank you for the organizers to give me the chance to present today. As pointed out, I'll focus on cancer. And one of the key take-homes messages that I hope you'll have is that we can all tap into this encode rainbow of information and make our way to that gold button all be leprechauns and make the best out of the data that encode has made public to all of us. There's three key points that I want to showcase today as to how we've been able to work with encode-like data or actual encode data to make better sense of how cancer biology relates to cancer development. So one has to do with the capacity to functionally annotate the genome of any given cells using the epigenetic signals. One relates to being able to merge epigenomes with information about genetic predispositions or mutations called in cancer to make some sense of the functional biology behind these mutations or genetic predispositions. And one final step to look at how one can use encode data to make a better sense of how the 3D architecture of the genome is organized and hopefully in the near future help us get a better prediction of which regulatory elements influence which genes. So I've been in the business for a little while, even though I'm not one of the oldest in the business. So back in the early 2007, 2008, when Bingran came out with this notion that one could effectively annotate enhancers using K4 methylation, K27 acetylation, we came out in 2008 with a similar story, specifically using the power of K4 monoendymethylation to discriminate cell-type specific enhancers. Since then, the field has progressed enormously well, as we've just heard from the previous speaker. One can use a wide range of epigenetic modifications to effectively annotate more than just enhancers. And if you do focus on enhancers, you can discriminate their level of activation, but you can also identify insulator elements or CTCF-rich bound regions. You can identify transcribed regions, promoters, and so on and so forth. And so we've been very tempted to use that information to better understand cancer. We've done a number of studies looking at how one can use epigenetic annotation to better understand the process of cancer initiation that was published in 2012. And for the specific figures, I'm showing you how we've been able to use epigenetic annotation to define functional elements within the genome to understand the processes that relate to cancer progression. So what we're looking at here on the top line is what's the status at the epigenetic level. Of a cell, it would be sensitive to drug treatment. We're focusing on breast cancer. We're looking at cells that are responsive to endocrine therapy. And when you annotate the genome based on chip-seq for histone modifications, you can find these regions of active enhancer activity demarcated by this green star, which corresponds to K4 methylation, or corresponds to K36 trimethylation that's taking place at the enhancers and the likes, which is really specific to sensitive cells. When you then do the same type of annotation in the resistant cells, the story is quite different. These enhancers that are active and the sensitive are shut down and resistant cells, and vice versa, elements that are shut down and the sensitive become open in the resistant models. And as we've just seen, you can at mine these regions for the DNA sequence identity that they have, and you'll find out the network of transcription factors that drive the growth of the sensitive versus the resistant cells. So for the models we were looking at, we were dealing with a breast cancer that was dependent on estrogen receptor signaling in an agreement, looking at the motif enriched with an open chromatin of the sensitive cells. One could see elements of response to the estrogen receptor, ER alpha. In the resistant cells, those regions were gone, even though the protein might be expressed. We've had plenty of cases where ER was highly expressed, but the cells were no longer dependent on ER, so level of expression of TF is not sufficient. Instead, these cells had become dependent on notch signaling, and for the cancer perspective, there's drugs that can antagonize notch signaling, so we could actually use these drugs and treat these resistant cells and antagonize their growth. So we could definitely get a lot of information from epigenetic annotation of different models of cancer progression. With regards to merging GOS data, TCGA, ICGC data with the epigenome, we've done a lot of work. You have publications on the left are listed that parallel many of the key publications that came out from ENCODE on the same topic around the 2012 time point. And just to showcase one example of what we've done, this is a study that focuses on breast cancer where we've, I'll put the emphasis on the tool development that we've done, because I think a lot of you are computational biologists in the room, so one of the tools that we've developed is called VSC for Variant Set Enrichment Analysis. It's a tool that's specifically geared towards identifying the enrichment for any given feature of the genome at risk locus ascribed to a specific disease. So the minute you have more than one risk region for a specific disease, you can run VSC. And what it does, it simply computes a null distribution of what overlap is expected by chance between the risk snip for a given disease and your feature of interest. And then computes the actual observed value for the feature you've measured in the cell type of interest. And so what you can see here, for instance, on the top side, we've looked at coding exons, five prime, three prime UTR introns and the likes. And there's a ton of overlap. There's a ton of risk regions, all the RS number of different risk regions associated with breast cancer. There's a ton of them that overlap with introns, for instance, all these gray boxes. But none of these overlaps are significant. They're just about what you would expect by chance. Now, if you do this type of analysis, looking at gypsy tracks for histone modifications, no surprise, as we've heard over the last day or so, they tend to enrich within enhancers, within regions that are marked with K4 monometallation. So nothing new there. But there's a tool. We're about to release it. So it's not quite out yet. But if you want to be aware of when the official release is, just send me an email. And I'll put you on our mailing list so you can be kept in touch with the actual release. I think it's a great tool. You need to get that null distribution in to make sure that your enrichments are significant. Now, of course, if these SNPs, if these risk regions, target enhancers, they have to do something specific. It's not just an overlap. They have to be acting on ideally transcription factors, since there's a hotspot of transcription factors going to these regulatory elements. And so we ran VSE against chip-seq tracks from a large collections of transcription factors that were provided in breast cancer cells. And the key thing that comes out as being significantly enriched within breast cancer risk low size are binding sites for the estrogen receptor, as well as the pioneer factor, FoxA1, two known well-established driver of growth in breast cancer. So this was not much of a surprise. But again, this is just a correlation. We're just telling you there's a significant enrichment of SNPs within binding sites for that particular motif, for that particular factor. But we don't have a function. We need to have a function. Now, many of you have heard of PWM. People have been using SNPs to assess the extent to which a SNP can be affecting a motif by changing the PWM score. Those of you that have played enough with the PWM know that it's not an ideal way of measuring binding affinity. There's lots of caveats that come in with the PWM score. And so we decided to improve on means to call ideal-specific biases induced by SNPs on transcription factor binding. So we came up with our own little methodology which we call IGR for intra-genomics replicates. It's very simple in application, so I'll take you through this just now. So the principle is the following. You have a SNP of interest, and in this case it's either a T or a C allele at that given SNP. It lies within a contextual sequence that we know of. We know exactly what are the nucleotide sequences upstream of it, as well as those that are downstream of it. So what we decide to do is simply look for seven mirrors, eight mirrors, nine mirrors, whatever length you're interested in based on how often it's represented in the genome that span the contextual sequence and the SNP of interest, one at a time. So the first sequence, for instance, that we'll be looking at is this seven mirror that starts at the end of TT, GC, TA, and then your T, your first allele of interest for the SNP of interest, okay, in the T sequence. Let's say this corresponds to the red sequence so we can easily map them across the genome. What we do then is that we look for a given chip-seq track for a given transcription factor we're interested in and we ask where are these sequences? Where is that seven mirror span across the genome? And what we end up is thousands of genomic locations that have that seven mirror. They're very, very frequent. And then what we do is that we compute the chip-seq signal for that specific transcription factor of interest over those seven mirror sequences. And then we'll have this bottom left panel which showcases a strong enrichment for binding of that transcription factor over regions that have that seven mirror. We then slide the window. So what we do is that we still use a seven mirror in this specific example, but our SNP allele, the T instead of being at the last position is the minus one position. And same thing, we look for these sequences across the genome. Turns out that for that specific sequence, it's a flat line. There's not much of an enrichment over these green sequences, but it's still providing information. And we keep going. We keep sliding our window until we cover the entire span of the seven mirror, of the eight mirror, of the nine mirror. Once we're done with one allele, we repeat the process for the second allele, and we'll get a profile there again. That's very different from the profile we had for the first allele. And then we compare these profiles. So here we have the red profile versus the blue profile which corresponds to the profiles that we get for binding affinity for a given chip-seq track for different SNPs. For different alleles of the same SNPs, sorry. And then what you can see is that the overall chip signal, chip-seq signal enrichment is much stronger on the red allele than it is on the blue allele. So we can effectively call allele specific binding. It's that simple. Now how good is our methodology compared to PWM? It's quite good. Here's a case example where we've actually studied in great depth the forked motif, looking at the expected change in binding affinity calculated by the PWM metrics, by the IGR pipeline versus by actual chip QPCR for a subset of these different alleles that we can play with. The first thing you'll notice is that for PWM, the dynamic range and binding affinity change that can be calculated is very narrow. We typically have between 0.8 and one maximum value of PWM score change. So a 20% fluctuation in binding affinity that can be caused by any single nucleotide change. If you look at IGR, this is an 80% change in binding affinity that can be calculated. So the dynamic range is much greater. And if you also pay attention, you'll notice that we can actually call for change in the sequence that will increase binding compared to the consensus sequence defined by PWM. That's pretty impressive. And now the chip QPCR data compared to this will by I it looks good. If you actually do a correlation analysis, so IGR on this axis versus the chip QPCR data, we had a 0.89 correlation, PWM versus chip QPCR 0.45. So I hope I've convinced you that there's a huge need to replace PWM with new methodologies to effectively be able to assess the impact of any single variant, single nucleotide variant on transcription factor activity. A little bit more values of comparisons between IGR and PWM. So this is looking at over thousands and thousands of SNPs assessing their impact on binding of a specific transcription factor, Fox A1. So no surprise, the correlation between the IGR prediction and the PWM are not very correlated with one another. But however, an interesting feature about IGR is that it's agnostic to where the data comes from. It doesn't matter which lab created the chip seek for that given factor of interest. So here we're comparing the IGR prediction using chip seek for Fox A1 generated by the group from Miles Brown versus the group from Jason Carroll. Nice correlation in the prediction of ideal specific biases created by any given single nucleotide polymorphism. You can even use chip seek from any given cell type. So here we're using chip seek from Lyncap cells, a prostate cancer cell versus MCF7 cells, a breast cancer cells, it doesn't matter. So regardless of where your SNP has been identified, whatever disease it's been found to be associated with, if the factor that you think is important for that SNP has been chip seek than any of the cell lines that anybody has ever chipped in the world, you can use IGR to predict the differences that that SNP could cause in an ideal specific binding for that transcription factor. The only filter that we found out to be crucial is the one that tells you which regions of the genomes are active or open. So ideally DNA is one hyper sensitivity or something of that flavor so that the effect you're measuring is really caused by the DNA sequence as opposed to the epigenetic context in which that SNP would lie. How good is it to make sense of the extent to which genetic predispositions associated with a specific disease to have the capacity to disrupt a given transcription factor? Well, here's some evidence. So I told you before that for breast cancer, FoxA1 was a primary target. Its binding sites were commonly found to harbor genetic predispositions to breast cancer. And so what we did here was to simply ask, if we use IGR to predict the capacity of any given SNP to alter binding affinity for FoxA1, how likely are we to see this for the SNPs that are associated with breast cancer and how likely is this to be taking place more than expected by chance? So the red line on the left side graph shows you the actual proportion of SNPs mapping the FoxA1 sites that are associated with breast cancer that have the capacity to change binding affinity. It's over 70% of them. And it's more than expected by chance which is represented by the null distribution that you see there. Of course, it's always important to go into the lab and do some actual functional validation of our discoveries. And so what we did was to look for cell lines that were heterozygous for some of the SNPs we had found associated with breast cancer mapping the FoxA1 binding sites. And once we had these cells, conducted a Leo-specific chip QPCR for FoxA1 to assess whether we could indeed validate the Leo-specific binding of FoxA1 at these sites. And the answer is yes. Out of the 12 sites we could find, nine of them validated and showcased a Leo-specific binding for FoxA1. So supporting the methodology that yes, we can indeed use a computational method to more or less streamline which SNPs have the potential to disrupt a given transcription factor activity and then validate these in the lab. So I think that this is an important feature to keep in mind in the sense that a lot of the post-GUS analysis has been focusing on finding the downstream target genes, but there's an enormous amount of information that you can get in characterizing the upstream events that are specifically dealing with the proteins that recognize these SNPs more than what we've just done so far. We can obviously use IGR to do a whole lot more than just look at the SNPs. We can use IGR to look at the impact of mutations that have been called in cancer through the ICGC or the TCGA effort. So here I'm showcasing, for instance, the data using IGR comparing thousands of mutations over 2,000 mutations versus large collections of transcription factors. There's over, I think there's close to 100 transcription factors that have been profiled in total so far. And what you can see, so the mutations are on the x-axis, the transcription factor is on the y-axis. And so what you can see here is a score of the expected fault difference, absolute fault difference in binding affinity predicted by IGR. And you'll see that there's a large collection of mutations that form clusters that actually showcase their capacity to change the binding affinity for specific transcription factors. So there's a number of clusters that show up here. There's all of these showing up. There's also mutations that don't appear to do anything with regards to the transcription factors we've so far chipped. We need to chip more, I would argue. But still there's these clusters and this specific cluster that I'm highlighting here is of interest by the fact that if you look closely at what are the transcription factors found to be affected by these mutations, while you see that it's CTCF, SMC3, and RAD21. So CTCF doesn't really need to be introduced, I think. Everybody's very familiar with CTCF as a factor that can be involved in chromatin looping structures. SMC3, RAD21 are component of the cohesion complex. They interact quite tightly with CTCF. Of these three, CTCF is the only one that has a DNA binding domain. But they're known to be going to the chromatin often together. And so presumably the relationship there is here between the mutations affecting the binding affinity for different transcription factors is caused by the fact that CTCF is commonly going to the same sites as RAD21 and SMC3. They work as partners. Of interest as well is this recent publication that came out last week, showcasing that indeed CTCF regions, the motif itself is commonly targeted by mutations in cancer. So the study that was published was predominantly focusing on colicul cancer, but the principle is the same. Another feature of interest that I'll showcase here is that from the study, you see quite well that most of the mutations lie outside of the motif. And so if you restrict your analysis to the motif, you wouldn't necessarily be able to pick up this type of enrichment. IGR is not motif restricted. You can restrict it to motif, but it is not, which is a huge advantage because then we could find mutations that lie on the outside as causing allele-specific biases in binding for these factors. All right, the key question though, that we all are always asking ourselves is we have a risk locus that maps to an enhancer. What is the target gene, right? We have this gene-centric perspective of the world. We need to find a gene. And so we're no different. We have that same bias. We always want to be able to find that downstream target gene. As we've heard yesterday, the chromatin organization or the 3D architecture of the genome is very complex. There's many different types of interactions. So you've heard last night about topological associated domains. These are showcased here as the turquoise and orange regions. They define regions of active chromatin, active expression and the like in orange. They represent regions of repressed regions and the turquoise colored regions. They can be far, far and away from each other. The key thing is that they'll be dependent on specific interactions, discriminating these active versus repressive TAD domains. Within the active TADs, you'll have different flavors of interactions. These interactions will be the ones that will allow, for instance, for promoters to interact with their enhancers of interest and regulate gene expression. And so it's important to be able to discriminate these types of interactions effectively. Whenever I think of this problem, I think of my youth when I used to go out dancing, whatever, and I used to be that guy, or I used to see that guy, then saying around being very active. So at the epigenetic level, for instance, it would be K27AC, right? Next to another person, also very active, another K27AC region. But being active next to one another is not sufficient, even though that individual might think that he's connecting to the other individual because they're so close to one another and they're dancing within a certain space, they're not connecting to one another, right? They're completely oblivious to one another. So there's a need for help. There's a need for matchmakers, for connectors to be present. And so there's a wide range of connectors that we've used in our society, and depending on the type of interactions we're looking for, you might go for a different flavor of interactors. For those of you that spend too much time on TV, you'll be familiar with this millionaire matchmaker. So there's different flavors of matchmakers in our normal life. The cell is very likely to be also of that same flavor. So we came out with the principle that the matchmakers that were involved in regulating the interactions between promoters and enhancers would be potentially different from the matchmakers that are involved in establishing the boundaries of topologically associated domains, okay? And so with that in mind, we decided to tap into the wealth of information that ENCODE already had gathered about the 3D organization of different cells. At the time we started the study, the 5C data had been done on a collection of cells, the GM12878K562 and HelaS3 cells, where specifically they had asked, what are the distal elements interacting with the promoter of specific genes? It was a biased 5C specifically looking at enhancers to promoter types of interactions. So we felt that it was the best dataset for us to use in order to identify the matchmakers specifically relevant to those types of interactions. And so this is a report that was coming out from one of the publications from ENCODE that showcases the number of interactions that were picked up. We're talking about thousands of interactions, some cell type specific, some shared across the different cell types. And so we decided to mine this data. It was to our surprise at that time it had not yet been mined by ENCODE to the point of addressing whether there was any specific factors associated with chromatin loops. And so we decided to do it. And so what we did was to simply ask for any given transcription factor that had been chip seeked in GM12878K562 or HelaS3, whether any of them were significantly enriched more than expected by chance at these chromatin loop anchors, at those regions that set the interactions between promoters and enhancers. And so we have the null distribution represented by the box plot as I've done in other figures. And then the observed enrichment with the red dot. And as you can see, there's a ton of transcription factors, chip seek tracks, that are significantly found to be enriched within chromatin loop anchors, okay? No surprise, we see CTCF and SMC3, one of the cohesion component, highly enriched within these anchors of chromatin loops, validates in a sense, our methodology. We also see a number of new factors that had not yet been associated with chromatin loops. Of interest to us was this ZNF143 factor. Now, as we were being reviewed through the reviewing process, we were fortunate enough to see both the publication from ENCODE and from the Lieberman-Aden group come out with an analysis looking at either high C data or chip bit data, looking for enrichment of transcription factors, and they reported ZNF143 as well as highly enriched within these regions. So our enrichment score was validated in a sense by publications from colleagues a few months before we actually could come out with our publication. So what is so special about ZNF143? Well, when it's ZNF143, actually is a promoter bound factor. What we're looking here on the left side is its binding strength across all of its binding sites with respect to CTCF, SMC3, and ball two binding. And what you'll notice is that at the top of that figure, there's the strongest ZNF143 binding sites that are present, and these coincides the sites which are also bound by RNA polymerase two. These are sites that if you look at the chromatin state define promoters, enhancers, insulators, and so on and so forth, those are the sites that specifically map two promoters. So that red section on the far right, let me see if I can get this pointer, this section here, that's all promoter regions. So the primary strongest binding of ZNF is at promoters. Weaker binding is seen in association with CTCF and the cohesion complex, and if you look at the chromatin state, it agrees with ZNF143 going to these sites which are CTCF rich, formally reported to as insulator elements. We also see enriched, which is the purple color. We also see enrichment within enhancers. Another feature of interest that further supports the notion that ZNF primary binding is at promoters is if you look for the motif it's known to recognize. If you specifically ask what's the proportion of sites that are at promoters, this top 10% of sites at promoters, that what's the proportion of these sites that have the motif, it's over 80% of them. And then if you ask what's the proportion of the other site, it goes down to 25% or so. And so it's really going to the promoter as a primary target and we think that the signal we're seeing at the other sites, the weaker sites, is potentially in effect from the chromatin loop itself and the fact that we cross link for CTC such that we're seeing these shadows or phantom events that others have reported in the past. At the promoter, it sits right next to Paul II. It's not competing with Paul II, quite the opposite. It actually helps Paul II. So is it the chromatin looping factor, yes or no? To address that issue, we've used the power of genetics. Just as we've heard in the previous talk, there's a ton of SNPs out there that we can use to actually assess the allele-specific effects on transcription factor binding, so might as well use that power. So what we did was to screen the genome for SNPs that were mapping the motif known to be recognized by ZNF-143, looking for SNPs that had the capacity to change the motif sequence to the point where you would expect a significant drop in binding affinity. So here's one of these case examples. The SNP we chose was the RS-223-2015. Turns out that it's heterozygous in GM-12878 cells. We have the perfect model because that SNP maps right here in the fourth position of the ZNF motif. And if you look at the sequence closely and you're a fan of PWM, you'll already be able to assess that there's a likely significant change in binding affinity because the T allele is rarely, if ever, present at this fourth position. In GM-12878, the SNP turns out to map right in the middle of the ZNF-143 binding site. The only thing we're missing is a loop. Now, this given site is positioned at the promoter of PRMT6. And this is the contextual sequence of that gene. We use the cross correlation methodology that John Stam and others developed a few years ago to try to infer where loops could be taking place. These are all these red lines going off from the PRMT6 promoter to elsewhere. We had a hotspot of expected potential interactions going all the way up to here. We've tested them by 3C and could effectively find a 3C loop mapping PRMT6 promoter bound by ZNF-143 to a distal site. Now we were all set. We had a SNP heterozygous predicted to change the ZNF binding affinity that we could test for allele-specific loop formation to demonstrate that ZNF was guiding loop formation. And so that's what we did. We validated the allele-specific prediction by doing two things. First, we used the chip-seq data. We actually developed a tool, which is called ABC, to call allele-specific binding from chip-seq data. So what we did is that we used the chip-seq from GM12878 and ascribe reach to one allele or the other at that given SNP. And the results are showcased here. So the gray line corresponds to all the reads that map to the allele, the red line, all the SNPs, all the reads that map to the t allele. So you can see that you have 240 for the allele and about 140 reads for the t allele. So our prediction seems to validate using the chip-seq data. We actually went into a more sensitive assay, chip-qPCR, for allele-specific chip-qPCR, showcasing here again that our cells are deployed looking at genomic DNA and then in the chip DNA that's pulled down, the significant enrichment for the allele over the t allele. So we have a site that indeed behaves the way we want it. Are the loops allele-specific? That was the next level. We did allele-specific 3C to measure allele-specific loop formation and bingo, same bias. The alleles are more frequent on the allele than the t allele. So preferential binding of ZNF induces more loop formation with an effect on gene expression. Here's the second SNP, same biology, called allele-specific biases in ZNF binding which causes allele-specific 3C loop formation and differential gene expression. Our tool, ABC, if you're interested, is out. You can use it. That's great to call allele-specific biases so that if you have any interest in figuring out whether any of the SNPs in the model you're working with can cause allele-specific biases, the tool is out there, feel free to use it. The key take-home is that we now have a better sense of how the machinery is set in place to define which promoters can interact with which enhancers. So promoters that are establishing loops, there's a strong recruitment of ZNF-143 present at those sites, it's right next to fall two. It itself is rarely found at the enhancers as a primary binding but we see the signal there at sites that are bound by CT-CF so we expect there to be a partnership between a distal site bound by CT-CF, promoter regions bound by ZNF-143 and together they bookmark the regions that can establish chromatin loop formations and hopefully this will allow us to significantly improve our predictions, i.e. the one from the cross-correlation analysis or anything else by restricting these types of cross-correlations to regions that are bound with these factors that are known to be able to form chromatin loops. So with that I'll just take a few seconds to thank the people that did the work. From my group as Van Bailey did an amazing amount of work. Kinjal Desai as well, former colleagues Richard Copper-Salieri, Nick Sinot-Amstrong, as well as Cheyenne Zhang, we're outstanding. Our local collaborators at the Princess Margaret in Toronto and as well from Peter Skiceri, I have to showcase Peter Skiceri who's done phenomenal work in the post-G.O.S. era and also in enhancer mapping. So thank you for your attention. I didn't speak fast enough to have more than one. I'm not kidding. So point well taken about the importance of looking upstream in the analysis of this meeting, a lot of what we're focused on is downstream and now we have two talks in a row about the importance of upstream analysis and you could have asked either of you this, but a lot of times transcription factors have similar DNA binding motifs and because they have similar DNA binding domains, are you able to reach through and say of like the class of ed's factors, I can tell which one it is or does this kind of analysis get you to a group of factors but not an individual factor? So if you restrict your analysis, just with the motifs, you're correct. You're gonna have that issue and that's gonna be hard to parse out. People in the past, I've tried to pair that up with expression, sometimes it helps, sometimes not so much. In our case, we're actually using ChIP-Seq data. For the enrichment score of risk low size for breast cancer over transcription factors, we've actually used ChIP-Seq data coming from a large collections of transcription factors that were profiled by the nuclear receptor community or the breast cancer community. And so using ChIP-Seq data, we're a whole lot more confident that the enrichment we're finding are directly relevant to the disease of interest. But no, you're correct. If you restrict your analysis to the motif, it's gonna be difficult and even more with the GWAS hits considering that you don't necessarily know which tissue type you have to focus on. That's another layer of complexity that makes it hard to restrict just on the motifs. Great, thanks.