 Thank you so much. It's wonderful to be here to tell you guys about some work we've been doing in my lab to understand how genes are regulated in 3D structures that form between distal enhancers and promoters. And I wanted to start with some pictures that are a little bit more elaborate than usual line with blobs on it. These are from some colleagues who've attempted to make videos or still graphics of the process of transcription. And the purpose is to remind all of us that this is happening in a three-dimensional space, that it involves a lot of proteins, and it involves DNA and chromatin forming complex structures. And I would argue that while more complicated than the line drawings, these are still falling short of the true complexity. So our goal was to try to understand what we could about this process from encode-like data, including actual encode data. So the motivation is probably clear. This has been spoken about already by Matiu this morning and Tyler last night. But just to recap and to maybe give you a little personal look into my motivation, the idea is that there are many mutations in the non-coding portion of the genome. It's an obvious hypothesis that these might affect enhancers, as we've also heard about today in several other talks. And so if you have here in the line drawing version a mutation that's associated with the disease, for instance, or something I'm really interested in, which is a human-chimp difference, a divergence between us and our closely related relatives, you might want to ask, could this variant actually be causal? Or is it some other linked variant nearby? And if I could find the causal variant, how would I follow that up? I would want to know the genes and the pathways that were being targeted. And so it would be complicated if there were many genes in this locus, which there are in most places in the genome where we look at these non-coding mutations. So what we've been working on a lot in my lab, and we came to it from this evolutionary question of comparing humans and chimps, and by our observation that the fastest-evolving regions of the human genome have the genetic and epigenetic signatures of distal enhancers that function during development, but I think is also very relevant to the disease question. We've been asking ourselves, can we annotate where the enhancers are, where the distal regulatory elements are, and I know many people here have been thinking about that. And then more recently, it would be great, we've been thinking that it would be great if we could map those to the genes. For example, here, if that TC variant on the right looks interesting because it falls in an enhancer and a relevant cell type, the inference that it would be regulating the closest gene, gene C is wrong in this map where it's actually looping over here to regulate gene A. So these problems are both hard, as everybody here knows, finding enhancers and figuring out what genes they target are not easy problems. Some of the standard things that are done, such as performing a few chip-seq experiments and saying, oh, I found enhancers, these are K27 acetylated, for example, failed to identify many of the experimentally-validated enhancers and find many false positives, but there is helpful information there, and the idea would be if we can combine lots of data sets together to do better. And our hope was that we might be able to improve on predicting gene targets because the commonly used practice of picking the closest gene or even picking all genes within a reasonable window on the genome turns out when you do the chromatin capture experiment that measures these interactions to be right only like 8 to 10 percent of the time. So often when we do a gene ontology enrichment or some sort of follow-up functional study we're actually pursuing the wrong functions and the wrong genes and the wrong pathway. So I'm going to focus, I'm going to assume initially that we know the enhancers, and I'm going to focus on the question of predicting the gene targets. So the question is, can we reconstruct 3D interactions between enhancers and promoters from the 2D genomic data? And so here's a picture of a region of the human genome, just to define the complexity of the problem a little more and to show you some of the data. This is a browser-like shot. There are many genes here. These are active enhancers and active promoters in this case from CHROM HMM. And these are interactions that were detected in a high-resolution, high-C experiment where chromatin capture measured that some of these enhancers here, this one E1 and E2, are looping over to this promoter, not these intervening genes here. So could we predict this from all this up here? And why might we want to do that has been motivated by several other speakers here, but one huge motivation is that this experiment to get it to this resolution of single promoters and single enhancers is incredibly expensive, millions of dollars to generate that data. And this data is easier to generate in a short time period and with less money. Another motivation that I think is maybe even more exciting in some sense than the financial motivation is that if some of this data were predictive, we might actually learn something about how chromatin loops form, that we might learn something about the mechanism. So the approach that we've been using is supervised machine learning. What that means is that we need training data, we need some examples of promoters and enhancers that are active in a cell type and are in physical proximity to each other and some other enhancers and promoters that are active have the active marks at them but are not physically interacting. And then we have a feature data from which we are going to try to learn a model. And once we learn a model by holding out some of the data, we can evaluate how well we predict on that held out data, a process called cross validation. If we could succeed in this, we could then make predictions beyond our training data with some confidence in the accuracy. So we're fortunate to have some good training data here. This publication that came out at the very end of 2014 performed, as I mentioned, a high C experiment, the chromatin capture at one kilobase resolution in several of the ENCODE cell lines. This is genome wide and gives us the resolution that we need to see individual promoters interacting with individual enhancers. But exceedingly expensive to generate millions of dollars. So by looking at active enhancers and active promoters and labeling them as positives if they are interacting in the high C and negatives if they're not, we have a training data set. The features we use to try to predict the interactions were of three general types. One, we looked at evolutionary conservation, not of the sequence per se but of the co-localization or the sentiny of the enhancer and the promoter. So if we look across evolutionary time, is there a conserved sequence for that enhancer across species and does it stay relatively close or at a similar distance to that gene? This had been shown to be very predictive of EQTL's expression, quantitative trait loci. It turned out not to be particularly predictive for us on this problem. Most of the data we used and most of what was very predictive were functional genomics experiments primarily chip seek for transcription factors, histone modifications and various structural proteins. The key and I'll jump a little bit ahead and I will tell you more about this in a minute is what we did was we looked at the enhancer and the promoter, which others have done. We heard about that from Matthew this morning. The really interesting thing where we learned some really interesting biology and we really improved our predictions was to look at the window in between the enhancer and promoter to integrate the signal along that piece of looping chromatin. This is different than what I've seen others do before. We tried it frankly on a bit of a whim and it turned out to be really interesting and important thing to have done. Finally we looked at the sequences themselves, so looking at the upstream transcription factors that are predicted by motif analysis to bind the enhancer are those annotated to be involved in similar functions as a potential target gene and also are there shared motifs at the enhancer and promoter. There was also some evidence from others that these would be useful features. There was some information there but most of the data turned out to be in these chip seek experiments as I'll show you in a minute and specifically on the looping chromatin more than at the enhancer and promoter. For those that are interested I'll tell you about the computational algorithm. We decided to use decision trees. The motivation was that we thought that these features might interact in complex combinations which turned out to be true. You might want some event to happen or another event but not some third event and we knew that it wasn't from the browser shot I showed previously and we knew this was going to be complex and that we needed to be able to model these Boolean combinations and that decision trees might be a good way to do this. And by decision trees I mean approaches such as random forests and gradient boosting. We tried several different algorithms and within this sort of family of ensemble methods there wasn't a big difference in performance across algorithms. So we did get a lot of benefit however from this ensemble approach which is essentially that you build many imperfect classifiers by random permutations of your data and then combine them to get a predictor that does better than your single best predictor would. This is really important because essentially what you do is you overfit some little part of your data through that random sub-sampling and by sort of learning these different subsets of features that can sometimes but not always be important you actually learn a more thorough model than if you just took your best classifier on the full data set. This gave us a real boost in performance. So the results of doing this really surprised me. I knew this was a hard problem and I knew there'd be some information here. I didn't expect to see such good performance but here's a summary of what we found and so this is on three end code cell lines where we had a lot of data to mine for features and we had the high resolution high C from the experiments done at the end of last year at the Broad. So these pictures are probably familiar to everyone. This is the false positive rate. This is the true positive rate on the vertical axis. Perfect performance would be in the upper left hand corner by having our algorithm outputs a score and by thresholding that score you can have a curve here where you make many predictions, have a high false positive rate but also get all of your predictions are down here, less power finding less of your true positives but also a much lower false positive rate. So a stricter predictor and what you can see is what we do a great job by a number of different measures. The area under this curve is one measure of performance. I think it's very important in bioinformatics problems where most of your data set are false positives to not just report an AUC which is the area under this curve which is how high above the random guessing line I am but to also look at precision and recall because in a set a problem where most of the genome or most of the enhancer promoter pairs are not physically interacting a predictor that predicts no interaction most of the time would just randomly have a very low false positive rate but you wouldn't have very good precision most of your predictions would be wrong. So pleasingly we also had a very high precision which I was very surprised to see. So when Sean Whelan the postdoc at my lab showed this to me I thought well maybe just encoded maybe this is a mistake it can't be true. So first of all was there any bleeding between your cross validation sets a bunch of technical issues we resolved that none of those were going on and then I said well maybe these features are just encoding how far away the enhancer is from the promoter because we know at least at very short ranges like 10 to 20 kilobases away there is a higher chance that an enhancer is interacting with a promoter. So we looked and it turns out there is no dependence on performance based on distance between the promoter and enhancer and if anything we do a little bit better the further away the enhancer is from the promoter despite the fact that many of these are millions up to 2 million base pairs away from the promoter that they regulate. So it wasn't just we weren't encoding distance here with this complex feature set. So what was encoded in the feature set as I alluded to earlier it turned out it was really important to look at the window between the enhancer and the promoter what proteins are decorating that looping chromatin. So we looked a nice aspect of using ensemble methods is that there are now some very nice techniques for feature importance in other words how important was each of the very different data sets for the prediction accuracy using techniques such as run recursive feature elimination for example. So you can get a measure of the importance of a feature and here I'm making box plots where I'm showing the distribution in different cell lines plus a combined model in the four colors for the enhancer the window in between and the promoter so marks at the promoter the window are in the enhancer how predictive are they on the vertical axis. And what you see is that there is signal in all three regions the promoter has a bit more information than the enhancer but the window in between is actually where the most information was the most predictive information. So then I thought well maybe this was because there were just more proteins binding there more signal but actually the this is this more predictive accuracy is despite the fact that the signal which I'm just plotting here is the sort of density of peaks is actually lower on the looping chromatin so there's not a lot going on there but what is going on there is super important. So what is it what's what's binding there but not binding there what's happening to the DNA as I alluded to this is a complicated mixture of things there's not a simple signature but it's a very consistent story when we look at the sorts of things. So if an enhancer and a promoter are looping with each other and we look not just adjacent to the promoter and the enhancer but on the intervening window we see other enhancers this makes sense in the sense of super enhancers or because we know that enhancers tend to cluster together so not right next to you but nearby the enhancer are often other enhancers so this is a very active region but and we might have expected that but what I wasn't necessarily expecting was that the loop has a lot of marks epigenetic and a DNA methylation etc marks of heterochromatin so this is first of all telling you that maybe an intervening gene is not the target of that enhancer that it's repressed but in some cases there are actually little windows that aren't heterochromatin eyes that have active genes in them but in between is this heterochromatin so there may actually be something physical or structural going on where it's helpful to compact the chromatin and bring the enhancer and promoter closer together the the biophysical modeling literature has some sort of spring models and some other theories about heterochromatin and how it helps these sorts of interactions that we've been reading about so finally I said there were some active promoters or some quote active promoters in the window but frequently they're kind of false signals because what we actually see when we look at the gene bodies of those genes is it seems that the polymerase while loaded up at the promoters is not actively elongating and making transcripts now what about the false interactions the cases where a promoter and enhancer don't interact the window in between often has the cohesion complex on it including this zinc finger 143 that we heard about from Matthew this morning suggesting that there is a chromatin loop and a pinching off with the cohesion complex of real chromatin interaction but it's with an intervening promoter not the one that you're considering and there is some evidence that these loops are actually connected insulators as well so it's giving information that there may be a different target gene and also may actually be a physical structure that prevents looping to a promoter further downstream and then mirroring what we saw we saw marks of open chromatin and elongation active promoters and active gene bodies importantly the meaning of these different features was different if we saw it at a promoter or an enhancer or in the window so this cohesion plant complex that is a negative predictor of an interaction when it occurs in the window is actually a positive predictor of an interaction if it's near the if it's flanking the enhancer and promoter as we heard from Matthew this morning so it's important to actually split up the feature in terms of these different regions and to keep in mind that a protein can serve a different function depending on where you see it physically along the DNA we saw this not just for cohesion but a number of other proteins too so what we started to see and cohesion is an example of this is that there seem to be complexes that were forming on this looping chromatin that we would see several factors co-occurring or being co-predictive so we actually just looked genome wide and looked at the co-location we made a map of the co-location of the predictive features so here are some of the top features for the K562 cell line and if there's a dark color in this heat map it means that they actually on this looping chromatin occur at flanking or overlapping positions and so here's the cohesion complex and those proteins are co-localated co-localized as you would expect but it's not just known complexes when we form a network out of these co-localizations we see some interactions or co-occurrences of different features that weren't previously known so in orange is our co-localization data purple are known protein-protein interactions and so this suggests some potentially cooperative or interacting roles of some of these different features that could be tested and certainly from the perspective of prediction we need many of these variables in the model no one of them alone is predictive we need the combination of something co-occurring or not co-occurring with something else so the big question for us and various collaborators and certainly for studying human accelerated regions in their role in human development would be can we do this outside of the encode cell lines could we do this without the rich feature set because we put hundreds and hundreds of data sets into the machine learning algorithm and could we have done that with less data sets so first we assumed that we still had some good training examples some validated interactions and non interactions to train and just asked well what if encode had only chipped five transcription factors or ten or fifteen or twenty how does the prediction accuracy affected so what's a minimal set of experiments and pleasingly we found that performance was very flat down to as few as sixteen data sets totally flat and still near optimal with as few as eight data sets so you can't just use one or two features as I mentioned it's a complex combination of things going on and if you look across examples in the genome okay many of them have say the cohesion complex the non interactions many of them have cohesion on the looping chromatin but not all of them but the ones that don't might have some other feature a different epigenetic mark and so you need a several of these features and it's not a random eight but there are a good number of different sets of eight that give near optimal performance they are not the same features you would use for predicting promoters and enhancers however in most cases they're slightly different ones but this does give some hope for moving into other cell lines that you wouldn't need the time and budget and team of an encode project now what if I didn't have that high resolution high C produced at the Broad for millions of dollars so I know here that I can get away with fewer features could I get away with less or no training data so the worst case scenario would be no training data let's say that I built the model on an encode cell line and then I plugged in the chip seek from my cardiomyosides can I make predictions is that are the is the model the same across different cell types so we tested that amongst the encode cell lines which are from different totally different lineages and so sort of a worst case scenario to see if a model trained on one selling could predict on another and we heard a little bit along these lines from Tyler last night he also went across species so this is a measure f max of predictive accuracy is the harmonic mean of precision and recall and I already showed you these numbers where we had good balance of precision and recall when you train and test on the same on data held out from the same cell line and you can see performance does degrade when you go to a different cell line so it's very helpful to have some training examples it doesn't need to be genome-wide high resolution high C need but some good unbiased training examples in your given cell type and we're testing now if some chia pet for example might achieve this which would be a less expensive experiment but there is this is not horrible this is still decent so we basically expect about 35% precision and 55% recall on a new cell line with only 10 chip seek data sets and no training data so that's kind of a worst-case scenario we think if you use a more closely related cell line that it will be better than this and that if you have a little bit of training data are a few more features you can improve so I thought this was was a good place to start from so to summarize this target finder project our problem was to predict these interactions from things that are marking the DNA it improves significantly upon using the closest gene which is frequently wrong and makes many false positive predictions the summary of our performance is that we can get more than 90% of known pairs at a low false positive rate and that if you did this on a different cell line with less data it could be maybe as bad as 55% so it's a worst-case scenario with very little data and the great thing and the most important thing probably is that the false positive rate is really low so our precision was higher false discovery rate was very low so in the last couple minutes I just want to mention how do you find the enhancers because we've also worked use machine learning to work on this problem it's published work and so I'll just briefly summarize it but some of the same machine learning techniques have been helpful there so which sequences function as these long-range enhancers we've been particularly interested in development because the bioinformatics tell us that many of the human accelerated regions function in development and we also have a number of collaborations in heart and brain development at the Gladstone institutes where I work the other reason to think about development is there are many validated examples of enhancers for example from the Vista browser and we'll hear from Len Pinocchio about that I believe in his talk tomorrow so these are pictures of mouse embryos where a candidate enhancer has been transiently transfected into the single cell embryo and you can see staining in the tissues and at the time points during development when that enhancer functions it turns on a reporter gene or it doesn't it was tested and there was no reporter gene expression and so this is a great proving ground then are good training data I should say for doing a supervised learning so we again use genomic features evolutionary conservation in this case of the sequence itself functional genomics data again at the potential enhancer location and sequence motifs of known binding sites or position specific weight matrices predicted binding sites as well as just enumerating all k-mers that as a way to get at binding sites or transcription factors that don't have a good motif model all in this case all three types of data were predictive they did not predict overlapping sets I mean they predict partially overlapping sets of enhancers but each predicted some enhancers that the other one did not and so it was helpful to combine them in a model and the model they included all three types of data was the best performing by far so this was a little different from the chromatin looping predictions where we really got most of our power from the functional genomics data here we use to support vector machines and then called a variant of it called multi kernel learning which allows you to build a separate kernel or predictor for each type of data and then do a linear combination a weighted combination of those for the overall predictor this was helpful because we knew we needed all three types and they're not on the same scale they're very different types of data and so it can be hard to put them into a model together on a comparable or regularized scale but I think there there may be room for improvement trying other algorithms here we didn't do a lot of experimenting with different algorithms so briefly to summarize here's our performance again we saw a very pleasing area under this curve a high power at a fairly low false positive rate and importantly this significantly improves upon using a few different chip seek data sets so in red blue and green or some of the typical enhancer marks k4 monomethylation the and k27 acetylation as well as binding of the transcriptional co-activator p 300 each of those by itself is somewhat predictive if we combine them and we get the union of all of them we get a pretty high power but an exceedingly high false positive rate so there was room for improving upon just doing some intersections and unions of data sets and we think that this is essentially the benefit of having some training data and using machine learning framework is that you can improve upon the sort of simple bioinformatic combinations here the false discovery rate wasn't quite as awesome as in the looping which is interesting because I actually thought this was an easier problem but but still we had a pretty good recall at a pretty high precision so we made predictions across the human genome this is for all of development any tissue about 84,000 predictions they had many of the bioinformatic features of enhancers important to me we predicted kind of conservatively that about a third of human accelerated regions were active during development I'm not going to show you the results but we did some of those mouse experiments ourselves and and 25 out of the 30 that we tested at just one developmental stage and bryonic day 11.5 were active enhancers in vivo we think step now that several others are active at other later time points and another sort of piece of data supporting these predictions was work from Adam Ceeples lab looking at fitness effects of the positions across the human genome the fit con scores and even though they were looking in encode cell lines these were developmental this train was trained on it and embryos I was really surprised to see that that we were actually doing almost as well as their method at predicting these sites with fitness effects I don't understand exactly why it's totally different cell types but I thought that was interesting so to conclude just where we're going from here these mouse experiments are expensive they're low throughput we can just test one enhancer at a time but as many of you might be aware this experiment can now be done in a high throughput manner by taking this vector and putting a barcode downstream we're actually putting the enhancer itself downstream is another variant and that way for every enhancer that you're testing you have a transcribed sequence that tells you that that enhancer is working and therefore you can assay the activity of many different enhancers by RNA seek and it's possible now to synthesize thousands of these clone them and then in cell lines at least to do in parallel in the same cells look at these thousands of screen thousands of enhancers and specific mutations in enhancers in parallel and we are doing this in cell lines that are derived from induced pluripotent stem cells my office is right next to Shani Yamanaka who's been referenced several times today I feel really honored to work with him and Bruce Conklin and others at Gladstone who are real whizzes at reprogramming cells and then differentiating them into different cells like neurons and here beating heart cells in the dish that have a lot of the characteristics of the original tissues and this is fantastic especially for human chimp comparisons because we can never get the tissues and cell lines to do direct comparisons with a with an ape for various ethical reasons even if we were able to obtain say human embryos but here we avoid those issues completely by reprogramming skin cells and getting these various developmental cell types so here's our approach this computational things I've talked about today the screening and IPS and then we have to still go back to animal models for real functional studies I'll end there thank our collaborators especially Sean Whantland who led the work on target finder and our funding sources and I'm happy to take questions hi do you think with the looping studies that the sentinine would have been more predictive if you'd used species further apart we looked across all of mammalian evolution if you go much further out there are very few of the enhancers are conserved and so it becomes really difficult to do that so we looked about as far as we could in terms of being able to find a homologous promoter and enhancer there was some signal there but not as much as we had thought there and was the signal that was actually there were those mostly developmental genes or yes actually there is more since sentinine developmental aside yeah great talk thank you can you tell me anything about the resolution you end up with with your prediction for the high C because for the high C it's like 1 kb 4 kb and there's often multiple enhancers the resolution of a single promoter and a single enhancer so a kb or less you're able to parse out within a given block called by high C those that are most likely to be yes that's precisely it yeah we can get below like a regular high C experiment might be like 25 kb we can by using the chip seek peaks we can resolve it down to a single promoter enhancer in silicone I was wondering if you could comment on how well your enhancer finder works in terms of enhancers being accumulated in a tissue or cell specific fashion yeah how cell type specific is it so we tried besides just predicting any tissue in the developing embryo we tried to then go on and predict the tissue and that's a harder problem the AUC is more like 60 70% it depends on the tissue so heart enhancers were very easy to predict they had a specific GC content and a low evolutionary conservation interestingly and some very specific motifs some of the other tissues like limb or brain were a bit harder and I think partially that's because we didn't have quite the right chip seek data I should have emphasized we're predicting in the developing embryo and we're using encode and epigenetic roadmap and about everything we could get our hands on basically everything that's ever been deposited very little of which is from a developmental cell type so we looked at heart development because we do have collaborators who are studying differentiation into cardiomyocytes and improved prediction did improve a little bit at getting specifically the heart embryonic heart by putting in the data sets from the IPS derived cardiomyocytes but it didn't it still wasn't quite as good as just the overall okay it's a developmental enhancer so I think there's some room to still improve on tissue specificity that was a great talk thank you a question about the target finder yes could you talk about what kind of cross validation you did yeah cross validation was incredibly important because we wouldn't want to overfit and we needed some measure of performance and so we we tried a number of things but the results I wish the AUC curves and the precision and recall values I was reporting were from tenfold cross validation repeated so it's ensemble learning so it's within each step in the random forest we're performing that so it's very computationally intensive but that makes sure that there's no bleeding from the training data into the test data it's very important that you do that right that you aren't within your ensemble having a feature sometimes on the training side and sometimes on the test side that can give a very rosy but inaccurate measure of performance so yeah those were the cross validation error rates does that answer your question yeah great yeah so maybe lunchtime