 It's my great pleasure to introduce Anshul Kundayi from Stanford University. Anshul is a star in the field of machine learning and computational biology. He did his PhD at Columbia University, was a postdoctoral fellow at Stanford and a research scientist at MIT, before then joining Stanford as an assistant professor of genetics and computer science. His particular focus is machine learning for regulatory genomics. He has won a number of major awards, the Hugo Chen Award of Excellence from the Human Genome Organization, the NIH Director's Innovator Award, the Alfred Sloan Foundation Research Fellowship, and he is also an advisor on the NIH Director's Advisory Committee for Artificial Intelligence and Biomedical Research. And if you remember, you and Bernie's talk from yesterday, then you remember that he also said Anshul is one of the persons who can do very interesting and clever stuff with deep learning and computational biology. And I think we'll learn more about this now. So we are happy to have you here. And the floor is yours, Anshul. And I would like to point out, he's on the West Coast, so he got up in the morning to deliver this talk, which we all appreciate very much. So please, if you look. Thanks, Anshul. Thank you. So thank you for the kind introduction, Karsten. And thank you all for being here today. I wish I could be there in person, but I guess this is better than nothing. So today, I'm going to talk to you about some recent advances that we have made in applying deep learning approaches to model various kinds of regulatory genomics data with the goal of trying to discover cis-regulatory syntax. And this work is actually in a pre-print, which is right here. It has a similar title. You just Google for it. You'll probably find it. And before I begin, I just want to start with some acknowledgments. So this was an incredible collaboration. Ziga Archik, who was a graduate student at Technical University of Munich in Julian Garnier's lab, he visited my lab about two years ago over the summer, just for three months. And I had this idea at the time. And my students then were all inundated with projects. And I just asked Ziga to give it a try. And he is just a fantastic scientist. He took this and took it to heights that we could never imagine. Avanti in my lab has been developing. She just graduated. She's, in fact, leaving the lab in five days. She's been tremendous and built a whole slew of very interesting methods for interpreting neural networks, general approaches, not just for genomics. Melanie is in Julia Zeitlinger's lab. So this is a very great collaboration with Julia, who's a fantastic biologist and expert on the CIS regulatory code. She's also done really great assay development. And we use one of her assays in today's talk. Melanie is in Julia's lab. And Amar was a new graduate student in my lab who helped with the project as well. So this was a really great collaborative effort from three labs, effectively. So I'll talk to you about their work. So let's talk about the main goal of the project. As you know, you can now perform genome-bite high throughput experiments that can profile chromatin accessibility, for example, attack-seek or DNA-seek experiments, or you can perform chip-seek experiments that profile hundreds of transcription factors in many different cell lines and tissues. And these datasets are allowing us to get really detailed maps of regulatory elements in the genome and different cell types, tissues, and contexts. The key question remains as to how proteins, regulatory proteins like transcription factors are able to recognize these specific elements in specific cell types, giving rise to dynamic binding and regulatory landscapes. And as you know, these transcription factors recognize short DNA sequence motifs in the genome and they bind combinatorially. Traditionally, work has largely focused on motif discovery, that is how individual transcription factors bind specific motifs. What we were interested in moving one step further and trying to discover rules of motif syntax. And by that we mean composition of regulatory elements, the rules of arrangement of these motifs, preferred spacings, orientations, and effectively how these rules have quantitative effects on transcription factor binding. By that I mean cooperative effects, additive effects, what's the nature of the actual effects of these combinations of motifs coming together. So we want to build quantitative models that can really make strong predictions and derive insights about syntax. That's really what I'll focus on today. So just to go over how we translate this into a machine learning task, if you're given a transcription factor, chip seek data set or a tax seek data set, so DNA seek data set, effectively what you're getting is a readout across the whole genome, right? So let's say the human genome, it's three billion bases. So every base is associated with a corresponding estimate of binding, which could be continuous. So it could be counts of reads, it could be some kind of normalized enrichment of reads, or it could be binary. So for example, bound unbound or active inactive based on some peak calling strategy. And so what you end up with is effectively, if you bin the genome into, let's say, a thousand base pair bins, then you have millions of sequences of length thousand base pairs, a library of sequences, associated with corresponding labels, right? And again, these labels could be binary or continuous. And so if you look at this kind of a data set, you've got a large collection of input sequences as I, and your goal is to map them to some scalar label YI, you can learn some prediction functions, so it fits very nicely into a classification of regression supervised machine learning task, right? There's been tons of work in this space over many years to solve this problem, ranging from linear models to probabilistic approaches to support vector machines, and more recently, neural networks. So how do actual neural networks work? We'll give a quick one-on-one. With a specific example, so in this case, we have millions of DNA sequences, or we want to take any DNA sequence, as shown here at the bottom, and map it to, let's say, a binary value, answering the question, is this sequence bound by a particular TF or not? So what we can do is take the sequence and first start by representing it with what's called a one-hot encoding, which is simply a matrix of four rows and L columns with a one or a zero corresponding to which nucleotide is present at each position, okay? And the basic unit of a neural network is an artificial neuron. You can think of it as a simple pattern detector. It takes in a bunch of inputs, in this case, five nucleotides, and tries to detect some pattern in it. So what is a neuron? Well, an artificial neuron is just a simple prediction function that takes in a bunch of inputs and produces some output, okay? In this case, this neuron takes 20 inputs, five positions by four bases, right? 20 binary values. And the simplest kind of neuron you can think of or prediction function is a linear function. So it just takes these inputs and performs a linear combination to produce an output, Z. And since we have 20 inputs, a neuron would have to have 20 weights, right? 20 parameters excluding the bias, okay? We can arrange these 20 parameters in the same configuration as the input. So that would be a matrix of size four by five, ACGD each row, five positions. And let's assume we knew the weights, the values of these weights, okay? Some positive negative values. This neuron's parameters would look basically like a matrix, right? If you visualize this matrix, you'll notice that it effectively looks like a motif, right? So the key insight here is that an artificial neuron operating on a DNA sequence, the weights of those neurons effectively encode motif-like pattern detectors, okay? This is nice parallel between classical sequence motifs that we've been using for yours and the weights matrices encoded by artificial neurons in neural networks. If you wanna score the sequence with this neuron, what we do is take these weights, multiply them by the inputs and add them up, right? So a simple element-wise multiplication followed by a sum. So in this particular case, we get a Z score of 11.6, that's a linear score. And what you typically do is take these linear outputs from the neuron and pass it through some non-linear function H. The non-linear functions typically used, either one particular class of functions is a logistic, which basically takes the Z and squashes it into a zero-one range, right? So this is quite nice if you want to convert the outputs into probabilities. Very similar logistic regression, basically. The other commonly used non-linear function is the RELU, rectified linear unit, which just takes the output, 11.6. And if it's less than zero, it thresholds it to zero. So think of it as a thresholding function, right? So in this case, the Y would be 11.6. Now this is how a single artificial neuron acts on the sequence, a particular position in the sequence. But if our goal is to take a DNA sequence input, any DNA sequence input and classify whether it's bound or not by a TF, obviously we need more than one pattern detector, right? Because the sequence could have many, many different motifs and many different patterns. So instead of having one neuron, which captures one type of pattern, imagine having a bank of neurons, in this case, let's say hundreds of neurons, with the goal that each neuron will try to learn or encode a different pattern, a different motif like pattern. And then if you're given a motif, what do you typically do? You take the motif and then you scan the sequence, right? To look for high scoring matches. You can do exactly that operation by taking this bank of neurons. In this case, I've shown you three, but imagine you've got a hundred and you just replicate them one base at a time, such that all the neurons shown in the same color have the same weights. So what's effectively happening is you're taking this neuron and scanning the sequence with it, right? And each window in the sequence is being scored according to how good of a match it is to the weights encoded by the neuron and the motif encoded by the neuron. And so basically you've got a new matrix of math scores of all of these patterns across the sequence. Now what you can do is you can stack more neurons on top of the first layer. And what these neurons are gonna do is take the patterns learned by the first layer and combine them or do linear combinations of those and create more complex patterns. And as you stack more and more layers, you can learn arbitrarily more complex patterns until the final layer is learning some very complex non-linear representation of the DNA sequence. And then that final layer is basically going to go through a linear function or a logistic function to predict the output. So if your goal is to predict binary outputs, this is basically logistic regression on the final layer, right? So we induce the whole bunch of novel features directly from the draws DNA sequence and then the last layer is performing logistic regression or linear regression. So this kind of model has been applied quite often to DNA sequence since 2015, starting with the famous deep bind paper. And all of these neurons have weights and biases. And your goal is to learn these weights. That's the whole point of learning the model. So given millions of these sequence windows with the associated labels coming from the Tripsic experiment, you can now train this model. You can give it millions of training examples and an algorithm like stochastic gradient descent with back propagation can effectively learn, optimize these weights to minimize some loss function which compares the predicted outputs to the actual observed outputs and tries to minimize the difference or some kind of loss function operating on those two, okay? So this has been quite successful. I'll argue that this approach is losing a lot of information and the reason is as follows. If you take any of the classical experiments that are used to profile regularity DNA like Chipsic, Chip Exo, or Nexus, DNA-Seq or TAC-Seq, in this current approach you're taking a large chunk of sequence and you're assigning it a scalar label, right? Either total counts or enrichment or binary scores. The problem with that is if you actually look at the data at base resolution, a single base resolution, you notice that actually the data sets, each of these assays produce beautiful profiles around binding sites. So Chipsic, for example, produced these stranded shifts in read coverage around binding sites. Chip Exo and Nexus will produce even higher resolution footprints. Again, these very specific mirrored images of read pileups, right near the binding site. As you know, DNA-Seq and TAC-Seq also produce footprints where proteins are bound. So at base resolution, you will see these footprints with read pileups on either side. And all of these structures of the reads, this fine-scale structure, actually has very interesting properties about how proteins bind DNA, right? The actual contacts. And so we don't want to lose this information by taking this entire region and just assigning some binary labor to it, right? Or continuous, like a count labor to it. So our goal was to model this data at base resolution. And this is what a, for example, of some Chip Exo data, Chip Nexus data from Julia's lab. Basically, Chip Nexus is similar to Chipsic, except it uses exonuclears that digest the DNA all the way to the protein DNA contact. And so you get very high-resolution footprints. And this is one locus in the genome, where you can see binding of four classic Moori Potency transcription factors, Pocke-Fosox 2-Nanagin-KLF. And you can see even at base resolution at a single enhancer, you can see beautiful footprints, combinatorial footprints of the four transcription factors. So this is the data we're gonna model today, specifically this system. But the approach I'm gonna talk to you about is very general, can be applied to all kinds of regulated genomics data, not just Chip Exo. You can apply it to DNA-seq attacks, Chipsic, Pro-seq, and we've done a lot of this in the last few months. So here's the basic idea or the basic tweak. Instead of predicting some scalar value, our goal is to actually now take a DNA sequence input and map it base by base to a corresponding profile. And we directly model count profiles. So we just take the, we don't even have to necessarily do peak calling. We just take the Chipsic data or the Chip Exo data, we map it to the genome and we count how many reads, five prime ends of reads that we get at each position on the plus and minus strand. So our goal is to predict the two strands separately because as you saw, this interesting strand shift information that can potentially give you interesting insights into where the binding sites precisely are. So our goal is to take the sequence and map it to a profile. Think of it almost like analogous to like a text-to-speech converter, right? Except that we're not gonna use fancy architecture. They're still gonna stick to the regular convolutional neural networks like I presented before. But our model is fully convolutional. That means if you know something about CNNs, we're not using any pooling layers and so forth. It's fully convolutional. It's basically just these motif detectors stacking one or two of each other all the way up to the final layer. And we're gonna try to map 1,000 base pair sequences to 1,000 base pair profiles across the entire genome. And our goal is to also simultaneously predict not just the plus and minus strand of one transcription factor, but you can multitask. That is you can simultaneously predict the binding profiles for many, many TFs. So in this case, we will predict four transcription factors together, okay? So they share the parameters all the way up to the last layer and then the last layer they diverge into their respective models. The way we do this is we try to model this data in two ways. So if you think about it, a profile has two properties. One is the shape of the signal, right? Think of it as a probability of observing reads at each position in the sequence. And the second aspect is the total number of reads in that region, right? So you can think of it almost like a multinomial distribution where you basically have a total number of reads that are being mapped to that 1,000 base pair sequence. And then you have to decide precisely how you distribute those n reads in that 1,000 base pair window across each position, right? With some probability at each position. So there's a multinomial distribution running over the sequence. And you're trying to figure out how to not only predict the total number of reads which people typically used to do before the scalar, but you also want to figure out precisely how those total number of reads are distributed across each position based on some multinomial distribution. So we designed a normal loss function to optimize the model, which couples these two outputs. So we simultaneously predict the total counts in the region using a mean squared error for the log of the total counts. And then we designed a multinomial negative log likelihood function as a loss function for the profile prediction which captures the probability of observing reads at each position, okay? These are conditioned on the sequence. Now the other nice thing we introduced is all of these assays have a lot of biases, right? As you know, enzyme biases and whatnot. And typically people correct for these biases before they fit models. What we decided to do is have an automatic bias correction system where basically we add the bias, some measure of bias. It could be, for example, input DNA or could be a DNA's bias or whatever. In our case, we use a patch cap bias, bias track. We add that as an auxiliary input in the final layer of the model. And so the model automatically learns to regress out the bias and then fit the sequence to the residual. This all happens simultaneously. And lastly, the architecture of the model is just a fancier version of a convolution neural network. We use something called dilated convolutions which allow us to save parameters. Then we also use something called residual connections which allow us to build pretty deep networks, okay? So that's the model. And we train this across the whole genome. So the model does remarkably well at prediction. So what we do is we hold out some chromosomes and we train on some chromosomes and we do this in 10-fold cross validation. And this is the sum two loci, two representable loci at base resolution. So the red, this is one enhancer, this is another enhancer. The top curves, sorry, the top plots in each color represent the observed profiles from the actual data and the bottom tracks represent the predicted profiles from the model. Using sequence only and these are on head out chromosomes. So the sequence has never been seen in training. And you can see at base resolution, the models actually do quite remarkably well. They almost look like in some cases indistinguishable from the data. In other cases, they in fact denoise the data. If you think about it very carefully, you're predicting the expected value of a multinomial and so actually it has the ability to correct dropout naturally. When you have sparse data at a particular locus, the model actually imputes the expected distribution, which is quite nice. So that's a two loci. We want to know how we actually do across the whole genome. And so what we do is we evaluate the performance of the model on both tasks. That is how do we do on total counts and how do we do on the shape of the profile? Now for chip exo, what you really care about are the locations of these footprints, the precise footprints. And so what we can do is even though we predict the profile at base resolution as a continuous count profile, we can evaluate this in a slightly different way. We've used multiple metrics. I'm just showing you one of them. Think of taking the observed profile and then marking each base as a positive or negative sign. The positive sign represent the spikes, sort of the flanks of the footprints. So wherever you see the spikes, you label those positions as positive and then the rest of the sequence gets labeled negative. And this happens across all the peaks in the genome. And so now what you can do is you take the predicted profile and basically evaluate it against a bunch of binary labels at each base. And basically the model should have a high area under the precision record of telling us that the predicted profiles produce footprint locations that are exactly synchronous with the observed footprint locations. Now the data is obviously noisy. And so we wanna allow for some wiggle room. So what we do is we allow the predicted positions, we smooth them over different bin sizes ranging from one base pair, which would be exact matching to 10 base pairs. So we allow for some smoothness. And we also need a reference as to how well we're doing. And so we compare basically our performance, the model's performance, to the similarity between replicates. So if you use one replicate to predict another replicate, what would be the performance of the, so what would be the concordance between the two replicates? And so the blue curve you see is the area under the precision recall curve of replicate experiments. And these are very high quality replicates. You can also pool the data and create sampling replicates, which are just pseudo replicates and that gives you an even higher upper bound. This is about as well as you could do, basically this is the sort of the data quality of the noise and the data in some sense. The yellow curve is showing, or the yellow points represent the model, sorry, the yellow points represent the replicates, the blue points represent the data, the model. The green curve or the green points represent what an average profile would do. If you just take all the peaks and you average a footprint, you use it to predict binding. That's what the green points look like. And finally, the black points represent a random profile. So if I just shuffle the profiles and I try to predict the actual observed profile, what would it look like, right? So the black is the lower bound, the yellow is kind of an upper bound and the green is what you would do naively if you just averaged, if you didn't have a model that actually learned from sequence. So what you can see is on average, for most of our TFs, we are almost as good or slightly better than the replicates. The reason is slightly better is because the model is denoising the data and we are fitting that the model to the pooled replicates, so we pool the data and we fit it. So effectively we're able to learn on a higher signal to noise ratio data set and really maximize how much we can learn. So this is very cool because basically what we're seeing, at least in terms of profile shape and the locations of the footprints, the model is able to predict from raw DNA sequence and base resolution with precision as high as replicate experiments, which is very encouraging. But what we also want to do is look at the total counts of prediction, right? So you not only want to get the profiles correct, but you also want to get the total counts correct. Now in the total count prediction, what we observe is something quite different. The replicates actually show very high concordance, 0.96, 0.82, 0.97. But the model does well, certainly better than anything you've seen before, but it's still not very high. So this is just the correlation on the peak regions. Okay, so if you do genome-wide correlation, this is going to look very high because you have most of your genome basically stuck at zero. Start gives you artificially high correlations. We're only looking at the signal regime of the data that's in peak regions. And you can see that we're actually getting pretty good correlation even across peaks, but it's not anywhere close to replicates. And actually we think this is not an artifact of an inferior model. You should note that we're only using 1,000 base pairs of sequence in a region to predict binding at the region. And obviously there are many other things that are contributing to binding of a transcription factor. So for example, the chromatin state of the region has an influence. So for example, if the DNA is repressed, then obviously protein would not bind there even if it has the right sequence. Similarly, long-range interactions, distal interactions we know have influence on locality of occupancy. For example, people have found distal QTLs, SNPs, and other enhancers that can regulate qualitative changes in binding at some other enhancer. We are not modeling any of those aspects. So the conclusion we can make from the model is that the local sequence is sufficient to exactly predict the shape or the footprinting profile of a transcription factor. But if you want to model the total counts, total occupancy at a site, you need more than the 1,000 base pairs. And we leverage the difference between these two predictions when we are interpreting the model to figure out what's happening. All right, so that's the modeling piece for you. So we're very happy with our predictions, but are our predictions actually useful? The answer is no, because we already have the data, right? So what's the point? Well, our goal is not so much to make predictions of data we already observe, but is to now start taking the neural network or the beautiful parameters it has learned that have allowed it to predict the profiles of unseen sequences with high accuracy and interpret those and see what they actually are learning, okay? So we're going to talk about model interpretation now. So now given this kind of a model, the first question we might ask is, if I'm given a particular enhancer in the genome and it has this beautiful binding profile for this transcription factor, what nucleotides in the DNA sequence are actually helping the model make this prediction, right? So Avanti in my lab built this algorithm called DeepLift which we published in 2017 at ICML. What it does, it can take the output of the model of any neural network and it can recursively decompose the contributions of individual neurons all the way down the network using a back propagation scheme, which is very efficient. And what you end up with is a contribution score for every nucleotide, such that if you sum up these contribution scores, okay, it gives you the difference of the reference, difference of the output prediction with respect to some reference input. So imagine a dinucreotide shuffle input, what output it would produce. The contribution of each base is telling you like a decomposed version of the difference of the observed prediction or the measured prediction of the actual sequence versus a reference sequence. So that's quite nice because now you can decompose, you can kind of interpret the contributions of important scores of every nucleotide in the sequence. And so just to give you some examples what this looks like, this is the OCT4 gene. This is a distal enhancer that regulates OCT4, very well annotated. It is bound by all these four transcription factors, as you can see here, beautiful footprints predicted by the model. And if you look at the deep shift scores of the same sequence with respect to each of these tasks, remember we have four tasks so we can interpret the same sequence using four different outputs and we get different scores. You get these really clean base resolution scores. Yours one looks like a motif that's firing. You can see for Nanog, this motif is firing specifically for Nanog and then for KLA4, these motifs are firing specifically. And so these actually look like transcription factor binding sites. So if you take more motifs and you map them, you can actually get a sense for which motifs are actually activating or combinatorially interacting potentially to generate each of these or to predict each of these different binding profiles. So we can take any single enhancer and really dissect that base resolution, which piece of the sequence which nucleotides are driving binding. This is quite nice. So we can use this for every enhancer in the genome but we'd also like to learn consolidated motif models potentially that are learned by the network, right? One approach is to take a more library of motifs if they're very good and scan the sequence and look for these important, high importance hits. But we also want to try to learn other networks learning more interesting motifs. That's something we want to figure out. So on the what she did is she has another algorithm called TF Modisco where she can take all the enhancers in the genome or all the binding sites, all the peaks and she can apply a deep lift to them. And so that will highlight all the important bases in the sequence for each of these sequences. We can throw away the nucleotides that are actually not important as predicted by the model and restrict ourselves to these short, what we call secrets. These are short subsequences with high importance. And then she has a very clever clustering algorithm based on like low end community detection with a normal distance function to account for these continuous, they're not. So one, you know, one thing to notice we are not simply clustering sequences. These are sequences with weights on each nucleotide, right? So she developed a beautiful new distance metric called a continuous jacquard distance between these secrets and which and then applies low end community detection to that with many other interesting tricks to consolidate these secrets into beautiful motifs. And these look like position weight matrices, but they are not. And this will be important as we go forward. We call these contribution weight matrices because they're not averaging the frequency of the nucleotides, which is what typically position frequency matrices or position weight matrices do. Yeah, what we're seeing is the average contribution score of each nucleotide in aligned sequence, okay? Across all the aligned sequence. So it does look like a PWM, but the interpretation of each letter is different. The height of the letter is not the information content based on the frequency on the normalized frequency of the nucleotides. It is actually the average predictive contribution of each base. So what do we learn? Well, the first thing we learn is to predict four transcription factors, even though the neural network has several hundreds and thousands of millions of parameters, you can collapse that into a much smaller set. But that set is also not four. So the four transcription factors are not bound to just four motifs. You actually find 50 different motifs and each of these are quite distinct. They look kind of similar, but I'll get to these nuances about why they are actually different and do they represent biology or not. So I'm just gonna show you a few. These are the main ones that we'll focus on today. As you can see, you pick up the famous Ox-Ox heterodimer. This is the CWM and this is the corresponding PFM position frequency matrix derived from the model, okay? This is the Oct-4 monomer. This is the Oct-Oct homodimer. This is the Sox-2 monomer. We learn three different nano motifs. I'll talk about this quite a bit. The KLF-4, the KLF-4 long, there's B-box, how did that appear? The ZIC-3 and ESR-RB. There are also binding, as you can see, motifs of transcription factors that we didn't even profile. And I'll get to that in a second as well. And each of these motifs actually have combinatorial contributions to binding of all four transcription factors. So for example, you can see the Ox-Ox motif is very important for not only Oct-4 binding, Sox-2 binding, but also Manor binding. And so you can see this kind of nice TF specific contributions of the same motifs, right? You also see that the footprints very strongly support each of these motifs based on their instances. You can again see the Ox-Ox motif has beautiful footprints or Oct-4, Sox-2 and Nanog. And you will see that the direct binding events. So we know, for example, Oct-4 and Sox-2 directly bind the Ox-Ox motif. You'll see very sharp footprints. Here's Nanog's direct binding site. You can again see really sharp footprints. Here, if Nanog is being recruited by the Sox-2 motif, for us, when tether binding, you see fuzzy footprints. So we can actually distinguish direct binding from indirect binding very effectively by looking at the shapes of the footprints and the fuzziness of the sharpness. And we have some nice ways of analyzing that in the paper. So now we can also take all these motifs and look at the number of instances, predictive instances in the genome. And we can actually find that, we find 20 or 50,000 predictive instances of KLF-4 and so forth. These are much cleaner versions of the motifs than just sequence scanning because these are not just matches to the motif, but these are also predictive matches supported by footprints, right? So we get high-resolution predictions of directly bound or indirectly bound motifs that are actually driving profiles. And we can take any single binding event, like a chip-seq peak, sorry, a chip-exo peak, 1,000 base pairs, and dissect how many motifs do we see in those 1,000 base pairs. And so we usually see quite a few on the mode is about three or four, medium is about four. And in some cases you have up to eight motifs inside a single binding site or collaborating to bind a single transcription factor, which is really cool. So just to give you some evidence of indirect tether binding, there are obviously direct binding motifs, non-motifs, but we also see very interesting tetherings. So for example, ZIC-3, which we never profiled, its motif shows up as being very important for NANOG and KLF-4. And we actually did a chip-exo experiment for ZIC-3, chip-nexus, and we do, in fact, see very sharp footprints precisely at these locations. So you can, again, contrast the sharp footprint of direct binding of ZIC-3 against the fuzzy footprint of NANOG and KLF-4. We also see this TF-3 B-box, and this was a very interesting story. We actually found this motif popping up specifically for OCT-4, as you can see. And these are all localized at tRNA genes. And what's really interesting is the V-box actually sits, so if you take all the tRNA genes and you look at their start and end positions, right, TSS and DTS, you can see OCT-4 binds really strongly on the start and the end and shows a tethered footprint at the V-box. So what we believe is happening is there's some kind of looping event where OCT-4 is binding the start and the end of the tRNA gene. You've got TF-3C, which transcribes the tRNA, and OCT-4 interacts with it and creates an indirect footprint at the V-box. That's our hypothesis, we've not proved it, but it's very interesting indicating an important role of OCT-4 in potentially regulating these tRNA genes in embryonic stem cells. We can also find multiple different versions of the NANOG motif. So we not only find the classical NANOG motif, this very strong TCA motif or ATCA motif, that's the primary footprint, as you can see right here. But if we find two other versions of the NANOG motif, you might say, hmm, are these real or are these just kind of random things that you're models finding? Luckily, we have a crystal structure of NANOG binding DNA as a monomer. And in fact, the contact region in the crystal structure is exactly AAT-GGC, so it's actually this piece right here. And that produces a different footprint than the main footprint. And then we find a third footprint, which is contributed by this G-G-A-A-T-C, which potentially is coming from, we think from another partner, very likely. So we're actually picking up very interesting, distinct footprints of the NANOG transcription backup. So that's great, we're able to learn interesting motifs and more complexity of motifs and subtle differences in motifs driving these binding events, right? Now you want to figure out syntax, not just motifs, you want to actually now start figuring out all these syntactic rules. So how do we do that? So we asked first the question about, there's been hypotheses about whether these motifs and the enhancers, whether they have strict spacing constraints, right? Does one motif like to be an extra motif with some fixed spacing? So we tried to find instances of pairs of motifs that showed very strong spacing constraints. So the x-axis is just the distance between the motifs and the frequency on the y-axis of how often you see them exactly at those distances. And what you're seeing spiking right here, the motif pairs that show very strong spacing constraints. And what we see is some of them are classic composite motifs, like NANOG, NANOG, you know, homodimers. You also see oxox, NANOG, the triplet. These are very short spacings, which makes sense. They are like probably direct protein-protein interactions binding DNA. But then we find a bunch of these which are very far apart, like 46 base pairs apart, 68 base pairs apart, 62. Really fixed spacing constraints. If we figure out where in the genome these fall, they actually are all sitting in retro transposons, repeat elements, right? And so what we're seeing is there's very little evidence of actually fixed spacing constraints as a cis-regulated rule of syntax. Most of it is variable spacing. The fixed spacing is really coming from background of retro transposon elements that actually contain these motifs. And we can take these individual retro transposons and we can beautifully dissect the precise positions of these motifs within these repeats, okay? So analyzing these retro elements is often quite difficult because they're very repetitive. And if you try to learn motifs directly from the data, you just learn the whole retro transposon because the entire sequence is replicated many times. So you learn these gigantic motifs, 46 base pairs, 50 base pairs. But if you look at the important scores, so if you construct this contribution weight matrix, you can actually pick out the precise bases that are bound by the DNA supported by the footprints. So we can use this quite nicely to dissect retro transposons and figure out interesting evolutionary properties of these sequences. Now, what we also find is other very interesting subtle properties. So if you take the narrow motif and look in its flanking region and look at the position frequency matrix, we see nothing. But if you look at the contribution weight matrix, which is slightly different as I mentioned, it's not capturing frequency of nucleotides but the contribution scores. We actually see these really beautiful periodic spikes of importance, these TAAT spikes. And you can actually visualize that here. So you're each row corresponds to a different nano binding site. We've centered on this TCA motif and we're just plotting in a heat map, the contribution scores. And you can see exponentially decaying contribution scores kind of rising and falling with a precise 10 and a half base pair frequency. And the 10 and a half base pair frequency, as you know, is the helical periodicity of DNA, right? And so what we've seen- There is a question from the audience, if I may interrupt you about the point you made earlier. The question is, hi, how robust are the weights of a trained network across different runs? Do you get the same interpretations if you run deep lift on retrained models? Excellent question. In the paper, we actually did 10, four cross-validations. So we actually explicitly tested the productivity of all of these results. And we see very good concordance across the 10 fours. So if you just fit 10 different models to the data, the deep lift as well as the modisco results are extremely stable. But we have another paper out on bio archive where we devised a new method to further stabilize these models because those attribution priors, you can take a look, we use like basically like a Fourier transform based constraint on the loss function to minimize high frequency components in these deep lift scores during training. It's a soft regularizer. And that we show actually even further dramatically improves reproducibility. So you can get extremely accurate and reproducible results even more than just this model, which did not use it if you use the attribution priors. What we see here is actually the, because we have these base resolution footprints, they really help localize signal. That's one of the nice advantages is that if you just have a scalar for the whole sequence, it's often difficult to, the model can make the same prediction using many different tricks, right? So if you fit 10 different models, you can often get very unstable results. When you fit base resolution profiles, there's more constraint in the output. And so that really helps regularize, naturally regularize the model. And then on top of that, if you add our attribution priors, which we have, which is very simple, it's just a little trick to add to the loss functions very efficient, you get incredible boost and stability even more. They almost like the replicates, sorry, the cross-validation force almost look identical, which is shocking because usually the models are quite unstable, okay? Thank you for the question though. Stability is a very important issue in deep learning and interpretation particular. And luckily, I think over the years, we've learned how to stabilize the models, okay? So this 10 and a half base pair periodicity is very likely coming from the fact that Nanog, in fact, like many other homeobox proteins, likes to bind nucleosomal DNA as homodimers in this precise 10 and a half base pair configuration. And this has been observed previously even in vitro with NCAP-C-Lex data, which is in vitro binding to nucleosomal DNA, okay? And what's really interesting is this 10 and a half base pair periodicity is seen for all the motifs that drive Nanog. Those same motifs also drive other factors, but when you look at the importance with respect to other factors, you don't see that 10 and a half base pair periodicity. So this helical periodicity is a very specific property of Nanog and all the motifs that are driving Nanog binding directly or indirectly. We can also look for these soft spacing constraints between Nanog-Nanog homodimers. So if we look for two motifs for Nanog, do we see any preferred spacing? And we see really for various configurations, so whether they are on the same strand or on the opposite strand, we again see that the contribution scores show beautiful 10 and a half base pair periodicities. So this is actually, you know, you can really, this sort of in some sense supports the hypothesis that Nanog is binding as a homodimer where you have instances of the motif with soft helical spacing preferences. These are not hard preferences. It prefers multiples of 10 and a half, but it's strengthens binding. It's actually very hard to find this periodicity with the regular motif scanning. So if you just scan the genome for these, using PWMs without using the contribution scores, it's actually quite hard to find that periodicity. And the reason is because the motifs are really short and you get lots of kind of false positive hits. It's very kind of difficult to figure out the bound motifs versus kind of random hits. And once you actually train the model and you get the contribution scores, you can really dissect this at fine resolution. You also see this periodicity again, between Nanog and other transcription factors. Nanog and Sox2, Nanog and Oxsox. And that's the reason we were seeing the 10 and a half base periodicity, the Fourier spectrum for the other TFs as well. And just to show you that this is not an artifact that the model's making up based on some parameters of the convolutional network. If you actually take these motif instances and you line them up and you look at the reads, the raw reads from the data, you actually see this beautiful spiking, right? So what the model is doing is homing in on these spikes of reads. These are very subtle. You only see them once you align them really nicely. And it's able to pick up this very subtle structure in the reads. Again, highlighting the importance of trying to model the data at base resolution. If I just sort of model this at 1000 base pairs without taking into account the shapes of the profiles, I would miss all this beautiful structure, right? Which allows the model to learn these very subtle syntax. So that's quite nice, but this doesn't tell us how the motifs interact potentially to give rise to binding, right? Do we see actually cooperative effects? Do we see non-linear interactions? Does motif of one factor affect binding of the other? So we developed kind of a powerful in silico framework for testing this. One of our strategy is to use synthetic sequences. So we can just create synthetic sequences, embed motifs in them, move these motifs towards and away from each other and use the model as an oracle to predict what happens to binding. The other is we can take the genome and actually mutate motifs and predict what the model would predict would happen to binding. So I'll show you examples of both of these in the next two minutes. So this is an example of one of these simulations on synthetic DNA. We insert the nanomotive right here and the oxox motif and we move it towards each other. And what you're seeing is in real time, the model is predicting what the profiles will look like as these motifs move towards or away from each other. And what we see is something very beautiful. So I'm plotting two curves here. The yellow curve is the response of Nanog. So it's the change in the footprint height at this position as oxox moves towards each other, towards it. And the red curve is a change in oxoos binding as the motifs move towards each other. And you see something very striking. First of all, you see this beautiful, very strong effect of oxoos on Nanog, right? But the reverse is not true. So Nanog almost has no influence on ox, on oxo binding. Also the influence of oxox on Nanog, it decays up to 150 base pairs. That's exactly like a nucleosome range and it oscillates with a 10 and a half base pair frequency. So again, you see the cooperativity, the effect of the oxox motif and Nanog binding. Its effect ranges all the way up to a nucleosome range with helical periodicity, the stabilization every 10 and a half base pairs. And so we do the same thing for the genome itself. You can take actual enhancers and for example, delete the oxox motif and you can see binding of both transcription factors disappears, but if you delete the Nanog motif, you can see the oxox pattern. So the oxo pattern stays as it is but the Nanog pattern gets destroyed. And so we can again, averages across all the enhancers in the genome and we again see the beautiful, asymmetric interaction effect between oxox and Nanog with a 10 and a half base pair periodicity. And so we see all kinds of these kinds of beautiful asymmetric directional effects of one motif on the other, one tier from the other. And we see this, so these are short range protein-protein interactions, others are nucleosome mediated interactions all the way up to 150 base pairs. And so just to validate this, Julia actually performed beautiful CRISPR experiments where at single loci, she, for example, we took years of predicted profile for SOX2, years of SOX2 motif, years of the observed profiles, they look very similar. We can then mutate this motif, we can mutate these two base pairs and then predict what would happen. And you can see that SOX2 is binding disappears and it also gets attenuated up to about 50 base pairs away. The observed data after the mutation, CRISPR mutation looks very similar, validating sort of the medium range SOX2, SOX2 cooperative effects. We also see Nanog, so this is the same experiment but you're now measuring Nanog binding. Again, Nanog gets destroyed at this position when you delete this, when you mutate the SOX2 motif, very similar to what you see in the actual observed data and we can do the reverse and show that if you delete the Nanog motif or you mutate the Nanog motif, it has short range effects on Nanog binding, but if you delete the Nanog motif or you mutate it, it has no effect on SOX2. So we have some nice experiments, CRISPR experiments validating this relationship. I'll skip this, but we have other nice validation against a taxic experiments after TF depletion. The model actually predicts differential chromatin accessibility very nicely. Again, hinting at my cooperative binding is very important for chromatin accessibility and it also generalizes to massively parallel reporter experiments in the same mouse ESLs. So just to conclude, I presented BPnet, which is a general architecture to map DNA sequences to base resolution profiles. You can apply this to ChIP-seq, ChIP-exo. In fact, in the paper, we also have ChIP-seq. You've now applied to cut and run. You've applied to DNAs, to a taxic, to a single cell taxic, pro-seq. Works really beautifully. We are expanding this to long range histone ChIP-seq data. You can take these models and then use a suite of interpretation methods to reveal motifs, syntax, cooperativity. And lastly, I showed you some examples of experimental validation at base resolution, really showing that the models predictions are really quite accurate with respect to binding, differential chromatin accessibility, reporter expression. And right now we're working on expanding these methods or testing them for variant effect prediction. So again, thanks to all my collaborators and students who did all this work and the funders. Thank you very much for a very exciting talk. So now it's time for questions. Let's start in the network. Damian Rokairo has a question. Damian, please. Thank you, Karsten. Thank you very much for the talk. It's very exciting to see the models and the predictability and also interpretability. I have a question you touch upon at the very beginning. And I don't want to go very technical, but when you discuss the idea of also considering the bias, I'm just thinking of these ChIP-seq experiments where actually the peaks are the result of a controlled DNA that you have. So I may think that that's the source of bias and I was wondering how do you model that? Yeah, so the precise way we do it is we add the control track as an additional input to the model, but we only add it to the last layer, right before the output. So it's almost effectively like imagine you have, it's almost like you have two inputs, right? You have the sequence and the bias and they're competing to explain the observed profile. And the model has to figure out which it prefers to use, right? You can also fit this in two steps. You can first just ignore the sequence, fit a bias model, a very simple linear bias model, right? That just takes the bias and tries to predict the profile. Then you freeze those parameters and then you add the sequence. So the sequence has to fit the residual, right? And so the nice thing about this is we don't have to pre-correct the sequence, I'm sorry, the profiles. You just add the bias and the neural network can learn even more complex transformations of the bias to explain the profile you see and then use the sequence to only explain what you cannot explain with the bias. I see. That's the idea. And so you can use all kinds of bias tracks. So when you do DNS, for example, we just take a DNS bias track and we add it as an auxiliary input and that digresses out all the enzyme bias and then the sequence fits to the residual. Nice. Thank you. Thank you. The second most popular question on Slido is the following. Thanks for the talk. Why a multinomial distribution when predicting binding sites from XOC data? I'm sorry, I didn't get that part. That's the question. Sure, no problem. So if you think about basically, we are modeling the counts in its native state. We're not normalizing, we're not doing anything because we have counts and what is the best rate of model counts? So if you had counts over the 1,000 base pairs, you can imagine using a Poisson distribution or negative binomial distribution which captures the nature of the noise quite nicely. In fact, we did try that. But if you think about how counts are distributed across 1,000 base pairs, it's like having 1,000 bins and you have a bunch of balls and you're throwing them into these 1,000 bins. That's precisely what the reads are actually doing. There's some binding occupancy in the 1,000 base pairs and then depending on precisely where the TFs bind, they determine how those reads, n reads, are distributed across 1,000 base pairs, 1,000 bins. And the distribution that captures the nature of noise of that kind of event, the best is a multinomial, right? You can think of other ones too, but multinomial is a very natural fit to it and you can see how multinomial can also correct for dropout events because you might observe zero reads at a position but that might just be because the total number of reads that you're distributing is very low. That base can still have a finite probability, not a zero probability of observing some reads, right? And since the multinomial fits the probability, you can actually impute expected values even when the data is very sparse. And that's how we actually see examples where you see very sparse reads across the sequence. But then if you look at the predicted profile, it's beautiful and continuous, right? You actually see values at every position. And that allows us to actually impute footprints at individual loci with very high confidence, even though the observed footprints are very noisy at individual locations. And the multinomial works even better for chip seek, which is even noisier, right? So the more noisy the data gets, that you want to use a loss function that is actually quite good at dealing with that kind of noise. Even for single cell data, people use multinomial distributions nowadays quite a bit to model accounts. That's part of the idea. Thank you. There's another question from inside the network, Sukhi. Yes, go ahead. Thank you for a very interesting presentation. And my question is on the choice of convolutional neural network. Yes. So what do you think is the advantage of it over repairing your network in your scenario? Because I think your network is more specific to sequential data. That's a very good question. So yeah, this is a common question. And I think it is often... So the way to think about it is convolutional neural networks are also perfectly fine for sequential data. The question is how much structure do you actually observe in the sequential data, right? If you're modeling protein sequences, I don't think you would have much hope with the CNN. You would want something like a recurrent model or a transformer or something like that because the exact order of each nucleotide really matters in protein coding sequences. In regulatory DNA, as you know, motifs have a fuzzy syntax. You don't exactly care where they exactly are. They can move around and it's more like a translation invariant convolutional model. It's basically like looking at a one-dimensional image and scanning it. So it's basically scanning operation and looking for motifs and there are mostly relative constraints between motifs. So the convolutional architecture fits that very well. We've tried recurrent architectures. We've not seen any gains. And as you know, interpreting recurrent architecture is actually much harder. So we have all these beautiful tools for interpreting CNNs. And we basically were able to push them as far as we can take them. And we're basically at replicate accuracy. So we don't see much of a advantage of potentially switching to recurrent architectures. There will be situations where it might help. So we're actually, for example, if you're modeling very long range interactions right in the genome, right now we're looking at 1,000 base pet chunks and modeling local sequence. As you start expanding, you might want to try more complex models that can capture very specific memory events of, oh, there was an event here and then 5,000 base pairs later, there's another event that interacts with it. Those kinds of much more structured sequential effects, things like transformers and stuff might be interesting. But at least for these local models, we don't see any advantages of recurrent architectures. Yeah, awesome. Thank you. Thank you very much. Unfortunately, we are running out of time. Therefore, I have to close the question session here. But I thank you very much, Anshul, for this wonderful talk. We learned a lot about the power of deep learning on sequences. Thanks for joining our summer school and also again, for getting up so early to do so. Thanks a lot. Take care, bye-bye. Bye.