 as well. So it's kind of a good time to talk about it. So this is jointly done with my students, Johnny Israeli right there, Anna and CS. Okay. So you're going to mostly talk about a case study where we use deep learning methods for learning patterns in regulatory DNA sequence. Okay. And so let's start off with some basic properties of DNA sequence, regulatory DNA sequence. As you know, promoters, enhancers, and so on, these kind of elements are usually bound by transcription factors. And these are regulatory proteins that bind specific sequence elements, usually referred to as motifs. So regulatory sequences usually have well-defined sequence patterns. These motifs instances representing binding sites of transcription factors. However, they're not as simple as that usually. Often transcription factors bind in the form of homotypic clusters. That is your multiple instances of binding of the same TF. And so the sequence will often show a density of hits of the same motif. So you'll see more than one instance of the same motif. Transcription factors also work with each other. And so often a regulatory DNA sequences will show combinations of transcription factors binding. And so the sequence will contain multiple instances of different types of motifs. So these would be referred to as heterotypic motif combinations. And we'll be focusing a bit on this kind of a case study. Alongside the multiple binding events of transcription, multiple transcription factors, we also have very interesting sort of spatial sequence grammars that are often encoded in regulatory DNA. So in this example, we have two transcription factors binding these two motifs. And you see that in these sequences, they often bind together with a very specific spatial constraint, like maybe five to seven base pairs apart. So these are just some of the rules that, or you can think of these as rules that are encoded in regulatory DNA sequence. And these are the kinds of rules we'd like to learn from the raw sequence data. And to give you one example where this is relevant, as you know, encode generates a ton of chip seek data, transcription factor chip seek data. And so you can think of taking any particular chip seek data set and you get two types of sequences out of it, you get a bunch of sequences that are bound by the TF. So let's say that's the positive class of genomic sequences bound by your TF of interest. And then the rest of the genome is your negative class. And so your main goal is to try to take these two classes of sequences and really understand what patterns exist in the DNA sequence to distinguish these two classes. And in the process, maybe given a new sequence that you're looking at, you can predict whether the TF binds it or not. Okay, so that's sort of a basic, that's the motivating task right here. So this is where we turn to supervised machine learning. The basic idea for supervised machine learning is in this case, it's a classification task, where like I showed in the previous slide, you have two sets of regions, in this case, your idea, and they contain, they might have different properties. So in this case, this class contains TFs with fixed spacing, and this class contains the same two TFs with variable spacing. Okay, and our goal is to take these various sequences belonging to these two classes, learn a function, a classification function that can take these sequences and learn a mapping to their class labels, right? So take a sequence and predict that it's class one or class negative one. So training such a classification function means in a supervised fashion, means I'm given instances, so I'll give my learning algorithm instances of sequences with their labels, so x, y pairs right here, and many such instances, and the algorithm will automatically learn this function to map x's to y's. Okay, so the input sequences to the labels. So that's called training the machine. A testing the machine would be a case where after you train the machine, you're given a new sequence, x, you push it through the function, and then it predicts the probability of that particular sequence belonging to class one or class negative one. That's testing. So let's look at one of the simplest classifiers ever, which is an artificial neuron, which is actually the building block of neural networks. So this is, there's a little bit of math here, but it's pretty simple. Think of your sequence, essentially, let's say has three bases, right, x1, x2, and x3. You're trying to learn this function f that maps these bases of the sequence into an output value y that says, you know, one thing or the other. So you can construct a very simple neuron of this sort, which simply takes these inputs x1, x2, x3, and it weights each input by some values w1, w2, w3. You simply multiply the weights by each input x1, w1, x1, w2, x2, w3, x3, and you add these values off. So it's a simple linear combination. That gives you your value z. So you can think of this as the simplest possible neuron that could exist, a linear neuron. Training this neuron would mean actually learning these parameters. So the w1, w2, w3, and b, these are parameters of the model. And by looking at the xy pairs, you will try to find the optimal w's. That's basically training the neuron. Now we can do something much more interesting. So the neuron that I showed you before was a simple linear neuron, right? It just took a linear combination of the inputs. I can take this z, which is simply a linear combination of the inputs, and I can push it through another function, h, which is a nonlinear function. So here I'm showing you what the nonlinear function actually does. The x-axis is z. As you can see, z can take positive or negative values, because x1, x1, x2, x3 can be positive or negative, and the weights associated with them can be positive or negative. So the z's you get out could be anywhere on the real line, right? So negatives are positives. This particular nonlinear function, the sigmoid of the logistic, takes these z's and then squashes them into a range between zero and one. So they're very useful in predicting probabilities. Basically, the sigmoid says something like, if my z is greater than five, then I predict one, no matter what the value is. If I'm less than negative five, I predict minus one. If I'm between negative five and five, I'm sort of linear. Okay, so it's a nice squashing function. Another kind of nonlinearity that's often used is called a rectified linear unit, and you can think of this as a simple thresholding function. The idea being, if my z is negative, I'll set it to zero. If my z is positive, I'll keep it as it is. Okay, so it basically just negates negative values. So these are three types of basic artificial neurons you have. A simple linear classifier, a sigmodal neuron, and a relu. Okay? So how does this relate to actually anything to do with sequence or motifs or anything of that sort? So you guys are all familiar with sequence motifs, right? You're given a bunch of sequences that TF binds to. You can convert it into essentially a position-weight matrix. Four rows, L columns, each value being the probability of observing a particular nucleotide at any position, right? So in this case, two Cs out of four, so 50% probability of C and 50% probability of G. And like Wouter showed in the slides before, you can represent it as a PWM logo. Now, usually before you use a PWM, you also use a background model. So typically, you will take this position-weight matrix and transform it into a log odds matrix where you will take the probability from the PWM and you will divide it by the background probability. So let's say whatever the background frequency of nucleotides are. So you'll simply divide these probabilities and you take the log. So this gives you the log odds of that particular nucleotide at that position. And now you can see the log odds matrix can be positive or negative, right? So these are the weights of the matrix. Now, how do you take a matrix and scan a sequence? It's very simple, but I'll explain it in a slightly interesting way, which actually relates to neural networks. So you can take the sequence and you can transform it into this thing called a one-hot encoding, okay? It's not complicated. You simply take the sequence and you create four channels, A, C, G, and T. And for each position, you have four channels and you simply put a one or a zero in the channel that essentially matches the nucleotide you observe at that position. So in this case you see an A, so the green is A, and here you see again the green sample is marked and so on. But that's the axis. And now you have this position weight matrix, sorry, the log odds matrix. So think of these as the weights W. And at any position now you simply take this matrix and you scan the sequence, right? And you look for matches. So let's say you're trying to compute the score of position C in terms of how well it matches this position weight matrix. You will simply do a lookup where at each position you will look at which nucleotide exists. You will look it up in the position weight matrix. So in this case, it's a G, right? So 3.7, a C is negative 3.2 and so on, which is effectively simply multiplying the axis, right? These one-hot encoding axis, which are ones and zeros with the corresponding values in the weight matrix W. So W times X, right? And you simply then take the sum of them to give you the score. So the score is simply taking the one-hot encoding of the sequence, the window that it's looking at, multiplying it by the way it's obtained by the position weight matrix and adding those and that gives you the scores. And then you simply scan across the sequence going one nucleotide at a time. This operation of scanning and scoring is nothing but what's called a convolution, because it's a basic unit of a neural network. So basically, I can take this PWM, scan the sequence and I get these scores, right? Now I want to threshold these scores because most of these scores are negative indicating a negative log odds, right? So I'll just set them to zero. All the negative scores will be set to zero. I'll keep the positive scores as they are. This reminds you of a rectified linear unit, right? Basically sets things to zero if they're negative and everything else remains as it is. So then what you see is you have essentially two hits in the sequence, one of which is a low sort of a weak hit of value two and the other is a strong hit of value 16. And so if you take the maximum value across this entire row of values, then you get a value of 16. So I've taken a sequence. I've scanned it with a PWM and I've got a single value telling me if that sequence has the motif or not. Now, how does this relate to a neural network? A neuron is in fact nothing but a motif. So let me exactly show you the equivalence one to one. So this is a neuron, right? With the rectified linear unit. It has these weights and so on. And this is my motif scanning the sequence. So note I have X1, X2, X3. These are my one-hot encoded sequences, okay? And those are the inputs. The weights of the neuron are nothing but the weights of the position weight matrix. This Z value, which is the linear combination of W times X, is nothing but multiplication of these X's with these W's, which give you the score matches, the math scores, right? So Z is essentially the value you get by taking a position weight matrix and matching it to a sequence. And then you threshold, right? You threshold. You negate the negative values and keep the positive values. That's nothing but the ReLU. So an artificial ReLU neuron basically exactly captures a motif. And it can score a sequence and threshold it. Okay? So that's how you can see a one-to-one correspondence.