 So, I gained welcome to fourth module for machine learning. This is under the Creative Commons license share and share like. And the focus for this module is on the use of neural networks for secondary structure prediction. This is a schedule. We've got about an hour. I'll have to go through this fairly quickly because we have a lot of material and apologies if I seem like I'm speeding, but there's a lot. So I'll briefly talk about secondary structures. People have never heard of it. I'll try and explain it. And then I'll show you how neural networks can be used to handle an interpret sequence data. The key thing here is we're looking at sequences. And a lot of what you have to do with machine learning is, and especially with neural nets is to figure out how to recode or encode your data so that it's more compatible with the neural net input. And some of it's thinking about it in a novel way or restructuring it or reformatting it. So this is, I think, maybe the main take home lesson for this particular module. And we'll go through a neural network for predicting secondary structure called an SSANN. And then basically we won't have enough time for the lab. So the lab will essentially be sort of homework for people to do after the class is over today. So we're talking about secondary structure that relates to proteins and that relates to polypeptides. So polypeptides are made up of amino acids, strings of amino acids that are strung together. They're sort of like chains or chain links. They can pivot around each other through certain torsion angles for proteins that consist of longer stretches of amino acids generally more than 40 amino acids. The structure of amino acids in a protein is called the primary structure. The formation of coils or spring like structures or linear strands, those are secondary structures. The three dimensional structure of a protein as it falls into collections of beta strands is called the tertiary structure. And then how proteins complex with each other form aggregates. That's called the quaternary structure. The picture is schematic diagrams of a beta sheet. This is the anti-parallel beta sheet. On the left you can see four anti-parallel, well I guess two sets of anti-parallel beta strands and one parallel beta strand. Beta strands are generally extended. They have hydrogen bonding between amide and carbonyl residues or groups. And beta strands tend to sort of form the central core or hydrophobic core of proteins. Helices are probably more familiar to people. They look like springs. They have a gain of characteristic hydrogen bonding between the first and the fourth residue in increments all the way up. They're very stable structures. So these represent the main building blocks of secondary structure. Those secondary structures, the red and helices, yellow and beta strands can be assembled into a three dimensional structure. These are called ribbon diagrams. Some proteins are all helical. Some proteins are mostly beta sheet. Others are more mixed. The structure of proteins is something that many people have been studying for a long, long time. When the first structure is emerged in the 1960s, people immediately notice there is this periodicity and helices and beta strands. And people started noticing a relationship between sequence and secondary structure. So, very early on, people actually tried to predict protein secondary structure from sequence. And this is an effort to try and predict protein structure. And certain amino acids seem to prefer certain secondary structure elements of alanine and methionine, leucine and glutamate tended to be in helices, isoleucine, veiling, threonine tended to be in beta strands. It's been a subject that's been published about and written about for probably more than 40, 45 years. It's sort of fading from being in vogue, partly because most protein structures have been largely solved by now. But it's, I think, a nice example, because it's an application of machine learning to do something that was intrinsically very hard to do. Understanding secondary structure helps us understand three-dimensional structures. It helps with things like threading and remote sequence similarity detection. People also are using it to understand protein function. So, as most of you might gather, I'm pretty old and in fact I've been involved in protein and secondary structure prediction for the last 30 years. It's the reason actually why I got into bioinformatics. And so, again, that's one of the reasons why I'm using this as a good example. So this is an example of where we've got a string of amino acids, a sequence that's written at the top of this particular view. And then what's marked in yellow and blue or cyan are examples of secondary structures that have been predicted. So you can see one helix stretching from about residue one to residue 14, another helix stretching from about residue 33 to residue 45, and a beta strand from residue, I know, 22 to 28. And there are different predictors. One is what's called the Chu-Fasmin model. Another one is a Garnier model. One uses hydrophobic moments. Another one looks for motifs. And then you can come up with a consensus that combines these together. So this is an example of secondary structure prediction that's not using necessarily machine learning, but in this case using sort of statistical propensities. So one of these statistical propensities are identified back in the 60s by Jerry Fassman and his student Chu, and they published this, I guess, table about 1969. And it just shows that certain amino acids like alanine A has a high helical propensity 1.42. And you can see that proline P has a very high coil propensity of 1.88. So the PC is coil probability P beta is beta sheet probability P alpha is the helix. So different amino acids have different preferences or probabilities for being in certain secondary structures. So for example, you can kind of come up with a simple mathematical one. This is not machine learning, but it's a mathematical one where you say take a stretch of amino acids, seven amino acids. Calculate the numbers for helix beta sheet coil. Determine the average assigned the middle residue, residue number four to that value. You can see the helix for each of the alphas, the betas and the coils. And then you can slide this window of seven residues, the entire length of the protein sequence. And then you can generate a plot, which is essentially the helical beta sheet and coil propensity. So here's a protein that's got maybe 61 or 62 residues, and we can see the plot with the green as a beta sheet at the beginning because that's the highest value. We can see a blue, that's the helix, and then we can see a red that starts around 18 and goes to about residue. I know 37 that's a coil region, and there's another beta sheet, then there's another helix and then there's another coil. And so this is just simply a plot and then use a sort of a threshold to say if it's above this it's helix if it's above that it's a coil, which would whichever value is highest. That's the winner. So that that's, that's an example of heuristic method, one that can be calculated with a conventional computer, even with an Excel spreadsheet to predict secondary structure. The two fast-man method is really old, 50 years old, at least. And it doesn't take into account really long range information, doesn't take into the fact that some proteins have a preference for being in a structural class. And it assumes secondary structure probabilities are additive. It doesn't look for certain patterns of amino acids that are known, they're known to be helical caps and they're known to be helical ends. And so with all those limitations, the method is only about 50% accurate. And a random method would be about 33% accurate so it's, you know, it's better than random but it's not great. So secondary structure prediction took a giant leap forward in the 1980s, late 80s, early 90s when people started using neural networks and a program called PhD was developed and this one basically took sequences. So sequence alignment or sequence profiles ran it through a fairly simple neural network and predicted secondary structure and its performance went from, you know, at the time standard was around 50, 55% up to around 65 or 70% accurate. And that was, that was massive. And that's what actually got me interested in neural nets and machine learning. The way that secondary structure is evaluated is sort of like a multiple choice exam, where if you had answers A, B and C, and you count the number of correct answers from A's and B's and C's. So in secondary structure we have B for beta sheet, C for coil, and H for helix. And so you can have essentially confusion matrix compare between the predicted and observed values. So you can have in this case the diagonals the beta sheet was predicted 77% of the time coil 81% of the time helix 88% of the time correctly. And then we can compare where that's over predicted or under predicted in terms of true positives and false positives. The result combining of the beta sheet coil and helix prediction is called a Q3 score. And then what I'm showing is another method which is called the confusion matrix. So the idea of using artificial neural nets to predict secondary structure and identify patterns of residues is getting on to be about 25 years old now. And we've talked about neural nets already so I won't belabor the point just saying that they are ways of performing both classification and regression. For secondary structure prediction. What you're trying to do is take your sequence data and you might have many examples of sequences. So we've got AC's, G's and A. It could be DNA, it could be protein, whatever. In this case it's protein. We're passing it as an input or through an input layer. We have a hidden layer. We have an output and the output here is secondary structures. In this case these sequences predict to be mostly beta sheet. The net connections between those nodes are the weight matrices and just as we've learned before, it's trying to modify those weight matrices that allows us to make the predictions. So we think about encoding. So we have a sequence and we might one hot encode A's, C's and G's just like we did with flowers like 001010. And if we only had an alphabet of three letters, we could encode it this way. So think of having a sliding window where maybe we're taking three or seven residues at a time so we then concatenate those, those three amino acids together to produce an input vector which is three times three or nine elements long. So we have another output where we indicate the secondary structure is coded through B, C and H. And we might have a decision of what the central residue should be in terms of the desired output, whether it's in this case a coil. So the CGA, the central residue G probably is indicated to be let's say coil or something like that. So there's a formatting where we're modifying how we encode how we do one hot encoding. And this is critical to actually having a successful neural net. If you don't do that proper kind of encoding and sometimes you need to be creative or inventive, then your neural net may not work out as well as you want. I think we've shown this sort of same structure before where we have a vector or set of numbers that codes the sequence we have a weight matrix. In this case, the weight matrix has to be of the same number of rows as the length as the input vector. This is just general linear algebra. Seven or two, four, six, eight, a nine by one matrix multiplied by nine by three matrix, which gives us a three by one matrix and we multiply a three by one matrix by a three by two. And that gives us a two by one matrix and that gives us let's say your desired output. So here, you know what our initial feed forward calculation is at point two four point seven four, we compare it with the preferred or desired output in this case say a zero one. And we can see we're slightly off so we have to do some back propagation to adjust those numbers. We recalculate saying kind of input vector we find that we've changed things the it's dropped a little better and the first digit and the last numbers increased so it's a little closer to zero one. And so after a few rounds of training, we think it's converged. We can carry on. We can put in a new input we can modify and compare and iterate and just as I showed before after many iterations we create in this case to generalized weight matrices which maybe allow us to predict secondary structure with say high accuracy. So that's the concept behind using neural nets for taking sequence and predicting an output. So that could be secondary structure, it could be binding. It could be the location of a promoter it could be a gene. So all kinds of things can be done through neural nets or through hidden markup models where we're taking sequence data and converting it. So we're going to try a real example and this is one where we're going to try and predict secondary structure from protein sequence data. That's our problem, then now we're going to construct our data set. So we're going to take a data set that we compiled many years ago called the protein property prediction and testing database. It's a website where we have sequence information and the secondary structure for thousands of proteins. So you can see the amino acid sequence is using the standard one letter code secondary structure uses Cs for coils Bs for beta sheet and H for helices. So, you can take that data download it and it has all the information about the protein name, the sequence, the secondary structure. And so you've got, you know, in this case hundreds if not thousands of examples. So that's our data set. It's considered a gold standard it uses fairly sophisticated computational methods to identify the secondary structure again it's not machine learning it's it's calculating certain features and summing them together but it's a robust useful data set. So, we're going to try using the neural net and this training set to see if we can predict secondary structure. So as before, we could, you know, pretend to code but in this case all we're going to do is open up the module for Python code. Again it's also been written in our. And then you can open up the secondary structure SS AN and and work with Google cloud. So it's very similar in design to the iris neural net program, we have to read data check for missing data check for invalid amino acids, split things in the training and test set. We have to do some encoding, we have to convert amino acids to some of the numerical for that we have to convert the secondary structure to some kind of numeric encoding so this is where we do the sort of one hot encoding. So we have windows of about, I think it's 21 amino acids, we have to make sure that we have, I guess padded sequences at the beginning and at the end, because you want to be able to have your window run to the end, and not start 10 or 11 residues into a sequence. So we put extra fake amino acids at the beginning and at the end, so that our window can slide through so we introduce what are called null amino acids. So we've, we encode the amino acids we encode the secondary structure. We define some functions for those. We also have to define our activation function, just like we did before a sigmoidal and softmax functions. And just like we did before we initialize weights and biases, we determine the batches that you do forward propagation error calculation back propagation update and then iterate over many hundreds of epochs. The first numpy and pandas distance we've done before for most of our other programs. Then we have to read our data. So this is still in a CSV format. The reading is a little different. It's not the same as the iris data anymore. But this is what the data looks like in terms of the sequence and the secondary structure. So we can parse that out. So modify this a little bit to look for or ensure that we have standard amino acids, whether they have the right letters if there's any X's in particular, and make sure that those are either cleared up or identified and sorted out. We're also going to check for any missing values. And the game is just a standard thing to make sure that we don't have any missing sequences or missing secondary structure elements. If we do it flags that. So again, this is very standard. We're using a training data set the training and testing data set of about 1400 proteins. There's more than 100,000 in the protein database, but this is just to keep things reasonable. In this case, we've set out our data size. There's a data fraction and then a training fraction of 70% and then 30%. So we're going to still split it out. So that we have a reasonable set. So this is, you know, just sort of deregure set up or just reading our data, setting up our training and testing set, working with a data size just to make sure we don't flood the computer and have to wait hours. So key to neural nets and we emphasize this before with the iris one, they have to do the same thing with sequence data is encoding one hot encoding so the input can be manipulated using matrix calculations using dot products and vector products. We have to change our amino acids from a and C and D, which are single last letter characters to, and we'll call this amino acid binary. So, in this case there's 20 or 21 locations, because there's 20 amino acids. And then we've introduced a special amino acid called the null amino acid. And this is sort of the invisible amino acid that we put at the beginning at the end of the sequence. And this is typically done whenever you're having to do some kind of windowing function where you're spanning through sequences. Because your window function kind of runs off the front and runs off the end. So we have one hot encode secondary structure. We're setting the B to be one zero zero C to be zero one zero H to be zero zero one same sort of thing that we did with Satosa, Virginia con and whatever the other one versus color. So, we have to make sure we convert our first of our amino acids into a sort of a binary number for each amino acid. And so this is what we're doing instead of the encoding. So we've got these 21 letters for the 21 amino acids standard 20 plus the null amino acid. And the structures are also going to convert that to BC and H and create numbers that are going to be zero zero one zero one zero zero. We have a windowing function as I said this is another thing where you use intuition. So that when you're trying to build out or create some kind of predictor, you take what you know, you take what perhaps has worked for other people and certainly in cases secondary structure people know that when you collect from data with nearest neighbors it helps with the quality of the secondary structure prediction. So we're trying to capture these pair wise or distant interactions and so we group them into windows, and we're going to try and calculate the secondary structure for the residue at the center of these windows. So, first part as I said, this is padding, where we've got a protein that begins with proline ends with proline. And then we're putting these null amino acids at the beginning and at the end, so that we can still allow our window to start and pass through the entire sequence for the protein. And what we're doing is we're taking sequence and then this is this window, and then we're predicting the secondary structure for the central residue in that window. So we've got this window of about 10 residues or 11 I guess the central residue is glutamate that's the E, and it is predicted to be a helix in that center in that central residue. And then slide our window along by one residue so we've moved from E to P, and then we'll take the prediction. And that also is predicted to be helix. So we can just slide this along the length of the entire sequence. So here we're encoding the sequence. The sequence starts with in this case with isoleucine glutamate glutamate glutamate, leucine leucine. We've padded it with a bunch of null characters. These are null amino acids, and we're encoding them as 0000 all the way to one. Isoleucine has the encoding of 00001 0000. Glutamine has an encoding where there's about 15 ones or zeros and then a one. So this is how we are encoding. Now, the other thing that we're doing is we're taking all these sort of binary readouts that have 21 zeros or ones, that's for each amino acid. And then we're taking a window of 17, which means that it'll be the ninth residue, which is the middle residue of the window from 17 is being predicted. So we've, I've drawn a box of covering 17 amino acids at the very beginning. I have eight null residues. The ninth residue is the first residue. We're getting a table that's 17 by 21, which is 17 amino acids and 21 binary values for amino acids. I'm going to flatten this. I'm going to take the 17 by 21, which is 357 and make it into a single vector of 357 bits. So this is another trick that's used a lot in neural networks for sequence data. This is what flattening looks like. So I've taken all of those zero zero zero zero ones and from all of those sequences, and I've just moved them all up into one very, very, very long 357 bit vector. And so when I'm doing the calculations, I got my window of 17. I move it down by one. So I've moved from isoleucine to glutamine, glutamate, which is a new center. And this is now residue two. And then I convert things, flatten things and carry on. So I do this for the entire length of the protein. So the proteins 300 residues long. And I'm going to have a lot of these really long vectors 357 shown 300 times. And so we go this all the way to the very end. And this is the last residue, the tyrosine. And then I padded it with another eight no amino acids. So here's the whole length of the protein, let's say 350 residues and I now have 357 bits for each of the proteins are each of the amino acids which corresponds to information about the surrounding sequence. So I also encode not only the amino acid data, which is on the left, on the right I'm including the output data. So each amino acid has a secondary structure. And so I feel that in as well. And recall the encoding that we use for secondary structures of 0100 for helix 010 for beta sheet and 001 for coil I guess. I'm just shown in slides what essentially we're encoding here we're producing this a in code. So it creates this window size we create first set of null, we pad it based on the size of the window. So how many columns we're going to have, how many residues the length, and how we bit ties everything. We also make sure that we've encoded it can calculate those padding sequences to the beginning at the end. And then now that we've encoded the padded protein sequence, then we start flattening it. So we convert this to this 357 bit representation of the 17 amino acids and 21 amino acid calls. And then take the secondary structure which again I was showing previously but this is how we encoded it. So this is a function called sst three encodes of the secondary structure encoding. We have hc and b three different types and then we're going to encode that all the way through to the protein whole length, and then also convert that to zeros and ones that's needed. So now that we've encoded. And unlike say the iris data, where we have to worry about, you know, numeric data regarding the length and width, we don't have numeric data we have basically zeros and ones so we don't have to do any more normalization, we don't have to do L one normalization The data is pretty randomized. It's ranging between zero and one. So we get to avoid the normalization process. So here's what our model architecture looks like. We've got 357 inputs. I'm not going to show 357 in the diagram. And we have three outputs. And we have a hidden layer with the hidden layer size being somewhat very what we want for architecture. Just like we did for the iris data iris problem, we have to choose a an activation function. So we're using the sigmoid function and the sigmoid derivative function which works really well. Again, this is just some math reminders about the sigmoid function also about the softmax function, which is used for layer two. Again, this is the definition of the softmax function. We went through this last time so I'm not going to repeat it but this is again just setting things up so that we can use our appropriate activation functions for each of the layers. Just as before we have to initialize our weights and biases. So we call up some random numbers and create those random weights all the way through. We're going to calculate the number of batches. So we've got 1400 proteins, I guess, and we have to choose how many batches we're going to train. And so that's partly decided by the user, but we have to make sure that the batch sizes are come up to whole numbers. The training loop is exactly the same. This idea of looking at these different batches do the forward propagation error determination back propagation update the weights and biases. We do this for batch one two three to batch and once we've completed that then we, we've completed one epoch and then we repeat for hundreds of epochs to make sure the training is thorough and complete. It's very similar to the slides we saw and in many respects this is just recycling a lot of the same architecture. We have to worry about the learning rate batch size. We want outputs which return the trained weights the biases and the error measurements. The forward propagation is largely the same. Again, it's now we're dealing with sort of a window size and the number of amino acids. And the hidden layer size and we have the dot product and we're calculating the activation function with layer one using a sigmital function and second layer using the soft max function. Just as before we have to determine the error once we've done the forward propagation. This is the difference between the output and the observed or known output. So the predicted versus observed and then we change or propagate that delta all the way through to the other layers. The back propagation goes from layer two to layer one layer one to layer zero. Again, these are the same methods models approaches. In terms of both the waiting in the bias. That we talked about the last time. So I'm not going to go into a whole lot of detail. The back propagation, as I say, still continues through the different layers. We're looking at the sigmoid derivative, we're looking at cost. And that function which is marked red. And highlighted to this to those points. So this is a game very much similar to what we did. Or the back propagation continues all the way from two to one to zero. So this is a game with the deltas for both weight and the bias. And now we're looking from layer one to layer zero. After we've completed the back propagation step, we're updating all of the weights. And we're multiplying this by our learning rate. Which is the Greek letter, Ada, and the delta that's learning weight delta that's marked here. So we have a different layers marked and different weights marked. So the weights are updated, the biases are updated. And we're looking at a kind of formula, same slides, basically the same interpretation that we used for the Iris model. So once you've done one neural network, you've kind of done them all. There's subtle changes to format, but the back propagation steps, the activation steps, the bias and weighting adjustments. They're all pretty much the same. This is the same coding that you saw with the Iris model. It's taking a batch, doing forward propagation, calculating errors, calculating the derivatives, doing the back propagation and the weight and bias updating and repeating that over and over and over again for each epoch. So this is a little different. So this is an animation of what's happening with the different input layers. So the 17 input layers we're seeing across the graph is the 17 and they're sort of concatenated or condensed so that instead of seeing 357 we're only seeing 17 units. So you can see they're weighting or the numbers that are sort of the average over these things. We have input units, we're seeing the epochs being changed or going from first epoch to 170, 180, 190. We also have the output units. And what their value is. So as we're training, you can see numbers are changing. Some initially change quite quickly, but as we get towards the end of, I don't know, I think it's about 1000 epochs. The numbers are only changing subtly. They're 0.1 here, minus 0.1 or 0.01. So it's starting to settle. You can see how the colors change as you went from the beginning to the end and back again. So there's quite a bit of change that happens at the beginning first 200 and then it settles quite nicely around the last 100 or so of the 1000 epochs that it runs through. So this is just sort of showing sort of the specific weights. 17, I guess, illustrated as the inputs coming in. We've got within the single hidden layer, we've got these five hidden units and then we've got three output units. So it's kind of a pseudo 17, five and three, although it's 357, five and three is the real input that we just can't show 357. So we're showing the weightings and the various connections based on where they are, the numbers and the strengths of the weightings sort of in terms of the lines. We could show darker thicker blue lines and thinner blue lines based on their overall weights. We could color them so that we have the appropriate colored weights, but the light, light green would be lower weights the dark blue or black heavier weights in this diagram. So, in addition to the weights changing and eventually converging. The error also is converging and the error plot is pretty generic over the number of epochs we trained for 1000 error drops quite significantly by factor probably 40 or more in terms of what we see. And so that's telling us that as it settles out where we're getting good convergence and that the program is is performing well. We have two versions of the secondary structure. And one written in Python, the other one written in our one in Python uses NumPy and pandas. You can see that the Python one is much more compact in terms of the number of lines that code. They run about the same as a general rule our programs run a little slower. And once we've essentially assembled the program trained it on our training set, then we can start testing it on the on the on the test data technically. And this is essentially how we do the testing forward propagation. So what we've got is a training set. There's 497 sequences and a test set of 213. So a grand total of I guess 710. So we used about 10% of the entire data set, which had 7100 sequences in it. So we kept this small just so that the program could perform in your lifetime, just given how slow some of the collab code can be. So what we've done here is we've calculated the q3 score for the training. And we've assessed it on the number of amino acids. What you can see is a diagonal so we've got the confusion matrix and so it predicts beta sheets 48% of the time correctly coil 69% of the time correct the helix 63%. That's on the training set the testing set which is about one third the size. The performance is 466965. It's not a perfect prediction it's not like what we saw with the. Iris data set where the errors for mostly zero or 0.05. The iris data set is almost a trivial one secondary structure prediction is non trivial. So this is more typical of what you'll see in, I guess we'll call it a difficult prediction program or prediction challenge. So we have, you know, larger off diagonal elements, and the diagonal elements are not 1.00 or 0.99. They hover around 65%. Overall, the q3 score, which is sort of what you would get on a three question multiple choice tester there's three answers in each question is 61%. Both the training and testing are consistent in terms of the numbers. So we can say this is not over trained. So we're confident in terms of its performance. So essentially what we have is is a neural network program written in pure Python also one written in our we've trained it on fairly large data sets. It took a fair bit of time for us to find the right training size set. So the first time we did it was too large second time was too small. So we're not showing you all the challenges we faced in terms of, you know, choosing a good training set size, but in principle, this could be used, not only for secondary structure prediction but you could use it to predict in spanning regions you could predict signal site prediction, the same concept could even be applied to DNA for gene prediction and in fact we'll show you how to do this so similar ideas. Just with the way that we've encoded sequence data the way that we've mapped sequence to some output information that it's a concept that can be reused in many, many formats. So what you can do with this obviously is is actually play around with it and I think what I wanted to do because of like the time was sort of give people the opportunity to work in the lab either with homework. So that's the end of the day for many people, or if those of you who are ambitious. You could just follow the slides along where it's the same sort of thing go to module for download SS and and or Sam, as we call it. You can also take the R code if you prefer that over the Python code as before sort of just look at the code as I've illustrated and explained. We have a data set. This is called converted data. So it's a much larger data set than what we have in the iris. And as before you can run the program. So it'll take a while to execute. You can change things and make adjustments to an assessment performance in terms of helix of coil and beta sheet prediction, adjust your data fraction just your training fraction. There are some other variations you can do and change your learning rate. You can plot your error plot see how it performs. You can upload different sequences and see how they predict. So any number of things that you could do with this in terms of testing it out trying it out. And given that we still have a few minutes that I'd certainly encourage people to give it a try.