 Welcome back everyone, who's got everyone, having a chance to try the neural nets. And I think, again, it's a bit of a toy problem. It's almost like hitting a walnut with a sledgehammer. You wouldn't probably do the iris classification using either decision trees or neural nets you probably just look at it with your own eyes and figure it out. Pretty simply, but these are examples of problems that can be solved in very short order. They're ones that you can see the result and understand it. We've compared decision trees to neural nets that seem to be, you know, slightly better overall, but it's also highlighting that, you know, both approaches are powerful and very useful. We're going to now apply neural nets to secondary structure prediction. So this is a little more of a bioinformatics problem. So, we're our last module for today. We've got about an hour for it. What we're going to do is it's a lecture and then we're going to give you some homework. It's been a long day for some of you and we don't want to have you stick around to do the lab. Most of you have most understanding of how to use colab now and this is the intent of modules two and three work. So let's give you a chance to look at code in a bit more detail, play around with both the R and the Python code, try a few things if you're comfortable. Obviously, it's it's you're not getting graded so you don't have to do this if you don't want. So we're introducing a concept of secondary structure and secondary structure prediction. Some of you might be aware of this or heard of it others that might be completely new. And then I'm going to show you how neural nets can be used to handle sequence data. So we're using protein sequence data tomorrow also you have you can use gene sequence data. Now, we're using sequence data because that's historically how lots of neural nets can be and have been used. You can also use it for other kinds of things it doesn't have to be sequence can be, you know, numbers can be any sort of thing. We're going to explain the Python code for the secondary structure prediction, and then as I said the lab is something that you can do after the lecture so we're not going to actually have an allocated lab time. It's something you can do in the evening. So just some introduction, we're talking about proteins, there's a small number of you who are probably doing protein work, most of you it sounds like are doing work with transcriptomes or metagenomes or general genomics and snips. So, if you don't remember, polypeptides are made up of amino acids amino acids form peptide bonds. Peptide bonds are sort of coupled or linked the way that chains in a link chain or can pivot around each other so they have planar components to them. So each amino acid is like it's on a plane and has a pivot point about what's called the alpha carbon. And that means that polypeptides will have a dihedral angle one is the phi angle, and another is a psi angle. That's how you measure planar angles and how they rotate. So that's been something that's been known since the 60s, and it has an important role in how polypeptides will form structures. So we're going to go from four residues to 40 residues and above 40 above is usually enough to be called a protein. We deal with the primary structure that's the sequence, the secondary structure which is helices like springs or beta sheets that look like ribbons. Those secondary structures will assemble into tertiary structure that's the three dimensional fold. This is what alpha fold was figured out how to do. Proteins, folded proteins will sometimes aggregate into large collections like hemoglobin is a tetramer of alpha and beta hemoglobin. So those are called quaternary structures. Now, alpha fold does the tertiary structure. We're doing secondary structure. So this is a simpler thing, simpler task. And to some extent, the whole point of secondary structure prediction has been made moot because of alpha fold, but there's a reason why we're doing this and explain a little later. So, in terms of secondary structures for proteins, beta sheets are these ribbons they form hydrogen bonds. They're often found in the interior proteins. They have a sort of a characteristic sequence of usually branched chain amino acids being predominant in the sequence. Alpha helices, they look like the springs. These are the purple images we're seeing here. They form a hydrogen bond network where the first residue pairs up with the fourth residue, which then pairs up with the eighth residue pairs up with the fifth and the ninth and so on. Of course, the tertiary structure is how these secondary structures assemble. You can have proteins that are mostly helical rich. That's called an alpha folding domain. Others that are primarily beta, they belong to the beta folding family. And then there's others that are mixed, many of these mixed alpha helix beta sheet structures are found in enzymes. So the reason why secondary structures note is partly for historical reasons. It's actually one of the very first fields in bioinformatics. It's been around since the mid 60s. With the field of machine learning where Canadians played a key role it actually turned out there's a Canadian named Jerry fastman who developed the field of secondary structure prediction. And it grew from the observation from salt protein structures, about three at the time that combinations of amino acids seem to prefer to be in in helices and others prefer to be in beta sheets. There's probably been thousands of papers published there's been dozens of books published on the field. And obviously if you can predict secondary structure it can help with a 3D structure prediction that helps with things like threading and homology modeling and protein function prediction. And when I started getting into the field of bioinformatics back in the 1980s. I got into it because those secondary structures seem to be really interesting and challenging problem. So this is an example of a secondary structure picture that we developed one that used the two fastman method that's the Canadian group. There's others French group the Garnier model there's computer motifs, some of them together and you can determine which region in this protein sequence has helices which ones are beta strands and so on. Jerry fastman calculate these probabilities where Alanine has a high probability of being a helix 1.42 a lower probability of being a beta strand 0.83 and in coil. There's some amino acids like valine or should be has a very high beta sheet probability of 1.7 and an incredibly low probability of 0.24 being in a coil or loop proline high very high probability of being in a coil very low probability of secondary structures. So you can use these tables of numbers and these are just so you know calculating frequencies of amino acids showing up in in secondary structures from solved protein structures and what they sort of suggested as a simple algorithm is take a collection of seven amino acids. Calculate the average probability for that you know those seven amino acids you take this table of alpha beta and coil for 20 minutes and choose it for those seven. Calculate the p alpha over Windows seven and assign that value to the middle residue. That's residue number four of this residue seven residue window. Do the same thing for P beta probability for beta sheet and coil and then you slide this window by one more residue. Repeat the calculation and you continue all the way and then you plot these probabilities autograph showing the likelihood of the helix beta sheet or coil. And this is the plot you get the blue is what the probability is for helix green is for beta sheet and red is for coil. So you can see the beta sheet is from one to residue seven. Helix is about from residue seven or eight residue 17 and there's a long coil region from 19 to 38. Helix and another beta sheet that pop up. So these are what the plots you generate and this is to allow you to identify that define your secondary structures. Now, this kind of works, but it doesn't, you know, like to include information about, you know, residues that are more than three residues away. It doesn't take into sequence account or fact that there are certain classes of structures and when the protein falls into class it makes all the other amino acids want to fall into that class of beta sheet or secondary helical. It assumes an additive probability. That's not how amino acids work together. It doesn't identify certain patterns. And the fact that they're what are called in terminal caps and C terminal caps and start helices of beta strands. And then when they tested this thing on many other proteins they found that it was about 50% accurate. You know, total random guess would be 30% accurate so it was better than random but it certainly wasn't 100%. So people continue developing it and then in the 80s. Someone decided to finally apply neural nets to it. And it was part of their PhD. And the person who developed it was named Burkart Rust. So it was his PhD thesis you applied a neural net applied multiple sequence alignments. And he found that he could actually get really impressive results at the time, something around 70 to 73% correct. This is the result I guess for one example where he's done a confusion matrix. There's the observed on the x axis and predicted on the y axis and you can see, you know, 77% on the beta 81% in the coil 88% in the helix. So this is for one example, but it's very impressive. So the application of neural nets to secondary structure got me interested and a lot of other people in fact it was the very first application of neural nets to a biological problem. And it came up shortly after the early descriptions of neural nets in 1986. So we've we've described on a neural net is I won't go through it. But it is in this case being used to classify it's being able to say this collection of amino acids it's in a helix this collection of amino acids it's in a coil. So we've got three classes but it's spanning over 20 different amino acid types and collections or groups of amino acids. What you do is you might give it examples of here's a sequence of Alanine Sistine glycine Alanine. And that's your input you have your input layer hidden layer output and you're trying to predict, you know, that there's going to be a beta strand which is to be or it's a coil, which is C, or helix which is H. The general pictographic view of what we're trying to take sequence data could be a sequence alignment or could be several, just a single sequence, feed it in and tell me what the secondary structure is. We can think about in this case is just a three residue when do we can think of one hot encoding a sequence data. So, we've got a sequence alphabet of three letters we can use the 001010 approach. Three secondary structures and we might have a hot one hot encoding of 011011. And then if we've got a string of amino acids and they just string these one hot encoded sequences to produce a three letter encoding which would be nine binary values. We can take our input vector as it's been encoded we can have a weight matrix, we can calculate what the output is and then we have another weight matrix and then we can calculate what the output is and see how close it is compare to our back propagation. Repeat repeat and adjust do it thousands of times and we get some value that saying it's you know 0.16.91 which is close to 01, which is sort of what the desired output. So we saw this before with other example, the other example with the but we're seeing how the geometry of the weight matrix is different partly defined defined by what our input vector is how we're encoding the sequence and also what our output vector is how we're encoding secondary structure. We can do this for the first input we can do this for the second input we can repeat and just like the other ones we eventually end up sub generalized weight matrix which defines our model, which would in principle be able to predict beta sheets coils and heal the seas. So, does that make sense to people I don't know if anyone has any questions about that general concept. Okay, so if it's simple enough will move on. So, as with everything we still decide, you know, what is our problem, how are we going to have our data set, how are we going to transform it, what model we're going to choose and we've already said we're going to do a neural net. And how do we validate so our model here prediction is our problem is how do I predict secondary structure from protein sequence data. So, I need to have a good data set, both for training and for testing. This is a data set that we created quite a number of years ago was called the protein property prediction and testing database. So, we have sequences. That's in the first line and then we have secondary structure on the second line. So things like C's B's and H's tell you that that's the secondary structure and SAP GK VIL that's the mean of acid sequence. We have information about the protein database identifier. And there are hundreds and hundreds of these proteins that have been compiled in this database. We can take that and extract it and we can create a database, not unlike the database we had for the iris once, but it's different columns and different data. We have a protein name, we have the sequence and then we have the secondary structure corresponding to the sequence. So you can see big proteins and little proteins here. So that's our data set. And we can then start thinking about how would we program an artificial neural net. So as we've done before for module two for module three and module four, we can go to our CVW machine learning, choose module for data. We can upload the secondary structure artificial neural net Python is also the R code. So our program is not unlike what we were doing for the neural net with the iris. So we read the data, we check the data. In this case we're going to look for missing data or invalid amino acids. So you don't have zeds as amino acids, you don't have X's and amino acids, we don't have B's amino acids so that's something to check. We have to create our training and testing set so we divide it by 67% one side and 30% or 70% and 30%. We have to encode our amino acid alphabets so that's one hot encoding we have to hot one hot encoder secondary structure. We have to define what that encoding function will be. We have to also deal with the fact that there are nulls amino acids the beginning and at the end. We have to deal with the null amino acids because if you're going to have a windowing function where you're looking at maybe not seven but maybe 17 residues at a time. You're going to have some blanks at the end terminus and some blanks at the C terminus. We're also going to do something called flattening, which is a way of taking one hot encoded data and making a little more amenable. We have to do the activation function and we've already dealt with this before the signal and softmax. We have to initialize weights and biases we have the batches we do the feed forward we calculate errors do the back propagation update and recycle. So probably about two thirds of the algorithm is from the iris one is being reused here but there's some new things. We've had to do before there's mathematical operations and data frame capabilities so we get a numpy and pandas. So then we have to read the sequence so it's not the iris sequence we have to read our data set of protein sequence and protein secondary structure. So this is the file to read that. And then we have to check for invalid amino acids. So here we're looking to see if there are any X's. This won't find zed or or bees. And it removes non standard amino acids. And make sure things are clean up. We also look for any missing values. And that's called verify data set function. Missing columns and rows. As we did before with the iris problem, we break it up into 70% 30% so we have a training set here of 700 proteins is not quite a thousand. But you know it's a big number with probably tens of thousands hundreds of thousands of letters and secondary structures. So 70% of 710 proteins is 493 and 30% of 710 proteins to 17 so our training set is about 500 and our test set is about 700 and this is just doing the split. It's a little different, slightly different than what we're doing for the iris. And once you've got this, we have to again because it's a neural net think about how to transform data do any sort of selecting. So, because we're dealing with letters. We're going to do one hot encoding. It was the same thing where, you know, one hot encoding. The virus species. Here we're one hot encoding that the letters. The amino acid alanine is 1000000. And then we have these null characters, things that are empty amino acids at the beginning and the end of the sequence which are padded amino acids so that we can do a window averaging. So we're changing the amino acids from characters to numbers. So we could have done embedding instead of one hot encoding, it would have done much better if we had embedded amino acids could have embedded information about their second or their hydrophobicity or their proximity to others. But embedding is complicated and it would take too long to explain in a course like this. So we've got one hot and put the secondary structure. So we've already shown how you can do that if there's three secondary structures Helix coil beta. We can use this three bit encoding. So this does the encoding in terms of the code and it creates the 20 letters as well as these null characters as we've described. And this also creates the one hot encoding for the secondary structure. What we're doing here is rather than embedding we're sort of taking averaging values so we're looking at the fact that amino acids interact with each other there. They know or engage with other amino acids that could be five 10 recipes away or more. We're going to put windows which is similar to what was done with the two fast method. And then we're going to predict the secondary structure at the center of each window. So for the full protein sequence we put in these null characters we're going to put eight or nine null characters at the beginning and eight or nine at the end just so that we can window properly as describe secondary structures to the first eight or nine. This is the sequence of the last eight or nine recipes. So you can see what's done here. This has got a window. It looks like about 12 or 13 13 and so we take the window. There's the sequence. And then we say that the middle residue is assigned. The value of an H or helix and we can move that window down. So this is very similar to the two fast method, but we're not just doing, you know, additive probabilities. Or you're doing something that at least the neural net will be able to figure out which might include multiplicative probabilities fractional probabilities and contextual inclusion. So it's a case where you slide this window along so one residue at a time. So we did the residue included E or glutamate we shifted it over now we're going to be P next one will be G next one will be F next one will be P as we slide this window. We've rewritten the amino acid sequence, which is, you know, these two four, six, eight null characters. The first residue in this sequence is I salute seen followed by three, the domains, and we're assigning these 19 or 21 character binary encodings. So you can see that they're different for each different amino acid type. So it's lots of zeros and few ones. What we're actually doing is something that's common in both with one hot encoding and neural networks is to flatten the array. So this is, you know, there's 19 or 17 amino acids and there's 21 bits representing amino acid type. So it could be a 17 by 21 table, or we can flatten it to be 357 bits long. So one vector that's 357. And that's flattening. So you take a square table or rectangular table and just flatten it into one long array. So that's what it looks like. It's hundreds of zeros and a few ones. And that defines this window of sequence data. And we shift one residue at a time. So I've moved it down by one residue, one null residue has been removed. And now we're centering around the E glutamate, which is marked there. And I repeat this. So this is a 100 residue protein. You would repeat this 100 times. And our input now has, you know, each input is 357 bits. And that is the length of the protein. And if the protein is 300 residues long it's 357 by 300, most of which are zeros. So this is an output data, which is the sets of beta sheets or helices or coil values for each of the 300 residues in the protein sequence. So, to encode that, we've done our amino acid encoding, we've created this window size, we've created the bitizing of the sequence to be zeros and ones as needed. Then we have these null amino acids as I illustrated before, so that we can calculate things at the very ends, beginning and the end of the sequence. We flatten it. So there's a flattening process. So, which converts a table of 21 by 17 to just a flat array of 357. We encode the secondary structures as we talked about before, zero, one, zero, one, zero, zero, and so on. So that transformation process is somewhat similar to what we did for the irises but more elaborate. And you can see how the size of the input vectors is huge compared to what it was for the iris. We're supposed to choose a model we've already chosen an artificial neural net. We could have encoded it for a convolution neural net or graphical neural net maybe. But we're keeping things simple, so we've chosen the artificial neural net. So the architecture, and we've talked about this before, is kind of chosen to match the dimensionality of the input and the output. So, each sample has, you know, 357 bits, we have three possible secondary structures. So the hidden layer has to be between 357 and three. Each input unit in the diagram corresponds to 21 amino acid inputs, amino acid types. We've talked about this, this is the same slide that was used for the iris one was the activation function. This is critical for any neural net or any type of machine learning method that uses a gradient descent optimization. So you could use either the sigmoidal function or you can use the softmax function, a couple other ones. This is the same thing we've seen before about, you know, what the math means and what these functions look like and some of the advantages of why softmax is referred in many cases for neural nets. And it's being able to sum exactly the one. So we have to fill in, we have, you know, hidden layer and input layer and output layer. So there's two weight matrices. There's biases as well. This is very similar to what was done for the iris neural net problem. Likewise, there's batches and we've chosen to do batch learning just like we did with our own neural net. And again, we try and make sure that the number of batches corresponds to a whole number and that the number of items in those batches is also non fractional. We've also talked about the neural net training loop. We have forward propagation, error determination, back propagation, weight bias updating, and you do that, both for the, within the training set, then within different batches, and then within different epochs. So we've got repeats on repeats. And that's the whole thing. I mean, there's no way that anyone would want to do this manually. It's just for computers because these are tedious time consuming error prone calculations where you're modifying your weights incrementally and doing it for dozens to hundreds of nodes, thousands of times. But it is just very much like classical optimization, which has been something that's been done since the 1960s. And there's gradient descent, Newton-Raphson methods are all all that. We also have to deal with the learning rate and how big the training set should be and how many epochs we need to do. Those are all returned as part of the same function. This again is a piece of code that we've already seen for the iris problem. Propagation, it's a similar step that we did with iris, but the code is slightly different. Again, just because of the structure of the data, the type of data size it is. But it still is essentially what we call a DOM product. We also have to calculate the error. And this is also very similar to what was done with the iris problem. We've got things sort of simplifying to sort of simple differences. We're looking at the forward propagated output of layer two. We're doing a subtraction initially, but then we use gradient descent as we move from layer two to layer one and from layer one to layer zero. And the formula that's used is embedded here. It's also written in Python. We've talked about that cost function. And how we get those partial derivatives and we're using the sigmoidal function. It's derivatives. There's the cost function as well. And we propagate through. We obviously have to update the weights. Those are the weight matrices. Some of which can be quite large. And these are multiplied by the delta function or delta values that we've determined. In addition to the weights, we update the bias, shift things accordingly. That's again just optimization. The same code is used or largely reused about repeating within the batches and then within the epochs. And doing the forward propagation error calculation. Back propagation, weight and bias updating and repeat and repeat and repeat. And a lot of it is being reused from an Irish neural net, but with some differences because we're dealing with a different style of input and a different style of output. So this is just showing how some of the hidden layers are changed. We've got output units, hidden units, input units and hidden values and we're going through the different epochs. And these are how the numbers change through the training period. Not sure if we did it up to about 1000 epochs with this. You can see some numbers are higher and lower than after 1000 just seeing it restart again, but if you watch long enough you can see that the colors are changing, which is really your affection of their value or the intensity. So this is how these weight matrices change over time. And I'm just taking some numbers and showing what the weights are pointing out to what those. Those are the lines that connect the nodes. And these are the sort of what we call scoring matrices. Just like with the iris neural net, we are calculating errors. And these errors fall with epochs and because we did 1000 epochs, the error gets from, you know, terrible to very good well below point probably point zero five or whatever we're calculating in terms of the error. And eventually it flattens out. And this is something you always try and track with neural nets, which is your target function, how, how good is my error, how close is my prediction getting, and is it starting to stabilize. Now, sometimes it stabilizes it near perfection in the case of secondary structure. Best you can typically get is around 60 or 65%, at least with this sort of naive model we're using. We've ever used, you know, embedding and multiple sequence alignment and other tricks which are complicated. I'm sure we could have got 85 or 90% accurate. So, to try and write this program, which we used a lot of the secondary structure or iris code before with Python so it's relatively modest. But to try and put this in our, it's almost twice the length. So, there are certain limitations and restrictions with our interest in being a manipulative characters, and especially large numbers of characters. In terms of overall speed, the program at least run on Google, Google Colab is about the same speed as R. As a rule, most R programs run sort of independently on a standard Python converter interpreter are about 10 times faster than R. Some cases they can be up to 100 times faster. So we've done our training on this. We've got convergence. We're happy at least with the error. And as I said, it reused a lot of code from our previous Python neural net. But the key thing that a lot of people forget is they don't validate their model. They don't test on testing data. They only report the performance of training data. So, again, just like we did for the iris one, we've added a little bit of code so that we can perform the test and we can see how well it does with new data or ever before seeing data. And it does a forward propagation to assess its performance. So, we could have used the entire sequence of 7,000 proteins. And that would have been, you know, one million amino acids, one and a half million amino acids. That would have taken several hours. That's the time you don't have time. So, we reduced it to roughly 10% of the whole PPTB set. And we broke it again up to 70% of 710 and 30% of 710. So, this is the result of the secondary structure predictor. What you're trying to look at. So, it's 48% correct on beta sheet prediction, 69% and coil prediction, 63% on helix prediction. That was what we got on the training set and then on the testing data set. It's 48, it's 46 instead of 65, it's 63. This is not as good as the IRIS data in the sense that IRIS data is almost trivial. Secondary structure prediction is until recently an unsolved problem. Alpha full solved it, but when we try to put this on as an example it was still kind of unsolved. We also, you know, did a very simple minded evaluation and a simple minded algorithm. And what you do in secondary structure evaluation is you do a, what's called a Q3 average performance. It's kind of like taking the diagonal, but also counting how frequent these things are. So, it's not a simple percentage average, it's an average over a number of residues. And so in the training set, we got a 61% Q3 score. And then testing set, we got a 61.2 Q3 score, almost identical, a little lower. And as I said, you want to make sure that your training and testing data are within a few percentage of each other. If one is dropped by 20% or miraculously increased by 20%, there's something wrong with your model or something wrong with your testing evaluation data. So, this kind of performance is 61% is better than the 50% than true fast men. But it's clearly not as good as the PhD one, which was written by Burkart Roast, which made him more famous back in the mid 80s. I guess it was early 90s actually. What we've done here is we've written a secondary structure prediction program in Python. We trained it on, you know, hundreds of examples, not thousands, and we tested it on a modest set. The code can be used to do other types of secondary structure analysis. So, there's also, you know, tasks where you have to predict whether a protein is going to have a membrane spanning region, or whether there's a signal region, signal peptide. And so this is a similar concept. And again, if you have training data enough of it, then this model could kind of learn it. You could also use it for signal site prediction and gene prediction. And in fact, we'll use some of the same concepts that we've done for secondary structure and apply it tomorrow to gene prediction. Obviously, if we use embedding, if we added more hidden layers, if we added other features, we could actually make this thing better. But because we're, you know, basically coding all the nitty-gritty stuff ourselves, it's just really tedious and really hard. And this is why, you know, I've been teaching you guys how to climb a mountain on foot. What we want to do tomorrow is show you how to scale a mountain with a helicopter, and that will make it much, much easier. And this is what Keras and Psykit learn about. And when you've got a much easier system for putting your algorithms together, then you can also start, you know, in optimizing things or making them more complex or testing different parameters. We don't have that luxury because we're coding everything. All of the derivatives by hand, all of the dot product calculations, all the error calculations, all the input and output and flattening process all have to be done by hand and checked and rechecked and recoded.