 Welcome to module eight of machine learning. This is the last module of this two day course and we're gonna try and wrap things up. As per usual, everything in this module in all parts of this course are under a Creative Commons license, which is essentially share and share alike with attribution. So this module along with the previous one is focusing on using machine learning with Keras and Scikit Learn. So this is essentially to try and demonstrate how you can use these powerful modules or systems that are available through Google to make your life easier when you're trying to do machine learning. We've already shown how we could use Keras and Scikit Learn for the high-risk classification problem. Here, what we're gonna try and do is demonstrate how we can use it in the secondary structure prediction with an artificial middle net. And then we're also gonna show how we can use HMM Learn, which is not really Scikit Learn, but it is a branch to do ProCare and Promoter Motif recognition. So a little more advanced, a little more challenging, I guess, than the simple iris ones, but again, this is just to simply show you the feasibility of taking a real bioinformatics problem with real bioinformatics data and to use Scikit Learn, Keras, HMM Learn, and others. And we're gonna do this, and you'll see this not only for the Python work, but we'll also see this in R. Now, what I also wanna highlight and Francis brought this up. So we have about an hour here where we're gonna go through both this lecture and lab. And then we're gonna leave about the last 15 minutes. So that'll be at 5.45 for everyone or about an hour from now, so that people can complete the survey and provide any feedback. It's also time to answer anyone's questions. I know that people have obligations and we'll try and wrap everything up by six o'clock or as close to that as possible. So, again, module four yesterday we looked at secondary structure and we used prediction with artificial neural networks. So we had a program called SSN or SAN.ipynb. Now this is gonna be SAN, Keras, IPNYB. So this is where we're using Keras. The structure of this is similar to what we did which is how do I predict protein secondary structure from sequence data? So that's our question, that's our problem. We're gonna use the same training data testing set that we talked about before and that's the PPTDB. This is an example of that. Protein sequence name, secondary structure and we can use this in this format. The actual code for this Keras module is in module eight and you can open it up, click on it and you can see the Keras version of it. Make sure you look for Keras. That's important to distinguish from other versions of SAN. Now if you open it up and look in it even as I'm speaking or maybe during the lab you'll see that it's very similar. It uses SNMPI and PANDAS as for the mathematical operations, array handling and data frame. We are reading a converted data set as a comma separated file. We're processing it exactly the same way as we did the last time. We're taking amino acid sequence, secondary structure. We're checking for any non-standard amino acids or cleaning up any other information where things are out of range. So all of that data check, which is what we also did in module four is being done here with this program. Missing values and label checks are also there just to make sure everything is complete. Again, nothing different than before. We're gonna create a training set of 70% and a testing set of 30% code is very similar. So same split data set test train function that we've used in all of our other data sets. So that's all a repeat and it's probably not worth going into detail. In terms of transforming the data set and selecting features, we're gonna do exactly the same thing because it is still gonna require some one-hot encoding. So this is something that's not automatic and knowing how to do one-hot encoding or coming up with schema for one-hot encoding is something that you pretty much always have to do regardless of whether you're using Keras or not. So this is how we're encoding and it's 21 characters. We're trying to convert essentially character numbers or hexadecimal type characters into something that's a binary and we're dealing with 20 amino acids. So it'd be 20 characters and then a null amino acid, which gives it to 21 and we're just going from A-C-D-E-F-G-H-K and so on all the way through the 20 amino acids. We also one-hot encode the secondary structure. So we're using three binary numbers, 1-0-0, 0-1-0, 0-0-1. Same thing as what we've done with the amino acid encoding alphabet. We give a unique binary number. This is essentially the way that we write it out to make sure we get 21 different numbers associated with every single amino acid. Same thing is done to create the binary for secondary structure. So that's the code. We're padding with the sequence as we did yesterday in module four. So we're adding eight extra amino acids at the end termus and eight extra amino acids at the C-terminus and they're all given the character null. This is because we're using a window of 17 residues. And so this ensures that we'll always be able to predict some secondary structure either at the first or the last residue. We have a window size that we've identified for interest 17. We're transforming the amino acids into their binary representation as well, which was what we talked about before. Padding to make sure that things have the binary code so that they're 0-0-0-0-0-1. When doing a game, this is just sliding the 17 things along. It's also trying to convert the 17 times 21. So whatever that is, 368 or whatever elements, we fetch the protein sequence and assign each one to a secondary structure character. So that's the standard setup. That's what we had to do previously for the secondary structure prediction. So if you look at the code from module four, that first half of it will be essentially identical. And again, that just emphasizes the fact that yes, scikit-learn's great, yes, Keras is great, but you still have to do some coding to set things up. And if you don't design it properly, you can't really run these programs. So that's why we spent a fair bit of time talking about one-hot encoding, converting things, even some aspects of normalization and scaling, which we didn't have to do for this one, but we would have had to have done with the flower work. So I guess it's 357 units, that's 21 times 17. Where what we did in the neural network from yesterday with module four, we had relatively few hidden layers. With Keras, we can actually play a little bit more. We can actually have more than one hidden layer. We can have three, five, 10 if we want. And as I mentioned before with the deep neural maps and because you can make use of some of the GPU and sort of some of the horsepower on Google and the cloud, you can go a little further out. And you'll see that although it's subtle, using extra layers and using the architecture or strength of deep neural nets, you can get a modest performance improvement. So just as we did before with the neural net, we have to import some libraries. We're importing dense and sequential. So dense is the framework and is the layer type and sequential is the framework model for that. So we import them. We call the sequential function, which is set the classifier to sequential. And then we start adding layers. And like the iris example, we're also gonna be using the reload activation function, at least for the first hidden layer. So more we add, in this case we're, I think there's six layers that we're adding. So we're making this a fairly deep neural net. And then with compile and fit, we can start choosing what we want to do in terms of the batch size and review pox. What are in function loss? In this case, it's cross entropy, how we're measuring in terms of accuracy or error measurements, gradient descent optimization, which is the atom function. All of those things can be invoked. So the neural net is essentially very similar or almost identical to the one we use for the iris, but the key difference obviously is in how we structured the data. So setting that data up so that we can put that in and it could be properly read and the output could be properly interpreted. We have to have the predict function and we take our input data and then the resulting arrays is kicked out. I'm not sure why, and I guess I've seen this in a few slides where the resolution on the image has sort of just vanished. And then perhaps Life and Louisa, maybe after this course, you guys can just go through and figure out what happened to some of these screenshots why they just suddenly went blurry. Anyways, what happens is that this produces a collection of probabilities of whether percentage B, which is beta strand, C for coil, H for helix, and we can take the prediction, which whatever wins. In this case, apparently 90%, although many of us can read it, that the structure associated with this particular residue is a coil. We can also generate the confusion matrix and determine how we did against the actual structures in terms of the actual training or test set. And so this is just a comparison between what we predicted and what we saw. So overall, outside of having to do the sort of the reading, checking, one-hot encoding, the calls to the neural net using Keras are certainly simplified the whole calculation. It's what would have been 100 lines that's reduced to about 10 lines. So in case you don't remember, this is what we got on our test set from yesterday's super, our secondary structure, artificial neural net or SAN. It's, you're mostly interested in the diagonal. So 46% for beta, 69% for coil, 65% for helix. Given the overall abundance, you can calculate a Q3 for this, and the Q3 I think is around 61 or 62%, which is okay, but not great. So if we go to the next slide, beta isn't quite as good, so 46 to 43, but what has improved is that the coil percentage has gone up to 70%, and helix has gone up to 65%, or 66%. And some of the off predictions are slightly reduced. So with the Keras using a deeper neural net, there's a modest improvement probably in the Q3 score, but this is I guess the advantage of the fact that when you have deeper nets, more subtle patterns are detected. Now, if we compare the Python secondary structure on that with Keras, it had about 270 lines, or consisting of about 270 lines or about 240 actual coding ones. It runs about 54 seconds. The R version is actually a little shorter and actually much faster overall than the Python one. Compared to the original pure Python one that we wrote, the code length is modestly shorter for the Python with Keras. And I guess obviously simpler to implement. So what we've got, at least now, is a set of both an R program, where we use deep nets, R functions, or a Python program where we use the Keras functions. They've been tested on a training set of about 490 proteins, just like the one that we did on module four, and then the test set was about 210. As I think we highlighted yesterday and even today, you could reuse the code with the Keras functions or with the deep net functions. You could apply it to membrane spanning prediction, signal site prediction, other things. Now, this, like our module seven, we're gonna have a little lab in the middle of the lecture. And so again, I just wanted people to pull out sand and to do what we've always done before. In this case, we're gonna go to module eight, open up the code and you can either get the Keras version or you can get the deep net version from R. You can open them up, you can start browsing, looking to see what they're like. We'd also encourage you to look maybe at module four just to compare the code between the two. Look at the differences, look at the similarities. Again, you can run the program and upload the data. In this case, it's converted data.csv, which is the same data set that we used yesterday, gain, run all through the runtime menu. And then we have various instructions where you can go to different cells, change things at layers, run things with different numbers of layers or no hidden layers, see what happens, what your optimal performance is. The next part that we wanted to look at after the secondary structure prediction was to go to essentially what we did this morning, which was the hidden Markov motif analysis. And in this case, we're not using scikit-learn, we're using HMM learn. So it is a Python library and it's similar to Keras but specifically done for hidden Markov models. So just like how Keras makes it really easy to build no nets, HMM does the same for hidden Markov models. So same pathway that we always talked about, same color scheme. In this case, we're gonna try and identify motifs from unaligned sequence data. So if you recall from hidden Markov models, we could use alignments and that would save us time but to make it more challenging, we're just sort of using collection of sequence, which consisted of 800, 1,805 promoter sequences with known transcription start sites covering the first 50 nucleotides before the transcription start set. So it's the same set that we used with our Python HMM. Same data, we have forward strand data, we have reverse strand data or reverse complement so that we can always read five prime to three prime to make things consistent. So if you wanted to take a look at the code, gain it's in module eight, you can look at either the Python or the R code. So we have HMM motif with HMM learn in Python. So as before, we can import NumPy and Pandas. So this is to help with the math and the data handling. We're gonna read our collection of motif sequences. It's a comma separated file, we use drop and a to help sort of things out. So gain that's pretty much identical. So we've got our data set, we've read our data set. Now we're gonna transform it and select. So this is where we have to do the encoding. And we talked about that before, it's the same thing, converting letters to numbers because computers can handle numbers. And we've done this for all of the sequence in our set. So pretty much everything to now is what we did previously for handling the HMM in module five. We're using the same arguably simplistic view where we've got 50 hidden states. We're not using insertion states or delete states, which probably would have improved the model. We have too many hidden states, which also makes the model a little messy. So that's one of the reasons why the performance doesn't go as well as we'd hoped. Now, when we use the Python method only, we had to write definitions for the forward algorithm, the backward algorithm. We had to combine the two to create the Bonn-Welsch. Then we had to call the Derby dynamic programming algorithm. So all those had to be initially prepared. Then we had to initialize and we had to train the modal model. And then we had to do the decoding, which is the motif prediction. So these are all of the steps, if you want outlined here that we had to build with the Python. And it was a lot of coding. With HMM learn, instead of all these other things, which is building the Bonn-Welsch, the Derby, and the initialization, all we have to do is basically initialize, train and decode. Just have to tell the number of hidden states we want to use. We call the training function and then run the decode to predict the motifs. So this vastly simplifies the code construction method. So just like calling PIP to bring in TensorFlow, we have to install or call PIP to bring in HMM learn. So PIP or PIP brings in all of those functions or modules. From that HMM learn, we import HMM, which is the package. We indicate how many states we want to use. So as before with our Python model, we decided on 50 hidden states. I think there's lots to criticize with that, but that's what we went with. And we create this multinomial HMM constructor function, which has a number of components. We call it the Viterbi algorithm and number of iterations. We have to decide the amount of data in which we want to train just like with the fit model, that's fit function that we used with a neural net. We can call on this for the selected dataset. And we've got our encoded sequences, our training size that we've chosen, how many we're going to put in. So this is essentially how to train the model. There is a forward algorithm, which we previously talked about in terms of a hidden Markov model. This defines the forward function. And it computes the alpha table or probabilities that we want. We also have to define the backward one. That also computes the beta probabilities. And that probably shouldn't say alpha table, which is the beta table. And so the bomb Welsh, which iterates both A and B, as well as the initial distribution, uses expectation maximization. So those things are called to generate both A and B. This is the Viterbi function. It computes the hidden states and generates the sequence of observables. Again, in this case, mostly code is actually comments rather than the text code. And then we have the HMM initialization, which is described here. So that is the code that we have for HMM. At that stage, we can start testing and validating using your test data set. So this is the list of probabilities that are produced. Once we've got trained, these are the trained emission probabilities in the different states. So we've got a 50 base length. Sequence that we're looking at. So the top line or the top row is state zero. Technically it should be 48 and state 49 rather than state 50. And then what's written in the columns are the probabilities. So that state zero or position one, we can run across and we can see that 0.548 has the highest probability. Or A is most likely for that one. T is most likely for states. One. G is most likely for a state. Two and so on all the way down. So these are the emission probabilities. We can also take a look at the transition probabilities and we can also look at the initial state probabilities. From these ones, we can calculate the highest emission probability and we can return out sort of a list, which represents the sequence. Now, remember we converted A equals zero, C equals one, G equals two and T equals three. So what we produce is this numeric array, which we then have to convert to A's, G's, C's and T's. The decoding is essentially that conversion, converting the integers into the sequence of nucleotides. So the HMM learn, which had to calculate the forward algorithm, the backward algorithm, we had to write the viterbi algorithm. We had to do lots and lots of stuff. So the actual lines of codes with just Python was almost 350. With HMM learn, it's still about 100 lines of code. So it's not trivially small, but it's essentially it's one third the length. And so a lot is saved with using HMM learn. Still, I think as you've seen, there's elements where we have to obviously, you know, write code to encode things. We have to make sure we call forward the backward algorithm because we're essentially still training, we're working with unlined sequence data. So there's still the work, or if you want the logic that you have to be aware of in order to build this. It's not a single line function, which is like an adder or a subtraction, or just call HMM and do some magic. There's still a fair bit of coding with that. In terms of the Python version of HMM learn, it was 98 lines. It runs in about seven seconds. We've written another version in R. It's a little longer. So about 139 lines with 95 lines of coding. And the program runs a little slower, but they all produce equivalent results and they're all reasonably quick. So using HMM learn, we've been able to take something that was rather daunting, I guess, in terms of the amount of code that we had to do with Python. We've obviously written it both in Python and in R. And in principle, these can extract promoter motifs from unaligned sequences. So it doesn't need to have prior multiple sequence alignment. They're generally shorter. The performance is still not great, but it's generally a little bit better than the one that was written in pure Python or pure R. As I say, there's a flaw in our concept, our topology design, that makes this particular program perform somewhat less than desired. But I think that certainly will be able to work on this for the next time. So this is really, I guess, wrapping up for the lab. We can use maybe the next 10 or 15 minutes for people to run through the lab. Same thing that we've done before, where you can go to modulate, you can open things up either in Python and R. In this case, you want to upload some of the data files. There's emissions text and the promoter emissions and the motif sequences that you have to upload. You can run it the same way you've always done, and then you can play around with the different cells, as described here, to check things out. So we've tried to make this real. We tried to give you real biological problems, real bioinformatics problems, problems where hopefully I've given you examples of how you could translate that. Certainly we've looked at hidden Markov models, which are just really awfully difficult to decision trees, which are much, much simpler and much more rational, and many cases can do just as well. We tried to code everything from scratch or the basic Python, just so you can sort of see the under the hood to look at the engine to see the nuts, the bolts and the valves as they're pulled apart. Some of it's ugly, some of it's scary, but when we've used some of the other modules or libraries that are available, pandas, NumPy, Sky, Scikit-learn, Keras, Sageman-learn, I think you can see how the coding is simplified, but I think you still fundamentally need to know how to code. It's not like running a calculator and adding. You still have to be able to know something about the functions, something about the process to make the calls, to know how to run things in loops. So by giving you code, putting in as many comments as we possibly can to explain the code, hopefully you have got a template that you could use in your own work. The other point I think that I want to highlight is to mention that RTAs, Life and Louisa are certainly at your beck and call. So if you guys have machine learning problems that you'd like to tackle, but you're afraid to get in touch with Life and Louisa, they can probably reuse or show you how to reuse some of the code that we've written for this course to help in your own work. At some level, maybe next year, we're going to try and convert some of the tools that we've been building into web servers. And that way, people can actually, rather than having to run the programs all the time, they could just simply upload them. Life and Louisa are currently working on a web server called WGAN, W-E-G-A-N. Some of which we'll be using a bit of machine learning, but a lot of multivariate statistics and other problems. And many of the things that they've developed, including the interface and some of the back end could probably be adapted to this program. Obviously, there are types of problems that we've dealt with, like secondary structure prediction that are really, really specialized. And only probably one or two of you would have any experience with that. But we'll keep them as that in the sense of their examples. They show you how to encode, how to one-hot encode, which I think is really important. And I think they show you how you can manipulate sequence data to help you interpret it.