 it's always a strange thing to get this. There it is. So today we're going to be continuing with what we started off just before the break. So this is module seven and we're going to continue with the keras and scikit learn. We originally applied things to applications in the iris problem. That's not really a bioinformatics problem. So this one we're actually going to try and apply it to a real bioinformatics problem. And this is for the secondary structure prediction. Now, I think again that was the last module that we did yesterday. That was module four. And we in that case used machine learning with neural nets. We didn't really use decision trees, but in that case the keras module is going to be most useful. So the program that we're going to develop is called secondary structure with artificial neural nets via keras. So that's the whole name that's in python. And as I say, this is mostly an extension of what we've done with the iris stuff with keras and scikit learn already. So it's not a big leap. And this particular module will be relatively short. We've scheduled an hour for it. We'll probably be done in about 35 or 40 minutes. The idea then is I mean people can play with it and run the exercises. But as I said, I think I wanted to use or have people use the afternoon today in particular and this break in particular to talk with the TAs or talk with me about their own specific problems with machine learning, some of the projects they wanted to do. So this was I guess a segue and it's one that we've done in the past as sort of a segue where it was going to be a short module, but it would give everyone a bit of time to talk about the problems during the breakout sessions. And because it's an online course, obviously people can come and go stretch their legs, eat something, come back and chat. So I'd almost call this as an interlude or an intermission, because it's not a great leap in terms of advances. Now the last session, which we'll be doing, which is the chat GPT, I think some of you might find it quite interesting. It's a new module for this year, but it's also there's a lot of developments that have happened and are happening. And that may also generate a fair bit of discussion with people afterwards as well. Okay, so that's a bit of a backgrounder. As I said, we've seen these flowcharts before. Hopefully everyone now has that memorized and remembering the process, in particular the whole point about testing and validating and thinking about how best to define their problem and how to propose the solution. So everyone should remember a little bit of a secondary structure. We're taking a sequence that's shown up here at the top and we're trying to predict helices, beta strands and coils, beta strands in blue, helices in the center of the yellow, and the black is the coil or unstructured regions. As we did before when we constructed a data set, we use this thing called protein property prediction and testing database or PPTDB. It's been around for a while, hasn't been updated for a long time, but it still has thousands of protein sequences with their secondary structures all marked off or annotated. The same data set that was used in the original full Python, Python implementation is used, this is the format of the table. And then we already know that we're using an artificial neural net. We also know that we have to do some encoding. Again, the encoding is a little different because it's not just non-coding encoding, it's beta coil helix. So we've got a three or two digit or two unit binary output. And then our amino acids, since the alphabet here is just three amino acids, we didn't have to do very much, but we're working with 20 amino acids. So the length or the encoding for that is a little more elaborate than what's shown here. So if we were to do the programming as we have pretended to do in the past, and now we're in module seven, we would do exactly what we've done before, which is going in the collaboratory, opening up the file. In this case, we're actually opening up the SSAN keras. If you look at the data, or if we're pretending that you're writing it, you would import NMPI and import pandas as before. We would import the data. And this is the code base codes that we're using to do that import. So we're reading the sequence and we're reading the secondary structure. We have the same data to check things, remove non-standard amino acids. This is just looking for Xs, but there are obviously other types of non-standard amino acids like Bs and Zeds. We're doing some missing value, missing long column checks. We've done that same thing. So the code is identical to what we've been using before. So that's a verified dataset function. And we do this split dataset and test and train. So this is a game where we've got some 70% and 30% for the training and testing. And then the whole point about transforming things. Remember, especially with character data, we have to do one-hot encoding or embedding. This is information we saw yesterday, but just to remind you, so we're converting the letters to binary. We've got 21 amino acids, 20 regular ones, and then the null amino acid, which is shown here. And then I talked to you about the encoding for the secondary structure. We've chosen to use three sets, 1-0-0-1-0, so a three-digit encoding. And this is the same sort of code that was written last time, so getting the encoding is done. And it's not unlike what we were doing with the neural net for iris. A lot of the same text, a lot of the coding and encoding, one-hot encoding, file reading, file verification, all was written before. Here's the alphabet with secondary structure characters. Here's the null padding that we had to do. This is both for the beginning and the end. And remember, we're using a sequence of a window of 17, so we have to have eight nulls at the beginning and eight nulls at the end. Just as a reminder, typically secondary structure arrives from interactions from multiple amino acids, not just individual ones. So we're windowing, and we take a collection of 17 amino acids, a window, and scan along and predict the property of the center amino acid. So that's just illustrated here. Window is a little smaller than what we're normally using, but that gate is just to semantically illustrate what we're trying to do. So there's the center, glutamate amino acid in that center. One is a helix. We shift over to the next residue, which is a proline. That proline is also a helix. And we slide along through the entire sequence. Again, this is data we've shown before in terms of the one-hot encoding, the zeros and ones for the nulls and how the different amino acids are produced in this binary amino acid alphabet. So because we're taking window 17, that means we prepare an array rather than a matrix, and we do this as a flatten vector, which is 357 bits long, which is 17 times 21. So couldn't fit in 357 numbers, but we're pretending. And so that's this flattened array of those 17 amino acids as read through. And then the same shifting is done with the same 357 residue long matrix or vector. And we slide through. So these are all slides that we had that I showed last time, but it's just a reminder since it's been a long couple of days. So once we've done the encoding, we have this length of inputs, and then we have this output schematically illustrated on the right. So there's lots of zeros, very few ones. So this is the text or the code that transfers that in Python. We're doing window sizes in terms of bases, which is a window size of 17. We've got our alphabet. We can concatenate the nulls to the beginning and end. And we have this flattened array of 357 values. Same thing with the secondary structure. We're just converting the HBs and Cs to the different zero, one, zero combinations. Same diagram. Same process. We have the neural net model. And again, this highlights the point about how the architecture defines, of the neural net define is defined by the size of your input layer, the size of your output layer. Now, in our previous model, where we were doing just an artificial neural net, we just had, as far as I could call, just another one single hidden layer. But with Keras, we could actually choose more hidden layers. And because Keras is maybe a little smarter, we can also arbitrarily choose how many nodes are going to be in those hidden layers. So this is showing two hidden layers with six input nodes. We could have 20 input nodes in the first hidden layer and five input nodes in the second layer. We could have we could have three hidden layers. And as I said, and as I think we've highlighted before, the more hidden layers you have, potentially, the more capable the neural net is solving difficult problems. And it's not just numbers of layers. I think there are models like transformer networks and graphical neural nets and recurrent neural nets, which have an even more complex interaction where things just don't connect from one node to another. They could potentially connect and jump over other nodes. Their behavior with interactions between other nodes is different, rather than just simply a weighted function. And so that would give us a deep dive into some of the more complicated aspects of learning. But the nice part about Keras is that some of those models are just ones you simply call. You don't have to worry about the tough coding. And it's just a function. So this is where things get a little different from the original program and the original algorithm. So just like we did in the neural net for the iris problem, we have to bring in both dense function and sequential. So sequential is the framework for the model. The dense is the layer type and the connections that we're having between each of the neurons in the neural net. In terms of adding, we're adding hidden layers and we've chosen a hidden layer size of six. We know the input dimensions has a window size of 21. So you're seeing the classifier.add, which is the way to add that. And with this example, we're using the relu approximation. So we're not using sigmoid. We're just using relu as the activation function. So if we wanted another hidden layer, we'd call add again to add a second one, call it a third time, add a third layer. We can have this in a loop. And we can have a maximum hidden layer size, in this case, of six here. So that's really easy to do. The same thing in terms of training, as we did with the iris model, we compile. So that mashes the layers together. We can use the optimizer function. This is add them again. And in terms of the cost function, it's the cross entropy one. In terms of the metric, it's the accuracy of the performance. And then when we do the classifier fit call, that starts the training based on both the batch size and the number of epochs that we've chosen. So again, same code that was used in the iris example. So at this stage, I haven't shown all of the details for all of the code, but I think the essence is that we're able to reuse a lot of the code that we did with iris. And you could do the same thing with just about any other neural network program. You can see that there's flexibility in terms of the number of hidden layers, the number of nodes in the layers. And there's other things we can change from batch size to epochs and other hyper parameters that can all be adjusted. The real work, as I said, especially for the neural nets, is that, I guess, it's data handling software. So that's where, you know, how do I read the data? How do I encode the data? Or how do I embed the data? How do I verify the data is correct or clean or not missing data? So the elements really are in the construction, the transformation, some aspects of the feature selection. But once you've chosen a neural net or a deep neural net, then with Keras, it's just literally a half dozen or a dozen lines to make the calls. They all follow the same structure, the same syntax that we've just shown here. Now, I'm not sure exactly what happened with this slide, why the values have faded out. Maybe that's something we can fix in the next iteration. But just like we did with the Iris one, you call predict. And that's the way we can measure or assess things with the test data. Why PRED is the out, is the probabilities of each sample and test. We can then calculate the confusion matrix. And we can take the probability in terms of each row. And we can measure HBC observed and HBC predicted. So again, fairly simple to do. So if we look or recall how the original Python program from module four yesterday performed, we call it did okay on coils. Pretty good as well. And Helix didn't do so well in beta sheet. And these tended to get confused. So Helix tends to get confused a little bit with beta sheets. Certainly the beta sheet tends to get confused more so with coils also with helices. When we take this Keras version, we get slightly different numbers. Just flip back and forth. So 0.46 with the old Python one, 0.43 for the Keras one. Look at 0.7 for the coil with the Keras, 0.69 with the original Python, 0.65 in the pure Python and 0.66. So overall, the numbers are probably about the same in terms of the Q3 score, which is around 61 or 62%. So there's not an improvement in accuracy. And that was the same thing that we saw with the Keras neural net. But we do see some slight difference in the numbers. And that's because of a use of relu, whereas we used a sigma sigmoidal function. And different initialization of the values in the weight matrices. So subtle differences. And that's the same thing that we talked about, of the subtle differences that would have led to slightly different results in the Keras version of the neural net for Iris prediction. In terms of the total number of lines, when we wrote out the program with Keras, it's about 240 coding lines. We didn't put in too many comments. The time to train and perform this is quite long, whereas with a R version, it seemed to be much faster. And that's a little odd. And I sometimes wondered, and maybe we'll have to double check whether we flipped the two. Now, what we've got is a neural net for predicting secondary structure using Keras functions. And then we wrote something in R using deep net functions. Both are training, trained on the same set, tested on the same set to get the same numbers. What I wanted to do is follow up, because we mentioned the utility of Metaboanalyst to help you guys do some biomarker identification. I think also taking point that probably a lot of you guys have finished with most of your questions about your data sets. So I wanted to show you, I guess, an alternative route to doing a lot of the work that we've been programming, but making use of a web server that has lots of functions with machine learning and AI. And that is Metaboanalyst. Okay, so the name and the sense of the program was originally developed to work with Metabolomics data, but it's indifferent to it's agnostic to the data types. So it can be genomic data, transcriptomic, clinical data, any omic data. And it does primarily multivariate statistics. It also does things like pathway analysis and functional analysis. And again, it's not restricted to just Metabolomics. So it's been a route. We started developing Metaboanalyst back in 2008. Jeff Shaw was the guy who was working on it. That was his PhD project with me. He upgraded in 2012. We upgraded again in 2015. Then he got a position at McGill University and continued to develop in 2018. Version 5 was released in 2021. So Jeff, who's at McGill, has largely taken over a lot of the development for Metaboanalyst. But it is one that I guess I can say I helped start and launched it because it was something that I really wanted to have in my lab because I think as you guys are seeing, some of the coding gets a little tedious. You're doing the same thing over and over again. Can you generalize it? So Jeff's a really good program. Everybody also understood the need for having good interfaces and for using the web. So this is just different analyses that you can do things like statistical analysis. That's multivariate statistics, enrichment, pathway analysis, power analysis, biomarker analysis, joint pathway. You can do analysis. This is more specific to metabolomics. There's meta-analysis, where you can look at multiple transcriptomic proteomic data. And Metaboanalyst has been the platform to produce a number of other tools that Jeff's lab has created. Microbiome analyst, at least done one for transcriptomics, some in systems biology. They all are very similar to Metaboanalyst, as I say, because it was originally designed to be a data agnostic system. So when you get into Metaboanalyst, and you can just type Metaboanalyst on the web and it'll take you there, you can navigate through this and you can see different functions. There's somewhere for mass spectra particularly. So you guys don't need to do that unless you're doing metabolomics. But it's some of the other things like the statistical analysis, biomarker analysis, time series, meta-analysis, power analysis, network analysis. Those things can are pretty much data agnostic. And there's standard steps that you always do. And you guys saw the six steps that we always did for machine learning. There's always this data pre-processing. There's always this scaling and normalization. Then there's, you know, analyzing your data and then at the end interpreting. So broadly speaking, this is exactly what Metaboanalyst does. So it has a really simple navigation structure. You have home data, tutorials, facts, APIs, history. You can download some of it and run locally. It gives you some examples of the data sets that have been created. But as I say, it just needs to be basically in a CSV file format. So here's a list of compound concentration. But it could also be protein concentration data. Or it could be transcript data. Or it could be, you know, clinical parameters, height and weight. So it's just any label. But the examples that are given are just intended largely to target people in the metabolomics community. But it's a hidden gem that I think more and more people in other omics fields are starting to realize. So this is an example where I've just, you know, chosen a data set and I've chosen to download it. This is a study done on cattle, but you could do it on plants. You could do it on humans. We're not going to feed them grain. But it could be, you know, a treatment. It could be outcomes. It could be any number of things where you're producing a lot of omic data. And you, in this case, four groups, you could have two groups, you could have three groups, you could have 20 groups. But this is sort of the data set that we've chosen. And you've already seen the code we wrote, which is, you know, check and validate data. So you could always write that code, but this is on the web. So you can actually do this. And you can up do a data integrity check. It'll check the content, make sure that things are in the rows and columns. It's looking to see if it's comma separated or read the data and interpret. It'll determine if there's English letters, English numbers, English underscores. If there are other characters, things have been removed. It works to see that everything's numeric. It determines how many missing values are there. It also tells you how the minimum or the missing values will be processed. And if you want to do missing value imputation or other methods, you can ask to do that. So a number of you guys have asked, and what do I do with my missing data? Well, this has all these options. And you just simply have to get online, make sure your data is uploaded, and it'll start doing it for you. No coding. The other thing that we've talked about is, especially for neural nets and other things, but also for separations, analyses, SVMs, you often need to do scaling, normalization, and transformation. And sometimes you have to write it yourself. We talked about the one-hot encoding. We talked about scaling methods and normalization methods. If you have to write them, that takes several dozen lines of code. A Metaboanalyst offers a whole range of normalization methods. You can do normalization by media and by some reference features, quantile normalization. You can do log transformation. You can do cube transformations. You can do scaling and this mean centering, auto scaling, range scaling, parade of scaling. So we talked about range scaling already. We talked about log transformations. We've talked about some of the normalization by median and some. So all of these are things you can just click on, and it does that once you've loaded your data up. So I'm just talking about how data is being transformed, how you can do the normalization scaling and transformation. This is bringing things into context. So whether it's sample concentrations or transcriptomic values, anything else that you're doing as collected atomic data, getting things into the same magnitude. We talked about that before. Transformations to make things more normal. This is to get a log transformation to make it Gaussian. The point of all of these is to make the data parametric. I'll show you what that actually means. So we've chosen these options. So it's about six, seven, eight times three times six, five. So there's about a hundred different options you can choose. You'll also notice on the right side, metabolism actually writes out its code in R. So if you don't know what R is and haven't seen R, a lot of people use metabolism to learn R. So once you've chosen these things, it takes your data, which is on the left side, and it transforms and scales it and normalizes it. So if you look at the data on the top, you can say this is not unusual for a lot of data. It could be metabolomic data, transcriptomic, chromodynamic data, anything else, human, physiological data. And you can see there's a bunch of things that are kind of near zero, and then there's some things that are huge, very large values popping up there. That data is not normally distributed, and it means that if you were to try and do machine learning on it, logistic regression on it, anything else like it, you'll get crazy results. So by choosing the values that I chose here, which is normalized by pooled sample, don't do data transform but do auto scaling, I was able to convert totally messy nonparametric data to parametric Gaussian data. So you can see the distribution of the data. It looks like a bell curve, and you can see how all these things have been lifted up from the baseline so that they're now in the middle, and then they have this sort of nice distribution with these box and whisker plots. Now, this scaling and normalization is something that you can't necessarily always know. So people often have to interact with it. And here's someone who's, we're trying this one, this is another data set. And we've tried a normalization with some scaling, we've tried combination, I can't remember what it was, but actually it's probably behind this. So I might have to, I'm not sure if there's an animation that's been lost. But anyways, we tried it, it failed. It's not great, this is not Gaussian. So we go back, choose some other values instead of doing none or normalization by sample reference, we're doing quantum normalizations that have no transformation, we're doing log transformation instead of doing, I don't know, auto scaling, we're doing Pareto scaling. So we've made those changes. And then we're going to view the result. So first we apply the normalization or scaling, then we view the result. And this is what we get. So again, on the left, it's kind of bad, it's very skewed. But by applying these transformations, we get something that looks pretty Gaussian. Certainly it's a lot better than this one. We could play around with some more. There's no necessary absolute confirmation, this one's, you know, the best. But this is a data analysis prep that you often have to do for any data. And a table analyst supports that. So I said, let's try some biomarker analysis. So we can go straight to the biomarker module, if we've, you know, uploaded our data. And as we talked about biomarker analysis is classification thing, we're talking about sensitivity and specificity. We saw this slide before, there's a confusion matrix between your projected and your observed and true positive, true negatives, false positives, false negatives. We can calculate sensitivity, we can calculate specificity and we've given those formulas before. We've talked about how you can measure sensitivity and specificity together with the rock curve and how rock curves are used to assess biomarker predictions, how rock curves were developed during World War II. And this is actually some of the data that they originally used to assess whether they're hitting geese with their rockets or hitting German planes with their rockets. So you know, this tracked their radar specificity. So good rock curves have, you know, an upside down L shape, a sharp sigmoidal shape. Poor rock curves have a straight line area under the curve is the way that we've measured the quality of those rock curves. And we saw this before, which on the lower right is a bad rock curve on the left hand side are good rock curves. So when you do biomarker analysis, you're trying to do usually rock curve analysis with high sensitivity and high specificity together. So that means maximizing the area under the curve, while minimizing the number of analytes that could be metabolites, could be proteins, could be genes, could be all three together, it could be some physiological measurement, could be a SNP, whatever you're wanting. So the tabloid analysis has three different modules. One is single marker at a time, you know, there's glucose allows to predict diabetes or not multivariate, which is one of most people are interested in. It's just saying, okay, I've got all this omic data, how do I get by biomarkers? Or manual. Some people have a real good intuition and say, yeah, it's this protein, this is metabolite, and this gene, those are the things that call it. Let me see how it works. So this is just an example where we've got a data set, ideally you'd upload your own data set. And it doesn't have to be a metabolomic data set. Could be transcriptomic, could be proteomic, it could be multi-omic, a mixed-omic, it could be physiological data or any one of those things. So this one we had 90 patients, serum, some who are going to be developing preeclampsia, some who had normal pregnancies. So we want to find biomarkers for predicting early preeclampsia. I think there's at least one of you working in this area. So just like before, we upload the data, we have to do the data integrity check. We've seen that before. Everything looks fine. There are five missing values detected. These are replaced by a small value. We sort of guessed here whether we needed to do log transformation, data scaling, and other transformations. And this is the before, so it's kind of messy. And then after, it looks really nice. It's nicely Gaussian. So this was done just by clicking about two seconds worth of work. And then we have a choice. We can do the classic univariate analysis, the multivariate, or a broad curve evaluation. So we can do different methods. You can call several different types of machine learning methods. It can do support vector machine, PLSDA, can do logistic regression. You can choose a number of latent variables. You can choose the feature selection method. So there's a built-in method you could use, I think there's PCA selection. So all of these things are just click and go. So this is a result where we've taken this data and we said, do we get a good rock curve? And it's taken as few as two variables, two metabolites in this case, kind of in two genes, two SNPs, two proteins, or protein and a metabolite or a gene and a SNP. And it calculated the rock curve and it calculates the confidence interval. And you can see that this data showed a really, really strong prediction. That is, we could predict who will develop preeclampsia in the first trimester because that rock curve, whether it's two variables, three variables, four variables is in the high 980s. We can also, in this case, we chose a model with four fourth 10 features and then we asked it to display the confidence interval and that's automatically calculated. And if you use the logistic regression option, it will actually calculate the logistic equation for you and the corrections needed. With the SVM, you don't have an equation, which is a problem, but certainly with logistic regression, you have an equation which you can use or publish or put into some tool. The other thing that comes out of these is that you can pull out the significant features. So again, the graph could be metabolites, could be genes, could be proteins, could be physiological features or any combination of those. You just have to make sure that they're appropriately scaled in some way. But what's showing here is that glycerol is the number one identifier. And so it's low in, I think, preeclampsia and high in normals. I'm just, I can't remember exactly. And then hydroxy isovilirate is the opposite. So that actually says just with two variables, you could do really, really well because they trend in opposite directions, they're orthogonal. And this is plotting out the variable importance of projection or VIP. So it allows it to pull out the really significant features. It allows you to maybe even refine them and say, well, maybe if I take glycerol divided by three hydroxy isovilirate as a ratio, things would be even better. And they probably are. If you knew about some of the metabolism, you might know that maybe glycerol and acetate are closely related as is propylene glycol. And that the body produces these things to the breakdown of fatty acids and lipids. You might identify that three hydroxy isovilirate is maybe connected to some branched chain of amino acids. So it might be connected to betaine or something like that. And so again, just from what you know about physiology, you might group things together. But again, this is all automatic. You didn't have to do any coding. And it produces a whole range of options that you can easily use.