 I'm your instructor for the final module for the Cancer Genomics Analysis Workshop. My name is Shraddha Pai. I'm a principal investigator at the Ontario Institute for Cancer Research, and I work in the area of precision oncology, which is the field where you try to match, you know, the goal is to match a patient's clinical treatment to their physiological, clinical, and genomic profile. And today in this context, I will be talking to you about multimodal data integration, machine learning, and developing patient classifiers. So a little, you know, very briefly, we are sharing these slides under the Creative Commons license attribution and share alike, BYSA. It means that if you were to, you know, if you were to take these slides and change them and distribute them, share them, then you need to attribute them to the creators, and you need to share them under a similar license to the one used here. Okay, so let's get started with multimodal data integration. So the learning objectives for this afternoon are as follows. You know, I will motivate the problem of multimodal data integration for precision medicine, and we will take a brief detour into the field of machine learning, where I will explain some key concepts that you need to run your own algorithms, as well as importantly, to evaluate algorithms run by others, say in scientific literature. We will talk briefly about why interpretability of a machine learning model is important for genomic applications. And then finally, we will go through a method that allows you to do multimodal data integration and build a patient classifier, and that will be our lab. So here is a brief outline that I will step through the lecture. So let's start with precision medicine and the need for data integration. So this is a sort of a growing recognition across different biomedical domains, be it autoimmune disorders, cancer and mental illness, you name it, that these diseases that we have traditionally given a single label to are really could be several entities underlying, they have different molecular characteristics, they have different demographic characteristics, and they have they are reflected in variability in clinical outcome of the patients and clinical course. So here is an example from a workshop I took a few years ago on autism spectrum disorders, right? This is a very heterogeneous condition, and there is characterized in the clinic by these three main, you know, behavioral characteristics shown in the red, yellow and blue circles. But then you have a host of other comorbidities and sort of clinical conditions with which these core symptoms present. So to the point where clinicians would say, if you've seen one kid with autism, you've seen one kid with autism. Now that is an extreme example, but one that serves to highlight the problem that is even in cancer and different types of cancer, breast cancer, lung cancer, pediatric brain cancer, you've got heterogeneity in the clinical presentation and prognosis. Some people respond to medication, others don't. So precision medicine is concerned with the methods that are used to collect, look at these data and build the kind of models that help you deconstruct these diseases. You know, the situation of different disorders or different stages of this journey of molecular deconstruction in order to change clinical treatment. For example, in invasive breast cancer, you've got a much more methodical and biomarker-driven approach where a patient's histopathology is taken into account in addition to the receptor status for some hormone receptors and, you know, other sort of predictors of aggression, cancer aggression, and that's used to devise your therapy. So that's kind of the goal for precision medicine. Hide these meeting controls. Okay, so from a research point of view, when you consider the system where the endpoint is your patient presenting at a clinic, you were asking yourself which layers of this biological system, starting from the genetics through to external input that modulates the system, which are the layers that are contributing to disease outcome? How do you deconstruct the disease, right? If you've got a particular disease with, you know, a single label like breast cancer, how many different subtypes do you have? And what are, which layers contribute to each subtype? You know, much progress has been made in some areas of cancer in dissecting molecular subtypes. And one example, notable example, is medulloblastoma, which is the type of pediatric brain malignancy and systematic molecular characterization of, you know, taking gene expression data, genetic data, DNA methylation data, and applying the kind of techniques that your instructor from this morning, Lauren, speak to you about to identify subtypes and use patient classifiers, which I will talk to you about, has led to, you know, the establishment of these four subgroups of medulloblastoma and, you know, you've got proportion, you've got subtypes within them. They've got very particular demographic characteristics and histopathological characteristics and so forth. And they've got different subtypes, have different associations in terms of, you know, genetic mutations, and some subtypes actually don't really have any strong genetic signal and are thought to be more epigenetic in origin. You know, and a typical example of that is a type of pediatric brain tumor called ependymoma, where, you know, there don't seem to be many recurring somatic mutations that characterize these tumors, but if you do a DNA methylation profile, you start to see that there are subgroups within these, and then you can start to chart your way to understand what are these cellular pathways that are being dysregulated because of the epigenetic, what kind of a dysregulation does the epigenetic pattern reflect and how do you go from this to picking a, you know, devising a molecular treatment for such patients. And so, you know, this patchwork of understand, this patchwork of discoveries of, you know, one particular disease seems to have genetic characteristics and subtypes, another has epigenetic and so forth has led to the building of consortia to pool patients and profile, you know, the molecular level at several layers, and this graph on the left shows you, you know, over time the kind of sample sizes that are being collected by these consortia focused on studying a particular disease type, and, you know, in one of these consortia, they pick and choose the layers they're going to collect the data for, the most prominent one being after clinical data being genetic data from blood samples and so forth. And so this field of precision medicine is at a point where we're starting to develop methods to integrate this multimodal data and take completely new problems where we don't know what the subtypes are, discover subtypes, we don't know what predicts clinical outcome or, yeah, aggression in disease, and we have to build the classifiers for this, we have to build methods to take genomic data and build classifiers and so forth. And all of this is going to pave the way, you know, and so this is the vision of the future is that you have this doctor's clinic of the future where your machine learning models have gone through their, you know, repeated validations, you've done biomarker discovery, you've done, you know, therapy discovery, and then these models go to be implemented in the clinic and over time, you know, while they're being used to classify patients, diagnose patients and predict prognosis, you know, you, it sort of feeds back into itself where you collect more data and keep refining the models. So that's kind of the vision of this field. Data integration is important and common for risk models, clinical risk models and clinical decision making. So, you know, here you've got some models that have been, that have been in use in the clinic for decades at this point, and they do tend to integrate data that span not just, you know, questionnaires from regarding family history, but build in physiological measures as well. And in some cases, as you're well aware, such as breast cancer, you do the genetic studies as well. So data integration is pretty routine for building clinical predictive models. And we're just going to talk today about sort of a systematic framework to take different kinds of data and build a model. So if you're a researcher in a lab and you focus on a particular cancer type, where are you going to go to find this multimodal data, right? Suppose you want to try out a new hypothesis, or you want, you have a clinical outcome of interest, and, you know, you want to see how well layer A versus B predicts your clinical outcome, where do you go, right? So there are a few places out there. A big one for the cancer genomics field is the International Cancer Genomics Consortium data set, where you can browse projects for tens of cancers. And there are very nice summary tables telling you what kind of genomic layers are available for a data type. You can filter by, you know, what site generated them and so forth, and download processed anonymized data. If you want to access patient genomic sequence, you have to, there's an access protocol for this, and then you're going to have to fill out the forms to get access to those data. The cancer genomics is also ahead of the pack in biomedical field, because it's sort of set the precedent with these consortia projects, such as the Cancer Genome Atlas, which is several, you know, better part of a decade old at this point, if not older. And these data are the point where they are processed and they are canned, and you can actually go to a, an R software repository like Bioconductor and just install, you know, the package that gives you access to all the multimodal processed TCGA data. Suppose you've got data in the lab and you want to integrate it with third party data, this is a really easy way to access it. And then, of course, in the field, you have the mainstays, the gene expression omnibus for, you know, for genomic data and DB gap, mainly, which is the database for genotypes and phenotypes for genetic sequencing data. So that's where, you know, you would go, and in addition to, of course, reading the latest cool papers and trying to see if you can form collaborations with the research groups to, to work on a project. Very briefly, you know, there are, what are the tools out there. So now you've got your multimodal patient data, you've got your outcome of interest defined, either through your research project, you know, or through a class discovery approach, which is the clustering approach of the kind that Lauren talked about this morning. What tools do you use to integrate this data and classify patients, right? So there's a standard tool before I even talk about the ones on this slide, you could just take the variables and concatenate them and use something like, you know, a regularized regression approach, which is a kind of regression approach where, you know, in a standard regression approach, all of your explanatory variables are assigned some kind of weight at the end of the regression. But then in a regularized regression, you put a penalty term that forces, you know, most of your variables to be at zero or have very, most of your weights to be of zero or very, very small, and that forces the algorithm to prioritize a few variables to have positive weights. You can do things like that concatenation based. But I believe this morning, Lauren talked about the advantage of methods that actually keep each type of data, contain each type of data to reflect its own internal structure before merging all the data sources together, as opposed to just concatenating them and losing that internal structure. And that is kind of the idea behind the similarity network fusion method she talked about this morning is that each data type has its own view. So, you know, both of these methods on this slide are in that vein where they actually don't involve concatenating all these variables, but rather they allow you to form a view of each data first by itself before you integrate across the layers. So, the first one is called NetDx and I will go into detail later on in this talk very briefly. It works in a very similar vein as a similarity network fusion where you start out, you folks are old hat, you know, it's all old hat to you at this point. You have all these multimodal data, sentinist tables to your software, and then it converts each of these into a view of patient similarity network. And then where NetDx differs is that it's not a clustering algorithm. It's a classifier algorithm, which means you need to give it labels to classify. Response to medication does not respond to medication. Tumor types A, B, C, or D. And what it does is it identifies features that are predictive of your labels. And in addition, once you've identified which features predict your labels, you can then take the top scoring features and classify held out samples using this paradigm. I will go into more detail later, so we'll have time to go over this. The other method I'm not going to talk about beyond this slide today is called mixedomics. It's actually a suite of genomic analysis tools in which there is a method called Diablo and it uses a partial least square approach in order to condense each of the data layers before it combines all of them using latent variable analysis. And so this one has the advantage of identifying individual single variables that are of interest, that are predictive of outcome. So it's another thing to try and I would say if you have a project where you're trying to do patient classification from multimodal data, just try both of them because they're going to give you complementary views and might feed into each other as well. Oh yeah, and very briefly, both of these software packages are available in the R bio conductor software repository. So again, this should be very familiar, but let's repeat for reinforcement. NetDX is based on this idea of patient similarity networks. So what it does is it takes data such as clinical data and using a defined measure of similarity converts the data into a network where the nodes are patients and the edges are weighted by how similar two patients are for that attribute. So for example, with clinical data, you can see that the brown circles here are strongly connected, the yellow circles are strongly connected, but the edges in between are kind of weak. So this is a network that does well at separating your two groups of interest. On the other hand, you've got the genomic, you can do the same thing with every layer of data. And when you do this with genomic data, say gene expression data, the clustering is not as powerful or the separation is not as powerful in this network, right? So different networks are going to separate patient classes to different extents. And the NetDX framework formalizes this quantifies this to rank networks based on which ones separate this group the best, which one separate that group the best, and it extends to like three or more groups as well. And again, what do people mean by similarity, right? We have some go-to measures like pairwise Pearson correlation. So if you have two patient samples and you've got gene expression data for both of them, you can take the pairwise Pearson, you can take the Pearson correlation of those two vectors and you get a number like the correlation is 0.8, right? Then that becomes your edge weight over here, right? So that that's one example. Another one is, you know, you might have another metric such as normalized difference, which you would use for something like similarity of ages, right? The closer the two people are in age, the more similar they are normalized to by the range of all the ages in your cohort. So that's normalized difference. And then as an SNF this morning, there are other distance based measures like you can take Euclidean distance and then apply a scaling factor to accentuate stronger connections and downweight weaker connections. So similarity can be defined by the user based on domain knowledge and sort of mathematical robustness. And so there's nothing in the algorithm that stops you from picking a particular similarity metric. And so the way NetTX works is it uses similarity to predict outcome. So if you have a case where you have, you know, multimodal data like clinical data, genetic data, brain imaging data, physiological measures, just as Netflix would say, you know, find me movies similar to those I enjoy using a recommender system, just like that NetTX says, you know, you say to NetTX, I have a pattern that reflects treatment non responders. And I want you to rank all the patients based on how similar they are to the treatment non responders. So if you have a patient X and he is ranked, you know, based on similarity to treatment non responders and ranked by similarity to responders, then if he's closer to the non responders, you say, chances are this is a non responders, I'm going to predict them as such. Okay, so it's a recommender system based model. The algorithm that we're going into now NetTX is general purpose. And this is what you're going to be using for your lab. It is a general purpose algorithm, which means that as long as you have defined the labels for your application, hey, I ran SNF and SNF told me there are five clusters in my multimodal data, right? Then I'm going to use five labels as the input to my classifier. And I'm going to use NetTX to find which features are predictive of group one, two, three, four, five. Right. And of course, I'm going to say this with a big caveat that you want to have your discovery set and your validation sets apart. Otherwise, your algorithm is going to do really well telling them apart in this particular set, because you found five clusters in this particular set. And it may not generalized to others. So always have that in the back of your mind that just because you've got the data set and you can run these methods doesn't mean that you're not vulnerable to your model doing well for artificial reasons, right? And then you pay for it down the line when your findings don't replicate. So the NetTX method uses machine learning, which is a type of artificial intelligence. And at this point, we will briefly digress to talk a bit about the key concepts in machine learning. And as I've said before, even if you don't necessarily use it in your day-to-day work, you will be increasingly faced with scientific articles where machine learning is used. And I think it's just good to know some basic concepts so you can critically evaluate the methods. So we've all heard the buzz about machine learning, right? It's being used all over the place, several applications in everyday life from fraud detection in banks to face detection. Facebook says, is that your mom? Yes, it is. How did you know? There's a classifier working underneath there. Traffic prediction algorithms, right? That tell you how much traffic to expect a given time of day when you're heading from A to B and so in a nutshell, what is machine learning? Machine learning is a class of computational algorithms that learn generalizable patterns from known data, historical data, and use this information to make predictions or choices for data the algorithm has not seen before, new data. So that in a nutshell is what machine learning is. There are three broadly speaking classes of machine learning. The first is unsupervised learning and in unsupervised learning you provide, you might recognize this is the SNF schematic. In unsupervised learning, you give the model the data but you don't have labels to give it. You don't say this is a luminal aid tumor, this is a basal tumor, this is a responder, this is not a responder and it's usually used to find structure in the data. Ask the algorithm to tell you how many connected networks are in this overall network. So that's what the goal of unsupervised learning is. An example is clustering and similarity network fusion. Supervised learning is a second type of machine learning method. It's called supervised because you tell the algorithm what the labels are and because you want them to learn the patterns associated with each type of label and you want it to predict the outcome. So if I have four types of four types of breast tumor, I'm going to tell the algorithm I have four types of tumors and here are all the genomic data associated with it and so forth. And then the algorithm finds, learns to classify new unknown samples based on the signature it found for each of those classes. And examples of supervised learning algorithms, you may have heard of these, support vector machines, logistic regression, random forests, deep learning, and net DX as well. So these are the classifier algorithms, but you must have labels for the classification. The third type, I just include for completion, for completeness, we're not going to talk about this at all today. This is reinforcement learning and it's a system where the AI agent has to keep making actions and changing the environment and monitoring how the state of the environment changes. We're not going to talk about this today, although there are applications for reinforcement learning and clinical decision making systems. So machine learning in medicine is still very much a new area. It is not used as much as it will be in the coming years, but just for your information, there are already machine learning algorithms that have been approved by, you know, authorities like the FDA. And these are, you go to this website which seems to track all of these FDA approved AI algorithms, medical futurists, and you can find the full list of FDA approved AI algorithms and look for cancer and these three pop up, right? And they have to do with, you know, picking out suspicious lesions from radiology images, and they actually seem to all be related to medical imaging at this point. So it's coming, but there are a lot from the perspective of the field, there are a lot of hurdles to be overcome, method development, interpretability, which I'm going to talk about, even establishing the kind of interdisciplinary collaborations you need to build a database that can serve as good input for these machine learning models. So, you know, the cancer genomics TCGA project, a cancer genome atlas project was sort of profiling from a basic research standpoint, right? But if you want to have more, you know, and then there was very limited phenotype collection that you can use for these machine learning algorithms. And so you need to build new data sets now, defined around a clinically formulated problem, and those kind of partnerships between clinical side, genomic side, machine learning side are going to need to be built and are starting to be built. So back to the machine learning and the classifier workflow, okay? So when you've got a classifier at its heart, this is what happens in a workflow. You start out with a data set, and you partition it into a training set and a test set. The training set is the one on which your model learns its signature. And the test set is the one in which you evaluate how well the model is able to discriminate between the samples, okay? And there are other were other phrases for terms to this test set, blind test held out set and so forth. An important part of training a model is cross validation. In cross validation, you take your entire training set and you partition it again. And then what you do is, you know, you sort of learn some parameters from this, these four folds in this example. And then you evaluate how well it did on the fifth fold. And then you make some changes. And then you kind of repeat this process over and over, holding out different amount, different partitions of the data for different iterations. And then the algorithm comes up with a consensus where it sort of eventually learns the final weights that become the weights of your model. So cross validation is important in improving generalizability to the test set, right? If you did not have cross validation, you might learn something that is very specific to the training set and without any kind of a give to generalize to a test set. And it might fair worse in the test set. The other concept for classifiers is features and feature design. So a feature is any unit that goes into your machine learning method. So it could be an individual variable such as age, tumor stage, or a single gene or a mutation in a particular gene as opposed to gene expression of the gene. But it can be more complex than that as we will shortly see. And so an important element of genomic model design is feature design. How do you select which features you're going to train the model on? Because we know that genomics can generate hundreds of thousands of data points per sample. And if you throw all those variables in, you could have a model that's so swamped with a number of variables for very small sample size that it never really generalizes very well, right? So you've got to use feature design to pick and transform your variables before you put them into the model. Then you use a step called feature selection, which is a step where you score the features based on how predictive they are of patient outcome or whatever it is you're trying to predict. And then you use this cross validation framework to arrive at final feature scores, which then allows you to select what your final model will be. And then once you have your final model, you apply it to your held out test set. So features, feature design, feature selection. An important point in machine learning is overfitting. Overfitting is when you train a model to fit not just the signal in the data, but the variations that are unique to that particular data set and that do not generalize well across two other data sets. And so when you have a data set that's really small and you get really good performance, you want to make sure that whatever resampling your model is doing in order to get the weights is done as effectively as possible in order to mitigate this overfitting problem. Another problem is, sorry, a signature of overfitting is when your algorithm does really, really well on the training data and then does badly on the test data. So that's a sign of overfitting. Be aware of this even when you're looking at models in the literature. So now you've got your model and how do you evaluate how well it's doing, right? There are a number of measures for this. Accuracy is the most obvious one, how correctly is it labeling your test samples, but there are other facets of the model that you need to consider in order to understand what you need to fix to improve the model, right? So at its heart, you've got the model that calls, you know, you give the model all the data and then the model labels the data as something, treatment non-responders, poor prognosis, right? That's your label of interest. So when a model labels a patient by a label of interest, that's called a positive and anybody who's not labeled is a negative. And so you've got this idea of true positives where the model was correct in labeling the patient as such and false positives where, you know, it was an erroneous labeling. This was not a non-responder. This was a responder, right? So you've got false positives. And similarly, you've got true negatives and you've got false negatives. So those are your four numbers, true positives, false positives, negatives, true negatives, false negatives. And from this, you get a collection of different measures that allow you to evaluate how well the model is doing. For example, you've got precision, which tells you, you know, of all the items you call true positives, how many of them were really positive? And then you've got recall, which says out of, you know, what fraction of the true positives did you net? Maybe you're doing really well in terms of precision, but maybe a recall is really bad because you're only catching 5% of the patients, right? And there are similar measures such as sensitivity, specificity, percent positivity rate, these kind of overlap a bit as well. And then you've got this measure, which is very popular in the literature in terms of reporting called the area under the receiver operator characteristic curve. And it's basically a comparison of the true positive versus the false positive rate. And then this is the area under the precision recall curve, which compares the precision to the recall. So the bottom line is look at multiple metrics. Do not just take a model's performance, face value by looking at just the AUC or just the precision recall because it depends on your particular application. So for example, and we're going to see this in one of the labs, right? If you had a data set in which 99% of your samples were, you know, label A, 5% of your samples were labeled B. Your model can get 99% accuracy simply by calling everything of type A, but you have missed all of your type B, right? So this is a kind of problem that arises when you have something called unbalanced classes, which happens a lot in sort of risk models, right? Where you have the chance of a bad event happening might be very rare compared to it not happening. And so your machine learning method and your choice of model evaluation need to take into account these characteristics of your data set, right? And if you were, if you wanted to look up the definitions of all these, there's, I would definitely recommend the Wikipedia page for machine learning and data science because I thought the way that they've defined all of these formula are really well done. So even if you're writing grants for machine learning and so forth, don't just say you're going to look at the AUC, you have to consider all these other ones, precision recall, F1, et cetera. Another very useful visualization for model, evaluating model performance is the confusion matrix. So confusion matrix basically just, it's a square matrix and it lays out all of the labels on both the, the columns and sorry, the rows and the columns. And it's how often, you know, the true label and basically show you the combination of true label versus predicted label. So if this algorithm is like telling the difference between three species of some kind of plant and if this algorithm was doing perfectly, you know, the diagonals would be one and the off diagonals would be zero because whenever it sees a versicolor, it calls it a versicolor, right? But you can see that this is not the case, right? It seems to be doing perfectly on two of the classes, but for the third class, it's confusing it in a very specific way. It's confusing it with the virginica, right? And if you had used just these measures, you might not have picked up on the signature of the model's confusion. And so when you have something like this, you know, you need to ask yourself, why is it confused? Why is a model confusing it? What are the shared characteristics? Are those relevant to my design? Is that true biological reflection? Or is it, or is there like a technical reason for this? And so you can use this kind of visualization to teach you more about what your model is doing and how to fix the problem. So because, you know, we're focusing on multimodal data integration, and we've just had time to go over some very basics of machine learning, I'm going to refer you to some online courses that are very high quality. If you're interested to learn more about these sort of, you know, machine learning foundations, you've got the Coursera course in Stanford, which is more like a bottom-up course. It starts from the mathematics of it, works its way up. It's very, very popular. And then there's another one called Fast AI, which has got a top-down, very pragmatic kind of attitude. And it uses Python and Google Collab. And it works you through all these examples. It's got a great library for visualizations and so forth. And this one uses Matlab. And so, you know, take a look if you want to learn more. Also, I added on the Slack, I added the CBW machine learning course, which is more on the sort of Python R as well. So actually, and what we did this year was we used both all the programs we used in the machine learning course. We did, we had them both in R and Python, and students just picked whichever one they want to work with. Oh, that's fantastic. It's good to know. Yeah. All right. So okay. So now that we are sort of familiar with some of the terminology around machine learning, let's go back to our patient classifier algorithm that does the multimodal data integration and talk about the workflow. Okay. So I showed you this image before. Now, what this algorithm at DX does is it's a patient classifier. So you need the labels. And then you start out with your patient data. And using a user specified, using user specified rules, you convert the data into patient similarity networks, right? And then you run this feature selection step, which allows you to pick the top scoring features, which you then use to create a single patient similarity network. And then when you have a new patient come in, you put them into the, you build a sort of a mega patient similarity network with the training data and the test data. And then you see which class the patient is more similar to. So in net TX, each of these networks is considered a feature. Okay. And so in this particular example, you've got the clinical data. And all of the clinical data, you can use something like the Pearson correlation measure, the Euclidean distance measure to convert that into one patient similarity network, all of this gene expression data, one gene expression network, and so forth. And net TX can handle missing data. So if you've got mutations and there's a lot of missing data, you can still use it to create your patient similarity network. And so each of these is your feature. And this is what net TX is going to assign scores to. So how it works is there's your classic workflow, right? You've got your training set and your test set. And then as I mentioned, you take, you take your training data, you use feature design to create your features, and then you convert it to similarity networks. You have to tell net TX what you mean by similarity. And then the way the feature scoring works is as follows, okay? It takes the training, it takes a sub sample of training data and makes a patient similarity network. Then it uses a kind of a regression that works on networks. So you know how a regular regression works on the different variables in the model, right? Think about this regression as the networks are the variables in that linear regression. Okay. And then this is a regular rise of regression, which means that the algorithm is forced to set networks to zero unless they're important for separating patients of different classes. And at the end of this, some networks have a zero weight because their regression did away with them, and others have a positive weight. And then what net TX says is, okay, for this particular sub sample of training data, these networks were important, I'm going to boost their score by one. Now I'm going to repeat this with another sub sample of training data, and then which networks come up positive in the regularization, I'm going to boost them by one and so forth. And then you do this, you know, K times, where K is like the number you specify, you might say, I want you to score all features from zero to 10. And then it does this process 10 times for different sub samples. And at the end of the day, each of these network features have a score going from zero to 10, where 10 means no matter what sub sample of training data I took out of the 10, I got a positive weight. And zero means no matter what sub sample I took, it was not a positive weight. And then what you can say is, you know, all right, now all my features have a score from zero to 10. So I'm going to apply a threshold, you know, say it's T is eight or something like that. Anything that scores eight or more in feature selection is going to be a top scoring feature. Now this is where my test samples come in. I take the training and the test samples, and I make that mega patient similarity networks. And then I use this technique called label propagation, in which basically like that net netflix example I gave you with a recommender system, right, you say, I know these are all my non responder patients in the network. And I'm going to use I'm going to walk from them along the network topology to my test patients. And and accordingly, you know, if they're my test patients are closer to all of the non responders, they get a higher similarity score. And if they're farther away, they get a lower similarity score. Right. So in this case, f is more similar to these red nodes than B is and C and G are not at all similar to them, right, they're like in the different and different sub network. And so in this manner, your test patients get a similarity score for each of your labels of interest. So if your labels of interest were tumor type eight tumor type B tumor type C that you discovered using maybe as enough. Then your test patients get a similarity score for a similarity score for B and one for C. And then the algorithm says, whichever one you have the highest similarity score for, I'm going to sign you that label, right. And so that's how the test patients are labeled. And label propagation is a network based technique. And it uses you can think of it sort of as a heat diffusion model, where your labels of interest are the hottest, and then the heat is diffusing outward across the network edges. And so the closer patients are to the hot notes, the higher their temperature, so to speak. So that's how NetTX works in a nutshell. In practice, you know, especially when you have a smaller data set, we recommend doing this over and over again, to get a sense of how stable your accuracy measure is, right. So if you split your training data, 80% for training, and 20% for testing, you know, you want to make sure that the performance measure you got out of that training and testing for one round is going to have very limited variance, if you were to do this exercise again and again. And it gives you a sense of, you know, whether your model is robust, or it's got a lot of variability for the samples that you've been training, you know, the models for. So the long and short of it is this train test scoring and classification. We do this again and again. And remember, I told you that every time you do this, you're going to get feature scores, right. So if you do this 100 times, then you're going to get 100 sets of feature scores. And from this, you can say, what are my consistently top scoring features. So you can say, which features consistently score 9 out of 10 in 70 samples or higher. And this is a value the user can specify, right. And then similarly, as I've said, you get a performance measure for all of the splits, right. Yeah. And that allows you to look at the variability. So in the big picture, you train this model, you run it in a bunch of splits, you see what your consistently top scoring features are in performances. And based on that, you make a judgment as to whether you're going to select these features to take a completely held out validation set and run the model on that, right. So that's kind of the overview of how this algorithm works. A bit about the benchmarking for net TX. For this, we use data from the pan cancer survival prediction project. And the benchmark included data for four different cancer types. These sample sizes are very small by traditional machine learning standards, but this is what we have to work with in genomics in some cases. And so you've got to play with the other variables of training and test resampling and feature selection repeatedly for different samples and looking at consistency and so forth, right, just an aside. In this benchmark, we was a binary classification problem, classify patient by good or poor prognosis. And the benchmark had, you know, five types of genomic data RNA, DNA methylation, proteomic, microRNA, somatic copy number aberrations, and it had clinical data. And you had to pick a different model designs. Oh, I'm going to do just clinical, I'm going to do clinical and RNA, I'm going to do clinical and DNA methylation and so forth. That was how the benchmark was set up. You train them up with your feature selection. And you see how well the model does for this design. And then you repeat this for the other designs in the benchmark. Sorry, what's the proteomic data? I think it was a reverse phase protein arrays, but I'm not on it. So it's like presents presence of or absence of certain proteins. Yeah, that's right. Yeah, okay. Thank you. And this was the performance of net TX compared to other machine learning algorithms. Each of these panels shows you data for a particular cancer type. And the pink box plot is net TX and each of these box plots is another machine learning algorithm. And each data point is the average performance for a particular design type. So this box plot is showing you several model designs. I forget how many, like maybe tens of model designs in each box plot. And net TX consistently outperforms most of the algorithms most of the time. There are some instances like this lung cancer data set with 77 patients, where a support vector machine found, you know, highly a non-linear separation boundary and did really well when net TX couldn't compete with that. But it just goes to show that, you know, again, if I had a patient classification problem, I would encourage you to set up a basic machine learning pipeline, such as scikit-learn in Python or Carrot and R, in addition to methods like net TX and Diablo. So you can see how consistently well the models or the different models are doing. I wouldn't just go to one particular machine learning tool and run it. Because, you know, no one machine learning tool is guaranteed to do well in all scenarios. And your choice of machine learning method also depends on your particular, you know, your particular classification, your particular classification problem, your, you know, what types of some models can only do character categorical outcomes. Some can do continuous outcomes and so forth. So that's a performance piece. And now just a brief sort of plug for interpretability. So, you know, for image classification, right, where there are no stakes, there's no stakes about privacy invasion or anything like that, you might say, I don't care how the model is doing it. It's a black box, but it's doing really well. And I'm happy with getting just the performance. I don't need to know what features are making it work so well. But that might not be the case in the case of it might not be the case for the, sorry, I'm getting a little distracted, the slack channel, I will minimize it. And some ping me, and I'll answer the questions at the end of the talk. So that might not be the particular case for genomic classifiers, right? Because, you know, in genomics, as compared to image classification, self-driving car data and so forth, we've got very small sample sizes. We've got sample sizes in the order of a few hundred. We've got, you know, tens of thousands to hundreds of thousands of measures in any particular layer that you say a 450k alumina DNA methylation array, that's half a million data points, right? And you're not just going to throw it into your model, it's doing well, I don't know why. But, you know, and then if you're going to use that model to make judgments on biomarker following it up with drug discovery, or you're going to ultimately make decisions about a patient's clinical care, if you have a model that's interpretable, you are going to do so with more confidence. So if you had two models that did equally well, but one was a black box, three models are doing really well. One is doing, one is a black box, the other one is showing you features that are completely unrelated to any prior knowledge of the disease. And three, you know, the third one, which is consistent with prior knowledge about the disease, it's pulling up the genes we know about, it's pulling up the pathways we know about, but it's also got some novel features that haven't been seen. Then as, you know, for rational drug design, the third is a better choice because it's consistent with prior knowledge, and yet it's introducing some novelty. So there are research benefits for interpretability, and it sort of increases the confidence for high risk decision, high risk decision making. And so for this reason, you know, we argued in this review paper that genomic classifiers need to be interpretable. And so, you know, we built in this feature in NetTX, which allows some level of interpretability. And how do you do that, right? So the way we demonstrated that in the NetTX methods paper is to take something like gene expression data, and instead of creating a single patient similarity network out of this matrix, you create one patient similarity network for every pathway of interest, right? So remember that so far with SNF and NetTX, we've only been talking about taking a particular data layer and making one patient similarity network out of it, right? But now we're saying you can start with a data layer like gene expression and break it up into networks that reflect pathway level patient similarity networks. And then we use this in this application for breast cancer subtype classification. We did a simple binary classification, classify a bed breast tumor as being a luminolase subtype or other subtype. These are tumors of very good prognosis. You know, and so maybe you wouldn't monitor a patient as with high density intervals or as frequently as you would a patient with prognosis, right? So it's kind of got some clinical relevance. And maybe you have, you know, drugs for targeting a particular tumor type versus another type. So the way this works is you use only gene expression data and you create one patient similarity network per subset of genes that reflect the pathway. Then you ask and you get your pathways from curated pathway databases such as reactome, panther, NCI, et cetera. You go through the NetTX workflow, you score your pathways from zero to 10. And then you classify your test patients. We did this for 100 trained test splits. This is a rock curve, a receiver operator characteristic curve. And this is, you know, it did really well. This means that the average performance was, you know, on your perfect. And then you've got the precision recall curve, which is also really good. It's a 0.92. But where this strategy really shines is in examining the scores of the features that you designed, right? So now you can say, okay, NetTX, I have run this for 100 trained test splits. What are the pathway features that are consistently predictive of tumor subtype? Okay. And that is what this visualization shows. So I believe you've covered enrichment maps with Veronique yesterday, I believe. And so this is an enrichment map visualization of the pathway features that were put into NetTX. Each node is a pathway feature that went into the machine learning model. And the color of the node tells you its maximum consistent score, where consistent is defined as for 70% of the trained test splits. What is the highest score you consistently got, right? And the ones that are 10 got 10 out of 10. And the, you know, slightly, you know, the less red colors got a lower score. And then using tools like the enrichment map and auto annotate, you can say, what are the themes, you know, captured by the most predictive pathways. And they reflect, you know, gene dysregulation in breast tumors, you know, cell cycle related pathways, DNA damage repair related pathways, and so forth, right? And then you've got some other pathways that maybe don't seem directly related to the condition. And you can explore that further. But this allows you to take your gene expression data, build a patient classifier, and identify cellular processes that don't just correlate with outcome, but they're predictive of outcome in a machine learning framework. And this idea, although we've, you know, so NetTX provides native support for pathway level features and allows you to fetch pathway definitions. And we're going to do that today in the lab. But it's a very generalizable framework. So it goes as far as your imagination does, which means that if you have something like DNA methylation data, and you had a strategy for grouping the, you know, methylated cytosines and un-methylated cytosines, based on gene regulation in your tissue of interest or disease of interest, you could group CPGs based on regulatory systems, right? Or core regulation of the same set of genes or something like that. And then if you have X leads in the lab that you could pursue with your experimental manipulation tools, then an analysis like this might say, you need to prioritize, you know, the DNA damage repair pathway because this one is predictive of outcome as well, and may have mechanistic significance, as opposed to these other two pathways for which you also have tools. So it feeds into hypothesis generation and goes beyond just, I can tell these patients apart, okay? So, you know, this is what we've covered today. We have gone through, you know, precision medicine and the field and the need for data integration. We've talked about where to find the multimodal datasets, which is these cancer genomic consortia data projects, curated TCGA data, DBGAP and so forth. We've briefly touched on patient classifiers that integrate multimodal data. We had a brief detour and machine learning key concepts. And then we did a deep dive into net DX, which is a patient classifier that uses patient similarity networks to integrate multimodal data. We've talked about the need for interpretability. And then finally, we've used that to talk about building pathway level features in our machine learning model. And so, you know, after the break, we are going to go through the lab exercises. And the lab exercises will involve using the patient classifier net DX in two exercises. One is to build a three-way tumor classifier using TCGA data that's brought in from the R package curated TCGA data. And we're going to integrate four different genomic data types, gene expression, methylation, microRNA, and proteomic measures. And I think we're going to classify it breast tumors, luminal A, luminal B, versus basal tumors. And then in the second, and that's going to be a very simple design, one data layer makes one PSN, and you can go through the workflow. The second exercise, we're going to build a binary classifier. But we're going to work only using clinical data and gene expression data. We're going to use pathway level features from the gene expression data. And I will show you how to tell net DX, give net DX custom similarity metrics for the clinical data. So we might, we're going to use normalized difference to rep to quantify similarity for individual clinical variables. And so those will be our two labs. And just we're going to use our studio to do this. And we're going to use our studio server, not the R studio on your laptop, the way you had done this morning. And so I think with that, I will wrap up my lecture. And I'm happy to take any questions.