 Let's move on to the second part of this module, which will be devoted to the communicomics. I will introduce you to this topic. Very well, relatively briefly, and I will give you a bit of history. And then, as I mentioned before, we will cover the discrimination classification procedures, and we will continue with the illustration of application of these techniques on the drug response studies. So to remind you, biomarkers and therapeutic targets, in case you forgot what they are. So these are biological molecules that can be used to predict how well the patient responds to equipment for disease or condition. And it can also be a molecular marker and a signature molecule. It can be called therapeutic targets is again a biological molecule, an enzyme, receptor, or other protein that can be modified or hit by a certain, say, antibody or antisense. And it's basically that's what the therapeutic target is. So the area of biomarkers has probably originated back in 1950s, where the pharmacogenetics and pharmacogenomics emerged. And these study inheritance in variation in drug response in patients' population. And this variation in drug response can be vastly different. It can be the whole spectrum of responses from adverse drug reactions on one side to the lack of therapeutic efficacy on the other side of the spectrum. So the very, the earliest experimentally validated examples were published back in 1950s and 60s, where researchers basically noticed large differences in response to a standard drug doses. And in this table, you see a couple of early examples of, say, drug, which was short acting muscle relaxant drug. And in some of the patients, it would cause a prolonged muscle paralysis. And then when researchers started to look into this problem, they found a genetic variation in the gene, BCHE, which was, which is responsible for enzymatic hydrolysis of the drug. And so certain patients would produce dysfunctional form that was not able to hydrolyze the drug itself. Another example is the anti tuberculosis drug that was given to patients. And it was noticed that plasma concentrations was vastly different in different patients. And that was correlated with the different risks for adverse effects of that drug. And again, studies demonstrated that there is a genetic variation in population within a gene NAT2 that performs enzymatic acetylation of this drug. So this is probably one of the notorious examples of that era, the C2D6 gene story, which is a gene of a cytochrome's P450 family of microsomal drug metabolizing enzymes. It catalyzes by transformation of scores of drugs, including anti-depressants, anti-arithmic drugs, activates even an digestive protein into morphine. And then the research has clearly demonstrated that the variation in the response to these drugs can be explained by genetic variation in C2D6. In some patients, it was non-synonymous coding SNP that was associated with decreased activity. In some of the patients, it was gene deletion. In some patients, it was gene duplication up to 13 copies. So this graph, this chart shows you a different cell populations of patients. And here you see the ratio of a drug to its metabolite. And so clearly, you see three parts of this primordial distribution here, with this one being poor metabolizers. And the majority of patients would be extensive metabolizers. And on the left, there would be ultra rapid metabolizers. And so you can imagine that poor metabolizers would have an excessive drug effect when treated with drugs that are supposed to be inactivated by C2D6 gene. And at the same time, codein would be ineffective in these patients. And similarly for the ultra rapid metabolizers, they would achieve inadequate therapeutic response to drugs that are inactivated by this gene. And at the same time, they can be highly sensitive to the drugs codein, as it was published for one of the patients who went into the respiratory arrest after the standard dose of cough depressant. So very soon it became clear that any given variation in response cannot be explained but just only one factor, only one gene. It's clear that most of the diseases are polygenic traits. And multiple factors affect the drug response, including both pharmacokinetics and pharmacodynamic factors. So for instance, warfarin story, which is the most widely prescribed oral anticoagulant, it can have very serious side effects such as hemorrhag and undesired coagulation. So predominantly, warfarin is metabolized by the C2C9 gene, which is cited from P450 family again. Then researchers found two common polymorphisms that were associated with the decreased activity of this gene from 12% to 5% of activity for wild type. However, the frequency of these polymorphisms were only about 10%. And so it was not possible to explain most of the variants in response by this pharmacokinetic genetic variation. So the polymorphisms, is that in the heterozygous state or in the homozygous state? In terms of the frequency of the polymorphisms, the reason I ask is, or at least the frequency of the metabolizing phenotypes. I actually find it difficult to interpret the numbers as you presented them, when you say, so is it 12% of activity of the resulting protein and then 5% of the resulting activity? And then the frequency of these polymorphisms is, so if you are a heterozygote for the polymorphism that has decreased activity, does that mean that your overall activity of both the normal enzyme and the heterozygous enzyme together ends up being 12% or the activity of the heterozygous enzyme? I think I should say that these are two distinct polymorphisms, I think, in the case of this gene. So they were two distinct polymorphisms. Whether they were homozygous, heterozygous, I do not remember, I'm sorry. So these two polymorphisms resulted in this reduction of activity, functional activity of a protein, right? And then those polymorphisms had a frequency of about 10%, both of them, and the variation in drug response was greater than that. So these variations could not explain all of the variations in drug response. So this is the message that I was trying to make here. And no, and now, and these were, these were polymorphisms in the pharmacokinetic factors, metabolizers of a drug. And then people found a gene v-core C1, which was a target of this drug. Now this is already a pharmacodynamic factor, and a series of haplotypes were associated with a final dose of warframe. And so this picture basically shows in summary that there are multiple factors that can contribute to the response to any given therapy. So, and certainly cancer is a polygenic disease. And aberrations can take place on pretty much every level, starting from genomic aberrations, such as spray points and versions and deletions, et cetera, mutations that would affect the functional capacity of proteins and or the binding properties of proteins, then changes that can take place on transcriptional level, splicing changes, there may be epigenetic changes, and then there may be even changes on the protein level itself. And all of these types of aberrations can be used as a biomarker. So, for instance, this is an example of an amplicon structure in breast cancer. This was a big amplicon that was detected in breast cancer, and shock on sequencing of that particular region showed this very complex structure of an amplicon with multiple distinct parts of human genome concatenated to each other. And these genomic breakpoints can be a biomarker. So, now as you've seen, the expression profiles, expression signatures of subtypes of tumors can also be a valuable biomarker, as it is clearly shown many times during this course on breast cancer, where we have a multiple distinct subtypes by expression profiles. And then it seems to be beneficial to put different levels of information, such as genome aberrations and transcriptome aberrations together and try to increase the amount of information and our power to discriminate between different subtypes of cancer, as it was done in the CHIN paper that is used for this course. So, just to remind you, Sora probably has shown you this yesterday. These are the survival experiences of subtypes defined by expression profiles. And then within the luminal A expression subtype, the researchers found also two subtypes by genomic profile, the ones that had an amplification within certain regions of the human genome and the ones that did not have. And those two subtypes defined by genomic profiles also had different survival experience. So, this seems to be beneficial to integrate different levels of information to increase our power of tumor classification. So, now, I want to remind you again the HER2 and Trastuzumop story, because this is probably one of the most prominent examples of going from the basic research to the clinic. So, the HER2 or HER2 receptors, the cell surface receptors you know of tyrosine kinase or HER2 family, the overexpression results in inactivation of intracellular signaling through these pathways to promote cell division, cell growth and inhibit adoptosis. And then in 1987, Slammon published in Science a study showing that there is an overexpression of this gene up to 20 copies in up to 30 percent of breast cancer patients and was associated with shorter survival and relapse times. Soon after that in 1990, Genentech completed a formidable task to develop a humanized monoclonal antibody just within one year against the HER2 receptor. And it was effective only in few percent of all of breast cancer patients. But then it occurred to them that it is probably beneficial to apply HER2 to the patients which which are HER2 positive, who overexpress HER2. And so the methods started to be developed to detect HER2 amplifications with fish and protein levels with IHCs. And in 1992, the clinical trials were commenced with Herceptin for HER2 positive cancers. And now it is a standard of care a combination of tests for HER2 expression status and the Herceptin treatment in combination with other drugs for HER2 positive cancers. So this graph just shows you that the efficacy, the response rate of different therapies in different clinical trials of a combination of Herceptin with these different drugs. So the yellow bars represent Herceptin alone as a single monotherapy and the efficacy is about reaching 30 percent, I mean response rate. And then in combination therapy, the response rate was achieved that was very high, was up to 70 percent. It was a definitely a huge success in this field. So now, because initially they tested this antibody on all cancers with and without HER2 expression. Thank you. So good to have a clinician here. I'm not a clinician. What are the red bars? The red bars. So this is a combination and this is a doxorubicin alone. So for two of the trials they have this comparison of mono therapy versus combination there. Okay, so there's been a plethora of studies showing the development of new and promising biomarkers and therapeutic targets. I mean there is a really great number of publications over the last decade or so. Now, if we ask the question, how well are we actually managing cancer these days? And it's very interesting to look at this website, Canadian Cancer Society, at the statistics say for 2009 in this case. If you look at each and every individual cancer type, so on the left there's an incidence rate. On the right there is a mortality rate for different cancer types within males. So what you can see there is for instance for prostate cancer, there's a peak in incidence rate which is about when was it? So it was exactly when the PSA screen test was introduced. And suddenly the incidence rate went way up. And then you see that the incidence declined back because people realized that higher PSA is not really that much indicative of a malignant disease. So there was peak here because of the great rate of overdiagnosis with prostate cancer. So you cluster a whole population of people in 1993 because you had this test that was more sensitive. And that's why you're not seeing a change in your own mortality because there really haven't been necessarily more cases. You were just detecting the same cases at an earlier stage. Yes, exactly, exactly, exactly. And so if you look at the mortality, it went a little bit up and then just a little bit down, which is probably can be attributed to a earlier diagnosis these days and preventative care mostly. But the change is not that really big, as you can see. So we're still not doing a great job at detecting aggressive cancers. So even if the PSA is really high, it's not necessarily that the patient will progress to metastatic disease. And at the same time, some patients come into clinic and they already have metastasis and they have low PSA. Now, it's an interesting example of a lung cancer for male. For instance, you see that the incidence and mortality goes significantly down over time. And that is probably because of the recognition of the contribution of risk factor of smoking, that men basically heard the message and here's the result. Now, let's take a look at the female statistics. So this is breast cancer. And you see that the mammography have been introduced here. And yeah, we see a little bit of a decline in mortality rate for breast cancer, but not that much indeed. And this is exactly because of the same problem is that we are not doing a great job at detecting high risk patients. We cannot really predict that well how the patient will do in the future. So we're missing a lot of really aggressive cancers, unfortunately. Now, it is actually a bit funny about the lung cancer. You see that the incidence rate and mortality for women for lung cancer is increasing over time. And that can be certainly attributed to the feminization. So women choose to smoke. So now if you look at all of the cancers together for males and females, now on top there is an incidence rate and down below there is a mortality rate for males. So this is the actual growth in incidence. Now, if you correct it for aging population and population growth, then this becomes a corrected curve. And so you do see a slight increase in rate for men. And then the mortality goes a little bit down. And this is probably mostly due to the prostate and lung cancers successes. In females, there is also a slight increase in rate for all of the cancer types. And there's virtually no change in mortality rate. So unfortunately, we're not really doing that great with managing cancer. And there's certainly still a keen need in novel prognostic and therapeutic part markers. So this slide summarizes what we've learned from this part. So microase and next-gen sequencing technology can be analyzed similarly for scanning of the whole genomes and transcriptomes for additional better prognostic and therapeutic targets. Both platforms have specific biases and considerations. Integration of multiple level data, genomic transcription, and etc. increases power. Then I touched a little bit on splicing and showed you that it is really an important layer of transcriptional regulation of gene expression. And it does provide rich source for biomarker and therapeutic development. And the final conclusion is that still there is much to be done for putting principles of personalized medicine into practice. New biomarkers, new analytical methodology, and new legislation. So this was actually concluding the part one of this module. And now we are moving into a more specific topic. And that is the discrimination and classification procedures. So there are three main statistical problems when it comes to the tumor classification. And these are identification of new unknown classes using molecular profiles. That would be called an unsupervised learning. Another one is classification into known classes. And that would be called discrimination analysis or supervised learning or classification analysis. And the third one is the identification of marker gene that characterize classes. And that would be the variable selection. So we will be talking about the discrimination, discriminant analysis or classification analysis. So this is a derivative of a slide from SORAPS presentation so that you can recall pretty quickly what he was talking about there. So what is the essence of discrimination and prediction? So you have an expression profile of your group that you know has poor outcome. And another subgroup that has good outcome. And you have an expression profile for those. Now using this cohort of patients, which is comprised of two groups, say responders and non-responders, you learn a classifier. Classifier is a set of genes that describe the best, the patterns within these two subgroups. So this would be classifier. And then this classifier would be applied to a new patient in order to make a prediction what would be the most likely response of that patient. So this part is called discrimination and this part is called prediction. Now within the discrimination part, the building classifier implies a classification rule which is comprised of two components, classification method and feature selection. There is a number, a great number of different algorithms, classification algorithms that are available out there. There's probably more than a hundred of different algorithms. I'm not going to go into the detail explanation of these, but I would rather show you a comparison of the performance of different algorithms. I suggest you to go and read specific literature if you want to know the specifics of each and every algorithm. But this slide is to name just few of most commonly used algorithms. And these include linear discriminant analysis, LDA, and Sorab mentioned this paper yesterday which used a derivative of this LDA which was called weighted voting methods that you may encounter pretty often in the literature. Then maximum likelihood discriminant rules, nearest neighbor classifiers, classification trees, aggregating classifiers, and then these two bottom ones, neural networks and support vector machines, slightly different in terms of complexity. So they are a way more complex than these methods and require more special mathematic training. But people mostly use these two top, three basically top approaches and support vector machines seem to be quite popular and in comparison study designs they show very good performance. So, yes, yes, this is a classification method for building classifier. Remember, so here. So this is a method and feature selection. It's a computational rule. Yes. Yeah. Okay, so for more details I will refer you to this very nice paper from 2002 where they compared performance of these simple methods that require pretty much little training. These are simple methods and using three published publicly available data sets. And the two major messages out of that comparison study is that KNN and DLDA linear discriminant analysis had the smallest error rate and that aggregation improved the performance of the aggregating trees classifiers. So just go ahead and read through these papers if you are to do the drug response studies or classification studies. I would prompt you to go and read through those papers. Yes. Oh, cool. Thank you. Cool. I think that would be quite beneficial. Yeah. Thank you. Yeah. Okay. So now what is the feature selection? What it's for? So the goal of feature selection is to reduce noise. There is three classes of filtering methods. The filter method, wrapper method, and embedded methods. The filter method goes as follows. So features are scored independently genes. Features are genes in our case in the simplest scenario. They ranked according to a certain score and top and are selected. So what are the scores? Scores may be between group to within group variation measure or test statistics or p-value. So basically the goal is to subselect genes or features that are mostly associated with the characteristics of compared groups. If it's an outcome, then with outcome. If it's survival time, then it's survival time. If it's a drug treatment, then it's a drug treatment. But there are a certain problems with and basically these are most frequently used methods of feature selection and there's a number of problems with these. First of them is redundancy. Features are selected independently without assessing if they contribute any new information such as core transcribed genes, core regulated genes that don't really bring any new information but they are associated indeed with the characteristics of our groups. Now problem two involves the fact that interactions between features are not considered. This is an information that we are losing here and problem three is that classification rule, classification procedure itself does not take part in feature selection whereas it pretty much should. So the classification should choose the best classifier. So the wrapper method is an iterative approach which basically assesses a classification performance of many subsets of features and then selects the best performing. And problem with wrapper methods is that they are computationally intensive and very easy to overfit. And the third type is the embedded methods which perform classification and feature selection simultaneously. And these are you know quite good methods. They achieve an improvement in classification accuracy. An example of such methods would be commonly used KNN and LDA classifiers. So to illustrate the goal of feature selection this is the correlation matrix of expression profiles between samples of group 1, group 2 and group 3. So the entire set of 3,500 genes were used for correlating profiles between any given pair of samples. And this is a correlation matrix. Now if you take the entire set of 3,500 genes you get a correlation like that. It's fuzzy. Now if you filter genes based on say differential expression between the groups, right, you get a correlation much better and this is what you want. You want to end up, you want to work with the genes that are mostly associated with these groups and these would be most beneficial in building classifier. So the goal of feature selection is to improve classification performance by removing genes that are not associated with outcome. It also may provide useful insights into the biology of this disease. So differentially expressed genes between these important interesting groups can shed light into the pathways and biological processes that may be involved in biology of these subtypes and then ultimately it can lead to the development of the diagnostic tests such as breast cancer chip for instance. So this slide summarizes very superficially the overall strategy for development of a classifier. So one would start with a learning or training set and this would be say patients, patient cohort with expression profiles and one would build a classifier using a certain classification rule and then the one would apply this classifier to the ideally independent patient cohort that was profiled at different institution with maybe different platform so it's completely independent and then the performance of this classifier is assessed through the error rates of classification of both training set and independent set so you get error rates from here and from there and these are used to assess the performance of a classifier. So now this is the ideal scenario but it is very seldom when especially at the onset of the study there is a independent cohort of patients so what is more frequently happening is that a researcher has a single set a single cohort of patients and so in that case this set is split in certain proportion into the learning or training set and the test set and in this case this would not be totally independent test set and then the one would build a classifier, test it on the test set and the error rates would give a performance assessment so this is much more frequent scenario but the downside of this approach unfortunately is that it results in the reduction of effective sample set size so say you compile a cohort of some 30 patients and now you still have to split it into the training set of say 10 samples and or 20 samples and test set of 10 samples so it effectively reduces the sample size which is a big big big problem for building classifiers. Now the essential part of classification and discrimination studies is the performance assessment of a built classifiers so the performance assessment consists of the following how accurate is a classifier and for that purpose we use confusion matrix or accuracy for instance now the next question how well classifier worked on learning set or training set and that would be the error rate that I showed you and it would be called resubstitution error rate then the next component is how well classifier worked on test set and in this case it would be test set error rate another very important component of internal validation would be cross validation of a classifier which I will tell you about later and finally one would wish to compare a given classifier with other existing classifiers and for that purpose one would use rock curves so what is the confusion matrix so say we have a cohort of patients that were diagnosed with with some other kind of test I don't know and and thus copy or some other test and some of them were positive and some of them were negative and this we may consider as a gold standard now we are testing our new method diagnostic test our classifier and we want to see how how it basically how it performs in the following manner so it predicted this number of positives and negatives within the same cohort now these would be true positives so the the true positives are those that were correctly diagnosed as positives and these would be true negatives that were correctly diagnosed as negatives for disease positive for disease and negative for disease and corresponding here and there would be false positives and false negatives so they were incorrectly predicted to be negative and they were incorrectly predicted to be positive in this cell so now just note for the future that we should always consider the fraction of of patients which are positive for the disease and that would be referred to as a disease prevalence which is very important for for application of a new diagnostic test and classifiers so just keep it in mind I will use it later on so now from this table we can derive a number of assessments so this is a sensitivity which is basically a fraction of true positives from the sum of this column it would be sensitivity or true positive rate now the specificity would be a fraction of true negatives from the sum of this column specificity sometimes is called true negative rate now from this row the fraction of true positives would give you positive predictive value and from this row true the fraction of true negatives would give you negative predictive value now of these four uh statistics sensitivity and specificity is not clinically relevant but positive predictive value and negative predictive value are clinically relevant because they take into account the disease prevalence and I will I will talk about this a bit later so now the accuracy is very often used for the assessment of a classifier and it can be expressed through this formula through true positives and true negatives and then it can also be expressed through sensitivity and specificity and also using fractions from from here fractions of positives and negatives according to the golden standard so here's the example of a confusion matrix and these statistics that are derived from this so here we have a patients with bowel cancer as confirmed on endoscopy so the positives and negatives for the disease and this was the evaluation of a new test which were predicted whether a patient is positive or negative for this disease and so the number of true positives was this this is the number of true negatives and false negatives and false positives and this is just a simple calculation that gives you a positive predictive value of 10 right so it's basically fraction of this from the sum of this row two plus 18 the same here and it's very simple for the sensitivity and specificity so you basically take a fraction of true positives of the sum of this column two plus one what is the difference between them so so you have to you have to build a classifier on a cohort of patients for which you know the outcome so you learn a classifier from your training set and this is a set that you use for building a classifier and then test set is another cohort of patients that you test your classifier on you can test your classifier on your training set and say aha the accuracy is 100 percent my classifier is the best in the world because you have learned the classifier from the training set in order to assess the performance of a classifier you have to test it on the test set which was not used for building this classifier does it make sense now which set is used for the accuracy measurement so for the so um so so the accuracy refers to the test set because now imagine that so this is your cohort of patients and you have already developed your test using some of the training set you have it already and now you want to test it the way you do this is that you you you compile this confusion matrix on your test set that was diagnosed with some other diagnostic technique but with regard to this test performance assessment this would be a test set and you probably calculated for both but the one that's most relevant is the one on the test set yeah if you have a poor accuracy yeah if you're learning set then you probably don't have a good enough classifier to actually proceed the testing yeah to illustrate what you just said this is the error rate that comes from the testing of this classifier on the same training set and this is very optimistic you would never use it this is the relevant error rate and this is what is used for the confusion matrix yeah of course yeah yeah so yeah there's an intrinsic problem that you know the gold standard method can fail so it also has some error rate so okay so now what is the misclassification error rate that I showed you that pips into the performance assessment this is just one minus accuracy and so just to reiterate the resubstitution error rate is very optimistic because it shows how well classifier worked on the same training or learning set and then more relevant error rate is how well classifier worked on the test set and this is a test set error rate now the misclassification rate is at the same as the one minus the AUC ACC the accuracy so is the ACC is that similar doesn't seem to AUC AUC is the area under the curve yeah so are they the same no they're different no they're different okay so the cross validation is very important internal validation of classifier so so it is performed on the same training set in so there are two two sort of versions of cross validation so the more general it can be defined as the V fold cross validation estimation and in that scenario cases in learning set randomly divided into V subsets of nearly equal size then we leave one set out and it will serve as a test set for us the rest V minus one subsets will comprise a learning set for building classifier then we we iterate this process V times and we compute test set error rates for each iteration and then we average the error rate across these iterations this is the essence of a V fold cross validation now the the specific case of a V fold cross validation is a leave one out cross validation which is widely used in the literature and that is when V is equal to N where N is the number of samples in your training set so you basically at each and every iteration you leave one sample out you build a classifier on the rest of the samples and then you test it on the left one out sample and so the observation is that this kind of validation works pretty well for stable classifiers such as KNM, LDA, or SVMs I do understand the question so that was that was no this is not it no cross validation pertains the training set itself this this diagram actually shows you so you compare so yeah you do you do it all together and then all of these three components contribute to the overall performance assessment so here you see that you say have a learning set for which you build a classifier right you you test the error rate which it would be you estimate the error rate which is a resubstitution error rate then you you do a cross validation here using your training set this would be your V fold cross validation and you build a classifier V time and then you average the error rate and then it also goes to the performance assessment and then finally you do the external validation on the independent set of of this classifier and that's how you get your test set error rate and all of these error rates go into the overall performance assessment uh well yeah this is the most uh relevant one but unfortunately it's it's very often the case that people don't have any independent set and the training set is small enough to be split unfortunately this is a brutal reality and people trying to do leave one out cross validation I mean all sorts of manipulations which actually lead to false results and very poorly reproducible classifiers this is a so to put it in less abstract terms if you've got 300 blocks of prostate samples you use 200 of them for your classification your learning set and you save 100 of them yes right you know they're in the same collection from the UBC pathology lab yeah that's that second set of 100 would be your independent test set well it would be a test set it would not be completely independent because there's a number of biases in each and every study which cannot be addressed during the study design the conduct and interpretation of the data and I will touch a little bit upon that kind of challenge and so that's why it is not entirely independent they're still maybe the same bias and the performance will be a bit optimistic there are some independence that you're not developing the classifiers developed on completely different samples but they're clinically dependent in the sense that they're collected under the same umbrella but in terms of development and classifier there is some independence I would say that's pretty good yeah it's a pretty good unfortunately in the reality you have to work with what you have yeah and so you do the best you can yeah and so here one note that I would like to emphasize is that learning set and test set have to be identically distributed so for instance yes well but you can correct for that right so for instance for instance you you can't really say have an equal proportion of disease versus normal cases in your learning set and test set because this would not be so in other words to put it in more general words both learning set and test set should come should be representative of the entire population of patients and that means that they have to have similar distribution of age you know race some status for certain hormones for instance or baseline characteristics baseline characteristics yes so does it not follow from that that your independent test set will always be a significant fraction of your your total samples and like depending on the number of variables that you're attempting to predict is there a way to kind of do a power a power calculation that says okay if there's five major variables you're interested in then your test set can be no smaller than 25 percent of your cases or 20 percent of your cases such because the more variables you have you you'll always you'll only need at least over no more than a 50 it is a very tough problem to make a really you know solid and correct study design it is very hard task there are certain approaches that people take and and there are nomograms that exist out there that help you to estimate you know what sample size and what characteristics of the sample size you should have in order to achieve a certain power so it's highly complicated and it's the whole you know distinct area and it's you know a big part of the epidemiology itself so it is it it is very very important point that has to be very thoroughly thought through and it's not simple unfortunately and and there is even if you try to account for all kinds of you know intrinsic problems that may accompany your study design still there are a certain number of threats that are very very hard to address and i will um tell you just a little bit about them a bit later just a comment about the variables we were talking about five variables but there are barbarians that sometimes need to account for variables that you don't even know necessarily because they come from the sample preparation, humidity in the day and then in that case but i wasn't thinking of variables so much in that case it's you've got a certain number of predictors that are coming out of your models so you'll say these are my top five genes the expression levels of which determine my subclassifier but they were all in that situation there will also be you know your your top genes that made the top 10 but didn't make the top five and so you know you may be faced with a choice where you've got your top five they look pretty good but actually maybe you need to only focus on the top three such that you can keep your validation set or your test set a significant enough fraction of your of your cases you know there's no there's no rule of thumb what what should be the size of your classifier how many how many features are you gonna include into your classifier certainly it should be reasonable from practical point of view it should not be 250 it should be in a range of at least i mean a few dozens a couple of dozens at most maybe 30 classes genes in your classifier and then what people tend to do you know the people who are a bit more thorough with with a research design they they basically estimate the optimal size of a classifier to to use for in combination with with a given classification algorithm and i will show you some of the papers that did that okay all right okay so i have 30 minutes before lunch i see um all right so let me finish with this section and then i would suggest us to take a break for lunch before we start the drug response studies okay so um so rock curves are commonly used for comparison of performance of different diagnostic tests or classification um procedures so the name comes from the receiver operator characteristics um and this is a graphical plot of the sensitivity or true positive rate versus false positive rate for a binary classifier system um as is discrimination threshold is the ride so this what it looks like on the left this is the the real rock curves for a number of algorithms trying to predict um uh correctly cases of alternative splicing and so this area is the rock space so the the graph on the right explains you the rock space a little bit so this line the red diagonal line is a no discrimination line meaning that the curves that would that would be in in the in this part below the discrimination line would not be a effectively discriminating between our classes and this area above the discrimination line is of interest to us and this left upper corner is the most interesting to us because this point is basically the perfect classification uh when there is zero false positive rate and 100 of true positive of true positive and so from from the graph on the left you can see that these two methods to form uh better than the rest the the black curve and uh what is the other color the next to it uh so it is it is very often to summarize the performance of any given uh method uh using the area under the curve so it would be estimated as the area under the curve and so basically the higher it is the better the performance of the method so it can reach it can be one or lower and sometimes people do the area above the curve which is the the complement of the area under the curve so weirdly enough some people do use this kind of summary so if you if you are going to to go into the classification and discrimination studies I would highly recommend you to read through these highly influential papers these are very solid studies of different cancer types that were published in the last few years from 1999 through 2002 so all of them use a pretty pretty good study design and different methods and so if you want to familiarize yourself with the field you should be aware of these really landmark papers uh in this area okay so this is the end of this part and let's take a break for lunch