 Welcome to MOOC course on Introduction to Proteogenomics. Today we have a hands-on session by Dr. David Fenu. In this session, he will talk to you about some basic informations like how to open a file in R, how to transpose your data and how to run a command. A background knowledge in R will help you to understand these things much quickly. Next, he will also discuss about how we can generate different kinds of plot and correlation study using R. Finally, he will discuss about binding classification model and how to plot a binarized copy number. So, let us welcome Dr. David Fenu for today's hands-on session. You should start RStudio. Everyone should have that installed from previously and then you need to do two things. One is to import the data. If you can look at the PowerPoint yourself, I will show it quickly here, but you should have the PowerPoint also so you can look at it, how you import the data. And it is the .txt file that is the data. And the only thing that is important to do when you import open the data file is that you should change the name and call it X just for so that it is short. And you should click yes here on the heading. So, then it will take the sample names as the column headings and you should select use the first column as the row names. So, these are the G names. So, the data is a small data set. So, it is the 77 breast cancer samples from CPTAC that you have seen already before. And I have all just done a small selection of genes. So, these are from the bottom here. There are five genes and these are the protein levels that we measured and it is the log 2 protein levels. So, it is ERBB2, ERBB4, GRB7, TBCB and Trim65. So, those are the five protein levels and then they have a value for each sample here. And then the other thing we have is copy number. So, we have copy number only for ERBB2 and we also I have binarized the copy number. So, divided it into low and high. So, that is all the data and then you just click on import. But if you do not for example, if you change do not change the name to X it is going to be complicated than later for you. And so, that is what you should see if after you import it you should see X here there should be 7, 7 is 7 columns and 7 rows and this is showing a small part of it. Now, the next step is to open the code which is the .R file and then when you open the code you will see this and we are going to go through that and oh yeah. So, there is one more thing that is important is if you select a few lines and then click on here I think I have seen some people have it different ways that says a run maybe, but that will then execute the line that you have marked and you can mark several lines. So, now, can anyone explain what this first line means? I am with the applying a function called t to X. So, X we can look at X it is where is it here it looks like this. So, what is going to happen when we apply t the function t yes yes. So, everyone knows what transpose is you guys have transposed it already and I will do it now. So, like that now it appeared here. So, it we can look at it and it is now has the gene names as columns and the samples as rows. Then what we want to do is what we are going to predict try to predict copy number that we will try to predict that the samples that have a high copy number. And so, we have this column now in the transposed one and we are going to say that any copy when the copy number is smaller than 2 then it is low and when it is larger than 2 it is high and this is now the log version of it. So, we can run these two and see that we get in the high we have 16. So, if you remember my slides from the presentation there were a few now few dots and it turns out that there were 16 that had high ERBB 2 copy number and, but most of them 61 are low. So, now let us try our first plot. So, you should see definitely look here that there are 16 that have high copy number you should get that and 61 that have low. So, this taking a subset is very useful and we can that is something you will need very often. And so, we just so, what this notation means is that we take x t and then, but we are only interested in the ones where the ERBB 2 copy number is less than 2 and those we assign to a new variable that is called low. So, we can actually we can look at let us look at the high because it is a little bit smaller. So, the high one now we should only have 16 samples here which probably we have. So, this now is a subset is a view of the overall table that has 77 samples, but where we selected only the 16 that are high copy number. So, now we just going to plot we defined gene 1 as ERBB 4 we just picked one out of the 5 proteomics tracks and we are going to plot on the x axis we are going to plot ERBB 4 and on the y axis we are going to plot the copy number. So, and that plot is going to appear here. So, now, we see that we have x axis is ERBB 4 and the copy number is for ERBB 2 on the y axis. And we see that there are these we know that there are 16 in our high group which are these here that have a higher copy number. So, we have those two. So, now we can do one more thing here is to color these in red. So, let me run it and then I will explain it. So, now all the ones we are considering high are shown in red. So, what we did is we first plotted everything in the black here in the plot command, but then we overlaid in the red, but just for the ones that are in this the 16 in this high group. So, we have now separated them you can put search for maybe our points and then you will get some and then it will show you that that is what I did yesterday that to set define the color you write C or L short for color equal to red and the red has to be in quotations. And there are lots of other things you can do you can change the shape of these you can have them filled all that you can look up on the internet how to how to do. So, now we plotted copy number here, but we are really only interested if it is the copy number it is low or high. So, there is one other column. So, here we plot ERBB 2 dot CNV that column, but we have another one that was ERBB 2 dot CNV dot binary. So, now we are going to plot that. So, it is the same plot, but we changing from the actual copy number to either 0 if the copy number is low or 1 if it is high. So, then we run this and now the y axis will either be 0 or 1 and the all the red ones will be up here that have high copy and they will have 1 and the other ones will be 0. So, you remember in my lecture I showed some slides where we had exactly these kinds of plots. And then what we want to do is to find a threshold here that where we have most of the red points above the threshold and most of the black points below the threshold. Now, as you see it is impossible to find the perfect thing for this, but let us try just one. So, now there is a whole section here that we are going to run we are going to just define the threshold at 5 because all the black points are below 5. So, we are going to run this and get another plot where we put a line here at 5 and then we see we are going to get these define this low above. So, that is the number of black points above the line which is 0 and low below which is 61 which is all the black. Then we have from the high group above we have only 3 which is a bit disappointing 3 out of 16 and that were classified right and then 13 is below. So, this is not a good choice of a threshold, but so what I want you to do is now find a good threshold. Oh yeah, so this is another Google search I did is to how to plot with R and you get a lot of different answers. So, and one very nice thing is that they have a graph gallery that shows different types of graphs. So, you can look at the graphs and see ok I want to do that kind of graph then they have the code there you can just copy and paste and modify it a little bit and. So, these will just now show the exactly what we have done and then if you remember one of the slides what we have done now is actually make a table like this. So, we had our actual groups which are the 0 would correspond to the low group the copy now the very ERBB2 copy number is low and the 1 is when the copy number is high and then the predicted group is when we set our threshold now at 5. So, we get this table. So, what I want you to do now is to change the threshold and just find what you think is a good threshold to choose for this example and this is for ERBB4 then I want you to redo it for ERBB2 protein values which of course, are going to be better that is what we expect, but also maybe try trim 5 trim 65 and see find for each of these the best threshold and also if you have more time you can explore to plot the data in different ways by looking here at the gallery. Yes. So, then when every time you try a new threshold you get a table like this. So, maybe fill out the table with on paper and pen and forever and try a few and then move on to from ERBB4 to ERBB2 which is going to give a better result because there is going to be more separation not surprisingly because the copy number affects the same gene more, but as we saw it is before I show that ERBB4 doesn't have a copy number chain, but it is still affected by the copy number chain in indirect way. So, yeah so that is what I want you to do and also trim 65 is negatively correlated, but not that strongly. So, that will be then the red and the black will switch, it will be the other way around, but yeah so that is just explore and think about definitely think about what does it mean that the threshold is optimal. So, what we want to we want to get the values of these, these four values for every threshold. So, and we have calculated these for the copy number high the ones that are above the threshold, the for the copy number high the ones that are below the threshold. So, can anyone tell me where so there are three that are above the threshold these three here of the copy number high three of the red points. So, where does that three go in the in the in the confusion matrix which we have four different positions where would we put the number three here yes true positives. So, we put the three here and then we have the copy number high below which are the 13 and what are those yes false negatives. So, we put three in true positives and 13 in false negatives and then we have the copy number low which then are the other two. So, we have the low above which are 0 we where does that end up true negative yes no sorry false positive yes. So, those are the false so we do not have any false positives and then the 61 the low below are the 61 true negatives and always with this we want as many as possible on the diagonal and as few as possible of the diagonal. Let us say that we have a very in additional test that we do for all the positive one that is rather quick and easy. Then we do not worry so much about false positives then maybe we can allow more false positive, but we are worried about the false negatives. So, it is not all the, but the taking the sum is definitely one option. So, the other example is let us say the consequence of having a false positive is that the PhD student will spend five years investigating this protein and then we of course, want to have very few false positives. So, it is always depends on the situation. So, we had the high the copy number high that are above the threshold that is our true positives and then we have the copy number high that are below the threshold those are the false negatives. Then we have the copy number low above those are the false positives. So, in this case with this initial threshold we had no false positives and then the copy number low that are below are the true negatives. You should try to find is to minimize the sum of the false positives and the false negatives that I think for this case we could call as the optimal case, but remember that that is not the general statement that it will vary from case to case. So, what you have done now is actually optimized a very simple machine learning algorithm. It is probably the simplest machine learning algorithm one can imagine, but it is still an algorithm that where we separate try to predict what is positive and negative. And so, the what I want you to people talk a lot about machine learning and it is, but there is a lot of hype, but there is no magic to it. This is it I mean this is really when if you understand this the rest is just tricks to do it better. So, when you are calculating the threshold optimal threshold now based on that table if you take the precision value and you want to get the most precise values can that be used as the point for calculating optimal? Yeah yeah yeah yeah that is yeah if you want to do that definitely yeah. So, I think this will this will kind of show this will kind of show the power of art. So, so what I have been showing now kind of showing the power of art. So, David went through how you can manually fill out that that conclusion made itself a table, but the programming languages like art comes with the fact that you can fill that table out with a single line of code. And so, now if you have to optimize this you have to do the table let's say 10 times. If you do it by hand it shouldn't take you a lot or maybe you are faster, but it shouldn't take you a long time, but you can very quickly do it with 10 lines of code and if you learn this thing from a follow you can do it with two lines of code and it automatically goes through everything if you want to calculate precision or you want to calculate accuracy or sensitivity specificity there are formulas based on that table and art automatically do those formula there are functions that calculate all those things and you can do pretty much everything and you can even plot all the values as your threshold changes and then see where the maximum or minimum occurs. So, that is the power of programming. So, I think David and I will basically introducing you to what it can do, but it provides power that goes way beyond what excel or people can do and like David said it is all what we could do by hand, but if you do it by code it is much faster. One thing the more important thing is the next time two weeks later you are trying to figure out what exactly you did it would be very important. If you did it in excel whatever you say is what you did before. So, reproducibility in terms of remembering what you did and showing others what you did and in a publication making it available to others is made much easier if you can code I think that is where the power of programming comes. So, I will just show you how to create that table automatically and then I will let you and Google deal with the less. To do this to do the table you need to have two vectors one that gives you the high and low based on copy number and the other that gives you the high and low based on this threshold that you have chosen and so when you have these two vectors you can call a function that will create the table for you. So, the first vector threshold of five and then create another vector that says when it is when the gene is high and when the gene is low and once you have these two vectors it is a single line of code that will do the table for you. So, basically what you need to do is you have to say high is one. So, you just stick that and just copy it over. So, I now have a variable called erbb2 cnv I just. So, I want to mark things that are more than 2 as 1 and so. So, this statement is saying there is this matrix called xt and I am picking the erbb2 cnv column and if that column is greater than equal to 2 then the Boolean value is true. In other words whenever that is true for any sample then that will have a 1 otherwise it will have a 0. So, it is just a different way of representing what David had before. So, if you look here. So, now you can see erbb2 cnv is logical and it has false false false true and so forth. So, for every sample it will say whether it was greater than 2 or less than 2. So, it now you have taken values and discretized it. So, it has become true or false I will do the same for the other I am going to set my threshold to 5 I will call it th. So, one of the tricks that our programmers use is to have names that are really small. So, you do not have to type too much when you are writing code, but that is not a good idea when you are writing code to keep this is like just to try things out when you are writing code to publish if you have th then you do not know whether it was what it meant. So, then you need a longer name, but when you are just trying things out it helps to have smaller names and so now I am going to say erbb4 no the error happened because I had an extra xt here or the arrow means get the result and put it into some name. So, what I am saying here is so, this line is basically saying I have the number 5 I want to call it th. So, that way when I write code using th I can go and change it to 3 and then next time I can run the same code and it will do it for 3 instead of 5. So, I use the equal to sign. Equal to sign is also the same actually equal to is a more newer thing they would not allow equal to in r before I think, but the later versions allow equal to, but I am used to doing the less than minus is the same as equal to. So, I used equal to because I do not really know r. In many languages it is equal to and r I think realize that people are getting confused. So, only recently they have allowed the equal to sign till then it was the less than minus. So, that shows that I am a little old. So, for the erbb4 gene I am going to say I am going to take the array xt and I want the gene erbb4 and I want this to be greater than or equal to threshold in order for it to be called high. So, now, for erbb4 gene again you see here there are 77 things and they are all either true false or some true or false. So, you can even look at it by just typing the name here. So, now, we save the result in erbb4 dot gene when. So, you can see for which sample it is true for which sample it is false it has the whole thing encoded in that one name. So, now, all you need to do to get your table that you did manually before is to say I want a table that compares erbb2 copy number and erbb4 the gene er yeah I spelt it wrong the first time. So, these are things that you have to keep in mind. So, many of you had issues when you loaded the table for example, if you didn't remove the column that said genes and make it the row name then when you do the transpose of the matrix the transpose works only when you have a real numbered matrix. So, there are all these quirks and debugging that you need to learn along the way it is not like as straight forward as one would think because the error that comes out doesn't really tell you what is happening or what is wrong. So, you need to kind of logically think through and do it, but you do it a couple of times you will get it. So, there is your table. So, you can see the true false here is the predicted and the true false on the rows are the actual. So, in our case the this is for the copy number and this is for the gene and so, you can this is the table that you manually created by counting dots before. So, now let us say I want to change the threshold to 3 I said threshold to 3 and then I just repeat the same command I had before and then I just plot the table again. So, now if you set it to 3 this is the result you will get and so on and there is a construct called a far loop where you can say start with my threshold of 1 and go up to 10 and it will calculate this table automatically and print it for values from 1 to 10. But I thought one should never use for loops in order. That is true that is the more complicated. So, there is a way to do it without doing far loops in a single line of code and I think that is when you come for the advanced articles next time. That is for yeah. So, you are allowed to use four loops until next year, but after that is no more. I hope the session was informative for you all where you learned some basics of R followed by the prediction analysis. Dr. Finn you showed you how to set an optimal threshold and count the sample. We also learned how to color code the samples that is coming with high copy number. Further he showed you how binary classification model can be used to classify the elements of a given set into two groups. There are many matrices that can be used to measure the performance of a classifier or predictor. Different fields have provided different preferences for a specific matrices due to different goals. For example, for the clinical applications the sensitivity and specificity are often used. I will recommend you all to learn some basics of R that will help you a lot for doing the statistical analysis and prediction analysis very easily. There are many publicly available resources where various softwares, various codes are already available, but you at least need some basic ability to run those codes in R. In the next supplementary lecture we will have a TA for this course, Deepthiru Biswas who will discuss about Pathway Enrichment and Network Analysis. Thank you.