 Hey there! Welcome to the Riffamonus channel here on YouTube. This is Code Club. What is Riffamonus, you ask? Well, thanks! It is a play on the word riff. And in music, a riff is kind of like a sample, right? Where you have this theme that is repeated over and over and perhaps in, you know, different musicians or different parts of a song or different parts of a musical. You hear that theme repeated in slightly different ways. And so my motivation when I teach is to largely riff, right? The way you'll see me teach is I'll take code from one context and bring it into a new context and show you can how you can adapt it to get something done that you that you want to achieve, right? And so we see that time and again. And that's ultimately how I and many other people have learned to code. That's in contrast to starting with a blank screen in our studio or wherever. That's not how we generally learn, you know, perhaps people in computer science who are just like special. That's how they learn. But for most of us biologists, that is not how we learn. And that's certainly not how I learned R. So I'm trying to teach you how I learn. Now, what does that have to do with today's episode? Well, there are many places that we can go and get code that we can then riff on, right? We could do a Google search, right? So you could Google, you know, how do I make a density plot in R? And invariably, you'll probably go to a page at Stack Overflow, you'll get someone else's solution, copy it into your instance of our studio, and then adapt it for your problem. Alternatively, another approach that I often use is if I look at the man page, the manual page for a command that I'm interested in, say, like, you know, GM histogram or GM density, at the very bottom are some examples of how to use that command. Well, I've learned a lot by taking that code, plopping it into an R script, and then working with it to achieve what I'm trying to get. Another approach is that perhaps you're reading a paper, and oh my goodness, these people actually posted their code. Well, you can go to that project, find the code, plop it again into your R studio window, and then go to work on it. Sometimes, you know, there's code that's been floating around a lab to make different types of plots or to do different types of statistical processes. So you could grab that code, and again, put it into our studio, and go to town with it. Sometimes, you've been working on a project, and you say, go away for a week or two, or longer, and you come back to your code, and you're thinking, I wish I knew what this code did. In all of these cases, we are left with the question, though, what does it mean, right? What does that code mean? Perhaps the person didn't write in a way that you're used to writing your code. Perhaps you're not sure what you wrote yourself, right? That is the topic of today's episode of Code Club. What are the strategies that you and I can use to better understand the code that we find out there in the wild and that we want to appropriate for our own analysis? In the past week, I've been doing some more work on this project we've been working on here in recent episodes of Code Club. As you recall, we've been developing L2 regularized logistic regression model using the relative abundances of microbial populations in a person's stool sample to predict whether or not they have a screen relevant neoplasia. So instead of screen relevant neoplasia, think SRN or maybe just colon cancer. So I developed these two models. I generated AUC values because we are doing 80-20 splits where we take 80% of the data to train the model and then test it on the held out 20. And so we iterate this say 100 times and for each iteration we get an AUC value, right? So I have 100 AUC values for the model with just the microbiome data and I have 100 AUC values for the microbiome data plus fit results, right? And so one question is, how do we know if the differences in the AUC values are significantly different? Now, this is an important question because I'd like to do some more fine tuning of my models, right? Perhaps I want to change how I do the pre-processing. Perhaps I'd like to add other features about the patients, you know, perhaps their gender, their weight, their height, things like that. Perhaps I'd like to try other types of models, right? Perhaps I'd like to try L1, regularized logistic regression, or perhaps I'd like to try random forest or decision trees or any of these other types of models. Perhaps I'd like to try a different taxonomic level. But I don't have a tool to allow me to compare the distributions of these 100 AUCs from, again, those 80-20 splits to compare the distribution across these different modeling approaches or different sets of parameters or tunings and things like that. So, this is a challenge that I'm not really sure how to tackle. But I remembered that a former postdoc in my lab, Begum, wrote this great paper in Mbio, which you should all be reading. And in there, she took the same data set actually at the OTU level and generated models using six or seven different models for doing machine learning. And one of the goals that she had was to say, you know, we should try multiple models, but then use the model that is significantly better than the other, right? What approach gives us the best model, right? So, you know, perhaps it's random forest, but if logistic regression is better, then we should use that even if it's a simpler modeling approach. I noticed that on her pot, she had stars, and to me, stars mean significance. And so I figured her code must have code for testing significance. Because Begum was so well trained, I might have had something to do with that, I knew that the code for calculating the p-values to say that the comparison was significant or not. So I went to her GitHub repository for that paper, and wouldn't you know she's got nice directory structure that looks a lot like what we have for this project? Hmm, isn't that a coincidence? Anyway, code. That seems like a good place to put code, doesn't it? So, going into code, she's got a couple directories here that I'm not totally sure what they mean, learning old testing. Maybe I'll go with learning because I see something here about figures, and I see that she has a different R script for each of her figures. What an awesome practice! It's easy to figure out, then, where the code is to build each of the figures. Now, if I could only remember what figure it was, where she generated those p-values or showed those p-values. So I went to her paper and found figure two, where these five models were not significantly different from each other, but they were significantly different from the L1 regularized linear SVM and the decision tree. So I get to figure two.R. It is nicely written and laid out with all sorts of comments. Of course, all code could always have more comments. And as I kind of scan through here to look at what's going on, something catches my eye, which is this perm p-value. And if I look to the right here, I then see that she's got the p-values listed as a comment. And so this perm p-value, I'm pretty confident, is the function that I need for my own project, where I could give it the data and then the different conditions that I want to compare. You know, a model using just the genus level data or the model using the genus level data plus the fit result. So now I need to track down where is perm p-value. So if I come back to the top, so I can do a find on perm p-value, and this gets me those executions of it, but it doesn't seem like that function is defined in this script. So the next thing I do to try to find this code again, is to come back up to the top and look at anything that indicates that she's loading a library or some other exterior script. And lo and behold, she is sourcing a file in code learning called functions.R. And I bet it's there. So I'm going to go ahead to back to learning and look at functions.R. And again, I'm going to look for that function perm p-value. And here we go at line 150, down to 167 is this perm p-value function where you give it the data, the two different model names or conditions that we might want to compare, as well as the code to generate it. There's not a lot of comments in here telling me what's going on, but I do see it outputs a p-value. And so I think this looks pretty good. One thing I notice oddly is that there's install packages mosaic library mosaic and it's using a mosaic function in here. So I'm not familiar with the mosaic package. I'm not totally sure what's going on with that. It really would be nice to have some example data and output that I could use to run through perm p-value to understand what was going on. Since this project was finished, Begum has gone on and gotten an awesome bioinformatics job at Merck. See what kind of jobs you can get if you engage in reproducible research. Since Begum left, I have hired another postdoc named Dr. Courtney Armour and Courtney's kind of inherited projects similar to what Begum was working on. And I knew that she had code to compare different distributions of AUC values. And I thought maybe she could help walk me through this code. And what she told me was that Begum actually gave her a zipped directory. And in that directory was an RStudio project code example data and example output. That would be wonderful for helping me walk through what this trunk of code is doing. Now I have Begum's code. I have example data from Courtney and example output. I've got somebody that's actually used the code, right? So these are all great resources to helping me understand what the code is doing and what the output looks like. I can then take that hopefully and adapt it to my code to compare these two sets of AUC values. My goal for this episode is to show you the strategies that I use to better understand code that's new to me. And we're going to do that by going through this code that I got from Begum and Courtney. Go ahead down in the comments below. Let me know what kind of strategies do you use to try to better understand the code that other people have written, even if that other person maybe was used six months ago. Here I am in RStudio. I've got Courtney's project opened up that she got from Begum. You'll see that there are four files here. So there's plot AUC with stats, p-value condition, and the example data CSV. And so this is the type of output that we're getting from MECROPE ML. I see I've got the CV metric that's a cross validation and the AUC. So I'm most interested in the AUC. And then there's a condition column at the very end. And so as I kind of scroll through here, so there's condition A, I kind of get condition B. And let me just go to the bottom and condition C. So it looks like there's three different conditions. So I'm only interested in making two comparisons between A and B, say. So we'll hold on to that. But let's go ahead and look at these two R scripts and see if we can understand what exactly is going on. So they're using some packages in here like ggpubr that I had to install, as well as mosaic, right? And so as we kind of look through here, we can again read through and try to understand what's going on. And so it looks like we're reading in p-values by condition.csv. I don't see that file. I'm not quite sure where that's coming from. But there is a data read csv. But this comes after that file was generated. So that's a little bit odd. And so there's a there's a plotting step here where it seems to generate a couple of figures. So maybe this is actually run after, right? So this is plot AUC with stats. So maybe we need to get the stats first. So p-value condition.r, aha, there's our function, perm p-value, cond, right? And so then this from line five to 21 is a updated version with clearly more comments than what we had up on Begum's repository. And so now we see example input data, data read csv, that is that example csv file, the three different conditions. And then she's doing something here, I think making a table to fill with the different p-values comparing a to b, b to c, and a to c. So this is some stuff that I don't really care tremendously about. But for now, I'm going to go ahead and source this whole file, make sure that it works, and see what kind of output do I get from this p.table.condition file. So that took a few moments to run. One thing I know about this procedure is that is a permutation based test. And so as I kind of look back up here, I see that it's doing something 10,000 times, right? I think it's doing a shuffle. Yeah. So here we have shuffle of those conditions to kind of make a null model that it's then comparing a mean to. So anyway, we'll hold on to that. And so that's why it took a little bit of time to run. But again, if I look at p table.condition, I get out three comparisons between the three different conditions, as well as their p value. And so that all is interesting. Again, I'm only interested in one comparison. So why don't we go ahead and kind of trace the logic as we go through the data, right? And so we've got data. And so I'll go ahead and run this data. And so again, we see basically that TSV file that we had or CSV value that we had before, we've got these three conditions. And then as it comes through here, maybe I'll hold off on kind of running all this because I'm not sure I totally care about it all. I think up through this step, the select, I think it's making, if I do p dot table dot condition, it's making a table that's got two columns for each of the two conditions. And then it's doing a group buy and summarize to then get the p value for each comparison, right? And so then it's calling perm p value conned with condition one and condition two. So I'm going to come back up to the top. And I'm going to create a data filter for condition not equal to condition C. And so this then gives me a data frame with 200 rows that I can then feed into perm p value conned. I'll go ahead and make it simple and do a select on condition and AUC. And so now I get this two column data frame that's being fed in as data into perm p value conned. And so I'll go ahead and call this data. And my cond name one will be condition a condition a and then cond name two will be cond name B, right? So I'll go ahead and load both of those. And so these are the values then that I would use inside of this data frame, right? And so again, I guess this is doing what I had already given it up ahead here, right? So it's already it's doing a select and filter for those names. And so this test sub sample will probably be the same as what I had up in data. Okay, so that's fine. And again, we've got comments in here telling us a little bit about what's going on. And so what we're doing is we're quantifying now the absolute value of the difference in mean AUC between the two conditions. This is the observed difference, which is again, OBS. So again, what the test is going to do is it observes, it calculates the observed difference between the two means. So the mean AUC for condition a and the mean AUC for condition B. So that's the observed, right? And then we want to know how extreme is that difference. Then what it does down here in this AUC dot null, is it I think as it says, right, is it's shuffling the condition labels and calculating the difference in mean AUC 10,000 times. So what it's doing is it is taking this data frame, right, or these this AUC column, it's fixing it, but it's randomly changing the condition label in that column, and then calculating the mean for condition a and the mean for condition B, right, that would be the mean, if there was no difference between condition a and condition B, right. And so then it does that 10,000 times to get 10,000 differences in the mean when you shuffle a and B. Then I think what the question is going to be is how does that differ compared to the observed, right. So are those shuffled differences in mean AUC greater than the observed, right. And so if they're greater than or equal to the observed, then we want to count it. And then we can calculate a p value by looking at the fraction that are greater than or equal to the observed difference in mean, and then that returns the p value. Let's look at this AUC dot null line, because there's a few things going on in here that I'm not totally sure about. I'm going to go ahead and add a line break in here. So it doesn't all scroll off. I'm not totally familiar with this do function. I'm going to guess that it's going to do this 10,000 times. Again, that's not something I'm super familiar with. To learn more about it, of course, I could look at the help page, right. So do things repeatedly. So let's look at this. So this looks like it is coming from Mosaic and looking again at the example trying to figure out how do works, do something three times times our norm one. Okay, so it gets a bit confusing if do is coming from the player or mosaic. And so something you'll notice here for the mean function, so they grabbed it from mosaic. And so, you know, it perhaps would have been a little bit more clear to say mosaic colon colon do. So that's something to keep track of. So it's basically saying, do what's after the multiplication sign 10,000 times. So what's happening after the multiplication sign, because that's what's important, right? So then we have diff mosaic mean AUC tilde shuffle condition. So I guess we never quite covered what is mosaic mean doing that's different than a normal mean, right? And so what it seems is that it's taking mosaic mean on AUC tilde typically means as explained by condition and using test subsample. I'm going to go ahead and run part inside of the diff statement. And what we get is condition a condition B, and we get, I think the mean AUC for those two conditions. What I'm curious about though, is if I used dplyr tidy versus type stuff, what I get the same result test subsample, grouped by condition, summarize mean, yeah, on mean AUC. And I get the same values with a little bit of rounding, right? And so then what the difference does is it appears to then calculate the difference between condition a and condition B, abs then is the absolute value of that to give us 0.073. What's going on in the second version of it where we have the shuffle condition. So I'm curious what shuffle condition does, right? So if I look at test subsample, I see I've got condition a, and then probably at the bottom I have condition B. So if I were to do shuffle test subsample dollar sign condition, that is then going to randomize the order of that column, right? And so it is randomizing the label, and then calculating the difference in the mean with that shuffling. If we look at AUC null, it's going to take a moment or two because we're doing 10,000 iterations. And if I look at AUC dot null, I'm going to get all of these permuted values. And so then n equals length AUC null square bracket comma one. So the square bracket is a way to get a value out of a data frame. And so the square bracket comma one means get the first column out of AUC null, which is these values here. And it's actually saying what is the length of all this. And so that's going to be 10,000. And then what it says is what is the sum of the absolute value of AUC null greater than or equal to odds. So I'm not sure why we have the absolute value here, except that it wasn't used up here, right? So for some reason, they use the absolute value on row 19, but not here on row 21 when they did it. And so then if we look at the inside here, what we'll get back is a series of true falses. And I need to make sure I've got that loaded. And so again, we get those truths and falses. And it looks like a lot of falses. And so numerically, a false has a value of zero and a true has a value of one. So then if you sum those, you will get the number of shuffled differences. And there were zero, right? So there were zero cases where the shuffled difference in the mean AUC was equal to or greater than what we observed, right? That tells us that our value actually is pretty extreme. And so we can then say r plus one divided by n plus one, this is a bias correction, we get a p value, and I need to do that. And our p value then is 9.99 times 10 to the minus five, right? So basically, it's minuscule. So that then is our p value. So I think I have a pretty good handle now on how this function works. We give it a data frame that has column for the AUC and the condition. We also give it the conditions we're interested in, and then calculates the observed difference in the mean AUC between the two conditions does the same thing, but does it 10,000 times and each of those 10,000 times, it shuffles the labels. And that will allow us then down here is to determine the number of cases where we randomize the data that the randomize difference in means is greater than or equal to the observed. That then allows us to calculate the p value. So what I'd like to do, because I've been so heavy on dplyr and tidyverse in the series of episodes, so I think I'll try to do my best to change these lines to remove mosaic and make it in dplyr. Again, the advantage there isn't anything against mosaic, but that dplyr, tidyverse, those types of features are far more widely understood by me and people in my lab and people I'm teaching versus having to teach them. And when I say teach, it's also kind of update people or get people up to speed on using functions from the mosaic package. So why is the package if something's already available elsewhere? There are tradeoffs obviously. Before I put this into my own project, what I'd like to do is modify the code that I got from Courtney and Big Yume, using the data they gave me to make my modifications, and then make sure that those modifications don't break anything, right? So the results that I have here in p values by conditions, I should get something very similar to these, once I'm done messing with the code. So again, doing that within this setting with the code and the example data is a really nice testing framework. There's this other approach called test driven development that really encourage you to look into if you're developing functions, and you always want to make sure that as you modify those functions, you're not breaking the code. I'm kind of doing something like that here, but it's a little bit more messy. So before I go too far, I'm going to go ahead and delete this extra code that I used to make my own version of the test sub sample, and I'll use the data that's coming into the function. I will then look at what's going on here and again think about how I would change the code to make it do what I need it to do. So I'll go ahead and put just a line break in here in this comment so I don't have comments and text going off the right side of the screen. And the first thing that I want to take on is this line 15 the observed, right? I want to replace this with a tidy versus type code, right? And so what I could do is again, let's go ahead and make sure we've got test sub sample loaded, test sub sample. And I'll pipe that to a group by condition. And then we'll do a summarize. I'll do mean equals mean on AUC. And that gets us the mean for our two different conditions. And then I can do another summarize to do diff equals the diff of mean, and that gets us our negative 0.734. I can then go ahead and wrap that in ABS to get the absolute value that and to do our p value calculation down here. I'm going to want that to be a single number. Currently, it's actually a table right of row one columns one. So I'll do a poll on diff. So that gets me the same thing that I had, believe it or not, right here, this gives us a vector value. This is a named vector value, but I don't think that condition B matters. So we can test to make sure nothing broke by replacing line 15 by making that commented out and then putting obs here. I'll go ahead and save this and I'll reload the function. And I'll reload all this other stuff. And go about regenerating those p values and see if they've changed any with my modification. So there's the output 0.739 is the one I'm mostly looking at. And that looks pretty spot on. There's going to be some variation because again, we're using a random number generator, as we use a larger number of iterations. So currently we're using 10,000. The number will become more precise. We'll come back up. And the next thing I want to take on is this shuffling and creating the null distribution. I'm going to basically do exactly what I did up here on line 17 through 21. But I'm going to be shuffling the condition column. Okay, so let's go ahead and put this on separate lines. So it doesn't get all scrolly. And what I'd like to do perhaps is encapsulate this as a function, right, because then I could call that function down here with shuffling the labels. So what that'll look like is to have obs equal a function that I'll create called get mean diff. That'll give test sub sample to and that then will output my observed value. So now I need to create that that get mean diff function. So I'll do get mean diff equals a function. And again, in that body, we're going to put that code that I lifted. And I'm going to go ahead and call this x as the x being the data frame that's coming in, where we'll group by condition, all that. And then we will this will automatically be returned. So again, if I do get mean diff on that and I do get mean diff on test sub sample, we should get that 0.73. So we're in good shape there. So the next thing I want to do is generate this null distribution. And what I'm thinking I would do would be something like a map DBL one to 10,000. And so what this will do is this will output a double vector, so a numerical vector. And it's going to iterate over values from one to 10,000. I don't care about the individual values, I just want to repeat it 10,000 times. But I want to do is I want to have it run a function that would be like get rand mean diff. Right. And it would do that on test sub sample. Okay, so I don't have that function. But I'm going to make one, right. So we'll go ahead and comment this out for now. And so I need to make get rand mean diff. And I'll do that back up here again. And we will then do function. And so then with x, I'm going to take the condition column and random randomize it, right. So I'm going to use a mutate on the condition column. And I will then do sample condition. And so now we see that we do have a random sampling of those two conditions. And just to double check that we have the same number of condition a and b that we started with, we can do count on condition. And should we get 100 and 100. So that's good. Good. So now we have this, we can then pipe that to a summarize and call on our get mean diff. That then gives us the 0.01. I'll go ahead and say diff equals that, get that and then we can then pull diff, right. And so then we get the difference that we see. And again, we get different values each time, because we're doing this sampling, getting different permutations of that condition column. Good. So again, what we've done is I have encapsulated away the workflow that was happening right in here into its own function. And so it's more readable where I say get the mean difference. And so now what I have here is get the random mean difference on test sub sample, I think that should work. But we'll see. So I'll go ahead and do the function, I'll go ahead and load, get ran mean diff, and then rerun that. And I get, again, that small value. Good. So that works. So I can give get ran mean diff a data frame, a table, and it will then permute the condition column, and then output the mean difference. So now what I can do is AUC dot null equals all this. And maybe for testing purposes, I'll have it do one to 10. And now if I look at AUC dot null, I see my 10 values. So that's good. I can go ahead and then ramp this up to 10,000. So we get that AUC dot null, again, 10,000 values. So something that occurs to me is that this doesn't actually exist. I don't think right, so incorrect number of dimensions. So it's a vector. So there's no dimension. So now if we look at n 10,000. And here again, looking at our we don't need the absolute value because we already did that. So I can remove one set of parentheses and I can get rid of that column one. And so then I can look at r and see zero. And then our p value will again be very small. So I'll go ahead and save that. And let's go ahead and see if we can't get through without any error messages. So the p values that I got from my updated code look spot on with what I had previously. So I'm really happy with that. Again, the code works. One thing about it is that it's horribly slow. I did not time it. But I think all these, I think all the pipes from Deplier tend to slow things down quite a bit. That's something that I might come back and play with later. But I've got something that works that is readable that I understand. And that works well. The speed is another issue that again, I'll probably come back and try to clean up later to see if I can't make it a little bit faster. The last thing that I want to do for this episode is take this code that I've refactored and plop it into a new R script as part of my micro ML demo project. So I'll come back down here and grab all this code. I'll go ahead then and create a new R script in my micro ML demo. I'll paste this in there. I also need to blow the library tidy verse. And then go ahead and save this in code as get p values. And so that is good. And now I can come back to my compare genus AUCs and I'll need to do a source on code get p values dot r. Make sure that's loaded up. I misspelled tidy verse tidy verse and then resource it. And I can then come down and I should be able to use my function that I created called perm p value conned. And where I will then give it my AUC comparison, which just to see what that looks like is a condition and AUC double check what I've got here. If I do AUC comparison, pipe that to count on condition. And I've got 100 genus fit and 100 genus AUC values. Again, those are coming from that 80 20 split. We've got the 80 percent return in on and the 20 percent that we test on. And so then we will call these genus and genus fit. We'll run that and we'll see what kind of p value we get. So we got a really small p value. There were no randomized differences in mean that were as large or larger than what we observed. Looking at my values here, we see that the genus AUC average for those 100 splits was 0.642. For the genus plus fit was 0.721. We can now say that that is a significantly different quality of model. And it's interesting because fit alone, I believe had an AUC of about 0.7. And so the genus adds a little bit of information on top of the fit. It would be interesting to kind of go back and think about how would we test a determinant of genus plus fit is better than fit alone. I'm pretty sure that it is better. But again, practically speaking, is it that much better? I don't know. Clearly there's a lot of other permutations that we could do to the model. Things like how we pre-process the data, the taxonomic level we're looking at, the type of modeling approach we looked at, right? But now that we have the ability to calculate the p values, we can compare each of those models and therefore figure out which modeling approach is the best so we can come up with the best possible diagnostic. I want to recap the types of strategies that I used with you today. So first of all, we read through the code, right? We tried to kind of trace where the data would go through each of those steps. Second, we had example data, or, you know, perhaps we could create example data if you don't have it with the code, that we can then work through each step of the pipeline. What we also did was that we started to refactor the code to perhaps rewrite it in a language, so to speak, that I understand. So again, these are all the strategies that we used to modify the code to understand the code within that demo project that I had gotten from Courtney. And then I could plop it into my code here for this analysis I'm doing to compare the genus alone to the genus plus fit. In the end, I understand what's going on and I feel much better about the code. Something else to highlight is that I went down this path because micropml didn't have a function for comparing distributions like this. And I think that's pretty important. So I did go ahead and file an issue on the micropml package on GitHub, suggesting that we go ahead and add this feature. I think we'll probably use something perhaps a little bit more like Big Ooms code since it is so much faster than mine. But, you know, it's one of the nice things about having things on GitHub is that we can have this discussion of code. The really nice thing about having things on GitHub as you saw is that I could go get that code and work with it, right, that you could go take code from any of my projects, bring it to you and kind of go through the same process of trying to understand what's happening, and then adapt it to your project. Again, going back to the beginning of this episode, that is what Riffamonus is all about. And that's how I think you and I know I learn the best how to program. In light of all that, a question for you is what stumbling blocks do you still have to posting your code on GitHub for whatever project it is you're working on? When you write the paper up that you're working on right now, are you going to post the code to GitHub? I really hope you do, because I'm sure you've got some gems in there of code that I could borrow to use for my own analysis. And again, this social coding is going to make us all better. But even with all this effort to make the code better and better understandable by me, I can guarantee you that six months from now I might look back at the code and wonder, why did I do that? Or someone else might be watching the video and think, why did Pat do that, right? So it's a continual process learning how to code better and writing better code. Therefore, I really hope that you have liked this video, you've subscribed to the channel so you can be alerted when the next episode drops so we can all learn to program better.