 I promise you that it's like half our studio hex stickers, half bioconductor hex stickers. All right, I'm going to go ahead and share my screen. I know we're not supposed to start for another about five minutes, but I want to give people a chance to find the materials. Let me move all of these WebEx tools, things out of the way. Nope, I can't move that. Just kidding. Okay, so this is the workshop that we're going to be going through. I assume that everybody has seen a bioconductor vignette before, so it should look very, very similar. I want everybody to be able to find the read me. Yeah, I'm just going to put them on the top. That's all. I don't know if everybody can read this. I'm going to put this in the chat for folks on, if I can get to the chat. Let's see. Hover over this area to access. Yeah, if you hover over the top, there'll be a participants list and then a chat box. You should be able to open it up like a floating window once you do that. Okay, I don't know if folks in the room are able to find this repository, but it would be great if you could go ahead and find that now. And the little window is kind of small. So if someone, I guess, can shout at me or wave or give me a thumbs up that they're able to find this repository on GitHub. All right. Three, three thumbs up for thumbs now. Okay. Yes. Yes. Okay. All right. There's a read me here down at the bottom of the read me. There are two links. They're powered by get hack. One is to a development rendering of the vignette. One is to a production rendering of the vignette. The production one is supposed to work, but just in case something goes a little haywire, I included the dev link as well. So please find this and click on it. If you are incredibly GitHub savvy, you could just fork this repository and pull it down and you'll get all of the data sets. If you do not want to do that, you can just kind of read along and download the data sets later. I'm going to try to go through some of the code, as much of the code as I can live. There are three pieces that require a couple of minutes of computing and I have notes about that, but I just want to make sure that everybody that would like to run the code along with me is able to do that. So if you want to fork this, go ahead and do that. If you're not as comfortable with get and GitHub, that's okay. You don't have to fork the repository. You can still find this link, which is this right here. So can someone give me a couple of more thumbs up saying that you're able to find this document, the get hack link and you're able to read along with me? I'm looking to see if I see thumbs up or thumbs down. You guys are about an inch and a half wide and an inch tall. So okay, are those thumbs ups? I hope that's a thumbs up. You guys are like literally this big on my screen. That window is resizable if it helps. If you have a multiple monitors, you should be able to drag it off as well. I'm on a laptop because I'm in Washington DC going to the RStudio conference. They scheduled the two conferences I care about at the same time. So I had to make Sophie's choice and I went with the cheaper one because I can't fly to Seattle right now. I'm from Miami and Miami to Seattle is a very, very expensive plane ticket. But next year, hopefully Bioconductor comes back to the East Coast and I get to see everybody in person. So I have this document open. I also have, I was in the Daggerty webinar. So let me switch over to the repository for this project. Was anybody able to go to the causal inference webinar? I thought it was pretty cool. Okay, so here's the behind the curtain. I'm sure everybody is familiar with what R Markdown looks like. I'm going to create a new script just so that I can copy and paste the code out of the, out of the knit document. So is everyone able to see my RStudio screen? Fine. Do I need to zoom in? Would it be better if I zoomed in? Is this better? Do I need to go farther in? I'll go one more and hopefully that's not too big. Okay. Oh, thank you, Jared. Oh, I've got to pay attention to the video and I've got to pay attention to the chat box. All right, it is 6.30. Do I have permission to be introduced? Or is that, is this a good time? Do I introduce myself or does someone else introduce me? Gabe, can you hear me now? Yes. Is that the only microphone in the room? No, they're, well, I guess it's the only one that's hooked up. There are some portable mics that have, that they have that they could deploy, but this is the only one right now. But we can bring people up here if people want to ask questions. Okay. Yes, please. I would, I would like this to be more, a little bit interactive and less me just throwing information at everyone. That's great. So that's great. So do you want me to introduce you or do you kind of already jumped up and introduced yourself before I could even start? So I was just pleased and I was just letting you roll with it. But if you want an introduction, I'm happy to introduce Gabriel and this is, he's going to talk about his, his package, Co-Best DMR. Hi, everybody. Again, incredibly disappointed that I can't be in Seattle with all of you. Hopefully next year I get to get to hang out with everyone together. Honestly, bioconductors is, if not my favorite conference, it is at least in my top two or three favorite academic conferences or any sort of research conferences. So I am, I'm, I'm so, I get to see people's little names and faces and I'm like, oh my gosh, I haven't seen Charlotte so long or I haven't seen, you know, it's, it's, it's great. So I'm, I'm hoping that everybody's been able to find this, to find this document again for those of you that are, that are new or just walked in, I'll put the link in the chat again just in case. So that link will take you to this GitHub repository and then at the bottom of the, the read me in this GitHub repository, there is a link to this, this workshop document. On WebEx, is it like Zoom where if people join late, they can't see the chat that happened above them? It is not persistent chat like a Slack or Teams. So if people come later, the chat message will So we typically just repost chat messages using copy and paste. Is there, is there a, an official host to the meeting that could periodically just post a link to this, to the workshop? Whenever people come like people new come in? I mean, I can just You really like to have a conversation. It's, it's great. So here's the thing. There is normally for every session supposed to be a person online who is, you know, jotting down any questions that pop up into the chat. I'm probably going to do that in this session. And there's not a ton of people in here and producer stuff is being handled actually by our excellent AV team. So I can do that and I can post those questions and then we can ask them at the end. And if people have questions, I'll keep an eye on the audience and you can, if you see somebody raising their hand or something, you can, you know, have me make sure to have them come up and they can ask them question on this microphone here because we have a live mic. Okay. So what happens if they run the code and they see an error? Can we do that during the session or is that something that we need to wait until the very end? So the session has been set up as a meeting, which means that anybody in here should be able to share their screen if they want to, you know, if they feel brave and don't mind having everybody see their error, they can share their screen and you can talk about it. Oh, okay. Perfect. Perfect. I'm not sure. I've, for most of, well, for all of today, I've been on the other side of the camera, all of the amazing talks. So, okay. And there's something to, there's a link to a Google doc. What's the, is that for where the questions go? Yes, I'm going to go grab my computer. I'm just going to stay up here because otherwise I can't watch the room. Oh, okay. Can you still see the screen if you're interested in seeing this? I can never. All right. Don't worry about me. I'm basically not going to get, I'll be watching this again later. Okay. Okay. All right. All right. Well, I know that we've spent about five minutes talking about things. Not going to ask everyone to go around and introduce and tell me a fun fact about them and how many brownies they like to eat. But that's normally how I tortured my students during like the first day of a first day of class as we get to know each other. But that's, that's okay. So I want to talk to you about the title is Detect differentially methylated regions. A little bit of background for this. I did my postdoc at the University of Miami in the Sylvester Cancer Center. And I was trained to work with like doing a little bit of biostatistics, a little bit of bioinformatics, genic-based stuff. And then later on in my postdoc, as I was transitioning out of my postdoc into a faculty position, some of the, some of my, some of my colleagues and my mentors approached me about looking at Alzheimer's and neurodegeneration data. And I learned very quickly that genomics is often the right tool for Alzheimer's and neurodegeneration. Basically brain diseases, they sometimes have the gene expression signals or protein expression signals. But a lot of times some of the more fruitful areas of research when we're dealing with brain diseases is to look at these epigenetic changes. So modifications that happen through a combination and interaction with the gene and the environment. Now I don't want to belabor the point. I don't want to go through like incredibly basic stuff. I'm, I'm sort of assuming that most people in the room or at least I, if you're coming to a talk about differentially methylated regions you probably know what DNA methylation is. So I don't want to waste a whole lot of time. So go into the workflow here. The idea is you have a phenotype of interest in some of our research. It's Parkinson's. It's some of our research. It's Alzheimer's or neurodegeneration or dementia. This example was heroin use. So many of you, to give you a little bit of the epidemiological context, there's something called the opioid epidemic in the United States where many, many Americans were prescribed opioids for pain relief after surgeries or after or whatever. And they became addicted to these opioids. And some people, if they can't get access to these prescription opioids that they're taking, I mean they'll find like a street version of a prescription opioids. Many of them will turn to injectable opioids like heroin. And the concept of addiction, we're starting to really, I mean people, psychologists have been thinking about this, this way for quite a while, but it's important that we treat addiction as a disease of the brain. And so we started thinking about this and we said, well what if we take some of these same techniques that we were applying to look at genomic profiles of Alzheimer's disease and cancer, well not cancers, but Alzheimer's disease in Parkinson's and other neurodegeneration diseases and we apply it to addiction diseases. So we were able to partner and by we, this is a combination of team, a combination from folks at the University of Miami in Miami, obviously, Florida International University, which is where I am a faculty member now, which is also in Miami and Columbia University. And so we have some folks from Columbia University that are studying heroin users. And so we got a data set with DNA methylation data for healthy controls and heroin users. And we said, okay, let's take a look at what's going on, see if we can identify parts of the genome that get, if there's hypo methylation or hyper methylation, so hypo meaning under methylated, hyper, I mean over methylated, then what would happen by random chance that is related to the phenotype of interest in this case continual illicit addictive heroin use. So here's the workflow. We would have some methylation array data and as part of the workflow, I'm kind of going to kind of breeze over a little bit, but basically you take a given your methylation probe of interest, right now we're supporting Illumina 450k and Illumina Epic arrays. So given your version of the array, then you can take find roots or predefined regions in the genome already have an excess out of CPGs. So a little bit about a little bit about the DNA methylation, there's a ton of background noise from someone, can you please, yes, there is a little bit of background noises. I'm not sure, but yes, Jared Andrews has brought that up. Oh, that sounds a lot better, thank you. Yeah, yeah. So quick recap of DNA methylation, you've got the 3.2 billion base pairs in the genome and sometimes you'll see Cs and Gs that are right next to each other. And there's an interesting thing in the genome where these CG pairs often happen a lot more than they should if it was just a statistically random, some sort of randomly generating process that would generate these CG pairs, and they show up a lot more than they should. And so these are called CPGs, so there's the C of phosphate and G. And a lot of times these areas will attract excess addition of methyl groups, which is where the methylation comes from. So this section of the workflow, this is sort of a completely data agnostic step. So we've actually already calculated some of these results for any 450k data or any Epic data in the genic regions and in our genic regions, just because it's based on the locations in the genome where these extra CG pairs show up. And so that's this part right here, this close by CPGs function. I will not be talking about this part of the workflow because for almost all of your data, it's going to stay the exact same. It's just the idea that any time you see a CPG that's within about 200 base pairs of another CPG, also within about 200 base pairs of another CPG, we just call that a region. So the output of that would be something that looks like this. It's going to be a ragged list in R. And it's going to have the name of a region. So here we see chromosome 22 in these positions. And in this region, from this position to this position, we see a couple of these CPGs that the Illumina probe is looking at in that region. So we have this part. And then we're going to combine it with the methylation data that's been recorded. And we're going to find out if these regions have concurrent methylation or co-methylation. And we'll talk a little bit about why that's important later. So there's two stages of this workflow. There's this part of the workflow where we just look at regions of the genome. And if you supply some genomic data, like the DNA methylation data, we would say, okay, well, which of these regions have excess concurrent methylation and what would be expected by chance? And so that's like an unsupervised part of the workflow. Then second, there's a supervised part of the workflow. And that's down here. And we say, okay, well, let's take our phenotype of interest in our case heroin use and find out if this excess methylation in these regions with concurrent methylation is related to the phenotype of interest. So there's the unsupervised portion, which is here, where we just go through your methylation data and say, hey, here are some regions in your genome where there is excess concurrent methylation. So I'm going to fast forward and then come back to this idea. When we mean concurrent methylation, we mean a couple of the probes in that region or have this excess methylation, that the methylation is in a region is not driven by one super hyper methylated probe. And then the rest of them are just normal. But there's a couple of probes in that region that all have a little bit more methylation than normal. So that's the unsupervised portion because we don't take into account the phenotype data in our case the heroin use. And then there's the second portion down here where we incorporate the phenotype of interest. And then we find out, okay, well, are these regions of concurrent methylation is the methylation in those regions related to heroin use or the phenotype of interest? Okay, so far, I've talked a little bit about workflow. What questions do you have? So we've got the two main ideas. There's one section where we do some unsupervised learning. And then there's another section where we do a supervised that's a linear mix. Questions on this so far. And I'll be looking at the chat as well. How are regions of co-methylation different from differentially methylated regions? That's a great question. Okay, so commonly, we are interpreting the idea that a differentially methylated region says that the methylation in this region for cases has a different profile than the methylation in this region for control. So if you think about like, okay, I've got my methylation, I'm going to think about it in rows because I'm a little bit more like a mathematician. So let's transpose. So I've got my samples right here. And these samples are all of the samples of the cases. And then these samples are all of the samples in the controls. And there's a set of probes, there's a region right here where the methylation is different for cases than it is for controls. So that's differential methylation. Okay, when I say co-methylation or concurrent methylation, I'm ignoring up at the top whether or not they're a case or control because it's unsupervised. I just want to know does this region have a methylation profile where lots of the probes are all methylated rather than just one probe being really, really highly methylated. So a lot of times when you look at like, oh, well, what's the average methylation in this region? The reason that it's so high is because you've got an average methylate, you've got one probe that is like super highly methylated, and then the rest of them are just like normal or whatever. And then the average gets dragged up. But when you look at the region itself, you're actually building correlations. You're doing clustering based on correlation. And you're saying, is the methylation pattern at this probe correlated spatially with the methylation at the probe right next to it, which is correlated with this probe right next to this and so far and so forth? So Ian, does that answer your question? So co-methylation does not take into account the phenotype. Looking at the chat. Ian, got a thumbs? Yep, sort of like feature. Yeah, sort of like feature extraction. It's absolutely an unsupervised technique. We are not taking into account the heroin use yet. We just want to know is there some signal going on in this area where the probes have their methylation correlated with each other? All right. So I'm going to use the tidy verse because I'm kind of an R studio junkie, but that's okay. The packages that you will need are here. I'm using pathway PCA. This is another package that I wrote. This is just to transpose the data because normally we have methylation data where the samples are the columns and the probes are the rows. And I'm going to make a plot later. And in order to do the plot and GG plot, I actually have to have that data transposed. If you don't want to make the plot, that's fine. You don't need pathway PCA. Cable extra is to make the tables pretty. You don't need that one. To run the code though, so core plot, again, is for plots. To run the code in the vignette that don't have anything to do with the plots, really, you just need COMF DMR, which you can get that from bioconductor. Now for COMF DMR to work, you will need these two packages right here. If you have never, if you don't, a lot of you are going to have these packages installed already, but just in case. So I'm sure that you have these packages installed on your computer. And then I use the tidyverse, like the particularly deep liar and the pipe operator, that sort of workflow. So let's talk about the data. There is, if you are interested in following along in this data folder, there are a couple of data sets. One of them is this pheno underscore clean. And pheno underscore clean is going to look like this. This tissue sample is just, we don't need it, but it's still there. So these are kind of sample IDs. And then we have their sex assigned at birth. We have case versus control where case being zero means that they're one of the healthy controls. They don't use heroin than the person's age. And then this is something that we can talk about later. But some of the literature that we found in particularly coming out of, oh my goodness, Steve Horvath's group, the famous epigenetic clock. Their research was that the age effect on DNA methylation is often nonlinear. So we're actually including a square of age in our model as well. But we'll get to that shortly. So anyway, we've got our, we can run, if you're interested, you can run this code. If not, just follow along. And that's okay. Here's another caveat right here. Right now I have a, it's a, unfortunately, I would like to call it a feature, not a bug, but it's a bug. Where for whatever reason, I can't have phenotypes that have class logical in R. So it needs to either be numeric or integer. So just be aware of that. It is, it has an open issue. So really, I just add a zero to the logical indicator and the code, the code works. So I'm working on that. But anyway, that'll be the version on bio conductor right now does not have a fix. So this is what the phenotype looks like. And we're talking about, this is African American heroin users and African American healthy controls. So the DNA methylation, this is some boilerplate that we have. I hope that everybody kind of understands what methylation is. And this is Illumina Epic. So we've got on the order of like 870,000 probes. We have a subset of this. And that's right here. This is just from chromosome 16. Also, I have to apologize. I wanted to show you all the real data this afternoon. But I don't know who here works with the National Institute of Drug Abuse. But they have final say over when data sets get embargoed and when papers get embargoed, we were supposed to be able to have the paper submitted this summer so that we could, we could, I could actually show you the real chromosome 16 data, but they have not given us approval on the paper yet. So I have to give you a it's a slightly noise up version of the data. So I took the actual methylation data and then added some particularly spicy bits of Gaussian noise. So it's not quite the exact data, but it's close enough for what we need to do without me taking off the, taking off NIDA. So anyway, I, again, I apologize. I would love to be showing you real data, but I can't yet. So one other, one other thing right here. Later on, we will, I'm using the tidyverse function read underscore CSV. You can totally use read dot CSV and you can get around this tidyverse functions don't support row names. So if you want to use read dot CSV, you can, it'll, you'll still be able to use row names without a problem. So this bit of code is so that we can have row names in our data. The probe IDs should be the row names and the columns should be the samples. All right. Now I'm going to go back up and this is something that we talked about at the start of the workflow that I said it's not super important for what we need to talk about, but it's here. And that is this list of what we're calling close by genes. And this is independent of your methylation data. So this works for any methylation data that is in epic format with Tim before annotation for genic regions. So we decided to remove the intergenic regions just for, for computation, like for speed of compute, speed of computing. This list is also here. So you can have that again, this is in the data folder of that repository that I sent in. And so this is what it looks like. We've got a ragged list here. And then we have a region in chromosome 16. And so based on the Illuminate Epic profile for this data, we have that these five probes would all be each one of them would be within 200 base pairs of each other and there are at least three probes in this region. So that's what the three and the 200 mean. So there's at least three probes for it to be a region. So two probes, not enough. That's not a region, but at least three that makes it a region. And the farthest that each individual probe can be from each other is 200 base pairs. If you want to know a little bit more of the justification for that, the paper down below we published a paper in nucleic acids research, I think in 2019. It took a lot longer for me to get the package on Bioconductor because I'm, I'm just a terrible person. But anyway, this is that's the justification for this. If you're interested, there's a link later on in the in the document. So this is what this looks like. We've got these regions already calculated. They are agnostic to your data. It does not matter what your your actual methylation data is because this is just based on the Illumina manifest itself. Um, so what we do to this, well, I'm going to show a quick example of what the methylation would look like. So I picked a region right here. This is just a for, for example, this particular region. And I say, okay, give me the probes in this region. So here are the five probes. Now show me the DNA methylation for these five probes and give me the first five samples. So we've got our five probes. Here's our DNA methylation. And then this idea of co methylation is asking the question like, okay, well, does methylation for these 96 samples happen concurrently? Is there a relationship between the between the probes? So we can, this is where I said that we would need to transpose the data. So if you don't want to make this figure, you don't have to, you don't need pathway PCA if you're not interested in making this figure. But the idea is we take this data and then transpose it into quote unquote tidy form so that a ggplot code will work. Basically we see that here are my five probes. And then these are box plots for the 96 samples. So we can see, okay, this is what the methylation data looks like for these particular five probes in this region. Okay, then we would say, all right, we want to find out, is there co methylation that happens in this region? So there's a couple of things. And this is, we had talked about in the epigenetics for bioinformatics, the epidemiology for bioinformatics, bioinformatician session that just happened previously. This idea of are there confounders? Like what are things that could modify our analysis? And so in this idea, we don't want to we don't want to just cluster the methylation that happens without taking into account these things that could be modifying the methylation, but that aren't the phenotype. So things like the age of the subject and their sex assigned at birth. Because we say, well, you know what, the methylation could be related to these things. So what we do is we run this function out of comethium R called get residuals. And what it's going to do is it's going to take your DNA methylation data, and it's going to fit this model right here. And it's going to basically regress out the methylation effects that are related to these confounders. Now, if you want to run this on your own, on the laptop that I'm on right now, it takes 1.6 minutes on my PhD students Dell laptop that took 1.7 minutes. So this one should be a short example if you're interested in running it. Basically, a couple of things that you need to know, you need to have the phenodef data, the phenotype data, but notice I am not including the heroin use here. So this makes it unsupervised in this in this case, because I'm not including the outcome of interest. And I am doing a transformation here, I am saying I want to I am supplying beta values, which are bounded between zero and one, and I want to transform them to m values, which are the the log base two or logit base two transformations that go from negative infinity to positive infinity. And then I've given so that you don't have to calculate this, there is another data set. So anything that has already been calculated is here in the results folder. So there are some data sets right here for steps in this workshop where you're already doing which so you don't do the computing. So this one right here residuals by age and sex. So right here that is this, I'm sorry, residuals by age and sex right here. So if you would like to get this, it's a transformed methylation values data set, but it's the residuals of the methylation, not the not the original. And that's what we want to do the clustering on because we want to see, well, what's the clustering actually happen once we've controlled for how old they are and whether or not they're male or female or their sex assigned at birth. So this is, we can see here, these are not methylation beta values anymore, these are residuals. So they're the in the m value space so they can go from negative infinity up to positive infinity. And so we look for areas of co-methylation. Okay. And this is the code to do that. I'm gonna, so this is the code that you want right here. And I'm going to, so right here the output is CPG. So this is very, very important. This is the default. If you want to troubleshoot, so I'm going to show you a little bit of what the results mean. And that's why down here I'm running this analysis here for the output being the data frame. You don't want to do this because you can't use this part for the next step. You want your output to be CPGs. So if you want to do downstream analysis later on, leave this as CPGs. If you need to inspect or troubleshoot the results and find out what's really going on, then use outcome as data frame. But basically we take the residuals data frame here. I'm not trans, I'm not transforming from beta in beta to m values because it's already in value. I want to use Spearman rank correlations to do these clusterings of the probe. And I'm saying that it's an Illumina Epic array, not 450k. And then here is my, this is that list I showed you up above that's already been calculated for us. Let me go back to this. Just this list of all of the close by CPGs that are in the Illumina Epic array. So let me go back here. And then I want to do this in parallel. So I'm going to give it three cores to work on. And this takes seven to eight minutes. You may want to run this back at your hotel later. I don't recommend running it in class. If you're on Windows, it took 15 minutes. So with everything else. That's just to let you know, when we're doing these results in practice, I use a bit more powerful desktop machine and leave it running for a couple of hours. But these are short examples. Okay. So what do these results look like? And again, if you want to get to the results without running the code, because of course you do. So there's the co-methylated regions and then co-methylated region one. So this is the example that I'm about to show, the output for region one. And then these are the results for all of the regions. So again, this is in the co-meth DMR repository in the results folder. I've got about 10 minutes left. Is that right? Yes. Okay. I'm seeing nodding of heads. Okay. So when I'm going to cluster these things, and Ian, this is getting to what we talked about earlier. What we're going to do is we're going to go through each of these probes and find out the sort of like the leave one out correlation, if you will. So we say, okay, I want to grab this probe right here. And I want to see for the 96 subjects is the methylation at this probe correlated with the average methylation in all of the other probes in this region. And the correlation here is 0.44. Now we move on to the next one. So we take this probe out, the CG1937-8133. And we calculate whether or not we calculate the correlation, the spearman correlation between this probe and the other four. And we see that the correlation is 0.25 on and on down the line. Okay. So notice we see here, this one is less than our threshold. Our threshold is 0.4. If you want to know where the 0.4 came from, we did some rather extensive simulation studies also in the paper in nucleic acids research that I mentioned earlier. But for right now, if you want to change it, you can. But 0.4 seems to work for quite a few diseases and quite a few data sets, and it had properties and simulation. So as my PhD advisor would say, don't make a mockery of honest ad hocory. So yes, 0.4 is ad hoc, but it has lots of simulation results to back it up. Okay. So we see here that this one has a low correlation with all of the other probes around it. Now, these probes are ordered based on their spatial position on the genome, which means that this probe up here is moderately related to the other probes in the region, but this probe is not. So I'm going to show a core plot of what we're talking about here. So this was the probe that we're talking about. Notice it has this much lower correlation between this 1937-81-33 and the other probes down here. So what we're saying is, yes, I had originally a region of five probes, but the co-methylated region is going to be these three down here. And you can see this because these correlations are higher than our threshold of 0.4. So we're saying, okay, it's these three probes down here, not all five probes. I don't want to pay attention to all five probes now. I only want to look at these three probes and find out if these three probes in this region are related to the outcome of interest. So this is where the co-methylation comes in. This is a really, really big idea. I want to make sure that everybody is comfortable with this idea of co-methylation or concurrent methylation. Are we good on this or do I need to talk about a little bit more? Okay, Ian says it looks good. So again, big idea, these five regions are part of a region just based on where Illumina decided to put its probes. So they said, hey, let's put these three probes or these five probes all close to each other within 200 base pairs of each other. That was a decision made by Illumina. It has nothing to do with what's actually going on in your data. These three probes down here, having some sort of correlation structure, that is dependent on your data. And this is, again, unsupervised because we have not measured whether or not these three probes are related to heroin use that comes next. So we're kind of done with the first phase, the unsupervised phase, and we find a whole bunch of regions that are like this that have concurrent methylation. Then the next step is going to be, well, let's do some statistics. Find out if these regions are related to heroin use. So again, I want to show you that the results from this part of the code, the downstream analysis, it doesn't include the correlation plot. It doesn't include all of this output. This is only for diagnostics. Really, the only thing that's going to say is, oh, here's your region that you should pay attention to. And it's going to be a lot of them. It's going to be like a thousand or something or 500, whatever it is. So again, that's here in this list right here, co-methylated all of the regions that have been adjusted for agent sex. So that's this file right here. All right. Now, so we're done with the unsupervised part. We've got supervised part, and I've got like three minutes left. That's okay. So the inputs here, this is going to be the list of those regions of concurrent methylation that we get up above. So that's the unsupervised portion. And then a data frame. Currently, we're taking in beta values. This is something that we, in our research group, we talked about it. This will be being changed soon so that it would be able to take in the beta values or the m values or the residuals. But for right now, it just takes in the original beta values. And then you need a phenotype data set that has the column named sample with an uppercase S. It is case sensitive. I have that hard coded because it was giving me a lot of grief not having it hard coded. I apologize. It frustrates me probably as much as it frustrates you, if not more. But for now, just rename your sample ID to be the word sample with uppercase S, and it works. Again, sorry about that. All right. So we want to talk about, I mentioned to you that we're going to have a mixed model that associates or finds if there's an association between the methylation and the phenotype of interest. So that's what we're seeing. That's what we're going to see right here. So the treatment, if you will, remember the treatment is whether or not they're using heroin. That's going to go in this X right here. The U vector, this is going to be the random effect for the person because as we saw up above, there was a region with three probes in it that were co-methylated, but they're all inside of one person. So that's why we're doing a mixed effect model. So here's our effect for the person. And this is all of our control covariates. We need to include the age and the sex assigned at birth, and then we've got our error vectors. The random coefficient model, what's happening here? So notice we have a fixed intercept and we have a fixed slope or a fixed effect of heroin use. The random coefficient linear mixed model allows this to vary based on the probe. So each probe could have its own intercept and its own slope. And what that allows the model to do is it allows it to better detect when there's, in those co-methylated regions, if there's a cooperative or collaborative effect across the majority of the probes in that region to influence the heroin methylation. So this is kind of the big idea. We're not just saying, oh, is the average methylation for one probe really high in this region? And then we're going to find out whether or not that region is related to the outcome of interest. Because at that point, you're really just testing whether or not the single probe is related to the outcome of interest. And that's not what we want to do. So we want to take these probes that are all correlated together already and then find out if they are working together to influence or if them working together is related to the outcome of interest. And so that's why we've got this mixed effect model here where we're allowing the slope in the intercept for each probe to vary. And then we have some code. I'm not going to go through this. I explain in great detail, or I hope it's great detail, what all of these components are. Again, we're going to have so we're taking in the original beta values, not the residuals, which is why we're including the same things that we adjusted for again. But we're not adjusting a second time because this is the original beta values. And in this case, we have our phenotype now. And that's whether or not they're using heroin or not. So this one takes about five or it takes four to five minutes on the laptop, my laptop versus my student's laptop. And so again, if you want these results, they are here. This is the results file and it's a CSV file. And this is kind of what it looks like. So we've got our CSV file here. And so what we're saying is, okay, here's a chromosome region from start to end on this chromosome. It's got three probes in the region. And then here's our estimate. Here's our standard error. Here's the statistic. And then we've got the p-value and the false discovery rate because we're making a lot of comparisons. So that's what our results data would look like. We want to annotate this. So we have a function called annotate results. And it really just pays attention to whether or not it was a 450k array or an epic array. And then we're filtering out anything that doesn't have significance at FDR 0.2 or whatever. And so we get a couple of these results, a couple of these genes. Now again, this is synthetic data. So I'm going to kind of skip this part. Down here, hopefully, hopefully NIDA will let us publish the data and publish this manuscript soon because we actually found some really cool stuff in rote 2 and caskin 1 that were already found to be related to alcohol use disorder. So people who are addicted to alcohol, the same gene regions have this excess methylation as people who are addicted to heroin. So it's really, really interesting here. So I want to give you kind of this recap here. Step one, you want to find the unsupervised step. You want to find the regions of concurrent methylation. And this is independent of your phenotype of interest. And then two, you want to test if those regions of concurrent methylation are related to the outcome of interest. And then, yeah, our collaborators, the psychologists are honestly kind of excited about this. But anyway, here's a bit of information on the session info. So if you have questions about this, I know we have like literally two minutes left, feel free to drop an issue in here, say, hey, I was running your code and something went wrong. I don't understand. Please send me an issue. And I would love to talk to you. Also, I'm available via Zoom, probably not this week because my hair is on fire. But next week, I can meet with you individually on Zoom if you are stuck and you want some one on one coaching on how to use this on your own data. And I, yeah, it's, I actually kind of like the tool. We've been, we've been working on it for a couple of years. So if you want to, if you want to use this on your data, I want to help you. So thank you very much. I think that's the, that's the end of my, I'm not, I promise you I don't do cocaine. I just talk like this when I, when I talk really fast. So, okay, I hope I didn't completely lose everybody. Did anybody understand anything out of what I talked about today? I think Ian gave me a looks good. So I got a, I got a thumbs up from, from Mark. Okay. I know we're out of time. Do we have time for, for like one, one question? Anybody have a question? I think everyone's tired to be honest. Oh my gosh. Yeah, I think, I think they liked your talk. There's a positive vibe in the room. I can tell you that. Okay. All right. All right. I know it's, I know it's really anxious when you're on the other end of the line and you're not sure. Yeah. I see lots of black, black boxes and then people that look like the size of ants. So yep, it's hard to be right now, but I can tell you, I think everyone enjoyed your talk. It was really good. And okay. And yes, please reach out to me. I'm available on Twitter. I also have an email address that I don't remember right now. I'll put it in the chat. Gabriel dot odum at fiu.edu. Just put like bioc 2022 in the, in the subject line. Also on Twitter, our EV, DOC, Gabriel. Just say, Hey, you idiot. I have no idea how to use your tool, but I really want to. So thank you, Ian. Okay. Am I, am I keeping people from dinner? Because I definitely don't want to be that person. I don't think you need to worry about that. And we're going to close it. And, but thank you very much. That was great. All right. I will, I will log off. My email is in the chat. My Twitter is in the chat as well. Please don't be a stranger. I promise I don't bite. Bye everybody. Hopefully see you next year in person.