 everyone. So we're on to the next package demo. So we have today, once and more, and Eric Davis, once and more, and Eric Davis talking about Null Ranges, modular workflow for overlap enrichment. They're both from the University of North Carolina in Chapel Hill. Take it away. Did you need a mic? Hi, everyone. Welcome to our presentation. Me and Eric will present together for the package Null Ranges that aims for generate non-future set of either blackboard strapping or the covariate matching algorithm. A common question in genomic analysis is passing whether two facial intervals are significantly over that that expected by chance alone. And many studies have shown other transcripts. Oh, is it better? Sorry. Many studies have shown that other genomics, transcriptomics, and epigenomics are not randomly distributed along the genome. So successfully detecting those genomic ranges that are significantly enriched or deeply located are biologically important. Example, like integration in cancer genomics analysis enriched genome regions that enriched in somatic amplifications or depletions shows contains a K cancer associated genes. While this test highly relies on the non-distributions, so our package is to generate reliable and close to choose non-distributions for this example, each non-hypothesis test. In our packages, there are two options to generate non-distributions. Suppose if you want to generate that non-future set for the feature Y, you could add a subsampling white prime from a pool of feature Z by controlling certain characteristics so that your Y and Y primes have similar distributions over one or more covariates. In another case, it's when you don't have a pool of feature Z or you don't know which covariate to control, then you can subsampling white prime from original features Y by moving blocks of the genome with the replacement, meaning that one feature can sample more than once by controlling their local dependency structure. So that the score, including JSA content or feature density, can be controlled. There are already a amount of studies going on in enrichment or colloquialization analysis, and many methods have been proposed. So if you are interested, feel free to go to our non-ranges website or this Velconductor tutorial. There are many real data examples under these articles' icons for the matching, and for the block bootstrapping, you can look into the tidy ranges tutorial bookdown here, noting that our block bootstrap idea actually motivated from the GSC method proposed by Baco in 2010. However, this time, we implement an efficient vectorized code to offer a genome-scale bootstrap data rather than GSC doing the generate block-wise bootstrap data. If everyone has been clear for the difference between matching and the block bootstrapping, then I will head over to Eric to talk with the match ranges first. I'm just going to go ahead and switch over to the... Oh, sorry. Oh, it's in the tab. Okay. Thank you. I'm going to make this a little bit bigger. It's probably pretty small. Is that better for people in the back? Can you hear me? I'm just going to open up a blank R script here for pasting in code from the vignette here. All right. So thanks, Wensen, for covering the beginning part where we're talking about the two major approaches of this package. First, I'm going to talk about the matching portion of this. So this is matching covariates using the match ranges functionality. So I'm just going to follow along with this vignette here. If everyone wants to follow along, you're welcome to. So match ranges allows users essentially to sub-sample a pool such that the resulting match set contains similar distributions of covariates or genomic features as a focal set of interests. And so as a visual example, we have this little diagram here where we have a focal set of ranges here on the right. And what you'll notice about them is that in this little toy example here that there are different colors and lengths. And then on the left, we have a pool set, which has ranges in different positions. And those also have their own distribution of color and length here. And so what we can do is we can use match the match ranges function here, setting color and length of the genomic ranges as covariates in our model. And it will pull out a match set of ranges, the same number of ranges as your focal set, but they're going to be matched for those covariates of color and length. And the match ranges, the null ranges package also comes with helpful visualization functions for visualizing the distributions of your covariates in these different sets. And the benefit of this is that the resulting sets can then be compared in whatever way you would like without the potential confounding effects from these covariates. So my research is focused on 3D chromatin structure. And one of those chromatin structure, one of those 3D chromatin structure is chromatin loops. So I'm going to talk a little bit about chromatin loops and use that as a biological example for this, for this spin yet. So if you don't know, most chromatin loops are formed in a process known as loop extrusion, where this ring-like cohesion structure extrudes chromatin until it's stopped at both ends by a bound CTCF transcription factor. Therefore, most of these chromatin loops tend to have CTCF bound in their loop anchors. Because these are transcription factors, one, these chromatin loop anchors also tend to have CTCF in accessible chromatin regions. And this can act for in this case as a potential confounder. So suppose that we wanted to compare CTCF occupancy between the anchors of looped and unlooped ranges. Match ranges can help by generating a set of null ranges that control for this confounding by chromatin accessibility. So in order to do that in this demo here, we've assembled this object called HD19 10 kb bins, where we've taken every 10 kb range of the human HD19 genome. And we've annotated it with CTCF sites, DNAs, the number of DNA sites, and the strength of these DNA sites, along with whether or not this range contains a loop anchor. So with that introduction, we can go ahead and get started with the code. So I'm going to go ahead and copy this code here. Paste it in. So first we can load in the object and take a look and see how that appears. Is that better? Sorry, a little bit of loading from the cache here. So what you can see is we have a G-range object and for convenience we're renaming it to bins here. And what this has is every 10 kb region along the HD19 genome, but it's got in the metadata columns here information about the covariates that we're interested in. So it's got the number of CTCF sites, CTCF signal, the number of DNA sites, and DNA signal here, as well as whether or not it's looped. So in order to use match ranges, it's fairly simple. First you need to define the focal and pull sets. So focal being the sets that you're interested in. So in that case this is going to be bins that are looped. So I'm going to define focal here as the bins where bins looped is true. And run that code. And the pull set is going to be where there are no loop anchors. So where looped is false. And to use match ranges, first we'll load our package here. And I'm just going to copy this code here. So I'm going to replace the code that's already here with focal, pull, and our covariates here are going to be DNA signal and the number of DNA sites. And so we can run this. And in just a second what we get out is a matched g-ranges object here. So what this is, is the result of match ranges performing the matching operation. It's finding the ranges with the correct distributions and returning it as a matched g-ranges object which functions essentially as a normal g-ranges object. So the next section here is showing you how you can use the matched g-ranges object just like a normal g-ranges object. So we can use the packages, genomic ranges, ply ranges, and ggplot to perform operations on these genomic ranges. Have you ever seen this before? Has anybody else seen this before? Sorry for the technical difficulties. Just like refresh the block due to a security threat. All right. With Mike's suggestion, we're going to ditch the live coding. Apologize for the difficulty. All right. Well, in the interest of time, I'm just going to go ahead and finish through this vignette here. So, yeah. So the result, and really all of this is sort of in the vignette anyway, so there's not too much additional that I was going to demonstrate here, but feel free to later go ahead and go through it and play around with some of these functions as well. So essentially what I was showing here was that we can use packages like ply ranges which supply functions for doing tidy or like operations on genomic ranges. It allows you to manipulate your matched g-ranges object as if it was a regular g-ranges object. So here we're grouping by summarizing and then plotting the result here. Doesn't really matter what we're plotting here. It's more just to demonstrate that it can be used for doing these sorts of operations. So once you've actually generated your matched g-ranges object, one thing that you'll want to do is assess the quality of your matching. So you can do that using the overview function here, which provides a quick look at how well your covariates are matched. So for every set, the focal matched in pool set, it provides the number of ranges that were pulled, the means and standard deviations, if they're continuous variables, or the number if they're categorical variables. Aside from looking at it with the overview function, we can also use plots to visualize how well the matching was done. So I'm going to skip over this plot in the interest of time, but you can plot the covariates using patchwork here to visualize all the covariates that were matched. And we provide a number of different types of plots for visualizing these. So in this case, we have a density plot or this sort of stacked bar plot here for categorical variables. And if you don't like the way these visualizations are made, then you can extract the matched data with an accessor function, the matched data accessor function as well. And so finally, once we have our matched ranges, we can actually investigate the question that we were interested in, which is comparing the CTCF sites between looped and unlooped anchors or genomic regions. So as you can see here, the loop set contains a high percentage of CTCF sites. The unlooped or pool set contains fewer. But once we perform our matching controlling for the effect of these potential confounders, you can see that there's an increase in CTCF relative to what you would expect. And essentially, this means that we attain a more meaningful difference due to looping when we're matching on these covariates. So if anyone has any questions, yes. I guess it's a very simple question. Do you recommend doing this matching and generating the references for multiple times to account for random sampling? Yes, you can do that. So in this example, we've set a seed for reproducibility, but you're welcome to iterate through these matched samples until you get one that is matched. Because visualizing these distributions is an important part of assessing that the covariates are appropriately, the distributions are appropriately balanced. And I guess how do you balance that with not cherry picking sort of the best result? Yeah, so this is sort of done upstream of your inference. So really you do your matching before you do your statistical test. So it's appropriate to use them. I just want to make sure that I've understood how it's working. It seems like you're adjusting the bin sizes until you get the right parameters that match your null distribution of parameters. Is that correct? Not quite. I think it's maybe a mixture of the two examples that I was showing. The first thing was showing the distribution of the genomic lengths, but the second example is it's not adjusting the bin sizes per se. It's selecting ranges from a pool that are matched on your covariates, not on the size of the bins itself. Okay, so there's a subsetting component that you can drop. Okay, that's awesome. Thank you. And I'll go ahead and pass it off to Wanzi, who's going to talk about the bootstrapping, which is the other portion of functionality in the package. Yes, thanks, Eric. And I will now talk about the boot ranges part. While data structure, while data don't have that full code pool structure, you would want to generate artificial ranges from the original set. And one strategy to do that is naive permutation or shuffling. However, genomic features often exhibit a complex dependency structure at a base on the placement or the local correlation of metadata. Example here is a snapshot from Genome Brother that jc% gene density and the CRE locations shows a clumping properties at a scale of 500 kilobases. So if we do the naive permutation, then we would lost the natural clumping properties as well will break the correlation of the metadata. Here shows the histogram reproduced from Baker's paper that the true non-distribution is shown on top sequence that are generated from a random process. So if we compare the permutation and the black bootstrap statistic distribution, we will find out that black bootstrapping doing a much better job to estimate the standard deviation compared to the permutation with the true non-distribution. So in conclusion that black bootstrapping will do a better job to kept clumping properties as well as estimating the bootstrap statistic. During the simulation, we find out your permutation will have a smaller variance so that your test will have a much smaller p-value which may cause the false to positive. Then go through here. This shows our methods process. In figure A, we can say yearly the feature sets are showing several homogeneous sets along the genome. Here different colors may represent different feature density, example high-gint density, middle-gint density, and low-gint density. So the algorithm would do black bootstrapping within each segmentation state separately. Example here, we will randomly selected a block with dance lb from the rest state and move it to a tail block across chromosome within the same state. Then the workflow for the block boot ranges is firstly we will compute a statistic by overlap ranges y and ranges x. Then after you derive the boot ranges by given optional segmentation or exclude ranges, you repeat the step one to derive a bootstrap statistic distribution. Given enough multiple times bootstraps to assume normality assumption, you can do a z-test for the hypothesis testing whether there is true biological enrichment between the two feature sets. For this demo, I will first load DNA's hypersensitive sites from encode project that has been pre-processed and start in the non-rigid data. Firstly, we could look at the overview of the data. It has more than 700 ranges, but thinking that your bootstrap data is many times larger than your original data. So filtering and trimming extra metadata can help make the analysis more efficient. Here we filter based on metadata column to remove the noise ranges and only select, because afterward we will calculate the statistic as number of overlap. So we only will mutate the ID number here and select that metadata column. We have to first library the apply ranges. Then after this step, we will have 62, around 6,000 ranges left. That falls on all 24 crums on HGA38. Then we import exclude ranges that we don't want to, typically won't have features located in. Here with data pre-combined, exclude ranges including encode produce ranges and tele mirrors and central mirrors from the experimental hub. Then for segmentations, we have two options, either perform a de novo segmentation based on feature density or download existing segmentations like the crumb ish shaman from annotation hub or other databases. In this demo, we use the segment density function from the non-ranges to use either a circular banner array segmentation or had a markup model method based on the gene density to segment this data set with segmentation bounce as two million basis. And it's pre-start in the experimental hub as well. Here I loaded the CBS one. There are also functions plus segment to evaluate the segmentation performance. You can first we need to evaluate non-ranges and the plot segment. Given the segment GR ranges exclude ranges and the type of plot we want, suppose we want a ranges plot, then it will show the segmentation state across the whole other crumb zone. The ranges that can connect together if they belong to the same state. And the breakpoint here may be caused by the exclude ranges or the transition between different states. Then given the block lengths, we are ready to run the boot ranges. Here I run 20 times iterations, but normally you will do 100 times. But based on different statistics, you can change the R here. To look at the return object is actually a subclass of GR ranges with block return as a integer format and iteration number and block lengths recorded as a factor RLE format. Then we could access the quality of our bootstrap samples. Firstly, we have to combine the original set with the bootstrap that is set by mutate iteration number zero to record the original set. And then we can easily use the summarize function from the plot ranges. Here I am trying to calculate the number of features per iteration. The result showing here, the zero here remember is the original set. So the bootstrap that have similar number of features as the original set. Then remember the advantage of block bootstrapping over the permutation is that it can keep the local dependency structure. So in order to assess that feature, we decided to calculate the interfeature distance to see whether it is close to our original set as well. Firstly, we define the interdictive functions and the return. It has the middle ranges difference if the two neighboring ranges are on the same chromosome and NA if not. And then we use the nest function from the PURR package to map the interdictive function to every iteration's data and then an assist to plot the density plot. Yeah, here the plot shows that this three times bootstrap data is really has the interfeature distance density really close to the original set as well. Then we are ready to derive the statistical of interest and perform the Z test. Imagine we are evaluating the enrichment of a feature X with this DHS. We will firstly simulate a 50 ranges of in feature X with the Y as one million basis and decided to use the sum observed number of overcome as the statistic. So for this up to rate feature X, there are 64 DHS has overlap with this feature X. Now we repeat the same step to overlap X and bootstrap data by grouping by the iteration and summarize the number of overlap count. Noting here we need to use the complete function to fill in zero when there's none over the ups. And then we are ready to draw the histograms. Remember the original set there are 64. So it falls in this histograms. It's because of it's an up to rate data. So we although not showing the Z test a result here but it will fail to reject none hypothesis and saying that there's no significantly over that between those two features. But since our downstream analysis is really flexible depending on the statistic, we can also do the mean number of overlap count per features. Then we have to do group by X ID and iteration and also complete the X ID here. We can see the middle step. If we summarize group by X ID and iteration it will return the number of overlap based on each ID and each iteration. But after we do the complete it will return zero for when this ID and this iteration have none over that. Then for deriving the mean number of overlap per iteration we have to add a new group by function for iteration and the summarize is as well using the ply ranges. Supposed the mean overlap because the mean the histogram will change as well. Similarly the observed mean overlap will be 1.28 and it still falls in there and fail to reject none hypothesis. I just want to mention not only to derive summary statistic that can use boot ranges, we have a lot of extensions. For example, to modeling the bootstrap you can use like panelized lines to derive optimized luckful change for the differential expression genes like this or even you can use the count matrix from the summarized experiment or single cell experiment to calculate the correlation between atasic and the genes and our data is always different for data format. You can either do that with ply ranges or do that with the tidy summary experiment object. This is the pseudocode. If you are interested you feel free to look at it. In the example that's provided for the attack on RNA-seq block bootstrap the RNA G-ranges object is not found. Leonardo had the same issue in terms of where it would be located for that example. Yeah, there are pseudocodes. I don't have time to run this time but you can change the summarized experiment to a g-ranges and extract the count matrix to give this numeric list to run the boot ranges. Is there any other questions? If not I'd like to thank our advisor Dr. Vab and Dr. Fastil and all the collaborators to give a really great comment on this project. Thank you. Thank you guys. I think there's no questions on the chat and no questions from the audience as well. There's a few more sessions here but I think there's something in the auditorium as well. Do we break or I think the next session is at 3.30? Yeah, the next session is at 3.30. So thank you.