 Online hear us. Please put yes or something in the chat. I Okay, awesome thumbs up. All right, great. Okay. Welcome to another package demo To remind all of the in-person people all these little round buttons are microphones and for the whole room And they are all on and there's no muting them. So Try not to talk for make too much noise Today there's going to be the presentation by Stina Kova who is going to talk about genome and distribution Hi everyone, thank you for for joining me my name is Christina and I I'm a PhD student at at the University of Virginia where I'm co-mentored by Nathan Sheffield who is sitting here with us and By by David obel and today. I'm going to be presenting you our bioconductor package named genomic distribution for You know summary and visualization of genomic regions Which we created in in Sheffield left at the beginning I would like to say I would like to make this You know, don't be scared to ask questions I don't want this to be a monologue Where I'm going through things and if something is uncertain You don't want to ask a question because I have speech prepared like if we don't cover everything. It's okay I would rather you ask questions if there is something uncertain But at first I'm just gonna give a short intro to the package to kind of give introduction I'm gonna first start by by introducing what our genomic region sets that we're actually summarizing and introducing I Am aware that everyone at this conference is going to be probably familiar what genomic region sets are or what genomic regions are but you know Just to make sure so please bear with me. So What are what are genomic regions that genomic region sets are basically products of Epigenomic experiments and genetic analysis Which carry a certain property? So here I gave an example of chip-seq chip-seq experiment because I work mostly with chip-seq data Or you know, you have your nuclei you're crossing all the proteins to pull down the proteins of interest And then you send the sequences that carry your given histone mark or transcription factor for sequencing What you do next is you map your map your sequences to your reference genome You find enriched regions, which are which are basically carrying the property that you're looking for we call those peaks and Then you can perform differential Analysis where you're basically comparing is the letter are the levels of My histone mark or or whatever are the number of rates mapped to those Regions different between healthy versus disease or are these scaling with a certain factor? And what you end up with is basically a table Where you have your your region of interest so your genomic regions, which is practically just yet genomic coordinates With a given property, you know, whether the region is up-regulated or down-regulated But in the end these are just genomic coordinates They don't really tell you much. So I like to view those Like coordinates on a map if I give you two regions on a map without any additional information There isn't really that much that you can tell me what could be possible difference between those two regions, right? But once you start actually adding layers to the map, you can start making Conclusions what could be the possible what could be the possible differences between those two locations, you know And and sources of whatever whatever you might be looking for So keeping that in mind that is basically what we're trying to accomplish with genomic distributions in relationship to To genomic regions. So let's say you have Genomic region sets. So a bed file your genomic coordinates Or you even better you want to compare two bed files What are the differences between that you simply plug those two genomic distributions and you can start annotating? You can so we can get These are some examples of what genomic distributions can do so you can do you can get distributions over chromosomes you can You can infer cell type specificity of your of your regions if you're doing bulk experiment We I'm gonna get into details a little bit In in a bit you can you can calculate the distances from transcription start side distribution across Genomic partition so like our region's and promoters and exons in trunks, whatever distances between Closest neighbors GC content and I don't think that this is this is actually the full list So in this workshop will I'll show you how to how to produce these kind of plots and Yeah, how to actually also bit like some of these for some of these plots you actually need to reference data for example you for the Partition plots the distribution across genomic partitions. You need to provide a reference You know like we need to know where the promoters are where the exons are So I'll show you how to build those references to But before we get actually to the actual workshop where I'll be showing you the functionalities I would like to just highlight few Few strengths of genomic distributions over other other packages like we I am aware that there are Packages out there which might with my similar similar Functionality, so what is so great about genomic distributions? We actually put a lot of effort into design of genomic distributions and here I'm highlighted the key strengths So all of our functions Can take single or multiple inputs So if you have if you have one of one genomic region set So one bed file or two bed files or 10 bed files if you want that you want to Compare you can pass those simply to genomic to genomic distributions. You're using the same set of functions Whether you're using single or multiple inputs, I'll show you that how to do that in the in the workshop The second strength is that where where Genomic distributions were designed in a way that we're using this modular approach where you first pass your Genomic regions to to calc functions those are functions which start with calc those calculate the summary Properties or features or whatever you might want to call them and If you want you can then plot it in your in your own way or if you want to do Yeah, just quick and dirty kind of Job not dirty But yeah, if you if you want to if you want to use our plot plot functions You just take the output from from the calc function Plug it into plots into our plot functions and you get you get summary Either of your single genomic regions or multiple genomic comparison of multiple genomic regions the Additional thing is the output of our plot functions are gg-fold objects So if you don't like the graphics, you can simply edit those. I'll show you also how to do that in in our workshop The third advantage is that we really tried to stuff Everything like anything that we could think of into genomic distributions We really tried to make genomic distributions as rich in Functionalities as as possible. So this way you just if you want it So if you want to compare your genomic regions, you just you just get to use genomic distributions It's just feel lines of in feel lines of codes You just get this this rich summary and you don't need to Find a package for how for calculating the distance from transcription start site and learn how to use the package and then you know If you want to get the distribution among across chromosomes, you have to find another package and learn how to use that one again. So And last, you know, but not least Advantage is that we really tried to we really put an effort Into writing genomic distribution. So so it's fast. We use data table and most of the functions and when we compared it To other other packages. We have to say that genomic distributions are really fast So you can you can process, you know, large regions or a large number of regions pretty fast Here I would just like to talk a little bit before we really dive into about about the So so you would understand the the names of the functions. So feature is just let's say, you know, distance to the nearest neighbor So like first First we have we have functions that starts with culp. Those are the the functions that are going to calculate the the summary properties We also have we have functions. We also provide functions that start with culp and end with ref So as I told you for some of the functions, you do need genome annotations and You know, that could that could be annoying getting that sometimes. So but don't worry if you are working with human or mouse data These are we provide those pre-compiled for you as a part of genomic distributions data package So like all you need to do is in these culp feature ref functions is provide your your query and just tell us which genome assembly you're using and We provide the genome annotations The last set of functions are plot functions to which you just Provide the the the output of the calc functions and and you plot your results Um before we get into the workshop, I would like to just acknowledge everyone, you know This is this is the last slide of the presentation. So I would like to Acknowledge everyone who has worked on this package all of the members of of Sheffield lab and I guess now we can Now we can get actually get actually to the to the workshop So I Guess I am the workshop is if you go to the workshops under genomic distributions fast and easy Summary fast easy and flexible summary and visualization of genomic regions I am just gonna go through the through the website you can Here through the vignette. So here you go to articles genomic distributions demo, you know, it's you can if you want to you can Just simply you can go to the to the orchestra and run it through there or you can copy paste paste from from here So in this workshop We will go well, we will go through all of the functionalities offered offered by genomic distribution I which I showed you and I would like to also show you how to you know build your reference data in case you're not working with human or mouse data You don't have to be scared of creating those annotations isn't isn't it hard You just need to give us your a fast date file or GTF file and we have functions that can do everything everything for you so let's start by loading data genomic distributions are are accepted I either Genomic ranges genomic ranges object or if you want to compare multiple datasets you you just combine those in genomic region into Gerlin genomic ranges list object so like here I'm providing a path to the regions and I'm using our trick layer import function that just you know You just need to provide a path and it automatically creates a genomic ranges object. I do like to Trim my chromosomes just to keep standard chromosomes just to avoid any kind of problems But so when we looked into into into the loaded datasets, you can see that it's really genomic regions object With with our coordinates Here I provided examples for h e k 27 installation from these cells h e k 27 trimethylation from these cells and fg f2 from induced pluripotent stem cells just so we have as a part of this of this package as an example just so we have some some comparison and As I said all of the functions can also do comparison of multiple region sets and so you can just combine Individual genomic regions a genomic ranges into G ranges list with the simple command and you have your general list object with with all genomic ranges within it Okay, now that we have Data loaded. Let's let's start. Let's start looking actually at the individual Functions, so the first feature that we're offering is the distribution over over chromosome So where do the individual regions fall? onto chromosomes So for that we'll be using calc from bins ref function Notice that it's really ending with ref so here. I'm just providing first I'm going to show be showing an example how to how to run this function with a single query So I decided to to look at the fg f2 from induced pluripotent stem cells So you just provide your query and all you really need to do is type ref assembly is a g38 This calculates This calculates How many regions fall into? Uniformly distributed bins across chromosomes. So when you look at this table You have the first the the individual bin coordinates then you have You have the ranking within the ranking of the coordinate within the whole genome then Ranking of a coordinate within individual chromosomes and last how many regions did fall into the individual bins So once we take the output really from this function and we plug it into plug from bins Really notice that the functions are are called the same We get a distribution of the regions across chromosomes But as I said we can also do the same for a multiple query So before that I created the gr list object Which combines all of the three data sets you can see that I'm using the same Functions of the same function instead of but I'm just plugging into the gr list here and Specifying the ref assembly now we have the same The same table that we have just now an extra column Called name. So which specifies the you know the data of the data set from which It originated and then once I plug this this table again into the same plot function It's a bit called from things bins function. It doesn't happen a different name I can now do a simple comparison Between between the individual regions where you know the regions of the individual region sets are colored by different color our next functionality is This that our functions for calculating distances from transcription start sites For that we're using calc feature dist function Which is designed to calculate? distances between between g-ranges objects, but We realize that you know, it's pretty common to be to be looking for for distances from transcription start sites or ref function has also TSS at the end and So want to use this function? We provide you the coordinates for or for transcription start sites automatically again so first you again calculate the distances to TSS and then you do plot feature dist and Here you can see really comparison of the distances to To the nearest transcription start sites here I actually I was reading in the in the workshop instructions that there should be Few exercises may be involved. So I included an exercise here So using the previously calculated TSS this object Plot histogram of distances to TSS within 5 kb so we can see that here We're looking at the distances just within within 100 kb Kb region, so let's zoom in for 5 kb its argument size for regions for the then plus minus 5 kb will be accumulated into These infinite Infinite bins its argument in bins. I do have the exercise solutions here. Voila So basically our plot functions they they have also like, you know, additional features It's not only that that you just you just plug plug in object and then you can edit it in As a as a gtplot object. We do provide some of some of you know Plotting function. So like you can see that once we say size 5 kb and in bins Then we want to plot in bins we get these plots where Yeah, the h3k27 a situation and h3k27 trimethyl issue should be further away Then the transcription factor that we have here So it's nice to see that most of the regions actually like really fall or fall Further than 5 kb okay, let's go to the to the next function offered by by a genomic distribution which is distribution over genomic partitions We're genomic part by genomic partitions. We need we mean genomic, you know, annotation classes such as exons, introns Promoters, etc. Again, we provide Rev function the specific name of this rev functions is called partitions rev and Once you do again plot partitions function you see the results here where you can see Basically the number the number of regions falling falling into a given partition and Here we are actually plotting the default in the plotting function is plotting Frequencies, so, you know, if one region if one region set has 10 regions and another region set has 1000 regions just to make it easier easier to compare As you can see the the rev function actually provides annotations for exons By primary TR intergenic of the rest is classified as intergenic regions introns core promoters versus proximal promoters where core promoters are the 100 kb from transcription started sites and proximal promoters we set up as to kb Did I say 100 kb I meant 100 base pairs sorry proximal promoters are to kb and then to require me to your yes, there is Yeah, so is either if the function using the chief's db object behind the scene to know about, you know, the How does the function knows? the location of exons genes so so we So that's that's what I'm saying like we provide annotation annotation data that we built from I believe that we build it from GTF objects directly, right or did we use on some some annotation Yeah, yeah, yeah, but but but so yeah this we we built these annotations We do have functions how to build this from GTF I'll show you where you you know, if you give us GTF file, we'll build this for you But but yeah, we we yeah, I think that we did choose the ensemble But and we just put it on as a part of the genomic distributions data package associated with There is there is a data package associated with with genomics genomic distributions Yeah, just so we can we can like make these ref functions and provide the Annotations, yeah But so so yeah in addition to to the just simple distribution over genomic partitions, we also provide a Function named calc expected partition ref which Basically, if you think about it your exons are way smaller than introns. So Calc partition of the expected partitions functions are kind of correcting correcting for for for these sizes where we're we're looking where we We're basically we're calculating expected expected number of overlap based on the size contribution of individual Partitions to the size of the genome and we're just assuming, you know uniform distribution of the regions across Across Across the genome. So with this function you can then you can then see If you're if your number of overlaps is higher or lower than then you would expect basically So if it's higher than you would expect You you have the the lockdown of observe observe over expected And the and the positive values and when you look at this here the FGF to FGF to which is a transcription factor is Actually highly enriched in core promoters, which which really which really makes sense, right? That should be close to transcription start sites the next function on That we're offering is It's now called signal summary in regions This function actually requires you to provide to provide your own Matrix with signal value. So like we're in the in this matrix. There would be the first column are Where where each row is a genomic region across across genome and each column is is Condition the motivation for this function is that I work with ball of data I had some Chipsi Chipsi experiment and I was wondering can I are these regions important for specific cell types? so then we built this matrix of We built this matrix where I I took basically all of the defined regions across genome that that might have open chromatin and I looked at a Taxi a taxi data on on on encode and build this matrix basically assigning normalized normalized signal a taxi signal across different different cell types and then So but you know in the end you can you can build whatever matrix you want you might be wanting to And I'm sorry, and then the the calculation function is is is basically then just Summarizing the signal within within your regions within your regions of interest so if we're using the calc summary signal function you provide query and Here we're providing the open signal meant matrix, which is again part of part of Genomic distributions data data data package. So we provide those for AG 38 AG 19 and MM 10 genome and this matrix then then has the Chromatin accessibility normalized chromatin accessibility values across different cell types So once you run this matrix what you get once you run this calc summary function you get You get a list where where you have signal summary matrix Where now it's gonna return you the the the signal values within your regions of interest for each cell type and It's also going to give you Summary Summary of those values. So box plot statistics. So we can see that for agk 27 installation and turf applies we have lower whisper lower hinge median upper whisper and upper hinge so once you plug this Plug this object into plot summary signal matrix. You get you get a summary Summary of these values of the signal values across across different different cell types If you if you plug it in without metadata, this is all going to be this is all going to be black So we provide we provide metadata where we assigned the So where we first have to have you have to just have a Column name call name which is going to have the same names as as your you know Your signal matrix and then you can provide whatever whatever, you know Here here I'm assigning the individual cells to To tissue type and then I say well my metadata is is in this cell type metadata object And I want to color color my groups based on on tissue type And so here I'm plotting here. I'm plotting bar plots where I'm showing medians of the For a given for all of the signal values with an individual cells And so actually when you look at that here, we have agk 27 trimethylation, which are repressed regions So we are really expecting the signal values to be low there since it's open chromatin We're here. We have agk 27 as a solution from B cells and as you can see the median is really the highest in in B cells and Here are IPS cells. So we would expect high You know high high signal anywhere In our next function the the last Yeah, and our next function is we're looking at the at the region widths Where you don't a primate like, you know, obviously you don't need you don't need any annotation annotation function for this so we're just we're just providing this calc width Function with no ref extension, which calculates the widths widths of your of your regions Um But the unique functionality that we're providing here is this plot qt his function Which basically in the eliminate long tails I don't know, you know, if you if you usually plot the widths of your of your functions and you plot it as histograms You have these huge outliers then That that that just kind of skew your skew your histogram So here we're back actually setting a threshold were Based on based on your input parameter I think that default is 2% the the top and the top and bottom the the smallest and the largest regions Are just accumulated in these in these in these Bins and this way you kind of overcome overcome the long tails and you can just see the the nice distributions Um, the next functionality is distances to the to the nearest neighbor Um, we um for that we're using calc neighbor dist Um function, which is really when you take your region of interest You're looking how far is the closest closest region and then how far is the closest region region to it and You can for this you can basically infer is my data um Like you often see this kind of like by model by model distribution where you can see that You know, these are these are kind of like regions that are possibly Clustering like close to each other. They're close to each other and then there are regions that are like on the other hand Like very far away. So like you can kind of infer this clustering properties of your Of your data sets the last two functions are Are looking at gc gc content i and dinucleotide frequencies in your in your in your genomic regions These two functions are a little bit different Because they they For that you actually need to provide a bs genome object We might change that actually eventually that we might do that just for a fast day file But at this point you do you do need Bs genome objects. So like here we have ag 19. So I I load bs bs genome um for ag 19 Um And provide along with my query Um, I provide to help gc content function The bs genome object and here I have the final output where we're looking at the gc No, yeah the the frequency of gcs within within individual within individual regions The next function is very very similar again. You need bs genome object But instead of looking at gc content, you're looking at the Dinucleotide frequencies so you can compare dinucleotides across across your genomic regions So, um, this is basically the the the summary of all the functions then that that we provide in genomic distributions um So as you can see it's pretty easy and pretty straightforward to to use these functions It's usually just like really two lines of code and and you just get get these plots. I use it pretty often I you know before it's it's it's pretty cool. Um But if you're not working with human with human or with mouse data I would be just annoyed to to you know, I would be like, oh my god So I have to I actually have to get to the list of my transcriptions third side to provide that to to the functions Well, I'll know whatever. No, I'm not really interested in those kind of plots But as I said, you don't have to worry. Um, all you need to do in order to run all of the functions Um, except for the gc content and the dinucleotide frequencies is Um, gtf file and fast state file. So let's see how how we can build those references So let's uh, let's start with fast state file in this package. I actually provided um, a fast state file From fast state file and gtf filed from c elegans So you just basically need to give us a path to fast state file locally on your computer Or you can even just give us url And uh, and then and then, you know, use our function. So, um What what do you need? Um In our functions, you might you might need Chromosome sizes So to get those is you just get get chromosomes Sizes from pasta and you just give the paths path to the fastest Festa source And you can use the outer convert to and if you have ensemble and and um Fast stuff from ensemble, but your um, but your regions are ucsc You can actually just write convert ensemble to ucsc and you're going to have the the ucsc Ucsc format annotation. So then you have your chromosome sizes um From chromosome sizes in order for you to build uh to build the you know distributions across chromosomes You need to provide uniformly sized bins across chromosomes. So to get those you just Pass your chromosome sizes to this get genome bins function You can specify how many bins across the whole genome you want to create And you get g-arranges list object Where for each chromosome you will have you have you will have it separated to uniformly sized bins I'll show you in a second how to plug that into our calc functions Um, the rest of the annotation classes are going to be extracted from gtf file So again, you just provide a path to your gtf gtf file and um in order for you to get the list of transcription start sites You use get tss from gtf From gtf and and that's it again. You're given g-arranges object where we have really the coordinates of Tss as you can see that it's really one base pair one base pair sized regions In the next function you can extract your your annotation classes, you know, like promoters exons entrants and etc You first need to get used get gene models from gtf Where you specify which features you actually want to You actually want to extract so here where you're extracting genes exons three prime and five prime utrs And you can as you can see when you run this function you have again genomic region g-arranges list object for with coordinates for genes and exons and three prime utr But here we still don't have the promoters and introns So for that you can take your Your your gene model list created created here and you pass it to the function genome partitions list Where you said my genes are gene models And you specify that that it's genes, you know And then you can actually Set up what are your do you want to get both core promoters and proximal promoters? What are your core promoter sizes, you know You might not want to do default and get proximal promoters to to to kb Which I did try and in yeast it's then told me that I have I have, you know Partitions larger than the whole genome. So So yeah, so that's so yeah, you can actually like really set it up and then what you end up is the partition lift with With promoters three prime utr and all of the annotation classes You might need And that's it. This is how you really build up your your all of your annotations needed for all of the functions that that I have shown you I don't think that we'll have time To go to go through that questions. So so so I You know, you can you can go through that It's you just really pass it to the cloud function without the without the extension ref But I would just like to address one last one last things As I said are if you want to customize our plotting The output of the plot functions are really just gg plot objects. So let's say you don't like the You don't like the color. You don't like the color or you don't like the title or or whatever So let's say this is this is the the original plot You just put the plot neighbor dist function But save it into sp equals plot neighbor dysfunction And you can just set scale fill manual scale color manual to whatever, you know values you want and you can just say well You you can add gg title and rewrite it You can you can just change the whole thing basically just add layer like additional layers to gg plot object If you're doing comparison of multiple region sets if you have just one region set You just need to actually reset the properties of your of your original original gg plot So like once you save the the gg plot object you can either You know changed like add additional layers or you can change the color settings According to to to how i'm showing you here, but it's actually like really easy and We really try to make all the functions very easy to use and Also like really flexible so you don't have to like then plot something and be anointed Or I have to report it because in my publication. I want it in a different way. So So so yeah, I don't know if there are any any questions I have I I'm working with that urea. So could I use that library with uh, prokaryotic genomes or would there be some Pretty far try it and if you run into issues Just leave an issue on on on our github page and I'll and I'll try to help you Or just like really shoot me an email. It should I think that it should be possible, but if it's circular Genome, I'm not sure I have never worked with Exams and introns and it's like Yeah, it has So like it could be a problem like it you you can actually specify what what you want to extract from the etn file Fun, I don't think that it should be a problem. Okay. Yeah That's really if you run into if you run into issues Should me an email or or just leave an issue on github and I try I'm trying to be like pretty Efficient and and answer quickly but sometimes it takes a day So we work with like a lot of different species and I was wondering how like if we were to build our own references If there are any recommendations for like different uses of bin count based on my genome size I can give like a really small gene number It's like a really big genome if you have the same bin size across the whole thing I just do it. I just do it impaired Okay So no, I just like really here I'm just showing an example like you can really get like high resolution and set up like really like a large number of bins and Or like if you set like small number of bins, it just really very much depends on what you're trying to get out of it So I just really try different settings and get the plot that are like the best There's a A couple online one from uh, Rohit saying a binding chance. Does your package use annotate our Our package that has similar functionality I don't think so Okay And then another question from Rohit Do you have functionality where we wish to compare our enrichment and User set regions in a particular genome region say exons and drones five prime utr three prime As composed to a randomly sampled regions of the same nucleotide composition We do not have that now, but that is like that might be actually nice functionality where we could potentially I'm not sure Like at this point we we do have the you know We do have the function that is that is showing the observed versus expected Uh frequency or like versus expected frequency But but that's just like looking at the sizes, but at this point like we do not offer um Anything anything like this. Yeah, but I think that that the That the idea is kind of similar And one more from Rohit How do you handle regions and user sets that overlap with more than one feature? Like overlapping with intron and utrs. Okay, so for that we are actually We have we have two ways to do that So in the first one we're just kind of you need to provide the the partition list In a sorted way. So if it overlaps with the first first class The region is kind of like, you know toast out From the from the from the region set and it's not accounted for like in the in the next round So so basically Yeah, it's it's priority. I was looking for the for the word. Thank you So so you have to give us like priority list, but in the In the cult function we have we have an option called BP overlap Um, so if you set it up to true, it's looking at the proportional overlap So like if it overlaps both introns and exons, so any let's say like Half, um, you know one third goes to exons and two thirds goes to intron Then I'm just going to count it as a one third and two thirds two thirds here So I actually use the the BP the BP overlap function a lot I have one question So I do like it that you've got the custom records there at the bottom and But my question so could you just like oh you can use hg19 or hg38 or mm10 But those the chromosome designations are very different depending on which database You do you know comes from so are these is the default ucse or is it the or ensemble? Can you switch to ncbi? So so you can either just provide your gti file and do it do to do whatever you want. Um But the default is ucse like the it starts with ch ch or the ucse, but like yeah, we were using the Um, like yeah, these annotations come from from the package that I forgot the name now I have just kind of like you know, I know I know it Excuse me Was it but we just converted it I think so I will kind of like blanking out right now, but like I can I can I can get back to don't worry I might just make a recommendation that you are explicit in your materials about it's not just hg19 It's you know chromosome ids for assuming either ucse or ensemble or here's how you convert them if you've got ncbi You need to do a custom record or something something like that because okay, it's a That's a place where people can get tripped up. Okay. Yeah, that's a good that's a good recommendation something. Yeah Thank you, but they will join me in thanking uh christine again for one Got it out 12 minutes until our fine well penultimate session, I guess. Okay I'm gonna go time