 Hello, everyone. My name is Franziska Bohnert. I'm the host for today. And with us is Edmund Miller. He is a PhD student at the University of Texas at Dallas. And he's going to talk today about the pipeline from NF Core called nascent and to you. Okay. Good morning, everyone. I'm Edmund. Let's get started. Okay. So a quick overview of what we're going to talk about today. So a quick background on nascent transcript identification, because I'm not sure if it's as common as some other assays such as like chip seek or an AC. A brief history of the development of the pipeline. And then lastly, we're going to talk about the pipeline itself and give a brief overview of that. So quick background on nascent transcript identification. The goal is to identify the changes in transcription of the RNA and what's going on at the cell at that specific point in time. Rather than say RNA seek, which isolates all the RNA in the cell at a steady state. So that would be like your mRNA and things that have matured versus what's actually being transcribed. And so you can get an actual response to things like heat shock or viral infection. And so pulling out the transcription activity sites through metabolic labeling of these and we won't go into that too much today, but I'm happy to discuss that with anyone in the future. And so the problem with that is that we're covering a lot of different assays not just one, or maybe some slight variations with that we're covering lots of different chemistries lots of different steps, etc. And because of that, some slight variation in computational and the computational pipeline can actually lead to 25% change in the results of the transcript calling. And we'll kind of talk about why that is later. And so specifically what I'm interested in are enhancers. And so there's a lot of different things that you can call with nascent transcripts, such as my RNAs and long non coding RNAs and you can call the gene sequences as well. But specifically what you can pick up that you really can't pick up an RNA seek is enhancers and so I mean, so these are just acting DNA sequences, they can then increase the transcription of genes. So a lot of people are probably familiar with promoters, they act in tandem with promoters. And part of the problem with these enhancers and identifying them is that there's hundreds of thousands of enhancers. So we have this evidence that the enhancer promoters interact through various other assays such as 3C. And then we also have evidence from these nascent transcript assays that enhancer RNAs are produced at these enhancers and they have a very short half life and they're in low abundance. So that's so we don't usually pick them up in general RNAs. There's a quick infographic of what's happening here in the enhancer promoter looping. Over here on the right, you can see that we have the promoter and Paul to, and then we have the mRNA coming off and this is what everyone's probably very familiar with. And it's being produced. And this is what you pick up an RNA seek. You also have transcription factors and co factors, but what we're really interested in, or I am specifically is this enhancer on the other side and the RNA is coming off of that with the Paul to activity. And those are thought to pull in all these transcription factors and co factors and, and all these various other things. So what do the reads actually look like since we're talking about bioinformatics here and what we're interested in. So these are just a couple of the various assays and this is a recent paper that I thought was really good that kind of summarizes all these. So you can say we have grow caps CS RNA seek net cage stripe seek pro seek. All these, but let's start here at the bottom with total RNA seek. And as you can see, just to orient everyone, we have the known enhancer here in yellow, and then we have the reads over here on the left, along the gene that's known. And so in total RNA seek here, you can see that we have a strong peak on the anti sense here. And then you can see that we kind of go along and have some, some reads coming from there. The main point here is that we don't pick up the known enhancer in total RNA seek. There's just not enough reads and not enough mature RNAs happening. Whereas something in grow cap, for example, you can see that we really pick up the known enhancer and have a lot of signal coming from there. However, we don't pick up the entirety of the transcript in grow cap. For example, but you can see we also pick up this, this opposite transcriptional start site. It's going in the other direction from the gene body itself. And then there's other things like ProSeq, which actually are nascent transcript assays, where you can see we pick up a little bit of the known enhancer, we don't have such a pronounced peak, perhaps, but then we also pick up the entirety of the gene body and things that are being transcribed all the way along. And so, as I was just talking about, we have Nancy, we have two different kinds of assays that we're kind of supporting. We have nascent transcripts, and then we have transcriptional start sites. And I think this image from the same paper does a great job of illustrating this as well. Part of the problem is there's like 13 plus assays for nascent transcript identification and transcriptional start sites. And as I said before, minor changes in the sample processing could lead up to greater than 20% in the final results. And that's what they, they found that I was validated by then I'll talk a little bit more about that in the history. So let's start down here at the bottom, you can see the promoter and the transcriptional start site here. The blue is the TSS assay like grow cap that I was just talking about, whereas the nascent transcript assay would be ProSeq. Neither of these are generic aren't AC either of what you're thinking about. So you can see in the TSS assay we get a very pronounced peak and this is actually at the promoter sequence of the, like at the very beginning of the promoter, then we have a slight break and this is CPG Island here, and then you have the nascent transcript assay, and then that picks up the entirety of the gene body and the elongation of that. So these are the two different types of assays that we're picking up the interesting part, then is that we're picking up enhancers as well, based on those. And so we have a TSS assay, and that's where we're picking up the initial transcription start site. And then we can also pick up the entirety of this transcript, and where Paul to is actually working along the entirety of it. And this is just kind of talking about the directionality of these and whether we're pulling them with a cap or not a cap. And I highly recommend the paper if you're interested in that. A quick history of the development version 1.0 was developed by a Nishio Tripodi and Margaret Grosseo, and was released April 16 and 2019. And in parallel in 2017, the Taehyun Kim lab at UTD started working to reproduce a paper that we that came out in 2018 and a second data set and I was mostly responsible for that. This is kind of where I got my start with bioinformatics and reproducible research because I struggled to build a reproducible pipeline and reproduce the results from that paper. And that's where I kept getting into the 20% variance of these things can really make or break the transcript calling. And I didn't understand that at the time but now after being validated it feels great that it's so much different than some other assays that you might be able to the bioinformatics pipeline doesn't affect it that much. And so I started creating my own like CID CD workflows and templates for snake make and around January 2020. And then as soon as we had a little lab hackathon introducing it to everybody I found enough core the week before and started looking to move everything over to that because I was excited to work with others on that and doing a lot of great work here. Let's talk a little bit about the pipeline. This is how far we've come. This was a snake make dag because there wasn't a dag of the the one of the NF core pipeline, but this is what I had in 2018. You can see the original presentation where I'm boring my lab with things like Docker and, and other things as well in that. You can see the majority of this is we're just handling Homer and alignment is pretty much all and then maybe an intersection of histones and the GM data as well and handling those two cell lines. Very rudimentary. And so the obligatory Metro map that I finished last night and then James already has some some feedback for me. I would like to think all those have worked on that. That was a great template and really easy to get going with that. So let's start over here with the fast queue and then we could pretty much zoom through everything here because we're really standing on the shoulder of giants here and using a lot from RNA seek, which is great because it's a much smaller use of pipeline and there's a lot less users, but we benefit from all those bug reports now with sub workflows and modules and all those other things. So we can really jump all the way to transcript identification. We just make some, some genome maps up here is kind of the only unique thing to us from RNA seek and support a few different aligners. So the first thing is we're grouping all the replicates up. And basically that's anything that's a technical replicate that we want to group up to increase the signal and biological replicates. And so then we feed that into for gross seek over here if they're specifically because that's what I've been so interested in. We feed that into Homer and grow GMM optionally and we'll talk a little bit more about that. And then everything else that's a transcriptional site site and gross seeking and others. We feed that into pints as well. And then we go into bed tools and we can intersect the two of these with a filter and without a filter and then basically only call regions that we're interested in and drop the regions that we're not. So like we can drop the regions that are gene bodies and promoters because we know that those aren't going to be ERNAs or other interesting RNAs. And then we can also make sure that we keep only regions that we're interested in such as like those with histone modifications that indicate ERNAs. Then we just do some quick quanta quantification and then we move into into multi key soon. So another little added benefit that we were interested in was supporting CHM 13, which is a new reference genome that came out recently. I personally highly recommend you all look into that as well if you're interested in that I'll be adding this to the template soon. But the main thing here and this and this infographic that they found is they were specifically looking at methylation data and how the new reference improved calls. And then you can see over here on the left is the number of Max peaks. And then you can see the the blue is the the old reference. And then you can see the CHM 13 reference are the additional calls that were made from using this reference. So these may not be much and may not be of interest in things that are well known and well understood, but very relevant for for nascent transcript calling. So we have support for that in our items config and you can just use CHM 13 and aligned to that. So let's talk a little bit about the transcript identification because that's kind of the, the most interesting part of the pipeline and what makes it unique. So, there's a couple different options as I said, first, if you're doing grow seek, I have some great support for that. If anyone would like to support other assays or would like to see it support please open an issue. So, first is Grouchy mem, and this is what kind of sparked us getting into the next flow and into bigger pipelines was difficulty reproducing this and running this on big enough machines to actually use it. Because it's an R package, and it was released in 2015 by Minho Che, Charles Danko and Lee Kraus actually just down the street at UT Southwestern. And as you can see by the graphic up here, Grouchy mem greatly outperforms Homer, and just about all these metrics and sleight or Cicer is actually just a chip seek calling or an old chip seek peak calling algorithm and it actually kind of outperformed Homer, which we thought was interesting. Looking at this graphic. It's a couple drawbacks to Grouchy mem though. It's very time consuming because it requires tuning, and then it's also quite memory hungry. When you're running on a bunch of samples. And then we also reached out to the authors and Charles Danko recommended that we use to units which is an published our package that doesn't require tuning. So stay tuned on that but right now Grouchy mem works and it does perform very well. And so as you can see this is calling the entirety of the transcript though. It's just up there in the left. Oops, I think I missed Homer. So I'll just talk a little bit about Homer then without it. It uses a little bit more naive of a peak calling method. It's just looking for the for the transcript and the difference in the peak in itself. There was released in 2010. Out of the glass lab and then now is maintained by Chris Brinner. That was what we originally used our paper, it works pretty well. The problem is it was made in a land before Docker so it kind of has a couple problems with like, had the way that it wants to pull the references for you but I finally realized you can just pass a fast in. It's like one line in the documentation and that works amazing. So we just run that on everything because if you're going to run garage and them and wait a couple hours, 20 minutes with Homer, you might as well get some results on that as well. So, again, I missed the slide on that. Let's now jump into pints identification. And this is a new assay that just came out in 2022 here, and it's very exciting. So I just left this in up at the top and figure a. This is just also illustrating the difficulty in reproducing these. And this is on the exact same data sets and you can see the difference in the home or in the greater than results. And just how much they vary and just a slight tweak in a tool you'd expect maybe better performance but you want to expect completely different results based on what tool you're using. And so down here at the bottom. This is the pints identification method and just in a rudimentary way. It works very similarly to Max to. And so there's just a, this is a potential true peak based on the density of this and it's very easy to pick out. And then it does some algorithms picks up the local background noise from these, and these are the light blue and then you can see in this example from those, that's then a potential peak that it needs to test and see is that actually a peak or is it just more noise from the assay itself. And so what pints is doing is really picking up these TSS start sites, as you can see from the read pile up here. It's just picking up the TSS site rather than the entirety of the transcript which might actually lead out all the way along here. It was released in 2022 so it's a little more relevant than 2010 and 2015 when the UN list lab. It determines the TSS start site is really what it's doing, as opposed to the entire transcriptional unit, because it's mainly focusing on TSS assays. So it gives kind of the optimum balance among this is from their, from their paper resolution robustness sensitivity specificity and computational resources required. There's a couple of other tools that can also be used such as D reg but those required GPUs and you start getting into all kinds of difficulty for users and specific machinery. So this TSS assay is just out of the box and works. So that's a, that's a quick win and then we can kind of support all of those through using pints and just handling most of the upstream and downstream processing of those. I'm cutting hands law here. The best way to get the right answer on the internet is not to ask a question. It's to post the wrong answer. So if you think that any of this information isn't correct or we should be doing things differently, please open an issue or in the back. I know there's not a lot of cohesion on the nascent assay transcript identification but I'd love to help the community build a kind of a group ideal workflow on this. And so with that, I'll take any questions. Thank you very much. So I have now allowed everyone to unmute themselves if there are any questions you can. Oh, I'm in the shower. Let me remove the background or change it. There we go. Hi, I've been yet thanks break talk. I think for this pipeline in particular, as we're realizing now a lot and of course is that we've got the really nice pipelines but we need to be able to validate the results between releases and stuff and this is kind of the thing that come up during the summit. And I think this is a really nice example of that because as you mentioned, you know, you, you tweak some parameters or you run the pipeline in a different way and you know, you get all of that variability in the results and so it's really important to be able to reproduce the results. Have you thought about full size test data sets and how we can validate whether the results are actually optimal across releases so say for example you or someone else comes to tweak the pipeline that we're not negatively impacting the results that you should be getting. Exactly, that is something that I thought about. So I haven't gotten a AWS full test going for growth seek yet. I do have two tests that were in the pints. They said, are they created some test data examples that I asked for because they had, they didn't have any examples of the actual usage of it. And so from those, we can then call the call the peaks on co pro and the other ones. I'm missing the other one but there's two test data sets already that are full data sets that have ran and then I have regression tests of those that I'm saving as well to compare against and they actually have an entire element matrix so I can we can probably pick a few of those and see if we can reproduce those each time or at least benchmark where the nascent pipeline is and see make sure that we're not changing drastically on those. We will see. Yeah. A second question so no controls right you don't you don't have controls for growth seek. You did, or the control sample is kind of included into that and that they talk about in the pints paper a little of some tools require you to have controls the tools that we're using don't really require to the controls. And then in the background model is built up and then the caller will, will call the peaks based on some random distribution in the genome. And last question, why not using max and other conventional callers why why is home have almost seems seems quite primitive I guess in terms of peak calling and stuff. Why not something more sophisticated like max is there more false positives is legacy. Homer you can also tweak some of the important things of like it picks up on the so I missed the, the image but basically picks up on the peak and then it picks up on the, the trailing tail of it is actually the piece that's really important there, instead of here I'll just pull it up. This is what Homer is actually doing whereas in max you might just pick up the peak, you're actually picking up this downstream transcript is why Homer is unique to that. Okay, and Cicer presumably does seem similar because it was larger peaks as well right that's able to call these sorts of things. Okay, cool. Thanks a lot man. Yeah. There's also another question in a chat. Why do you use feature counts and not other quantification methods as in RNA sec. The difference is always just what I've used for that I'm open to other ideas on it it's not the exact same as RNA seek and most of those are RNA seek specific is part of the issue on the quantification of those. So, the difference is as we pass in the genes count with those, and then we also count with the identified transcripts and identified transcriptional start sites of those, and give you counts of all of those. So that's kind of the difference and then downstream you kind of have to do your own math behind the scenes and stats, because it's not the exact same as RNA seek in terms of like how the math works out on those. And also not well defined. You're counting with RNA see you're counting things that overlap overlap spliced transcripts where there's growth seek you're looking at the entire gene body, where splicing isn't important so feature counts kind of can do that in this case whereas with RNA seek as we've known and had previous discussions it's not ideal for the transcript splicing type of quantification. Exactly, exactly will said it's just kind of it's simple or like it works. It can work in a very simple way is the is the reason we're using feature counts. Okay, thank you. I don't see any more questions. I want to thank you of course Edmund, but also the John Zuckerberg initiative for funding the bite size talks. And as usual, if there are any questions you can always go to the NF core works workspace on Slack and the nascent channel and ask your questions there. Thank you very much.