 Welcome, I'm Francesca, I'm today's host, and with me is Christa Scheer, who's giving us an intro on the newest developments in the NFCore cut and run pipeline. And it's all to you now. Thank you very much. I will just share my screen. Is that all good. Yes. Can you see my screen yeah. Yes, we can. Okay. Hello everyone. So thanks for the introduction my name's Chris chess here, and it's it's a it's a pleasure to be here. That's probably the first time that I've presented this pipeline, and when it's consumed like a year and a half of my life on and off so I'm extra happy to be showing to someone today. So that's good. On today, I'm going to go through the concept of the pipeline and the reason it was developed. I'm going to walk through some of the key features with you discuss some of the kind of more interesting points of the pipeline is not going to be exhaustive. And then I want to go through some of the new features for the version 2.0 release which I'm going to shamelessly plug throughout this presentation. I want to go through some of the testing and automation features which I developed, and also, of course, some of the future plans of the pipeline. So without further ado, a little bit about me. So I'm a postdoctoral research fellow at the brisco lab, James brisco's lab at the Crick Institute, which is in London, and my kind of focus is single cell multiomics. So I'm trying to develop a system or kind of rapid prototyping single cell system that will target ATAC multi target cut and run or cut and tag and transcript tone, all simultaneously from the same cell. And, but at the time when I first started this project I knew nothing about kind of cut and run. And despite the fact that it's kind of an integral part of the pipeline, other than the you know the technique and the kind of results that it produces. So I thought the kind of easiest thing to do would be to find a project that got me into cut and run or cut and tag data analysis. And this was it so I realized that there was no NF core pipeline I've been using NF core for a while for some other projects and I realized that there was no NF core pipeline for this experimental protocol. And that's where it went from there really, and it just kind of snowballed but it's been a really interesting project to work on and you know still ongoing. So, just a brief overview of cut and tag or cut and run. So, the idea is that it's a it's a successor to the chip seek assay and the main difference with chip seeking and these ones is that we get much lower background. Kind of binding and non specific cutting. So the basic protocol is that we wash antibodies over either a target transcription factor or histone mark. And then we attach an enzyme, which has a protein a binding site that's been attached to it. And the protein a binds to the antibody, and then the enzyme kind of just hangs around attached a localized into this targeted area where the TF or the histone is marked. And then you give an ion in the case of M&As here it's calcium. And that activates the enzyme and causes it to cut in open chromatin or on the nuclear zone around where this target was. And then you can get rid of everything and sequence the products and then you get a very accurate position down to the nuclear zone level of where where the TF or the histone was. Now the difference between cut and run and cut and tag is that kind of run uses M&As as the enzyme that cuts and cut and tag uses transposes as the enzyme and do you use magnesium instead of calcium as the activating ion. They have kind of I won't go into it but they have different kind of advantages. So there was there was kind of some evidence to show that cut and tag is better for transcription factors than cut and run. But that's kind of all outside the scope of this presentation. The key thing to note about it is that the processing, the bioinformatic processing upstream is exactly the same for both protocols. So that's another reason why I wanted to kind of do this pipeline is because it kind of kills kills two birds with one stone. You get two pipelines for the price of one. So that's pretty cool. These approaches are really growing in popularity, especially in the correct but I think globally as well. And yeah, as I said before there was no going to pull pipeline for this. So overview of the pipeline. This is a kind of diagrams becoming popular now kind of the tube map diagram. And we can see here, the general flow of the pipeline I'm going to go through this bit by bit so I won't spend too much time on the slide. So in general, we have kind of trimming and QC at the beginning, and then we have alignment in the middle here. And then we need to do some peak, we need to kind of gather up the reads into into peaks and remove duplicates filter do things like that. And at the end we finally call the peaks, and then we do a bunch of reporting going all along the way. So the first bit I want to talk about is the sample sheets the reason I'm talking about the samples you I wouldn't normally discuss this but it's actually one of the new features in the pipeline. So I wanted to just touch on it. So this is the new version of the sample sheets and the sample sheet allows you to basically define where your samples are and what the kind of structure of the experiment is going to be. And this is very similar to all the other kind of NFC pipelines around it's kind of half standardized I feel. You can merge technical replicates or merge data from multiple lanes and this is a feature of most NFC pipelines, and you do that by having the same sample ID and the same replicate number and then this like in the top two rows here. And these two will then automatically be merged together as one sample, which is really useful when you need to get sequencing for multiple lanes and things like that which happens a lot. So the other main feature is that we can assign control groups and control groups are really important in cut and run and cut and tag, you almost always have an IGG background control for those experiments. And so the ability to assign that in various different ways is really important. This pipeline can auto detect when there is a control being given. And that's detected by the fact that it's being used as a control in the final column here in one of the other samples. And the other thing to note is that controls are automatically assigned as per their replicates. So we have the wild type here, which has one replica and then or two replicates one and two here. So explicitly assigned replicates one and two will be assigned to one and two here which is quite useful. Also, if we just applied had one control group there, then the control control group will be applied to both, which is also useful because sometimes you don't have multiple replicates of IGG. The other main feature of this is that it's got some robust error checking in it which again is it is kind of a requirement and a feature of the most NFC pipelines. I went over that because that's changed from the previous version and I'll just highlight that again later, but that's the sample sheet checking. The next stage of the pipeline is the is the trimming and the initial quality control as well as the merging of the samples together. And this again is is is standard for a lot of bioinformatics genomics pipelines. All sequencing machines special or luminous sequencing machines require adapters and most people sequence on a lumina, and these need to be trimmed off so this is this is standard and you need to do QC before and afterwards. But the reason I wanted to touch on it is because I wanted to touch on the other pipeline is designed in the design principles of it. So there are kind of many paths for downstream analysis as we all know in genomics, once you get to a kind of critical points, then the, the, the paths of analysis diverge once you know depending on your the scientific questions that you want to answer. And, but the upstream analysis, the point at which it diverges will always be the same. And so with this pipeline instead of providing a load of features for downstream analysis that are difficult to test because they're kind of situation specific I wanted to really focus on the upstream data quality. And for this pipeline that that kind of critical point is when the peaks are called. And so I want to produce really robust peaks that you can trust that is supported by a lot of quality control and transparency around how those peaks were calculated and that was the kind of main aim of the pipeline. And because of this, this kind of enabled a proper development cycle for the pipeline, where we can test and integrate new features, produce maintenance, analyze the kind of new features and kind of out in the world and then design new features and implement those in kind of a circle. If we were having to test downstream analysis routes all the time and stuff like that this this cycle would break down. So, going on to that the principles that the pipeline was designed around was repeatability so it needs to not fail, it needs to do the same over and over again into better trust it. It needs to be reproducible which is again and these are kind of core principles of NF core as well but it needs to be reproducible and this is what NF core enables a next flow. We need we can run this pipeline on clusters laptops doesn't matter it should run the same as long as you have some key minimum installation requirements. And the other two of the two that I was kind of talking about just now is needs to be transparent we need to know where the results came from. We need to get insight into those results and the way that I've done that is through providing lots and lots and lots of reporting. So you can see here on the diagram the little stops that have a pie chart in. And these are all the points in the pipeline where reporting is produced. And these reporting is in the form of charts tables and various other things multi QC reports, if you guys are familiar with that. We just allow someone to get a really good view on exactly what's going on at every stage in the pipeline. And if something is not clear, then that's, you know, kind of a problem and we try to fix it as quickly as possible. So onwards to kind of main function in the pipeline again then to the next stage after this is alignment, I won't go too much in into alignment, it's using bow tie to and it's standard alignment procedure. So there are some interesting parameters that we describe in the documentations to how bow tie to is run with the kind of reads that you get from this type of experiment but that's kind of outside the scope of this presentation. After that, we go on to filtering so we filter out reads which have a minimum need to have a minimum Q score. And also we remove duplicates from some of the reads, but not all of them. So one of the key things to know is that normally you would just remove duplicates, whatever because you want to get rid of all the PCR duplicates. But the trouble is with cut and run is that you get this, because it's targeted you get this very close stacking of reads over the same sites. So even though so this this is valid data but some depending on the parameters of the duplicates, you may find that this, you know, gets filtered out when it shouldn't do so we don't we don't move duplicates on the target samples unless there was clear evidence of PCR duplication. That's too heavy to ignore and then you can of course turn it on in the pipeline. Back to so this is this is all kind of standard stuff this is something I really want to talk about is the read normalization and this is something that we changed in version 2.0. So one of the main stages of the pipeline. Is that the aligned reads are stacked up. And then we get what's called a bed graph out of it which basically shows you for each region. So how many reads stacked on top of each other. And so you can imagine this kind of creates a histogram. And this histogram is what's used when we call peaks in various peak callers in max two or see a car. And these peaks need to be normalized in some way. So there are quite a few different sources of normalization kind of error in in these experiments. So this is experimental batch effects you know if you used different enzymes different antibodies, you know different batches of antibodies and things like that. They can produce different results and that's kind of outside the scope of the pipeline is quite difficult to fix that once you get to the bioinformatics stage in this in this class of experiments. So I'll move on from that. The other really kind of big thing that we need to account for is is epitope abundance so some some epitopes that you target such as some histone marks are really quite ubiquitous across the genome. And some like you know some rare transcription factors and things like that are much more targeted. And so you're going to get traditionally kind of less reads associated with with those lower abundance epitopes yet they're just as important if you're trying to compare them. So one of the main tasks that we have to do is to normalize between them so that we don't get tiny tiny little peaks or no peaks called for this this low abundance transcription factor when we actually do want to detect those sites. So the original way to do this was using spiking normalization and this is back from the kind of chip seek days. So the spike in is some E coli DNA that's left over from the, from the process of producing the protein. The the enzyme either M&As or transposes and this the amount of this. The amount of the epitope and the amount of the spike in DNA that's present. And if you keep the amount of the enzyme constant that decides how many reads how many kind of cuts you get on the E coli DNA versus how many cuts you get on your target genome. And you can kind of use this to normalize against how much of the epitope was present. So one of the big problems with this number one is that the newer cut and run and cut and tag kits are our process so that you don't really have very much spiking at all in the in the kits left over it's all been kind of cleaned out. And so that was a big problem with starting to see that a lot in the pipeline, people coming to me talking about these projects is that they can't normalize properly against spiking because there isn't any. Some people have realized this and have started to spike in their own DNA, but that comes with its own problems with getting the correct amount spiked in and stuff. The other thing that's required when you're looking at epitope abundance and normalizing with spiking is that the, you need to have the same amount of material the same amount of cells in the experiment in order for this normalization to work. And that's just not the case in a lot of experiments, especially with you know tissue samples and things like that you just can't guarantee that. And so again we're seeing that this kind of normalization is really hard to achieve. So in the new version version two, we started to include options for normalizing against read counts and read depths across the genome, and using deep tools. And we found this to be quite successful so far it's not as complex as normalizing against you know spiking DNA you're literally just normalizing against the read depths between different samples which obviously if you've got different abundances of epitopes that's going to cause other problems. But it's better than no normalization and yeah it's it's it's proving successful so far there's quite a lot of manual tweaking involved. I just wanted to highlight that these are the kind of main questions we're thinking about this pipeline and this is not finished, you know we're going to carry on trying to work out what the best way of getting the most robust trusting you know trustworthy peaks from from the pipeline. Now I'm aware that I'm probably running out of time yes I am so I'm going to move a little bit quicker. So the final kind of major stage of the pipeline is that the recall peaks. Again, I wanted to highlight this because we the old peak cooler. See a car which is produced by the Hennikoff lab who developed can run and cut and tag. Some people were having some some issues with it or just wanted to use max to which is the kind of standard peak cooler for high background noise experiments like chip seek. An ATAC seek and yeah so we included max to as an extra peak cooler and you can actually run both peak callers in parallel together if you want in the pipeline to compare the results. So that's another major change. And the last kind of stage of the pipeline is you know to give us some really trusting peaks you know we've we've tried to normalize as best we can. When we call the peaks, we we call them against the IGG background if it's provided so that's another form of normalization. And then we also can do consensus peaks. So how many of these peaks are present in our replicates and we can we can be stringent if we want to say we need all the peaks present for for this peak to be trusted. So as you can see that's what we really concentrate on is trusted peaks and transparency for using the reporting. So key features summary for version 2.0 now this version is not out yet it's going to be out in the next few days hopefully I'm trying to find time to go through all the kind of final changes before it can get approved for release. But hopefully next week this will be released. The sample system the sample sheet system redesign. We've got additional read normalization options which I've just been through we've got additional peak cooler options, and we have loads of like bug fixes and performance optimizations and things like that. So another shameless plug or version 2.0 go ahead and use it and please do let me know if there's any kind of problems. I think I might I'll just touch on this very briefly because I'm well I've got to finish. This is just a note on testing. I basically took what the test to this is kind of for the pipeline developers out here. But I took the testing that we do in NFCOR modules with the YAML testing with pie test. And I applied that to the pipeline and we now have 213 tests that run using pie test for every code change that we make on the pipeline and I think it's really made the pipeline a lot more robust. Especially because it's just me working on it or just there's a couple of us just working on it. So I really think that was important and please if you have any questions if you're developing pipelines want to questions or come and contact me because I do think this is quite a good advantage. And so finally, news and the future, the version 2.0 release is imminent as I've already said, we really need developers it's just me and another woman called Tamara working on it. And, you know, we really want to push these features forward, but we need we need you guys in the community to suggest features and help with the coding if you possibly can. We are going to looking at more options for peak calling, and also some very rigid downstream options such as nuclear zone positioning and transcription factor footprinting we're looking into it we really want to get, we don't want to get too far in the downstream but these look like quite good and then finally I just wanted to say of course my project my main project is single cell and I really want to adapt this pipeline to to work with single cell data at some point. There's a lot of talk around that to be had but I really would like to have and of course have a robust single cell cut and tag pipeline, because I think that's the kind of future. So, thanks for listening. I just wanted to thank everyone at the Luscombe in the briscoe lab. I wanted to thank Charlotte West who I think on the call. Because she was the original kind of co developer of this pipeline she's she's now left, and then also tomorrow Hodgets who's the new kind of co developer on this project. So, thanks, thanks everyone and thanks for listening. Thank you very much. I have now enabled people to unmute themselves for Q&A, you can of course also write in the chat, and I will read out the questions. So, are there any questions. I have one. Go ahead. Oh, okay. Yeah, so you've introduced these two new normalization options normalizing as recounts or read depth. Can you have specific scenarios in mind as to which what when is better to use read counts when is better to use read depths. Yeah, so not at the moment so we basically deep tools has some normalization options available to it that are kind of really based in the RNA seek world. So there's a bunch of like transcription, it, you know, normalization against, you know, killer base length of the transcripts and things like that those kind of classical RNA normalization techniques. And we've taken some of those options, and kind of introduced them just for set regions of the genome so at the moment basically we have a bin size of one on the genome. And we calculate the read depth at that bin size of one, and then normalize against that in that region against the other samples. And then you can widen that bin, if you wanted to to cover a larger amount of features. So it's really, it's really just to kind of get them a little bit more in line with each other Mr waiting to see how helpful those options are kind of downstream but really the, the other feature of it really was being able to turn the spike in normalization off as well. Yeah, it because it was on by default the whole time and you couldn't change that so there is an additional often just to turn it off at all and then the idea is that we provided these extra options and people just start playing with them and come back to this to us about how useful they are for one case in point a group that I'm working with the creek. We turned off spiking normalization and just did a bin of one read depth normalization. And then we basically that resulted in the samples looking a lot better. The IGG kind of background was was super high because the the relative read depth on the IGG samples is low you get less reads with the IGG because it's spread out more. So what we did is we included an extra parameter in the pipeline to be able to scale the IGG background back, and then use it to call peaks so we have a situation now where we have like the, we can basically use the IGG to change how many peaks are being called on the sample. So we, we're basically now in a situation where we have to run with an IGG threshold of like 0.2 0.4 0.6 0.8 and one. And then we look at how many peaks have been called for each sample and basically tune it to the experimental question that we're looking at. And then you with transcription factors you may want to look at something with a bit more minute kind of hot more peaks being called so you can pick up more binding. Whereas with histones you might want to lower that threshold, or sorry, raise the threshold for peak calling so that less peaks are called. So it's really interesting. Yeah, it's an active area of development I would suggest what you do if you're going to turn it off is do the CPM mode normalization which is what's recommended in the documentation. You can use an IGG background threshold, the five different ones are 0.2 0.4 0.6 0.8 1 and see how it looks in the IGG browser or however you view your peaks. Thank you very much. Artemis. Artemis Gordas. Yes, yes, thank you. I'm going to talk about the QC you've stressed that you provide so many QC reports for the user to assess, but I'm coming from the perspective of a person who never did, who never did prep processing for the peaks. It's very hard to assess after you get the reports, like, is it good or not. Could you provide some kind of like a representative series of different QCs from different data, since you are communicating with the users to just show that this is how a good quality would look like if you have a bad quality. It's really non-intuitive and I tried to find somewhere some documentation how it should look like or in the papers, but people don't write about it. That's a really excellent question. And yeah, it's a great idea. I shall absolutely do that. I'll create a new section of the documentation to show some kind of examples for sure. That's a really good idea. Thanks a lot. And then we have another question from Harshil Patel. Okay, great talk. Thank you. So, my question goes back to the normalization. When you normalize in the way that you mentioned, what do you factor in global changes? So, say, for example, you have a control group which has a level of signal, and then you have a treatment group where you have systematic changes, so you have an uplift everywhere in that signal. And the normalization on the base pair level or per region would essentially be canceled out in that scenario, in which case you wouldn't really see a signal, even though there could possibly be a change or something to be had there, right? So there are options to do global normalization as well. Because it's a bit experimental, you do have to have an understanding exactly what you just said, you have to have that understanding and know which normalization options that you need because of that. The pipeline doesn't do it for you. So yeah, but there are global normalization. So if you want to, you can literally just normalize against the total read count if you wanted to. Yeah, but even in that scenario, I think it would cancel out. So ironically, the only way you could really do a proper global or detect proper global changes is via spikens, even though they're unreliable for this type of experiment, because it gives you some sort of reference point as to how much things are changing across your sample groups. Yeah, I couldn't agree more. I would die. I don't pretend to say that the the recount normalization. There's a reason why people don't do read count normalization because it's not particularly accurate. And I would agree with you that spiking is far better as a normalization option if it's an option. But we were just getting so many projects where there was those just like, you know, 10 reads or something. There's just no no alignment whatsoever. There's no option to do it. And also it was giving us some really screwy results, even if there were reads that were found. So it's just a, you know, because it just relies on the fact that you've got to have the same cell counts. I think that was mainly the problem beginning is just differing amounts of material. So it's okay. I'm always surprised that it works for this type of experiment for RNA seek is very different. But here we have lots of background you have variability in your antibodies your variability in cell count your variability in pull down. Yeah, I'm surprised that it even works to be honest but yeah, yeah. And I think it's, you know, the next, the next, the next kind of version, we really need I mean to be honest we really need to start a project on this that's a community based project and try and find, you know, people are interested in cut and run and try and find a good way of normalizing this data and taking these factors into account because none of it's a magic bullet. And it really affects the results you really have to run it with all different parameters and the amount of peaks that you get called is completely different depending on how you parameterize the normalization. Okay, if there are no other questions in the audience, then I want to close again by thanking Chris for a great talk. I also would like to thank the John Zuckerberg Foundation for giving us some funding and for anyone who has further questions maybe later on you can always reach us at the Slack channel for either come run or for bite size. And thank you very much everyone else.