 I'm Jose Spinoza Carrasco from the comparative bioinformatics group, the CRG, and today I will introduce you NFCOR, the NFCOR ChIP C pipeline. OK, so a little bit of background about me. So I'm currently a postdoctoral research fellow in the Labo in Notre Dame's lab at the CRG, the comparative bioinformatics group. As some of you may know, this is the group where Nexro was created by Paolo de Tomaso. And actually, it's like my second time I'm there. So when Nexro was developed, I was already there. And actually, I put also these two things here, because we are actively contributing to Boforec, that it's a consortium to annotate the genome of the cow. My boss likes to say it's then called for cows. And this is under the brother umbrella of the Eurofung, which is the functional annotation for animal genomes, which aims to annotate animal genomes. And we are essentially using NFCOR pipelines for these, both in Boforec and in Eurofung. And I'm also a core member of NFCOR. OK, so a little bit of background about ChIP C. Probably all of you know about this, but what we want to obtain when we do ChIP C experiments is this kind of fix, which are showing us where our transcription factors are binding in the genome or the instance modifications. And this is normally how the experimental procedure is done. So there is a cross link between the transcription factors that are proteins and the DNA in the place that they are sitting. This normally done with formal date. Then after this, there is this sonication procedure to get rid of the rest of the DNA. Then there is immunopreciditation step, which is what gives the name to the technique. And this way we take the transcription factors that we are interested in. And then the DNA is purificated and the library is prepared. So I'm not a wet lap, guys. So probably many of you can explain this better than myself. And yes, as I said before, this is the kind of things that we obtain after we have run our NFCOR ChIP C pipeline or other spot pipelines. OK, so some figures from the NFCOR ChIP C pipeline. So I was looking at it yesterday and in terms of stars, it's a fair more popular pipeline. Although it has not been updated for a long time, as we will discuss today. So yes, it's a quite popular pipeline and those are quite used pipeline. And it was originally developed by Chuan Wang and Philly Wells. And then it was just modified to be in NFCOR by Harsil Patel. So yes, as I just mentioned, so this is this timeline shows. I think if I'm not mistaken, I was looking at this yesterday because I thought that ChIP C was one of the first pipeline to be released in NFCOR. It's not the first first, but it's among the 10 first ones. So as you can see here, it was first released in June 2019. And this is the release cycle of the ChIP C pipeline itself. So it was first released, as I said, in June 2019. And then it has updated in November 19 and version 1.2 was released in July 2020. So these are two minor releases. So this means that since this point, it has not been any real big update on the pipeline. This is I forgot because I put it in the slide. OK. So we are working in the development of the DSL2 version of the pipeline. Actually, most of the things that I will discuss today can be applied both to the DSL2 and the DSL1 pipeline. But if they can only be applied to one of the versions, it will be to the DSL2 even if it's not yet the stable version. Yes, we have been dying to release the pipeline for a long time. So we are, in this sense, similar. We are approaching to Sarek, or even worse than Sarek. And we have not released the version 2.0, although we are very, very close to it. So here is the pipeline overview. So if it starts with your FASTQ files and an input and a speed sheet that I will discuss during the presentation, and there are some quality control processes like FASTQC here. Also, the adapter, well, this is not quite the control, but the adapters are removed with Tringolor. And then the alignments are performed. So in the version 1.2, the only aligner that was available is BWA. And now in the new version, these three other aligners will become available with I2, Star, and Chrome app. And after the alignment, some alignment statistics are calculated using some tools. Then there is these other processes that are shown here. So the replicates, if there are replicates, are merged using PyCard. Then duplicates are marked, also using PyCard. There is some quality control for the alignment level using PyCard and PyCard. And also, the BAM files are then filtered by the duplicates. So the duplicates that we have marked in the previous step. And also, the blacklisted regions are removed. So there are some regions in the genome that are difficult to align, and these are known, and these regions are removed with some tools. And then also, after all these steps, some other aligners, so after all these processes, there are some alignment statistics are calculated. Sorry. And then here, we have some of the, let's see, the analysis that the pipeline performs. So we produce this fingerprint plot and the redistribution profiles. These are also could be seen as quality control plots because you can see the distribution of the profiles of the peaks, for instance, binding to the DNA. Also, these strand cross-correlation peaks, plus there is run with phantom peak wall tools. And big, big, quick files are produced with the peaks so that they can be used downstream and for the visualization. And then we also, of course, call broad or narrow peaks. So the pipeline allows to have these two modes. So normally narrow peaks are called for transition factors and broad peaks are called for instant modifications because the regions tend to be much wider. Then we run Homer to see how these peaks, the annotation peaks that are produced, how they are found relative to the genomic features for instant genes. And there are also this process with Max2, which it's to call consensus peaks across a given IP. And we also run these subrede future clones to have the number of reads that we found by peak, for example. And this is something that actually also we will discuss. It's we run this seek only for quality control. So only the PCA is produced before in the previous version. Some differential special analysis was done. But we agreed that these steam processes should not be in the main pipeline. That's why they have been removed. So here I'm listing the main DSL2 updates and features. So of course, the pipeline has been part to DSL2 syntax. This means that all the models that were not yet available in models have been implemented. Also, we need to implement some new models for tools that we need several in one process. Well, you probably are familiar with this. And more specific to the pipeline. So the files containing the blacklisted regions that I mentioned before have been updated. We have included these new aligners. So BBWA, it's the default one still. But you can choose from this. Actually, that's something that I'm not completely sure. If Chrome app, it's working as expected. But probably here, I will need the help from someone more familiar with this aligner. Then the effective general size logic has been the factor. This is a parameter that it's need for Max2 to annotate peaks. And we have changed the logic. The input sample sheet format has been modified. And as I mentioned before, the differential special analysis has been removed from the consensus peak comparisons of the pipeline. And of course, we have fixed some bugs. So this is just to show you about these blacklisted regions. So here are the main thing that they wanted to, as you see, the issue is closed, because this has been already implemented. But just want to throw a warning that if you are still using 1.2, you probably need to update the blacklist in the case that you are using one of the genomes where these lists are available using this parameter, the blacklist parameter of the pipeline. If you are using the development version, you don't need to care about this. Then, as I mentioned before, for the Max2 needs this effective genome size, this max size parameter that it's encoded in the pipeline. And we have included now in the I genomes configuration the max size for the corresponding bit length. We have calculated this based on this link here. And if the genome is in the I genomes file and you provide the read length, it will be automatically taken from these maps. If not, so that's why we need this new read length parameter. And if this is not the case, in the same way that we calculated these values, the pipeline will calculate the values for your genome using the cage unique merge model that has been implemented. Then this is, I think I'm quite late. So this is how the input looks like. So you have the sample, fast Q1, fast Q2, antipodia control. We have seen these several times in similar formats in doing the bite sites talks. But as you can see here, we have the sample. These samples will be merged. So for instance, these two samples will be merged. So everything that it's before this rep1 and rep2 and are identical, this will tell the pipeline to merge the samples. If you have a single n reads, like in this case, you just provide the file here. If you have a parent, you will have to provide the second file here. This is the IP and this is the control. And the control, as you can see here, is then listed here. And of course, has not this control field. So for random pipeline, then once you have this split sheet, you just need these parameters. So this is for running the testful. And you just need, in fact, this is taken from this link. It's what the testful is using to run this full data test. And you provide the genome. And now you have to provide the read plane so that we can take the value of the max size parameter from the map in the genomes. And with this command, you will be able to run the pipeline. In this case, I put depth development branch because, as I mentioned you, it's almost in production. And it could be quite safe to use it with the new features that it has. And yes, probably it will be there before the end of the summer. But then, yes, there are more parameters that you can take a look in the parameter stock to parameterize your run of the pipeline. Please take a look there. And if you have any questions, just drop us a line in Slack. And yes, this is something that pops up many times in Slack. And that's why I put it here. So you need contos for running the Chipsick Pipeline. We know that there are experiments that are old. And maybe they did not have contos. But the pipeline, currently, it's designed to be used with contos. And there is a kind of a hack. That's why I put this answer from Harsie. I don't know if he's connected. What if he's connected or not? It's that you can use the ATT&CK Chipsick Pipeline if you don't have contos using this parameterization. And in principle, it should work. But ideally, the best thing is to use contos. If you are designing your experiments, and of course, you should have your contos. And then, yes, so this is the output of the pipeline that with the command line that I previously mentioned. So this is available also in the website. So you can go there and see all the results with the full test data set. This corresponds still to the version 1.2. But hopefully soon it will be updated. And then we already have plans for future releases because we wanted to make this the 2.0 out. And also, we wanted to be quite similar to the version 1.2.2 so that we can identify any bug or any problem that we have. And then from there, we can start growing the version 2.0 if there are futures that are needed by the community. And these are two more. Yes, two of the things that are planned for version 2.1 will be to include the metamap. As you have seen, I have done this schematical before, which was not very nice. So they didn't have time to take a look to James' talk and create it. And also, we would like to create to add the repulsive discovery rate that it's used to check consistency between replicates. And it's kind of a standard because it's the measure that was used by ENCODE. And of course, we are open to ideas. And if you find a bug, please tell us. And yes, with this, I'm done. So we have now a summer break in terms of bite-sized dogs. So I think it's the 13th of September. But probably Francis is kind of better than me. So if you have any questions, yes, tell me. And that's all. Thank you very much. So I have now enabled for everyone to unmute themselves if there are any questions. You can do so and just ask them directly or put them in the chat. OK, thanks. I have a question. So I was wondering. That was a great talk, by the way. That was good. Thanks for the effort. So I was wondering if you could, while supplying the command line arguments, change the genome build, change it looks like your heart could have the HG-19 into the code. So the things that you can provide, so in the Igenome's configuration file, there are several genomes. One of those is the HG-19. But there are more. And there are not only human. They're also from mice and so on. So if you can check the key, and this way, all the files that you need, the FASTA file, the genome FASTA file, these genome sizes that I told you, they are automatically rendered by the pipeline. But in the case that you don't have them, or I mean, in the case that you are running, I don't know a genome that it's not there, you can provide these parameters to the pipeline and these files, and it will run. So it's just for simplicity that I include this genome in the command. It's OK. Yeah, thank you. Are there any more questions? OK, if there are no questions, then I would like to thank Jose, of course, and the John Zuckerberg Initiative for funding of these talks. And as usual, this talk will be uploaded to YouTube. If you have any questions later on, you can always come to the Slack channel of Chipsick or Form Bite Size and ask the questions there. Thank you very much.