 So, what I have to do here is so much easier than what Ben just did that I'm almost embarrassed to go back to slides because the live demo is incredibly nerve-wracking, but. So we've been working on the RNA-seq pipeline. So in the beginning, we told you that we have implemented RNA-seq processing pipeline. We've also implemented the chip-seq processing pipeline for both transcription factors and for histone modifications as well as a whole genome by cell-flight pipeline and a DNA-seq pipeline is coming. So I'm not going to do a live demo of the chip-seq pipeline, but I did want to tell you what the architecture of the pipeline is and make you aware that it exists so that if you're more interested or also interested in processing chip-seq data, you should be able to do that. The deployment of the chip-seq pipeline is identical to the RNA-seq pipeline. It's on DNA nexus. The same sort of process that you would go through to add data to the beginning of the RNA-seq pipeline, exactly the same process will apply for chip-seq as well. So the pipeline, like many of the pipelines that we deploy, has a mapping step and then it has a peak calling step and then it has a statistical framework that's applied to the replicated peaks at the end to try to assess concordance of biological replicates. All encode experiments are replicated, and so this last piece called IDR is something that we run on all of the encode experiments because they're all replicated. If your experiments are not replicated, then you can't run this, but you can call peaks. Okay, so briefly the chip-seq pipeline uses for transcription factors and for histone modifications, the mapping step is done with BWA, duplicates are marked and removed. The peak calling step for transcription factors uses SPP, the peak calling step for histone modifications uses Max2. Max2 is also used to generate the signal tracks for both histone modifications and transcription factors, and I'm going to tell you about the difference between the peak calls and the signal tracks in just a moment. And then as I mentioned, there's a piece of software called IDR that is a statistical framework that allows assessment of concordance of two replicates. So something that we haven't talked about yet but is very important, an important advantage to deploying the pipelines in the way that we have is that we can generate all sorts of quality assurance metrics. So all ENCODE experiments have target read depths, they have target library complexities, they have goals for data quality that have to be reached for those experiments to be accessioned and distributed as ENCODE products. And so the calculation of those quality assurance metrics are important to us because that's the way we figure out whether the experiments were any good or not, but they'll also be important for you when you run your experiments through these pipelines because you can compare your data to ENCODE data and see how it stacks up. And so the, I wanted to actually sort of point you to resources to learn about these QC metrics rather than step through the math, which I probably couldn't do justice to anyway. We do calculate sort of four general categories of quality assurance metrics for the chip seek experiments and some of these also apply to DNA, DNA experiments. So of course we calculate the read depth and there's an excellent paper that I refer you to here that talks about target read depths for chip seek experiments and what read depths you want to try to achieve for different histone modifications or different transcription factors depending on how often they bind to the genome. So we also calculate some estimates of the complexity of the library that you sequence and those are called NRF and the PCR bottleneck coefficient. Those are documented in this paper here. There's also a strand cross correlation method that is documented here that we calculate as part of the pipeline and that's the measure of the quality of the chip. And then I mentioned IDR is the way that we quantify the concordance between replicates. So rather than step through in great detail what all of these metrics are or what they mean I just wanted to put this in the slide deck so that you could go back and look them up and read about them if you want to. So you've already seen this, most of you I hope, but this is just what the histone chip seek pipeline looks like running on DNA nexus. It looks just like the RNA seek pipeline, there is a workflow that is composed of steps. Those steps some run concurrently, some run by themselves, others run at the end and this is the display that you see when you run one of these pipelines to completion on the platform. So I also wanted to just show you after we run these pipelines at the DCC of course we accession all the output up at the encode portal and I thought I would just quickly show you what that looks like. So an experiment that has been run on DNA nexus and then accession back into the encode portal and here I am at encodeproject.org. This is the experiment page or an experiment that has been run through the pipeline. This is what Yuri showed you yesterday. You can access the metadata that described the experiment but now if we scroll down we see the files that are generated by the pipeline and this is a graphical representation of what just happened on DNA nexus. So you see files are yellow bubbles and software steps are blue rectangles and you can follow the trajectory if you want that the raw data takes through the pipeline by following the arrows through this graph. So I won't step through it but what I just want you to see is that on the portal without ever going to DNA nexus at all you can see exactly what the relationships are between the input files, intermediate files that are generated and the final output. So that's what we're trying to depict on this graph here is the relationships between the files and the software steps that generated them. Again this is all on the encode portal so this is accession to metadata about a processing pipeline that was run. You can click on each one of these, I'm just going to click on this one in the middle at random and scroll down and you'll get additional metadata about that file as well as a link to download it. So what we're trying to accomplish on this page here is just to show you how the files were generated and give you direct access to them one by one. Yes, yes, okay so the question was what one of these says signal over control and what does that mean I'm going to go back to the slides. So this is important if you care about chip seek. So what the chip, this chip seek pipeline actually creates quite a number of outputs. It generates peak calls which are these blocks that you see on tracks. They have a definite start and a definite stop and they're generated based on the raw signal but we also generate these continuous tracks that you see on the browser of where the chip signal was high and where it was low. And all of those signal tracks are normalized to the control experiment for the chip, for the chip. That could be input DNA, it could be a Mach IP. But the signal tracks that you see, output from the uniform pipeline, the signal tracks that you see are normalized to the control. So if you see a positive going trend in that track, you know that that came from the experiment and not from the control. Did I answer your question? Okay, no. I answered some question, I hope someone had that. Yes, yes, that's correct. Okay, can I repeat your question here? So the question was actually about sort of exactly how you input the control files into the pipeline. It's, you didn't see something like that in RNA. ChipSeq pipeline takes these types of controls. And you add those fast queues from a control experiment in exactly the way that you add inputs, input fast queues from the experiment itself. So you will, you will, the input to the pipeline is fast queues from your experiment and also fast queues from the control. Yes, so typically we match controls to experiments. So if you have two replicates, you will have also two control replicates. However, the pipeline will run if you submit the same reads as both controls. We do a certain amount of read normalization between the two controls, one control is very shallow and the other is very deep, for example. We'll pool those and use that pool control for both experiments. Yes? Do you use any way of aligning the chips, aligning the peaks so that we, you know, we can say that this is actually the same peak from the different samples? That's a good question. So, so from two replicates or from two different experiments? Right, from different experiments? No, we do not. So, and this is actually brings up an important sort of design criteria for all of our pipelines. Our pipelines really are designed to take an, one experiment, usually a replicated experiment and produce a uniform output from that one experiment that then can be compared, that's comparable across many experiments. But that, that, that comparison across experiments, that's for you to do. Our pipelines really are, are, are designed to take the primary experiment data and process it into some sort of output that, that can be consumed by, by any analysis algorithm that, that you might want to apply to compare experiments. So most of our pipelines are within a replicated experiment. Yes? Two different time points or two different samples, that's a major, you know, hurdle. It is hard to do. Yeah, no, it's, it's, but that's not why we didn't do it, but, but really because our, our role as a data coordinating center is really to, to give uniform output from each experiment that then can be used for subsequent analysis. So we would consider that as a subsequent analysis if you want. Be happy to take any other questions. Sometimes it takes a second for, there it is, you're good now. So all of this data is being generated at different centers with possibly different instruments, different flow cells, lanes and all that. To sort of follow up on the question that was just asked, how do you normalize across all of those things? And it sounds like maybe you don't, that's a downstream thing, but can you give us any idea how we would do that? Because those effects can be kind of significant. Okay, that's a, that's a good question. So this is one of the reasons why we take primary reads and not, for example, mapped reads. So we could, we could build our pipelines to take damn files, for example. But you might not map your reads in exactly the same way as we would have mapped the reads. And so that actually can, that difference can propagate through to the end. And when you do your PCA, PCA1 is the lab, right? Which is not what you want. But I think that's what you're concerned about. So what we have, our experience, what we've found is that within the consortium there are working groups that just, that set sort of standards for how experiments are performed. And those are documented on the portal. You pointed that out yesterday. And what we have found is that if those guidelines are followed, and for example for chip, the antibodies have been characterized to the same levels. And that the chip is performed in the same way. Even data from multiple locations run at different times. The fast queues, if you put them all into the same pipeline, the fast queues are comparable. However, what isn't comparable necessarily is read depth or the libraries themselves. And that's what I was talking about in the QC metrics that we calculate. Those are definitely not uniform. They're not uniform within a lab, neither are the uniform across labs. So it's one of the reasons why we generate all those QC metrics is they all should fall into target ranges in order to then be able to compare the data in the end. So I didn't really give you sort of a check list that you might go down to ensure that an experiment that you want to compare to end code is comparable. But you definitely want to calculate the same QC metrics through the pipeline and compare those to other experiments that have done with an end code. And if they're very different, then it's unlikely that the results will be comparable. Thank you for that question. Okay, yes, go ahead. Yes? So maybe I missed it, but could you explain this step from the BAMs to the pseudo replicates and the pooling? Right, okay, so I'd be happy to. So I'm going to give you the answer for histone chip first because it's simpler. And the answer for TF or transcription factor chip will be slightly different. So for histone chip, what happens here is we call peaks for each actual replicate. Let's say an experiment with two replicates. We call peaks on Replicate 1, we call peaks on Replicate 2. We take the reads from both of those replicates and we pool them. And then we call peaks on the pool. All right, so now I've called peaks three times. I've called peaks on each of my true replicates. I've called peaks on all the reads pooled together. Different from concatenating the peak list, right? So it's actually an independent peak calling on pooled reads. Then we back up, we take that set of pooled reads and we split it in half. And we call those pseudo replicates. They're chosen at random without replacement. Okay, so we split the pooled reads in half. And then we call peaks on each of those pseudo replicates. All right, so I've called peaks on five reads sets. Each true, true Replicate 1, true Replicate 2, the pool, the pseudo-replicate 1 of the pool, and the pseudo-replicate 2 of the pool. Five sets of reads we've called peaks on. In the end, when we report, sorry, when we report the replicated peaks, which you'll see when you bring it up in the experiment page, you'll see that there'll be Rep 1 peaks, Rep 2 peaks, and then there'll be Replicated Peaks. The replicated peaks are those peaks which appear in either both true replicates. That's good, right? You've replicated your peak, it's in both places. Or if it doesn't, it has sort of a last chance to get into this set, if it appears in both of the pseudo replicates of the pool, then that also qualifies as a replicated peak. So that's what's happening here. When we pool replicates and we subsample into pseudo replicates, that's what's going on here, and we call peaks on all of those. That's why this list here is long. So all of this is in order to generate the subsampled pools from which we decide whether peaks are in fact replicated. That's for histone. For TF chip, where we actually run a full IDR protocol, there are additional pseudo replicates that are generated. So pseudo replicates of the true replicates are also generated and fed into the IDR framework. And those are not accessioned on the portal, so you'll never see those files. They actually just exist within the pipeline. But that contributes to the IDR thresholded peaks that you get in a TF chip experiment. So it's a subsampling and a pseudo replication within true replicates that are then run through this framework in order to have an unbiased quantitative way of determining whether the peak came from both replicates. I hope that was helpful. Thank you. Yes. Yes. Yes. The question was for the replicated peaks are the coordinates based on the pool or based on the true? That's a great question. And yes, they're from the pool. OK. This will not take long. I've shown you the graph of file relationships. And the only other thing that I wanted to show you is that each of these files are also available through the graph here in the way that I showed you clicking on an individual file, but also in a list of files down here at the bottom of the experiment page on the portal. So what I wanted to make clear just through these slides is that we spent a lot of time talking about this platform where we actually run the experiments on the cloud. But all of the results of those runs are distributed through the portal. So they could have been generated anywhere, I suppose. But we, in fact, do use these pipelines that we're sharing with you. But the results are accession and distributed through the portal. So I think I'll stop there. And now that Ben has had a chance to catch his breath, we'll see if we can't visualize the results of your pipeline.