 Okay. Okay. Great. So, yeah. So, hi. I'm Eric, Gino's group at UCSD. My email is there. Anyone who wants to email us afterwards with questions or whatever you want to ask, I'm happy to answer questions. And so, Brent and I talked. Brent is talking tomorrow, I think Friday as well. So, he's going to talk more about sort of the integrated efforts and lots of the data types in his talk. And I'm really going to focus more on our eClip data because that's sort of the newest, the most different from any of the other data types on the encode portal. And so, it's something that we figured would be useful to describe really well for people. So, our group is a little bit strange relative to everyone else here in that the vast majority of encode is focused on transcription regulation and the regulation of the genome. So, at the DNA level, epigenetics, transcription factor binding, chromatin regulation, whereas our group is really almost exclusively focused on RNA. And so, a little bit of co-transcriptional regulation potentially, but really once an RNA is transcribed, how is it regulated and how is it processed and how are the RNA, and how is that controlled? So, it turns out that every aspect of RNA processing is very highly regulated. So, that can be things that change the sequence of the RNA. So, RNA splicing actually changes the splice sequence that ultimately gets translated into a protein. But also, the export of RNAs from the nucleus, localization either to specific organelles in the cell or, for example, localization of RNAs to axons and neurons, the stability of RNAs and how quickly they're turned over, and ultimately even the rate and initiation of translation are all controlled by the activity of RNA-binding proteins. So, in the same way that transcription factors bind DNA through sequence motifs, regulate transcription, RNA-binding proteins bind to RNA through RNA motifs, either primary motifs or structural motifs, and regulate these different steps of RNA processing. And there's been a bunch of efforts in the past few years to estimate how many RNA-binding proteins there are. The best estimates now, there are at least 1,000, and some estimates are almost 2,000 RNA-binding proteins in the human genome. And basically every process, every developmental stage, every disease that you study, eventually you discover that there are RNA processing events that are misregulated in those systems. So, because of that, in this last ENCODE round, ENCODE decided to create a subgroup. We call ourselves ENCORE for RNA regulation. And there's really four groups, five groups that are directly in our group. So, Brent Gravely leads our group and is doing RNAI knockdown, followed by RNA sequencing of RNA-binding proteins. Chris Burge's lab is doing bind and seek, which is an in vitro assay to look at the in vitro binding motifs of RNA-binding proteins. Shannon Fu's lab at UCSD is doing chip-seek of RNA-binding proteins, a small number to look at co-transcriptional regulation. We're doing clip-seek, and I'll talk a lot more about that. Missing from this slide actually is Grace Xiao, who's a computational lab largely, but really is working on RNA editing and looking at RNA editing and allele-specific RNA processing. And then there's one other, that got a little messed up there. There's one other group, Eric Lucchia's group is actually not officially part of ENCODE, but has been a great collaborator and is taking the antibody resources that we generate and doing IF, so that for every RNA-binding protein that we profile, we also know where it's expressed in the cell at subcellular localization. Okay, so just to give you a very broad summary of our group's efforts, as of yesterday, Brent went and looked at this. There's about 400 RNA-seek data sets, RNA-binding protein knockdown RNA-seek data sets. There's 135 E-clip data sets released, and then about 50 RNA-binding-seek data sets and a bunch of chip-seek data sets. So these are all released, they're all publicly accessible, anyone can go download them, and the rest of this talk I'm going to kind of go through those data sets in more detail. But just to give you an idea, we've generally tried, and you can see Judge for Yourself how successful we were, we've tried to do the same RNA-binding proteins with as many assays as possible to create the most useful data set of overlapping assays for the same factors. And so we're doing okay, we're trying to fill in many of those gaps, but that's really the goal, is to get all of these different data types for every factor. Okay, so what I'm going to talk about today basically is to spend a bunch of time going through the E-clip method, really with the intent of that everyone can understand what files are available, how we process to get those files, and sort of some ideas of why we do some of the steps that we do, because some of them are not super obvious if you're not familiar with the data type. And then give some quick examples of what kind of analyses that can be done, and some of the tools that we're developing, that are in process developing, that should be available very soon. So to summarize, if you're familiar with transcription factor, chip-seq, clip-seq is very similar, it's a very similar idea, there's just more steps because it's RNA and it tends to be a little bit messier. So we start out the same way, we cross-link RNA binding proteins to RNA, we use UV, it tends to not cross-link protein-protein interactions, so it's a little bit cleaner in getting single proteins instead of complexes. Then there's a bunch of steps involved in basically purifying that protein, so standard immunoprecipitation and washing, but we actually run these samples out on protein gels and cut out a size range of the protein size to about 75 kilodaltons above, which is about 150 bases of RNA. And the idea there is that what we found is that many RNA binding proteins, when you pull them down in certain size ranges, you also get ribosomes, or you get Paul II complexes. So depending on if you don't run this gel, some RNA binding proteins work perfectly fine, others you get enormous background from some of these very abundant complexes. So this obviously is a little bit more time-consuming and kind of a pain, but it definitely improves the signal. And then there's a bunch of other steps that are basically just library preparation steps. So we have RNA, we have to put adapters on it, we have to reverse transcribe it, and we have to PCR-amplify it. One of the key things that I'll mention is that if you're familiar with Illumina sequencing, it has adapters on both sides, the fragment in the middle. In this case, we actually attach a randomer, and that randomer is attached here before PCR amplification, and so the idea is that we can uniquely distinguish reads that map to the same position, but have a different randomer as being unique RNA molecules, but reads that map to the same position and have the same randomer are PCR duplicates, and we can discard those. And that's actually very important, and I'll talk about that in a minute. So we do that, we had to sequence it, we do data processing and peak calling, and as was shown earlier, we get, this is an example for Fox II, where we get a binding site here, in this case right near our Fox II motifs. So it actually works reasonably well. So the overall processing pipeline is many steps. Very quickly, I'll just quickly go over them and emphasize which files exist, and then I'm going to go through some of these in more detail in the next few slides, and there's a lot of detail in the slides if you download them. I'm not going to discuss all of that detail, so it exists, I'm not going to look at it, but sort of hit the high notes. So we have FASCII files, we actually sequence paired end, and I'll explain why that is. We do adapter trimming, we remove repetitive elements, we have then mapping to the genome, we get uniquely mapped reads, we remove PCR duplicates to get usable reads, and then we take read two and perform peak calling. Okay, so let me go back one step. It turns out that when you RT, and you do the reverse transcription here, RT enzymes tend to stop at the cross-link site, because even after Protein SK cleavage, you have an amino acid still cross-linked there. So it turns out that this reverse transcription end, which in this case is actually read two, the first base of read two, is actually typically the position exactly where the RNA binding protein is cross-linked, or at least is enriched for that. So that's why the peak calling is usually done for historical reasons, in this case from single read data, and so in this case from read two. So if you go to the DCC, these four are the four files that exist on the DCC. The raw FASCII files, mapped usable reads, peaks, just all called peaks, and then input normalized peaks. So if we go and look specifically for Eclipse, if you search for Eclipse, you get 270 data sets. That's actually because each RNA binding protein has a paired control. So if you just search for Eclipse, you get both. But right now, there's 135 Eclipse data sets released, and there's some more that will be coming very soon. They cover both K562 and HEPG2 cells roughly equally, and then we're starting to do a few in the N-text tissue samples. It's part of that collaboration. So you'll start to see some human tissues as well. So if you go and search now in this case, you know, on the DCC website, you can search for RBPs. In this case, if you search for Fox2, you get a whole bunch of data sets. Here are some of the, so the RNA binding seek has been done on Fox2. There's some antibody information. This is a knockdown RNA seek data set from Brent. But if we go to experiment up here and choose Eclipse, now you see there's two listed entries. One is Eclipse target Fox2, and one is Eclipse RBFox2 mock input, is an old name, but is the paired input with Fox2. And as I mentioned earlier, because we cut the gel, we actually have to do a paired input for each protein because the size range is different, and that actually changes what the background is. So in each case, for each clip, there's two biosamples. Each biosample has a clip replicate, and then one of the biosamples has a paired input. Again, I can talk more about that if you're interested, but that is for historical reasons. So if you go to one of the clip page, if you click on this Fox2 clip experiment here, you get to a clip experiment. It has a bunch of experimental details, the size range of the RNAs, some protocol information. This link here is the link to the paired control data set, and that's, there are links, so even if you're using the REST API, you can programmatically get those control experiments paired. And then if you scroll down, what we have are raw FASQ files. Again, for replicate one and replicate two, there's a read one and a read two FASQ file. And then there's a bunch of files. The key ones, I would say, are the BAM files, which are paired end mapping, the usable uniquely mapped paired end mapping results. And then these bed narrow peak files, which contain input normalized peaks. And I'll talk about those more. So one other thing that's useful on every one of these clip experiments at the bottom, there's two documents. There's an SOP, which is our assay SOP, and then there's an analysis SOP, which goes through all the processing steps. So you can freely, you know, on your own time, go download those, look at them. This is what they look like, so it lists all of the scripts. All of the scripts that we use are available on GitHub, so you can download them and use them. This is the static link at the encode projects. It's also published as part of a recent paper that we published. And so I'm going to quickly just skim through basically all of the steps of this, hitting the high notes of what's important, but not really going through many of the details. So one thing that I will mention is that we do do multiplexing. This has already been done to the files that are on the encode DCC, so you don't have to worry about that. There are inline barcodes that we use in some cases for high throughput pooling. That's already been done on the datasets that exist on the DCC. So if you download the FASQ files on the DCC, they're standard FASQ files with one exception. That exception is that we take the randomer sequence and append it to the beginning of the read name. And so you can see here for, you know, some dataset has a read one and a read two FASQ file. In this case, the read IDs are the normal and a read IDs except that we stuck the randomer at the beginning of that read. And that's so that we can keep it through all the mapping steps and all the processing steps and then parse it back out at the end to do a PCR duplicate identification. Then we do adapter trimming. These are, you can see the commands. They're kind of ridiculous. The, basically the problem is that we've observed that at a very low frequency, so 0.1% even less on that order, there are adapter dimers that make it through into our libraries. And the problem is that 500 reads is not very many, but if they all map to the same place in the genome, they create a really good false positive peak. And so what we wound up doing was sort of erring on the side of over trimming. And so we wound up doing a bunch of trimming steps using basically shifted versions of the adapters because we also found that cut adapt by default is very bad at dealing with five prime truncations. If you're looking for basically double adapters. So I'm happy to talk about, if anyone wants more details, I'm happy to talk about it offline, but just be careful with the adapter trimming because it can be a problem. So the next step is repetitive element removal. The key aspect here is that even for a really good clip dataset, still probably 10 to 20% of the reads are going to be ribosomal RNA. In many cases, that's even higher. And in an ideal world, those will just map to the ribosomal RNA transcript and you'd be done with them. In reality, they wind up mapping to pseudo genes all over the place and create really good false positive signals again. So to alleviate this for general processing, we first map to rep base, remove all those reads that map to rep base, and only take the remaining ones and take those through to genome mapping. We have some repetitive element specific analysis tools that I could talk about later if anyone's interested, but for general processing, that's what we do. We try to get rid of repetitive elements to avoid artifacts. Then we do standard paired in mapping with star. Same way you would map to RNA-seq dataset, we map it to the genome, to the transcriptome. So the genome plus is vice junction database to get uniquely mapped reads. So then for PCR duplicate removal, if you're familiar with chip or RNA-seq, many people just use the start and stop positions of mapping because we have the randomers we can actually do better than that, and so we take reads that map to the same location and have the same random sequence, and we discard anything more than one that has the same random sequence we discard. So the BAM file that is on the DCC is the output of this PCR duplicate removal that only contains uniquely mapped, non-repetitive, non-PCR duplicate reads. So to give you an example of what that looks like for an average dataset, so for some of the older clip methods, you can see that, you know, all the time you're going to get some reads that are too short to do anything with when you sequence them, they're discarded. We usually see on the order of 10 to 20 percent mapping to repetitive elements, some of them maybe 10 percent that doesn't uniquely map to the genome, but the real advantage of our improved E-clip methodology is that the number of PCR duplicates that we now see is very low, whereas a lot of older clip protocols, you just got an enormous level of PCR duplication. So across, now this is going across, I think about 300 encode E-clip experiments. You can see that the fraction of usable reads, basically this green over this green and this gray, most of the time with E-clip we're now in 80-90 percent unique, whereas most published clip datasets are more like 5 or 10 percent non-PCR duplicates. So, again, very quickly the BAM files on the DCC, they're standard BAM files, except again that random is still on the read name just in case you want to use it for something. Okay, so to call peaks, this is something that's still a little bit of work in progress. We usually take our data and use it for many standard chip-seq peak colors. They have to be strand aware, so you have to mess with them a little bit, but the way that we actually call peaks usually is to use peak calling algorithm called Clipper that was developed in our lab a few years ago. The idea there is it identifies regions that are significantly enriched above the transcript that it's in. So even if transcripts can vary by orders of 1,000 fold in expression, so we're looking for regions that show higher binding than the average level of that transcript. So Clipper does that. We then take the Clipper output and then compare those against that parasize-matched input. Very simply, we just... I don't think so. So very simply, we just literally just take that peak region, get the number of reads in the clip, get the number of reads in the input, and do a very simple enrichment test to see the significance. So this was not usually done in Clip, but for Chip, I mean, you guys are all very familiar why this is important. For Clip, we saw that it was really, again, enormously important. So even if we took something like SLBP, which exclusively binds histone RNAs, it's been very well studied, very well characterized, we actually see, basically, RNA-seq looking coverage at abundant genes. So in this case, EF2. In SLBP Clip, look almost exactly the same. A different RBP, Lyn28, actually does show a little bit of binding at one of these exons. So it's not that everyone looks exactly the same, it's just that you always get some level of background. But if you take the SLBP Clip and now actually plot the read density in input versus the fold enrichment in Clip, you can see that the histones are all very clearly enriched. So there is just a flat level of all background in every Clip experiment. You, again, really just want to normalize away and get the regions that are enriched in the Clip relative to the input. And so for each of the peaks, so this is two examples. One example of a region that's actually not enriched, so the U12 small RNA shows up as a fairly common artifact in almost every Clip dataset. It has a lot of reads. This is reads per million of 500, so it has a lot of reads. And so if you look into the input, it's actually depleted in this Clip, whereas the three-farm end of this histone RNA is something like 2 to the 9th enriched. So it is actually very clean. You just have to do the proper normalization to get rid of those false positives. So the key thing to note then is if you download these narrow peak bed files from the DCC, these are input normalized peaks but actually contain every Clipper called peak, including the ones that are not enriched and even including ones that are depleted in the Clip relative to the input. So it's a standard bed file. It has the region. It has the information. But then it has these two columns here of log 2 fold enrichment and minus log 10 P value. And so these ones all happen to be enriched, but if you go down the sorted list, you'll start to get ones that are actually the fold enrichment is negative. And so we provide those to people. We provide those all for analysis purposes. In some cases, for example, for IDR, they actually are very useful because you need not enriched peaks to run something like IDR. We also provide it so that people can set their own cutoffs and do their own tradeoffs and sensitivity versus specificity. But we leave that up to you. We standardly use 10 to the fifth and basically eight fold enrichment as sort of a stringent criteria. It seems to work pretty well at removing false positives. There are definitely real signals below that, but it works pretty well for us in getting a good, high-quality set of peaks, usually on the order of hundreds to thousands per RbP that really doesn't have very many false positives. Okay, so then very quickly, I'm just gonna hit a couple of highlights of what we can do now, the kind of analyses that people can think about using this data for. So we have, you know, the most simplest one is individual RbP analyses. And this is an example here again with Fox2. You know, we can ask questions like what are the motifs that are enriched at all binding sites or only at binding sites in introns, in proximal introns, in three-point CTRs. We can ask where are the peaks for this RbP? In this case, you can see that Fox2 is very heavily enriched in proximal distal introns relative to mRNA or genomic background. We can even ask things like where along the pre-MRNA or mRNAs, is this RbP bound? In this case, it has a little bit of a 5-prime distribution. You can see in red here on the pre-MRNA. Exons are kind of flat, but introns, it tends to be very enriched near the 5-prime splice site. So we can ask these very simple questions. And then in addition to the clip, we can start to look at other resources. So we can integrate this with RbP localization. Again, this is from Eric Lucchia's lab. And as you can see, the Fox2 in their data is basically, you know, it's a little bit set of plasma, but really heavily nuclear, and actually exclusively opposite of a nucleolar signal. We can also integrate that with RNA-seq data to create these so-called RbP maps or splicing maps, where the idea here is to say, okay, if we look at exons that are differentially regulated by Fox2, where does Fox2 bind near them? So in this case in blue, exons that are included upon Fox2 knockdown tend to have more Fox2 binding here. Whereas RbP, skipped exons that are excluded upon RbFox2 knockdown, you get an enrichment of binding, for example, here. So we can build these splicing maps or all sorts of different maps for individual RbPs. In the lab, we're also building these RNA-centric views of RNA processing. So the idea here is that we can basically do an in-silico screen to identify RbPs that potentially regulate an RNA of interest. So here's just two examples. So 7SK is a very well-studied, small nuclear RNA. In this case, if we take all of the clip data sets, the most enriched is LARP7, which is a known member of that complex. So it works very nicely. We did this for Exist, and it turned out that the top four hits for Exist all actually matched recently published results from chirp studies. So pull-down and mass-spec proteins associated with Exist. You can see here just by sorting by clip enrichment, we immediately pull out RbPs that have very specific localization patterns on Exist. So this actually works very nicely to pull out potential regulators of an RbP of interest. And this is something that we're building as a resource for the community. We're also building... And this is another tool that we're building. So there's just a genome browser track that literally just has every binding site for every RbP, colored differently. So this is, again, the entire Exist transcript. If we zoom in here now to the three-prime end of Exist, you can see that there's a whole bunch of RbPs binding here. This is one example. But all of these different colors are different RbPs that each show some significant enrichment, a specifically rich peak at that region. So I think these are going to be very useful tools for people to just ask questions about the encode data. And then finally, we can do... We can ask questions, very global questions, about RbP binding. So for example, here, we can just cluster all of the datasets by what types of RNAs they bind. And we very quickly pull out clusters of, for example, ribosomal RNA binders, intronic binders, 3-prime UTR binders, et cetera. So we can start to infer what an RbP is doing based on both specific and also just general properties of its binding profiles. Okay, so what's available? So all of the data, obviously, is on the encode DCC. That's all available. You can go download it and use it. We are in the process of moving our processing pipeline onto DNA Nexus. That should be done in July-ish. So that should be available for people to use very soon. The initial processing pipeline is the first step. It should be followed very quickly by bringing our IDR and sort of QC pipelines also as well, which we've developed to very quickly QC datasets as part of encode. We're building a browser to do these RNA-centric queries. There's a website that can handle like 10 people looking at it right now, so it's definitely not ready for this. But that should be available soon and we'll let people basically do the same thing that I was showing, just type in an RNA or a region of interest and you can get the RbPs that bind there. We're obviously in the process of integrating with the encode encyclopedia and these sort of factor book summaries for RbPs now to sort of profile for each RbP what is a very quick summary of what the clip data shows for that RbP. So with that, I will thank Gene, our encode group, so Brent, Chris, Eric and Shandong Fu. In our lab, we have some great computational and experimental people, especially Gabe, who is a really great computational grad student. I'll throw out a pitch because he's graduating soon and if anyone wants a really great computational person, you should email him. Alright, thank you and any questions and I'll be around obviously the rest of the conference so if you have any specific questions feel free to come chat with me or send me an email or whatever. Thanks. Hey, I was just curious how you're defining repetitive elements and if you guys throw that data away or if you keep it or if you're planning on including that in the body sets. So we define repetitive elements plus basically the full ribosomal RNA transcript, which is actually not in rep-based but has a lot of reads. We have built another pipeline to deal with those. It's a little bit of a complex mapping issue because you want to be able to map two families of RNAs, not just individual RNAs. So we built a pipeline to handle that. We're still working out the kinks of that and trying to figure out the best way to make that available. But it's something we're very interested in is that bind to ALU's and line elements and stuff like that that are very interesting. Thanks. So you show one example about that experiment information on the web-based for example, which experiment and what's sequence in time, either parent or single-end and also replicating information. But the most time when we download the data there is no technical or biological replicating information there. But when we run some software, it doesn't need to distinguish which samples actually be lost which one. And I wonder whether this information could be provided through download, the math file or some other resources. So let me repeat that and see if I'm understanding. So you're asking basically for the control that's actually paired with the same sample is that information easy to get somewhere from the website? Right, because right now those information is really not available for the batch download. It's just for a single experiment on the website. So suppose you download hundreds of samples, it's not convenient. So that's a good point. The DCC people could probably comment more on how to make that easier on the DCC. The way that we actually usually do it and so the way that we for example downloaded some of Brent's data is to go through the JSON objects and the JSON objects in those cases so in the case it's changed already. Each of the experiments has the biosample identifier for the clip and the input that are coming from the same biosample. They actually have the same exact biosample. So you can identify which clip is paired with which input based on them being paired in that entry and then of the two replicates which one is paired by which one has the exact same biosample. So it is doable. The information is there but I agree it's not necessarily the easiest format and that's something we should think about to access. Good point. I was just going to respond directly to that. The answer is that in the value of the experiment either in the, if you go file.replicate.library that biosample, that biosample value will be exactly identical or you can go experiment.replicate.library that way but I'm hearing from you that you're trying to access the data in a large group but that you're unsatisfied with using the JSON model to get to it or was it just unclear to you where it was in the data model? I think the major issues right now the dataset has been deployed in a way that not as like computationally like basically processed in a way that script automatically can recognize based on the name and sometimes those names really has no meaning like the end as yx then suddenly comes to another yq so who goes first? which one is forward which one is reverse read? and same as for the replicate is this replicate one or replicate the two? so is distressing to you that the metadata is not in the file name you would like the file name to have more metadata which is something that we at the DCC found through experience of encode 2 that we were trying to pack too much information into that file name which is why every single file and every single experiment has a rich deep json object with it so that you can have all of that information of whether it's rep1 or rep2 read1 read2 whether or not it has a UMI in the file or not whether or not the library has a biosample that is the exact same biosample or has the biosample term ID and I would love to have more conversation with you or any of our wranglers afterwards to try to help you navigate that json object however I don't see us moving towards trying to pack what we have of like you know 75 different fields into a file name if that's what you're asking or if you're trying to ask something else we can also take it off I know internally for our group we have sort of a spreadsheet that just lists clip replicate 1 here's the encode a session number here's the encode a session number in the file given a session ID is very simple through the encode portal so it may be just easier for us to just post that spreadsheet to solve that question of just here's explicitly rep1 rep2 here's the paired input for rep1 here's the paired that might be the easiest way to go