 Hi, everyone. I'm Ben. I'm going to do a live demo for you. So a couple things just to start off. You should, in order to do this live demo, as we said repeatedly, you have to have an account set up to do this. I'm going to run it through as if I am you. I'm going to log in to a brand new account that's never been used before and do it that way. The steps of a demo are outlined in the handouts which you can download. So if you ever get lost, you can refer back to that. I am going to first kind of zip through the example that we're going to do, which is going to be a little tiny analysis of a half a chromosome of RNA-seq data, which we are going to run to completion through the entire ENCODE pipeline. And it's going to take about 50 minutes to run all the way through. So that's why I'm going to sort of whip through how to set up a job and start it. And then we'll move back a little bit, explain a little bit more about how the interface works. And if anyone gets stuck while they're trying to run it, please raise your hand and someone will come and help you. So here we are, logged in. So this is the login panel. All right. So when you log in, you start. You'll see it's my free trial. I have nothing in here. This is sort of the project panel. So you organize all your things and projects however you want. They're like a folder, effectively. So to start here, we're going to go to the ENCODE Uniform Processing Pipelines project. So what we're going to do is wait for this to load. And we're going to copy the long RNA-seq and the reference files folders to a new project in your account. So this little icon here is to create a new project, because you don't have any projects. So just call it demo or something. All right. So you actually have to click Copy into this folder. Done. OK. So now note that I'm still in the ENCODE Uniform Processing Pipeline folder. So I'm going to click this back arrow here. Have a new project. Click that. Oh. The whole setup to run takes about three minutes. So I can go first. I'm in my demo project in this long RNA-seq folder. We use the word long to substitute for getting transcripts as opposed to other microRNAs. So inside this folder, there's a couple folders here and a couple little boxes which I'll go into. Again, I'm going to explain most of this in detail once we get some stuff running. This little symbol is a workflow. A workflow is what DNN-Xs calls what we might call a pipeline. A workflow is a bunch of things stuck together. So if you click on that and open it up, you'll get this image that you shown before, hopefully. And specifically, the one-replicant paired end is our example that we're going to use. So there's four in there. Some of them are for multiple replicates. Some of them are for single-ended. So the way the workflow is organized here is that this is a step. This is a step. This is a step. This is a step. This is a step. These are the inputs on the left. These are the outputs on the right. And the outputs are connected to the inputs. So, and again, we're going to come back and do this again if people have trouble. In order for this to run, to click the Run button, we need to turn all these little guys in the middle, which we're calling applets green. So let's just start. This should be set up. So this is a paired end experiment. So we need two fast queues, one for each pair. So I'm just going to go back to the beginning for people who are still having, haven't created a project with the files in it, okay? So first thing I'm going to do is click on the featured projects, end code, uniform processing pipelines. So this is our sort of master copy of the pipelines and the reference files. So I'm going to select the long RNA-seq and the reference files by clicking the checkboxes. I'm going to go over to the left side and click the Copy button. So that's going to open a interface in order to copy to another project, which for our purposes is a, your personal project that you are maintaining. So we're just going to copy the software and the data onto there. So this little funny-looking suitcase plus link here is create a new project, which if you haven't created a project yet, you will need to create one project. So if you click on that suitcase link, you'll have to enter a name for that project. So I'm going to just type in this case end code demo into the box, hit Enter, right? It's churning a little bit. So now it should say something like no data available here. On the right hand of this, it says copy into this folder. So if you click on that, it should copy the data that I selected in the first place. You might have to click it twice. Okay, so let's spin it off. There it goes. So you get a little 100%. It's done. People still working on that? It's fine if you are. So now this leaves me at this project, which is a project that you guys can't really edit because it's our DCC project. And this bird beak over in the corner that points to the left will take you up a level or back to your personal account and the list of all the projects that you have. So because I did this twice, I have two and some of you may have two too. They're just little files on the cloud, so feel free to delete them if you don't need them later. So you can see, actually, both of these have some data in them, 348 gigs. So I'm just going to click on this one to enter that project. I'll wait and give it a second. Spin it a little bit, and I'll hit Helix. So is anyone not who's trying to work through this demo not at this page yet? One, two, that's pretty good. So from this, your project that you've created screen, there'll be a little folder. It looks like a little folder, a guy there. One of them is long RNA-seq. That's where the workflows reside. A workflow is DNNX. This is a way of explaining which processing steps to run in what order. So I'm going to make this a little wider so you guys can see it, but should be able to see it on your screen. So we have four here. One is for single-ended RNA-seq data and one is for paired-ended, one pair of them. And then there are two separate ones, depending on whether or not you have one input replicate for your experiment or two. We're just going to do the one replicate one. So you can click on that. It should open up a window like this. Is anyone who was... Is this where we are now? We're good? Okay. So just to repeat here, the way these are organized is that the outputs of the previous step in a pipeline are connected to the inputs of another step. So for example, this star underscore NO underscore BAM, when I highlight it, it highlights it there. That means that this BAM file, which is one of the alignment files that we're going to create, gets passed on to this applet here, which is the quantification applet. The applets in the middle here represent software that's going to be run. So in order to get this started, and I think we have to pretty much go through this once and do it so that we will finish, is you start clicking on these items here and filling in the boxes. So if you click on that, it should show you a little menu, which shows you the valid GZIP FASCUE files that you could select for there. And it's going to take a second to find them. But we're going to pick the one that is called like HEMI, if maybe yours is faster than mine. There we go. So there's two here that are described as HEMI, chromosome 21 HEMI. So this is the middle of chromosome 21, which is just to be a small dataset that we can finish in the workflow time. So the one, and this is the NC file that these are extracted from. So we are going to take the one with a one HEMI version, click on that, puts it there. This also inputs it to the next step here, which we'll go over. So the, oops, don't right click. We'll get that menu. So now we do the same thing with the other pair of the reads. Same menu. Just make sure you pick the one with the two. Does everyone get how to put files into the system? Anyone confused? So notice how they're both different, one and two. Now, we have to actually, when you're doing the mapping, you have to actually pick the correct genome. So these are reads that I happen to know or from a mixed human sample. So I need to, and then we have the genomes pre-indexed for the various pieces of software. So this is the star alignment. So we're going to click on that, and it's going to give us a bunch of choices of indexes to choose from. So there's a mouse one. There's several human ones. One that star works is that it takes not just a genome, but also a transcriptome, and we use spikens for some quantifications. So that's what all this spinach is. But if you just pick hd19mailv19erccstarindex.tgz, click on that, it fills that in. So we're going to go to the next step now. So scroll down. Here, this is another alignment program, which is you may be more familiar with this top hat. The reads are automatically set up to be the same reads that you put in in the first place. So you see how when you highlight them, they sort of light each other up. And we're going to get the top hat index. Now it's important that you pick the same index, right? Like I don't want to have one aligned to mouse and one aligned to female. And you should know which version of the genome, in this case hd19 you're mapping to, because when you do a visualization, you don't want to have it drawn on the wrong genome, right? So that's that one. Now the next step is the quantifications. So this is a program called RSEN from Colin Dewey's lab. So that uses a slightly different flavor of the index. And if you'll notice when it comes up that there's no female version of the genome, and that has to do with the fact that we want to write out zeros in our quantification file for stuff on the Y chromosome. So it doesn't really matter if we use the nail or not. Click that link. That's there. The last input is that these two steps here are going to take our alignment BAM files and make wiggle files out of them, big wig files that we're going to visualize. And this requires a chromosome length file. So let's just take this one here. Let's write one. So now this one's runnable. It's green. This one's runnable. It's green. This one's not runnable. What that means is if you put the input, you figure parameters, params, I guess. So if you click on this black box in the middle, this option isn't checked. So it needs to be checked that the library's parrot ended. So we'll save that. Now it's green. Okay. Oh, here, what you click? Sorry, I could... Oh, uh... maleHG19chrome.sizes. Sorry, I didn't read it out, did I? This one? So these other two have just a little text you need to put in. You can just put in here. This is a required field. It's just to save some files on the name. So just type in demo in this box here. Save it. And the top one, the star line, it's the same thing. So look, I'm all runnable. And I can run my analysis. So, yep. Also, this whole process should be on the handout if you want to refer back to it, that you can download from the agenda page. But I will be happy to go through it again. All right, so I'll just... Okay, so starting from the pipeline, let's see if this should be blank. I don't think I'd save my stuff. Okay. So what we're doing is... Sorry for the... What is this? Sorry? No, that's totally a fair question. I just want to say that our plan was that because it takes a while for it to run, that we would try to get people started and then we would describe the RNA-seq pipeline. So I wonder if we should... So what you're going to do is put in the inputs to the various pieces of software which are in black about the RNA-seq pipeline. Okay, so I'll try to... Oh, okay. Sure, sure. I guess I can do that. Let me just... Okay. So people who are at this step, I want to go back for some of them who haven't done the original thing. So I'm going to go way back to the beginning. But you should be able to, like, go download the handout from the encode2015.org website. And that can help you follow the details while I help the people who are even behind you, okay? Sorry for that, but that's just how we do it. Okay, so you should be at a page, something like this that says projects, right? You should have nothing here. So what you do is go to the encode uniform processing pipelines. What? No, that's why I didn't tell them to run it yet. Oh, well, then that'll be fine. Okay, so those of you who clicked run, that's perfectly fine. There is one little subset. That's why I didn't say actually run it, which is that you can specify where the output files go, like in another folder. So the people who are running it, you'll just get a pile of files in a folder, but it will work fine. It won't make any difference. It's just that when you run two or three of them, then they start piling up with all these crazy file names. All right, so for these people who just got internet, so you got a processing pipelines here. This is the master copy of the thing, right? You copy the long RNA-seq in the reference files. So you hit this copy in the left-hand corner. So you're not going to have these two, so you need to create a new project, which is this crazy thing. See this? Just name it whatever you want. I just named it demo type in demo two in the box there. So click copy, done. Okay, so now you're still where you were when you started the copy. So you have to get to where your new project you created. So you have to go back on this arrow here. You should see your project you created there, right? So you go there, and then you're going to go to... In here there's a folder called long RNA-seq. Click on that. Then if you scroll down a bit, there's these guys, which is the universal symbol for workflow. How do you get files? So I don't think we're going to have time, but what you can do is if you go to, like, if you want an encode file, yeah. You can... Basically there's an add data button if you ever want to put fascues in your system. And add data will let you copy a file from a URL or a file from your computer. Please don't upload any files in this room now. No idea what it will do to the internet. But that's how you get files in. It's pretty easy if you just poke around a little bit. All right, so... Oh, this is the wrong one. Don't click on that one. See, this is the two-replicant one. Don't do that. Missed it. So this is the one... Here, I'll make it wider again. One-replicant paired end for you guys, and you should get to where we are inputting stuff. Okay? Press... It's got it. Okay, I'm going to go through the inputs again. That's... Yeah. I... You're welcome to try to design the UI. It's not my system, so... Interesting that it's not... Yeah, I mean, some people like using the command line and just write your script, so you can do that, too, if you want, but not in the demo. All right, so... You get the read one. So this is a list of possible fascues you can input. I probably should have deleted some of these, so take the one that says... We're going to look at the two that say chromosome 21 hemi. So for pair one, just pick one. All right, I'll do it again. This one... ENCFF646CCF underscore 1-CHR21 hemi. Okay? And now the other box, we're going to do the same thing, but get pair number two, and now we need an index genome for star to align to. So we click on that box. We don't want a mouse, one from MM10, so we want a human. We're going to use the male, and we're going to use the version that is ERCC, which is just a code name for the spikens, which... So I think I'm going to try to this slightly perpendicularly for people who are following along, so we're just going to finish this top step here. So we're going to just click configure params, and this star here means that this is a required parameter, and that's why it's orange and not green. So we're going to just click configure params, orange and not green, so you can put in any sort of text string in here as an identifier. So we use this for a while for like ENC, BS, 111, but it doesn't matter, you can put in any string in there, so demo is fine. Save that. That step is done, right? So this alignment step, just because I think we're not going to have time to go over the whole pipeline very much, is runs an alignment of your paired fascues against the genome that you selected. It creates this is a essentially some QC numbers that you may see at the end if we get there. It creates a regular alignment file to the whole genome. It creates an alignment file to the transcriptome, which is included in this genome index. It's gencode version 19, and it has a log file which has other numbers in it, right? So we're just going to repeat that partially for the other four steps. This is the alignment of two top hat, which is just creating an alignment BAM file. The inputs are already set for you, so we just need to set the index, and again we're going to take HG19 male V19 ERCC top hat index. Put that in there. We're going to click configure, params. We're going to put in demo into the identifier for biosample library box. Now one's green. Anyone having trouble turning these two guys green? Great. Yeah. So the next step is the RSAM quantification which takes the annotation BAM. See if I can show you this here, because I want to do it anyway. So see how this output on the right is marked to that input on the left. So that file is going to go when this file is created, it's going to go directly into this step. So we don't have to put that input at all. That's part of the pipeline is that it's all plumbed together. We're going to take the index for RSAM. So here we just need the HG19 male V19 ERCC RSAM index. That's good. Still not green yet. Configure the parameters. Now this one we need to tell RSAM how we're going to use this for RSAM quantification. We use this piece of software for both the paired end and the single-ended, so it's got to know which one you're inputting in. The output of that file are what we are just calling these results files. They're just tabbed limited files with genes as rows and like FPKM, TPM and other numbers there. The last two steps are going to take the output files one from Star and one from Top Hat. They're going to convert it to BigWig files and the BigWig conversion script needs a chromosome name-length file. So we click on that. It's going to open another window. Get to pick again our genome which is going to be male HG19 chrome sizes. I should point out that you can arrange these in such a way that they're coordinated but I wanted people to go through the steps of actually doing it. So our things are all green now. Is everyone there at this point? All right. You still need help? Nope. Okay, we're good. So set the output folder. This is not a critical step but it will keep your files managed if you put the output in the right there. So in if you go to this right side there's a little file like a windows explorer thing. So you click on that arrow down in examples there's input and output. So you just click output for example or you can give it any folder. It selects. There's nothing in it. That's why it says no data available. It's an empty folder. That's what you want. So now I've got a folder in there. I can click run. Trumpets. Okay, so this when you run it it takes you to something called the monitor tab. So that before we were working sort of the manage tab and the monitor tab will show you all the jobs that you have running at any given time as well as all the jobs that were finished all the jobs that may have crashed all the jobs that you turned off. It can go as far as you want. So this is the master job. If you notice this plus sign here it will show you all the sub jobs that this job is going to create. Hopefully. The second. Right? So these little lines here correspond to those steps in the workflow that we started. So there's two alignment steps. A quantitation step. It's a little bit hard to see. And two BAM to bigwigs stranded steps which is the correct one. So it requires the quantity. Let's start with these. The BAM to bigwigs, I'm sorry I'm a little thrown off. The BAM to bigwig steps require the BAM to be created otherwise it can't create a bigwig, right? So these will run automatically when their respective alignment steps are completed which will take on the order of 25, 30 minutes with this little example. You run a bigger example it can take longer. Similarly, to quantitate the genes and the transcripts we require the output of star which is what we're going to quantitate. The encode consortium decided that we only needed one set of quantifications but we wanted to do two sets of alignments both to compare to previous experiments that had to be done and also because Topbat is sort of the industry standard RNA-seq alignment file but we found in our hands that star actually performed a lot better so we wanted to use that as well. Okay, it is now. So I think people have gotten the hang of raising their hand if they need help but I'm going to check here anyway to see where people are. If, because I, especially in the time straight, what will happen here is that if this, if one of these jobs actually dies or something, like for example, and this can easily happen, if you gave it the wrong input, right, like if you accidentally, it was hard, but if you, if you put the Topbat, if you really go out of your way, you can do this. You put the Topbat index into the star BAM which you shouldn't really do but it will run for a minute or two and then throw an error and it will turn red and you can actually, from this interface, right, look here, so I clicked on the job that's one of the jobs that's actually running. It's coming up a little slow here but I can actually view the log file that it's creating so you can actually, you can actually effectively peek into the virtual machine that's running this job and it shows you all the output of what's going on. I'm not going to do that because no one wants to see a bunch of text stream but if you want to do that there. I think, yeah, the R symbol, the star alignments which, it's a great question. I can show you also like, I think I was trying to demonstrate it but I'll trick it in here. Actually, I think, you have a question? Okay. Well, you can go back here. So I'm skipping some assets because he's busy. Skip that. So here, here's sort of a view, like a on one page view of what you guys just ran. So, I mean, there were good reasons to do this first but I did want to try to get it started. The code that's available that, so each of these, those applets or those steps here are just a little shell script that runs a program that you can download like Top Hat or Star, sorry, or Star or R SEM. Or here, there's a step that runs, actually these are sort of the steps here. Maps, this maps with Star, this maps with Top Hat, this maps with Star. This converts the bams to big wigs. There are a whole forest of big wigs that get produced. So for a stranded, a stranded data set, it will give you the plus strand and the minus strand. It also gives you all the reads, including the non-uniquely mapping reads and there's a file that's just the uniquely mapping reads. So that's four per alignment program. The Star BAM outputs are also shoved off through this quantification here. So at ENCODE, we have a rule that we always run everything twice. We always run everything twice. So what I've drawn here is this is that, so for every experiment we have some form of replication, either a biological replicate or a technical replicate, ideally a biological replicate. We talked about how we had a multiple replicate pipeline or a workflow you could use. So that effectively, for the same experiment, sorry I'm losing my mouse pointer here, we can take the quantifications that are output from, there's one more step in that double pipeline, the double replicate pipeline, which takes the RSAM quantification file based on genomes from one replicate and the one from the other one and runs some QC calculations based on it. We are currently working on the IDR calculation for RNA-seq, but we don't have it implemented yet. But we do do something from Rafa Izrahi, which is like a, it measures the dispersion of the log twos of the sec ratios. Tell you more about that. We get, from each replicate we get two BAM files, we get, if it's paired we get four big-wig files, times two is eight, and then we also get two quantification files, for which the genome quantifications are much more reliable. The transcriptome ones are more for reference, but it's hard to actually judge isoforms from these methods. So you had a question, what a great question. There is indeed such a way. I was waiting for someone to have a failed job, so I could show them. Anyone have a, did anyone job fail? Yours did? Okay, so then this is for you. So go to monitor. I can't really do it because my job's not failed, but it will sort of told you, here, go to monitor and let it spin for a little bit. So when you click on the plus symbol here, it will show you all the jobs. So the whole thing will fail if any one of its sub-job fails, but you probably want to check to see which sub-job fails and look at that. But for the whole thing, what you can do is you can... I don't have an example, do I? I can maybe switch to one, but what it will show you is that on a failed job, when you click on this... Actually, let me do it anyway. There should be a button that will show up here. This shows up, which basically just says rerun. So rerun with the same input. Now, obviously, if you rerun with the exact same input, then it will probably give you the same error, not necessarily. But it will rerun with what you loaded and then you can look through it and change what you need to change. There are actually the files that are specified as outputs so that they're on the right-hand side of that screen. Those are not cached. They're permanently saved until you delete them. So the DNNX system keeps track of what job used to create them and what parameters were used to create them. So if, for example, this can happen, certainly when developing your own pipelines in the system, let's say my alignment step works really well, but I have some bug in my quantification script. So I run my whole pipeline, the alignment works perfectly and the quantification fails. So now it's not because I had bad input, it's because there's actually a bug in my applet code, which is we developed all the applet code with help of the encode data analysis center. And so we had to effectively not really compile but ship that code to the DNNX's platform. And so if there was a bug in that code, which there were many, which we have fixed them all, there are no more bugs. What you can do is remake that applet and then rerun that step depending on the same previous input. So that's one of the advantages of having, and I think when that type of technology is very difficult to implement is one of the reasons why we use this system is because it's not that it's impossible. There's a, you have another question? Okay, there's, sure. Yes, it depends on how you've implemented the applets, right? So for our purposes, we do want to keep them as efficient as possible, but we gain a lot of throughput by basically running a thousand, like literally a thousand, we can submit a thousand experiments at once, and if the whole thing takes five hours, they will be done in five hours because we have effectively an infinite supply of cores at Amazon. But you can also change the number of threads. You can optimize your applet so that it speeds things up. And last question, at some point, can you show us how to upload FastQ files and download other kinds of files? Yes, that's the second time. Can you do that around, I don't know when we should, we'll definitely try to squeeze it in or find me if I don't get to it. So another question in the back. Yes, so the question is that, is your top hat and star parallel or serial? So one after another with the unaligned reads from one step taking to the other. That's my first question. And the second question is the star index is very sensitive to the length of your reads. So if you're a shorter reads, how you're handling, you know, is the star index getting generated again or is it just you have a standard star index based on 100 bases and that's basically you're using for everything? So the first question I think is pretty simple. If I got it right is that, what is the dependency of the star and the top hat? Alignment steps? So they're completely independent. They use the same input and they have completely different outputs and they don't interact. Okay, so what do you use with the top hat output then? Because in the flow chart you're saying that you're using RSCM only with the star and you're not merging them together or... No. So what are you doing with the top hat output? Well, the idea is just to have a comparison for previous data that was done. I mean, what it has to do with the fact is that the consortium or the RNA working group as a whole really liked how the star program performed and to be fair it was written by Alex Dobin in Tom DeGeris' lab, but other groups who use top hat mostly wanted to compare the results directly at the BAM level but they weren't worried about producing the quantifications for it. So your second question is a technical star detail about the length of the reads, right? So I don't think we take that into account but since Alex helped us write the pipelines, also almost all of the current RNA-seq data we're running through it is pretty much the same length because it's done by the same people, but that's a good point. We might have to remake the indexes for that. Yeah, for the users, you know, because for you it's the same length for the users. So like you think it's different at 36 and 100? Yes. Okay. Yes, anything above 100 it should be okay, but anything below 100 if you have 50 or 60 bases, you know, around that range the star index really. Thanks. That's really useful to know. I did not know that. So there's a question. There's a question that's been asked by a couple people so I wanted to bring it up for everyone because you may have the same question. The question basically is do we have a pipeline for paired and unstranded libraries and as a defined encode uniform processing pipeline, we do not because all the encode RNA seek data that's been generated by the consortium is either single and unstranded or paired and stranded. And so that's why the pipelines that you see in the encode uniform processing folder, pipeline folder are those. However, you can take a look at the parameters for the single and unstranded. Take a look at those. It is essentially the same software components. You can look at the parameters. You can modify the apps, the applets. We actually did have a few test examples that were those, the other two. So paired, unstranded, paired, unstranded and unpaired, stranded. Sorry. But we didn't actually... It was already four pipelines. It was already too many. And actually, I don't even think you need to do any coding. I think you can just rearrange the steps in those pipelines and make a new workflow and it would work if you needed that. If someone has a specific use case, I'll show you how to do it. I guess we could publish them too if we could organize it in such a way. Yeah, but I think that the point that Ben addressed was also the one that I wanted to bring up is part of the beauty of the DNA Nexus platform is that you can mix and match applets that are public to make the workflow that you want to run your data through. So approximately 40 minutes ago we were supposed to have Seth talk about Gypsyq. It's okay. Actually, we have this built-in. So I think you want to go back and do that or you want to skip out. Let's see how long my job has been running. Let's do that. Because I think we can analyze it. So it's been running 10 minutes. So I believe that the earliest we can visualize is about 30 minutes, which is probably going to finish us up. So there's a couple of options. One, I can go to... I have an example of this already completed in a different account. So I can show you how to take the BigWig file and draw it to... to draw it on the UC Santa Cruz browser, which is kind of cool. Or we can have Seth come up and we can talk more about the pipelines in general without in sort of a less workshoppy way. But... Okay, sounds good to me.