 Hello everyone and welcome to the Introduction to Galaxy tutorial part of GTN smorgasbord an event in February 2021 where we are presenting lots and lots of Galaxy tutorials. I'm Dave Clements. I am with Johns Hopkins University and part of the Galaxy community. Slides and links to the tutorials we'll be using today are at that URL. So, goals for this workshop. This may be the first tutorial you're participating in as part of Smorgasbord. So the goal is basically to learn the basics of how to use Galaxy. When we're done you will know how to get data into Galaxy, you'll know how to run tools, you'll know how to create repeatable workflows. So you'll have a basic idea of the capabilities of Galaxy. We're going to do this without requiring you to be a biologist or computer programmer. And we mean this. So one of the core strengths of Galaxy is it enabled scientists to do sophisticated data analysis without having to learn command line interfaces or Linux system administration. We also mean it about the biology part, at least for this tutorial. Galaxy is developed for the life sciences, but it is widely used outside of that. And if you want to learn how to use Galaxy, this tutorial does use a biological example, but it's pretty straightforward. And we explain all four concepts. So I'm also not a biologist as that will become clear. So I think both of these statements are true. You don't have to be a computer programmer, you don't have to be a biologist for this tutorial. So this session's tutorials are introduction to genomics in Galaxy and about two-thirds of the way through that will make a slight dip into extracting workflows from histories. So let's go, well, let's see, I'll open that up and then I'll come back here to the slides. Okay. So those are the two tutorials we'll be walking through today. But there are lots of other intro tutorial options. Anything in these topics. So the GTN is grouped by topic. And these are the two introductory ones, introduction to Galaxy analysis. One of today's tutorials is from there. And using Galaxy and managing your data and the workflow tutorial is taken from that one. As you do more with Galaxy, I particularly recommend name tags for following complex histories. That's a great tutorial, which will teach you things that are very useful once you get multi-input, multi-output let's see analyses. And using dataset collections. If you use Galaxy for large-scale experiments for things where you have 20 conditions over 20 time points, you will want dataset collections. You may want it even for smaller experiments. Without dataset collections, things become quickly intractable. So if you find yourself thinking, boy, this is really hard to keep track of. Take a look at both of these tutorials. Okay. So before we get started, we're going to introduce biology. Biology one, molecular biology, biology in three slides. So four years of college right here, three slides. Who knew? So things that are alive have chromosomes. In humans, they're long strands of DNA. We have 23 different chromosomes and we have two copies of each. So the DNA is, well, let's keep going. So chromosomes themselves have two strands. They have a forward strand and a reverse strand. So humans make this up. There's nothing inherent about each strand. But by convention, we call one strand the forward, one strand the reverse. Genes are used by your body in a particular way. So when your body is building proteins created by genes, on the forward strand, it walks left to right. And on the reverse strand, it walks right to left. So as your body uses your DNA to create itself, maintain itself, it walks in a particular direction on each of the genes. And what are genes? That is actually a huge controversial question as to what a gene is. You wouldn't think so, but it is. For our purposes today, genes are parts of your DNA that produce active molecules in your body. So particularly proteins. So genes are the blueprints for the proteins that make up your skin and your hair and your eyes and your bones. That's what genes are. And they exist on both strands. And as mentioned, they go in a particular direction when they are copied to build parts of your body. So that's molecular biology in three slides. Let's get to our driving question. And it's what we're going to use today to learn how to use Galaxy. And it's a pretty simple question. Do genes exist on different strands ever overlap? So do we ever see this situation where we have a gene on the forward strand and we have a gene on the reverse strand and they occupy the same space opposite each other? Okay. Now, there's a reason why we're asking this question. Your DNA consists of four amino acids, just four, and we represent them with letters, A, C, T, G. Okay. Yeah, I don't know why I said it that way. A, C, T, G. Okay. And they constrain each other. So if you have a particular letter on the forward strand, you must have another particular letter on the reverse strand. So for instance, if you have an A on the forward, you must have a T on the reverse and vice versa. If you have a T on the forward, you must have an A on the reverse. Same thing with season Gs. So the genes are recipes for building proteins. And it's unlikely that if you were to take, say, a sentence, write it down and then come up with some encoding scheme where A's became Z's and B's became Y's, if you were to write down a sentence and then reverse that sentence and then switch all the A's to Z's and all the B's to Y's and so on, that you would get anything but gibberish on the other side. And that's what we're doing here is the other strand is reversed and it's complimented. We've done this translation. And so the one side produces a working functional protein. The other side, almost always, maybe always, is going to produce nothing. It's not going to make any sense. You're not going to be able to read that sentence or any part of it. So we're asking, does this ever happen where we get parts at least of coherent sentences on both strands? So that's our question for the day. Okay, so let's go to the tutorial. Let me zoom in a little. Maybe a lot. Okay, so this is our tutorial and introduction to Galaxy Engineomics. Yeah, let's scroll down. This is a typical GTN tutorial. Talks about a bunch of stuff. What our motivating question is, there's all the definitions we talked about. Here's the relative lengths of the 23 pairs of chromosomes in humans. We're going to look at chromosome 22, which is this one right here. We're going to look at that one because it's short and this is a workshop. Okay, so same information. Do they ever overlap? Okay, so how are we going to ask this question? We need information about where genes are. And today we're going to get that information from something called UCSC. We'll get to that in a bit. So we're going to get that information about where the genes are. And that information should have what strand it's on and where it starts and stops on each strand. Okay, and then we have to figure out, okay, how do we calculate where they overlap? So that's our question. But first let's get our data. First thing we need to do is log in. So each of you should pick a Galaxy instance. The usegalaxy.star instances. This tutorial will work on all of those. It will work on most Galaxy instances because it's using pretty basic tools. I'm going to pick usegalaxy.org. I'm in the US and that one is based in the US. But you can pick whichever one you want. So here's usegalaxy.org. I'm going to zoom in on this guy. This is usegalaxy.org. And this is where I'm going to do my work today. Again, I encourage you to do it on whichever one's closer to you, but you can pick whichever you want. All the tools we have are available on all three of those. Okay, so three panel design. Pretty standard for Galaxy. On the, this is my left hand. On the left, there is the tool panel. And it has a series of tool boxes which contain related tools that work on various different things. Some basic stuff like get data, text manipulation, but also some pretty specific stuff like variant calling or mapped data, sandbam, or you name it, how to assemble a genome, how to annotate a genome. Lots and lots of tools. If you click on any of these, it expands out to show you the tools that are here. Okay. So this is the left hand tool panel. We'll explore some of those tools today. If we go all the way over to the right hand side, this is our history panel. And this shows us what work we have done so far. As we do work, it will show up in this panel as a box. Hopefully as a green box, which means it was successful. And each time we do something new, we'll get new boxes here. And new work is inserted at the top. And older work gets pushed down. So if you're a computer person, this is a stack. Okay. So older things will move towards the bottom. We'll stay on the bottom. And newer things will show up at the top. So this is a linear view of our analysis. In the middle is where we look at data and where we set up and run tools. So we'll get to that in just a bit. The first thing that we want to do is log in. We're registered. If you don't have an account yet, you'll want to register. I have an account on the system which I will use. So I'm going to log in. So click on log in or register. Takes you to this panel. And I'm going to use my Clements account because I think I can remember the password. Let's try it. If you don't have an account, you want to do the registration. Okay. And it'll send you an email, and you'll need to confirm your email address. So now I am logged in. Instead of saying log in or register, it says user. And if I click on that, there's a whole bunch of things. I can log out. I can set preferences, custom builds, lots of things here. Okay. I'm not going to do any of those just yet. So so far, you've gone to a Galaxy server, and you have either created a login or logged in. The next thing you want to do is have an empty history. So if you have used Galaxy before, it will show you a history, probably the last history you had here. You'll want to clear that out by clicking on plus, which will create a new unnamed history. There are many advantages of using Galaxy after you log in. You can use it without logging in, but it means you can only have one history. That history will be lost when you close your browser, and you can't use things like workflows. So you can do a lot, but not the full experience. So I really recommend creating an account. So I've already got a blank history here, unnamed history. Okay. Let's go back to tutorial. Okay. So we've done that. Walked through that, walked through that, started with an empty history, got that. Okay. Let's get data. So we're here. We're going to get data from UCSC. So how do we get data into Galaxy in general? A couple of ways. One is the upload data. We're not using that today, but I'll show it to you briefly. Click on that, and it brings up this form. And you can get data from lots of different places. You can upload from your laptop. You can also use FTP, in which case, let's say you would tend to use FTP if you have very large files, and you don't want to wait for them to upload through this form. You would use FTP to transfer it to your Galaxy account and then import it into your history using this. There's also paste and fetch data here. If you have very short data, you can paste it. A much more common use of this is pasting URLs. So if you have a URL of a dataset, you paste that here, and that will import it. Today, we're not doing that. Today, we're going to get data from one of the tools that Galaxy knows about. So I clicked on the get data toolbox, which expanded out into this. If a tool is listed here, it means that tool knows about Galaxy, and it's going to know which Galaxy server called it. So all of these tools, if you go to them, you will generate a dataset, and at the end, you will say, hey, transfer this back to Galaxy. That's what we're going to do today, and it will know again which Galaxy to transfer to. So today, I'm going to use UCSC main table browser. The UCSC browser is a visual view of genomes. So it's called a genome browser. This is not that. What this takes us to is the database that backs the UCSC genome browser. We'll see the genome browser at the end of the tutorial. But this takes us to the database, and it allows us to get any of the data that UCSC has that it uses to generate its genome browser. It's a really useful resource. So I clicked on that, and it will take me, I hope, to UCSC. It does, and it's very tiny. Okay. So again, table browser. What is this? So we're trying to get information about genes on human, and chromosome 22 specifically. So let's keep that in mind. It's kind of a big form. It's got a bunch of stuff. Fortunately, the defaults are pretty close to what we want. It has information about mammals, vertebrates, that, whatever that is, insects, and nematodes, lots of things. Okay. We want mammal, human, which is what's preselected. That's great. But if we were working with mouse, we could get data about that, or armadillo, if we're doing research in leprosy. All sorts of things, Chinese hamster, pangolin, micro bats, naked mole rat, you name it, it's here. So if you're doing research in any species that UCSC has data about, this is an excellent resource. So we're happy with human. What about assembly? What does that mean? Click on that. Okay. We got five choices here. And this one is the most recent. I'm not going to go into what that means, but the most recent is generally good. Okay. I come over here group. Don't know what that means, but let's click on that. Okay. So different types of information. We want information about genes. So we're going to select that. And then which track? So GenCode V32 is the default. That's telling me something about what UCSC thinks most people want. And then I click on that, and it shows me this whole list of options. Okay. A whole bunch of versions of GenCode, but a whole bunch of other things too. This is rather daunting and it reveals an unpleasant fact in genomics. And human, especially because it's so well studied, is we don't have 100% agreement yet on what are genes. And this reflects that. And so they agree on 99% of the stuff, but the corner cases, we're still hashing out what's what. For our purposes, we're going to take the default, which is GenCode V32. Okay. Table known gene. I don't know what that means. And there's a bunch of other stuff there. I don't know what that means. But if I want to know, and I recommend doing this at some point, I could find out what known gene means by clicking on this. And this will take me to the database documentation for UCSC and tell me exactly what's in known gene. And if I do that and read it, I will figure out this is actually what I want. Okay. So far, we have not changed anything. All this is still the defaults. What do we want? Well, eventually we want the whole genome because we're asking a big question. But right now, we're not sure how we're going to do this and we're trying to figure out what steps to do. When you're figuring out what your pipeline should look like, what tools you should use and what options you should use. You often don't want to do it on the whole genome. The whole genome is 3 billion bases. And I don't know, somewhere between 20 and 30,000 genes, we just want a small portion of that so we can do a proof of concept. And then we'll come back and do everything. So we're going to work out our pipeline first with a smaller data set. And then to answer our actual question, we'll come back and get the whole genome. Okay. So I'm going to switch that to position. And then I'm going to tell what do I want? And as I said before, we're going to use chromosome 22, CHR 22. In UCSC, the terminology is lowercase CHR followed by the number. Okay. In other places, it's just the number. In other places, it's capital C. Yeah. Lots of different places. Here, we're going to use chromosome 22. We're not going to specify a part of it, which we could do by saying colon get me bases one through 300. We want all 50 million bases of chromosome 22. So any genes on chromosome 22 is what we're going to get. Okay. And then I'm going to skip the rest of this until we get down to here. We also want the defaults. We want it in bed format, which we'll describe shortly. And we want to send it back to galaxy. Now UCSC did this. Check this off because it knows that it was called from galaxy. So now it's going to send it back. So that's good. Plain text. Yeah. I want plain text. And then get output. Yes. I want the output. Okay. So we're going to click get output. Again, we only changed position chromosome 22. Get up. And it doesn't get us the output. Instead it takes us to this form and custom track. I don't know. I'm going to ignore that for now. And I come down here and I say create one bed record per bed as the format whole gene. Okay. I'm asking a question about genes. So I'm going to go with that whole gene. And then it says send query to galaxy. So I didn't change anything here. Send query to galaxy. And now it's actually going to send query to galaxy. So I clicked on that. Okay. And now I'm back in galaxy. So we saw a big green box. And then we saw a gray box. And now it's this whatever color that is. And then hopefully it will go to green. What all that means? The big green box means the handshake between UCSC and galaxy was successful. Meaning UCSC said I want to send you some data. And galaxy said, okay, let's do that. It then went to gray, which means okay, I've scheduled it, but I'm not running it yet. It then went to peachy orange. And that says I'm actively getting data from UCSC right now. And then fortunately it went to green, which says, hey, we think we successfully transferred the data. So you'll see that as steps run. You'll often see a big green box in the middle sometimes that says, hey, okay, the first handshake worked. And then the job in the history panel will go through a three step process of gray, peach, green. Okay. So what have we got here? We got our data. Let's take a look at it. I'm going to click on that. So there's a couple of ways to look at data. You can do a quick preview where you click on the name, and it opens up and it tells you some facts about the data. Let's see. We have almost 5,000 regions here. That might mean genes. Okay. So we got almost 5,000 genes on chromosome 22. Format is bed. Database is HG38. When we were in UCSC, it asked us which build we wanted. And we picked the one from 2013. That's HG38. And I think that was actually listed in that selection as HG38. Galaxy also supports previous versions as well, although not all of them. Okay. And then a couple of other things here, some of which will get a short preview of the data set. And this shows five records, first five records, and columns. It looks like there are 12 columns. Okay. Chromosome start and it knows that. Let's actually see the whole data set. So to do that, I'm going to poke it in the eye right here, the view data. And by doing that, it brings up the data set in the center panel. And we could count these columns. Sure enough, there are 12. We're going to focus on the first six. Okay. So bed is a widely used format. The first column is always the chromosome, or yeah, the chromosome. So here it says chromosome 22. Second column is where does this feature start on this? So it starts at 10736, 170 on chromosome 22. Where does it end? It ends there. What's the ID? Now, this is an ID. It's completely, it's almost completely uninformative. It doesn't tell us much at all. But it is unique. Okay. And what it's unique for actually is not a gene. It's unique for a transcript. So this whole analysis we're doing today, we're saying gene, but the data we're actually using is transcripts. And what's the difference? So here's another fact about biology is genes can produce different things. So one gene can produce slightly different versions of itself, which result in slightly different proteins in your body. Okay. So maybe you're not getting enough salt today. And so your body makes one version. Tomorrow you're eating lots of French fries with lots of salt. And so your body generates another version. Okay. This is common. And so what we've actually got from UCSC here is transcripts, not genes, which means for some genes, we're going to have multiple transcripts. Okay. That's a compromise we're willing to make to get our basic question answered. In the long run, we might want to address that if we actually want to say genes. Okay. And we'll talk about that at the end. But this is basically a transcript ID. There are other ways to get a gene ID, which would not show us transcripts per se. But anyway, this is a transcript ID. We're going to treat it as a gene ID. The fifth column is the score. Here UCSC is not telling us anything about how confident it is that this is actually a gene. Do we have 10 tracks or 20 tracks supporting it? Nothing. It just says this is a gene, trust us. Okay. Sometimes this is valuable, very valuable. Here it's not. Column six, the last one we're paying attention to says what strand are we on? Are we on the forward strand or the reverse strand? Forward is plus, reverse is minus. Okay. This matters for our question. Okay. So that's our data. We've got 5,000 genes, 5,000 transcripts. And let's see. We know where the pluses and minuses are. We know the chromosomes. We know the start and stop. That's going to be useful, too, because we want to check to see what overlaps. Does this guy overlap with this guy? So before we start doing our analysis, there's a couple of best practices we want to do. One is we probably want to rename this dataset. So everything here is true. It came from UCSC main. It's about human. It's about chromosome 22. It's about genes. Great. But when I come back a week from now and look at that, it's like, that's not nearly as helpful as it could be. So I clicked on the pencil, entered attributes. Come over here. I'm going to select all of this, move it to info. So I hang on to it. And I'm going to give it a more meaningful name for me personally. And what is that? That is genes, oops, genes, strands, 22. Okay. So I got genes from both strands. That's a pretty good name. And I save it. And now it changed. So when I come back a week from now, I'll look at this and I go, okay, I know what that is. Great. Okay. So that's one best practice is rename your inputs. Also rename your outputs. Intermediate datasets, that's a judgment call if you want to do that or not. Today, I'm going to do that just to keep things clear and because we're new at this. But eventually you probably won't rename most of your intermediates. You'll just rename your inputs and outputs. Okay. Second best practice, unnamed history. That's true. It is unnamed. But Galaxy is quite happy to have you have hundreds of unnamed histories. And when you come back a week from now and you look for, yeah, what was that thing? That name is not helpful at all. So we want to give this a helpful name, a gift to our future selves that says, what is this history about? And it's about overlapping genes. And this one is about Chrome 22 in particular. Okay. So overlapping genes, Chrome 22, turn. Good. Okay. So now when I look at all my histories, which I can do by clicking up here, I will see overlapping genes, Chrome 22. Good. Okay. Let's go back to the tutorial. See what's next. We got the genes. Did that. More definitions, reference genome. Good. Take a look at the tutorial. It goes into more detail than I'm going into today. We did that. Talked about that. Great. Looked at the data. Also great. Okay. Talked about that. I'm doing good. Okay. Let's see. Naming. Talked about that. Oh, we're so on it. Okay. Key point. So we've got the data. What's our plan for answering the question? And our question right here, how often does this happen? Okay. We've seen we have genes on both sides. How often does this actually happen? Okay. And the way we're going to answer that is we're going to split our data set in two. So right now our data set contains genes on both strands. We're going to split that into just genes on the forward, just genes on the reverse. And then we're going to compare those two and see which genes from those two data sets overlap with each other. Okay. And then we're going to check out how many of those actually overlap. Is it common? Is it rare? What's going on? Okay. And fortunately it turns out all these steps are easy in Galaxy. Okay. So let's go back in. Let's run our very first tool. So we have this data set. Now what do we want to do? The first thing we want to do is split it. Okay. And we want to split it based on the value of column six. So we got all these tools, right? And they're in toolboxes. And the toolboxes are well-named. Okay. Collection operation, text manipulation, filter and sort. Joint subtraction group, you know. And how do I find a tool that does this? Okay. Now in my opinion, this is still a pain point in Galaxy. It's gotten better. But we have, I don't think there's a thousand tools, but I think there's almost 800 tools on the server. If you're on the European server, there's even more. So there are a lot of tools. How do you find tools that do things when you don't know where to look? Well, we have some idea where to look and we will get to that eventually. But what do I want to do? I want to split this file. I want to do what? I want to filter it and only keep the, you know. So I would try searching for a bunch of things like split file, filter. I would see what comes up. For all of those things, we're going to get a lot of tools. Okay. And not at least as of today, hopefully in the next release, not as of today in a particularly helpful order, but getting better. Okay. I'm going to go to, this is either text manipulation or filter and sort. I'm going to try that angle instead. Okay. Text manipulation, click on that. I got filter tabular. Wow. Okay. That might be what I need. Jeans on both transfer insert filter. Wow. Okay. So I didn't know this tool was here until right now. I practiced this before today, but anyway, I didn't know this was right here. This might work. Okay. This might work. It's not what the tutorial does. And I filter tabular. Filter a tabular dataset, which that is a tab formatted by applying line filters as is being read. Multiple filters may be used with each filter using the result of the previous filter. This sounds really promising. Let's come down here. Line filtering example. So here we have one, two, three, four, five, six. Okay. Six files in our input. So most tools will look like this. They'll have, let's see, inputs, you know, like what are the parameters and an execute button, and then it will describe them. And here it says, okay, so filter one, append a line number column. Oh, so you can actually add columns of this tool by regular expressions. If you're a computer scientist, go for it. Add a number column, replace the value, normalize list, append a line. Okay. So it looks like I can use this to modify tabular files, but not to get rid of all the minuses and keep only the pluses. So query tabular, not what we want. Come down here. Split file. Okay. That will actually work. Okay. The tutorial does not use it. I'm not going to use it today, but this tool will actually do what we want. It will do it in one step instead of two. Today, I'm going to use a different tool because that's what the tutorial uses, but you could use split file as well. Okay. I come down here. I come down here. I go through these, and actually I would stop when I got to split file because that will do what I want. But the tutorial does not use split file. It didn't exist when it was written. What we want to do is just use filter. And filter is just a plain out useful tool you should get to know. So we're going to use filter, filter data on column using simple expressions. So what this does is it looks at your dataset and it says only keep rows where this is true. And what is this? It's a simple expression. So here, which file do I want? I've only got one input data set, so there's only one there. And what's my condition? Okay. Well, there's some standards for this column. One is if we want to specify a column, column one, we say C1. If we want to specify column six, which is what we want, we say C6. Okay. And we don't want it where it's chromosome 22. We want it where it's a plus. Okay. And you see this equal, equal? Congratulations. Once you hit execute, you're now a Python programmer as well as an expert genomic data analyst. That's Python, and that's how Python tests for a quality. So what this is, you can make this a complex Python expression that says, okay, I want it where column six is equal to plus or where column two is greater than 500 and the score is zero. You can make this arbitrarily complex as ors and nuts, all you want. Okay. We just want to simple, you know, just keep the ones where column six is equal to plus. We have no header lines to skip, so just run it. And I don't want an email notification. I could get one. If this file were gigabytes or terabytes or petabytes, I might want to know that when it's done in three hours, but it's not going to take three hours. It's going to take a minute. So I'm just going to click execute. And if I'm lucky, this will produce only the pluses. While this is running, I'm going to rename it. Again, you may or may not want to rename your intermediate data sets, but since we're new, let's do it. We think this is going to be the forward strand genes. So I'm going to call it that. Save. Okay. Now, a key best practice. We know we started with almost 5,000 genes. We just did an operation which is supposed to throw out all the minuses, all the genes on the reverse strand. Therefore, we expect, what do we expect this to look like? If I open this right now, I will see what it is, and I may not challenge it. I want to think before I see what my results are. What's plausible? What's not plausible? If I get 4,920, it means I didn't get rid of anything, and I don't think that's plausible because I saw pluses there. If I get zero, or sorry, because I saw minuses there, if I get zero, that's not plausible either because I know there are forward there. If I get more than 4,920, something is really wrong. I'm not doing what I think. So try to think about it before you see it, and it's plausible. So let's poke it. We now have, yeah, about half of 5,000, a little bit more on the forward than on the reverse, or at least that's what we think. Poke it in the eye. We should see that it looks just like it did before, except we only have pluses now. Okay. This is good. We got our forward set. Now we need a reverse set. A couple of things we could do here. We could go over here and run filter again. Okay. Type in everything. Select what we want. Okay. We could do this. We're not going to do that. Instead, we're going to use the run this job again, this looping arrow. This does not run the job again. What it does is it sets you up to run the job again. So I'm going to click on that. And it comes up here with exactly the same parameters that I used to generate this data set. Okay. So I look at that, and there's genes both strands. Good. That's what I want. I don't want to use this because this doesn't have the negative, the reverse strand on it. If I run this on the reverse strand, I'm going to get nothing, or on this one, I'm going to get nothing. I want to run it on this. Good. The only thing I want to change is minus. That's it. Everything else stays the same. Now, for this tool, this loop didn't save us much time. But imagine we're using a really sophisticated tool, which has potentially up to 50 parameters. Now, all those parameters have defaults. But maybe we changed 10 of them. And we ran it. And now we're going to run it again. And we just want to change one of those 10. We want to see what happens. If I go over here and launch the form again, I have to remember what the nine settings were that I did the first time, plus the thing that's different. If I use this, I just have to remember the thing that's different. So here, not a big time win, but some, but imagine much more complicated, sophisticated tools. Okay. Here it just meant we had a type of minus. So we're going to execute this. And what we expect out here is the reverse strand gene. So as soon as I can edit that, I will. Reverse strand genes. Save that. And again, we want to play the expectations game. We ended up with a little bit more than half here. So we should end up with a little bit less than half here. So 2,300 maybe in reverse. And yeah, 2,290. Okay. So 2,290, 2,630. This is all looking good. Okay. So it passes the plausibility test. We poke it. And yes, they're all minuses. Okay. So we've now got the first part of our process. We've split it. Okay. We now want to check for overlaps. When Galaxy was young, so Galaxy has been around for 15 years, one of the very first sets of tools to be on it was operate on genomic intervals. And it does exactly what we're looking for. So genes are genomic intervals. They're on a chromosome. They have a start and stop. It's an interval on something that's genomic. It's a genomic interval. And there are a set of tools in Galaxy that do lots of different types of set theoretic operations on different data sets. And here we're going to do set theoretic operations on the reverse and the forward comparing them to see where they have some intersection if they overlap at all. Okay. So that's what we're doing. So we're going to go down here and look for operate on genomic intervals. And it should be common genomic tools. Yeah. Okay. So this is actually a common thing. There are a lot of papers that still use Galaxy only for this. So I click on operate on genomic intervals. I'm going to come down here. I'm going to say no, I don't subtract maybe. But that's not what I want. I want intersect in a set theoretic term. I don't want subtract. Okay. And there's intersect two items later. Let's click on that and see if that's what we want. It's got three things. I don't know what this means. I scroll down. I get here. It tells me does it tell me what it does? Okay. Well, okay. It has examples of what it does. So if I select overlapping and apparently there are two choices overlapping pieces and overlapping intervals. First data set, second data set. Okay. If this is my forward and this is my reverse, it looks like this tool is going to find genes in my forward data set that overlap with genes in my reverse data set. It's not going to find both of these. It's just going to find this one. So it looks like I'm going to have to run this twice as well where once it's the forward, that's the first data set. And once where it's the reverse, that's the first data set. There's another option here, which is overlapping pieces. I don't think I want that because I want to preserve the whole gene. Okay. I don't want to lose the information that I have. If I do this, it looks like I'll just end up with a chromosome is starting to stop and I will lose everything else. Okay. That's what that looks like. It's not what I want. So I think I want overlapping intervals and I think I want to run this twice. So I scroll back to the top overlapping intervals. Yep. That's what I want. And my first one, this is the one I'm going to keep things from and that's the forward. Is that right? Yes. And then the other one is the reverse. So whenever there's an overlap between these two, keep the overlapping forward. Bravo. Okay. Now, how much do I want them to overlap? So the default is one. I'm going to argue that given my limited knowledge of biology, that one is not significant. Okay. One is a single letter. And you can easily imagine, if we use our sentence example, that one sentence starts with an E and another sentence ends with an E. That's got to be common. So I want to look at more than just one letter. The question is how much more? Well, in biology, especially for coding genes, three is a very significant number. So each three letters produces a part of a protein. And so whatever I pick here, it's going to be a unit of three. So it could be three, could be six, could be nine. If I say nine, I'm saying I want three units of a protein. Proteins often have lots and lots of units. Excuse me. Coffee. Wonderful. Okay. Proteins often have lots and lots of units. Three is not that many. Okay. So I'm going to go for nine, but you can go for three or six or nine or 12. I don't know what happens if you go higher than that. Okay. I'm going to go for nine, which says three protein units. Okay. So they have to overlap by at least three. It's not enough to overlap just by the first letter. It's got to overlap by three words, let's say. Okay. Good. And then I'm going to click execute. Going to check my settings one more time. Forward first, reverse second. Overlapping intervals. Looks good. I'm going to execute. Boom. Okay. Intermediate data set. Okay. I'm going to rename it because we're on this, you know, we're doing this. Excuse me. Okay. So what I expect to have here are overlapping genes. Forward strand. That's not what I meant to do. Share. Do that again. Okay. I'm sharing. That's what I meant to do. I had a little bit of a cough break there. Okay. We're back. Overlapping genes, forward strand. Now, remember, you know, best practice, think about what we expect this to be. Given what we know, how often do we expect sentences to have three words in the opposite order at the end of another sentence? Okay. Is that common, rare? I don't know. I can imagine it happening. So my expectation is there'll be some here. But let's see how many. Okay. So here we go. 986, which is a huge number. We started out with what? 2,600? So what is that? That's almost 40. That's a third, maybe. Between a third and 40 percent of our genes on the forward strand overlap with some gene on the reverse strand. Okay. 40 percent. So that's higher than I expected. Okay. But we're going to keep going. And then we're going to take a look at this and see what's actually happening. And see if what we asked for is what we're getting. Okay. So let's keep going. We've got it here on the forward. Let's do it again for the reverse. Okay. So we're going to use the run this job again. And we're going to flip these and say, I actually want the reverse first and I want the forward second. I want to get the reverse genes. Okay. And this will produce a list of that. Execute. So click on execute. And as soon as it lets me, I will change the name. Let's try that. Save. Now, expectations game, if this is correct between a third and 40 percent, this should be a third and 40 percent, 834. And it was slightly smaller to begin with. So yeah, this, if this is right, then this is also plausible. Okay. So got really high numbers here compared to what I was expecting. So let's look at the data a couple of ways. And I'm going to highlight some tools to do that. So the first thing, let's, let's see. Yeah. Okay. I'm going to poke this guy in the eye. This is the overlapping genes on the reverse. Now, I look at this and it looks like it's still sorted by starting position. It is. But if I wanted to be absolutely sure this was still sorted by, by starting position, I would use the sort. Okay. So filter and sort, sort right there. And I would say, hey, sort this by column one and then by column two and maybe by column three as well. But it's in, it's in position order. Okay. So which one is this? It turns out I'm looking at the reverse. I can tell because of this little bar. Okay. So it starts at 1555. Let's take a look at the, at the forward now. 1554. Okay. So the forward starts at 1554. So it starts before the reverse. Okay. So the forward and then the reverse starts and then let's see where does the forward and it ends at 1557. That's where the reverse ends. Sorry. And then where does this one end? Look in the eye. Okay. Transferring data. What you're thinking, I hope you're thinking is, Dave, this is crazy. I don't want to bounce back and forth doing this, this, this to see. And yeah, you're right. You don't want to do it. Okay. So I'm going to take you one step up from that, which is a generally useful tool. And that is scratch book. So yeah, toggling back and forth between views and trying to visualize this in your head with nine digit numbers, eight digit numbers is crazy. Don't do that. Okay. For a quick sanity check, you can use scratch book. So I'm going to enable scratch book by clicking on it. And it doesn't appear to do anything. Okay. It changes it to yellow and adds a check. But what it does is change high view data. So I'm going to poke it in the eye and something different is now going to happen. So it now pops it up in its own window, which I can resize. Okay. So I do that. And which one is this? This is reverse. Okay. So let's poke this one in the eye of the forward. And now I have the forward and the reverse up at the same time. And I don't have to toggle back and forth. The math is still brutal. So 1554, this starts before this one. So now I know though it's 1555 within this one and it is 1557. So 1555 is between these two. So I know that this starts and it looks like it ends. This one is completely contained in this one. Okay. And that was obvious right off from just looking at that in the scratch book view. Whereas before it, yeah, it's insanity. So use scratch book. Okay. But this still doesn't give us a ton of insight about what this looks like, how it's going. I don't want to do this for every one of these. There's like 900 of them, right? But I want some better view. So what we're going to do next to get that better view is send it to UCSC, the genome browser. Okay. And we're actually going to use the genome browser this time to view it in the context of the human genome. And this will be very helpful. So I'm going to get out of that. Okay. And what I'm going to do now is I'm going to reunite these two back into one file. They started in one file. But now that one file will only have genes that overlap with something on the opposite strand. And the way we're going to do that is with a concatenate tool, which is actually named in a way that makes sense. And it's in text manipulation, which also makes sense. So concatenate multiple data sets tail to head. That sounds right. That basically means I'll stack this one on top of this one into one much longer data set. So 834 plus 986 is what I expect the number of genes to be in the result. Okay. And if I come down here and I look at this and it's like, okay, I got two here, three here. And I can do more than two data sets. I got two here and I can concatenate them all. They all end up in one. Bravo. Okay. So this is what I want. I need to select more than one here. And on my Mac, that's a shift select. Okay. And overlapping genes, overlapping genes for, okay, good. That's what I want. I'm going to execute. Okay. And off it goes. And again, this, this may be my final output. So I really want a good name for this, because this does list the genes that overlap if we're doing it right. Okay. So I'm going to say, okay, overlapping genes for strands, HR 22. Oops, save. Okay. Save. And again, our expectation was it's this plus this four plus six is zero. So that should end in a zero, at least. And it does. 1820 is about, yeah, it's about that. Okay. So looks like we got it. If we poke it in the eye to see it, we will see that, in fact, it did just stack them with all the pluses on top. And at the bottom is all the minuses. Yeah, I could scroll down. Trust me, it's there. Okay. Good. So we now have something that we can ship to UCSC to visualize. And it's going to have both the forward and the reverse genes in it. We have other options for visualizing this. We can view it in IGB or IGV. These are both genome browsers. They're standalone tools. We could do it. But we're going to go back to UCSC. So to do that, I'm going to click on this, UCSC main. So it's going to take us into the genome browser, again, the visual representation of what we saw before. Okay. A couple of things to note. Right at the top, user track. This is what we sent. And all these little things are from our file. They're all genes that we passed on to UCSC. Okay. And then down here is all sorts of information that UCSC has about this particular region of the genome. Omen, allelic variant phenotypes, gene expression. This is from, yeah, GTX. Okay. Really colorful graphs that tell you where each gene is expressed. Lots of other information. And if you want even more information, you can turn on all sorts of tracks. So really a phenomenal resource. If UCSC has your species, there's a huge amount of data. Okay. But right now, this huge amount of data is on the overwhelming side. Okay. So how do I get any, how do I tell if stuff is overlapping? How do I find anything overlapping? And what you would do is you would zoom in potentially. Okay. That's one way to do it. Using these things, or you can zoom in if you click and drag on this, or even click and drag on this. Okay. It's a very useful resource. It's a separate tutorial. So I'm not going to cover how to use UCSC in depth today. I'm just going to show you what we need to to find what we're looking for. Okay. So I looked at these results ahead of time. And after wandering around some, I found a pretty clear example of what highlights what we found. And it is the DGCR2. It's the same one that's in the tutorial. So gene is DGCR2. If you're not a biologist, that doesn't look any more useful than ENS000721. But if you are a biologist, that's a gene name. Okay. And that's much more helpful. And sure enough, it found it. Okay. And I want to take the browser to go look at that gene. Okay. Now, I originally found this gene by just zooming in and wandering around. Let's do that. So I click on it and I select and it takes me to it. Right there. DGCR2. Okay. And what have I got here? I've got a bunch of stuff in the top. The complexity is starting to show up. So each of these things here, if I zoom out by 1.5x, it'll show me the DGCR2. And it's three different, well, just one variation right here goes, no, one, two, three variations goes from there to there, but it has additional information like these arrows. Well, I can tell you right now, those arrows are which strand is it on. So DGCR2 is on the reverse strand. Right. And then what else? But what are these thick, thin boxes in the middle and at the ends? Well, let's get to that. Okay. This doesn't show me anything useful yet. So I'm going to zoom in again and I'm going to zoom in on the very end portion here. Let's see. Now, this user track is not displayed how I want it to be either. I'm hoping it gets better on the next one. Let's say it doesn't get better on the next one. Okay. I'm right clicking there. And for some reason, so it picks from any of these possible ways to display this information. It can hide it all together. It can be dense or squish or pack or full. And each of these brings more levels of detail. I think I want to see this information in the same level of detail as it is down here. So I right click again and it's pack. So let's try right clicking here and set it to pack and see what happens. Okay. Now, I don't know why UCSC picked that. Maybe it has a threshold where it says, okay, if there's more than one's doing, whatever. Don't display it as pack. Display it as whatever it was. Okay. But now I see, okay, there's actually more going on here. Remember that we talked about transcripts versus genes. This is all actually the same gene right here. Okay. That's shown down here. It's all DGCR2, but it's different versions of that gene. Okay. But now we see there's actually something over here that might overlap. So I'm going to zoom in on the end because that's the most informative. Now, to do that, I can't really use this because I'm really zoomed in on the whole chromosome. That's what this represents. But I can zoom in on this region of the chromosome. The way I'm going to do that is click and drag. Okay. So I've clicked and dragged and I'm covering the end of DGCR2. And I release and it comes up and says, hey, Dave, what do you want? And it shows in the future, you can do this instead of what you did. And I have yet to actually read this and pay attention, but it would make things more efficient. I just click drag and say zoom in. Okay. And it re-renders that. And now it's just showing me the end of DGCR2 because that's what I wanted to see. Keep in mind, this is the user supplied track. So this is what we shipped it. And we're looking at this now and we see, okay. So this is all DGCR2. This is AC004471. That's not very usefully named either. But it's another gene. So we're showing two genes, this one DGCR and this one. And if we look at this, it looks like both here and there in our own supplied data that this one is on the forward strand, this one is on the reverse strand. So they do overlap. Something to note is they don't overlap where the boxes are. Okay. They only overlap where the thin lines are. Okay. What does this mean? So I'll tell you what this means. It means I lied. Okay. It actually takes four slides to learn molecular biology. Okay. So four slides. Our motivating question was do genes ever overlap? And we found, it looks like we found that, yeah, they do. They overlap a lot like a third of the time. Okay. So this is the question we asked, the one on the top. And we found out, okay, they do. It's quite common. But the question we actually want to ask is do exons overlap? It turns out that genes have complicated structure. It's not just a start and a stop. There are different components of it called exons, introns, and a whole bunch of other things. And our question is, okay, do things that produce functioning proteins that produce skin and hair, do they ever overlap? That's the question we meant to ask. It's not the question we asked. The question we asked was do genes ever overlap? And the answer is, yeah, they overlap a lot. But this is not really constrained by this because this is an intron. It largely doesn't matter what's in here. Okay. It matters a lot what's in here. So this can constrain this all at once and this can constrain this all at once. It's not really answering our question. What we want to see is where these exons actually overlap. We want to see this condition, okay, this one, not this one, where, yes, these genes do overlap, but they only overlap in the introns, not in the exons. So this is not good news because it means we just ran this experiment, which it looks like we ran correctly. But we weren't asking the question we thought we were asking, okay. Okay. So, but all is not lost if we go back. So this is where we dive into the other tutorial about extracting workflows from histories. Our recipe here worked, right? We started with genes on both strands. We pulled out the reverse and the forward. We did the overlaps and then we reunited them. And it worked. We found all the genes that overlap. We now know what we need to do to find exons that overlap. It's the same set of questions. We just need different inputs. And this is what workflows in galaxies are for is if you want to repeat a process on different data or slightly different parameters, you can do this with workflows. This is a history. You can think of histories as a baked cake. Okay. This is what we've done. It's a gene cake. Okay. It's a vanilla cake. It's what we've done. It's right here. With Galaxy, we can ask this cake, hey, tell us the recipe that was used to make you. And then we can use that recipe to create a exon cake, a chocolate cake, using the same recipe, just a different input. So that's what we're going to do. So we did spend a lot of time doing this. Okay. And we figured out what we did and we figured out the logic was good. Even if the inputs were not what we thought. So this work was not wasted. First of all, we learned a lot. Another cough break. Sorry. We learned a lot, but we also produced a recipe that's going to work. This protocol is going to work if we run it again on exons or at least we think it will. So let's try that. First step is to create a workflow from it. And the way we do that is we go to the top of the history panel. Okay. We can refresh the history, which just reloads it. We can create a new history. Don't want to do that yet. We can look at all the histories we've ever created. Don't want to do that right now. We can look at history options. And that's what we want. So click on the cog. All sorts of things we can do. We can copy. We can share or publish. This one's really important. If I'm going to put this in a paper, if I'm going to cite this experiment in a paper, I can actually put a link to this history in the paper and I can put a link to the workflow we're going to create as well. And that allows anybody reading the paper to go see exactly what steps we took and exactly what our inputs and outputs were. And this reproducibility is done by default in Galaxy. So if I want to do that, I would share it, which makes a URL, which creates a public URL for this history. And you can put that inside a paper. Show structure, extract workflow. There we are. Extract workflow. That's what I want. So I'm going to click on that. Okay. And that brings up this in the middle. If we look at this, this corresponds to this. What does this say? It says jeans both strands. Okay. This exactly corresponds to that. Okay. Treat as input dataset. Yep. So I have one input in this and then I have five other files, five other datasets that get created as we go. Did I have any missteps in this process? I don't think I did. There's there's nothing here I did that I don't want to keep. Usually there is because you're trying different tools. You're seeing if this works. Something doesn't work. So usually you'll have other things here that okay, we're dead ends. You would unselect those over here. Okay. You would change that and say, oh, that didn't work. So take it out of the workflow. But here, I think everything worked. So we're going to leave everything in. Okay. It's going to have one input. Bravo. And I'm going to give it a name. And instead of that name, the default name, I'm going to call this what? Overlapping. I don't want to call it jeans or exons because it can be used with either. What's a generic term for that? Features. We call these things features. Overlapping features on opposite strands. So they have to be on opposite strands. Can't be on the same strand. And I think that's good. So I'm going to create that workflow. I created it. Look at that. Okay. Workflow overlapping features on opposite strands created from current history. You can edit or run the workflow. Okay. I'm going to edit the workflow to begin with because that's going to give us some, you know, some idea of structure. It's a really good way to see the structure of what you've done, by the way. So I'm going to click on edit. Oh, and it comes up and says this Dave, if you leave, you're going to lose something. And it's like, what am I going to lose? I look at this and I can't tell what I'm going to lose. So I go back to stay on page and I look again and I don't see anything I'm going to lose. And I try it again and it says, no, you're still okay. What's happening is this, the scratch book. When I leave this page, I'm going to lose my scratch book formatting. Okay. So if I had spent a lot of time arranging this, you know, in very handy ways, I'm going to lose that because we don't save that beyond your session. Okay. So I could say, I don't care and just lose it. You can do that. Or if that makes you uncomfortable, you can close all these and then you won't get that warning. Okay. So I just closed all those and scratch book scratch book is still active, but I'm not going to do it. I'm going to click on edit and it should just let me go. Yes. Okay. This is the workflow editor. Let's see. This is the default layout. And then expand it just a tad. So we have genes. So that was our input. We ran two filter operations. We ran two intersect operations. I'm going to try and make this pretty because I'm going to publish this someday. And then the concatenate. And so that was another cough break. Just like in live workshops, in person workshops, instructor messes up his voice. So I'm going to take a cough drop and I'm going to be talking with a marble in my mouth for a while, but hopefully it will be better. Okay. Excuse me. So we've got our workflow here, which is our history. It's a graphical representation. And in the history, it's a linear representation. So, you know, it's very simple, but you start to get lost if you get long, long analyses. You can always create a workflow from it and view it here. And in my opinion, it's a very nice view. So a couple of things I'm going to do here. I'm going to add some documentation. So let's see. I'm going to click on that one. And I'm going to say, now, jeans, not so good. This is our input. So features on both strands. And this is the, I don't know, this is either the plus or the minus. Let's see which one is this. This is the forward. So this is the forward strand features. And I come down here and this should be the reverse strand features. Look at that. Yes, it is. This is worth doing with workflows. The whole point of workflows is to reuse them. So it's worth putting time into documenting it. It's a gift to your future self. If you come back a week from now and you, you know, you may have a hard time remembering this because you've run 35 other analyses since then, all these comments are going to help you. Okay. So you might want to put a comment on the intersect on the intersect. I'm going to put a comment on this one, which is what this is the overlapping features forward and reverse. Okay. So that's good. Okay, good. So I've made all these notes to my future self so that when I try and run this workflow in a week, it's going to make sense. Now I'm going to save it. So click on the save icon. And now it's saved. I could have run it before, but it wouldn't be as helpful as it is now. If I run it now, it'll have some useful information to tell me when I run it. Before I leave the editor, let's talk about a couple ways to use it. So what we did right now is we ran a history and we extracted a workflow out from our work. And that's what I always do because I'm not clever enough the first time I do an analysis to get all the steps right. I'll have dead ends, you know. But some people are clever enough to get it right the first time. And for that, you can come into the workflow editor and create workflows from scratch. And you can edit them in ways I haven't done here. So for instance, maybe I want my end result to be sorted. Okay. So I click over here on filter and sort and I click on sort. And sort shows up. And I will now take the output of that and sort it. And I will actually change the name here to be this because that's more meaningful. And now it will have these features in sorted order instead of forward or versa. Okay. So I could modify that here. I'm not going to do that today. So I'm not going to click save at this point. Let's see. Now what are we going to do next? So we've got our workflow. We've saved it. I can find it in Galaxy anytime I want. It's here in the workflow tab. What do I need to do? Okay. So I'm going to run it again on the exons. Well, I don't have the exons yet. So I should go get exons. There are a couple ways to get exons on the tutorial covers. Well, it mentions both of them. It covers one of them. I'm going to do the other. Okay. So I'm going to go back to my analyzed data view, the main view. And it's going to say Dave, you have some unsaved work. It's the sort. Okay. And I don't want to save it. So I'm going to leave page. And it's just taking me back here to analyze data. What the tutorial page does is it uses the tool to actually extract information. Okay. From this file. From the file we first got from UCSC. So we only paid attention to the first six columns, right? Through there. But there are six more columns out here. And these columns actually define the intron exon structure. They define what we saw in the browser. Where the lines are thick, where the lines are thin, where it's only the really thin line. This defines what the structure of the gene looks like. So the information about exons is in here. And there is a tool that will walk this and actually create a file which contains exons. When would we want to do that? Well, we could do it now. I'm not going to. I'm going to show you the other option. But you would really want to do this if this file were 60 gigs. And you downloaded it from somewhere. And now you have to go back and get just the exons if that's available. Well, there's going to be one or more exons per gene. So the dataset we get for exons is only going to be bigger. You don't want to wait for that 60 gigs. You just want to extract it from here. So in that case, you definitely would want to run the tool that walks this and extracts the exon information. We're just going to go back to UCSC and get the exons because it's easy. So let's do that. We're going to get data. And again, all these tools know about Galaxy. In your mind, you see UCSC main, right there, table browser. We're going to go there. Click on that. Okay. Now, same view as before. All the settings we had before, except, look at this. Chromosome 22, which before was just CHR 22, has now become this. Why? Well, UCSC remembers what part of the genome we were looking at in the genome browser when we were looking at the gene structures. And it has saved it and put it there for us rather conveniently, except that's not what we want. We want the whole chromosome. So that's the only thing we need to change. Everything else is the same. Position we want all chromosomes on 22. Bed, Galaxy, great. I don't see anything here about exons. Okay. And it turns out it's on the next page. So click get output. And right there, coding exons. So click on that. Okay. Coding exons. Before we got whole gene. This time we just want the bits that end up in proteins and send query to Galaxy. So we do that. We get the big green box. Hopefully then we go from gray to peachy to green. So there's a couple of things to note. I can rename that. Okay. Let's rename it. Yeah. Okay. Exons. Exons. One was on 22. I save that. Okay. And our expectation here is that this is going to be bigger than this. Because again, genes have one or more exons. And I go there. And yeah, we went from 5,000 to 16,000. So more than tripled in size. If I poke it in the eye and look at it, what do we get? We only have six columns now because the exons don't have internal structure. They just have a start and stop. So all the information that was out here is gone. And a whole lot of pluses going on. But right at the bottom there's a minus. So we have that minus. That means we got, we think we got what we wanted. We got both strands. Good. And now we want to rerun our workflow. Okay. But I'm going to back up just a little. So when I got this information, I didn't set up a new history. I forgot. And so now I've got this exons file at the end of this genes history. Right. And you know, I could leave it there as a reminder to myself, like Dave, you messed up. Or I could delete it. But I, you know, I got the file. I want to hang on to it. So what I want to do is copy it to a new history. And maybe I should have done that to begin with. Okay. Is open a new history and then go to UCSC. That's what I should have done. But I didn't. But all is not lost. What I'm going to do is open a new history and copy it. Copy this to that new history and then delete it from this history. Okay. So I could create a new history and then figure out how to copy it. Or I could go here. I could copy the complete history. That would work. Not what I want because I would get all the extra stuff. I just want to copy some data sets. So we do that. And what data sets do I want to copy from which history? I have lots of histories. So it picks the current one. That's correct. And I want to copy that one. And which history do I want to copy it to? I could copy it to any of mine. Well, I actually want to put it in a new history. I want to call that overlapping XonsCHR 22. Okay. And it says it's just going to copy that one. And I'm going to do that. Click it. Beautiful. Okay. Data set one, copy two. And if I click on that, it'll take me there. And I could do more copies if I wanted. I don't. But before I go there to that history, I want to get rid of this guy. And so I'm going to click delete. Boom. And it's gone because this is not about Xons. It's about genes. Okay. But it's still using up my disk allocations. So what I want to say is data set actions purge deleted data sets. So I said deleted but Galaxy like most systems these days doesn't actually delete it right then. You can say, okay, I'm sure I want to get rid of this. So I'm going to click purge. And it's going to delete it from this history but not the other one. Okay. And so now it's gone. Great. Okay. And let's go here. View all histories. Okay. So again, I've been doing this for a while. I have lots of histories. This is the one I'm currently in about genes. I want to be in this one. So switch. Okay. I'm now there. And I now want to start my analyses. Whenever you want to go to your analyses, you click on analyze data that'll take you back to your Galaxy home. Okay. And now we have an empty history except for our input data set. Yep. 16,000. It's all there. Things are looking good. Okay. So let's see. Let's run a workflow on this. So we go to our workflows tab and I have lots. Here's the one we just created. So I click on that and it says what? I can edit it, copy it, download it, rename it, share it, view. But runs not there. But there's a lot of stuff I can do. So what I actually want is this. I want to run the workflow. So I'm going to click on that and say run the workflow. Okay. Now, we built this from the genes history. We're going to run it on the axons. It's saying which, you know, what's your input data set? I only have one data set over here. So there's only one listed. Expand the full workflow form. I'm going to do that. Okay. So there's that. Send it to a new history. No, because I got this one. Okay. Forward strand. Boom, boom, boom. Click on that. It says, okay, it's a filter. It's great. Okay. Something to note is I'm still doing nine, okay, for my overlap. I want to have three codons, three, three words. Okay. I could change this if I wanted to. I could rerun this analysis, you know, on the same data set or a different data set. And I could say, well, what happens if I do 15, five words? I'm not going to do that, but you can do that with workflows. Okay. So we're here. I'm going to run it. And it's going to come up and it's going to say successfully invoked, good, big green box, necessary, not sufficient. Okay. And it's saying, let's see, you can refresh the history pane. Well, how do I do that? I do this refresh history. Okay. Well, I don't need to. Good. Because it puts stuff in for me. Okay. And it has now scheduled these one, two, three, you know, five things to do. And it knows it can't run these two until this one, it can't run this one until this one is done and so on. And then there's this one that it can't run until these two are done. And so hopefully these all go through the state where they go from gray to peachy to green. Okay. What have we got here? Something I didn't show in the workflow form is you can configure the steps to be hidden when you run them. And what that means is that they would initially show up here when they're gray and when they're peachy. But once they go to green, they will go to green and then they'll disappear. They will become a hidden data set. So it means that what you would have here is you would say, okay, I have exons and then I run a workflow. And the only thing I care about from that workflow is this guy. And so we would see this while it's running. But once it's done, we would go direct from data set one to data set six. And then if we wanted to see the hidden data sets, we would use the cog. I didn't do that. But we could. So I'm going to update this one, change the name. Assuming we got it right. Okay. Say exons on opposite strands, strands, CHR 22. Okay, save. Now, how did this go? So we ran that filter. Okay, we had 16,000 and about half a little bit more on the forward than half. We expect a little bit less than half on the reverse. 7221. So if we add 7221 plus 8925, hopefully we get 16,146. And we do. If I scroll over, spot check it. Yep, they're all on the minus strand. Good. Things are looking promising. But we don't yet know how common it is to have overlapping exons. And let's take a look here. Let's see how many, I think this is the forward strand. How many did we get? So we went from one out of three, two out of five genes to a total of five exons on chromosome 22 that overlap with something on the reverse strand. We look at this one, we get eight. So there are eight exons on the reverse strand, five on the forward that overlap. So when we changed our question to actually be the question we wanted to ask, we found out if we did this right, that it's pretty rare for coding exons to overlap on opposite strands, which is what we would expect. Okay. If we look at this, overlapping exons on opposite strands, we have 13 regions. We could view it in UCSC main again and actually visually verify that. We could also verify it by sorting these results and then eyeballing it. Okay. If we do that and we figure out, yeah, we're doing what we think we are, we would then rerun this on the entire human genome and find out which coding exons have overlaps. So are we done? And the answer is mostly. Okay. We've answered our basic question, is it common, uncommon? And we found out it's pretty uncommon. We started with 16,000 and we went down to 13. So that's less than one out of a thousand. Okay. So what is that? 0.1% of exons overlap. Okay. So it looks pretty rare. If we run this on the whole genome and that trend holds, then it's, okay, that's pretty rare. If we care more, if we want to find out more than just the general answer to the question, if we want to find out, say, what genes overlap, we have more work to do. So right now we just have exons and we have transcript names. We can get more than that. We would have to do some extra work. And if we're new to this type of work, it's going to take a while and it's going to be frustrating because we have to figure out, okay, how do I get meaningful gene names and how do I coalesce these exons down into genes? Because that's the question I really want is how many genes overlap. And once you have the number of genes or once you have how many genes overlap, you can ask questions. Is there anything about these genes that's interesting? Are they mostly in the immune system, for instance? Is there something about genes that overlap that is unique? Maybe there's not. Maybe this is all just random chance. Or maybe there is something that it has to do with cell division, that it has to do with something. We don't know. Never run the experiment through. But we have an answer to our first question. If we want to answer the deeper question or the more specific question, it's going to take more work. And we're going to make some mistakes along the way. My point here is that we're spiraling in on the answer. At this point, I'm going to declare victory. We found out it's pretty rare. Less than 1%, a lot less than 1%. If we want to do deeper analysis, we're going to have to do more work. And it's going to be frustrating to begin with. Eventually tedious. But eventually, you're going to get good at that. And you're going to know, okay, I know how to do this because I know where to get the real gene names. And I know how to map exons back to the transcripts that came from. And how to map transcripts back to the genes there for. And I can do this. It takes a bit to get there. But you're going to get frustrated at the beginning. Get to the point where it's at least just tedious to do. And then eventually get to the point where it's just poetry in motion. You just know, okay, I know how to do this because I've done this before. And if I don't know how to do this, I know where I can get other datasets that will help me do this. And it gets easier and easier as you go. You make less and less mistakes as you go. Yeah, it can be quite a rewarding experience. So with that, I want you to go forth and analyze and enjoy the rest of the smartest word. I encourage you to get help on the Slack. So please join us there if you have questions. Each workshop, I believe has a different Slack channel. And I'll be on Slack Monday, Tuesday and Friday. And I'm on the west coast of the US. So keep in mind those time zones. I have some meetings on Monday and Tuesday, but I'll be there outside of that. But with these questions on this tutorial, anybody can help. So again, welcome to the Smarties Board. And I hope this was time well spent. And thanks for sharing your time with me. Cheers.