 Hello, everybody. Welcome to GTN or Galaxy Training Network's Marcus board. This is our week long asynchronous kind of follow along at your own pace, virtual training that's being offered for this week in May 2023. My name is Natalie Coucher. I'm part of the Galaxy community. I am based at Johns Hopkins University in Baltimore, Maryland in the US. So today I'm really excited to, first of all, welcome you to smorgasbord and walk you through tutorial where it will be able to introduce you to Galaxy. So thanks everybody. Let's get started. The goals of this tutorial today are going to be to learn the basics of using Galaxy. You'll be able to explore the tool and walk through answering a scientific question without you necessarily needing to be a biologist or a computer programmer. So the tutorial that we're going to be doing today is focused on a question in genomics. But there are many other tutorials on the Galaxy training network that cover other scientific topics. So there, and they provide this kind of basis, this background in the field that you'll need to kind of follow along with the analysis. And at the same time, to answer some of these questions computationally, Galaxy will let you do that in a way where you don't have to know how to use the command line or code in a specific programming language. You'll be able to do that by clicking around in the interface. So hopefully that should be something that you can follow. So, today in this session, I'm going to be walking through two tutorials. The first is an introduction to genomics and Galaxy. So I'll walk through a little bit of the background for that. And then we're going to use the extracting workflows from histories tutorial. And I'll explain what all those words mean. But as a preview, we're going to do an analysis where we bring in data. And then we might want to think about, now that I've done this cool analysis, how can I do it again and again and again on other data sets without having to necessarily go through all the steps of setting up everything in the same way that we did. That way you can pick your data and then run the same analysis. So we'll get to that part when we arrive at that section. But there are a lot of other great tutorials. If this one doesn't, if you kind of start here and you want to learn more or you want to dive into a specific topic. So in the Galaxy training network, the tutorials are organized in different topics. So there are really great tutorials in the introduction to Galaxy analysis, as well as using Galaxy and managing your data. So this will kind of give you the foundation for how you can use Galaxy for these tasks. And then there are more scientifically oriented tutorials that are great as well to help you learn how to do a specific analysis in a specific field. And then as you do more analysis with Galaxy, there are a couple of tutorials that I'd also recommend. As you bring in more data, as you follow more steps where things are a little bit, there's a bit more of a need for organization and tracking your data through the history, which also talk about what that means when we get to it. These are some good tutorials that I would recommend. Okay, so now with that background set, let's talk about some of the background and genomics that you might need to know for today's tutorial. So, every living organism has DNA that the molecules in our cells use that help inform what to make all of the proteins and molecules and enzymes and different compounds that are needed in our bodies for them to function. And then, you know, for plants, different structures and parts. So all of this information is encoded in our DNA. And our DNA is organized into structures called chromosomes. So, the DNA is made up of these four base pairs, A, C, T and G. These represent nucleic acids or nucleotides. The letters stand for longer chemical names, but to abbreviate them, we'll stick to A, C, T and G. And the genome, all of our DNA is organized into these long strands of DNA. Because there's so many of these strands, it's so long, the way that they get organized within our cells are onto chromosomes. So, all humans have 23 pairs of chromosomes. So, our chromosomes have two strands of DNA. So what that means is we have one strand that reads A, C, T, G, A, C, T, G in different orders, repeated all for many, many bases long bases or nucleotides that I was kind of talking about. And so, not only are, is there this strand, that strand is also complemented. So, every A will match up with a T and every C will match up with a G. And so, then there are these two strands that are bonded together. So we have the forward strand, you can imagine that like reading a book, you have kind of our sentences that are going forward from left to right in the U.S. But the DNA is also complemented. So, every base, you can imagine like every letter in the alphabet might have a different letter that it corresponds to. So, it would be a pretty weird book to read, but in our DNA, it's kind of matched like that where every time you have an A, it matches up with a T, every time you have a C, it matches up with a G. So this way, we get a forward strand and a reverse strand, and these two are bonded together, which is represented by this ladder kind of looking structure here. So, where are we at again? We have our DNA, it's organized into strands. These strands are organized into chromosomes, and the chromosomes have two strands. So it has the forward strand and the reverse strand. But what's the kind of information that is encoded on these strands? Well, we have genes. Our DNA is the basis for where our genes are. And we have genes that exist both on the forward strand, as we see here, represented kind of on the top, and we have genes that exist on the reverse strand. So in the example with the book, it might be a little funky, but you might get a scenario where all the letters that are organized might have an opposite and complementary, so opposite and reverse version of it that actually makes its own word or a gene in this example. So that's a really cool observation that has been made. But one question is, are there genes on these forward strands and reverse strands that overlap? So it might be pretty crazy to imagine like in the English language example, but in our DNA, the way that it's organized chemically, are there genes that exist on these opposite strands in the sections that overlap? So that question is what we're going to be investigating in our tutorial today. So to do that, we're going to need a couple of things. We're going to be using Galaxy as our platform to run this analysis. We're going to need some data. So in this example, we're going to be using human genomics data, and we're going to need some tools in order to kind of do this analysis. So in the human genome, all of the chromosomes, all of those 23 pairs that we have are 3 billion bases long. So that's something we could try to do by hand, but would be greatly facilitated by having an analysis platform. Right. So now that we have our motivating question, let's get into Galaxy. So to do that, I am going to open up my tab, and I'm going to use usegalaxy.org since that is a public server that's based in the US. And we have a couple other main servers that look very similar to this. And these are at usegalaxy.eu and for those based in Europe and usegalaxy.au, which is a public instance based in Australia. So I would suggest using one of those three main instances. And then I would also suggest using the one that's closest to you. So what we have here is the Galaxy interface. So now we're in the Galaxy platform, and there's a bit going on here so we can walk through that. So what do we have here here we have the main Galaxy interface. So there's a bit going on here so I'm going to walk through the different sections that we have. So up top here in the Masthead, we have a couple of different options. So here we have a way to see our workflows, visualize our data, find shared data, get help, login, and then a couple of these icons. So here you could explore the most current release notes of the latest Galaxy release. These happen about three times a year. You can access the Galaxy training material. So once we get into the tutorial, I'm going to highlight how this can be really useful. And then there's this window manager that you can use. So say for example, you're trying to look at a couple of different files at a time, and you might want to be able to see those in the same view. So that's something that this button will allow you to do. Okay, so now here on the left side, we have this panel, it's called the tool panel. So there are a couple of different things we can do here you can search for a tool. If you know there's a specific tool that you want to run on your data. You can upload data. So you can bring data into Galaxy, send data, you can push data out of Galaxy. And there are a lot of different types of tools here. So text tools, genomic file manipulation, common genomics tools, genomics analysis tools. So there's a lot to explore here. Each of these is a header that you can click on and expand and look inside to see what tools exist there. And some of the names are really informative so you can maybe understand what tool you're looking for from that. But there's also going to be a help section within each of the tools that'll give you a better understanding how to use them. So we'll look at that when we're in the tutorial as well. And then I'm going to go to the far right side. So here we have our history which I've kind of hinted at already in the introduction. So in this history, this is the way that all of the operations, all of the jobs and tools that are used every step of the analysis. All of this is tracked in the history as an item. So our first step in our tutorial is going to be to upload data. So once that data is in here, there's going to be a block somewhere to this blue one, where it's going to show us that item for with our data. And as we do each next step, another item will go on top of that item. So each of them will build up and that'll become a lot more clear once we actually start in the history. So then this final section, this middle panel, when you log in, you'll see some announcements that are relevant to the Galaxy community. This is really handy to check if there's some recent news that you want to learn about. This is also going to be the place where when we pick a tool to run on our data, we're going to be able to set parameters. So some specifications like what data do you want this tool to run on. And sometimes what file do I want as an export? Do you want just the summary? Do you want all the intermediate files? Those are options that you'll be able to set in the center panel. And then this is also going to be the place where we will be able to explore our data. So without further ado, let's show you how the magic happens. So first, what I'm going to do is I'm going to log in. So you can use Galaxy locked out, but there are more limitations like you won't be able to save your work after you leave. And you won't be able to have multiple histories. So I'll have usually one history per analysis that I do or per tutorial. So if, for example, I am trying to do multiple, you couldn't keep track of that with without logging in. So to do that, I'm going to log in. If you don't have an account, you can also create an account this way. So I'll sign in with my email. And my password. Let's see if I can get that right. Sometimes. There we go. I think we've got a little excited there. All right, so now we're logged in. So the tutorial that we're going to go through today, I have it on the tab here is introduction to genomics and Galaxy. So one way that you could follow this tutorial either by yourself or as you're working along with me is to flip back between these two tabs, the GTN training and Galaxy. But another way that's really handy is if you flip over to the Galaxy tab, and in the math side, you can click on this graduation cap, see Galaxy training materials. If we click on that, it already remembers the tutorial that I was at. So I could scroll back up to the top here. Galaxy training, this is the screen that you'll typically see. So it'll be the main landing page for Galaxy training. So today, I'm going to give an introductory tutorial. So that's going to be in this topic introduction to Galaxy analysis. And the tutorial that I'm going to do is introduction to genomics and Galaxy. So I'm going to click on that here. And now the same tab that I was looking at in the other tab is going to be followed here and it's going to remember the place that I was at in the tutorial. So if at any point we scroll through introduction to Galaxy, we covered the introduction, the motivating question. Here are some more definitions. If you'd like to review those again, a couple of our background, a little bit of the background that we covered. Okay, so now we're at this next step called get human data. So I can explore here and read a little bit about the background of what we're doing. We've talked about the main panels that we have here. And now we're at this step to get data into Galaxy. Okay, so we will use the get data toolbox and the tools panel on the left. So here, this is a screenshot that looks just like what we have on the left here. So let's see, we're going to click on the get data toolbox to expand it. So now I can click outside of this tutorial. I can click on data to expand. And wow, there are a lot of options. So let's go back to the tutorial to see exactly what we're looking for. And it remembered exactly where I was. Okay. So now the get data toolbox contains a list of data sources that this galaxy instance can get data directly from upload file is quite useful for getting data from your computer or from the web. So there's another tutorial where you can learn how to do that but today we're going to use the UCSC main table browser. So here, I could even click on this tool. And it's going to take me exactly there from the tutorial or I can click that here. And that's going to get me to the same place. So now that I've clicked on that you could either click run tool or you could click on that tool on the left side panel. And it's going to bring us into the UCSC table browser to learn what we are to find the data that we want to bring in for our tutorial. So I know the following steps that we're going to be using but if you want to follow this tutorial or with other tutorials you can also keep that pulled up as well. So, first, we're going to use UCSC this table browser to find our data set. So, for this example, we can use human data. So the played in mammal will keep that the same for the sake of exploration it looks like UCSC also has vertebrate, deuterostome, insect nematode viral data set. So, if human isn't your cup of tea if you're interested in studying maybe other mammals, there are a lot of other data sets that are available here. But we'll stick with human. And next is this piece called the assembly. So, here, UCSC the browser is has already pre selected some things for me. So I see that there's this assembly from December 2013 called GRCH 38. Let's click on this to see what else is here. So it looks like there are a couple of other options that we can select. So, the past versions earlier than 2013, and then very excitingly, most recently there's been an updated assembly or an updated version of the full human genome that's been created and released. Really scientifically cool because there are some parts of the genome that are really hard for that were really hard for the existing sequencing technologies to really elucidate to better understand and be able to find the correct order of them. And so with newer technologies that support reading longer and longer sections of DNA, or of our fragments of DNA. Now we were able to create a more complete assembly. So, T2T here stands for telomere to telomere. So representing these ends of each of our chromosomes these sections called the telomere. So there's a telomere to telomere assembly of the human genome, which is really exciting to use and I suggest running through this tutorial using this. Maybe after completing this version. So let's stick with this December 2013 GRCH 38. And then what we're looking for so let's see we have some options here genes and gene predictions are selected. Thinking back to our question, we want to identify areas of the chromosome where the genes overlap. So I think here this group genes and gene predictions is going to help us understand which genes are on the forward strand in which are on the reverse strand. So I'll click that. And then we'll use this track gen code B43. So that's going to help us to have those annotations or that information about where the gene starts where the gene ends. Table known gene. I think that worked for us. So far, I've done a little bit of explaining but haven't really changed anything that's been selected so far. So now we define the region of interest. So for the purpose of the tutorial. We could either look across the whole entire genome so all three billion bases in the genome. But for this purpose, we're going to want to look at a smaller section on and this can be maybe our hypothesis we can test this and one chromosome, and then I'll go beyond that to a larger. A larger data set, if we find something that's confusing or if we find something that's really exciting to want to see is this true across the whole genome. So for this example, we're going to use chromosome 22. This chromosome is denoted by CHR 22. So I'm going to change that field so that it only says CHR 22. All right, that looks good. And let's see, there's a subject in that optional retrieve and display data. Okay, so the output format is going to be this file format called bed or browser extensible data that works for us and set some output to Galaxy. So since we came from Galaxy, it knows that we want that data to go back into Galaxy to analyze so I think that all looks good. So again, on this page, the only thing that we've changed here is this position to CHR 22. Okay, let's click on get output. Okay, now that we are here, I have one more step. There's an option to include a custom track header. We're going to skip that for now, and then create one bed record. And so one record that's going to be one row per bull gene. I think that works for us for now. So let's stick with that bull gene and now send the query to Galaxy. Okay, job has been successfully added to the queue. And it sent us right back to our Galaxy platform. So this is starting to look a little bit more familiar. But the difference here now is that we have this item in our history so based on that tool that we used it's UCSC main on human known gene chromosome 22 colon and then it's base one base 50 million in that chromosome. So that's, that's going to be a lot of faces. I'm guessing there are going to be a lot of genes in there. So now we have this data. First the box started out as gray. So that means that the job is in the queue, it's being loaded. And now that it's yellow, it's been submitted and the job is running. So after a moment, once that data is loaded, the box is going to turn green. And that's going to mean that the job was successful and the data was brought in successfully. So sometimes you'll have a job that runs and it turns green, but it didn't do exactly what you expected it to so sometimes it's good to ask questions of your data to see. Did you do what I expected? Did the tool work in the way that I expected on my data? And then sometimes it will turn red. So that would mean that the data did not get successfully loaded, or the tool did not run successfully. There was some error. So that would mean there's an issue. A lot of times that recommend rerunning the job. So I'll show you how to do that a little bit later. But for now, and it also might mean that you need to look into the issue a little bit more to see what's going on. Okay, so now that box turned green, my data has been successfully uploaded. And what I can do here in the history is I can click on this item, and I'm going to get a preview of my data set. So there will also be some metadata or some data about my data that is shown in this preview. So here we see we have 5,353 regions. This number might be a little bit different than the tutorial because of an updated version of gen code that's being used, but it's still around the same amount, a little bit higher I think. And then what we have here is some information. So the format of this file is a bed file, and the database that's being used is HG38. So that assembly that we selected in the last page is that human genome 38 version. And now we get a little bit of a preview of our data set here so we can see it looks kind of like a table. So it must be a table, and there are a number of different columns some of them have names so there's this one column with Chrome. And because this looks like CHR 22 that looks like chromosome so chromosome 22. That looks right. Now we have a start and end. So these are going to be the base number the position where this gene in this row on chromosome 22. So it starts at this position, what is that. So that's 10,736,170. If base is where it starts and it ends at 10,736,283. So 170 to 283 it's a little bit over 100 bases long. So that's the size of our first gene. There's some other information about the gene that's encoded in this row, and then possibly a quality score. And then this next column six is going to give us information on the strand so what strand is this gene on so the first row, the first gene that we have is on the minus strand so that's just a notation for the reverse strand. One row on the reverse another on the reverse one on the forward another on the forward another on the reverse so this is going to be a big file full of all of these genes that are on chromosome 22, and then they're going to be 5000 of these genes actually. So we got a little bit of information from this preview, but if we wanted to take a deeper look at the data, what we can do is click on this eyeball poke it in the eyeball. And now what will happen is we see this fuller view of this table that we were just looking at here in the main panel. We have that same information from some 22, I'd expect if I scrolled through this whole entire file, all of the entries all the rows would say chromosome 22, you could do that and check yourself if you'd like. We'll have the start position and position again, and then this strand information which is going to be really helpful to us. Okay, so now that we brought in our data. This isn't super informative for me to look at this and see. Okay UCSC main. I'm going to want to make the name of my data set a little bit more informative. So to do that, I'm going to click on this pencil. And it's going to take me to a far more I can update the name of my data set. I'm going to type in, let's call this genes on chromosome 22. So here I'm able to update the name. There's also other information that can update here. So this all looks good to me I'm going to click save. And now I see that that's updated here. So that looks great. Now, thinking again on our question. We want to see which genes on the forward strand overlap with the genes on the reverse strand. So what I might do next is split this file and separate out all the genes that are on the forward strand from the genes that are on the reverse strand. So there are a couple of different ways that we can do this so this is going to be our opportunity to look for a tool to to operate on our data set on this table. So I'm going to look on the tool panel on the left hand side. So in the tutorial, it already tells us exactly which tool we need to use. So maybe you're doing an example, an analysis, and you don't know exactly what tool makes sense to use. So maybe let's look through that a little bit. So we're thinking to split our table right to split out those genes on the reverse and the genes on the forward. So maybe something like filter makes sense where I can filter out the reverse string genes on into one file and that forward on another. So let's expand this and see if there's anything here that sounds like it could be useful. Extract features from DFF data. Not familiar with that filter data on any column using simple expressions. That sounds kind of right so let's maybe click on this tool and see what it shows us so have the name of the tool here. So here I see a section on the tool parameters. So filter. This is, I guess, where we can select the data set that we want to filter on. Okay, that sounds fair with the following condition. C1 equals Chrome sum 22 number of header lines to skip zero. Okay, and then there may be some additional options. So let's let's look under the help section to see if we can learn a little bit about how this tool operates. Maybe let's look under the setting syntax so the filter tool allows you to restrict the data set using simple conditional statements columns are referenced with C and a number for example C1 refers to the first column of a tab delimited file. So we have looks like a tab delimited file so we're going to want to filter maybe on this column six for a plus and then for a minus. Okay, make sure that multi character operators contain no white space. Okay, when using equal to operator double equal sign must be used okay. The numerical values must be in single or double quotes. Okay, so if it's a number doesn't have to be in quotes but if it's not a number should be in quotes and then filtering conditions can include logical operators so it's getting a little bit maybe more into the computer science where you can make a little bit more complex complex expressions. Okay, so let's see an example. See one so column one equals chromosome one so it selects lines in which the first column is chromosome one. Okay, so it sounds like with a little bit of modification we can update that and maybe say we're looking at column six and see if that equals plus or see if that equals minus. So let's give that a try it looks like they're some more complicated examples further on but for our case let's see if we can, if we can use this so the point condition see one. Let's change this to see six and then equals to chromosome 22. Let's filter first for the genes on the positive strand so I'm going to change that to a plus sign. Number of header lines to skip. We'll keep that at zero. Okay, I think we're ready to run the tool. Okay, so now we have this nice big green check mark on the right in my history I see we have a job that's queued. And I click on that here now it says the job is currently running. Great. So, I'm going to close that up again. So it's telling you the tool uses this input, our genes on chromosome 22, and, and it produces this output, which is a filter on data one. Great. So, if we're looking for these genes on and we're wanting to fill split this out we might expect that you know maybe roughly half of the genes are on the forward strand and roughly half are on the reverse strand so maybe we can keep that in mind as we look at this data set. So, moment of truth. I'm going to click on that data. Let's see, maybe it needs another hand. Or maybe we can refresh. So I'm not able to see a preview here for some reason. Maybe let's poke the data in the eye, and that way we can take a look. So, here, we can do another kind of sanity check to see did did this tool actually run the way that we expected it to. So in the preview of our first data set, we had on the strands minuses and pluses, but here, I'm just seeing pluses all the way down. So, they could go and manually check, you know, so far this looks like everything that we have here is pluses so I think that that was a successful run of this filter data to get all the forward the genes on the forward strand. So, to make sure that I remember what we did here and what data set we have here, I'm going to change the name. So again, click on the pencil. And then from filter on data one, I'm going to call this forward genes from some 22. Forward and now I'll scroll down and save. Okay, so the preview did end up popping up now so what we can see here is we have 2831 regions from what we started with as 5353 region so okay we we did have a bit fewer. So we look like a subset of our original data set. And in this box here we have some information on the job that we just ran so filtering with column six equals positive for forward strand. And we get a nice view that here there's about 50 to 53% of those 5000 lines in our original data set. So that's really exciting. That looks like it was successful. Now let's do this operation again, but looking for the genes on the reverse strand from our original data set. So this is a really handy tool that I was referencing earlier in the situation where maybe you have a job that doesn't run successfully you end up with a red box, a really handy tip sometime is just to try it again. But in this situation, we're going to want to set up the tool where everything is the same except for instead of the plus, we want a minus to get the reverse strand so to do that. So we'll click on this curved arrow and the tool tip that we have that comes up is to run the job again. So instead of it kind of going forward and running already we'll get to set those parameters again. So here I still want to use the full list of genes from chromosome 22, but now I'm going to change that position. Excuse me for this to be equal equal to minus for the reverse strand. And now I'll click run tool. Great. That was also a success. So now that I am thinking through what this might be doing. I would expect that we'll get everything that wasn't in this first, first kind of sectioning so of that 5300 that we started we got 2800 here, about 52 and some percent. So I'm going to expect that once we run this operation we're going to get that remaining 47 or so percent. So once that job finishes running, we'll be able to double check that and know that we've captured all of the genes that are on chromosome 22. So look at that. Another moment or so to run. Great. So our next job brand here we have filter on data one. Let's open that up. Okay, so here we have 2500 and so regions got that information and there it is the 47% remaining is what we have in this file. So super. So again, I'm going to rename this. I'll click on the pencil here. I'm going to call this reverse genes from some 22. And I'll say that. So now that that's updated, I will click on this item again to shrink it. And now we have a more full viewer our history. So now we have the forward strands. We have the reverse strands. Next, we're going to want to look at where they overlap. So one thing we can do, we can look back in the tutorial. So what did we do get the human data. We got that we've got the data. What's the plan for answering the question. So, okay, split the genes into forward and reverse data sets. Okay, next we're going to check for overlaps. So here what we're reading is genes are an example of a genomic interval. So it's a part that spans part of the chromosome. And we're going to want to look at if this interval the section of the chromosome has any overlap or intersection is another word that we can use that for understanding the places where they were they overlap. What we can do here is going back into our tools in the tools on operate on genomic intervals toolbox join and intersect have the most promise. Let's try intersect. Okay, so I'm going to click back into galaxy here, going to close this category. Ah, alright under common genomics tools operate on genomic intervals. Let's find that intersect tool. So intersect the intervals of two data sets. Super. So let's take a look at this. So here, again, we have our section for tool parameters. Let's look at the help just to make sure that we're doing the operation that we expect to be doing with this tool. So here it's showing that there are some examples of how we can use this so there's we can set the parameter to look at overlapping intervals or overlapping pieces of intervals. So in this first part, we will have our first data set. So here we're going to be using the forward genes, and the second data set so the reverse genes. And what the selection of overlapping intervals is going to return to us is it's going to return the whole gene from that forward a list where there is any overlap with the second data set. Under the second selection over overlapping pieces of intervals. What it would return to us and looking where the that intersection happens, we would only get the segment where the overlap occurs so we're going to lose some information in the first gene. Or we're going to lose the information across that whole gene. What we would want to do here is select the overlapping intervals option so here we can maybe run this two times. So first, the first data set will be the forward genes where they intersect with the reverse gene so we'll get a list of all the four genes. And then maybe we can flip it so that the second data set, the first data set turns into our reverse gene so that we'll get that whole reverse genes information. Okay, so I'm going to scroll up to the top again. We're going to want to select the overlapping intervals of. Let's start with the forward genes that intersect the reverse genes. So, those forward genes is going to be our first data set, the reverse genes will be our second data set. And we want to include in the results any genes that overlap for at least one base pair. Sure, that looks good. So now we can run this tool. So here we're going to get that intersection between the forward and the reverse with the spit out it's going to be the genes that hold gene where there's any part of it that overlaps. So, we have got that tool started. We have that here in our history now. So this is going to take a moment for the job to load it and to be submitted. Once we complete that step, we're going to want to do the same thing with this time the reverse genes, so that we can collect all that information about the reverse genes. Okay, so now that that tool has run. I'm going to click on the preview. Maybe I clicked on it a little too early so we won't get the preview so I'm going to look it in the eye and see what output we have. So here we have a list of genes all of these look to be on the forward strand has that plus sign in the column six. And I would expect for this. The number of regions or the number of rows in this table to be fewer than our forward genes list because I don't know how common it is that the forward overlap with the reverse, but I'm going to guess that maybe it's not that common. So back of this data set. I'm going to name it overlapping for genes going to save that here. So that updated and shared us the preview. Yeah, so now, as opposed to our original where we had, I think it was like 2800 regions. We have over 1000 region so that's still a bit higher than. So I guess that's maybe around a little under half of those genes on the forward strand overlap with the reverse. So let's run that again, this time getting the information on the reverse genes. And we can do the shortcut to run the job. Again, where in this situation we can swap these two data sets so I'm going to move make this the reverse genes. And I'm going to make this the forward genes. And now I'm going to click run. So now this job has been submitted. Here it is it's being scheduled. Can close that up. And so now we're going to be able to analyze where that intersection is happening. And this list that we're going to get are the overlapping reverse genes. So now we have the overlapping reverse genes from this run so I'm going to click on this item, and we have again about half so 937 of that original 1,300 I think from the reverse gene that we had so let's again change the name of this data set. And we can call this the overlapping reverse genes. So we don't lose track. Okay, so now, once that name updates. We have, I think a lot of the information that we need to be able to answer our question. So we were able to see that we were able to bring in the data from a human chromosome. And that data work was the genes that are on that human chromosome 22. So we were able to pull out which genes are on the forward strand and pull out which genes are on the reverse strand. And then we were able to do an analysis where we found out, where does that forward strand overlap with the reverse strand, and how many genes overlap on how many genes from the forward strand overlap and how many genes from the reverse strand overlap. And we got a bit of a big answer so it's pretty, a pretty high number there are 1,122 on the forward strand, and 937 on the reverse strand. So we might want to now combine these data sets together on so that we can, you know, if you're using this analysis in a publication, we can bring those data sets together. So I'm going to use a tool in the toolbox, maybe under join. So compare two data sets, group data by column, join two data sets side by side. I want to kind of join them top to bottom so subtract that doesn't look right okay maybe under text manipulation. So we're going to add, change column, concatenate. Okay, so we can concatenate the data sets tail to head. So, just paste them together into a new file. So let's click on this tool. Okay, let's see if this gives us a handy explanation of what this tool is going to do. What it does, concatenate data sets. So in this example, we're concatenating this data set that has a couple of regions from chromosome X with a data set one with from genes on chromosome one and data set to with some genes on chromosome and this tool is going to result in the following so we have chromosome X one and two. So that looks right. We're going to want to combine those data sets together so let's go ahead and use this tool to do that. So we're going to want to connect catnate the data set. So let's pick the four regions that will be on top for no particular reason I guess, and then insert data set will want to combine that with the overlapping reverse genes. Okay, now let's run that tool. So we're going to combine our data sets together. And once it runs will probably also want to rename the data set so that it's a little bit more informative. So now what we were able to do is we have our list of these overlapping genes. The next step that we can think about is actually taking a look at the data and using a really important tool in our tool belt of genomics and general data analysis to actually take a look at the data so we've been doing a little bit of that. So let's take a look at these previews and looking at the data, but we can take another eye at with some tools that have been developed to actually look at what this data looks like on the genome. So before we do that, I'm going to rename this data set. And I'm going to call this overlapping genes. And then I'm going to save this. And now close these overlapping genes. So here, I am going to use one of these icons here. This one that looks like a little bit of a bar graph is the visualize option. So here I'm going to click on this button. And now I see a bunch of different options that I have for visualization. So there are maybe different tables that I can create. But I am looking for. Okay, so that's a scroll down or scroll up. I'm looking for the UCSC genome browser. So this is a tool developed by the folks at UCSC where we brought in our data from actually. And this is a tool that's going to let us look visually at what the genome looks like. So I'm going to click on main. And this is going to open another tab with that genome browser. So now it's looking like our genome browser is loaded. There's a lot of information here that we see so maybe we can start from the top and work our way down. So we can see here that we're looking at the genome browser on human with this specific reference that we selected. So that means that what we're looking at is being compared to this version of the genome that we have a lot of information about. So whatever we bring into that, it's going to be compared to this reference and we're going to be able to compare what we have to what that references. So then we have a couple of different options and how we can explore. So here the region that we're looking at is pretty big across the entire chromosome 22. This red box is showing what is being highlighted in this section. And here is the user track. So this is what we have supplied. This is the data that we brought in from our galaxy instance. And here below is all that information from gen code about what genes and what transcripts are included in in these areas on chromosome 22 so we can see that there's some overlap with some of these larger black boxes and sections. So what we're looking across a really big section. So here we're looking from this 16 million base to 19.5 million space so this is a really huge span there are a lot of nucleotides that make up this whole sequence. So if we scroll down we can get a better picture of what some of the data that's provided so here we're seeing all of these different versions of genes and transcripts along the section of the chromosome. There are some variants and phenotypes that are observed so the way that the gene is expressed to what the result is so those main variants where maybe there's a base or a couple of sections that are different. And that results in a different expression of the gene. And here we get some information from GTEC so the tissues where many of these genes are most expressed so GTEC is the genotype tissue expression project where they've collected lots of different tissues across human samples. And we're able to measure how much is each of these genes expressed, and then you can look across all the tissues that they collected, what kind of tissues what parts of the body are these genes most active in where they're the most produced so there's a lot of really cool information that we can understand from that as well. And then here below we can select a lot of other custom trap tracks. So if you're interested in learning a lot more learning about specific types of expression there are a lot of tracks that you can activate here. So this looks like a really good resource. If they have this for your organism. So I'm going to scroll all the way back up to the top, and we'll zoom in on a specific gene. So the gene that we're going to want to look at is called DGCR2. And this is the gene that is looked at the tutorial but you can also scroll around and kind of play around with UCSC yourself it's a really cool resource. So this is one of the tutorials that teach you how to use it but for today's tutorial we're just going to zoom in on DGCR2. And so UCSC tells me this is a gene for the George syndrome. So I'm going to click on here. So here we've got some really detailed information a little bit more zoomed in than what we saw before. So here we see DGCR2. It looks like there are three versions of this gene. It looks like there are some kind of thin boxes here some thicker boxes here so maybe let's zoom out a little bit. So we can see it in context. So here's a couple of different things around it. Some other genes perhaps maybe a couple of other transcripts. And so the information that we get here from these arrows on these lines is that this gene is on the reverse strand because the arrows are pointing left. So these transcripts that we have here on this one is pointing to the right so that's the forward strand so this is probably one of the results that we got in that in that result table about where the genes overlap. I want to take a look here because I want to see where this box it whether that overlaps with this one here so to zoom in I can select a specific section so up up top here I'm going to click and drag across the area that I want to zoom in on. And then there's some more information here that I'm sure would make it a little bit more helpful some other ways that we can interact with the genome browser in terms of moving it around. I'm just going to click zoom in to look closer at that section. And we get a lot more refined information here. I'm still seeing that the user track looks a little bit different than what this track that's provided from gen code looks like so one way that we can adjust that is, I'm going to right click here to the left of this gen code section and I see that the option that's selected here for the way that it's displayed is pack. And I really like that. So I'm going to see if the user track looks like that, and to see if we can change that yeah so here at the representation looks like it's dense so to make that match up I'm going to click on pack here. And that's something you can play around with as well. So it looks like there are a couple of different regions here where we see these genes are overlapping. But I think the important piece for us to look at here is that these boxes. They actually represent the exons. So each gene, once a copy of it is made so that it can go and be produced into a protein or another type of molecule. The whole entire gene isn't always used. And there are a lot of times the genes get cut out into these different sections. Some parts are removed and they're rearranged. And that's why we often see multiple versions of genes here, because they have different expressions as the transcribed in the translated are are fancy ways of talking about how the biological ways about how genes are processed and converted into these products. So one of those ways that it's processed is chunks are removed and so the chunks that end up getting kept are called exons. So here the exons are represented. And here we can see in that tool tip below that this is exon one of two in this transcript. So this part is kept. So then I'm seeing here that these two exons are not overlapping, which means that ultimately this part in between where where there is this overlap, there isn't anything that's necessarily going from the gene to a product that is expressed or created out of that gene. So I think this leads us into maybe a little bit of a different question that we want to ask now that we did this analysis we saw this result. So maybe we want to see, we learned that there are genes that overlap, but now maybe we can get more specific and and ask, are there exons that overlap are there pieces of the genes that are coding into a product that overlap. And what does that look like. So, first, we were looking at this question do the genes overlap, and we found that they do that happens pretty, pretty often. We might want to see do the exons overlap so do these sections that create actual molecules do those overlap. So what we just saw was something like this where we have an exon maybe they're lined up but there's a gap between them, where there's that part of the gene that gets discarded and expressed. So we would want to maybe redo our analysis to look at whether or not the exons overlap. So these sections that are coding. So what does this mean for us do we have to start from scratch do we have to, you know, walk through every single stuff again and you know this hour of work that we've done. Thankfully, no. And we can do that in galaxy using what's called the workflows. So, we did all of these steps here, where we brought in the genes, we extracted which were forward and on the reverse strand, we checked which of those were overlapping and ultimately got this list of the overlapping genes both on the forward and the reverse strands. And that was really informative to us right we learned that this overlap happens a lot. And so how can we do this, again, looking at the exons. And so like I said the way that we can do that is using workflows. And because we did all this work we set up all these steps that we want to run. The next thing that we'll want to do, and is really handy in galaxy is that we can make a workflow out of this history. So the way that I'm going to do that is up at the right in this history panel, we could either create a new history. Let's switch between some of the histories that we have, or we can look at the history options so let's click on that. So, let's see, delete export, export history to file extract workflow. Let's do that. So we're taking to this page where I see everything that was in my history. And now I get the option to see what tool I ran and the history items created. So since we were working from a pretty well developed tutorial and we knew the workflow that we were running through all the tools did what we wanted. We want to keep all the steps that we did, and keep them in our workflow. This is when I'm doing an analysis for the first time and actually a lot of times there are a lot of steps that happen where oh, that tool didn't give me the result of the result that I really expected or I kind of went down in a direction that didn't end up leading me to my answer. So we could then remove uncheck the boxes that didn't lead us to the result that we wanted so that when we run this workflow, we can just tell it start with this data set and get me to this output without having to do any extra analysis. So I'm going to rename my workflow. And I'm going to call this overlapping. Well, I could call it genes. And this could be pretty specific to genes, but all the steps that we did are pretty generalizable. So we could run this on a data set of exons. And then we'd also get the result that we're looking for so a more general word that we can use that would include things like genes and exons is features. And something about this workflow to is that it runs on opposite strands. So that's going to help me when I look at this workflow in the future to remember okay, this is looking for features on opposite strands I'm not just looking at once one strand. Okay, so now I'm going to create the workflow. The workflow overlapping features on opposite strands created from the current history, you can edit or run the workflow. Before we run it again, I'm going to click on the edit. And this is going to help us see the structure of our workflow. Yeah, a kind of compact view first. So I, you can kind of drag and expand and move things around to make it a little bit more pretty, a little bit easier to see which data is going where let's pull this out a little bit. Awesome. So now I can follow the workflow here in a visual way so the genes go through these filters. These filters were filtering on the forward strand the reverse strand, and then we saw the intersect on both of those directions that led us to our final data set. And so because I want to keep this workflow I want to use it again and again, it's going to help me to label each of these steps, so that when I'm running at the next time. The results will be a little bit easier for me to follow rather than having to investigate every data set. So to do that, I'm going to click on this box for my first step with the genes, and I'm going to call this instead features on overlapping strands. So my input can be genes exons introns if we're interested. So we can make that selection. So this filter that step was looking for features on forward strand. So the next one was looking at the features on reverse strand. So now we had intersect of the forward and reverse those we can keep the same, and then concatenate data set so this is going to be overlapping features both strands. So now we have our workflow. It's nicely labeled. We could also shrink each of these panels so that we can take a better view at the workflow or zoom out. So, you know, if you have some monster workflow that's working on tons of inputs and lots of steps and everything's going everywhere you can get a really nice overview of what's going on. So also you could continue building out your workflow you can add additional steps right in this editor. So say for example maybe at the end once I get my overlapping features, I might want to sort those features. So in the tool panel on the left, I can click on filter and sort, and I can click on sort, and then add it as the last step in my workflow, connect to this output and set it as the input to my sort data set. And then rename this sort to my final version so I'd call that overlapping features on both strands. So this is a really handy way to have a high level view of the workflow, rather than what we saw linearly in the history. And so because I don't necessarily want to sort of the end I'm going to remove this, and I'm going to click on save workflow, so that now I can rerun this workflow on any kind of input data data that I want. So the next thing that we'll want to do now that we have our workflow setup is now we can more easily think about our question, which was, you know, we looked at the genes that were overlapping but now we want to take a look at the exons that were overlapping. So to get back to see my history. I'm going to click on this home button. So that's going to take me back. And now let's bring in some data from UCSC again looking at the exons. So the tutorial covers a different way that you can get this data. Say for example, you were looking across the entire genome, and you already spent a lot of a lot of time getting that data uploaded into into galaxy. So this data set actually does have information on the exons, the exon, excuse me. In this section beyond that column six that we originally looking at. So there are some tools that you can use it's described in the tutorial of how then you can extract these details about the exons, so that you don't have to go through that process of bringing in the data again. But because we're looking just at one chromosome I'm going to go back to UCSC to get that data. And then in the tool panel, I'm going to click get data. I'm going to find the UCSC main table browser. So, at this point we're going to keep things all the same. But now one thing that is different is that the position. What we remember is from the genome browser that we were looking at this specific section. So if there was other information that we wanted to get, or to organize it in a different way. That's really handy that it'll remind us and kind of put in those coordinates already within the genome. So, we want to look for the exons that are across the whole chromosome 22 so I'm just going to remove that selection. On the second page. We are going to select here that we want to get not the whole gene, but we want to get the coding exons so those exons that are going to become part of end up as part of the RNA and then that will get translated into a protein so I'm going to click that. But to get those specific exons and send my query to galaxy. All right, job successfully added to the queue. So now we have another data set that we've brought in. It's added as another item to our history. This time, this is going to be the exon so we'll give that a couple of moments to run. And then we'll want to make sure that we rename it so that we can keep track of the difference between these two data sets that we had. Now, what I did here was, I brought in this data set into my current history, which is unnamed so far so in this history in this analysis I was looking at the overlapping genes on chromosome 22 so I might want to rename this history. For this specific analysis to say, this is where I found those overlapping genes. Did I say that correctly. There it goes. Okay, now it's updated so overlapping genes chromosome 22. And here I just brought in a data set of exon so I'm going to rename that as well. So this is exons chromosome 22. And click save. So I could keep this analysis in the same history. But it might help me in the future if I'm kind of looking at specific steps to actually put this in a different history. So I'm going to do that. And I could have started a new history and brought in that data set directly into that new history, but you know, I was a little excited about bringing this exon so I didn't quite do that. So I can actually move this data set from this history to a new history. So to do that, I'm going to click on create new history. I can see that this history is empty so I'm going to first name this overlapping exons chromosome 22 and click save so I know which history to move this to. And then I can click on history options show history side by side. So I'm going to see all of my previous histories that I have so I'm going to select histories here this is the current history that I'm in. So I want to get some get that data set those exons from this overlapping genes history. I'm going to add that selected history to my view. And now I can simply drag and drop this exons data set, and it's going to get copied into this history now. So the exons data set is currently taking up some of the room. Now I have two copies of this exons it's taking up some of the room of this history that I have. So what I'm going to do is I can delete it from this history, and it's still going to stay in my new history here the overlapping exons. So we see that this number about how much space, this history is taking up didn't change. So the way that we can go and make sure that history gets deleted, you know, like your email you might delete it but then it goes into your trash folder. So I'm going to actually go and do that final delete. So I'm going to switch to my overlapping genes history. And then here I get to see that I have six data sets, and one deleted data set. So I'm going to click on that. And here, the deleted data set that I have as my my exons. So here what I can do is click on this cog and click purge all deleted content. So do I really want to delete delete this data. Yes, because I have a copy in my other history. So I'm going to click. Okay. That removed this data set so I'm back under two megabytes of data. So now I'm not using unnecessary disk space. Okay, so let's switch back to this overlapping exons history so now we can run our workflow. Okay. So no data found for selected filter. I think the selected filter was still the deleted data set. So I'm going to click here to show after okay, we're good. We're in good shape. So now what we can do is run our workflow. Finally. So I'm going to come up to the top here in the mast head and click on workflow. And right here at the top. I see this workflow that I made a couple of minutes ago. So here I could click on the name. There are a couple of things that I can do edit copy, see the indications the previous times that I've run this workflow. But what I want to do here is run it. So I'll click on run. So now I have this similar form that we've seen before where we're going to look for features on the overlapping strands. And the option that I get to run this on is the data set that's in my history. So I want to run it on the exons on chromosome 22. So there's this nice way that I can, I don't have to see all the steps. But for the sake of kind of seeing what that looks like. Now we get to see all these other steps that are going to be part of this workflow. We have an option to send the results to a new history. But I created this history specifically to look at the exons. So I'm going to say no. Now, when we were looking at that intersect on the on the genes, we were looking at where there was an intersect of at least one base pair. So another note about exons is that once that DNA is processed, all of the exons are chopped out of the DNA and they're combined into what's called RNA. That RNA, when it gets translated into a product or like a protein, that each part of the exon it's read in in chunks of three bases long. So this product that's created by these three are called amino acids, and these are the building blocks of our protein. So it's these amino acids that then get created all lined up. And they get formed into our protein so those amino acids again are made out of those three nucleotides so we can even in our workflow adjust here and edit where we might want to look instead of just one base pair that overlaps. Maybe we can look at a place where there are three, maybe nine nucleotides that overlap so that we get three of these amino acids in a row. So I'm going to click nine here. And then I'm going to do the same thing on the opposite intersect. So I can click edit here, change that to a nine. And that way we're looking at these sections in the exons that are going to give us a little bit of a longer section that we're looking for so that we see these amino acids that actually turned into into products. We're able to say change that for this run of the workflow. So you can still customize the type of analysis that you're doing. You can make these adjustments in the parameter, especially if you're looking for something a little bit more specific. So now that I've made those adjustments. What I'm going to do is scroll back up to the top and click run workflow. So we successfully invoked our workflow overlapping features on opposite strands. So you can check the status of Q jobs and view the resulting data in the history panel. So right now, we see that under this summary section that it's waiting to complete invocation one. So everything at this point is still being loaded. And as the workflow completes will have more items that are added to the top of our history. Wow, they all they all just popped up. So we have our original data set and now it's pretty similar to what our history looked like before, except for there are some jobs that can run at the same time. So starting with that initial data set, we have that filter where we're looking for the positive strand exons and the are the forward strand exons and the reverse strand exons. So we can't process the intersect steps until we get these overlap until we get these exons on one strand on the opposite strand filtered out. So once that's complete, both of those intersects are done, and only when those intersect files are created can we concatenate them. So here we just are able to go through the whole process with our workflow that we did before, but now that our workflow is done, we can, we can run this really quickly in a matter of a couple of minutes. So, our workflow has finished running, and I got really excited about running it again that I didn't even take a look at the data sets that we had so that's kind of one temptation with workflows is that you can. It's just really exciting that I can save a lot of time run all these steps. But it's still really important to keep in mind is the data that I'm running through this workflow is this workflow really going to get me the result that I want or process data in the way that I want it to. So let's do a little bit of exploration of our data. So, like we talked about with the genes, they're going to be many exons within those genes right so the exon is going to be a section of that gene so if we started out with 5000. So we see that they're 18,000 18,473 exons that are within those 5000 plus genes that we looked at in the last data set. So, we would expect maybe close to half to be on the forward strand maybe half on the reverse strand. So here we have almost 53% on the forward strand. And then on the reverse strand about 47% of the genes are on there. So that's pretty similar to what we saw previously. And then what we saw on the intersect is actually only five exons overlap on the positive strand with excellence on the reverse strand. And only seven on the reverse strand overlap with those on the positive strand. So, wow, last time we had about what 1000 on each so we saw that the gene overlap was really, really common. But here, we're seeing in the combination, there are only 12 regions, there are only 12 exons that overlap with chromosome on chromosome 22. So we did a lot of work, we found out that this is actually really rare, you could say, in its chromosome. So, maybe that now that we've done this enough is you might want to see well, what does this look like across the entire genome, or a human genome, or maybe is this more common in other organisms than it is in humans. You could also play around with the versions of things that we used right so here we use the bad December 2013 assembly that reference of the human genome. Maybe you could see wow okay so now we have a lot more data about the human genome. What does this look like on the most complete version that we have today. So, in those sections of the genome that were really complicated to try to put together. Does this happen more does this happen less, or about the same amount. So there are a lot of cool directions that we can take this and especially with the workflow right but that only took a couple of minutes. So, a lot of cool directions that we could take this into understand where the exons overlap. We can also try to understand more information about these exons. So what genes are they from, or what proteins do they make is this specifically in a immune, an immune products or proteins that are important for different types of functions. And is this more common in certain ones than others. So, there's a lot more information that we can try to understand here but the tutorial that we've worked through today really nicely tells us each of the steps that we have to do all the parameters that we have to set the specific kind of data that we need. But once we get into this place where we're asking more questions, it's really exciting, because there's a lot we can understand, but that's also can be more frustrating because then the job is on us to look for different tools and make sure we play with the parameters just right to get that information. So hopefully this is a nice introduction to Galaxy for you, hopefully a nice introduction to genomics. I hope you enjoy the rest of this Marcus board week. If you have any questions feel free to ask in the slack. There are going to be lots of folks available to help you I'd be happy to help you through if there are any places where you get stuck as well. So, happy Marcus board. Thanks everybody.