 Hi, everyone. Welcome to this tutorial about using R in Galaxy. And this is a continuation of the previous video where we captured a few things about how to use basic functionality in R through Galaxy. Just a quick overview, we're going to be using the active interactive R Studio provided by the Galaxy interface. Now that we've already running the session here, you can go to user, the active interactive tools, and you'll see that it is actually running here. If I click on the RStudio link, it will pop up the interface that we were working for. In our previous session, we had a look on some basic operations of R, how to create variables, what are the different ways to apply mathematical operations, how to assign values, how to remove them. We talked a bit about how to work with vectors, how to do subsetting, and finally a few things about lists. In this session, what we'll be discussing are a few more steps on how to deal with data and specifically how to do data manipulation for tabular data from Galaxy in R. We will check how to load and explore the shape and content of the tabular data set using base R functions. We'll see a few things about factors and how they can be used to store information. We'll use one of the most commonly used libraries in our deep player that we will play the data, and we'll see a few things about working with that. An interesting point to keep in mind is that a substantial amount of data that we've worked in science is tabular data, so data used in rows and columns. So there are some principles that it would be nice to keep in mind when working with such data. So one of them is keep the raw data separate from the analyzed one. So in principle, you load the data and then you don't touch them anymore. Even if you write back, you write back your analyzed data, your refined data, your normalized data into different files, so you don't get the risk of, by accident, changing the original data. The second point is to keep the spread data tidy. A simple way of thinking about this is to have one row in our spreadsheet for its preservation or sample and one column for every variable that we want to measure or report on. Although it's quite simple and easy to explain, it is one of the most easy concepts and principles to do to violate. It is interesting to keep in mind that a vast amount of time for scientists is dedicated to tidying data for analysis. And finally, a crucial aspect is to also trust the data but always verify. You don't need to be paranoid about the state that the data is in, but you should always have a plan to assess and verify them. And this is one of the focus of this particular discussion right now, the full-fledged lesson. In many cases, you might have some assumptions, some expectations about the data, range of values, how many they are, and what are the different observations that you already have. And as the data are going higher in numbers, this might create, this might not reflect the actual, you might be easily verified. So for this reason, it is good enough practice in the beginning to check actually that the data is correct to verify them and we will see how this works. So one of the first things that we're going to do now is we're going to create a new R script. I'm going to close this one and I'm going to go file, new file R script. And I'm going to save this one as R advanced, advanced, there we go. And now we are ready to import some, some tabular data in R. The easiest way to do that is to use a function called read.csv. This can get input a lot of different things, a file, a stream. In our case, we will be using another URL. Essentially what we will be using will be the output of the annotated differential genes, differential express genes that we created in the RNA sec lesson just before. So I'm going to save this one also as a new variable called annotated differential express genes. And I'm going to assign this here. So again, the name of our variable, the functions are going to be reading this in and the full URL. I'm going to execute this. And as you can see, we have a new data point here. So if I have a quick look, it might not see as best as we have liked, which means that we've kind of worked but it's not the best way. So in order to achieve this, to actually check it out, if I were to open the file, I've downloaded and checked it. Locally we'll see that it's going to, it is top-delivered instead of comma separated. So what I'm going to, I can do is I can use an additional parameter here called separator which is defined as slash t. So if I'm going to rerun this command, you see now that I'm going to actually see that I have 130 observations. Observations are equivalent to rows and 13 variables. So 13 columns. And if I drop down this list, I'm going to be able to see the different columns that this information, this table actually has. So this is quite interesting to see. I can also have a look at the, I can use the console, the commands to have a look at the names of this variable. So I'm going to use call names and run this. And you can see the same names that I see here in environment. I can also see them as a vector now in my file. So congratulations. You've successfully loaded your data into our studio. So now that we've loaded the data, the next thing that we need to do is to actually have a quick look at what they actually contain. So get some summary of what the infamations here is. In order to do that, we can use a function called summary which is very convenient because we don't really need to remember a lot of things. So let's run this. And you can see we run by control enter. You see now in the output in the console, we see a lot of formation. We can also use the function that we saw last time called structure, which will allow us to get a better understanding of what the structure of this variable is. All right, so we have a lot of information here. So have a better look. So in terms of the summary, you can see that every column is now represented here as a block. So there's a block here, a block here and so forth. So if the column is numeric, what it gives us are some basic statistics. So what is the minimum, the maximum, the median, the mean, the first quartile, the third quartile. So we have a basic inclination of what is the information here. If for whatever reason, the column is numeric character and you can see that you have character here, character here and those last ones, you only see a few things like what is the length of this particular column. So how many elements it has, and what is the class, what is the mode and that's it. Another point that might be relevant to keep in mind is that here we have information about the individual structure. So in other words, we can see that the gene ID is a column of character type and these are the column that it has. This is an integer and so forth. So using both the summary and the structure, we can have a better understanding of the content and how this works. An interesting point to keep here is that a lot of the variables like the basement, the log two for fold chains and the p-value, and they are numerical data and they provide this particular summary statistics. Some others are treated as the Google data and we have them as like one-third A plus and mode. It is interesting to have this in mind because this will allow us to have a better understanding of how this can work. Another point that is interesting to see now is that, and this is actually one of the new entries of the lesson is that by default, all of those columns are treated as either character or numericals. And in some cases like the start and the end, which are integer numbers, they are considered as integrals instead of numerics, and the difference being that they might have either float points or not. However, there are some cases where the information provided here is of more use if it's actually categorical data. What does this mean? There is a parameter when loading the file which is called strings as factors. So this is a parameter that since a few months ago by default, it was set as true. And now with a new version of R has been changed to false. I will rerun this command and I'm going to also rerun the summary and the structure. And you're going to see some changes now, as opposed to the previous case. I will go back to the structure. You can see that all the character base columns instead of being character has now been changed to factor and factor is one of the major data structures that are used in R because they allow us to work with categories. And they are a special case of character type vectors. And what it allows us if we go to the summary, you can see now that instead of saying that G9D for example has 130 length, it actually gives us some basic allocation of how the different values, the different categories that are within this particular column correspond across the different cases. The most easy to see example is strunt where we have 72 cases of plus and 58 cases of minus or feature for example, where we have 126 protein coding or three long code in long code nowadays or one pseudogen. Sometimes you might need to treat data as a factor or otherwise you may want to keep them as character. For this instance, I've explicitly asked everything to be changed to factors so that we can continue with this process but so that we can show exactly how the factors work. So the first thing that we're going to do, let me change the size of the table again, is to use to extract from the annotated differential genes table. I'm going to extract the features, the feature column and I'm going to save it into another vector, a new variable called feature. Let's add it like that. So as you can see now here, we have our original data frame, our table and now we have yet another one. If I were to do a function called head so it can show us the top first columns of the variable, you see that it actually prints out that the first few values of the vector are product calling, product calling, product calling, product calling, product calling, product calling. So everything is the same but it also gives us another piece of information saying levels. So levels are essentially the different categories that are supported by this particular factor. If an easy way to think of them is like dropdown list. So a factor is dropdown list where there's a limited number of options. You cannot have a different value, a different type of information there except the ones that are already defined here. And if I do the structure of feature like this one, it actually gives us the information here that we have a factor with three levels. These are the levels and now it's actually some more information here which is numbers. So why are we having numbers here? So for the sake of efficiency and storing less information R stores the content of the factor as a vector of integers where each integer is assigned to each of the possible values with an alphabetical order. So in other words, the first element in our feature objects is protein coding. The second one is, sorry, the first one is the long note code RNA. This is the first, if you look at the alphabetical order L is the first letter then we have the protein coding and then we'll have pseudogen because PR and PS. So if we assign them in numbers, R does one is alliance zero RNA long code RNA two is protein coding and three is pseudogen. So by using the structure command structure function it gives us what are the different levels plus the first few values. And as we've seen before, the first few values are all protein coding. So if I count six first few values which are the number we have here you can see that it's two, two, two, two, two. So in other words, we have the representation, the internal representation of the different values the different factors as integers. One of the most common uses of for factors is to actually plot categorical values. So let's actually try to do that. We'll be using the base function plot to do that. And I'm going to type here plot feature and I'm going to execute this. And as you can see down here, let me zoom this a bit more and it produced a simple enough plot. We're going to be seeing much more information about how to create nice complication in a later tutorial, but for time being, and it is good to know how this can work. So in other words, factors are an efficient way of storing categorical data with interpretation of using this kind of information. So now that we've seen the factors and we will return to them in a bit, another point to actually check out is how to do subsetting. So we have a table here, but we might need to extract particular piece of information. We're going to use the exact same structure as we did with vectors. So we're going to be using the square brackets. So let's try a few ways of subsetting. So let's say, for example, that we want to extract the first value of the first observation. So in other words, if we see these as different columns and I can use this particular information, this particular button to see this as an actual table, what I want to extract is this one. So my first row and my first column, this particular value. I can use index one comma one. If I run this, it will give me my first value. It will actually give me more information here because I've extracted a particular piece of information from a factor column. It actually reminds me that this is a factor and these are the different values that they exist here. I can do the exact same thing with different indexes. Let's say two, four. If I run this, I get this particular value. If I check the actual table and I go two rows, fourth column, there it is. This is the value that we've actually just retrieved. And so as we did in vectors, we can actually specify ranges. And I can use, for example, one double number four and one. And for this one, if I run this, it will give me the first four values of my first column. So rows one up until four for column one. Again, I'm going to go here. My first column one, two, three, four, it gives me those four values. I can combine these in both directions. So I can say one, two, ten and one, two, ten here and one, two, five here. If I run this, I get a subset of the same table. Similar to, again, two vectors, I can be very more explicit. I'm going to say that I want to have the first 10 rows, but I'm also want to capture columns named feature and gene name. So in this case, okay, so I've created an error here because I mistyped some of the columns and I can check exactly what the names are. There we go. So this is my problem. I had a typo. You see that I put the name with a capital N, whereas the actual column name is with small n. So I'm going to change this to nn. I'm going to run this and now it actually gives me the correct sum. So the first 10 columns, the first 10 rows, but for columns feature and gene name. I can do also in the same process last time, I can extract columns. I can disregard columns and I can say, I want you to give me everything but the first column. So in this case, I've requested all the rows. So as you can see, I have all the rows here with the exception, all the columns, sorry, because I have nothing on my second part. I have all the columns, but for the first row, as you can see, instead of starting from position one, this starts from position two. I can do the exact same thing. Instead of except, I can do this one. And for this case will give me all the columns and only the second row, or I can combine that. And I will ask the first, the rows two and three, the rows two and three and all columns and vice versa. I'm going to change that and I'm going to ask all the rows, but for columns one to three. And again, here I'm going to have, as you can see, one heart and a third observation, but only the first three columns, which is GNAID, BISMAID and log2FC. So a final point, which is very specific for data frames, is that you can use the dollar sign that we saw earlier to actually retrieve a particular column. So for example, in this case, I've requested to give me all the GNAID column. And this is how it's working. A final point, again, specific for data frames, we also kind of saw this into the vectors as well, is I can use this one, and I can access, for example, the feature. And I can say that I want to extract only features, only contents of the table that have pseudogen as a feature. So as you can see, it gives me all the columns. This is, I have absolutely nothing here, but only the rows for which the feature column contains pseudogen. I will take it one step further and I will stop the subset in here. I can continue this even further, and I can have also a column to subset. There's a lot of exercise that you can see on the training material on Galaxy. So it would suggest that you have a look and you try them out and we hope you provide a bit more context on how a subset works. So now we've seen how subset works in data frames. I'll move one step further and I'll go back to the factors. Let's say that as we've loaded the original data table and you see that here we have factors and numbers and so forth, but we actually might not want all of these to be factors. We might want them as originally they were as characters. So one of the common activities is to change the type of a column, of the values of a column from one way to another. So basically what we're talking about is coercion. Let's try to do this with a gene name. So we don't want the gene names to be factors, we want them to be actual values. So let's check what is the structure of the gene names column. Let me scroll a bit down, there it is. If I check this out, it gives information that is basically a factor with 130 levels and these are different integers associated per level. So let's try to do something weird. We know that these contain characters. This is our, if we check the table and I'm going to do it right now, I'm opening a table, I'm going to scroll the end, I see characters here. So I want characters. And we saw previously in the vectors that if I try to coerce strings to numbers, it is just a warning that it basically adds a lot of missing values. So let's try to do this here. Let's try to S numeric and I'm going to put the annotated gene name as input for that. It will possibly give an error. Huh, this is interesting. Actually it worked rather well and no warnings. And for some reason, the characters have been changed to numbers. So it works, but it actually doesn't. So instead of giving an error message, odd returns in numeric values, which in this case are the integers, as you can see here, assigned to the levels in this factor. So for example, AMA was 88 and you can see that it actually 88 here and so forth. And this is a kind of behavior that sometimes can lead to hard find backs. For example, if you load a table and by accident it is to convert to the factor and you try to do some numerical operations there, it will change to numbers. No error, you see no errors. So you consider that everything is going well, but actually this is a problem. So if you don't look rather well, you may not notice a problem. So how do we do that? We actually coerce. We make a explicit version of this particular column as characters. So if I try this one, you can paste it here, I'm going to run this. And you can see that this time around actually works as we want it. And these are the different values. And now that we know that this works, I'm going to overwrite, and this is important to highlight, I'm overwriting my original column in the data frame that I loaded with the changed information. And you see now that in my table, this has been changed from a factor to a character. However, bear in mind that I have not saved on my original data. So this is only within our studio. I can easily, if need be, reload my original data and have the exact same process in place, but this is something to keep in mind. So again, just as a reminder, when loading a data table by default, now, since a few months ago, this particular parameter is set to false. So no actual column is changed to a factor. However, if you select this to be true or if you convert explicitly a column to a factor, you might always try to remember when to apply coercion, especially if it applies numbers. Because if by any accident you had a column where it should have numbers, but by accident it had some characters, it will change to different factors, to different level facts. Everything's going to be character, it will change the factors. The factors will correspond to integers, the different levels. By coercing this to numeric, the numbers of the corresponding Vy levels will be retrieved. You'll see them still as numbers and without actually checking that the numbers correspond to what you expect, you might miss this altogether. So it's common enough mistake and can be encapsulated in a single point when dealing with factors and converting from a factor to a, and you want to apply a function to the factor, always keep in mind what is the data that you're actually trying to apply this to. All right, so now we've captured all this information about the factors, and let's try to check a bit how we can apply also numerical functions to it. For example, let's say that if we look at this table, we have also the base mean column, and which corresponds to the normalized count of all the samples and normalized by sequence depth and lever acquisition. So let's say that we want to check a bit about information, numerical information there. We can apply functions as we've just seen directly to a column. So for example, we can use the mean, which gives us the average value, the mean, which gives us a minimum, the max, which gives the maximum value and so forth. Let's try by finding, for example, the maximum value that this particular column has. We're going to use the exact same approach. So we're going to use the name of the variable and using the dollar sign, and I'm going to select the base mean. So control enter. And as you can see, it gives us here the information about what is the maximum value that comes here. This is useful, and we can also make a shorting of this information. And in order to do that, we can use the approach of subsetting in a sense. So in order to do that, I'm going to use basically the subset function, but instead of subsetting, I will ask R to order the information here based on the value of the base mean. And I'm putting comma here because I want all the columns. I literally want to reorder my entire table by ordering my rows based on this particular value. And I can use this to sort this information into a new variable. So for example, sorted by base mean, and I'm going to assign this here. So if I run this, you see it has been executed and here is the information available here. I can do a quick check by doing head of the sorted by base mean here, and we're going to check. And here we see the information in the base mean, it starts 19, 23, 24, 26, and so forth. If I do the exact same thing on my original table, we'll see that the information is rather randomized. So we have minus four, then two, then minus two, and so forth. So it's exactly as we got them. So as you can see here, what we got is the base mean is by, sorry, I was talking about different columns. So the base mean again, if you see that's 1,000, then 65,000, then 2,000, so it's going up and down. So we see that is basically by ascending order. So first 19 from minimum, from the smaller to the higher number. There is an option in order that allows us to change that. And the parameter, the option is called decreasing. So by default is false, but we can set it to true. So if I run the exact same command again, and now we can check again the sorted by mean, the head, we see now that it starts with a maximum value that we've already checked out earlier. So we know that this is the maximum value, and then the decreasing order. So by doing that, we can play around with the different options and create the version of the table in the form that we find more appropriate with. Now that we've already done that, the next step that we might want to is to save this new table, this change table into a new file so that we will be able to reuse it again. And so the function to do that is called flight CSV. And what the expect as input is first of all, what do we want to save? So let's say that we want to save this particular file. And we are going to this particular table and we want to save it in the file named annotated differential expressed genes. And we can say plus strands because there's also missing strands here CSV. So if I run this, we can go in the files and we see that indeed we have the brand new name, the brand new file created here. And there's an option of pushing the information from the RStudio back to the Galaxy. You can find this in the basic tutorial. So you can really interact with all the outputs that are here back into Galaxy. You can continue your pipeline there. So now that we've covered a bit of some of the basic functions of working with tables. And again, real fast, we talked about what are the features, how we can subset table and what are the different ways that we can see the structure, how factors works and what are the caveats that we saw here about automated coercion and by accident converting something into numbers that it should not be. And then how to apply functions. So selecting columns, rows, sub-setting, playing around with the data to reformat them in a sense is one of the most common things that are done before the actual analysis at the end. And if we want to do a lot of things one after the other it might be possible to create a rather complex set of commands of action. So there is a particular package called Deployer which was created around 2014 and that provides an additional level of functionality in R that specifically allows us to aggregate and combine and analyze tabular data in a much more efficient way. And also an important point is that it addresses directly the data as they are located. So it is generally quite memory efficient. So how can we do that? The first thing that we need to do is to incorporate the functionality into what we are doing. And not to do that, we use the command. So basically we want to load Deployer and not to do that, we specify library Deployer, if I press enter, it gives us that information here that R has attached the package Deployer and these are the different package, the different some changes that we need to be aware of and so forth. So let's see a few things that can be done now for subsetting, but using Deployer functionality. And you can see how much more intuitive and convenient this process is. So first of all, let's say that we want to select columns and filter some rows, which is basically subsetting. Deployer provides a function called selects and what it does, it expects us input the table that we want to use, so the annotate genes. And then we specify by name, what are the columns that we need to use? So if we open this one here, we can see that the columns that we need, we might want may be gene ID, we want the starts and we want the ends and we also might want the strands. So if I run this now, you see that the output that we get is a table that only has the columns that we specified. And as you can see, we didn't need to specify the index or anything else and just the table itself that we probably know. And what are the names as they are listed here of the columns that we need. Similarly to the standard subsetting, we can do, we can select everything except some of the columns by using the minus symbol. So let's say that we want everything except, for example, the chromosome. So if I run this, you can see that it prints out all the columns. I'm scrolling a bit more, all the columns here. After P adjust, you see that this should be chromosome, but you see now it's missing and then it goes back from start and then so forth. So it is a very easy way to extract formation. Plus it gives additional functionality like let's put again the same structure at the beginning. And let's say for argument sake that we know that we want to extract some columns and the columns that we want to use are all start with P, for example, P dot. We want to extract only the P value and the P adjust, for example. We can either specify P value and P adjust or we can say that we want, sorry, I've put double S here, starts with, and we can put exactly a string that we want your column to start with, so P dot. So if I run this, you see that the player is clever enough to do a quick pattern, checking across all the columns and gives us only the ones that are starting with P dot. You can have even more complex. You can check that and more functionality here and that has more complex ways of the information. So for example, that it contains a character or ends with a character and even creates a more complex and regular expressions to do that. So with select, and I'll put this as a comment, select allows us to select columns basically. So features of the information. Subsetting, as we did before, also needs to select rows. So filter is the functionality in Deplier that allows us to filter for rows. So how we do that? Again, the same structures before we use the function what we want the function to be applied to. And then the question is, okay, how do we want our rows to be filtered by? So here we need to put a statement. For example, let's say that we want to keep only the rows for which the strand is the plus one. So if I run this, we see that it prints out all the columns, but instead of printing out the 130 observations that we have here, it only brings out 72. And if we check the strand, you see that everything is the plus one. So the equivalent in BASAR that we saw before is to use annotate genes again, and then can have annotate genes strand equals to plus and then comma. So in other words, give me all the columns and only the rows that this particular column has a plus sign. So if I run this, it'll give me the exact same information. But as you can see in terms of reading the code and thinking of what we're trying to do, this is a bit, it's much more convenient. So these two, let's try something more. We can filter, for example, by chromosome, let's do again the same thing. So filter the rows, but for which the chromosome, and again, we can put a bit more complex. Let's say that we want to select only for chromosome X and chromosome two R, sorry, two R. So if I run this, you see again, I'm going to scroll a bit more and you can see in this column here, the chromosome, you don't have chromosome X, chromosome two R. So in other words, here, I want to filter all the rows for is the chromosome, the value of chromosome is either chromosome X or chromosome two R. And we can have even more logical questions here, logical questions. So I'm going to do the same thing again, annotate genes and let's say that we want to filter only for log two FZ log two dot FZ dot to be greater or equal to two, greater or equal to two. So if I run this, it will give us only the six rows for which the log two FZ value is greater than two. And here's one of the most useful functional here. And it's also important to highlight that you can do combinations of those two functions. Let's say that we want to combine those two. I can say, and I'm going to copy this part directly, that I want to filter for rows that have this particular chromosome only X or two R, but also I want the log two dot FZ to be very equal to two. And so in order to combine those two, I can put a logical end. So if I run this and let me run, it will give me only the two as you can see here, this is our entire table. You see chromosome X, chromosome two R. So basically it filtered both of those aspects, both chromosome and for the log two F or change to be greater than the equal to R. So this is one way of changing multiple information. Another interesting point, because if someone will have multiple criteria, you can easily imagine that this will be quite extensive. So you're going to have multiple different logical operations, one after the other. So eventually it might end up being a very, very long line. So deep player actually provides a very nice additional functionality called piping. So in other words, it's a way to code where the output of one function is provided as input to the next one and so forth. It's very similar to what is done in the Unix environment. And let me show you how these are. So the actual symbol of the piping is the percentage greater percentage. So this is the pipe in deep player. So what I'm doing here, I'm saying I have this command and I want to, the command by itself, let me remove the pipe. If I run this, it actually prints everything out. So the output of this command is basically the entire table. So I'm piping the entire eight table and then I'm asking to filter by strand to be only on plus. So the output of this entire thing is being passed into the next one. And this is why, and this is important to highlight, I don't need again to highlight this as input. So as you can see, filter expects as a function and input and then you have the rest of the parameters. Because it's now part of a pipe, it doesn't require an input. It doesn't expect an input because it's already given for the pipe. So if I run this, it will actually filter the rows for plus strand if you check here, this is the plan. And I couldn't continue doing that. So I can put another pipe here and I want this output to be, so I've selected the rows, I can now select some columns. And now that I've selected based on that, I want only to keep G9D, start, and chromosome. So if I strand this, you see now that I have only the one, two, three, four columns. And more importantly, you can check that now already I filter based on strand, but strand doesn't need to exist in my second part. I've already moved it by select. So if I swap those commands after one after the other, it wouldn't work because in this case, the strand column will be missing. So I would urge you to try this and see why it will actually fail. And just for verification purposes, you see that this command gave us 72 lines. And by filtering only by columns, we still maintain the same number of rows. And I can continue the pipe and let's say that we want to see only the top few lines. So I pipe this to head and it gives me only the top lines for me to really check how this goes. So this is a very convenient way as you can hopefully see to create subsets based on your criteria on your purpose, based on your original table without actually having to do a lot of complex operations within the same command. So you can split your commands one after the other. So this allows us essentially to create a new object, a new variable that we can then use to save. So I will remove the head because I did this only for showing how many lines you have. And I'm going to say here, I want to keep this as the plus strand genes. And I'm going to save this. So I'm going to run this entire thing here. You're going to see that there is a new table here called plus strand genes, let's check. It actually has only the four columns. And now if I do head of the plus strand genes and I'm going to run this, it actually gives me the exact same command as before. So by starting with the annotate D genes, the original table, we can do subsets, we can do filters, we can retain only the columns that we need and we can save this information eventually as a new table that we can use for our finalized. So here is how we can select and filter columns. Another key and quite useful functionality is to actually create a new column. So let's say that right now in our original table, and we have the log to full change, but this is the logarithmic version. Let's say that we want to change that to the full change, not the log of the full change. And in order to do that, we will create a new table, a new column, sorry, based on this particular function. Not to do that, and I will be using pipes now to be a bit more explicit. So I'm going to pipe the table in. I'm going to use a new function called mutate. So mutate says expect, first of all, the table which I provide through the pipe. And second, what is the name of my new column? I'm going to call it full change. And how this is going to be calculated. And I'm going to call this two in the exponent of my original column, which is log two dot, and this is the one, right? Let's check this out. I'm going to put this ahead so we can see only the few first three column, first three rows. So it's interesting to see now. So you see that all those columns are the same. It hasn't been changed, but now there is an additional column at the end called full change. And if you do the operation, this is, if you do the log of that, you will provide this information. So this is how you can create a new column. A final point to keep in mind is now we've created a new column, we've selected filtered. It might be extremely useful also to try to think of more complex situations. So for example, let's say that we want to answer the question, how many genes, different types of genes we have per chromosome, right? This is a very natural question. If we think about it, what we are asking is to create one subset of the annotated D genes per chromosome, and then basically count how many genes we have there. So this process that I just described to split our initial data into groups and then apply an operation to its group is called split apply combine. And this is a functionality offered directly by Deplier through the group and summarize functions. Let's put this title, so group and summarize. And let's see how this works. So in order to do that, I'll start again with the annotated D genes and let's address this particular question. So how many genes we have per chromosome? So in order to do that, first we need to group. And let's call a function called group i and I want to group my table by chromosome, right? And now that we have our groups, now it's not actually printed out groups. This is an internal functionality of Deplier. So it says that now I want to apply any new function that provide considering these as groups. And the second question would be, okay, I want to summarize and I want to summarize them by counting them. The count, so how many, if we split, if we group our original data per chromosome, how many rows it has, how many rows, which basically corresponds to how many genes is done by the functionality n. So n basically counts rows per table. So if I run this very, it's a very short piece of code, but you can see that it's addressed a very interesting question and one that is quite often used in research. So you see that it provides two new columns. One is the chromosome, which is the question that we've asked. The second column is n because this is sort of what we've asked here. And you see that chromosome two L has 24 rows, chromosome two R has 31 rows and so forth. So it is a very common function. We can use this even better. We can say I can copy this entire thing and I will use the equivalent version, but instead of n, which is the short hand, I'm going to use count. So count basically counts how many rows you have and I can name the column as n. So if I run again, this is the exact same command. If I run this, sorry, I mistyped something here. I have to change this one because I have not named my column there. Again, my apologies, I have changed. Sorry, I realized that what actually happened is that the count is the, let me repeat that. Because it's grouping by chromosome and doing the count of data of rows per group is actually a very common function and the plot provides a command function that actually is short hand for this particular thing. So group by chromosome and summarize is equivalent to doing count chromosome. So if I run this, you see that it provides this kind of information. And if I want to actually name the column, I can set n in there. So as I want to say this, this is one way of applying the same functionality into subgroups. And you don't necessarily need to have one group. I may want to ask the same thing, not by chromosome only, but also by strand. So if I run this, it will create additional groups because now it will create one line per each combination of chromosome and strand. So chromosome 12 strand minus chromosome 12 strand plus and we'll provide again the count for all of those things. So this is a very versatile command and will allow us to create quick summaries of the information that we have and we can have different functions here as well. We can have an average, we can have an absolute number or depending on if we want something more advanced, we can have a full different mathematical operation done into this piece of information. Right, so we've seen how we can sort of massage and aggregate and summarize information into our using dplyr. Another thing that is quite useful to keep in mind, especially when we start thinking about plot information which we'll be seeing shortly in the next video is how to reshape the data. So I'm going to use the exact same example here. I'm going to copy the same here. So what we'll have more information and we can try about this. Let me actually change the name here to N. So we name the column, if I read on this, you'll see that the name now has been changed to N. We can send this to a couple of it if you want and so forth. So here we have a quite traditional way of presenting information. And we have the chromosome, the strands and how many times this particular, how many genes this particular strand has, how many rows my original table had that correspond to this particular piece of criteria. If we want, however, we might need some cases to have a different type of presentation. For example, you might need a table that has chromosome then plus as one column and minus as another column. So again, there is a library in our called, let me call it, let me type it here called TidyR which allows us, let me run this, which it ends now it's loaded, which allows us to actually transform and reshape a table from one format to another. So changing the way that the information is representing. Without actually changing the formation. So this is the key point between Dplyr and TidyR. Dplyr aims to aggregate, summarize, filter, select on the particular table. TidyR is aiming towards changing the shape that this information is presented. So here again, let me run this command. We see this kind of information. So let's say that what we want to do is to use the presentation of chromosome, strant plus how many, strant minus how many. I'm going to use a new function here by TidyR called spread. So what spread expects is two piece of information which is what column needs to be taken consideration and what is the information that will be split across different columns. Let me expand this a bit more. So it expects to select which column and which values of a column should be changed from a single vector to multiple columns. So strant in this case, it has two values plus and minus. Those two will be changed across as different columns. And then what is going to be the column that the information will be split into these particular cases. So if I run this, you see now it has been a new representation of this information. We have a chromosome and then we have a plus and the minus and we have a plus. If we do the math, we'll see that the two different pieces of information will completely match. We have two L and we have two L minus 12. There we go. Two R, two L plus is again 12 plus 12. If we check two R, the minus 17, the plus is 14. And the minus 17, the plus is 14. So it is actually the same exact same piece of information, but same from one way to another. And tidier allows us to seamlessly transition from one place to another. As you can imagine, this might be quite easy to represent visually, for example, as a hit map and having these two pieces of information shown as a particular column. Let's say that we save this one. I'm going to save some typing as a wide representation. And it's called wide because you take a column and you spread it across multiple different cases. So let me run this and we'll see here that we have a brand new type of information. The opposite operation of spread, as you might imagine, is a function called gather. So I'm going to use the exact same approach. I'm going to pipe this in to a function called gather. And it will work in a similar situation. So I'm going to gather all the information on a new column called strands. And I'm going to utilize the information that is already there into a new column called n. And I'm posing that the chromosome in a minus in the sense that this column should not be affected at all by this functionality. So if I run this, it will give me essentially my original table. So I'm going to save this as a long representation so that it's clear what difference between those two versions are. And you can see here that we have the wide case and the long case. So we can use these functions from Bplier and tidier as we've seen to go from one representation another to do some analysis and to save its individual table that is being produced so that the rest of the analysis can continue from then onwards. It's more to highlight one of the principles that we said earlier in this video that we don't touch again upon the original data. We load it once, we continue working on that, but everything is repeatable. We haven't overwritten our original dataset. If we want, we can save any of those using the right CSV function and save it into our local folder. And then we can use it from the downwards. If you go to the Galaxy training material, you will be able to find some exercise on that. Tidier and Bplier have many, many more functions than the few that I've just shown you, but these are the most basic ones. And as you've seen, they're quite versatile and they allow a lot of things to be done. I hope you found this useful and please feel free to check out the tutorial itself for more information and exercise that you can do.