 Hi everyone, in this video, I'll be showing you a few of the basic structures and how to use coding in R. As you've seen in the previous video, you've seen how you can start RStudio through Galaxy. This is the instance that I was trying before, so it's still ongoing. In case you are using a Galaxy instance that is not supporting RStudio at this point, an alternative option of running these exercises is to use the RStudio cloud service, which is provided by RStudio and you can create a free account and you're going to see the exact same environment that you can use to run these particular exercises. For our purposes, I will continue the discussion using the instance of RStudio that I've been spun onto Galaxy. If I want to open the interface again, I can click on the RStudio link here, which will open up my interface. This is the script that I've created in the previous video. If I open it up, you can see some information here, but in order to have a brand new place to put our information, I will be creating a new script, so in order to do that, you go to File, New File, Ascript, and you see now a new piece of information here. Just a few things about R, in case you're not aware, R is a free and open source programming language. It's been growing in popularity for quite some time now. It's widely used and it has a broad community that continuously supports both the base R as well as provides a whole set of packages and libraries that extend and enhance the functionality provided by R. It's quite powerful. It can run on multiple different environments and platforms including Windows, Mac OS, Unix, and as you can see, you can sell different other platforms including Galaxy that can run RStudio directly, R and RStudio back there. So in this tutorial, I'll be talking about a few of the basic aspects of R and I will create initially by talking about one main thing in R which is how to create variables. So variables basically is a piece of information that maintains a value that is useful for R to remember and to use. So if we want to create a variable or an object, if you like, variable A, so I've created this as a comment, you know, do that, you name your variable as A. Actually, let me make this a bit more clear by putting the brackets there. So we have the name and we use the assignment operator which is looked like an arrow which is the less than symbol with the does and the value. So if I want to run this, first of all, it would be nice to save as you can see it's still entitled so it's unsaved. I press control as to as a shortcut and I will name this as our basic. And as you can see our new script appear down here. So if I want to run this, I can click control enter. And now you've seen on the console that the value has just run. Plus, it's, it's useful to see here on the environment that you have, and that our studio is, is, is clever enough to tell us that look you have a new variable that you've defined. And this is, this is the particular value. So the environment is basically a space where we can maintain and have always a good understanding of what information is there. So in this instance, we've named our variable a so there are a few points that may be useful to remember so for example, we should avoid spaces. So if I want to call a new variable. And this is, for example, we want to create a new variable here. And this is not a good name because we've added spaces here. And in order to avoid that, we tend to put the underscore as a way to connect those different words together as a single string. And it's also a very good practice to avoid putting symbols like exclamation mark or at or hashtag as part of a name because each of them, each of those symbols have a different functionality now. So this, and this will create a problem, for example, with hashtag, you see that by the R studio already has identified that this is a comment and it's not available as I was imagining. And also, we cannot start a viral with a number. So if we want one number, this is not going to work. You can see that already here. There is indication saying that you're doing something wrong. Okay, this is a very good rule of thumb. And at some times let's say that we want to create. Let's create another variable. And I'm going to call this as human chromosome number, and I'm going to assign these to three. So there is a problem here as you can see. And the interesting thing is that it gives us information that what you're doing might include an error. And the problem is that as you can see by accident around here, I left a blank so it's basically a text that string here called human then I have a second one. And the problem is that it starts with a symbol that it doesn't understand. As I said, it's a usual practices to always start with a character, and the underscore creates an issue. So this is by deleting the blank here. This works absolutely fine. And now that we've done that let's create yet another one. And I'm going to reassign this time reassign object names. Assign object names. And so let's say that we want to create a new variable called gene name. And I'm going to put the value of 53 here. If I want to run this and control enter, and you see that the command has been executed and now I have a yet another variable here. If I want to actually see the value, and I can type gene name, and you see that it prints out exactly this, this content. And so in this case, we see how we have put the value into the environment and how we can retrieve it from from that. And if I want to, for some reason, I'm done with this variable and not needed anymore. I can use the function called remove RM, and I can run the name and provide the name as the attribute. And the command has been executed and now you see my environment that this does not exist anymore. And which is also true if I try to access now the value of gene name if I run this, it gives us the error that the object to name is not found. In other words, and this has been created and now it's been deleted. So if I try to access it again, it gives me an error which is very well expected. So this is how we define variables how we assign values and how we move them. And let's try to have a look into some of the properties that those variables have. So every object in our has two main properties. The one is the length, which is in other words how many distinct values are held within this particular object. The second one is a mode, the mode is not word what is the classification what is the type of this particular object. There are several different types of modes. The most common ones is the medic. So these correspond to types like float integrals decimals and so forth. There are the character ones, which is representing something that has a consequence of a sequence of letters numbers all together. So like names tags and so forth. And we have also the logicals, which are Boolean values like true false. That's it. There are a few more and which I will not go into because they will try some additional context but the main idea for you to remember is that length gives us the number of values that are contained in this object. And mode is what is the type of information there. So let's try this out. And so let's try. Let's see. Sorry, let's see mode and length. And length. So I'm going to first define a new variable. Let's let's call it the chromosome name. And I'm going to assign value chromosome zero two. I'm going to run this and now we see that this is our, our, our, our, our environment. If I type mode and the chromosome name. You see that our city is clear enough and it's typed and the by typing a few characters it's identified that this is what I'm trying to do so I auto filled it. And if I run the mode you see that it stack is of character. And so this is an if I type length of the chromosome name. And you see that it actually contains only one value which is exactly what we should expect it. So, and this is how mode and length works. And if you go into the corresponding tutorial on the galaxy training network material. You'll see some exercises and I will definitely encourage you and find going through them as well to familiarize with you and what you're with and how mode and length actually work. And the second point of that might be worth in discussing now is that beyond assigning values can actually do also operations between them. And one of the main operations is to do mathematical operations. And there are all the basic math operations are available in our basic math operations. And this correspond to the values are plus minus for addition to subtraction, we have the assays for multiplication, and the backs last for division, and either the up arrow or double, correspond to the exponential. And there's also the double percentage. So for example, I have two values, a B, a double sense of B is the module so the remainder of the integrated division. And, and this is how we can use them so for example it's five one. And as in our in math we can use parentheses to actually prioritize the, the application of the operations one or the other. So let's say that we want to do this kind of operation. So in other words, five in the exponent of 0.5. And the result of that one and the whole thing divided by two. I can run this and it will give me a number 1.6180.34. This is one way of working with that but given that we already have some, some variables we can also use the operations like not only numbers but on on variables that contain numbers. And in our environment here we have the human problems number 23. So we can try and actually human sonar number times two. And if we run this, it will give us 46 so we are and takes the name of the available the object here it replaces it with its content at that particular time at execution time just in three and multiplies is by to this is operation and this is what it gives us. And gives us a print out the results. So this is how how this whole thing works. Another point that is extremely useful in our and is how to use multiple values at the same time. So vectors are one of the most commonly used object types in our, and it's basically a collection of values. And that's importantly are the same type. So for example, we have a vector of numbers, a vector of characters and so forth. So this allows us to put lots of piece of information on the same type into a same object in the same Bible, so we can access them at the same time on the operations with them. So, um, so let's start working with vectors. So what I'm going to do initially is I'm going to create a new voucher vector called snip jeans and not to do that I'm going to use a new function called and see which stands for combine. And I'm going to put a few genes here, let's say, I XTR and a CTM three AR and O P R and one. So, for jeans here, if I run this. Now you see here we have a new representation of it so that she needs object exists here. And, and this time around, you actually see that it is represented a bit differently so it tells us that this has a character, it has elements one for and this is the element so if we have a longer vector with more and more more values in it. It will not so everything just the first few, but it's always a good idea to have a look at what happened environment. And let's connect this to what we saw earlier and see the properties of this of this object as we said we have the mode and have the length. So let's try mode on the snip jeans. As you can see. And, again, the mode gives us what is the type of the values that are contained. And, and because we've put something that is basically text, it gives us what is actually true which is the characters. We can also use the length of strips me snip jeans of this new variable. And as you might expect, there is also going to give us for which also can be seen here. So both model length gives us the information that we expect. And there's another function that is extremely useful that combines basically the output of both mode and length produce both piece of information called structure, str. And so if I try and structure jeans. And this it's basically gives us the same formation that the mode is Carter, the length is one to four. So for piece of information, and these are some of the values there if it was if it were a longer. For example, it's going to give us a few less and a few of the bias instead of all of them. I think the cutoff is around 1000 when printed out. And but this can be changed, if we be not that it is extremely usually 1000 by the same time. So this is how we create a vector. This is how we can see its properties. And if we create a vector we actually want to start working with it. And so a few things that we might need to do is to get a value from this particular vector, or to subset a vector or to a range, and so forth. So let's create a few more vectors of different times so that we can see how this works. So let's start with a new obstacle snips. So we're going to use the combine. And I'm going to put a few of sleep identifies here. 53576. And another one RS 1815739. And another one RS 6152 and finally, sorry, RS 1799971. So these are for sleep variables from from deep sleep basically sleep identifiers. I pressed enter, not control enter so it does it hasn't executed the command yet. So let's create also another one, and which is the chromosomes. The idea here is to capture the information of where this chromosome is coming from from where this corresponding and sleep corresponding. So let's say chromosome three chromosome 11 chromosome X and chromosome six. And also let's put also the sleep positions. And, and this is going to be sorry, I forgot the dust in my assignment again I'm going to use the combine. And this time around going to create only values, which are going to be 7626856656. So, 0624674575 and 144039662. So in this case, we have created this, we have defined those three variables here we haven't run them yet. So that's why we don't see them in the environment. So what I'm going to do, I'm going to highlight them and I click on run, which will run all those commands at the same time. So now you can see that I have four vectors here, the one that we paid before, the SNP genes, and now we have some chromosomes, positions and SNPs. And you can see that with the exception of position, which is numeric, because we've literally put numbers all of them are our character ones. So now that we have them, let's say that we want to access a value from those ones. Let's say that we want to access the third value of the SNP genes vector. So to do that, I'm going to specify SNP genes and I'm going to use the square bracket and I'm going to put the index that I'm looking for. So in this case, we want to look for then the gene in position three. So if I type that, you see that that's AR. Bear in mind, and this is an important point, and that in R, indexes start on in one. So here, as you can see, AOXTR is one, ACTN3 is two, and AR is three. And that's why when I execute this, retrieve the third element, it gives me AR. So in addition to retrieving just a single element, I can actually retrieve multiple ones. And I can do, for example, a range. Let's say that we want to retrieve all values from one to three. So what I can do is one, two dots three. And if I run this, it'll give me a subset, but retrieving only values one, two, and three, as you can see, same is here. And if I don't want these values to be sequential, I can use the same approach. But instead of specifying the range directly, I can use this C to combine specific indexes. So for example, one value position one, three, and four. So with this command, what I will have, and I'm going to run this here, it will give me the first value, the third value, and the fourth value as a new single element. I can also combine these two representations. So for example, I can do SNP genes. And I can put here that I want the elements of one, two, three, and also four. So in this case, what I will actually do is I'm creating here a vector of elements one, two, and three. And in this element, I also add the fourth one. So if I run this, it will, okay, as you can see, I did a typo. So it gave me that the SPN genes is not found because I spelled this the other way around. So I'm going to change this to SNP genes. And if I run this, it will give me an incorrect number of dimensions. Because I'm trying to access the fourth element outside of this one. So what I was, I was expecting to do is not have the parentheses here, but basically here, because I want to combine this vector, one, two, three, plus four. Again, because this is a quite common mistake. And if I want to access all these elements, instead of, so this particular case, I'm trying to access multiple dimensions basically. So instead of saying that I want to combine the vector of elements 1.3 and 4, I said I want to access two different directions. So I'm going to change this one back to the original one. I'm going to run this and it will give me essentially my original vector. So this is how we can access elements or subsets of the elements in a vector. But the question is, what happens if we want to add new elements in the vector? So let's say that in the SNP genes vector, I want to add more. This is what we actually did here. As you can see, we have a vector of elements 1, 2, and 3, and we extended this vector to add element 4. In this context, I can say that I want to combine the existing SNP genes vector, but I'm going to add a few more genes. So CP1A1A1 and let's add also APOA5. So if I run this, it will give me a new vector of this 1, 2, 3, 4, 5, 6 elements because we had four already in SNP genes and we add two more. However, bear in mind that if I want to print the content of SNP genes, and if I run it again, my original four elements are still there. So I've added them, but I did not save it back to my variable. So in order to do that, I have to overwrite my original variable, so SNP genes, with the output that is produced by this particular command. So again, I'm combining the contents of the SNP genes vector to this additional two elements. If I run this now, it doesn't print anything, but now you see here that SNP genes actually is a range of six, and I have additional values here. So we have four original, plus the two I've just added. So please keep this in mind that you are essentially changing our original vector, so this is a process that should be done if you are actually aware that you are actually changing your original data, you are updating it. So you're adding new elements to your vector. So as you can see, by using these positive indexes, I can access elements into an interface. So let's say that we want to do the opposite thing. Let's say that we don't want to add elements to that, but we want to remove them. In order to do that, we use negative values, so I'm going to do SNP genes again, but I'm going to put minus six. If I run this, it will give me the result of, we'll see that it will give us a, sorry, the interface shows for a bit there. So if we run this, you'll see that from the original vector that contained six elements, ACTN and up until AP085, the sixth element, one, two, three, four, five, six, which is this one, is now removed. So by indicating with a minus six in the same square brackets, we've removed this particular value from this point. So let's say that we want to save this change. In the same process that we did before, I'm going to use SNP genes, and I'm going to override my original value by specifying minus six. So if I run this, and I can check now that my vector has changed from a length of six to a length of five. And another interesting point that is something that you should be aware of is that you can always explicitly add a value to a specific position, and we use this with double brackets. So for example, I can say that I want to add in SNP genes in position seven the element called AP085. So if I run this, it works. And now you see that my element has seven. So from five, it moved to seven. Let's have a look at how actually this looks like. I'm pretty in here, and as you can see the original five elements are still here, up until CYP1A1. Then we have something called NA, and then we have the one that we just defined. So what happened is that because we explicitly asked R to add this element in position seven, it creates not a number, a missing value for position six, and then it added the element in position five. So this might be a good or not so good thing, depending on what you're trying to do, but some to be absolutely aware at any particular time. So what we've seen so far in with vectors, this is how we can create a vector by saying combine and we put the elements that we want there. We can do this for medicals like this one. If we want to access a particular element, we position, we request the particular value by its index number, index is start in one, and we can create a range like this one, or a combination of ranges like this one. And if we want to remove an element, we use the negative index, so we want to remove the element in position six, and we can add an element in a particular position by the double brackets, as you've seen here. Another way of extracting information of sub-setting is to use logical sub-setting. So let's say that we want to use the positions which are numerics, and what I want to do now is I want to retrieve, so I can put an index here, so give me the position that is in position three, control enter, it gives me the correct value. But let's say that I want to actually retrieve information that exists, and I want to retrieve all values that are greater than, let's say, one million, or what the number is this, so it's 10 million, or 100 million, 100 million, sorry, so over greater than 100 million. So it actually gives me a single element, because if we look in our original table, the only value that is big enough is the fourth one. So I can use this kind of logical operation to retrieve information from here, just to provide the context, so logical operations, they are less than, less or equal, greater than, greater or equal, this is the exact equal to, so it's double equal symbol, not a single equal, please be aware of that, this is one of the common mistakes. The not equal is exclamation point and an equal, and then we have the logical or, which is the vertical line, or the logical and, which is the symbol. So it is good to keep this in mind. So how did this function actually work? And this is a nice structure to have in mind. So here, this is a vector, right? So let me copy this command right here. So what I've put here as index is basically the application of a logical operation on a vector, the number. So if I run this by itself, you see that what it produces is a logical vector, a vector set, false, false, false, true. Basically, it applies this operation to each individual element of my original vector. So it checks whether this value, the first value, is greater than 10 million, the second greater than 10 billion, so forth. And depending on the outcome of this operation, it gives us the value of false or true. And because of that, the only true value is for the last element, the fourth one. So what happened here is basically a function where we said, okay, so I want you to give me all the positions for which the original index, the original value is greater than 10 million. So it creates a logical value and gives us only the indexes for which the original operation gave us true. So if we want to, so this answers the questions, the question, what are the positions that are greater than 10 million? But I can also ask the question, okay, what is the index of those? So I might be interested to know which is this particular position. So in order to do that, I actually use a function called which and I can put the exact same operation here. So which index or which indices of my original vector have values that are greater than 10 million? And you can see here that it's four. So why this is important? And why do I stress this a bit more? Because when we program and we create a sort of structure, we might not always know what are the inputs or what are the values basically that are going to be used when running our code. So instead of using a pre-determined value like 10 million that I put here, we can use sort of parameters of variables as parameters that will define that. So in this case, I can say that I want to have a snip marker cutoff, marker cutoff, then it's going to have the value of 10 million. I'm going to copy it right here. And I'm going to run this so you can see that we have a new value here, a new maviable. And then what I'm going to say is that I want to give, I want you to give me the snip positions for which the snip positions are actually greater than the snip, sorry, marker cutoff. So I don't need to keep, so if I run this, it would give me the exact same value. So if I do this, I want me to change this particular value at every particular point in my code, but just do it once in line 79 of this instance. And from here onward, I will only refer to it by using the corresponding variable called snip marker cutoff. So another point in order to close with the vectors is to have a good understanding of how we can investigate whether we have missing values or not. So in order to do that, there is a function called is MA. So it acquires whatever you give us input, so for example, snip genes, whether you have any missing values in there. As you can see, if I run this, the output is false, false, false, false, true, false. If you remember, because we inserted a new gene on position seven, the sixth position was an MA. So by using the is MA, we can get this kind of information back. So this is a good tip to remember. Another point is that we might have a case where we want to retrieve specific values by a name. In order to do that, we can use another operation that's called in. So it's typed as percentage in percentage. So let's say that we want to retrieve from the snip genes all to check basically. And if these particular two genes are present there or not, APOA five and snip genes. So if I run this, it will give me true true. If I put also, let's say TP 53, and I run this, it will give me true false because it will give me true and for the first two, for the last one, it will say, okay, I cannot find this one in snip genes. So this is a function. This is an operation in which it checks every element of my original vector against the elements in the genes and checks whether this one is in my final vector. Continuing on with the vectors, if you go again to the training material, you're going to see that there are a few exercises there. I will definitely encourage you to have a better look and see if you can try it out and you can have some better understanding of how vectors work. I'm going to continue with another key point in R, which is about coercing values. So coercing values is in some cases, requesting from R to change the type of a data in a vector to a different type. For example, we might have a list that has the position of a snip, but by some accident, some text was thrown in there. For this reason, everything is going to be changed to an actual character. So let's see about how those things work first and then I'm going to try to show you how we can change the types between different vectors. So if you remember, we have the chromosomes, so let's check again the snip chromosomes. If I run this, we have 3, 11, X and 6. So if I try to check the mode of the snip chromosomes, we'll see that it's type of character. And this has been done because if I scroll a bit up and we check the chromosomes, I explicitly stated them as characters. I put the quotation marks before and after. Let's create a second variable. Let's call it chromosomes2, underscore 2. But this time, what I'm going to do, I'm going to say, okay, I'm not going to change them as numbers. I'm going to say that we have 3, 11, X is actually a number. So I'm going, as symbols, I'm going to put it as quotation marks and we have 6. So here I have a mix of numbers and strings. If I run this, you see that it actually worked absolutely fine. But let's check what is the mode. So mode of the chromosomes2, if I run this, it's still a character. So what happened? So the problem is that because the vector has to have a single type of information there, I need to transform all the contents to the largest common time. In other words, it tries to change all of these values into a single time, whatever is the most difficult to store in that sense. 3, 11, and 6 are numbers. So they are easily converted to a text by saying that this is the symbol 3 instead of the actual number 3. So in order to have these all as a single time, because X is not easy to change into a number, because R does not know what the number X corresponds to. It changes everything into a character. This is a question. So it automatically changes the type of the values into the largest common time. However, we can force R to change, if possible, from one type to another. But we have to be explicit about this. So let's create, I'm going to copy the positions from here. But this time around, I'm going to save it into a different variable called positions2. And instead of having them as numbers, I'm going to put everything in quotation marks. So instead of dealing with them as strings, as characters, I'm going to, sorry, as numbers, I'm going to deal with them as characters. If I run this, we see this here. You can see that it's characters. If I put mode positions2, I can verify that it's in the character. And I can also try to access one of the element positions to element1. And it gives me the first one as an actual character. So however, this is not really a very convenient way of dealing with that, because we already see and we are aware, because we put the data there, then all of these should be numbers. So this is where the question comes in. And we say that we want to change positions2 by using the s. s. And now it will, we can select a type. So we want to convert positions2 from a character perspective into a number. So this is done by us explicitly. And we've requested, sorry, let's see now the mode again, positions2. And now you see that it's been covered to numbers. And if I check this here, you can see indeed that it is numbers. And all of them are considered as numbers. So this is a bit straightforward, because we've converted positions. These are all characters that are basically numbers. So it was not difficult to do. However, let's try to do the same thing, but for chromosomes. Let's try to convert chromosomes2 as numeric. And I'm going to chromosome2 again. So as a reminder, we have numbers here, 3, 11, and 6. But we also have something that is not easy to convert to a number. We cannot think of how. So if I run this, it actually executes fine, but you see that R put out a warning, which says that because you coerced, you explicitly asked me to change the type of the values from strings, from characters to numbers. I've done as best as I could, but some that I could not, I've changed them to NAs. So in other words, if we check, first of all, the type of the mode of our vector is numeric. But if we want to check it out and run it, you can see that we actually have numbers, 3, 11, 6, but you also have a missing value. So it works, but at a cost. This again might be a good or a bad thing, but it's something to always be aware of. So if I want to summarize this a bit before we move on to the next two lists and talk about this a lot, a bit, it's always important to be careful when, and to check the results, when explicitly coercing one data type into another. And it is the implicit coercing, the implicit change is happening like here by R. And this is a safe conversion because no loss of formation is actually happening. And it's always a good plan to use the structure of a chromosome, to use the function of a structure for a particular variable before using them. And before we apply the conversion to us, so we are always aware of how it's done. The implicit coercing, again, it's fairly safe, but because this may have been a vector of 10,000 numbers and one character, if looking briefly through the data, you might have the misconception that this will indeed be a number. But because there was by accident a character in there, somehow, R will implicit coerce everything to a character. So checking right after loading or right after creating a vector, what is the actual structure will help you easily identify such issues. The final point to keep in mind is another type of structures that are provides, which are called lists. So as opposed to lists, to vectors, lists are able to contain multiple different data types. And this is extremely useful because it allows us to store multiple piece of information at the same time. If you look through the Galax Chain Network tutorial, you'll see a few links with additional stories about lists. But one of the easiest ways to convey how this work, let's say that we want to combine all the piece of phrase that we have so far, and let's call it the data by adding them into a list. So to be more clear on what is happening here, I'm going to split into multiple lines. So all of this is within the list function. And I'm going to say that I'm going to have a column in my list called genes. And this one will contain the SNP genes comma. I'm going to have the reference SNPs, the column called SNP SNP. And these are going to be my SNPs. Let me check. I have SNPs. There it is. I'm just making sure that everything is in place. And we want to have also the chromosome. And this is going to take information from SNP chromosomes. And finally, we also want to have a position. And this one is going to be the SNP positions. So by doing that, I'm going to highlight and run everything. And now you see that we have a different type of data here. So our study in order to have them visually seen, it provides this information into a different section called data. And as opposed to values, which are all of the same type, a data type, at least in this case, can have multiple different types. So you can see that we have chromosome vectors, a chromosome vector, a character vector, a character vector, a character vector, and a numerical vector. And we can reach information directly. So if I want to have to access this data, I can use a dollar sign. And I can say, okay, give me the position. So by using that, I use me the list of elements there. In the same sense, I can use, in the same sense of the vectors, I can access the vector position. Sorry, it is a vector position. And give me the element in position two, for example. And if I run this, it will give me the second element in my position list. So accessing a content of a list is done using dollar sign. As soon as we get into a list, we actually have vectors now. So we can apply exactly the same process we've seen before. So in other words, lists are quite elegant ways of combining information at the same time, so that we have them compacted into a single entity that can be accessed at any point. I'm going to save this so that we have this quick ready for the next video. And I will, again, definitely encourage you to go through the tutorial of the Galax training network. There are a few exercises and additional links there. And I hope you found this useful. And bye.