 Francis wanted me to share these notes with you first of all just to thank the bioinformatics workshop but also to say that all the lectures are going to be made available so it's going to be up in access you can share them and everything but you got to make sure that you follow these rules if you do so. So we're talking about missing values and how to deal with these in R but before we go on and talk about that I'd like to say a few words during the break I was asked a couple of questions about forums and things about where you can ask questions about R and also I was asked about what about MATLAB. So for those who don't know MATLAB, MATLAB is a very popular software for I would say it's more geared towards mathematicians and engineered applied mathematicians. It's a commercial software it's very powerful you can do lots of things you can also use it for bioinformatics they also have packages and tools and don't get me wrong it's a very very good package and in some level it's more powerful than R you can do things that R can't so if you're looking for software to do image analysis you know MATLAB is still the best thing you can use probably for that. On the other hand it is commercial it's fairly expensive so maybe it's not a big deal for you you can afford to buy a license you know your company or your institution or your school whatever probably as licenses you can use but the downside of that is that first of all we couldn't really do a workshop like that right because we couldn't expect everyone to have a license because it is very expensive. Also there's going to be less people that are going to be using that software MATLAB for example versus R because it is more expensive it is not open source you're going to do you cannot know exactly what it does. So the community around you is going to be a lot smaller because there's going to be less people that are going to contribute new things and new packages for you to use. Of course you're going to have the full customer support with MATLAB right so you can call them say it's not working you know fix it for me I'm not sure they will do it and probably if they do it it's it's going to take a while. On the other hand you don't add that with R but you have your community around you so there's tons of mailing list for the R users and I can assure you that if you've got a question and you ask a question on the mailing list you're going to get an answer in probably in less than five minutes. I mean it is truly amazing how many people are working using R again help you doing things. Of course you know if you do that today and say oh this is great you know this there's all these mailing list I'm just going to subscribe you know subscribe to the mailing list and I'm going to say how do I create a vector. People are going to be very angry at you if you ask questions like that just because probably a lot of people have asked the same question in the past and you can just parse you know the old mailing list and get the answer without crossing it again. So if you decide to do that and I encourage you to do it if you don't mind receiving about you know 10,000 emails per day then do it go ahead and the good thing is that not only are you going to be able to ask questions that you know are not answered already on the archive of the mailing list but also you're going to be able to help people that are asking the same questions in the future you know. If someone is asking about how to do something and you know how to do it you can answer right away and it is truly amazing so even though you don't have customer support I think you have a much better support that's coming from the community people who are just like you working using R. So if you go on the R website you will see that there are several mailing lists and some of them are more targeted than others there's just a general R mailing list. I'm subscribed to the R MAC mailing list because it tells you more about the things that are specific for R like the GUI and things. I'm also subscribed to the one of the bioconductor mailing list which is on high throughput sequencing. This is something that I'm not really going to talk about during this workshop just because the data sets that are generated by high throughput sequences are very large and sometimes can be sort of cumbersome for an introduction to R but there's been tons of support in bioconductor for high throughput sequencing there's lots of packages already there's something like five people at the Fred Hutchinson Cancer Research Center working on these packages all the time and I'm subscribed to that mailing list and I receive so many emails a day on what's going on how to do things and people say hey by the way can we do that with this package no but let me do it and one hour later you can do it I mean it is truly amazing so I really encourage you to do it so this is just to say that R is maybe not fully supported with by a company but there's the community around you and I think that is actually much better. Okay so going back to missing values so in statistics and in bioinformatics you're gonna have lots of missing values right I mean if you do an expression a gene expression experiment and you have let's say mice and you do you need to do some culture on these mice take some samples some of the mice maybe will die sooner and therefore you will have some missing data and you need to deal with that you're just you're not just going to throw away the experiment because you know one of the values is missing so one way you can deal with that is using this sample or just any which means that there's a missing value so what you can do here we can create a vector and if one of the values is missing okay let's say this was a few patients that you wear that you had weights on then you will have all the weights for the patients you could actually wait but there's one that is missing so you just put an any so the bad thing about missing values is that if you don't know how to deal with them R is going to tell you what there's a missing value what do I do so let's try to see how that works so here's the example we have the missing value copy and paste that into the console we can look at the values and you can see that there's an any so it is just a numerical vector okay just numbers but because there's a missing value there's any now if I want to compute the mean and we're gonna see that later on in the exploratory data analysis we can just do mean of weight well R says there's a missing value I don't know how to do it I don't know how to deal with it right because there's something that's missing so I'm not just compute the mean of that sample however there are ways to deal with it so if you look at question mark mean you will see that there's an option that says are you know how are you going to deal with the missing value one way to deal with it is to do NA that are M equal to meaning remove the missing value and compute the mean okay so if you do that what it's going to do is just going to remove that missing value and compute the mean of the remaining numbers so that's really nice because there are ways to deal with missing values and I would say most of the functions in our have that way to deal with missing values and this is an option that you're going to find into a lot of functions when you try to do summary statistics on a sample is the NA that are M equal true or false false it's the default that is doesn't do anything big because there's missing value if you turn it to true you're just going to remove the missing value and you're going to deal with the remaining data points yes because sometimes you should not remove them actually there are many cases where you know if you remove them you're going to get a bias sample right so let's say you're looking at survival rate and you know some people are going to die say well just remove them and of course you know the survival rate is great because you removed all the missing values right so you shouldn't do that so you need to be a little bit careful about the way you should handle missing value and in fact the best way to do it is by turning that by default to false yes so that's a good question I mean there's a lot of things let me remind you we only have a couple of days here so you see we're not going to know everything after two days right so try to ask the question as you know that you want answer for in these two days but if not you can always use you know question mark help and so forth we're going to see how we can read in data sets one of the comment is read that table we see that and one of the options is how do we deal with missing data by default you will do something but maybe it's not exactly what you want maybe you've got a special code for your missing data but I would say that many software NA is pretty standard so typically we'll know what to do and how to deal with this but if not you can sort of tune that to your data set okay so this is we'll just to tell you that there's another type or special kind of thing that you can see now which is NA for missing data and we'll talk a little bit more about missing data when we encounter them but for now we don't we're not really going to worry about that okay so here's another way so we've talked about vectors okay vectors are nice but we'll be nice to have another way to vision to summarize data in our or to assign data in our is using what we call matrices or arrays so a matrix is just a two-dimensional array where you're gonna get numbers in your area of course because our is also a statistical and therefore kind like a mathematical package you can work on matrices so if you know what a matrix is you can do matrix multiplication you can do inversion you can do lots of fancy things with matrices we're not really going to talk about however a matrix is just a way also to hold a table okay so we're gonna play with that a little bit okay so we're here so here so we're gonna use a comment that you've never seen before so this is another way to create a vector okay this is saying let create a vector going from 1 to 12 okay space by 1 okay and you can do that by doing 3 to 4 to 5 for example that's another thing to 12 okay so you can start at something that's not one and you can go to something that's whatever you want and you're just gonna go one by one that vector yes that's another way to create a vector yes you don't have to do you see or ever this one you can only generate a sequence that's an increasing sequence of ones so with the C you can concatenate any numbers you want so it's more flexible but this is a quick way to create a vector by the way we're gonna see a lot of functions and things and of course I'm not gonna give you definitions for all of them sometimes we'll see them through examples I think it's the best way to learn the language it's way to go through it work with examples so here let's go back to creating that vector x one to 12 okay so we know that we can query the length of x by typing length of x and it's 12 right because it's just one to 12 for a matrix or a table you have dimension right you need to know how many rows how many columns do I have because here it is a vector okay if I type dm of x which is what you would typically do for a matrix we're gonna see that the example it's gonna tell me doesn't work it's not a matrix right doesn't have a dimension it's just a vector so it has a length however I can actually I can actually force it to be a matrix in that way which is not really the way you would do it typically but this is just to make you understand something so I can say I want the dimension of x to be three rows four columns we look at x now you're gonna see there's three rows four columns we did not change x x still the same thing it's just we told R to display it as a matrix so this is just to tell you that for R x is not a vector it's nothing it's just a bunch of numbers that are somewhere in memory okay and then we can tell R to display these numbers in specific ways okay so even though we do not change the actual values of the vector for X it for R it is not a vector it's just a bunch of numbers you tell him displayed as a vector this is when we created the vector now I tell him change the dimensions of x to make it a matrix and you can display it as a matrix okay so this is not really how you would create matrices directly because it's one way to do it but it's not the best would you do it but it's just to show you that in fact a matrix is really nothing more complicated than vectors just a vector that's arranged by row and columns typically you would create a matrix using the matrix function okay so what you would do is that in that function you would put a bunch of numbers you would specify how many rows you want and then you can specify if you want to arrange these numbers if you want to go by rows or by columns so let's look at that so here I'm doing the exact same thing I see take the numbers one to twelve arrange them in the matrix with three rows and go by row so what this means going by a row is that you're going to go first one two three four five six seven eight and so forth and if you do these equal false well guess what you're going to go by column one two three four five six seven nine ten to twelve okay so it's just the way you arrange the number in your matrix so matrix is just a two-dimensional array of numbers and you can arrange them as you want it's good because you can also give names to rows and columns so here let's say I want to give some names to the rows of X okay and I'm going to put some names to the columns of X and you're going to see what these does if I look at X now you're going to see that there's ABC one two X Y so I get some names so this is good because you can arrange numbers you can summarize numbers in a table and you can put some names on it so that's straight away here you could put oh this is patient one on day A on day B on DC or something like that okay so it's a nice which summarize number any questions about this so it's pretty pretty easy pretty standard we're going to see a lot of ways there are a lot of ways in R to put numbers in memory and display numbers okay so so we've seen how to create matrices and arrays and this is what you get if you do that there are all the ways just like for the vector we could use the C or the rep or the one to twelve there's various ways to create vectors and arrays why do we need that many ways to create vectors and matrices just because sometimes some ways are more convenient and others they're just going to be faster okay not be faster in the way it's going to take in terms of time for R to create the vector of the matrix but for you how much typing you're going to need to do to create the object that so matrices can also be formed by if you want blue in rows or columns using the C bind and R bind function so this is the equivalent of the C for vectors C bind meaning I'm going to bind columns are binding I'm going to bind rows so let's look at an example here we are so I'm going to create two vectors x1 and x2 okay if I look at x1 x2 these are just vectors then I can create a matrix called my matrix and I'm going to bind x1 and x2 by row okay so you've got one two three four five six seven eight and it's binding by row because x1 is the first row x is the second row let's try to do the same thing but with C bind I guess what you're gonna have the same thing but you're gonna bind them as columns okay if I'm going to you guys if you've got questions please ask me you can create a match you can actually what's really nice too is that you cannot not only bind vectors with vectors you can bind a matrix with another vector as long as the dimension match so here we've got so let's go back to this one so let's take x1 x2 we create another vector y1 is my matrix so let's cook and paste that so we've got this matrix now I've got another vector called y1 and what I'd like to do is I'd like to paste that vector that vector y1 over here I'd like to paste it just right here okay so I'd like to add another column to my matrix this is very similar to what I was asked earlier is that can we add another number to a vector that's already existing the answer is yes here we can just create a new matrix that will be the previous matrix with another vector glued at the end so we can do that and we can display my new matrix and this is what we get okay so we glued that vector at the end of the matrix what if I want to glue that vector at the beginning of the matrix the first column well you can do the same thing just put it here okay and this is going to be okay so you can put it wherever you want that makes sense okay yes we're going to see that there's a very easy way to do it you can do that too it's to do that it's unique to do a little bit more things but you can do it too and we're going to see that we're going to talk more about indexing so there's various ways to do indexing and so basically what you need to do is to index the right column the right and then you can play with it yeah because here you created a variable that's called why one so by default all knows that this one is called why one so we'll add something that's called why one the other one didn't have any names here so it doesn't put anything this is the default but after you can change the name just as we've done before you know put row names and put the names and so forth okay so in statistics it's very common to what we call categorical data right such variables will be for example male or female or could be a group okay placebo treatment treatment one treatment two these drug that drug and so forth so it's very common to have these sorts of things and we want a way to handle these kinds of variable very easily so for example would be nice if we could just condition on a specific variable we say okay right just want to look at the male just want to look at the female and the way we're going to deal with this by creating variables that are called factors and a factor will have a set of levels so if we create a factor a factor for sex will be male or female for example the two levels the good things about factors once again ours going to know how to deal with these kinds of variable because it is a categorical variable and you can also assign meaningful names to the categories so we're going to see that through an example so here I'm going to use a categorical variable called pain okay and it's going to be the level of pain so again this is the example that's taken from the book I give you the reference so here you could have basically no pain a little bit of pain more pain a lot pain so this is we've just created a vector right so if we look at pain this is just a factor nothing special so we need to tell our to turn that into a factor okay one way to do that is to say put it turn this into a factor or turn into as factor so we're going to create a new variable called f pain for factor and then we're going to say take that vector in fact here I could have just replaced that by pain if I wanted to as a factor if we look at that if we type it you're going to see it's different it's going to say levels 0 1 2 3 because it knows it's a factor and there's four levels coming from 0 1 2 3 yes yes so we that's what yeah that's what I said you could have said here just paint the exact same yes as factor will try to convert an existing vector into a vector factor factor you can sort of create factors from scratch so in all there's going to be lots of functions that could be a vector for example or something like that and if you want to convert to something else you could you're going to do as that vector as that matrix and so forth if you want to create something from scratch you're just going to use vector matrix etc so the as that is to say converted to something okay now the good things about these factors is that we can put them meaningful names we can give them meaningful names okay so now if we type pain you're gonna say that it's non-severe medium medium mild okay so we've got meaningful names when I've got something how do I know it is a vector it is a factor whatever so this is related to the as that factor or the factor question we can just test if it is a factor we can say is that factor true if I use pain it's going to be false because it's a vector and I can do the same thing with vector is that vector paint true but for f-pain it's false so a factor is not a vector anymore for r even though it looks just like a vector it's just because it's coded in a different way and r knows how to deal with a factor differently with a vector and in fact the is that you can do that for a lot of things you can do that for matrices and so forth so typically for any type of object you can question whether it is from that time and that's very helpful because sometimes you forgot you know you've got lots of variables you're using a function you don't understand why it doesn't really work you can just sort of test if it's the right type okay so we're not really going to go into that but I just wanted to point out the difference between the two okay and probably I don't I mean I don't know if Boris is gonna talk about that but factors can be very helpful and when you're looking at linear regression and sort of these types of things because you can condition and something okay so we've seen vectors we've seen matrices we've seen vectors of numeric logical characters it's great but it's you always have to be very consistent in the way you store the objects right it has to be a factor of numeric a matrix of numeric vector of characters sometimes it's nice to have a way to just bundle a bunch of objects together that you can work with okay and that way is called a list so at least you can combine objects of possibly different kinds or size into a larger composite object so we're going to see an example of that so you can really think of a list as a complex vectors kind of like a vector but you can put a bunch of stuff in it doesn't matter if it's not the same thing if it's not the same size okay so let's create a vector x a vector of factors and a vector of characters okay now I'm going to create a list where the first element is going to be h the second one is going to be sex and the third one is just going to be some metadata that are related to the other two variables okay and you can see here that x and y do not have the exact same length and z is completely unrelated but we can actually bundle all of these three variables into one element called the list and if we look at my list you can see that the first one is age sex and metadata and the way you create a list the name of the variables that you give all will use them as the names of the variable in your list then you can access each specific variables typing the name of that variable if I want to look at age I can just do dollar sign and the name of the variable so this is the way to access specific elements in the list using their names if you want to look at meta you just do the same thing okay so this is really a great way to put a lot of objects together and in fact it's one of the beauty in ours that you can create least like that so for those who know programming a little bit a list is kind of like a structure you can put a bunch of elements of different kinds into kind of like a bundle okay so here's another great thing about our it's a great type of object in our it's called a data frame so data frame is just like a data matrix or a data set so we're dealing with data sets so it would be nice to have a way to store that we've seen the two-dimensional array that's kind of like a data frame almost but the data frame you can actually be a little bit more flexible in fact a data frame is almost like a list and it's used when you've got when you want to put objects together that have the same length and where each of the elements across the different let's say vectors of the same position across come from the same experimental unit so let's say you've got several patients and on each of these patients you're gonna get you're gonna take several measurements then you can put that into what's called a data frame so you're gonna understand very quickly what data frame can be used for so before we had the variable sex here I'm gonna create a data frame with age and sex and let's display that okay so you can see that in a data frame there is a grouping structure that is let's say this is one patient this is another patient this is the third patient and so forth the age and the sex are paired together coming from the same unit the same person and just like a list you can actually access each of the element of your data frame using its name and we can try that this wasn't that should be my list okay so it did not really give you an error but sometimes you need to be a little bit careful because the object should have the same size it's kind of like what I talked to you earlier about sometimes I will say that's fine I can do it I'm just going to recycle the elements of that vector okay so of course ours great does a lot of things but you need to be careful about what you do so we're gonna be working with data frames because it's a nice way to summarize things so for example if you're looking at a gene expression datasets you're gonna have genes okay this could be the units of your experiment and then you're gonna have all the variables you're looking at right expression in the first array a time one time two time three under this condition and so forth so we're gonna be using a lot of data frames so why do we need data frames if it is simply a list right it is just a list because if you remember if we go back to this it looks just like a list in the sense that I can access independent variable using the dollar sign and the name right which is what we've done here well it's just that it's more efficient storage to begin with because you can because there's a relationship between each of the element of each variable but also and more importantly because it makes it much easier to index things so for example if you want to get the patient one all the variables all the measurements from patient one you can do that very efficiently in the list you will need to have each variable in a different element of the list and will be a lot more difficult because you don't know which one goes with which so when you know that there's a natural when you've got data set we've got rows and columns data frame is the natural way okay and why is it different than just data matrix is because you can store columns can have different types so here you could have a vector of factor a vector of characters a vector of numeric and so forth something that you couldn't really do with the matrix matrix needs to be all numeric or all of the same kind just like a factor so this is just a data frame is just like a special kind of list which is more efficient when you've got sort of square data okay so here we're going to talk about something that's pretty important is called indexing so we've seen indexing a little bit with the vectors how we can get to specific elements but now we're going to see some examples of how to do a very efficient indexing to get to the elements you really care about so let's go to that example so we've got the pain vector so I'm just still in memory but I'm gonna be created here so we've seen how to look at the first the second element okay let's get the third okay one thing is that what if you do pain of 10 so here you can say that it gives you any because the length of the vector is less than 10 so it's gonna tell you there's no element there because it's less than 10 so it's gonna give you it's not a number there's nothing okay so it gives you an error because you're going out of bounds if you want once again it's clever enough that even though you're going out of bound it's just giving you an NA it's still going to work so you need to be a little bit careful sometimes it's nice because other languages will typically just break down and so you can't do that because you're sort of going out of bounds here it's saying okay it works but it gives you sort of a weird value with these not the missing value time so it can be good and bad it depends what you do you need to be a little bit careful now let's say I want the elements one and two can just do from one to two it's gonna give me these from one to three from one to five or four and so forth well it doesn't have to start at one you can just do go from two to three and you get that two to four two to five etc doesn't have to be continuity either so you can just do see a one and three so what this is doing is that it will create a vector of indices one and three and then you're gonna say okay let's just get me one and three from the vector pane one and four so forth okay and you can put more than two you can put five then two okay and it's going to reorder them so it's gonna give you first the first element then the fourth element then the fifth and then the third okay does that make sense pretty simple so someone asked earlier what if we just want to remove a column or remove an element or something well you can do that very easily you can do the same thing but you put a minus sign in front of the index and that's going to tell our I want the vector but remove the fifth element okay we do that bang here we are the vector without the last element and you can do that with any of the elements can remove the first one and the second one you can actually remove two and three you can remove one and three I mean you get the idea it's very powerful you can do a lot of things we're gonna see that so we can do what's called conditional indexing so this is probably what you meant well we'll get to that so for a matrix so we still add my matrix that we've created earlier okay this is it what if we want first element or the first row first column well we're gonna index it in the same way but we put a comma to separate rows and columns okay and here we should get one let's say we want first row third column fifth oh it's at a band because there's only four columns here if we get two and four it will give you the respective element you can do the same thing as with vectors you can do one to two okay and this will say okay I want in the force column I want the first and second elements so it's four and eight you can actually remove leave empty either the row of the column and what this will do is that sit take the first row and the second row of course it's the whole matrix there's only two rows you can do the first column or the first row the second row the first column the second column and so forth so indexing is very powerful okay and that's why it's so nice to work with matrices or data frames because right away you can get to the row of the column that you want or the elements that you want very very quickly we can do the same thing let's say we want the data that we want the matrix but we don't want the second column so we're gonna put a minus two in front of the column number and that's going to give you the matrix without the column okay does that make sense yeah we can also index list so we've created my list before so let's look at it again so we've we've seen that to index variables in the list we can just give them names okay so that's one way to index using the name we can also index using the brackets say let's let's take only the first element of the list let's take 1 to 2 okay so you can index the variables like that and that's kind of a that's kind of a neat way to do it too because you can just select a couple of variables in your list now if you want to access the elements of variables within that list you can either do like that so you take let's say age and then of course you can look at the first element of that second the third and so forth okay already too much I know there's there's a bit of information here with the indexing is pretty tricky sometimes you know I don't even remember all the ways you can index something you can all I mean you're gonna learn through the examples we're gonna visit this afternoon and probably later this morning now let's say I really want to extract the variable from the list I would put double brackets so double brackets you're gonna see the difference between just putting just one bracket it still is me with the meta so here I would have to put the name here it removes the names and it's just put the actual content of that variable so that's the difference here because if I do this I want to index after the variables here it's not going to work because it's gonna give me the meta variable if I look at this this is going to give me the first element of the variable so this is equivalent to doing that that's just another way to do it okay sometimes your element in your list might not have names so that's the only way you can access these elements I don't read too much we'll revisit some of that we've created a data frame earlier okay so data frame is really just like a matrix except that the types the variable type can be different across columns in the data matrix so we can just index the first row that's nice because you know you can see right away why it's more efficient than a list it's because because we've got the natural pairing of observations across rows we can just extract one patient right away say this is all he's 31 as female and this is the metadata we can do the same to we can extract a variable okay using the column so forth that makes sense it's really easy we just have to remember how to do all these things okay we're going to see something that's very very important very interesting in our it's called conditional indexing okay it's a very quick way to get to the numbers you really care about so the chair the trick is that indexing can be conditional on another variable okay so let's say you've got a very complex data set and you'd like to extract the variables or the measurements that are that were taken on Friday before seven o'clock and sex is female whatever this is the sort of things you can do with conditional indexing so we're going to see an example so once again we create pain a sex vector of the same length and age okay so let's look at this we've got pain and sex okay first one is just a new numerical vector this one is a factor and this one is numerical factor let's say I want to look at the pains for only the males so what I'm going to say is look at pain but such as the sex is equal to M so that it's male okay so you can do that like that what you're gonna get is that you can see that this one is zero this one is three and the last one is one okay so you're only gonna get to the elements for which the patient was a male now you can see that this is pretty straight forward since very natural in the way you do that for the condition of the conditioning but you might ask yourself why is there two equal signs why not just one okay doesn't anyone know that yeah but why why do we have two equal signs here not just one good so in many programming language the way you're gonna assign a value to something is with the equal sign in our at first it wasn't with the equal sign is with the left arrow as we've seen okay and in fact it's the best way to do it because it's less confusing you can really see it's different different from the equal sign but in fact in all you can also assign with the equal sign I did not tell you that because I don't want you to do it because it's not really good right because it's very confusing you don't know if it's really equal or assignment so it's it's safer to just use the left arrow for assignment even though a lot of people you know are kind of lazy just like me but they prefer to use the equal sign because just one character is just two characters well guess what doesn't make a very big difference if you use two characters or one character but it's going to make a lot of difference when you start you know coding and things and you have to read your code and there's an error you don't know where it comes from and so forth because if you do that let's try to do that with the equal sign what's going to happen in a you don't really understand why right because what this is doing is that it's going to let's look at the variable thanks if you do that right so you've assigned a new value to your variable okay so it's a bit confusing and that's one when you put the two equals is just going to know ours going to know that it's you're making a comparison between two things okay so it's just to make it very clear to you or there's not an assignment it's just a comparison you can do things similarly with with greater or less for example so here let's say I want to so I'm just going to recreate the sex variable first and let's say I want all the pain for the patients that are at least 32 at least greater than 32 okay so I do that and you can see it's going to extract all the elements except the one who's 32 which is that one okay so you get 0 3 2 1 that makes sense or the thing is that you can actually combine things so you can say okay I want h greater than 32 and sex is male okay so if you look at the two it's going to be 0 3 and 1 you could do less than 32 less or equal than 32 and sex is female okay so you can play with these sorts of things very quickly or you could do either it is less or equal than 32 it's less or equal than 40 or it is a female okay so you can play with and or things like that I'm doing that on the vector because they the actually it's three variables they don't have to be in a list they separated variables yeah as long as they have the same length it makes sense to do these sorts of things and if they don't they will sort of give you an error or warning okay so try to do the same thing of indexing with the female and then do the same thing with age less than 80 and separately and then combine the two you say I want all the female that are less than 80 okay and I want to look at the values of the pain so try to do that so trust that you should be able to do it it's pretty straightforward so we want it first less than 80 so do that okay and then you want it sex is female okay and if you want to combine the two okay you need to be a little bit careful here it means that it's strictly less than 80 if you want to do less or equal than 80 you put it that way less or equal greater or equal okay so one thing you can do also to check your indexing if it's doing the right thing you can actually take the variable you're trying to condition upon sex and you do say right nothing stops you from doing that and this that way you will check that this is actually grabbing the right patient of the right index of the variable right here you should all be female if you do age such as age less or equal than 80 it should be less or less or equal okay so it's done the right thing yes no okay okay okay so we've seen conditional indexing we've seen all the pretty much all the types available we're gonna use and we're gonna need the first thing we need to know now is while we're doing statistics excuse me well we need to read data into our how we're gonna get our data into our right actually someone asked me a question on how do we get data from excel into our right because sometimes people will send you these excel files which I find really annoying because you need to have Excel and you need to how to deal with these sorts of things but it's actually not too difficult to get your data from excels into our base would do it will be to open your your excel file using Excel and then you can do save as either as a text file or CSV and then you can read that into our with the read that table function okay so in our it's pretty easy to read data so one thing you have to know is that your data it's it's pretty easy but if you've got very complicated data lots of characters and things very messy probably you don't because if you open your data in excel it's probably clean enough anyway but sometimes you can get slightly more complicated but for typical data sets it's very easy so here we're gonna try to read in the GBHD data set so it's a flow-satometry data set looking at the graph versus host disease in a patient that actually had that have that disease okay so here's gonna be a tricky part let's try to do that right away what do I get well I can't find the data set right because you need to tell our where you're gonna look for that data set right so one way to do it and I don't know how it works on on windows but there's a way where you can actually tell him where to look for things so you can change the directory so here on my mic if I go into miss and I look for change working directory I can do that and I'm gonna tell our I want my directory my working directory where I'm gonna put all of my data where I'm gonna put all of me all of my output and so forth to be just desktop okay and here I can see oh here's the data that's fine this is where so in windows you probably have something very similar where you can change the directory under file okay so if you go under file you're gonna have a comment says change directory set that to be your desktop assuming that you put the data sets on your desktop but okay for everyone if you've got a problem or raise your hand you can really raise your hand if you've got a problem that's okay or you can ask your neighbor if you want to you know it probably seems like a lot of things to do you're gonna tell R to do this and that but you've got to think that you know R is probably very intelligent so is your computer but if you don't tell him where the data is and what data you're talking about it's gonna be very difficult for him to work with the data set right and it's not very difficult to do you just go and say look on the desktop so now that you've done that let's try to input that comment again and now it's working right because he knows that the file you're looking for is on the desktop so you can look that in fact there's another way to do that you could also give the full path of where exactly the data is but that's the Santa axis sometimes can be a bit tricky because you know windows and Mac is slightly different so it's much easier to just do and change the working directory and in fact in general it's it's a good way to do it you create a directory somewhere let's say on your desktop whatever you put all your data in it you work in it so that you remember that all the results and everything will be in the same at the same place where you have your data is that okay for everyone okay so now we've read in the the data set with read that table by default this is going to give you a data frame okay so when you read a data set like that all assumes that it is a data frame there are all the ways to read data to create least and things when it's not exactly of the same size but for all of what we're going to do we're just going to deal with data frames so because it is a data frame we can actually look at things like that so we can look at the 10 first rows of that data frame you can see that so here put an option in the function that says header is equal to true right it's because there's a header the variables have names and when you read the data set you're going to tell all to look for these names and all is going to use them for when creating the data frame okay we look at the first 10 rows and you can see there's numbers try it okay so exactly so so no it's not lost you can just call up the window but it could be that it filled up that window so you've sort of lost it then you lost it yeah it's because if there's only a maximum number of lines you can have your window but there's one there's one thing that you that you can do when you're looking for comments that were very I have or you don't remember this history can I actually do question mark history and I will tell you a little bit about how to go back to previous comments okay and because I wanted the first 10 rows like here it depends if it's it well if it's a matrix you need to put a comma because there are rows and columns if it's just a vector you don't need the comma okay so going back to your question well someone is still getting set up at the back you can go in the history and you can say okay I want to I don't remember I know I've typed something like that I don't remember exactly what it was there's a history in R okay it's gonna try to you to remember what you've done and you can actually look for a specific pattern so I could say okay I want to look at history that contains age so I think it was pattern okay well it doesn't work on my machine here but as I use the GUI that and I never used before but typically you can do that and go back to the history I don't know if you try in your machine if that actually works but that's one way to do it okay so we did read in the data frame and here we can look at each of the rows like the first 10 rows we can also look no it shouldn't be yeah it's just because for me I've got a special setup on my machine so because I never use the GUI so I just put the GUI this morning to be able to show you good are we okay in the back guess not no okay okay yes yeah so you're gonna be able to say what's in memory things like that but not what's typed there you're gonna be able to to save the comments you've typed and the variables but you're not going to be able to to save the output or if you really want to save it just do a print or some or copy and paste into another document but typically there's no real reason to save it because if you save all of the variables and all of the comments you can just re-execute everything okay so we can do the same thing grabbing the first column the second column and so forth yes absolutely it's always better to have a file that contains so here we are we are the best up good everyone okay so we we can read the file this is what it looks like contains all the information we need we can work with there are a couple of things that you should know as well as that are comes with lots of packages and functions and often these functions of help file and it's very useful to have data sets to use these help files so typically you can also get lots of data sets in our ready okay so for example there's a data sets called iris which is about bunch of flowers and things measurements and specific flowers you can just type data iris and that will load that data set so you can try that okay and that will load the data you see it worked and then if you go if you type iris let's look at the first element you will see something first row you can see that looking at the length of the sepal and the petal so sometimes they it's nice because you can get some data sets already in our you can create packages and bundled data sets in them so it's very good because it makes it easier to do reproducible research you can package your code with your data you know and the people can just reproduce exactly what you've done okay is that okay for the input of data yeah yes so the the that one is actually part of the base distribution there are other data set that are within specific packages so if you just type data and actually here's an example if you type data crabs that's not gonna work okay because it doesn't know but it's actually part of another package which is it's called mass and we're gonna see that so I can load that package and then I can load the data and it's working okay so when specific data sets are within a package you first need to load the package for the iris is part of the base distribution so you don't need to do anything so some of them you the packages will always be on our computer for you to be able to load them but you can download them from somewhere else install them and then use them there are some packages that come with the distribution so that one is actually coming from our when you saw our you will have that package as well okay that's just because you don't need all of them at once there's no need to load everything right it's better to just load packages for your specific needs so that's a good question so when do we use small capital letters so R is actually case sensitive so when you have the variable when you've got an object a package whatever you need to make sure that you've got the right either small or capital letters right and it makes it's it's nice because you can create more variables and put specific names and things but sometimes you can create a bit of confusion so you need to be a bit careful about that so here just because the package called mass so often we've got acronyms it's easier to put capital letters right so that's why okay did you have another question okay that's that's a basic question but it's a good question because there's obviously lots of format you can have for files right so you could you could have a tab delimited file you could have space in between numbers you could have lots of various things right so our tries to be very clever when you use I read that table you will try to guess which format it is and then how to read the table but I'm sure very soon you will encounter cases where it doesn't do what you want so you can specify that if you don't know just do question mark read that table okay and you're going to see that there are a bunch of options here okay and one of them it's called set so you can do separation could be either tab it could be a space be whatever by default you will it will take any space either tab space things to be a delimitation between two numbers or two measurements of variables you can have a header you can have no header you can do lots of various things okay yes factor but typically it will put a missing day I'm missing by you if one of them is missing or is not a number or something yes yes each column is so each column is like a vector if you want and it has to have the same type either character or numeric or logical whatever Excel does a lot of bad things again don't get me started okay so you didn't know about functions and arguments but we've used functions since the beginning of the workshop right every everything we do in our it's a specific comment and there's some arguments that we put like read that table we put an argument the file name header equal to force whatever read that table is a function that is depending on the argument it will do different things so a function is just like a mathematical formula of function apply to one or more arguments so here's just one function log it will just take the log of x but you can do other things so you can plot the weight versus height and we're going to see some plotting functions so something you need to be very careful about when you use an R function is the order of the variables that you use so when you plot when you do that plot weight versus height R assumes that the first argument is the x variable and the second is the y variable okay if you want to do the other way around you will have to do height and then wait okay if you do not know how to specify the arguments you should look at question mark plot what are the arguments what are the possible options I can use etc in some of the functions that we've used like read that table if you do question mark read that table you can see there's there are tons of arguments but we did not specify that many arguments we just said name of file header equal to it's because by by default R is going to try to guess the values of all the arguments you don't specify so for most of the functions arguments come from they come with sensible default okay and therefore you can omit this argument so for example if you want to plot do a scatter plot of weight versus height and we're going to see that by default the colors is going to be black which is coded as one so when you do just plot weight versus height it's basically doing the same thing as this okay so there's a default for that parameter that you don't have to care about except if you want to change it if you want to change the color then you can do it very easily do color equal red equal to whatever going back to the arguments so if you do not specify the names of the argument just like here weight and height the order is very important right because you will assume that the first one is the first argument of the function which is x the second one is the second argument of the function which is y so sometimes it's better to be very explicit so here you could do the exact same thing but do plot x is equal to weight y is equal to height okay that way I will know for sure that this one is x this one is y okay and in fact you should always be very very careful to specify the name of the variables and try to use the same order as r so x and y even though if you do y and x if you specify the name it will work you are running to a lot of trouble you know if you don't specify the same order because sometimes we will forget the names and you won't realize that you did and therefore you will get something completely different than what you expected so try to be careful in respecting the name putting the name of the variables in the same order that will save you a lot of time later we'll see some examples at that library so this is one of the strength of r is that there is all but there's all the other libraries that all the other guys in the world have written and that you can use for free these are available in some specific website so there are a couple of websites where you can download libraries one of them is called grant which is just the repository for our packages just typical our packages this bioconductor which is geared towards genomics where you can download packages and these things can be done very easily within r directly and of course some of the packages are also distributed because I just said earlier so if you do let's try to do something if you do a library survival which is a package to do survival analysis which is included with r it works okay because it's included not not also that when you try to load a package that requires other packages you will load these packages automatically so here it says I'm loading the package splines because survival uses some functions of clients and this is great because you can customize packages you can write your own package but you don't have to rewrite the function that does this and that you can just use you know package from John and that function that marked wrote in the other package you can just load these fun these functions from the other packages when you need them and therefore it's very easy to write packages that are very customizable to respect you what you want to do now let's try to use another package that package is called Sam are which we're going to use later and try to type that on your machine does it work you don't get the same thing here that's bizarre because it's not a default package right so we need to install that packet before you can load it so for me I installed it yesterday because I know we're gonna use it but for you you can just do it right away so again on the Mac you can just go go into package manager and we do that only know how you do it well we'll figure it out actually it's not package manager it's package installer and there's very sources for your packages there's cram there's bioproductor and other things so here's just in cram you can search for the name of the package so it's Sam are and you search for it okay if you don't find it there's a very easy way to do it through the command line as well so don't read too much about it okay you can select and you do install selected package okay it's that easy the other thing you can do as well is just to do install that packages and the name of the package and by default you know that it comes from cram so you don't have to worry about it and if you do that it should install the package for you see it's downloading the package it's installing the package all automatically yeah there's a couple of warnings and things when you install package but it's not really good so it's another way to do it in fact I always install my packages this way because it's easier for me I have scripts to install all the packages that I want all the time but for you it's probably easier to use the GUI and you can just do that very efficiently if you don't know how to do it if you're running into some problems just raise your hand and we'll get to you we'll install the package as everything would are the first time you do it it takes forever because you don't know how to do it once you know how to do it you've done it easy it's always gonna be the same thing so again if you need a help raise your hand