 During the coffee break, I was asked about more details about the string split thing, but I wanted to share that with all of you, just to make sure. We use string split a lot, and let's illustrate this. So let's define, I'm putting this in code snips and then uploading that, so just to show you. Define a vector which contains a single element, which is the three letters, a, b, c, divided by a column. And incidentally, when I assign something or have an expression like that, and I put parentheses around it, the effect is that whatever gets inside gets executed, but the whole result also gets printed or is available. So this is basically a shorthand of saying, assign that to x, and then show me the contents of x. It's basically done in one line. So if you find things with parentheses around them on a line in a script, that's what that means. Okay, so I have a vector, it contains one element, a, b, c, scalar values, a single value is assigned to some r object. In fact, r also vectors of length one. So this is the vector a, b, c. Now if I do string split x, as we've seen before, we split on the colon, we get a vector of three elements a, b, c, and that vector is contained in a list, which contains one element, which is that vector. Now if I pull that out by using the, the sub-setting operator with double brackets, which gives me a single item that I can, I can apply to a list, then I get only the vector and that's normally what I would want to be working with. I get a similar result by saying unlist, because the unlist function takes everything that it finds in the list and puts it into a single vector. So that looks exactly the same. In this case, string split one and unlist the whole thing are exactly identical. But now let's consider the case that we don't have a vector of one string, but we have a vector of two strings, or potentially more strings. And the elements in here have different lengths. So now we have a vector, the first element is a, b, c separated by colons, the second element is d, e separated by colons. And if we string split that, we now get a list of two elements. The first one has the a, b, c vector and the second one has the d, e vector. And now it matters whether we use the sub-setting extraction or the unlist, because the unset, the sub-setting extraction applies only to the first list element of course. But the unlist applies to the whole thing, and I get a, b, c, d, e. So depending on what the input potential can be and what we want the output to be, either one or the other or both are possible. So that's the difference between the sub-setting extractor and the unlist. All right. So we're looking at how to get this into, as data, we used, we just put the whole thing in the whole text that we had into a single character element, and we split it on the line break character, and then we could either unlist or unsplit it and we assigned this to the correct, to the correct value. And now, most frequently though, what you will encounter is data that is large and structured in some way, and for that, we will be, we would be reading it with some variant of R's inbuilt read functions. The most frequently used ones are the three read lines, read CSV, and read delim. So read lines produces a vector of lines from a text file, one line at a time. So in our case, that's exactly what we wanted. So this reads the file, every single line goes into its own element, and the result is a vector, so we can assign that. We could also use the read CSV function, or the read delimited function, it doesn't matter because we only have one element per line here, so whether we tell this, whether we consider this to be tap separated or comma separated, it doesn't matter because there's only one element. So that works in almost the same way, looks quite similar, except that now we don't have a vector, but we have a data frame instead. In this case, the data frame is a data frame with a single column, since we had no header in our file, header is false, are automatically generated a header, a column name here, which is V1, it was read in rows, and we have the same data. Now if we have that, this is a data frame. So how do we change a data frame into a single vector? Because the specification called for a single vector. Right, we extract the columns. How do we do that? We do literally just column R genes, so we could either just use the column V1 because it's called that, which is the vector we want, or find all rows and the column named V1, which is the same thing. But in this special case, we can actually also use unlist, I think, and we get all the data from the data frame in a single vector, and depending on what that data is, that will determine what the type of the vector is. So all of these three, either car genes, column V1, or car genes, V1, or unlist car genes would do the same thing. I could even have used read CSV and then just applied the dollar V1 directly to the function call. I don't need to use an intermediate assignment to a variable. More often than not, however, our data is going to be two dimensional anyway, rows and columns of different types, and we'll just put it into a spreadsheet to begin with. All right. Good, so we've talked about how to get very, very simple data into R. How many genes do we have? How do we know? Right, so here in the environment, we see 46 observations of one variable. But it's actually still in the spreadsheet version, but we can just change that back to a vector. Right, so it's a character vector of 46 elements. If we wouldn't be looking at that, how do we find it programmatically? Length. The length command gives us the number of elements in a vector, which is something we often need because the length command is also the index for the last element. The index of the first element, of course, is one. The index of the last element is length of the vector. Okay, here's a task. Now on the science website, as on any other paper websites, there's a section of the article called supplementary materials. And if we go to the supplementary materials, this lists tables and usually additional methodological documentation, methodological methodological documentation, and hopefully other data. In this case, table S3 summarizes data that was underlined, one of the plots, I need to remember which plot. Anyway, so this is one of the plots that they produced. The data contains expression profiles of different cell types under lipopolysaccharide stimulated and control conditions. So these are differential expression data from single cells. So if we find something like that on a website, we click on the table, it's an Excel spreadsheet. Click on it. It gets downloaded to our download folder. We don't need to do that because I've already put it into the data folder, table S3.xls. And first task is to open that and try to understand what's in that table and what we would possibly need to do to read the data into R. And the second task, then, is to actually do this, to actually read the supplementary data. Now it's very common to share data in Excel spreadsheet formats. A word on that. Excel is an excellent spreadsheet program. It's really, really, really, really bad for statistics. Do not use Excel for statistics. Not only does it make horrible, garish, hard to interpret, and ugly, really ugly plots. Really ugly plots. It also is often wrong. It actually has its own idea of statistics that is not quite canonical. There's two links. I hope they still work. Practical statistics on Excel and burnstatistics.com that kind of comment on why it's a poor idea to use excellent statistics. So by all means, use it to share your data, keep your data. As long as the data doesn't get too large, it's often quite useful. Don't use it for programming. Excel programs are usually messy, and do not use it for statistics. You're learning to use R, and that's the right way to do it with a capital R and a capital W. Now, you see, this is one of the things. When you export things as you should from Excel into something like comma-separated values or tab-separated values, you even need to be cautious because one of the problems with Excel is that it truncates numeric precision. It thinks, you know, all these digits off to the comma. You don't actually need them. They're like, you know, overkill. Why do you need seven digits? Come on. Well, sometimes you do, and whether I do or I do not, I do not want Excel to make that decision. I want to be the one who decides whether to do that, so please, just leave my data alone and to have it leave my data alone, simply convert all cells to text before export. That will export the actual values that are in the cells and not some interpretation of what the value is in Excel. In principle, there's a package, Excel S ReadWrite, which is available via CRAN, but I'm not actually sure that this is a sound practice. The Excel, the default Excel format is closed. People have reverse engineered it to a fashion to understand what's in these files. There's no guarantee that that's actually correct. And on top of that, Excel spreadsheets can be complicated. They have multiple sheets. Their values can be explicit. They can be implicitly calculated from formulas. You can mix various tables on the same page. So actually using a function like Excel S Write is calling for trouble. What you want to do is you want to make sure that you see what you are exporting and export that into a plain text, comma-separated value or tab-separated value format. Make things explicit. I can't overstate that. It's really, really important to keep explicitly track of what your data is doing, validate every single step, and exporting from Excel is no exemption. So what we do is we open our Excel spreadsheet and then we export this as a .csv file, a comma-separated value file, and then we'll read it into R. So your task is to open the data table like you would usually download it from the supplementary materials on a paper website and then save it in a comma-separated value format. And when you have that file, put up a blue post it, and if you're stuck on that, put up a red post it. Yes, that one is to read Excel and save it as a .csv, so you're stuck here or you don't have to. Yeah, one is the answer. You're going to have to express that. No. No, yeah. Okay. Yeah, okay. Yeah, unfortunately. Did I save my file? No. You just leave me to it. That's the answer. Thank you. No, I just put it there so it's not good because he already has that file in his tutorial. But this is totally up to you. Good. How are you? I'm glad to hear you're at the OACR. Yeah. That's awesome. I can't believe I haven't seen you around like we're so close now. Yeah. Are you ever like doing movies? Yeah, yeah, yeah. Yeah, I'm often here. Whose lab are you in? So I see only a very small number of blue post-its. I hope that doesn't mean it's difficult. This is the table when I read it in Excel. It's kind of large-ish. It has a lot of data. It has a few header lines. File. Save as. Format. Many options here. I want comma-separated values, CSV. The alternative tab-delimited text. That's something we usually call a TSV file. However, if you get Excel to write tab-delimited text for some reason, it insists on calling it TXT and not TSV. And if you call it TSV instead, it adds .txt to your TSV extension anyway. We shouldn't be getting too frisky in saying things apparently. Anyway, so CSV actually generates CSV. So I choose Table 3.csv. I click on Save. It complains again this workbook contains features that will not work or maybe removed. Don't do this. Maybe it may hurt your little brain. No, no, no, no. Continue. And then it finally saves that. Incidentally, it now calls this file, which I've read before as an Excel file by that name because I've saved it. However, if I simply press Save, I believe it changes it back into an Excel format anyway. So Excel does lots of things that I'm not explicitly asking it to do. So I'll throw this away. I just call it differently. So there it is. What's inside? Always look what's in your data before you work with it. It's a text file. It has information at the top. It's comma separated. So we should be able to read it with an R function called read.csv. And what I'd like you to do, your task is to read the table into R, assign it to a variable, and there's a few caveats. Use the read.csv function. After you read it in, use the head function to look at the beginning of the object to see what you have and whether that's what you want. There may be header rows that you don't want. There may be odd names in the headers that you don't want. So give the resulting object column names that reflect the cell type, just like in Figure 2c of the paper. So here we have different cell types. Cells, macrophages, natural killer cells, monocytes. PDCs are what dendritic cells? Poly... Plasma cytoid. Plasma cytoid. Oh, plasma cytoid dendritic cells. The abbreviations that the paper uses are BMF, macrophages, NK, for natural killers, MO, for monocytes, PDC, DC1 and DC2. So it's a good idea to use these column names, but as the header possibly explains, some of these are controls, and some of these are lipopolysaccharide stimulated. So distinguish CTRL and LPS conditions in your column names. And call the last column cluster because they've clustered the cells and the numbers in the last column correspond to the cluster identifiers. There's a function here, which is loaded. It's called object info, which takes a few of the data items we can retrieve for an object and prints them out to the screen. So once you've read it in or in between, you can use object info, whatever your object name is, to find how many rows you have, how many columns you have, what the column names are, and so on and so on. What the column types are. Do you have strings? Do you have factors? Are they numeric? Which ones are numeric and so on? And validate that. Yeah, what I omitted to say is, I should add that in here, call the final object LPSDAT. Do call it LPSDAT because we'll be referring to that variable name in later parts of the script often. You're of course free to call it anything you want, but then you will need to edit a lot of the lines in the script in order to make them actually work. Call them LPSDAT. That's what's expected here. So again, the task is read your CSV file and make sure that what you get as a result of your read operation is actually something that's useful to work with the genes in that file, the expression values, and with appropriate column names. When you're done with that, put up a blue poster. If you run into trouble and you can't continue, then put up a red poster. It's, yeah, it might be... So gives you a table, so it's a different kind of object. It's not what shakes our object. You don't need it. You work with it the same. It's all the same. It just gives you extra information. Data is size. There's no seconds. It's speed change, so I don't know if it's going to be that important. Yeah. Well, no. Oh. You need to put in some error. Yes, I've said the code error will be used to read the code. Yes, look. Okay, so, I've said this. It's like you end up doing... It just hits a number. I just use, like, a double scroll button. Yeah, like, you just know it. No, don't use that. You just have to write it all out. It's the same. And you're like, I understand the background. I was just talking about this. Yes, yes. So it's the same thing. Okay. Let's look. Yeah, let's talk to them. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Okay. Yeah. It doesn't matter what we're doing. Yeah. Some of them aren't named because of the way it's split the... all of the characters. Like this. It doesn't have one header in here. That one really is called... It's like, you know what I would do. I would just fix it all over the cell and then read it in because I don't know what it has ever been so much easier. But I don't care. Just get that line. Because that's not the way it's going to be. Yeah. So please, may I have one? I don't have the same choice. OK. I don't have a right for this. But we have to sort of have some comment here for a character. So it means you have a character string as the argument for that. Or a white line. I don't have a character string. I didn't have a character string. Now all of your lines are the same. OK. Because the extension can decide. All right. Bees. OK. I got a better one. I think that's職is. Yeah. So I correct the職is. And I see that's true for that. Actual. Perfect. I was at the part of where it's really unique. So have you got either the header or the basic part? Probably going to be different. Which one? The text. So I would assume each one of these would go in, except for the same type, right? But when I put head and table S, it would come in one. So let's discuss a little bit of what's been happening here. So one thing that you probably noticed is trying to read in this header information in a sensible way doesn't really work. It's extremely hard to somehow try to recover usable information from a header that's built like this. This was not made to be programmatically used. This was someone's idea of how to fluff it up for visualization. So what I would do as a strategy is I would just skip reading these first six lines. There's a skip parameter in read.csv. Start reading line seven. Tell read.csv that this file actually does not contain a header. Otherwise line seven, i.e. your first data line will be interpreted as column headers and lost from your data. So skip the first six. Start reading line seven. Tell it that columns, that this has no header. And then after it's read in, simply give it a vector and you'll probably just need to write that out by hand, what the elements are, and assign that to call names of your data frame. So then were you assigned row names? Yeah. Well, we'll get to that in a moment. Maybe I'll just walk through the sample solution here. So if we take the whole thing and assign that to some raw data, header is false and strings is factors is false. Many of you have omitted or forgotten or not realized that you really have to say strings is factors is false and something like this here, otherwise we get factors all over the place. In particular, if you're reading the whole thing, the first elements are text elements anyway. So this means that the entire numeric data column is also converted into factors and we need to recover from that. Anyway, so this gets messy. Just tell it strings is factors is false and then we'll clean it up later. If we look into that, this is where things go wrong. R automatically assigns V1, V2, V3, or I've seen X1, X2, X3 apparently, cell type cell treatment. This is where our actual data starts. So all columns are named V something or X something, rows 1 to 6 do not contain data. There's not a single row that could be used as column names and all columns are characters. So this all needs to be fixed. The first thing we do is we drop the first six rows in that case. What I'm illustrating now is how do I recover from a sloppy read? We can read this in a more intelligent way, which saves some of that work, but how do we recover here? So then we just drop the first six line with a negative range. Remember that for subsetting negative indices, remove items from the output. Be very careful. The parentheses here are absolutely necessary. So what I have here is 1, 2, 6, which is minus 1, minus 2, minus 3, minus 6. If I don't use the parentheses, what does this give me? It's this here, minus 1, 0, 1, 2, 3, 4, 5, 6. That's not intended. Be careful. In this case, the parentheses are really, really important. It's one of the cases of operator precedence. The minus has higher precedence than the expansion, the colon character. So the minus applies to one before everything is turned into a range. If you, like me, cannot, for saving your life, remember operator precedence, then just put in parentheses everywhere where there's potentially a conflict. If you don't want to remember that operators of equal priority are evaluated from left to right, or you don't want to remember whether exponentiation is evaluated before multiplication or after that, that kind of thing, just put things into parentheses. Having parentheses never hurts. I think I can't remember a single case where it would cause problems. So we remove that. Let's see what we got here. So we start with our row seven, but things are still messy. Then we define column names. As I said, now here we just have to do this by hand. If we're very savvy about how to use the paste function, we could make our lives easier and paste that together in a way where we don't actually have to spell it out line by line. But this is very explicit, and if we put this into our scripts, it has the advantage that we can also comment what we actually have here. So gene names, cell types are taken from figure four. Hygiene and all cell types are B from macrophages. Control and LPS refer to control and LPS challenge. Okay, so this defines my column names. And my column names now are here. B dot control, B dot LPS, MF dot control, MF dot LPS. And so on. And now there's one thing about this. Since I have deleted the first six lines, note what the row name of my first line is. Seven. So the row names were generated when the file was read. By default, if there's nothing else specified, the row name is the same as the index of the row. Almost the same because an index is a number, but a row name is a string. So these look like numbers, but they're actually the number seven, the letter seven, the letter eight, the letter nine, the letter one, zero, and so on. So these are strings. And when I delete things from my data set and I subset things, the row names don't get changed. So now my row names are from seven to whatever. Now that's very surprising. If I want the first row and I use a row name of one, that doesn't actually exist. There is no row name one now. The row name is seven. So to keep the row names which will be printed out when I do a print statement of some columns, to keep them aligned with the row indices, I re-index the rows. This is why I set row names LPSDAT to one, the range from one to n row of LPSDAT. And I do that one, two, three, four, five, six, seven, as it ought to be. Now a residual problem here is that all of these things are still characters and I need to change them into numbers. I can't just go and do that with one expression, at least not with an expression that I would be discussing here. You could do it with an apply statement. So what I do is I iterate over the second to the last column and convert these numbers to numeric values. So LPSDAT of the column I is as numeric of what that string was. So if I look at the beginning now, that kind of looks correct. I have genes. I have values of cells. I have rows of the correct name. And I have, if I str, I see that the first column is a character column of gene names and all the remaining columns are numbers, not factors. So remember, by default if we don't turn this off, the read functions, the default R read functions turn strings into factors. So we always need to use strings to factors as false. Let's talk a moment about factors. So why factors in data frames? What are factors in the first place? When are they useful? Why do we have them? Now factors are special types. Normally there are strings like male-female or agree-neutral-disagree or something like that. But underline, they are coded as integers that identify the levels of the factors. So for example, if we have genders, factor MFFMF, probably a very 1950s view of what genders could be, as a factor. This tells me I have the values MFFMF and two levels, F and M. Now internally this is coded as 2-1-1-2-1. And the 2-1-1-2-1 refer to the levels. So second level, first level, first level, second level, first level. Note that the order of the levels doesn't have anything, doesn't necessarily have anything to do with the order in which they originally appeared in the vector. It may be alphabetic, but it could also be something else. Now if we look in more detail, the structure of this is five numbers and two levels. Especially if the levels are long information, this can be useful and compress the data a little bit. We can also define ordered factors in some way. And if we have something like agree, neutral, disagree, these are not random in the ordering, but agree is stronger than neutral, is stronger than disagree or ordered in some way. And if we want to use the data in that way, we set the parameter ordered equals true when we define them. And we can define levels. So in this example here, sample grades, I have a grade of G1, G3 and so on. I can define levels G1, G2, G3, G4, and I can tell this is ordered. Now my sample grades are what I define it to be. And note the level string now. G1 is less than G2, is less than G3, is less than G4. This is how you identify that this is an ordered factor. And when you use factors, for example, in regression analysis to regress over factors, i.e. see whether your values correlate with some categorical variable that you have coded as factors, then it's important to keep this order intact for the regression analysis. Can you give us an example of what that might look like in data? Like why you would have... Well, like this here, we have quality grades in some kind of analysis. My pathologist tells me that the samples we have somewhere decayed pretty badly and some are very fresh, and then we could perhaps order them. Or two more classification might be one example here. The more severe the diagnosis, the worse the prognosis is in a tumor, you might order the factors along this way. But depending on what you want to do, you could order them this way or that way. It's free for how you define it. So they're useful because they also support a number of analysis methods such as keeping box plots in order or calculating regressions. I've linked to a factor tutorial and a discussion on their use. One is by Jenny Bryan, who also teaches... Jenny also teaches for us, right? Or at times. So she's quite active in... She's in Vancouver and quite active in our community. Now, for our purposes, and I think everything that we're going to be discussing in this workshop, the default behavior of R to treat all strings as factors is just unwanted and needs to be turned off. So when we read something, we always turn something off. And why this happens? I'd really like to illustrate this again. So if I have a data frame that I get from an Excel spreadsheet and the first value was not available, so in my Excel spreadsheet, somebody has typed n slash a, not available. And then the rest is numbers, and I read that into a data frame. So wherever this says type info, I just forgot to change the name. It's now called object info, so just replace that or ignore it. Now, this is a data frame. It has a factor of six levels. So it doesn't have numbers. It's a factor because vectors all need to be of the same type. And so everything in here was converted into strings to match this string here, and therefore we got a factor. Incidentally, that's a common problem with reading CSV files. If someone somewhere in your spreadsheet just put in a single blank character, that's very hard to catch, but it will turn your entire column into strings, not numbers. So once again, whenever you read something in, make sure that you have the right data type. Okay. So can we turn them back into numbers? Well, for example, I just remove the first na, and now they're all like 1, 1, 2, 3, 5, 8, and I just turn them into numeric by saying my data frame 2 is as numeric my data frame 1, and see what that gives me. Is that correct? Is that the right result? My data frame 2, it looks correct, except if you look more closely than as Francesca noticed, there's no 8. There's actually no 4 either. In the original data frame, there was no 4. So instead of 1, 1, 2, 3, 5, 8, we get 1, 1, 2, 3, 4, 5, which only looks quite similar, but would give us an entirely wrong result where we actually to interpret these as Fibonacci numbers. So what just happened? We turned these into numeric, but that just means we're taking the internal representation of the factors and using these as numbers, and that gives us the wrong result. So if our numbers inadvertently get converted into factors, we can go horribly wrong when we then go and turn that back into numbers. That's one of the most dangerous, and in this case, really hard to spot. Congratulations. It's really, really hard to see. It's the kind of subtle error that can creep in, which can really ruin your day. So what you need to do instead is, at first, you need to turn it into characters. This gives me the correct characters. So my DF internally as factors are indeed the numbers 1, 1, 2, 3, 4, 5, but these are the levels of the factors, not the actual contents of the factors. So if I turn them into character and then turn them into numeric, and then I look at them, I get the 1, 1, 2, 3, 5, 8. So if you ever need to convert something from factor back to what it should be, turn it to character first. Once you've turned it into characters, as character, whatever the factor is, you can process it further. Either you wanted strings to begin with, or you wanted numbers, or you wanted logicals or whatever, and then cast them to something else. Okay, but then at the end of the day, after everything is said and done, we have the LPSDAT object. We have genes as characters, expressions as numbers, and clusters as integer numbers. Now if your LPSDAT object looks exactly the same, that's good. If it doesn't, you can load the one that I have in the data file. So R has a very efficient mechanism to load and save intermediate data. You just save whatever objects you want in an R data file. They are compressed. So if it's a large object of, say, textual data, you will be surprised it gets to be a really small and compact binary file on your computer. And to load it back, you can simply say something like loadDataLPS.RData. Note that we are not specifying the object name. The object name is whatever was saved in our R data file. We're not assigning this. We are reconstructing something that we've taken out of the workspace. This would only be necessary if we had done the removed... Because right now it is active in our workspace, so we would not do this right now. Right. But if I would have removed LPSDAT and then it doesn't exist anymore, I can simply recreate it by loading my loaded copy, which should look exactly the same. Because that's how I solved it. If your local version of LPSDAT looks substantially different, you might want to do this anyway. Simply so we all have the same file that we'll be working with that will help us avoid some surprises. Now to work with this, something we always and often and frequently need is sub-setting. Is there any volunteer from yesterday and the day before who would like to teach the next 10 minutes on sub-setting? Anybody who's really spent a lot of time on the pre-work tutorial on the sub-setting part that... I'm not surprised. So if this is repetitive, this is good. You really need to internalize sub-setting. It's really, really crucially important. Any kind of data analysis, exploratory data analysis, starts with reading in data and then sub-setting parts of data and comparing data as needed. So let's reconsider the six different ways of sub-setting. And what I do here is I'll... I'll just start with a synthetic dataset, randomly pick something and let's see what we get here. I give things a name. It has a number of legs. Depending on how many legs it has, it can be a fish, a spider, a beast, a crab or a centipede, and some values that I've measured on my objects here. Now, the first thing of sub-setting, the simplest thing of sub-setting is sub-setting by index. So this is a two-dimensional object. Sub-setting by index means I specify rows and columns in square brackets. If I specifically specify one particular row, one particular column, I get one element, spider. If I leave the rows empty, this means all of my rows, this incidentally is the same thing as if I would have said that dollar type. If I keep the first thing empty, I get all rows. If I keep the second thing empty, I get all columns. If we want a particular set of rows and columns, we pass a vector of positive integers. So, for example, we can say dat C23 and C123. This is for rows two and three columns one to three. Now, any function that returns a vector of integers can be used. Most frequently, we use the range operator and retrieving ranges of rows and columns from a matrix or data frame. Sometimes it's also called slicing, a slice of a matrix. So, dat one to four to one to three works. But it can also be in reverse order. So, if we specify four to one, one to three, we get the same thing, but ordered the other way around. We can subset only even rows or only odd rows. Or we can subset only every 100th row or a random sample of 100 rows. So, for example, if we have a data file of protein-protein interactions that we download from the strings database and we have something like 500,000 rows in that, and we want to develop some analysis working with the 500,000 rows might not be very efficient. So, we might subsample it to take only every 1,000th row, which leaves us with 500 data items, essentially randomly sampled, which we can then use to develop our analysis before we apply it to the entire data set. So, especially if your data sets are large, subsampling a small subset randomly will be useful for analysis. Sometimes if the data is really, really large and your analysis is really, really expensive, you actually have to do that. That's the only way to analyze your data. Actually working with a random sample from your data by subsetting. Now, the indices don't have to be unique or in any particular order. So, we can repeat things. So, this here repeats the first row three times, the second row two times, and the third row one times. So, it doesn't have to be unique. We just think of it as being, you know, slices usually because that's what we most often do, but any ordering is fine. In particular, we can select random subsets with the sample function. So, sample 1 to n of three elements gives us 6, 3, 4, or 1, 5, 4, or 4, 2, 5, or whatever. And we can do random subsets. Or we can sort the data frame. Now, sorting the data frame works in a somewhat different way than you would expect. You can't just sort the data frame. If you sort a particular set of values, for example, the values in column two, you get the sorted numbers. But what we need to do in order to get the entire data frame in the order that we want, we don't sort the entire data frame, but we look for the order in which we should be presenting our data. That is not the sort function, but the order function. Now, if we look at the second column, it's natively in the order 0, 8, 4, 0, 8, 8, 2, 0, 10, 4. If we apply order to that, we get a vector of 1, 4, 8, 7, 3, 10, 2, 5, 6, 9. Now, what does that mean? Well, these are the indices of these numbers in which we would need to consider them to get them into a sorted order. So 1 means 0. 4 means the fourth element, 1, 2, 3, 4, which is also 0. 8 is the eighth element, which is also 0. So this is how we get our 0, 0, 0 in the ordering. So maybe, let's write this slightly smaller example here to illustrate this. So if we have a vector 8, 13, 5, and we sort that, we get 5, 8, 13. If we order that, we get 3, 1, 2. So this 3, 1, 2 is the indices of our original vector to get them in sorted order. 3, 1, 2, 5, 8, 13. So in order to apply that, the result of ordering to our data frame, is again just sub-setting. Order creates an index vector, and we use an index vector to pull out the rows from our data frame in the correct order. So our data frame, sorted by number of legs, which is column 2, puts this vector into the rows, and then just columns 1, 2, 3. So now we have them ordered by number of legs. Three fishes, one bird, two beasts, three spiders, and a crab. We could also order them by lexical order of the names, and that would be ordering by column 1, C, I, J, R, S, S, T, U, W, and X. So ordering is very versatile. If you look at an ordered vector, though, it always takes me a little while to figure out what's going on. Let's try to be familiar with that. If you specify a negative index, that element is excluded. So dat minus 1 means the whole thing without the first row, dat minus n, not the last row, and so on. So that's one thing. Positive and negative indices. The second principle of sub-setting is logical vectors. So instead of indices, we specify sets of rows or columns by Boolean values, true or false. And if we place a vector of logicals into the square brackets, only the rows or the columns for which the expression is true are returned. So something like this vector here, rows 1, 2, 3, but true only for column 1 and for column 2 and for everything else, gives me the name and the type of my collective specimen. And you need to take care that the number of elements in your logical vector is exactly the same as the number of rows, respectively, columns. Now these logical expressions can be combined with the AND and the OR operator. And I'm skipping ahead a little bit. So sub-setting by index, sub-setting by logical expressions. How about filtering? Filtering means selecting from a data set depending on what the values are. So if we get all values that are greater than 100, we're filtering our data set. If we filter by string matching expressions, we can use the grep or the IN operator to subset via string matching. For example, grep, everything in the third column, i.e. the type that contains an R, tells me types with an R are found in rows 2, 5, 6, 7, and 9. So what are these? Well, I take the result of that, grep, R, dat, 1, 2, 3. Look at the first three columns where we see these are spiders, birds, and crabs, all with Rs. Types that begin with C. This is the regular expression. Special character here at the beginning of a string, a C, are crabs. I also define centipedes, but no centipedes happen to end up in my collection. Let's see. If we use the IN operator, I'll review with you how that works and how it does. Then, we have indices, we have logicals, we have filtering by some kind of logical expression, either grep for string matching or other logical expressions. Then we have subsetting by name. If row names and column names have been defined, we can use these for subsetting. So, for example, I can say rows 1 to 5 of the column called name, or rows 1 to 5 of the column called name and the column called legs. In that case, I pass a vector with the row names or the column names. So vectors of row names and column names also work. And finally, the dollar operator. I often use the dollar operator whenever I want to extract single columns from data frames. So in this case, if we want to extract the column named legs, I would apply that dollar legs, 1, 2, 3. So usually in my code, I use dollar operators when I want to indicate single individual columns from data frames. Okay, now we have our object LPSDAT. Let's do something more real-worldly. And I'd like, as a task, you to write our expressions that get the following data from LPSDAT. Subsetting expressions. The first one is rows 1 to 10 of the first two columns in reverse order. The second one, is expressions and expression values for LPS stimulated monocytes in the column mo.lps for the top 10 expression values. The top 10 only. Third one, all genes for which B cells are stimulated by LPS by more than two log units. The next one is expression values for all genes whose gene names appear in figure 3B in our data set. Remember, that is car genes. So these strings are in car genes. You need to figure out how to subset the LPS data by the genes in here. Hint, this is where you use the in operator. And that's it. So these four tasks. In six minutes, we'll be breaking for lunch. You have five minutes to do this. No, actually, it's four tasks, so four minutes to do. Try to get as far as you can. You should not take more than 20 seconds for the first one. The other ones are a little more involved, but also a little more real-worldly. Now, note one thing. And this is, after all, not just an introduction to our workshop. This is exploratory data analysis. So with these operations, we're doing our first exploring data tasks. Specifically, finding the top 10 expression values from a large data set. So this is the kind of thing that you would do when you actually explore your data. And we'll get to something else very soon. So we're already imposing on our lunch break. Well, if you feel compelled and doable, if you feel obliged, no, no such thing, you need a break. You need to ventilate your brain cells from time to time. So I'll see you back after lunch.