 Welcome back. I've noted down the issues we talked about here. I don't think that I'd like to spend more time to address them, though, because I'd rather move on and then finally end up with the real data. However, what I would like to do is to spend just a short while forming some memories. So let's have a little pop quiz. In the next three minutes, write into your myscript.r file as many functions as you can remember that we discussed this morning. You should come up with at least 37. I'm not kidding. So if you have a hard time remembering them, a good way is to try to think about what we were doing. We were reading data in various ways. We were exploring the features of objects. We were constructing mnemonic codes from binomial nomenclatures. We were doing things to configure our r-session. Time is up. Pencils down. Don't you hate these words? Okay. So what did we have? Let's start with configuring r. What functions did we come across there? Technically, yes, but it's not a part of our main r installation. What else? Cat. What else? Similar to cat, we had kind of similar, more similar to cat, quite similar to cat. Print. Do you remember what init did? It copied something. Yeah. So there's a command for copying files. Command for copying files. What would you guess? No, that's for reading data. Command for copying files. How could it be named? If you think there ought to be a command for copying files, you know? Copy. What does our studio tell us with copy? No, it doesn't recognize. File.copy. There's a lot of different things we can do with files. Source. Yes, we use source. Anything else? Substring. String split. Seek. And the related seek. Long. Repeat. Anything else? Come again. Type info. Yes. Like init, not in base r, but. And there were a number of functions within that. Things like mode and attributes. Type of class, I believe. And the very useful and important str. Anything else that comes to mind? A really important one. C. Our vector constructor match. Person n. Person n? We never use that. Where? What line? Do you remember? History. Oh, of course. And you can do it in selection and match case and whole word and wrap and so on. So that's quite versatile. Basically, that's an important one. So did I forget something? The turn is a function within a function. I think that's basically it. So for most of these, do you think you could write one example? Just one example? So how about this table here? Write one example for substring. Just any substring from something that comes to mind. The next table in the back, write one example for string split. Then the table in the back, one example for print. Table in the back over here. An example for seek. The table here in the middle. An example for too lower. And the front table here, something for rep. Just something that doesn't create a syntax error and come up with something. That's a great exercise to consolidate these function words. Any problems? Anything that, to your surprise, didn't work? Ask Greg or Lauren. So that's a great way to consolidate these things. You can only learn them through repetition. One way to repeat of course is to actively work with them. But for some of the most basic and most frequent commands, it's good if you write for yourself a little script of just the command names and then in the morning after you brush your teeth and get ready to go or in the subway when you actually get a seat in the subway. Just write a small example of something. And be active in writing. If you only listen to me talking, it's not going to help that much. If you only read the code, it's also not going to help that much. Things become most clear when you actively write them. Now we've been dancing about this thing of actually entering data from the supplementary material of that paper. We've already entered a few gene names in an Excel spreadsheet, but that was kind of cheating because that was a text file already. Now in order to import one of the main datasets that the authors have published, we'll talk briefly about ways to store data in principle. And that brings us to the question of lists and to data frames. Now we've mentioned before that elements in the vector must all be of the same type. So we can't put characters and numbers and logicals all into the same vector. Rows of a matrix must all have the same length. Columns of a matrix must all have the same length. So we can't have rows or columns that where one is half the size of another. The most general data structure in R that has none of these limitations are lists. So lists are just collections of elements, and these elements can be lists, and vectors, and matrices, and data frames, and anything, even functions. And they can have any type, and they can have any length. So for a very brief example, we construct a list for vector. We use the C operator or the C function for lists. We use the list function. And if we construct a list for a common expression vector, we could do something this year. So a list for book 19 would perhaps have an element of its size, an element of the selection marker, the origin of replication, GenBank accession number, whether we actually have it in the lab or not. Another list of restriction sites where each individual element within that secondary list could be a single scalar if it's a unique cutting restriction enzyme, or a vector of sites. So executing all that defines our vector. Now the difference between the double square brackets and the single square brackets is the following. Double square brackets always only access one element. And if you want to access list elements, you have to do it one by one. You can't go through them in ranges. So in order to access the first list element here, we put one in double brackets. So the first list element is what we defined as size. The third list element is what we defined as origin. But there's a slightly different nomenclature where we can also use this dollar notation. So instead of calling it the third element, we can use the dollar notation to extract from it the value that we associated with that name when we constructed the list. So if our list ory was call e1, then book $19 ory gives us that value. Book $19 sites gives us an entire list of all the restriction sites, like a qr1, acl1. And for example, using that dollar operator again on the second list, book $19 sites, dollar band 1 gives us all the band 1 restriction sites. For example, we can ask about what the second restriction set it is. So it's very versatile. We can put any kind of data into a list. Brandon. Do the double square brackets actually do anything different? Because if you put the same text in with just a single square bracket, it still returns the same. Is that just your notation you wrote? Honestly, I just remember it by convention that I always access single list elements with double square brackets and what's inside them, and their longer vectors I use single square brackets. There's a long and formal reason why we do that. Greg, do you remember, Lauren? Right. So it doesn't seem to be very important in practice. So you see a slight difference here. Right. So if I extract that element, I actually extract the contents of that list item. If I extract it as one, I extract it with its name and its contents. Oh, God, I wouldn't even know how to properly express that. And it does seem, it's not very important in practice. Just, okay, tell us why. So if I simply extract with a single bracket, my type info shows me what I have here is still a list. It's just a part of the list. But if I use the double bracket, I actually get the contents. So that's the difference. Okay. Now, I think by far the most frequently used data structure in analysis of R is some variant of data frames. So data frames map very naturally to your ideas of a structure of a spreadsheet. They have rows and they have columns, and we can access them by rows and by columns. And the important part is that essentially the columns are vectors. So all the elements in each column must be identical. But the elements, the types of the individual columns can be different. So my first column can be character columns, say for gene names. And my remaining columns can be numeric columns, say for expression values. That's the typical way that I could store things like that. So for example, if I construct a data frame kind of similar in the way that I've used it before, I could have a column called genes with some gene names, a column with expression values, a column with logical values, that could be true or false. Then that's the structure that this would look like. R gives me the column names, it gives me the values and gives me the row names. And we can we can access them by row comma by column. So for example, if we say my data frame minus two in R, this means everything about my D F except row number two. So a negative index means exclude that column, exclude that dimension. So this excludes row number two. If I wanted to exclude column number three, I would type this instead. Right? So the syntax within these brackets is always rows comma columns. Whatever you put in the rows can be used to subset the data frame. Whatever you put in the columns can be used to subset the data frame. We'll speak more about subsetting later. So I'm going to use to subset the data frame. We'll speak more about subsetting later. Now let's let's look at this here. So the data frame has the values which we gave it. It's three observations of three variables, i.e., three rows and three columns. My first column is called dollar genes and it's not characters, it's called factors. So these are no longer strings. And that's something you have to be careful of. When you construct data frames in R by default, everything that's a string is treated as a factor. We'll talk more about factors later. Very briefly, we use factors to encode categorical data, like male, female, or induced or uninduced. If we have a textual column of strings and that is converted into factors, that's usually not what we want. So in order to turn off this behavior, you have to make it a habit, always typing as your last command when you make a data frame, strings as factors equals false. Now that's a bit of a nuisance. I think it's a great nuisance. And you can also turn it off globally. But if you ever then encounter a package that assumes that strings are automatically read as factors and you've turned it off globally, the results from your package might be wrong. So that's not a good idea. But if we recreate this data frame with exactly the same values and strings as factors is false, then the type info now shows us the first column genes is no longer a factor, but it's now a character column, i.e. a column of strings. Now one thing that, for example, we could be asking about the paper is whether the known markers of Figure 2D or Table 3D are actually enriched in the data. Can we see their results in the data? Can we confirm that? So in order to do that, we need to import their data, the data that is posted on the supplementary material section of the science website. So I've downloaded this for you. This is exactly the way that it comes from science, and it's called table3.xls. And can you open that in Excel or in Open Office if you don't have Excel? Or whatever other spreadsheet you have? Let's have a look at what's in there. Okay, any issues? Is this immediately usable? I mean, of course the question is usable for what? We could ask a biological question. Find me the gene that is most significantly or most strongly induced by LPS stimulus across all cell types. Is that in the data? What's the data anyway? I mean numbers, but what do these numbers mean? How do we know? Okay. Okay. So we have normalized data. It's expression enrichment log 2. So it's log 2 of enrichment of CD11C positive cell populations from mice. Anybody here into hematology? CD11? Pan-myloid marker. And there's mice treated or not treated with LPS, classified into seven cell types, and clustering figure S3B is based on this table. So that's where we see the individual genes and their responses. To, you know, to, you know, to, you know, that's where we see the individual genes and their responses to stimulation in different cell types. So the enrichment of that in the cell type. Right. So what's a high, what's a high expression or a high enrichment versus a low enrichment? Is it an increasing number or a decreasing number? These are all negatives. So enrichment is all smaller than one. So CD69 in cell type one, after two hours of LPS treatment, is it enriched or depleted? Yeah. So that's enriched. The larger the numbers are, i.e., you know, these are negative numbers, the more enriched this is. This is a factor of log two of 2.5. I think you're probably essentially familiar with this type of data. Right. Okay. So once again, the question, how do we get it into R? So we can start playing with it. So are we going to copy this now and paste it into a text document and start adding commas and quotation marks? No way. We can ask our summer student to do that if we really, really, really hate him. But we can be sure that the results are not going to be correct. So there's a lot of data here. Okay. Long story short, in order to get it into R, except for using the readXLS packages, we should be using readCSV. And table S3 and comma separated values, that's what it looks like. So when Excel produces a comma separated value, it figures out that many of these cells are empty. So they're just sequences of commas. Every line corresponds to one row. There's some text here. There's information about cell types. There's an empty row here. And then there's the actual values that we want. Okay. So we have to read this. Right. So we've opened this in Excel. A cautionary word in Excel. It's an excellent spreadsheet program. I use it for spreadsheets a lot. I use it for spreadsheets a lot. I use it for spreadsheet program. I use it for spreadsheets a lot. It's a terrible, terrible statistics program. It's incredibly poor as a statistics program. It makes real statistical errors. There's two links that I've given you here if you want to check up on that about how Excel statistics is broken in many ways. And more over Excel makes really, really atrociously horrible plots. They look awful. So it's okay to keep data in Excel spreadsheets. If it doesn't get too big, you can keep the data for an entire project in Excel spreadsheets. It's one way to store it. You can easily visualize it. You can do some checks on data integrity and some basic calculations of things. But ultimately, you'll want to write scripts that import your data into R for real analysis. Now, as I've mentioned this morning, you have to be cautious. One of the problems of Excel is that it truncates numeric precision. So convert all cells to text before you export them. Then that doesn't happen. All right. Now I would like you to read this supplementary table from Excel. So the task steps are here, load the data, save it as a comma separated values file, examine the file, understand what it looks like, read it into R, assign it to a variable, use the function head. Head always gives you the first by default six elements of whatever the object is. So to look at the beginning of the object, you'll want to see more than the first six lines, though, so that you get an idea of how many extraneous lines, lines without data, lines that only describe the data, have been introduced. And so you can then remove any unneeded header rows. And then you somehow have to give the columns names that reflect the cell type and the stimulus status. So whether it's with LPS or without LPS. And then you can use type input to analyze the object you have created. So this kind of work is our everyday task and data analysis. First of all, taking the data, reading it into R, and as we read it into R, cleaning it up. That's sometimes a major nuisance to have to clean up the data. You can usually never assume that a data set that you get from somewhere on the internet can be used just as is. So we'll do that. Now, in principle, I think we've covered the functions that you need. So I'd like you to try this yourself. And the way you try it, writing code, something like that, is simply using comments to write through the steps that you need to do. So open file and assign to variable. No, that won't work. Use read.csv to open file and assign to variable. Figure out correct options. Something like that. So structure your work. Put it into steps. And then after you've written your comments, how you would go about solving this problem, then you start implementing that in code. One of the problems when you get stuck in coding is that you write code too soon. Before the problem is entirely clear, you start just writing in some code and then it maybe isn't quite the right thing to do in code, but you've already invested so much time looking up this function and that function and then you don't want to go back and throw it all out and start again, and you start hacking things together. That's not good. So try as much as possible to figure it out conceptually first and then write the code as basically the last cherry on top of your conceptual construct. So write it down as comments, point by point. The goal is to take this data file and then have a data file at the end that we can use. And in order to be able to use it, it should contain the gene names in the first column. Let's write our requirements. That's a good thing to start. Gene names in first column, expressive column names. The column names should be in some way related to the cell type and their status of whether they're LPS induced or not. What else? Data columns, only numbers, no factors. I think that's kind of it. See where that gets you. If you're completely lost and don't even know how to begin, put up a red posted. We'll help you out. So just to focus your thinking here, if we use the function read.csv and we use the file name table as 3.csv, we can assign this to a data frame which is called raw dat. Now if we look at raw dat, we can immediately identify that this doesn't look very good. So here, this is the first column name. This is the second column name. This is the third column name. So that's just nonsensical. This is the first value of the first row in the first column. This is the value of the second row in the first column and so on. So again, that doesn't look very good. So there's a lot that didn't actually work. Then moreover, some values here are 1 and then there's something missing for 2 because if you remember the Excel spreadsheet, they didn't actually label the cell type and the LPS stimulus status, but they had one row where one cell was responsible for labeling two rows hierarchically. So they're not individually labeled, but you have to basically kind of reconstruct these labels. So that's not very good either. So there's a lot of messy things going on. Things that we can do here is to define that we should not be using a header. There's no header in this that we can use for our table. So probably there's an option in the read.csv where we can turn on or off whether our dataset contains a header line. And then after reading the data, we probably should get rid of the rows that have these comments because we can't use them as data values, obviously. So that's another thing we have to do with our raw data to clean it up. So moving on from there, we can simply turn off the header by saying header is false. And that looks a little bit better. Now, the first column is not called table S3 clustering genes, but it's called V1. Second column is called V2, V3, V4. But still, we have all this information here that we don't need. So row one, row two, row three, five, six, and so on. Moreover, if we look at the structure of this, of that data frame, we notice it's all factors, factor, factor, factor. And we said we don't want factors. We wanted real numbers. So apart from saying header is false, we need to say strings as factor is false as well. So the next thing we should probably want to do is to make sure that we don't need these, the rows that we don't want. So how many rows at the top do we want to exclude? Well, all we can see if we use the head function is that it's more than six because we don't actually see any of the number values we need. So let's see how many, how many there are. If we add a second, a numeric argument to head or incidentally to tail, we can increase the number of elements, or in this case the number of rows that we see. So here's what this looks like in somewhere in the middle. We see that rows one, two, three, four, five, and six contain all this header information. And then the following rows contain the numbers. So we need to take our raw data and remove rows one, two, six. And we can simply do that with how? Removing rows. We can do skip when we read. That's a good option. Generally, we can just use minus in parentheses one, two, six. I'll show you. So this vector here is minus one, minus two, minus three, minus four, minus five, minus six. And that says as we copy our raw data to LPS that which will contain our real data, remove these rows. Right. But we're going to have to make our own column names. Definitely these strings should not be in the data section of our object. Okay. So once we remove rows one, two, six, it looks more like a data frame that we could use. Column one contains our gene data. Then we get a number of expression levels. And then we have row 16 at the end, which is a cluster membership label. So what clusters these genes belong to. Okay. Now you mentioned the option skip. Let's look at that as read.csv integer, the number of lines of the data file to skip before beginning to read data, exactly what we need here. So let's try skip equals six and see what that looks like. Right. So we have the same thing. Skipping the first six rows. We can either remove it that way, or we can skip it explicitly while we read it again. Okay, so in this case, I assign this to LPS data. And then I check what we actually have. Okay, we have a character column. We have a numeric column, more numeric columns, and integer columns. Now I think in my sample solution, in my sample solution, not using skip but dropping the rows with negative indices as I did before, has the effect that the columns are all set to character values. Because they were characters at the top of the line. And then in my sample solution, I have to go through column by column and convert the individual columns to numeric. However, using the excellent suggestion of skip obviates that problem. So since these lines are skipped, then as our reads the data, it determines that all of the data within a column is actually numeric, and thus the type of the column should be 0. So I actually think that's a better solution. I've changed my sample solution here. Yeah, then I need to change my comments. I'm not going to do this now, maybe later. All right. So what next? We need to change the column names. So in order to set the column names, in this case, I simply define a vector which is as long as the number of columns I have. And I basically define them by hand. So the first column will be called genes, second column B.control, third column B.lps, next column MF for macrophages and so on and so on. And then genes assigned to the clusters. So in your CSV file, in your CSV file, you might have your column headers across the top, your IDs down the left. They technically just row six, but they were all unique headers in row six. Could I just pick those up rather than manually? So I skipped five and then row six. Yeah, so you could do that, right? So you could edit it into your Excel file and then put it here, skip six rows, put whatever you have there. That works just as well. So you can either do it in R or you can do it in Excel. What's the preferred way? Why? What do you mean more reproducible? Because if you, let's say, we need to look at this database again, we'll have to remember that we did this in Excel file. Exactly. So once you do that, the file that you have on your artist is no longer the original file that you downloaded from the scientists. You've changed it. And you have no record of how you changed it because Excel doesn't keep a record of it. So the correct way to do it is to read it in as it was and then do all your manipulations in your script file because then you can read your script file and exactly reproduce and understand what happened. And if it's something complex, you can put in a comment why you did it and what you thought when you did it. And what did I ever think when I did that? That's very, very useful. So you're exactly right. The better way to do it is leaving the data that you download from the website as it is. Okay, so this vector here defines our new column names. The function call names and its cognate row names allows us to set row names and column names to whatever we wanted. So this is column names of the object LPS that are going to be that. This is a function that can either set or get data. So before this is assigned, the column names are simply v1 to v16. After I've assigned it, the column names are these columns here, genes and so on and so on. So that's also kind of important because we don't actually need to refer to these columns by their abstract names or whatever strings we assign to them. We can just also use square brackets and then call them 1, 2, 3, 4, 5, 6. However, if we incorporate the semantics of what these numbers are and what they contain into our script files, it is much, much easier to avoid errors. So if I intend to do something with, say, macrophage control minus macrophage LPS stimulation and I say LPS that $MF control minus LPS that $MF.LPS, that's completely explicit in the script. But if I say something like LPS that square bracket comma 4 minus LPS that square bracket comma 5, how do I know that? That's right. I need to go back and check it and, you know, the chances of actually making errors at that point are significant. So that's why we use column names and row names that capture the semantics. And that's really important because we can do the wrong thing and it will not generate a syntax error. It will just make erroneous results. So if you ever can, put checks and balances and make things explicit and write comments and name and label things and make no shortcuts with them. All right, so now we have the right column names. What does this whole thing look like now? Perfect. Column names here and so on. Then the skip function relieves us from setting the row names because after we use the skip function, our first row is called one, our second row is called two and so on. What we did previously, we deleted six columns. So our first row was called seven, our second row was called eight, our third row was called nine. And that, of course, can be quite confusing. So we should make sure that the row names correspond to the actual row numbers and are not offset. We could also take the genes and assign them as row names. That could be useful under certain circumstances. It would be useful to subset things. Maybe we're going to look at that. But column names. And then we're done. So if the structure looks like this, you have a data frame. It has 1,341 observations of 16 variables. The first column is called genes. It's of type character and it has the gene names. All other columns are numeric. And these numeric columns are either expression values or gene clusters. So if it looks like this, we should save our beautiful cleaned and imported data. And to save a piece of data in R, we can use the save command. The arguments for save are the object that you want to save. It can be more than one. It can be a list of objects. And the file name. So in this case, execute save lpsdat to the file lpsdat.rdata. If I do that, I can remove it. And now it no longer exists. But then I can reload it. One thing that's important to notice about reloading is that the name of the object is automatically generated from the stored file. So we don't specify what the name is supposed to be. We only specify the file name. So lpsdat is then automatically reloaded. And we get the whole thing back. All right. Now, does everybody have a data frame that looks like this? You guys are awesome. So let's play with the data. Let me just briefly mention the thing about factors. So factors are really useful, but not here. We use factors, for example, in regression analysis. If we want to see if a particular treatment is correlated with the status of male or female probats. So then these things are used as factors. And they support a number of analysis methods, such as ordering box plots, for example, or calculating things. I've listed two tutorials on factors. And they're quite useful. But in our case, we really need to turn the factors off. Because the way that factors are stored is not necessarily the way that you think they are stored. So for factors, well, let's just briefly, I'll go through this example. If I have a data frame, and my data looks like this. So it has the numbers 1, 1, 2, 3, 5, 8. But at the beginning, somebody typed a string na. So not available, but the string na. And so the result is that this column gets to be character data. And once this column is character data, if I don't turn it off, the whole column is a factor. So that's what it looks like. So it's a data frame, seven observations of one variable. Data is a factor with six levels. And the levels are 1, 2, 3, 5, and 6. And so on. So what happens if we want to get this back? So for example, since the first row is not available, we can remove it. So now, this is what it looks like. We have the values 1, 1, 2, 3, 5, and 8. Remember, these are not numbers but factors. And I recognize that these are factors because r tells me what the factor levels are. 1, 2, 3, 5, and 8. Notice that 1 as a factor appears twice in my data. And na appears as a factor level, because it was initially present, even though it's not there anymore. So there's information that's separate from what's still in my data frame and what's being stored about the data frame. Okay, now I should say, well, these are all numbers, so I really want them as numbers. So I can do, I can change my data frame as numeric. Okay, so I assign that and put it to my data frame 2. What does my data frame 2 look like? 1, 1, 2, 3, 4, 5. Whoa, how did it change from 1, 1, 2, 3, 5, 8 to 1, 1, 2, 3, 4, 5? That's a nasty error because, you know, it's subtle. It's not a syntax error. What you've done here is a perfectly valid reassignment of factors to numbers. You won't get an error of warning with that, but your results will be wrong. Does anybody understand why I got these numbers? Exactly. So a factor is essentially an ordered list. It's an enumerated list. The first factor element is the string 1. The second factor element is the string 2. The third factor element was the string 3, and so on. The sixth factor element was the string NA, which doesn't exist anymore. So the fifth factor element is the string 8. So if I turn that into numeric, I don't turn the string 8 into numeric here, but I turn the order, the position of that string in the order of factors, which is kind of arbitrary in the way that the factors were constructed. So even though NA was our first row, it ended up to be the sixth factor. So what we need to do is before we turn them to numbers, we need to turn them to characters. Turning these factors to characters, from factors to characters, preserves what they are. So we turn them to characters first. Then we can turn them into numeric, and then everything is as it should be. So now these are numbers and not factors anymore. So you really have to be very, very, very, very careful. And instead of being very, very, very, very careful, the better way to do it is to avoid factors unless you want to work with factors. So the first two characters were both the same string, and so we called them the same factor twice. Exactly. The second one is the second factor. Imagine that this would not have been one, one, two, three. It would have been male, male, female, neutral. Right? It's the same thing. It just happens to look like numbers. In fact, these were not numbers when they came in. They were numerals because of that NA at the beginning. Yep, exactly. Right? So we can convert them back, but you have to be very careful about how to do that. Better to avoid them unless you really want to work with factors. And the way to avoid them, of course, is to always religiously make sure when you create a data frame, use strings as factors equals false. And it's not even that much additional typing because of command completion. D, data, frame, and then just type str, and I get strings as factors and completed. So don't avoid it. Yep. You have a problem with that. You can explicitly set the type of any individual column that comes in. You basically need to give it as an argument a list of column types. And then you can treat numbers that look like numbers. You can actually treat them as factors if you need them to be factors. Yep, exactly. Find a vector with your column names and just put that in as your definition. You don't have to have it typed out in that big long. We need CSV with all the options. You can find vectors that is a string of our column names and then use that variable for them. Yeah. Which, here, call classes. That's the argument where you define what the individual columns are. Okay. That much for factors. So let's see what we have. What will this expression do? The first 10 what? The first 10 rows. Does it? It does. Yay. Okay. We've looked at head. We can have a look at tail. It's the cognitive head, but it gives us the end. The function n-row gives us the number of rows. 1341. Function n-call gives us the number of columns. 16. And we can use the scholar syntax. So, for example, to get the first 10 genes, I can say LPS-dat-dollar genes 1 to 10. And that's actually my preferred way of working with data frames. At least if I use columns one by one. If I need a range of columns, for example, to calculate means, then I use the range, the colon operator. But if I use an individual column, I tend to always write LPS-dat-dollar genes rather than writing LPS-dat 1 comma 1 to 10. Which is exactly the same thing in terms of its output. Sorry, this is wrong. Oh, nice. What did I do wrong? What's my mistake here? Why is this not the same thing? It should have been the same thing. But why is it not? What's my mistake? Where is the error? So, if you use number one, does that indicate that for the row one, you want to first 10 variables? Exactly. What I told it is, give me the first 10 columns of row one. What I actually want is the first 10 rows of column one. There we go. Yeah, I should update that. That's how I do this. In principle, with your own project, of course, you can update things that you have on GitHub. With this one, it won't work because you don't have right access to that directory. But if I want to commit this change here, I select my file in the commit window. I see the old version. I see the new version. I put in a little message what I did and I committed. So now, the file is changed locally and I can upload it to GitHub. Now, if you pull branches, the file should be updated on your computer as well, presumably. I hope not all hell breaks loose now because of commit conflicts. In the most favorable case, all that will happen is that line 527 will now be correct. Where was the pull branch? There's this little icon here, the green plus, the red minus, and the arrow here, the stacked icon. This means version control. After this icon, you can pull branches and that pulls the most recent version of the uploaded data. So that's what makes these projects, especially if you work in a workgroup together, it's very easy to then keep code consistent across many different computers by frequently updating them from GitHub, making sure everybody has the latest version, being able to know who made what change at what time, so having an audit trail over your scripts, being able to revert errors that happened, which in collaborative computing, things like that can often happen. So we will have a short coffee break, and we will then go more deeply into subsetting and extracting data and looking at data. So how do we now write commands that extract subsets of the data? How do we filter them in order for us to be able to work with it? And we will reconvene at 3.30.