 Everyone to the seventh lecture of the bioinformatics course halfway there and today we will be talking about the our language for statistical computing. It's one of these tools that as a bioinformatician you use a lot programming is more or less fundamental. For today I have a lot of slides we're at slide two of a hundred today. So I changed the introduction a little bit because normally we start with the answers to the assignments. I thought it would be better to put those all the way in the back so we can just see how many slides we can do. And then if we have time left, then we will do the answers for the previous lecture. If not, then that just has to wait until next week. So that means that you guys have a week more to do the assignments. But for today I wanted to talk about our because I think it's really important that when you do bioinformatics you definitely need to learn how to program right. So I just wanted to take one of these languages which is relatively easy to get into. And I think our is a perfect language for a beginner. So if you're a beginner and have no programming experience whatsoever during the next three hours, I will try to convince you that programming is easy. Well, it's not easy. It's as hard as you want it to be. But so we will be starting off with some basic programming techniques like how can I use R as a calculator. Then we will be talking about things which are specific like types and variables. Afterwards we will be talking about control structures and how to output things like writing a file or writing a log file and these kinds of things. I wanted to say a little bit about string escaping. And then part two will be linear regression because I think that linear regression is one of these tools that is really really useful to model your data and get some more insight into what is happening. So we will just go through an example of one of the basic data sets in R. And I will show you guys how you can do single linear regression, multiple linear regression and quadratic regression in R. And then like I said if we have any time left we will do the answers to the assignments of the previous lecture, but I put them all the way in the back. So unless anyone really objects to it and says no I spent the whole week working on it and I want to know the answers now, then we will do them now. But I don't think that it's the biggest deal in the world. All right, so let's just start right. So R as a calculator. So of course let me first show you guys R. So R is a tool which you can actually easily install. I don't have a Firefox window open, so let me actually show you guys the Firefox window. And then I hope it's not too big, well it's actually a little bit too small, so let me move it a little bit bigger. So if you want to install R, you just say R download and Google knows what you want. So if you're on Windows, you just go here, you click the download R button and then you just open up the executable that you download, press next, next, next, next finish like you do for installing any program. And then you have R installed. So once you have R installed, it looks like this. Let me close the Firefox window so that you can see the R window. So then it looks like this. I have a slightly different version, right? The newest version is 4.1.2 and I'm still at 4.0.2. But this is how R looks and then of course we can use R as a basic calculator, right? So if I want to know what 5 times 7 is, I can just type it in and R will tell me that it's 35. So for the slide here, we can use it as a basic calculator. The main thing that you have to remember is that the decimal separator in R is always a period and never a comma, right? So depending on which country you're from, sometimes people use a comma. So they say 1 comma 5 for 1.5, but in R it's always 1.5. So here you can just type in things like 5 divided by 10 or 1 plus 4. There are some special operators, which is good because you want to do things like 5 to the power. And so you can do exponents. Exponents can be done in two ways. You can use this kind of exponent operator or you can just use the double multiplication. So multiplying is just using the multiplying sign, but using it twice will do 5 to the power of 2. You can do the Euclidean division as well. Euclidean division is very important in bioinformatics because it allows you to see how often a number is fully inside of another number and how many remains. So I have an example of that. But in R there's also a lot of special numerical constants like E and F, which means infinite, NAN, which is not a number and NA, which stands for a missing value. So a little bit more about Euclidean division. So if I want to divide 100 by 39, and generally when people were growing up, at least when I was growing up and I was still young and in elementary school, there was this thing which is called long division, where you would take the number that you wanted to divide, put it in the middle, and then take the number that you're dividing by, put it in front, and then do these lines surrounding it. So now the first thing that you have to do is more or less figure out how often 39 fits in 100. So it fits in 100 once, it fits in 100 twice, but it doesn't fit in 803 times, because then you're over 100. So if you do it twice, right, so if there's two times 39 wholly in 100, so then you subtract 78, and then you have 22, which is the Euclidean division remainder. So Euclidean division is very useful, especially when you're starting to batch up things or or dividing groups, and you want to know, well, I have 100 measurements, I have 39 boxes, and then of course I can have two boxes, which are full, and I need a third box, and in the third box there will be 22 samples. So it just allows you to reason about batches, and it's used a lot for some reason in mathematics, it always comes back, so I just wanted to mention that you can do that. So in R the way that you do it is if you want to get the Euclidean division, so you do 100% divided 39, that will give you two, and if you want to get the remainder, then you just do 100% 39, and that will give you 22, so that's the remainder. Good. So besides being a calculator, R also has a bunch of built-in constants, right, so for example R, if you type in letters in R, it knows that letters signify the 26 uppercase letters of the Roman alphabet, so that just as an example we can type in letters, and then it will just show you that no, these are the letters of the alphabet, so it's A to Z, and they're all uppercase. Letters in lower case will of course give you the lower case numbers, so that's one of the nice things about R that, hey, it allows you to use these built-in variables. Besides, of course, knowing about the letters of the alphabet, R also knows about the months, so you have months.up, which is the abbreviations for the English month names, and you have the months.name, which are the full names of the month. Furthermore, it also has this built-in pi, so it knows the ratio of the circumference of a circle to its diameter, so it's just a built-in constant, right, so if you want to calculate something using pi, you don't have to type in 3.1415, and this is not true for all languages, so for example, if you're thinking about the C programming language, the C language has no knowledge about what is pi, for example. It also has no knowledge about letters or months or abbreviations, so that's one of these things in R that you get for free. The language is aware of what letters are and what months are, and it also knows like constants, like pi. There's a lot more constants, but I think these are the main ones. Furthermore, you can use R for imaginary numbers, so imaginary numbers is when you take the square root of a negative number, so we've all been taught in mathematics that when you multiply two numbers, if you take a number and you do it to the power of 2, then it will always be positive, like 3 times 3 is 9, minus 3 times minus 3 is also 9. Of course, in mathematics, there's also a field of mathematics where by definition they say, well, the square root of minus 1 is something which is called i, which is the imaginary part of the number, and R supports this natively, which is really, really handy. Well, it supports it natively, but you have to tell R that you want to use imaginary numbers, so when you type the square root of minus 1, R will just say, well, this is not a number. However, if you type the square root of minus 1 plus 0 imaginary part, now it will understand, oh, you want to use imaginary numbers, let me help you, and it will tell you that the square root of minus 1 is one imaginary unit. So this is really useful when you're doing modeling, for example, if you're modeling a spring constant, so in mathematics, imaginary numbers pop up everywhere and also in biology, they sometimes do. R has built in basic trigonometry functions, so if you want to calculate the sine, the cosine, the tangent, the arc sine, the arc cosine, or the arc tangent, it knows these things and they are built in, so you don't have to write functions for this, they're just available to you. The same thing about logarithms, so the only thing that I want to kind of highlight is that if you type log of 5, then it gives you the natural logarithm of 5, so the ln function, which we would normally write down as ln, logarithm natural, but the log of 5 standard, so the log function itself, the base is the natural logarithm. If you want to get the base 10 logarithm of 5, you can do log 10 5, and if you want to get the opposite, the exponent, you can do exponent 1. So of course this is really useful when you do things like p values, which are generally 1 times 10 to the minus 6, or 1 times 10 to the minus 7, so generally when you plot p values and you make a plot, for example, across a chromosome or some other, if you have like a hundred p values and you want to show them in a graph, then generally you don't show the p values, but you show the minus log 10 of the p value, right? And you do this because then the difference becomes very clear, right? The difference between 10 to the minus 5 and 10 to the minus 7 is very small, but the difference between the log 10 of these numbers is very big, right? Because the minus log 10 of 5, the minus log 10 of 10 times minus 5 is actually 5, and 10 times minus 7 is 7, so the difference is just much more visible when you do a plot. So the standard order of operations applies in R, so when you are programming in R, then the first thing that when you type in something like 10 minus 3 times 2, and these things come around on Facebook once in a while where people start arguing about it, but there's no arguing about how to do the numbers because the numbers or the order of operation is done by PEMDAs. So PEMDAs is please excuse my dear Anne Selle, and I bet that any language has a kind of donkey bridge, that's what we call it in Dutch, for kind of these the order of operation precedence. So have we first do the exponents and the roots, so first 5 to the power of 2 or the square root of something, next is multiplication and division, and then lastly we do addition and subtraction. So have when we look at something like this, then first is multiplication, so we do 3 times 2 is 6, and then we do 10 minus 6, so the answer here is 4 and not 14, which on Facebook people always shout the answer is 14, which is of course isn't because you have to adhere to the order of operations. Of course this can always be overwritten by using brackets, so if you really want to do 10 minus 3 multiplied by 2, you just put brackets around it and then it's perfectly fine and it will first do the subtraction and then the multiplication. So in R everything is in memory, so this is a drawback of R, so if you load in a big data set which is 5 gigabytes big, then these 5 gigabytes will be loaded into your RAM memory, so the RAM memory of your computer is limited, so generally nowadays computers come with like 8 gigabytes of RAM or 16 or 32, depending on how much money you spend on it. But be aware that this is one of the severe limitations in R, so if you have like a massive data set which is tens of gigabytes big, then R will struggle with this because it cannot load everything in memory at once, so it needs to keep part of the hard drive and start swapping, so things will get very, very slow. But to manage this there are some functions to manage your session. So what is a session in R? Well if you open up your R window right and we look at R then everything here which I type in is in my session, so if I define a variable, so for example I say I want to assign to the variable called Denny number 15 or something like this right now I have in my session a new variable and this variable has the value of 15, so everything has, so this takes a little bit of memory for R to remember that the variable Denny contains the number 15. Of course if I start assigning millions and millions and millions of values to a single variable then of course this will take much, much more memory, but of course everything that I do within the session is saved and it is saved until I close R, so if I close my R window and open it up again then the variable Denny will be gone because it won't remember things between sessions. All right, so if we want to know where we are right on the hard drive and again this is the same as in the terminal we need to figure out our working directory right, so if I open up R it will just set the working directory to where R was opened, so that's generally C, program files, R, something like that, but generally we want to move to a location on the hard drive where we are right, so I can use the get working directory to see where I'm currently at, so if I go to R and I do a get working directory then what we can see is that currently I'm reading and writing files at this location on my hard drive, so in C users, addons, documents, so this is where everything is stored and if I do a there, so a directory listing then it shows that these are all of the files that are there, so you see that I have a whole bunch of photos and like a Valentine's Day PNG located there. If I want to move somewhere else right, because normally my data is stored on my D drive and not on my C drive because the C drive I use for Windows, all of the data that I get goes on to the D drive, then I can change my working directory by using the set working directory command and I can just say well move to the D drive, very similar to the CD command that you have in your terminal, so if we then do a dir here then you can see that now there's all kinds of different files like Zoom MP4s and PP1 and my Recycle Bin and these kinds of things, but so you have to manage where you are, you have to tell R explicitly to go to a different directory to load or save files there. So you have the get working directory, know where you are on your hard drive, you have dir to see which files are there, you have your set working directory if you want to go from where you are now to somewhere else. If you want to see what is in your current session, right, if I want to know which variables have I got defined, you can use the LS function, so this is a little bit uncommon, but in R if I would type the LS function right and I would execute it, it would say Danny, right, because we only defined a single variable and that had the value of 15, so for example if I would define a value of 29 and would put it in the variable called Anna, then now when I would do LS I would see that two variables are defined, so it's just that you don't have to scroll all the way up and like what did I name my variable, no, you can just use the LS function for that to see all of the variables which are defined. All right, so those are the main things to kind of control where our loads data from, where it stores data and how to move around. If you want to use external packages, and this is really important for R, because if you want to do things like machine learning or you want to use some new fancy regression algorithm, then generally those are not standardly installed in R, so R is just the programming language and it has a massive repository of packages which are maintained by other people. So if you want to, for example, install a package called QTL, then you can do that using the install.packages function and then you give it the name and the name needs to be surrounded by these quotes, meaning that this is a character vector in R. So installing the package doesn't do anything, it just contacts the external server, it downloads the package to your hard drive and then it stops. If you want to use the package, you have to first activate it and activating a package is done by the library function, so when I do an install, nothing happens, but then when I do library QTL, all of a sudden it makes all of the functions and the data which is stored in this QTL package available and now I can use it, so the QTL library might give me a function which is called scan or some other function. If I want to save a certain object to my hard drive, an R object, so something like a variable, I can use the save function, so when I say save Danny and then I give it a file name, then it will save this variable and the content of the variable to the hard drive in the name that you specify. If I want to load it, then I can just use load, so I just say load and then your.rdata or whatever name I gave it and then it will reload the variable from the hard drive, so this is one of the ways where when you have something in your session and you need to run out because there was a car crash or something else, you can save certain objects for using it later. If you want to save just everything, imagine that I've been doing R the whole day and in the end I have like 50 or 60 variables that I have defined and I just want to dump them all to disk because I want to continue tomorrow or someone shouts like well you've got three minutes and then the power is turned off, then you can use the save dot image. So when you say save dot image, you give it a file name and then everything which is currently in the session will be saved to disk. When you quit R and this is one of these tips that almost no one tells you when you start learning R, when you quit R, never say yes, always say no, because when you quit the R session, it asks you if you want to save a current image and then it saves it into some sneaky location, usually in C users in your kind of home drive and then the next time that you start R, it loads all of the stuff which you had in the session the last time and this is not bad, it is only bad when you for example loaded a 10 gigabyte file and then you quit R and then you just press yes, right? So then it will save what you have had loaded into the session to this sneaky file and then next time that you start up R, it will load in this file again, making it take like 10 minutes before R is actually available, because it needs to load in all of the data. So a tip that I have is when you quit R, by doing it in Windows, you can just click the X on the top right corner, it asks you, do you want to save your session? Always click no. If you want to quit R from the command line, so if you just want to say quit, right, then you can say Q which is the quit function, but then give it the parameter no to make sure that it doesn't save the whole session. And this is just a very useful tip, because otherwise you might run into issues that you start depending on stuff which you had defined previously, but is not available when you switch to a different computer. So just a tip for me, when you quit R, never ever save the session. If you save the session or if you want to save the session, do it yourself. So just say save.image, give it a file name, and then when you start R the next time, say load that image, and then give it the file name, and then it will reload everything on command. The nice thing about R is that people who write packages for R, they are forced to document every function that they write. So help is available for anything that you want. So every function, even the plus function, right, as simple as the addition operator has a help file. So if you don't know what the plus symbol means, you can look at the help file for the plus symbol. If you just want to know what kind of functions there are, right, imagine that you're interested in machine learning, then you can just do question mark, question mark, machine learning. And then it will show you all of the functions where somewhere in the help file, the word machine learning is mentioned. If you want to know what functions there are to do quantum computing, it's the same thing. Just do question mark, question mark, quantum, and all of the files, all of the help files that have the word quantum in there will pop up. If you know which function you want to use, you can use the question mark function name, and that will directly open up the help file for this function. But the nice thing about R is that it has a lot of built in help. So if you just want to know what does the sec function do, you can do question mark sec, and it will give you like two A4s of description and an example. So the examples are all the way at the bottom of the file. So how to use the function, everyone that writes a function for R and submits it to the repository is required to write a help file, and this help file is required to have an example. So just for you guys, the question mark operator, use it a lot. If you don't know exactly what to do, just do question mark, search for the thing that you want to do like standard deviation or quantiles or whatever, and it will open up a list of help files for you, and you can just see what I need. So that's more or less the first couple of things that I wanted to tell you about R. The next thing is about the types of data. So in a lot of programming languages, there is a distinction between things like integer values like five, so whole numbers and floating point numbers. R doesn't have that distinction. So in R, a numeric value is five, but seven point nine is also a numeric value and 10.6 is also a numeric value. So R doesn't have this whole number floating point number thing. So in R, a number is just a number. Besides that, R also knows what logicals are. So it knows that true is a one, and it knows that false exist. And this is very useful because when you do things like if statements, so when you start branching, or when you start looping, then you use logicals. We have characters in R. We already saw this when we did the install that package, right? So the anything surrounded by two quotes is regarded as being a character value. And there's a there's a layer in this. So it means that a logical can be represented as a numeric and a numeric can be represented as a character, but not the other way around, right? So the data types just become bigger and bigger. We have vectors in R. So vectors are a list of numbers, a list of characters, or a list of logicals. So vectors always contain a single type. And have, for example, we can make a vector using the C function, which is the combined function. And this combines everything together in a list. So here we're making a list, which has five elements. The first element in the vector is one. The third element in the vector is 5.3. We can do the same thing with characters. And we can do the same thing with logical factors. However, because of this layeredness, if I start mixing things together, the vector will be of the highest type. So let me quickly give you guys an example about that. So if I say combine one, two and three, then it will just give me a vector of one, two and three. However, when I say make a vector of one, two, three and a, then now all of a sudden everything will be transformed into a character, right? Because character is a higher type than numeric. So a character value can be, a numeric value can be represented as a character, but a character cannot be represented as a numeric value in R. So when I combine stuff together, then it will always cast it to the highest type in the vector. And that's something that will screw you over a lot of times. Because you assume that, oh, I'm working with numeric values, but because one of the numbers was not a real number, but like had a space in there or a comma, R now makes all of the values in your vector a character, because characters can more or less represent any type. Furthermore, R knows what a matrix is. So it understands a two dimensional matrix or also three dimensional matrices. And you can create a matrix by using the matrix function. So you just say matrix, fill it with the numbers of one, two, 20. So double point is two. So it's just making a vector. And then you give it the number of rows and the number of columns. And then you store it in a variable called y. So this will make a standard matrix a two dimensional matrix like an Excel sheet. So there are some functions that you need to be aware of when you're working with types in R. For example, if I'm working with vectors, I'm often, I often need to know how many elements there are. So I can use the length function. So the length function on an object will tell you how many items are in a vector. The str function will tell you the structure of an object, because objects can become very complex. And like, currently, we only saw like vectors and matrices, but we can have things like lists with vectors in there and matrices and other lists. So you can make objects as complex as you want. But the str function will give you a graphical representation of the structure of an object. So it will tell you well, you have a list, the first element of the list is a numerical value with a length of 15. The second element of the list is a matrix, which has dimensions of six by six. If you want to know the class or the type of an object, you can use class, right? So if I have a vector and I want to know is this a numeric vector, a character vector or logical vector, I can just ask the class of this vector. And then it will tell me, oh, the vector that you gave me is a numerical vector. The names are when you are dealing with a vector, you can give a vector names, but you can also get the names of an object. So let me quickly show you guys an example. So if I have this vector, right 123 and a, and I store it in v one, I can ask the names of v one. So the names of v one, they don't exist at the moment because v one does not have names. But it can for example, for example, assign names. So I can say the names of v one is observation one, observation two, observation three, and observation four, so four, and then close the bracket. And now when I look at v one, it now has the names of this. And now when I just want to get only the names and not the values, I can use names of v one. And this will give me the different names that I just assigned. If you do use names on a matrix, it will give you the column names. Just as a tip, you can, you can force things from one type to another type. And you can ask if something is of a certain type. So you can use as dot logical as dot numeric as dot character, and that will transform a character vector into a numeric vector. Of course, the things that it cannot convert, for example, the a in our example, will become an a. So it will become a missing value. And is is used when you want to do manipulation. So if you want to do if you want to know if this vector is of a certain type, you can use is dot logical. And then it will tell you yes, it is or no, it isn't. Good. So creating vectors is relatively easy. We can use the C function to combine several objects together, like numbers or logicals or characters. We can also use the sec function to make a numerical sequence. This is this needs three parameters. So we say, for example, make a sequence from one to 100 by stepping seven every time like this. So our window. So I can do sequence of one to a thousand or a one to a thousand stepping by 250. And then it will say, okay, so from one, add 250, add 250, add 250. So this is just the sec function. We also have the repeat function. So the repeat function just repeats a certain object and X amount of times. So if I say repeat a comma five, then I will get a vector which has five a's in there. Matrices can be made in two different ways. So the first one we already saw. So I can use the matrix function, then give it a vector of numbers that it should put into the matrix. And then I tell it the number of rows and the number of columns. I can also use the C bind function or the R bind function. So C bind means take a vector and another vector and combine them together into a matrix, right? So if I have two vectors of the same length, I can make vector one, the first column, vector two, the second column. R bind does the same thing. But instead of binding it in a column wise fashion, it binds two vectors or three vectors together into a row wise fashion. So hey, if I have two vectors, the first vector becomes the first row. If I have, if in the second vector, then becomes the second row. So quick example, let me define a vector of, so repeat, for example, the letter A 10 times, define another vector which repeats the letter X 10 times. And now I can do a C bind of V1 and V2. And then you can see that now the first column contains the first vector. The second column contains the second vector. And if I would have used row bind, I would have ended up with a vector, with a matrix which looks like this. So it has 10 columns. The first row contains V1. The second row contains V2. All right. So here's the overview slide. So we didn't really talk about the double point operator. So the double point operator is the SEC function, but it's just shorthand because it always steps by one. You cannot give it a different step size. So one double point four just means one, two, four. So it just creates a vector with one, two, three, and four in there. Repeat, we talked about the matrix is more or less the same thing. And that's what I showed you guys. So when I type a vector name into R, or when I type a matrix into R, let me show you here, right? Then what you see is here, you can see this comma something, right? So comma one, comma two, comma three, comma four. So these are the indexes. So here it will say, well, you have one comma, which means the first row, second row. And here it's comma three, which is the third column. So this allows you to select things from your matrix, right? So imagine that from this little matrix, I would want to get the third column, I could say, store it in a variable. So I just call the matrix Y. And then I can say Y. And then I could just do comma three. And this would give me back the third column. So R automatically indexes the columns and the rows for you. If you have names, you can also use the names, right? So from Y, I wanted to get, for example, V1, right? So the first row, which is called V1, I can just say from V1, give me column number one to five, right? So now we'll select the row, which is called V1. And then it will select columns one to five, or one through five, because it will also give you the fifth one. So this is how R shows the indexing. So if you have a matrix and also in a vector, you see the number. No, I don't want to become famous. So you are just going to be blocked. Where are you? Moderator, please help me. Man, there it is. Good. So the indexes in R are done. And it shows you the indexes, right? So just that you don't worry, like, what is this comma one and one comma thing? It just shows you that one comma is the first row, comma two is the second column. So how do we index a vector, right? So if we have a vector and we want to get stuff out, well, we can use the square brackets. So for example, imagine that I have a vector which contains the letters a through i. And then I say, well, give me the fifth letter of this vector, I can just do V, right, which is the name. I could have called it V1 or V2 or whatever. But the square brackets allow you to index into a vector. So if I want to select the fifth element, I just say V square brackets five. If I want to get two, two, five, so the highlighted part here, then I can say V2, two, five. If I want to get a disjunct set, right, not everything in a row, but I want to get like the the eighth number as well. So the number or the letters that are located at index two to five and the one at eight, then I can say, then I can use the combine function. And then I can create a vector which has the indexes, right? So I can say combine two, two, five comma eight. So make make a vector which contains two, three, four, five and eight. And then use that to select from V. I hope that's clear, because this is something that you do a lot. Matrixes work more or less the same way. But now you of course have to give it the rows and the columns that you want to select. So for example, if I want to select from the first column, the first three rows, I am saying M123. So the first part, so everything before the comma signifies the rows that you want to select, everything after the comma is the columns. So remember first rows, then columns. So 123 selects from the first column, the elements 123. If I want to say from the fifth row, select element three to six, then it selects this part. I can also select just a single element. So I'm going to say from the matrix, give me the eighth row, seventh column. And I can also select the whole column, right? So selecting the whole column just says M, then you don't fill in which rows you want. You just say give me comma nine. So give me the ninth column of the matrix, so then it will take the whole thing, no matter how long it is. All right. So just as a reminder, these are the types of data. So we have logicals, true and false. We have numerical values, which are five, 7.9, 10.6, which is slightly different from other programming languages. We have character vectors, which are called strings in many other languages, which are things like 123 with the double quotes surrounding it. We have vectors and we have matrices. So R also has some advanced types, because a vector can only be of a single type, right? The same thing holds for a matrix. So a matrix can hold only numerical values or only logical values or only character values. However, in many cases, we actually have a matrix, which every column of the matrix is of a different type, right? If I'm measuring plans, then the first column might be the name of the plant, then the second column might be the height, which is a numeric value. Then the third column might be the color, which is a different data type. Then the fourth column might be a logical, right? If it was watered or not, so true or false vector. So to represent this, R has something which is called the data frame. So a data frame is not a matrix. It looks very similar to a matrix, but now every column can be different. So the first column can be numeric, the second column can be character, the third column could be logical. And that is okay. So you create a data frame by defining the three matrices or the three columns that you want. So in this case, we have V1 being 124. We have V2, which is character plus a missing value and V3 is actually a logical vector. So when we say data.frame V1, V2 and V3, and we store it in a variable called D, then now we make a data frame, which is similar to a matrix. But the first column is numeric, the second column is character, and the third column is a logical value. So a list is very similar to a vector. Again, it's just a list of things. But it can contain anything, right? So every element of the list can have a different class or can have a different type. So here I make a list. The first name or the first element of the list is a named element, and it contains a character called Fred. Then the second element of this list is something which is called numbers. And this has not a single number, but this is vector V1. So the second element of the list contains a vector, which is of length four. We can also say, for example, the third element is called age, and it has the value 5.3. So a single numeric value. So a list is very similar to a vector. It's just that a vector has to be of a certain type. A list, every element in the list can be a different type. Besides that, in R, we have something which is called a factor. So R, because it is a statistical language, it knows about categorical variables, right? So categorical variables are things like male, female, right? So I can use the keyword factor in R to say that this is something which is a categorical variable, right? So here I'm just saying, well, repeat males 20 times, repeat females 30 times, combine this together. So now what I have is a vector which has 40, 50, so 20 and 30. So I have a vector which is of length 50. First, there's 20 times the word male, and then there's 30 times the word female. And by now using the factor keyword, R looks into the vector, sees which unique values there are, and then says these are the only values that are allowed. So in this case, if I want to add another gender, then it says that you cannot, because the factor forces it to be one of two categories, either male or either female. And this is very useful when you're dealing with categorical variables. And it understands this also at a statistical level, because a categorical variable needs to be analyzed differently from a numerical variable. So that is why R has a built-in factor type, which represents categories. So besides that, we have comments. So comments is start with a hashtag, and then everything after the hashtag will be ignored. So if you make a script, then use comments a lot, because R will ignore the comments anyway, but for the people that might read your script in the future, or yourself that might read in the future, it is really important to know what you were doing and what you were trying to do. So adding a lot of comments helps a lot with understanding code. I've been working now in bioinformatics for around 12 to 16 years, I think. And it has helped me a lot. If I look at old code that I wrote like five years ago, then the only reason why I still understand what's going on is because past me left messages for future me. So use comments for yourself in the future so you know what happened and what you did. All right, so a quick self-test for you guys at home, and I'm probably not going to wait for you guys to answer it, but this is generally one of the questions on the R course. So when I give the R course in the summer semester, then there's one question in the exam, which is very similar to this. And so it asks you what is the type of truth, so the first element. So you guys want to have a guess? Then I can take a sip of coffee and you guys can guess what type this first thing is. So what is the type of the first element? Is this logical, numeric, character, matrix, vector, data frame logical? Yes, Xanaxin, that is the trick in the trick question. Because it is surrounded by these double quotes, it's a character. Everything surrounded by double quotes is a character. So this is a character value, string. Yeah, in other languages, you would call it a string value, but in R you call it character. All right, so the same thing holds for the second one. Also a trick question, it's not a numeric value, it is again a character value. This is a numeric value, 1e plus 11, so this is 1 times 10 to the power of 11. 0x89 is also a numerical value, it's just a hexadecimal value. So this is a different way of writing down numbers, where you don't use base 10, but you use base 16. This one is a trick question as well, because a lot of people see this and then say, oh, this is a color. In R, this is not a color. In R, this is a comment. It starts with a hashtag, everything after the hashtag is ignored. So this is just a comment. This is a logical vector, as factor true, turns this logical value into a factor value. We force it to be a factor. So this is a factor. The is character, this will test if this thing that you throw in is a character value. So is character 1 times 10 to the power 11? No, this is not the character. So the answer of this is false, but the type of false is a logical. So the is functions is that something always return a logical value. So always a nice question in the R course. I always love coming up with ways to trick people in that sense. And this is generally the only question where I do that, because I generally don't like trick questions. But here it's nice to see if you can put people on the wrong lag. All right, so a little bit more about lists. If we have a list and we want to take something from the list, since a list is different from a vector, we have to use the double square brackets. So imagine that I make a list, and this list is called W. It has four elements. The first element is a character vector of length one. The second one is a numeric vector, which has a length of four. The third one is again a numeric vector of length one. And the fourth element in this list is actually a matrix, right? And that is perfectly fine because a list can contain anything. So every element of the list can be a different type. So how do I now select the first element from the list, right, Fred? How do I get that? Well, I have to say from list W. So if you type W and R, it shows you this, right? So the names are using the dollar signs. And then it says name, dollar sign number, dollar sign age, because I'm automatically assigning a name to it, right? So if I would use the names function on W, it will tell me name, numbers, age and matrix. So if I want to get Fred out of this list, right, then I have to say from W, select the first element of the list. And then from what you get back, select the first element again, right? And this is because in R, characters and numerics are automatically interpreted as vectors, because in R, you generally always deal with vectors. So a vector of length one is not a different type. So if I want to just get Fred back, I have to say from W, take the first element of W and then from the thing that you get back, take the first element, right? Because it is a character vector of length one. But I still have to specify that I want to have the first thing. I could also use the dollar operator and instead of having to do double square brackets and giving it the index, I can just say from W, select the numbers vector, and then from this numbers vector, select the second and the third element, right? Then it will give me back two, three. So this part of the vector. Of course, this can only be done when you use names, right? If you don't assign names to the objects, then you have to use the double square bracket. So I could use W dollar matrix one comma, which will give you the first row of the matrix, which is stored in W. And you can get the first column the same way. But here I'm using indexing. So here I use the number. So from the fourth element of W, give me the first column, right? And these two are more or less equivalent, except for the fact that here I'm asking for the first row and here I'm asking for the first column. So if you can, and you're using lists, use named list, so make sure that you assign a name to each element of the list, then you don't have to use the double square bracket thing. You can just say W dollar, then the thing that you want to select from. So in this case, numbers, which is the second element, right? And if in any time in the future, you reorder your list or you add another element to the list, then the name will still be the same while the index might change, right? If I add something in front of the list, then everything would move. And W one one would not select fret, but it would select the thing that I just added to the list in the front of the list. So try and use named operators. Alright, so if I want to work with matrices and data frames, then of course, sometimes I want to know how many rows a matrix has, then I can use the n row function. Sometimes I want to know how many columns a matrix has, so I can use the n call function. And often I want to do something for each row of the matrix, or I want to do something for each column of the matrix. So I can use n row for the number of rows and call for the number of columns, I can have row names. So if the matrix has row names or column names, then I can get those by typing row names matrix or call names matrix. But I can also use the the arrowhead operator to assign the row names. So if I have a matrix which has three rows, right, then I can name these rows by saying row names of the matrix A, B and C. So I can just directly assign into this function. If I would type the row names matrix, I would get back the names, but by assigning to it, it will update the row names of the matrix, or it will update the column names of the matrix. In many cases, you run into issues with matrices, and that is because the matrix, so have for example, if I want to make box plots in R, then the box plots function takes every column of the matrix and makes it into a box plot. If I by chance have my data structured wrongly or differently, right, and I have my data in the rows and not in the column, then I can use the transpose function. So the transpose function, what it does, it takes the first row of the matrix, and then makes the transpose of this, then the first row of the matrix becomes the first column. The second row of the matrix becomes the second column, and the third row of the matrix becomes the third column. So it just takes the matrix and puts it on its side. So it switches rows and columns. And this is very useful because in a lot of times, the function that you're dealing with is not exactly knowing what data you have. Like I said, the box plot function assumes that when you call it on a matrix that you want to have every row or every column of the matrix represented as a box plot. But if you do the heatmap function, it assumes that you want to use every row. So it often, you have a matrix, the matrix is formatted in a certain way, but the function that you want to use just expects the matrix to be more or less the other way around. So by using the t function, you can transpose a matrix. So it's rows become the columns, and the columns become the row. All right, 155. Let's do one or two more slides, right? So we're a little bit behind, a little bit behind like slide 24 out of 100. So we're getting there. Okay, so variables we've already seen, right? So variables in my mind are boxes. So you can put things in the box, you can use like the arrow, or you can use the is assignment operator. But you can use this box without knowing what's in it. And that's the nice thing about variables, right? So variables are kind of boxes. So in the mind, they're like a layer of abstraction. So you can just assign something to a variable. And then you can use like the properties of the variable saying for every element in this variable do something, right? And then it doesn't matter if there's one element in the variable, or if there's 100, it will just do it for each element. So we've already seen variables, right? So variables can be like, have a holding single numerical values like 1.5. Here we define a variable called can, which has two elements. So the first one is true. So this is a logical vector. Here we're assigning a vector of length 4 to a variable called half. Here we assign again a two logical vector too many. And then here we assign the value of minus 5 to names. And you are free to choose the variable names. So when you come up with a variable name, make it a good name, right? So don't use variable called x, right? That doesn't mean anything to anyone, just say average temperature, right? And the nice thing about R is that you have the tap operator. So if you have a variable called temperature measurements on day five, right? And this would be just some temperatures that you measured. So let's just put something in, right? This is a very long variable name, but it is a very meaningful name, right? Because I know exactly what it means. If I now want to use this in R, I don't have to type it in. I can just say T E M P, then press top. And then it will tell me all of the possibilities. So pressing top twice will show that, okay, so when you start with temp, there's something called temp here, temperature measurements on day five and temp file. So these are two functions that are built in. But as soon as I say, okay, so temp, air, and then I press top, and it just fills in the whole name for me. So don't restrict yourself, don't make your variable names too small, make them meaningful, right? So that's a tip saying that if you define a variable, give it a good name, right? And if temperature is a relatively good variable name, temperature in Celsius is a better one. Temperature in Fahrenheit is also a better one, right? Because now it has the unit. So people know what's what you mean. All right. So variables, you can give them names yourself. I would always advise people to have meaningful names, speaking names, tell people what this variable is doing. All right. So I always advise people to code clean. And that means create scripts, right? So use a new file for a new lecture, right? So if you're starting to answer assignments, and that's the same way as the assignments are uploaded to Moodle, if you have the answers right, it's in a single file. So lecture one, or the assignments to lecture one, in my mind, there should only be a single file which contains the answers to that. But then when you look at, when you start doing the answers to the second lecture, those go into a next file, right? So a file is also an encapsulation unit. It stuff which belongs together is in a single file, and these files have logical names. So when you start coding, one of the things that you should kind of force yourself to do is to always add a header to each file. So a comment section where you state the name of the file, the name of the author, the date at which the file was created, the purpose of the file, and always add something about copyright. Say just on the top of the file, this is mine, and no one is allowed to use it. Or say, I don't care, anyone can use it, right? Just state the purpose of the file. So what is it going to do and state kind of what you think is the copyright? So it is mine, no one is allowed to use it unless they say my name 15 times. And of course, use a lot of comments in each of the files that you create. So a little example that I generally use, so this is generally the header that I make. So here we see the purpose of the file, right? So the purpose of this file is the analysis of Hardy-Weinberg equilibrium. Then the copyright is this file was written in 2015. So the copyright is 2015. And of course, since I'm working for the Humboldt University, the copyright is not mine, right? I made this file because I work for the Humboldt University. And the Humboldt University is kind of the owner of the copyright. Everything that I do at work does not belong to me, but belongs to the university I work for. But of course, it is written by me, so right, written by Danny Aaron. So anyone seeing this file in the future knows, okay, so if there's anything wrong with this file, and it does not compute the Hardy-Weinberg equilibrium, then complain at this guy. I always add a, or I always try to add a first written and a last modified date to each of the files that I create, just so that for me, I can kind of look back in history and say, okay, so what did I do in April? Well, in April of 2015, I modified this file. This is not entirely required, right? When you use things like version control like Gith or Github, and then Github will also track this data for you, when was it first written? When was it last modified? But for people looking at the file, it's just handy to have this information on the top. And it doesn't cost that much time. So I just have a standard header that every time that I open up a new file and create a new script, I just copy paste in the header and adjust it and just say, well, instead of analysis of Hardy-Weinberg, this is something like this. And of course, these first modified and last modified, when you use version control, it's not that important, but it is something that you have to keep in mind that people want to know how old a certain script is or how new. And then generally, all my scripts start with a set working directory, because I need to move somewhere, right? The data does not live in R. So I have a set working directory. So where do I go? Well, in this case, I go to my D drive to the folder called R course, and then to the folder called Assignments. So this is more or less an example of a header. And these kinds of headers I put in each one of my files. All right. So it's 202. Let's do a short break. Yeah, let's do a short break. Then I can get some more coffee. I still have a little bit, but perhaps you guys want to get some coffee as well. So thank you guys for being here the first hour. I will stop the recording. So people on YouTube, see you on the next part of the lecture.