 start recording and I'm back. So welcome back. So I think last Tuesday was really nice. We had a lot of questions that people were asking about the assignments. One of the things that came up was that people would like to have kind of more of a thing that they can hold in their hands or read on their Kindle. So because it's a pandemic, Springer actually decided that it would be a good idea to make a whole bunch of their books available for free. So these are three books that you can get for free from Springer. You can Google. They have a lot more. It's not just about our programming, but these are three books. And especially the first one, a beginner's guide to R is a really good book if you start out in R. So people have asked me, like, is there a book that I can buy or use? So you don't have to buy it. You can just download PDFs and read them on your e-reader or on your phone when you're in the train. So three books for you guys. And my advice would be start with the first one, a beginner's guide to R. If you kind of want to jump in directly to the statistical part, which we will come to in lecture six, I think, then the introductory statistics with R is really good. And the understanding statistics with using R is really good when you don't really have a basics in statistics. So the introductory statistics with R, so the second one is more for when you already have or you already know what an ANOVA is and you just want to know how to do it in R. If you don't know what an ANOVA is and you want to learn by using R to kind of get a feeling for things like an ANOVA or a t-test, then the third book is really good. So I forgot who asked last Tuesday, but yeah, there's a lot of good resources out there. And these three are for free. And like I said, there's many, many different books. So if you just Google Springer free textbooks, I wasn't able to get access to the book for free. Oh, is that so? I will... Let me see. Can I click the link? I probably could. Let's open it up. So let me see. I have my Firefox somewhere. So here, when I go here, when I click on the link, then it's just download book. It might be that the EPUB is still paid, but the PDF is free. VPN from your university. Yeah, that might be it. It might be because I'm at the university. You know what? Since they're free, I will just download all three of them and put them on Moodle as well. I'm not so sure if I'm allowed to do that. So I might not be able to do that officially, but I will. So I will just put them online so that you guys can get them from there as well. Yay! Free books! Free shit! That's what it's all about, getting free shit. All right, and there's more. So indeed, if you want to do other things, then Google, Spring are free textbooks, and then they have literally 150 books across many different fields. There's also one if you want to learn like HTML and PHP, and you want to make your own website. Okay, so the overview for today. Today we're going to spend some time reading data. So reading data, and because we've been doing a lot of things now with generating random numbers, but of course, when you have your own data set, you want to read in data, and you want to read in your own data. So first we're going to, I'm going to show you how to read tables. So tables like two-dimensional structures. There will be text files. So just plain text files like a Lorem Ipsum file or some other file, which is text in there, which does not have a 2D structure. We will be talking a little bit about binary files. Well, actually we will be talking a lot about binary files, because I love binary files, especially like images, because they're really fun to use. And then we will have text files line by line. And text files line by line is really good when you're dealing with big data, and you have like sequencing data, which has like millions or billions of lines. Are you recording? Yes, I am. Good. I should actually update my overlay so that it has like a little red thing above my head to show you guys that I'm recording. So then when you don't see the little red thing, then I will update my overlay with that. Compress files. So R doesn't just read like data files. It can also read a zip file directly. It can also directly read from an URL. So if you want to, for example, read Google search results, you can actually do that in R directly. We will be talking a little bit about managing your data. So if you have a matrix loaded in, how do I subset my matrix? And I want you guys to know the which function, the in function, and the subset function. We will be talking a little bit about saving data. And since we are doing data analysis for R, and we're in the biology department, I also want to introduce you guys to biomark. Because biomark is really, really handy when you do biology. It has access to databases like ensemble, but also like keg and all the other ones. And you can get directly your data from biomark into R. So you can do queries like give me the genome sequence from chromosome one in mouse from 1000 base pairs to 1500 base pairs. And it can automatically download these things for you. So just a quick reminder of lecture one. We used R as a calculator. I talked to you guys about the different types of data. What is a logical and numeric character and these kinds of things. Indexing data works for all of the types using single square brackets, except for the list, because the list is special. So you need to use double square brackets. The sec function, repeat function one to 20. So the double point operator. In lecture two, we talked about variables, control structures, functions, different brackets. Again, if you do an if statement or a for loop or a function, always use the curly brackets. It is just much clearer for anyone else reading your code. And it's also going to be easier for you when you reread your code in like a year or a year and a half. Because you think, oh, now I have a big data file and I want to load it in line by line because I can't load it in one go. Go back to lecture number three and then get the code from there. And then make sure that you can use it. We talked about escaping the inevitable, about randomness and about clean and reusable code so that you can kind of see your code in four years and still be proud of it. So if we talk about R and data, there are several sources where we can get data from. So we already saw the random numbers. So of course we talked about the uniform, the Gaussian distribution and the Poisson distribution. But there's many more. If you want to use the inverse beta or the gamma distribution, R also has you covered because that's also in there. One other source of data which you should never underestimate is R packages. A lot of R packages come with example data. And this is a really good source of data. There's probably enough data in R packages to write some really nifty publications because the data there is just as example data. So I will show you how you can list all of the data sets that are already in R and how you can use them. Of course we can use files like comma separated files. We can directly load Excel files if we wanted to. Tap separated files which is very similar to comma separated files. But of course images contain also data. So a lot of people in the future will do things like make photos off like through a microscope, right? Cell counting. So you get an image and on this image you want to identify how many cells there are. So also images can be used as a data input source in R. Furthermore you can get data directly online. So you can directly in R load web pages. But you can also load data from things like FTP sites using R. So that's what we will be discussing in the coming 50 minutes. So random numbers. We already discussed these. I'm not going to discuss them further. But that's one of the things. So like I told you, our packages contain example data. And not just example data. Some contain actually very intricate data sets. So if you want to load a data set which is available in R via package, you can use the data function. If you want to see all of the available packages or all of the available data sets across all of the packages or libraries that you have installed, you can use this command. So it's just data, round bracket open package is dot packages. All available is true. And then it will roll down a list of all the packages that are available in R. Usually when you load a data set, for example, the US arrest data set is used a lot in many different R courses. And this just contains things like, so it's the United States, all the arrest data. So it contains the number of arrests in the different states concerning different. Oh, Mustafa Kang. Thank you for following. Sound not too hard? I disabled the sound. All right. So let's re-enable the sound. So the next follower will get the nice sound. I hear the sound in my headphone. But all right. So you can load a data set like the US arrest data using the data function. And the first thing that I do when I load data is look at the head of the data. So just look at the first like five to 10 lines. And then hey, it shows you that, well, this data that I just loaded, it's a matrix, it has four columns, and it has an X number of rows. And you can see that in Alabama, you have 13.2 murders in a certain year. I think it's 2009 or something. There were 263 assaults, the urban population, and then the rape, the number of rape cases. I think it's percentage of total crime or something like that. Where are all the R packages stores and who pays for the storage space? So there are many different repositories. So there is the CRAN repository. So the CRAN repository is the standard R repository where data packages can be stored. And it has the highest kind of quality standard. Who pays for the storage space? That is an interesting question. I think it's a collaboration of different universities that pay for it. Because if you are working at a university and you think like I use R a lot and I want to support it, then you can run your own mirror. If you type install.packages in R and you don't feel it, then it pops up a list of all available providers. And many of these providers are universities. So there's a lot of universities that kind of support R and they run their own mirror. Furthermore, you have Bioconductor. So Bioconductor is the big database for biological packages. And Bioconductor is paid for by an annual report. Let me see. That's by Robert Gentleman. Bioconductor is who is paying for Bioconductor? I don't know. Again, there's a lot of universities on the list. But someone pays for it. And I think you can also support R by making donations. So part of it is paid for probably by donations. I think something like Bioconductor might actually be supported by the NIH as well. All right. Drigoletti redeemed next slide in German. All right. Good. I'll finish this slide first. So when we want to load our own data, right? So for example, if we want to read a table from a file. So if we have a structured data file, which is tab separator or comma separated, and we can use the read table and the read CSV functions. So it loads tabular data and you have to specify what is the separator. So which character separates column one from column two from column three. You can add the NA.strings parameter to specify how missing values are coded. Because a lot of datasets that I've seen, they either use like a little like dash, like a minus sign. Some of them use an X. Some people like specifying missing data by saying NA. Some people use like an empty string to separate or to specify that there's an empty value. There's support for header and row names. Because in R, a matrix can have a header and it can have row names. So you can also specify if it has a header. So the header is a true false variable, meaning that the first line of the file gets interpreted as the column names. And then the row names can be specified using a numeric value. And a numeric value of one means that the first column contains the row names. But sometimes the row names are in column number three or four. And often this returns a data frame, because of course, like you know in R, a matrix can only have one type. So a matrix is either numeric or character or logical. But many datasets of course have a mixture of those. Like the first column is a character. The second column is a numerical value. And then the third column has like male female, which is a factor value. So hey, you can you can it almost always returns a data frame unless you have a matrix, like a matrix, which only contains numeric values, then it will do. Okay, good. So here, Seid man dann den ganzen, den ganzen function call from function call? What's a function in German? But here Seid man dann den ganzen function call from R and alles was man spezifizieren kann. So man seet, dass es ganz viele Parameter gibt für den Retable function und den Retable function ist gleich wie den Ried CSV function. So den Ried CSV function hat den gleichen Option. Das einzige das einzige Unterschied zwischen den Ried CSV und den Ried Table ist den separator. So den Zeichen, was den unterschiedlichen Kolommen von einander scheidet, scheidet, unterscheidet. Aber von all diese Parameters, den man oder Funktion, Function Parameters, den man nutzt, sind diese den Function parameter, den man oft nutzt. So den header ist den ersten, den separator und das quote Charakter. Man hat den row names und den column names, aber den na strings gibt es auch und call classes kann man nutzen, um den unterschiedlichen Kolommen zu spezifizieren, ob es logical oder numerical values sind. Daneben hat man auch strings als Faktors und das ist ein ganz schwieriger Parameter, weil man kann hier sehen, dass es ein default value gibt und den default value ist unterschiedlich für mich als für andere Leute, weil wenn man er installiert, dann abhängig von das os, das man nutzt, gibt es eine andere Einstellung. So oft wenn ich Leute anfangen, Tabelle mit a zu lesen, dann sage ich immer, setze den strings als Faktors zum falsch, weil man wohl nicht, dass er automatisch Characters in Faktors umwandelt. Okay, so quick recap in English for everyone who's not German. These are the parameters highlighted which are the ones that are most important. So we will go through them one by one. So the first one is the header. So the header defines if we have column names, it can be true, it can be false, true means first line is interpreted as column names. If it is false, then it will auto name the columns. So the columns will be called V1, V2, V3 and so forth. The quote defines which quote character is used. Oh, there's a massive spelling error here. Let me write that down so that I can fix that. So the quote character you can specify yourself. I often say don't quote stuff in your data file, sometimes people like it, strings as factors, should the character columns be turned into factors? The default value used to be true. I think in the latest version of R it's false, but I always advise people to set it to false because Faktors are something that you only want to use when you do statistics and you want to have you want to have control over if a column is a character value or if it's a factor value, because only you know if something should be interpreted as being a grouping variable or if something is just a character value like a free field, which has notes. And when it is set as true, then it will just take every character column and convert it into factors, even though this might not be the correct, even though the column might not contain a factorial variable, so a categorical variable. The separator is the separator character that is used in the files. The row names is something that you can use to specify a vector containing the row names, so you can directly, if you know the different row names, you can directly give it to R, but more generally you use it and you give it a single number, which is the column of the table, which contains the different row names. So generally you say row names is one or row names is two, and then it will take the first or the second column and interpret this as the row names. Besides that, you can specify the call names, so you can specify the call names when you say header is false, right? If you say that there is no header in the file, then you can use the call.names to specify your own column names. So you just use the C function to say C column name one, column name two, column name three, and so on. And then it will use this after loading in the matrix. Of course you have to specify the same number of column names as that there are column names in the matrix, and the same thing holds for the row names when you use the vector, then it has to have the same length as the number of rows in your file. Checknames are assumes that every column name and row name is a valid variable name. That means that a column is not allowed to start with a number. So a column cannot be called one something, or 15 something, or one of the things that I see a lot in our data is that things are called 21 days, or 35 days, or two months. And in R that's not allowed, because R has the with function and the with function takes a matrix and makes for every column of the matrix, it makes a new variable. So that is why the checknames is there so that you can disable this, because what R will do when it reads a column name, which is invalid, because it starts with a number, for example, it will change the column name. So the check names is not just a check, it's actually that it starts modifying the column names. So something like 21 days, when you load it in, when checknames is true, it will be x21 days. Furthermore, it will also remove spaces in column names. So if you have a column name, which is 21 space days, then the eventual column name in R will be called x21.days. So checknames is false, we'll read it in as it is. But this will break some parts of R. So you cannot use the with function then anymore. The skip parameter you use a lot, because a lot of data does not start at the first line of the file. Sometimes you have like five or six header lines, and these five or six header lines you want to skip. So it has some data, like this data was collected by Denny, it was collected on blah, blah. So you can skip these first five or 10 lines or 100 lines, which contain no tabular data. So of course, that's very common in data sets. Okay, so the call classes can specify what we have in the different columns. So the call classes takes a vector, and this vector will be applied to the columns. So if we have a data set which looks like this, then the first column is of course character. The second column called chromosome is a factor, because there's only an x number of chromosomes, and they are a grouping factor. So they are factors. Then we see two columns which are numeric. Then we see a column called strand. Strand, of course, is a factor again. So it's either minus one or plus one for positive and negative strand. You have the MGIID. The MGIID in this case is the mouse genome database they identify. And MGIID is of course a character. The symbol is also a character, and the description is also a character. So you can specify this yourself. So when you open up your file in a text file, then you can just look at it and say, oh, I'm going to explicitly tell our what type of data is in what column. So in this case, hey, it would be character, factor, numeric, numeric, and so on. And it takes a little bit of typing because you can't use just C or car. You have to type it out fully, but it will help you because it will load it in using this type, and you don't have to start muddling or modifying types afterwards. Reading data from files is often a trial and error process. You never get it right the first time. So what I do is open it up in a text editor. I look at the content of the file, and then I use a couple of tries to load in the data correctly. If you load in your data, you can use the dim function to get the dimensions of it. Of course, you have to check that if you have a data file open in your text editor, and you see in your text editor that there are 50,000 lines in the file, then using the dim function, it should also give you 50,000 for the number of rows. The head function, of course, you can use to check the first five or 10 lines that you load it in. So one of the things that I also use a lot is that I don't use the head function a lot. What I generally do when I load in my matrix is just type the name of the matrix and then select the first 10 rows myself. So do 1 to 10. And I do that across all of my code. So every time that I modify a matrix, or I add something to a matrix, then there's a statement saying, print the first five lines. Because I want to visually check that the data that I have loaded in is still correct, and that I haven't done anything weird with the data, or by accident threw away a column, or overrode a column, or these kinds of things. So my code is generally littered with, I do a manipulation of some kind, and then I check if the matrix that I'm working on, just check the first five or 10 lines just to make sure that it still looks okay. So we can load files also. So text files, which do not have a structure. So then we're thinking about files which contain text data, or FASTA files, or VCF files. And then we can use the read lines function. So the read lines function reads in line, reads in text lines on a line by line basis. It loads text, sees the FASTA, VCF, you can load any text file using the read lines function. If you want to read, for example, 10 lines, you can specify n equals 10. If you want to read in the whole file, so all the way to the end, it is minus one. And here there are, here, the first parameter is a kind of double parameter. So if the first parameter is a string like this, or a character value, then it will interpret this as a file name. So it will look in the current working directory where you are to see if there's a file called mytext.txt. It will open up the file, and it will close it afterwards. You don't have to use a string value. You can also use a object which is called a file object. So you can also use it to read a file line by line, or 10 lines, and then another 10 lines, and then another 10 lines. So you can process one line from your file at a time, and this is really useful when you have a big amount of row-wise data. For example, sequencing data that comes with millions and millions and millions of reads, and every read is on a line. So if you are, for example, reading sequencing data, but also if you're reading gene sequences, if you're reading a big FASTA file which has like 20,000 genes in there, and for each gene you have the first 60 base pairs, then you can't load this in in one go, because it's just way too big. The same thing holds for single nucleotide polymorphism lists. So if you have lists which have a lot of snips in there, then generally, if you look at a general genome, like the cow genome, then you're dealing with 26 million snips, and of course you don't want to read in a data frame or a matrix which has 26 million rows, because that will be massive in memory. So then load your text file line by line. So I will just show you how to do it, and this is of course something that, like, it's for you guys, right? So if you need to do this in the future, then you can use a connection object. But I only want you to know that you can do this. I don't expect you guys to learn this by heart. I just want you guys to be aware that if I have big data, I can use a connection object. The connection object has different states. So I can open up a file for reading, I can open up a file for writing, and I can open up a file for appending. The difference between writing and appending is that if I open up a file using the W, so if I say file, name of the file, and then I say W, the first thing that it does is empty the file, because you're told that you want to write through this file, so it will just throw away the entire content of the file. If you want to append to a file, so if you want to keep everything that's in the file and just add stuff on the bottom, you have to use the A argument to say append to the file. Furthermore, you also have R plus, so this is opening up a file for reading, writing, and appending, but it is optimized for reading. You can have W, so W means writing optimized, so it's optimized for writing to the file. Again, the W plus will delete all the content which is in the file, and then you have A plus, which is optimized for appending, but you can also read and you can also write if you wanted to. So how does this work? Well, just a little bit of an example code. I first specify on which line I am, because I do want to go through the file and I want to kind of see what I have and how many lines I have. So here I am defining a connection, so I am saying this file called text file.txt, open it up in reading mode, so I am not allowed to write to it, I can only read from this file, and then I have this thing called t file, which is my file object. So then you have this magic incantation, which you don't have to learn, but if you get it, I wonder how much precious data got lost because of W. Yeah, a lot. It happens. And that's why I generally, when I get expensive data files from a company or something like that, the first thing that I do is put them in backup and the file on my computer gets to be read-only mode. So if you look in Windows and you right click a file, then you can say read-only mode, and that's the thing that I tap for files that I get from input files. Just to make sure that if I do something wrong, it doesn't throw away my data. So if you download a file, you can say that this file can only ever be read, it can never be written to, and Windows has an option for that, but Linux and Mac OS X also allow you to put a file into read-only mode. So what happens here? I'm saying while, and normally if you have these big statements, I like reading them from the inside out. So reading them from the inside out, what happens is I call the read lines function on the connection, saying read a single line, put it in a variable called line, then I want to know then what the length is of the line that I just read. If the length is larger than zero, that means that I got data. Only when you get to the end of the file, and there is no more lines in the file left, will this be zero. So if I'm at the end of the file and I do a read lines on this object, then it will give me back the value zero. And then I know I'm at the end of the file. So that's why this while statement checks if the length of the returned line is larger than zero, because if it is zero, we're done. We're at the end of the file. So what this does, it just cuts the line to the screen. So it just cuts it in Windows. I update my line number to be one bigger. And then when I'm done, I have to close the connection. I have to physically close the connection. Otherwise, Windows will start complaining saying that you cannot open this file because it is still in use by another program. So make sure that when you use the file function, you always close the connection that you made. And this will allow you to read through file line by line. And of course, you can then do what you want with this line, like normally what you would do with the line is split it by a certain separator or look to see if there's a certain word in there that you want to use. So if you want to do something on the Bible, then you can use this to read through the whole Bible line by line. And you could say, well, if the word Jesus occurs in this line, remember that or do plus one, or something like that. So you can kind of count the number of occurrences of Jesus in the Bible or some other book that you're interested in. You can also use tar.gz and zip files directly as a connection. So if you have a zip file and this zip file contains a single text file, then you can use this as well. Then you can directly load from the zip file without having to extract it. And this is really useful when it comes to sequencing data, because sequencing data is so big, it's generally like in the order of like 200 gigabytes, plain text file. So from the company, you get it in a tar.gz file. But using R, you don't have to extract it. So you don't have to extract the whole file. So you get a like 25 gigabytes zip file. But if you would extract this, then there would be like 200 gigabytes of text data in there. But you don't want to deal with this 200 gigabyte of text data, you just want to deal with this 25 or 25 gigabyte file and just reload through it. So you can also use tar.gz and zip. How do you do that? You use the gz file option. So you have gz file. So here again, we do the exact same thing. But now we read through a compressed text file called text file.gz. So we open it in reading mode. We have the magic incantation to take one line while we are not at the end of the file. And then we do the same thing. So we cut again the line and then at the end, close the connection. If you don't close the connection, then the file will be locked until you quit R. You can also use an URL as a connection. So you don't have to have the data directly on your hard drive. You could keep the data. Can you still make sure you read only with a target gz file? Yeah. Yeah. Any file in Windows, you can right click on it and then select the button read only and make it read only. So if it's read only, then of course you can never write to it. I think that the gz file doesn't allow you to specify writing mode to begin with. So if you would say gz file something, write, I think it would just throw an error. Because I don't know if... Well, it might actually be able to directly write to a compressed file. You just have to check it. That I never got in the temptation of trying to write in a very expensive gz file. And I never tested it because generally you don't want to write two files anyway that you want to load in, right? So for me you have data which is input and then you have stuff which is output. And the output can be an input of a next step. But I always work setting my work, directory, loading my file, doing some manipulations and then writing out the results. And that's kind of the structure of my scripts that I try to follow. Like I said, you can also use an URL as a connection. So if you directly want to read what Google gives you back, you can say URL, http, google.com. Should probably be HTTPS by now. I don't even think that they serve HTTP anymore. But then you can do the same thing. So you can just read a single line from google.com and then you can close my URL. A word of advice from my own experience is don't put this in a for loop. Don't send like a thousand requests within a couple of minutes to Google. They will block you. They will block your IP address. And that will mean that you and anyone else on your own network cannot use Google anymore. So if you want to load directly from a website, be very, very secure in what you do. So don't say 4x in one to a million, URL, google.com, read lines, one, close the connection, and then do it again. That will definitely get you banned from Google or Facebook. The only one which really allows it is Bing. So Microsoft Bing is kind of, Microsoft is really relaxed in that kind of things. You can kind of blow up their servers and they don't really care. But Google cares. Facebook cares and they will ban you. And then it takes a lot of emails, which you can't send from your Gmail address anymore, to get unbanned. So don't do that. But sometimes you want to. Sometimes there's a nice dataset online. One of the examples where I do this is, for example, the COVID-19 data. So I've been tracking the COVID-19 pandemic since January 2020. And they have this really nice URL where you can download all the current data. And once in a while, I just hit that URL and just download the whole dataset without having to store it on my hard drive because that would take up a lot of space. All right. So you can load from text files, from URLs, and you can load from compressed text files as well. If you want to load an Excel file, then unfortunately, there is no native support for Excel files in R. And this is because Excel files are a proprietary format by the Microsoft Corporation. So you're technically not allowed to write an open source program which reads Excel files or writes Excel files. So there are some limitations on what you can do with the format because the format is owned by Microsoft. There are some available libraries. So there's the XLSX library, which uses Java to read and write Excel files. And then there's OpenXLSS. And that's the one that I would advise you guys. So if you want to read an Excel file in R directly, use the OpenXLSX package. It uses C++, so it's much faster. And it can read and write Excel files directly. But be aware that both of these packages are relatively slow and they are relatively unreliable. So they might work, they might not work, depending on your Excel file. Because if your Excel file is a little bit more complex and it has some references from one sheet to the other, many of these packages don't deal with that correctly. The XLSX package is not working on my R. Yeah, that's what I told you, slow and unreliable. So the thing that I do is when I have data in Excel, I open it up in Excel, I do Ctrl A, Ctrl C, then I open up a text editor, and then I do Ctrl V and I paste the whole file into the text editor. And I save it as a comma separated file. CSV is king. Yeah, CSV is definitely king. It has its drawbacks, but it is a lot better than Excel. So if you want, you can, if you really want to do something with Excel, then you can do the install.packages.xlsx. Then of course, after installing it, you need to load it. So you say library.xlsx. And then you can say read.xlsx from my Excel file, sheet name, read sheet number one, and then store this in sheet one data. But your mileage may vary. You might get stuck on the install.packages step. You might get stuck on this step where it won't load your sheet at all, or the data that you get back is just garbled. So that happens. So always, I always export it. Yeah, so you need to install Java in that case. So the Java home thing. So you need to install Java from either Oracle or the OpenJDK. Because it wants to load the rJava package, and the rJava package needs the Java home to be set. But that's a whole other story. So they're just packages that are external. Probably the Open.xlsx package would work for you because that uses C++. So that doesn't rely on Java being here. But here, you might need to have a C compiler installed on your system. So, but your mileage may vary, but there are packages that you can use to read and write to Excel. All right, so I told you that the next slides will be a lot more advanced. And this is just to show you guys what is possible. And let me see, we still have around 20 slides to go. I record it for 42 minutes. Okay, so I'm just going to do a couple of slides and then we will have another break. So read bin, loading binary files. So you can load any image on your hard drive. But you can also load any executable file if you wanted to. So in case you want to do some cracking of software, for example, you have a video game which has some protection. So you can only run it when there's a CD in the drive. Then you could use R to rewrite the executable file. So you would want to load in the file, do some manipulations, and then write it out again. So in theory, this is all possible in R. But I will just show you a very basic example of loading a BMP file, so a bitmap file. It's a relatively old format, but it's a very clear format. It's much better than JPEG or PNG if you want to explain how files work. So you can use the readbin function to read binary files. So in this case, I can say readbin my.bmp and is one. So it will read one element. So what is one element? That is what you have to specify still. If you want to read the whole file, then you can use this incantation saying read binary my BMP file. The number of bytes that I want to read from the file is the file info, the size of the file. So this gives me back the file size in bytes, which I can then use as the n parameter in my readbin function. So every time that we're going to do this, we're going to specify reading the whole file, because of course an image is nothing when you only have like one byte of the image. Although sometimes a couple of bytes is enough. So readbin loads binary files and it has a function parameter, which is called what? Because when you're loading binary files, you have to specify or you have to tell R how you want to interpret the bytes that it reads, because when it reads binary, it just reads ones and zeros, right? So you tell it to, for example, read a numerical value. So a numerical value is a 64 bit value. So it will read 64 ones and zeros. And then after reading 64 ones and zeros, it will then interpret this as a numeric number, like R specified it. The same thing holds for a double, a double will read 64 bytes from the input, and then interpret the number or the sequence that it gets back as a double value. The integer value is a little bit different. The integer value will read eight bytes from the input, and then interpret this as a number between zero and 65,000 something. If you want to read larger integer numbers, so whole numbers, you can say int. It can also read logicals, it can read complex, so logical in this case is just one byte. So it just reads one byte from the input. And then if it's one, it's one, it's true. If it's zero, then it's zero. You have complex, you have character, and you have raw. So during the whole reading, we will just use the raw function, because we want to interpret the file in our own way, because we're loading an image file. R doesn't know about the BMP format. So we're just reading raw hexadecimals or raw binary numbers from the input stream into R, and then we will start doing our own interpretation on it. So a little bit of background for you guys on BMP images. So as you all know, an image like the thing that you're looking at currently, which is a movie, but a movie is of course nothing more than a sequence of images. So image are a two-dimensional array of pixels, right? So here we see an image, and this image is six pixels wide, and it's two pixels high. So you have pixel number one, two, three, four, five, six, and then pixel number seven is on line number two. So remember that every time that you read a BMP file, we need to know what the width of the BMP file is, because we have to split it at a certain point, right? We have to end the current line. So hey, if you look at your monitor, then the monitor has pixels, and the pixels start from here, like this is pixel number one, then it goes all the way to here, depending on what your, oh, it can't go any further, but depending on what kind of a monitor you have, you might have like 1,024, or you might have 2,000, or you might be watching it on an HD screen, and you have 4,000 pixels. But for this example, we will just be using a very, very small image. This image has six pixels in the width, and two pixels high, meaning that in total this image has 12 pixels. So a file is not a two-dimensional array. A file is just a linear sequence of bytes. So the way that a BMP file is structured is that a BMP file has a header. So the first 54 bytes of the file say BMP header. So it says something like BMP, and then it has some fields which encode the size of the file, but we're not going to use that. But the nice thing about the BMP file is after these 54 bytes, the next three bytes determine the color of the first pixel, and then the next three bytes determine the color of the second pixel. So normally we would code colors as RGB, red, green, blue, but the BMP format does it the other way around, so it has the first pixel value is the blue value, then we have a green value, and then we have a red value. And the blue value is more or less a binary number, so it's a hexadexamal number of two numbers, which ranges from 0 to 255. So we have at position 55 we have our first value, which is the blue channel for the first pixel, and at 58 we have the blue channel for the second pixel, at 56 we have the green, and at 57 we have the red channel. So summing up these three numbers or more or less using or mixing these three color values we will create the color of the pixel, and this depends on how many, on how intense the blue channel is versus how intense the green and the red channel is. If all of them are 255 we will have a white pixel, if all three of them are zero we will have a black pixel, and if we have a blue pixel then that means that's 255, 0, 0. Is that clear so far? Anyone can make a kind of an impression on how it looks on on the hard drive, and so the file is nothing more, so it has a whole bunch of bytes, the first 54 bytes are meaningless, there's no color information in there, and then the color information starts from byte 55, and then byte 55, 257, first pixel, 58, to 60, second pixel. All right, if that's clear, how do we are going to do this in R? So when we do this in R, right, so the first thing that I'm going to specify is which image file I'm going to load, so in this case I'm going to load assignment three data, image one dot BMP, so the name of the file, I get the information of the file because I need to know how big the file is, so I'm going to say file info, image file, and then I'm going to read the whole file in raw mode, so I'm going to say read binary, the image file which is just the string or the file name, I say n is as numeric the size which I get from the image info object which I just created, and how am I going to read it? Well I'm going to read it as raw values, and then when I type image dot data in R, then it scrolls a big vector in front of my nose, right, so this is a vector, so this little image here, image one dot BMP, it has like 97,901 bytes in there or raw values, right, and you can see that these raw values are hexadecimal, so that means that the last number ranges from zero to f, then the second one also ranges from zero to f, and ff means 16 times 16, which is 256, and 00 means zero, right, because and then 01 means 1, and 0f or 4f, this means 4 times 16 plus 16, no, 4 times 16 plus 50. Anyway, it's just the representation that it uses, so if you want to learn how to count in hexadecimals, then there's probably a good book that you can use for that, but it's just a big vector, so you get a big vector, this vector contains a bunch of numbers, so now we want to do something with it, right, so the first thing that we're going to do is remove the header, because the header doesn't contain any image data, so we're just going to throw away, so I'm just going to say my image data minus, so throw away bytes number 1, 2, 54, so you can remove stuff from a vector by using minus, and here we're just throwing away the first 54 bytes, so instead of having, so the image looks like this, with the first 54 bytes being the header, then we remove it, and now the first blue color is at the first pixel value that we have, so the first value that we get in our vector is the blue color for the first pixel, the second value in our vector is now the green value, and the third value in our vector is the red shadow, right, so image.colorData will contain the same as image data, except it will not contain the header, because we threw that away using minus, and then we specified throw away the first 54 bytes, all right, so now I want to get the different color components, right, so imagine that I want to get the blue color component, so I can use the sequence function, I'd say make a sequence from 1 to the length of the color data, right, because from 1 to the end of the vector, and now step by 3, because I can see here that if I take 1, 4, 7, 10, then I get all the blue values, right, and this is then something which I now call sequence, right, so this is the sequence here specifies 1, 4, 7, 10, all the way till the end of the file, and every time this thing will point to the blue channel, so now I'm saying, okay, so now I want to make a matrix, right, because my image is a 200 by 200 pixel photo of Obama, so I say make a matrix, take from my image dot color data the sequence, so only the blue numbers, right, so only the blue intensity values, transform these to know numeric, because the color data is still in hexadecimal format, but we want to have it in numeric format, and since my image is 200 by 200 pixels, I have to specify that the matrix will have 200 rows, and it will have 200 columns, and then I call this blue, and if I now do image blue, then you see here that this is what R will now draw for you in the image window, so you see that everywhere where the image was blue, you have a high blue intensity, everywhere where the image was red, you see that there's a very low blue intensity, and of course if the image is really blue, you see that there's a really big blue intensity, so this is the way that you can just use the read min function to read in any binary file, like a BMP file, and to display it in the R plot window, and of course you can see here that it says, well, this is from zero to one, from zero to one, but we can also specify how big the plot should be and where the point should be, so now we can do start doing really, really fun stuff with it, because now, for example, we can say, well, do something else with it, we can make the image move by just moving the first column of the matrix to the back and then do that continuously, then the image will start shifting and will start moving across the image. All right, so that's another hour of recording, so we will take a quick break. Remember, you don't have to do this during the exam, I'm just showing you this because at a certain point, I want to show you guys how to make a really nice world map and plot, for example, the COVID-19 data on the world map, so that you can make an image of the world or just an image of the world map and then show how many infections there are, so that you can make really nice visualizations, which you can then use in a publication, and the same thing holds for I'm measuring fish and these fish are located in different locations in Germany, so now what I want to do is I want to show a map of Germany and then for each of the locations at which I sample fish, I want to show how many fish are of a certain species and how many fish are of a different species, so at each of these points, I want to have a pie chart showing the kind of combination or distribution of these fish. All right, so I will stop my recording here.