 So welcome back, also, to people on Moodle. So any questions about the assignments, try doing them for yourself before you look at the answers, which are posted on Moodle. And if you get stuck somewhere, then send me an email or take a quick peek at the answers, but try doing them yourself. You learn more by just getting stuck and frustrated than you do by just copy-pasting code from the assignments or from the answers. All right, so today, files and common file types. So let's just start. So files come in two major types. I think everyone knows that there are plain text files. So those are files holding text data or data in a textual format. Here, you just have to be aware that there are many different encodings. So the standard letters more or less are in the ASCII character set. So those are like the normal Latin letters from A to Z, capital A to Z, but also things like the new line character, the carriage returns, loader slash R, and a couple of other characters that are defined in the ASCII. If you are German and you are using a German keyboard, then of course, hey, you have letters like umlaut and those are not defined in ASCII. So when you type an umlaut and you save a text file, then the character encoding of the text file changes. And R does a pretty OK job trying to figure out which kind of a text file you are loading in. But sometimes, this can be problematic. So sometimes R thinks that you're loading in an ASCII file, but you're actually loading in a file which has UTF-8 in there. And when you have UTF-8, then there are more available characters. In UTF-8, you can also have Chinese symbols or symbols from other languages like Farsi. And the problem there is that R might not recognize them properly. So when you load in your file and you look at your data and you see that there are kind of weird characters in there, which you do not expect, then usually it's an issue with things like character encoding. And you can set the character encoding in things like retable to make sure that it loads in the characters in a format that is kind of suitable and that it understands the format. One thing about text files is that they are always having an end-of-line character. So after every line of text, there's an end-of-line character, which is slash n in Linux. But in Windows, it's slash r slash n. So there are some nuances when you're loading a text file, which has been made, for example, under Linux. If you're trying to load it in under Windows or the other way around, you might have an additional character at the end of the line. So these are sometimes difficult to get rid of. But again, does it's best to figure out what your end-of-line character is? Sometimes it makes a mistake. So then when you look at your data, you see kind of a square. So a square usually with a little question mark in there. And that is the fact that r didn't read the end-of-line character properly. So besides text files, of course, there's binary files, things like exo files or DLLs. When you're under Windows and under Linux, those are normally called SO files, so shared object files, or just standard executable files. But a binary file is a sequence of eight bits. And you have binary files usually contain some kind of a header. And the header usually contains in the first 16 to 32 bits, it contains kind of a magic number, which either Windows or Linux uses to figure out what type of a file you are looking at. So if you open up a binary file in a text editor, you generally get something which looks like garbage. But then in the first couple of characters, there will be something like PNG. And then you know, okay, actually, I'm trying to open a PNG file with a text editor. But there's a major difference between text files and binary files. R allows you to read text files using read table. You can also read binary files in R. So if you're interested in learning how to read BMP files and stuff, then follow the R course. And there will be a lecture about reading and data coming from different sources. But I won't go into detail here. Just be aware that there are two very fundamentally different file types. So the most common file format that is used in bioinformatics is the TSV or the CSV file. So the TSV stands for tab-separated file and the CSV stands for comma-separated file or comma-separated values. And this means that you have just a simple text file. It's a format which stores data in a tabular structure. So every line of the file is a row. And columns are defined by a separator. So each field value of a record is separated from the next by a delimiter. And that's where the TSV is different from the CSV because the TSV uses the top as a separator. And the other file, the CSV file, uses a comma as the separator. Of course, when you're reading in TSV or CSV files, always be aware that there is something which is called delimiter collision. And that is, of course, when I'm using commas to separate my file, if one of the records contains a comma, then of course the R or another program which you are using to load in these files will see this as a next column, right? So then the first line will have, for example, 11 columns and the next line will have 12 because the second row of your file is not proper. So this is a very fundamental issue and R will give you a warning or an error when this occurs, when you're trying to load in a file which is tab separated and one of your fields has a tab in there. And then of course R will give you a warning saying that, well, at line 1100 something, I'm finding 12 elements, but all of the other lines before that only had 11 elements. So this is called a delimiter collision error. And this happens a lot with TSVs and CSV files. The next very common file format is also a text format, so a FASTA format. So FASTA is the common format for representing nucleotide sequences or peptide or protein sequences. So you have a special start of line symbol. So if you have a dot comma, then this means that there's a comment line. So this is after this symbol, you can type anything that you want. And a FASTA reader or a program like in R, you also have FASTA readers in several libraries. They will just ignore this line altogether. Zebra with no principles. FASTA should die already. Why? What other format is better than FASTA? FASTA is a great format in a way. It's text-based, so you can just open it in a text editor. And there are other ways, but I think GZIP FASTA is pretty good. The only drawback of FASTA is that it's kind of heavy. Because if you look at DNA sequences, you only have four possible base pairs, but you're using like a whole character to represent that. And the character can hold many, many different values. So you're wasting a lot of space in the file when you're looking at FASTA files. But in a way, FASTA is a pretty good format. Like if you know of a better format to store things, then use that. But FASTA is kind of common. More like slow, am I right, guys? How do you mean? FASTA can be really, really quick. Like you can read text files at an insane speed. Sibir is a troll. Yeah, I don't know. It's a good comment. Like there's many, many people that don't like FASTA. There's some alternatives to FASTA as well. It's just that it's very, very common format. So the next special line character that you have is the greater than symbol. And greater than symbols are description lines. So for every sequence, the sequence can have a name or a description. And then this describes what kind of sequence you have. And of course, we've seen FASTA already a couple of times when we download sequence data from ensemble. So after the header line and comments, one or more lines may follow describing the sequence. Each line of a sequence should have fewer than 80 characters. And this is not a very hard rule. So if you go to ensemble, ensemble puts 60 characters on a line, but many, many FASTA readers will not enforce this 80 character limit. But the sequences that you give are expected to be represented in the standard EOPAC amino acid and nucleic acid codes. And so you can also say there has to be a C or a T at this position, or there has to be any nucleotide. So you can use an N character as well. Right, so if you're coding DNA, then ACTN is perfectly valid because the N character just stands for any possible base pair. So the NCBI defines the standards for unique identifiers used for the sequence identifier in the header line. So when you have different databases, you can see from the name or from the description. So after the greater than symbol, you can kind of see which database it came from. And they have more or less this format where you have first the GB, which stands for GenBank, or you have EMB, which stands for EMBL, or you have DBJ, which stands for the DBJ. So the first three to four letters are based on the database where it comes from. Then you have the accession number. So you have then a horizontal line, the accession number, and then the locus. So that means that you could have something like GB, 1,500, and then the locus is chromosome three position, something like that, right? So but the sequence identifier line in the header is also standardized, but you don't have to adhere to that standard, right? You can make any file that you want. And as long as you have a description line, a number of comments line might be one, might be zero, might be a hundred, and then you have the sequence data, then that is a valid FASTA file. All right, so FASTA is a format which allows you to store sequence data. If you want to store data which comes out of a sequencer, then normally you get a FASTA Q file. So FASTA Q file is very similar to a FASTA file, but the advantage or the thing that FASTA Q adds is that there's the Q which stands for quality, right? So it's a FASTA file format with a Q, with a quality score. So that means that you can encode the, you can encode for every base pair how certain you were in that it is really this base pair. So it has a four line format in a way, has so FASTA Q normally uses four lines per sequence. So the first line is an add character which is followed by a sequence identifier and an optional description. So this is very similar to the FASTA format, right? In FASTA you have the forward or the greater than symbol, but in FASTA Q you have the add. Then the second line is the raw sequence letters. So that's ATCG when you're looking at DNA. And then the next line is a plus character and then optionally it's followed again by the same sequence identifier as after the add, but generally it's empty. So generally you have nothing on the third line, just a plus character to indicate that this is the second part of the sequence. And then you have the encoded quality values for the sequence on line two. So here of course, you must have the same amount of quality values compared to the number of base pairs in line two. Then that's kind of enforced in the FASTA Q format. So the quality scores of course are not encoded using standard one, two, three, four, five, six, seven, eight, but they are encoded using fret scores. So a fret score is a p-value, which is the error probability, right? So if I have a sequencer, then it reads, for example, an A. And it is, for example, 99% sure that it read an A, right? So if it's 99% sure, that means that it's one in a hundred chance that it's not an A. So that means that the quality score of this base pair would be 20, right? If there's more noise in the machine and it only has like a 90% certainty about the base pair, then of course the fret score is 10. So fret scores more or less range from zero, which means that it has no certainty about the base pair at all, kind of more like a less like a random read, but had the fret scores more or less start from like five in that sense. And they go up all the way to 60 and 60 means that there's only a one in a million chance that this base pair is not the base pair that is mentioned. So had the Q fret is more or less based on the minus log 10, or minus 10 times the log 10 of the p-value, it is encoded into the ASCII format. And so there are 93 levels in total that it allows you to encode. So it doesn't just go up to 60, it goes a little bit further. So it goes all the way up from one to 93. And it uses the ASCII characters 33 to 126. So that means that if I want to encode a quality score of one, a quality score of one is a exclamation mark. The next quality score, quality score of two is the double air quotes. And then all the way, you see here that you can encode a quality of A and then of course you can count an A is, I think number 61, so minus 33. So this is a quality score of 61 minus 33. Let me type that in, 61 minus 33. So in a capital letter A has a quality score of 28 and a small letter A has another quality score because it's all the way over here, right? So it is just using the p-value, taking the p-value, converting it to this Q fret score and then this Q fret score is looked up in this array of letters. And that is of course, because you cannot write down 10 or 20 or 100, you want to write down a single letter. So it uses this encoding, the fact that a character can have many different values to kind of write down the quality score in a very, very compact format. All right, so that's kind of the fast Q format. So fast Q format looks very similar to a FASTA format. It's just allowing you to store the quality of the reads in the same format. Another very common file format is the GFF format, which is called the general feature format. And that is a file format that describes genes and other features of DNA, RNA and protein sequences. There are two versions in use, GFF2 and GFF3. Nowadays, most files that you find online or that you download are all GFF3 format, but the formats are very, very similar. And in essence, it's just a tab separated file. But the trick here is that each of these are a tabular file, but every line can only have nine entries. And these nine entries are always separated by a tab. So how does this look? So the first entry on a line is the sequence. So the name of the sequence where the feature is located. Then we have two, which is the source, so the keyword identifying the source, like a program or a database where it came from. Then you have the feature, and this is a feature type name. And these can be things like gene, axon, SNP, CDS, but these are, and you can only pick something which is in the dictionary, right? So there's a dictionary which says that this is a gene, this is an axon, and if it is an axon, then it has to be encoded this way. Then the second or the fourth element of each line. So after the third tab, there's the start position, and this is the genomic start of the feature, and this uses a one-base offset. And this is a little bit tricky because there's always this way between biologists and computer scientists, because computer scientists start counting from zero, while biologists start counting from one. So if you would look at chromosome one, then biologists would say that the first base pair mentioned, for example, a T, is base pair number one at position number one. But a bioinformatician or someone from the informatics field will say no, this is the first base pair and it is stored at position number zero. So there's this off by one error, which is a very big issue sometimes in biology. You then have the end position at position number five, which is the genomic end of the feature with a one-base offset. And then you have the score, so this is a numeric value indicating the confidence of the source of the annotated feature. So this can be something like, well, we have a coding sequence here and we are 87% sure that this is a real coding sequence. Then the seventh position is the strand. This is a single character that indicates the sense. So it can be plus, which means five prime to three prime. It can be minus, which means three prime to five prime. Or there can be a dot, which means that it is undetermined. And this, of course, is important when you're dealing with things like SNPs. Yeah, because a SNP can be a G to T SNP on the one strand, so on a positive strand, right? But on the negative strand, a G to T SNP is, of course, a C to A SNP. So when you're dealing with SNPs, then this makes a big difference if you're looking at the positive strand or if you're looking at the negative strand of DNA. Then here in position number eight, that this is where the difference between GFF2 and GFF3 is, because in GFF2, you would have a frame and then in GFF3, it's called the phase. So the frame and the phase, of course, in this case, they relate to things like amino acids, right? Every amino acid is coded by three base pairs. So a base pair can be in frame, meaning zero. It can be out of frame, meaning one or two. So this is the shift relative to the amino acid code on sequence that is being used. And then at position number nine, you have the attributes and this is all other information pertaining to this feature. So this is generally a list, which is kind of collapsed and put into the last column. So the ninth column is more or less an open field where people can read and write whatever they want. I don't have an example of a GFF file. Let me show you if I have, let me just search my hard drive if I have a GFF file for you somewhere. I probably should have one, yeah. Those are really, really big. So this is one of them. And this is GZipped. Open this one. Yeah, so this is more or less how it looks. So let me show you guys my Notepad++ window. Right, so here we see that we have a GFF version three file. And so there's some annotation on the top. And then here we see in the first column, we see the name. This is the source. So this is a gene, which has predicted by gene nomom. This is a gene. It ranges from here to there. It is, what is the next one? The next one is the feature. All right, the start, stop, start, stop, score. So it has no score. It is on a positive strand. And then it is, it doesn't have a frame assigned, right? And then here is the rest of the information. So it has the ID of the gene. And then this just continues on. And you can see that this is not a top separated entry, but these entries are separated by dot comma. So you can kind of get the ninth column out and then split the ninth column based on the dot comma. And then of course you can get things like the biotype and see if it's a pseudo gene or these kinds of things. And this allows you to, in a very compact format, store a gene, store all of the different axons that belong to the gene. And of course we can also store things like coding sequences or messenger RNA, or we can do tRNA, right? So hey, you can see that these are from different sources. But this is kind of, so if you start bioinformatics and you would start doing stuff on mouse, the first thing that you would do is download the FASTA sequence of the genome. And then the next step would be to download the GFF file because that contains the annotation of this sequence. So then you would know where a certain gene starts and where a certain gene ends. The only drawback that I found to this kind of type of file format, and this is relatively obvious when you look at the description of the format, is that it doesn't store the chromosome. And I've ran into that a couple of times that it doesn't store the chromosome. And I don't know why, right? Because it has the start position, the end position, the strand, but there is no dedicated column for chromosome, which I always found a little bit weird. So that's just something for you guys to remember, and which might be a question on the exam, like which column is missing from GFF formats. Well, I would really want to see a chromosome column here because of course hand sequences are not floating around and you need to know on which chromosome you are to know exactly where something is located. All right, so that's the GFF format. The next one is the VCF file. So VCF files are variant call format files, and it's again a text format and it stores gene sequence variation. So the variant call format only lists the variations. However, we need to know which reference genome was used, right? So if I have the mouse reference genome, then this reference genome changes like every two to five years, and then there's a new genome build. And so the VCF file doesn't have an internal, well, it usually has a link to which file was used to do the annotation on, but there's no hard coded link. So you can make a VCF file, and then when the genome build changes, you have to remember that you have to remake your file or update the positions based on the differences of the old genome build versus the new genome build. The current standard of VCF is currently 4.3, so there's been many iterations, and all of these iterations are slightly different from each other. But mostly there are VCF files, are files which start with a lot of header lines. A header line in VCF starts with a hashtag. The header provides metadata describing the body of the file, so it's kind of a self-encapsulating file, right? So the header kind of describes what you can expect next. So it's not like a GFF file where each of the different columns have a certain feature type or have a certain possibility, right? But here the VCF format has a description in the header saying that, well, on column number eight, there will be the genotype, the probabilities, and the depth, or there will be only the genotype or there will be only the depth. So the header has the metadata describing how the rest of the file looks like, and there are special keywords in the header which are denoted with the double hashtags, and those contain this metadata description. So double hashtag is metadata, single hashtag, generally is just the header line, so the single line which holds like the individuals. So the body of the VCF file follows the header and is top separated into eight columns, very similar to a GFF format, but an unlimited number of further columns give optional information about the samples, and the way that these further columns are coded is encoded into the info. So the eight, no, the ninth column. So the ninth column describes how the 10th column looks like, and the 11th column as well. So the VCF format, first position, or the first thing is a chromosome, so the name of the sequence on which the variant was called, then you have the position, also one based position of the variants, the ID, which is the identifier of the variant, for example an RSID or a dot when it's not known, then you have the reference allele, so the reference allele is the allele as it is, or the base pair as it is in the reference genome, so the genome which was used to call the variant on, then you have the alt column which describes the variant, so the mutation at this point, then you have the sixth column holding the quality score, seventh column is the filter, so this is a column where there can be a flag saying pass or fail, based on some kind of filtering that you did, and then you have the next which is the info column, so it's an extensible list of key value pairs describing the variation, and so it's very much like the last column in the GFF format because it's a field which can have many, many entries and all of these entries are separated by dot, comma. And then the ninth is the format description and the format description describes how the next columns will look like. I don't have an example here, so let me search my hard drive for a star dot VCF, and I will show you guys a VCF file as well, so let's open up a VCF file which is relatively small. All right, so let me show you guys my notepad, so here we see how this VCF is built up, so the first line tells you which format was used, so this is a 4.2 format. I did a filter, right, so the filter can be having a value called low qual, and that means that the quality of this snip was too low to be reliable. Then we have a whole bunch of things which are format, and so in the format, you can specify things like AD, which is the allelic depth, you can have things like DP, which is the approximate read depth, we can have GQ, which is the genotype quality, we have GT, which is the genotype cold, and then we have the PL, which is the probability likelihood ratio, which is the fret score for the genotypes, and then you see that there's a whole bunch of additional fields that can be in the info file because all of these things are defined in the header, so the header defines what is allowed in the thing itself, so in the data. We then have a whole bunch of these contexts describing all of the different chromosomes, right? So we have a, so in the context, when the ID is one, this is chromosome one, which has a length like this, so you can't input a length which is bigger than this length. We see that, or I know that this is a file based on mice, so we see all of the chromosomes which are available in mouse, so one up until 1 to 19, plus the X, the Y, and the mitochondria, and then you see all of the other ones, and these are sequences which are still not incorporated into the mouse genome. Last time we talked about scaffolding of genomes, or no, it's not last time, but the time before that, and we talked about when you do a genomic alignment, hey, you go from having reads, these reads are then put together into contics, and then these contics are scaffolded together into a genome, but of course there's some contics which you were unable to place, so these are part of the mouse genome, but we just don't know yet where they belong if they are located on chromosome one or on chromosome X, but of course they are here because there can be variants found in these contics, so you see that there's a long list of contics, and generally they're relatively small, like 40,000 base pairs, but the biggest contic in mouse is almost a million base pairs, so that's a million base pairs of sequence where we do not know where it belongs to in the genome yet. So in a new genome build, it could be that they figured out that this is part of chromosome two, and then of course it will kind of disappear here, and then chromosome two will all of a sudden be a million base pairs longer. So that's why there's more than just 19 chromosomes, XY and MT in there, because of course we are dealing with a genome build which is not optimal, right? It's not finished. There's still sequences that we are unable to place in the genome just because of the fact that we don't know where they belong yet. Then here you see that there's a reference file, so this is the FASTA file, so the file which holds the genomic sequence on which these variants have been called. And then you have the next position, so the next position just gives you the different headers. So the first one is the chromosome position, ID, reference, alt, quality, filter, info, format, and then after that are the names of the different samples. So you can again see here that we are having mice, so here's the standard AKR mouse, which is a certain type of imbred mouse, and then of course we see one of our BFMI substrains, the substrain number 12, and then we have a BFMI 861S1, which is a different type of Berlin fat mouse, but that's not, but then here you can see how it looks like, so here we see that in the reference, there's GAA, however some mice have a G here, so they have a two base pair deletion, so the two As in the reference genome are deleted in some of these mice. This deletion was called with a 488 quality score, and of course the quality score depends on what you think is a good quality score. The filter was dot, so this one was not filtered out, the quality here was high enough to stay into the data set, and then we have the info column, which tells us something about this variant. For example, we see here again DP, and the DP is the depth of sequencing, so at this position we had 207 read, covering this SNP across all of the different samples, and then when we go all the way to the back, and then in the back we see the genotypes, so we see that for example the AKR mouse, the genotype of the AKR mouse, because you see GT, the genotype of the AKR mouse was zero slash zero, meaning that it had reference slash reference. The next BFMI mouse also had reference slash reference, and here we see for example, for this position had the BFMI mouse has a zero slash one, meaning that one of the alleles was GAA, and then on the other DNA strand there was only a G, and it can be one slash one as well, so one slash one means alternative, alternative allele, so in this case GG. Right, is that clear? The VCF files are one of the most common file formats you will run into as soon as you start doing things like genotyping or other things, so I'm hoping that it's understandable and otherwise just read through it again and look at a couple of VCF files. All right, so an unlimited number of columns can be given, and the format of these are described in the ninth column, so when you write a parser for VCF format, it's not as simple as just doing a retable using a string split and be done with it, because every row in the table can have a different combination of things, so the first row might mention the genotype and the depth, the second row might only mention the genotype and not the depth, but that is described in the format field. All right, then one of the other files which I encounter very frequently are pet and map files, and this is related to when we do QTL mapping, so the pet format contains both family size, family-based and regular genotype data popularized by Plink, so Plink is a tool which allows you to do QTL mapping, so pet files come with a corresponding map file, so the pet file contains the information on the individuals and the genotype calls done on this individual very similar to a VCF format, however, the pet file does not have any column storing the location, so the locations of these things are stored in the map file, so the pet file has six fields at the beginning, the first field is the family ID to which family does an individual belong, then there's a within family ID, so called IID, and this cannot be zero, so this is just the individual ID, so in our mouse house right, we have mice which are born into a certain family, and then within this family, you can be individual one or individual two, then you have the within family ID of the father and you have the within family ID of the mother, and this is very important when you do QTL mapping because QTL mapping is really dependent on the pedigree that exists, so if individuals come from the same father or the same mother, had they share more of their genome and you have to compensate for that when you do an association test, because an association test could be significant just based on the fact that everyone had the same father or a lot of mice shared the same mother, so that's why you would have to write down the father and the mother, then the one, two, three, four, fifth column is the sex code, sex code is one for being male, two for being female, zero for being unknown, so in mice, of course you know if a mouse is a male or it's a female, and this of course has to do with things like body weight which is very different between males and females, and then the last column that is there is the phenotype value, so pet files allow you to store a single phenotype and not more, and the original coding was one for a control sample and two for a case, because Plink used to be a tool for doing case control studies, and zero here means missing, so you can code the phenotype as being unaffected, affected or unknown, so one is control, two is case, and zero is missing, and after this the genotypes will be encoded and there's two columns used per variant, so it's not like the VCF file where it's zero slash one, no, in this case zero slash one for a heterozygous animal would be encoded by having a G in the first column and then a T in the second column, a homozygous reference mouse would be coded G in the first column, G in the second column, and a heterozygous alternative allele mouse would be coded T in the first column and T in the second column, if we assume that a variant would be the GT SNP. So how does this look? Well, this is one of the pet map files where I made just a little screenshot up, so first one is the family ID, so this is family ID one, family ID two, three, four, five, six, so all of these samples come from different families. Then you have the sample name, so this is sample one, and then the father ID is zero, so the father is not in the data set, the second one is the mother ID, so the mother is also not in the data set, and then we have the third one, which let me look back, this is the sex code, so here the sex of this individual is unknown, and then the next one is the one, and here you can see that this is a newer format pet file because newer versions also allow you to just have a numerical value there, so normally you would have a case control, right, saying one being control, two being case, but nowadays you can also store like a phenotypic value, like the body weight, so this individual might have weighed one gram, the second individual might have weighed 1.33 grams, the third individual 1.8 gram, and then of course we see the genotypes, so the first one is A, the second one is G, so this is a heterozygous animal, the second individual here is AA, so this homozygous, the third individual is also homozygous of one of the alleles, and then the next individual here is 00, meaning that for this individual, we did not get a genotyping result at this point. So pet files like said come with a map file because you can see here that there's no description of what AG where this SNP is located in the genome, so there is another file which has the same ordering, but it contains four columns, first column being the chromosome, then the SNP identifier, then the genetic distance in centimorgans, and then the base pair position in base pairs. So you have to supply a map file with a pet file, to something like Plink to do the mapping. All right, so couple of words about software and source code. So source code is a collection of computer instructions. It is written using a human readable programming language, and it is normally stored as ordinary text file, and possibly with comments, and I would always say that there should be comments. And so source code is something that you write, and then of course source code is used by the computer, but the computer cannot read your source code. So you have a program which goes from source code to computer code, machine code. So source code can be compiled, which means that the source code is translated into machine code, and then it is executed, or it can be interpreted. So what R does is interpretation because it does not create an executable, right? It doesn't create a .exe file. So it only lives within the R interpreter. So as source code can be parsed and executed directly, and this is called an interpreted language, that means that you translate it to an intermediate representation, and then this intermediate representation is executed. And this is what happens when you code R. If you code something like C or C++, then the source code is taken by a program called a compiler. This compiler then creates a file which contains machine code, and then this machine code can be executed by your computer directly without the help of an external program. So an interpreted language always needs an interpreter being present, being active, like an R. So when I open up an R window, then I'm talking to the R interpreter. So the R interpreter takes my source code, makes some kind of internal representation, and then executes that representation. If I take a bunch of C code, then the C code is compiled, creating a machine code file, and then this machine code file is more or less specific for my CPU, and then I can execute that directly, and then it will just go through. So I don't need an additional program. So a compiler and an interpreter both work very much like the same way. So hey, you can have languages like C or C++ or Ada or Java. Hey, then you have a front end. So the front end is the first step. So it reads the source code, and it verifies that the syntax and the semantics are correct so that they are according to the specified source language. Then you have the middle end, so the middle part of the program, which then takes the verified front end source code, and then it creates an optimization. So generally what it does is that it takes your code, and your code is transformed into an abstract syntax tree. So the abstract syntax tree is more or less a tree representation of the code, and then what it does, it starts kind of modifying this based on certain rules, but this doesn't change the meaning of the code. So it could be that you do x equals one plus two plus three, and of course it will see that one plus two plus three is always the same. So it will just say, well, I can replace this part of the source code by saying x is x plus six, right? Instead of having to do every time the same operation, adding one, adding two, and then adding three, it will then replace this by having a single operation saying, no, I'm just going to take x and add six to it. So it does a kind of analysis on your code to remove things like, or to compute constants and to remove additional steps. And then in the back end, so the third part of the program, be it a compiler or an interpreter, this is the part which generates your target dependent assembly code or your machine code. And of course the back end in an interpreter is different from a compiler because a compiler generates machine code, while an interpreter will just interpret the abstract syntax tree that it gets from the middleware, right? So the middle end gives it to the back end and then the back end will more or less execute it directly if you are an interpreter, or it will generate machine code for later execution when you are a compiler. And this is very, very schematic and it's very like a high level abstract view. There are many different ways of doing this because you nowadays have just in time compilers which are interpreters, but when code is used a lot, then instead of interpreting the code, it will actually generate machine code to make your code faster. And there's all kinds of mix and matchy forms between a compiler and an interpreter currently. But very strictly speaking, R is an interpreted language, meaning that it does not create X files, it does not create machine code, it creates something which is directly interpreted by the R interpreter, while a C++ compiler will take your source code, make an abstract syntax tree, do all kinds of optimizations and generate an executable file which holds machine language. All right, so how long have I been talking? 45 minutes. So it can go another like five minutes before we take a break. So when we talk about software or code development, there's different stages at which you are developing your code. And this is not that important in academia, but when you are working for a company, and then there are certain stages of development that you go through. So the first stage in which a code is, is that it's the alpha stage, right? So you write some code and this is your initial draft, it lacks certain features, there's a lot of bugs still in there, but at least it does kind of the core of what you want it to do. So the core can be doing analysis, right? So an alpha version is generally not given to the public, it is just internally made to see if the feature that it implements is correctly implemented. And hey, there's a couple of tests written and there might be one or two testers that look at the code to see if it does what you want. Then the next step in software lifecycle is taking this alpha code, polishing it, adding most of the features that you want to have in it and then you want to have it suitable for user testing. So this is called a beta version. So if you get beta version software, and then this version of software has most of the features that the eventual software will have, it is suitable for user testing. And often you have closed betas, meaning that the closed beta is done within the company itself, or you have an open beta where people of the public are given the program or a game and they can test the game to see if there's any bugs in there. If everything goes well, then the beta eventually becomes a release candidate. So the release candidate is the beta version where all bugs are more or less removed. So if the release candidate is usually given to the public, the public uses this and it uses this version to find any bugs. If bugs are found and the release candidate is updated, but as soon as all of the bugs more or less are fixed, you go to an RTM version. So the RTM version is released to manufacturer function version and this used to be the version at which you would get it on a floppy disk or a CD drive. Right nowadays, a lot of software is distributed via the internet and so there's no real difference with giving people a release version or giving them a beta version. But in the old days, software would come on diskettes or they would come on CDs. So the RTM version was the version which would be printed on a CD and then this CD would be shipped to the customer. And then there's still the release, the stable or the final version and this is the release to manufacturer version where then the final bugs have been fixed. So even though you would have printed it on a CD, there might still be bugs in there. So in the old days, what would happen is that you would get a CD with a game and then at a certain point people would figure out, oh, there's still bugs in the game. So then what they would do is they would provide you with a little download so that you could download the patch to patch certain errors that were still left in the software. And this is the kind of classical software development cycle where you write initial software, making an alpha version, you go to a beta version, you have a release candidate, then you get a release to manufacturer, which is more or less the stage at which you cannot change the software anymore because you're sending it out to the public on CD. And then had the latest version, the stable version would be something which would be the release to manufacturer version plus a very small download which would patch some critical bugs like based on a certain video card, your game wouldn't start, and then the RTM version would be patched into a release or stable version. All right, so that's more or less what I wanted to say a little bit about software development. It's just that you get an idea that in an academic sense, you're always writing software of more or less for yourself or for a very small group of people, but there is a lot of software out there, things like Plink or RQTL or other software packages which are used by thousands of people and are actively used in research. And so when you are looking at software, you always have to think which kind of a version of software do I have, right? Do I have an alpha version or do I have a release candidate? Because there's a lot of difference in the quality of the software tools that you're getting and that you're using. So a lot of, nowadays this cycle is not that clear anymore because of the internet and anyone can download updates and patches at any point. And nowadays you have like pre-ordering even so you could pre-order software or games that have not even been released yet or not even developed yet. So there might be a little alpha version, but that's it. But in the old days, it used to be a very structured process where things would be physically printed on a CD or a disc and at that point in time, the developer cannot change anything about that anymore. All right, so of course when you do test or when you code stuff, you have to write test because software testing is an investigation conducted to provide stakeholders with information about the quality of the project or the servers under test. And so if you are ever ending up writing code for something which is critical, like an MRI machine, then there will be a lot of tests involved. It cannot be that there's a patient in your MRI machine and the MRI machine goes completely haywire and ups the magnetic field by like 10 times and the scanning software goes completely crazy. It's like you're in there and then all of a sudden the entire machine starts buzzing and wearing and the screen at which the operator is viewing goes completely blank or hey, you have a blue screen of death. So to prevent these things from going wrong in the real world, software is tested and depending on how critical software is, there is either a lot of tests compared to almost no tests. So a lot of academic software which is being written is written without many tests. And this is one of the things in which you can look at the quality of software. So if you are ever asked to review a software paper, right? And the software paper provides software for users to use and it comes from an academic field and just count the number of tests that are in the software will already give you an idea if this software is good or if it's just complete crap. And believe me, 90% of academic software comes without any tests whatsoever. So it's software which will only work on the machine of the developer and not on anyone else's computer just because there's no test. So software testing involves the execution of software components or system components to evaluate one or more properties of interest and there are three ways of testing software or three different strategies. So the first strategy is integration testing, then you have unit testing and then you have regression testing. So I've been talking now again for 55 minutes. So I think this is a good point to take a break and then we will talk through all of these three different types of testing. And then the last part of the lecture will be me teaching you guys how to make an R package and add tests to the R package so that you can provide users with software which is stable and where you can have confidence that when you change stuff in your software or when you change lines of code that the software will still continue to function even on different machines or even on different operating systems. All right, so I will stop the read.