 Welcome back, everyone, if you're watching this on YouTube or if you're watching it on Twitch, thank you for still being here. Today, we will be talking about standards for analysis, which I think is a very important part. So the things that I wanted to talk about is different file formats, which are very, very common in bioinformatics. We're not going to talk about all of them, but just the main ones that I encounter on a day-by-day basis. And for the rest, I wanted to talk about bug prevention by testing code, because testing code is really important, and it ties in nicely with the second hour that we are going to have, which is making your own R package. Because one of the advantages of using R packages is that R packages come with built-in testing. So with that out of the way, just let's start with some file formats. And the first thing, of course, that people should know is that files come in two different types, two major types. So if you look at your hard drive and you just browse around a little bit, you find that a lot of the files on there are binary files. So the binary files over here. So there's a difference between text files and binary files. So if you have a text file, it's just a plain text file. So it's a file that holds textual data, like letters and numbers and these kinds of things. And there's different types of encoding sets. The most common one is the ASCII coding set. So the ASCII coding set is a coding set where you have the letters of the Roman alphabet, more or less, or the Latin alphabet, so to speak. But of course, there are many different encodings because there are some people that don't use Latin letters. If you're from China, then you have a completely different alphabet. So to support these things, there are different file formats, like Unicode or UTF-8. So be aware that text files are not just text files, but there are many, many different types of text files because of different encodings. But the thing which they all have in common is they hold textual data. And generally, there's an end of line character which separates different lines. And there's generally also an end of file character, so an EOL stands for end of line. Because text, you have a line, then you have an enter, and then the next line. Binary files are completely different. So binary files are things like executable files on your hard drive where, for example, you have an X when you're in Windows or a DLL file. So binary files are just files which generally contain computer instructions. So they are not human readable, and they contain a header. And this header contains kind of a magic number, which then contains the file format. So just be aware. So all of the file types that we will be talking about today used in bioinformatics are text files because text files are easy for bioinformaticians. We are not computer scientists. And computer scientists like to work more with binary files and binary formats. But a lot of the formats that are here in bioinformatics are actually just plain text files. Or plain text files just run through an encoder, like zipping them. So the first thing, first file format that we've seen many, many times is just the TSV files or the CSV files. And this stands for tab-separated, comma-separated files. It is a simple text format for storing data, which has a tabular structure. So just a single table. So each row is a single line, and this is a record. And then each field value, so each different column in this line is separated by a delimiter, be it a tab. Then you call it a tab-separated file, or a comma, or a dot comma, or something like that. And the main issue with these types of file format is that there's something which is called delimiter collision. So imagine that I have a comma-separated file. Then, of course, within a single cell, so within a single row at a certain column, I can, of course, not store something which holds a comma itself. And that is just a big issue of these. So generally, the delimiters that you use are chosen in such a way to prevent this issue. But delimiter collision is a real issue, and it will just mess up the data after that point. So imagine that you have your data in R or some other programming language, and you write it out to disk as a tab-separated file. Then, of course, the data that you are writing to disk is not allowed to have tabs. Or if you're writing a comma-separated file, then, of course, the comma is not allowed to occur in your data. So these are very, very common files, and we get them all the time. Like if I look at our animal house, the animal house sends me Excel files or just comma-separated files, which we use. So the next file format, I think we also already saw it a couple of times, is, of course, the FASTA format. The FASTA format is a very simple format for storing nucleotide and peptide sequences. It contains this special start-of-line symbol. So there's two different symbols that can occur at the first position of a new line. So if you see a dot comma, then this is a comment line. So everything behind it is more or less ignored, so that you can add a comment to saying that, oh, there's something in this sequence. And then we already saw this multiple times, is that the FASTA format, the first line in the file, generally starts with this larger-than symbol. And then you have the name of the sequence or the description of the sequence, which comes next. And then you have the sequence. And then you have another description line saying that, OK, so now I know that I'm going to have the second sequence. So after a header line and comments, one or more lines may follow describing the sequence. Each line of a sequence should have fewer than 80 characters. And generally, we go for 60 characters per line. And of course, the way that we encode these things is, again, using the EUPAC amino acid and nucleotide acid coding, because we need to have a standard. So we can write ATCG. Missing values are coded with a little dash. And for example, any nucleotide, so it doesn't matter, is coded with an N. But we also have W for being a weak, so an A or a T. And we can also use like an S for strong base pair, which is a C or a T. So that depends on the number of hydrogen bridges within the base pair. So of course, if we look at FASTA files, then there are, of course, standards for the sequence identifier. So the header line in FASTA, more or less, is free to choose when you make your own FASTA files. But based on which provider gave you a certain FASTA file, you can have very standard unique identifiers for the sequence. So if I download a sequence from GenBank, then the GenBank will have a larger than symbol. Then it will say GB. Then it has the accession number, which is the GenBank number for this thing. And then it tells you the locus. So the locus is the location on the genome where it's located. So that might be chromosome 3 from some position to another position. And all different databases that we have, so different databases provide different data. And they use different accession numbers. Some don't use accession numbers, but they just have entries or names. But based on this, you can see from which database a sequence came. And sometimes you can also see if it's the reference sequence or if it's a different sequence. So FASTA, if you make a FASTA file yourself, you're free to choose the description. But when you download it from a database, then, of course, they use their own kind of description, their own way of describing it. So besides the FASTA format, which is used when we transfer more or less sequence data from A to B, we also have the FASTA Q format. And the FASTA Q format is a text format, again, very, very similar to FASTA, but instead of having a description row and then a whole bunch of lines saying the sequence, the FASTA Q is the common format which comes out of DNA sequencers. So if I have a DNA sequencer, then it will make pictures. And then from these pictures, it will call the base pairs and the quality of the calls. And this is then stored in this text format. So the nice thing about FASTA Q is that FASTA Q format, each entry, so each more or less spot on the sequencer, has four lines. So the first line is an add. So not a greater than symbol, but FASTA Q uses the add symbol. And then it gives you the sequence identifier. So this is sequence one. And then an optional description. Generally, the sequencer will put the x and y positions of where on the flow cell this was measured. Then you have on the second line the raw sequence letters. So that means that you have the A, C, T, and G calls. Then you have a plus character on line number O3. And this is followed by the same sequence identifier. But generally, this is just empty. So you have a line, which is the description. Then you have the sequence. Then you have a line which starts with a plus and the rest is empty. Or it can repeat the first line. And then in line four, we have the quality values. So the quality value, so how certain was the sequencer in determining this base pair is encoded using quality values. All right, question, can you convert formats? Or are you stuck with the format the file is made in? No, no, you're not stuck. So for example, if I have a FASTA Q file, then I can, of course, convert a FASTA Q file to a FASTA file. So then what I would do is I would just go and read the first two lines. I would change the add symbol to a larger than symbol. Then I would take the raw sequencing letters and then make sure that they are maximum 80, put them below. And then, of course, I would just not be able to store number three and number four. Because FASTA, relative to FASTA Q, it doesn't have an option to store quality data. But FASTA Q is similar to FASTA, but it has this option to add quality data. And of course, in theory, you could take a FASTA file and convert it to a CSV file. It's a manual conversion, but there's a lot of tools out there. So I would bet that there is a tool in Python or in some other language which you can run to convert a FASTA Q file into a FASTA file. So yeah, you can do it manually, but there's tools out there in a lot of cases. All right, so the first line is a little bit interesting, because for each base pair that is in line number two, there is also a quality value. And of course, here we run into a problem because numbers are not a good way to store quality values. Because a quality of 40 is something that is really hard to encode, because 40 is two characters while the A base pair, which had a quality score, is only one. So the way that people did this is to do fret scores. So a fret score is a quality score for reading. So here you see the different quality scores. So quality of 10 means that there's a 1 in 10 chance of the base being accurate, like so 10% error rate. A quality of 20 is a 1% error rate, 30 is a 0.1% error rate, and so on. So the best quality score, more or less, that you can get from a sequencer is 60, which means that it's almost 100% sure that the base pair is correct, not so much 100% sure, but it's like there's a 1 in a million chance that it's not the base pair that is being read. So here you have your error probability, which the machine gives you. The machine says, well, this base pair that I read, I am 99.99% confident, which means that that's a score of 40. So how do we then use this? Well, we take the minus log 10, so we do minus 10 times the log 10 of the p-value, and then we call this q-thread, right? So this is the conversion that we do. So we take, for example, 10, then we take the minus log 10, and it's the p-value here. So this means this corresponds with the probability towards the thread quality score, right? So that's just a conversion that we do. That's why it's 60, right? Because this is one times 10 to the minus six, which means that this is six. If we multiply six by minus 10, then we end up with 60. And then what we do is we encode it into ASCII, because we want to have one letter or one position for each, for the quality score just like in the sequence, right? So if we write an A, and this A has, for example, a thread quality of 10, then let me count one, two, three, four, five, six, seven, eight, nine, 10, then it will get a closing bracket, right? So in that sense, then we have a one-to-one correspondence. So every letter in the second line has a single character in the fourth line, which denotes the quality, right? So a quality of like here of a P is a very good quality, while the quality of a one here is a very bad quality, right? Because this is probably like 15, right? So that's a certain chance of being wrong. And while here at this side of the spectrum, we're really certain, yeah, because this kind of ends up being quality 60 scores, right? So this is just a way that FASQ encodes their quality because of this rule that for every base-bearing the sequence, there has to be a quality score, and this quality score can only take a single character to be more or less compact. All right, so that's how it works. And there's actually two different thread scores, but I won't bother you by this, but just remember that the thread score is a quality score, and the advantage is, is that instead of writing down a number, you write down the number as a certain ASCII character, right? So ASCII characters, here you see ASCII characters 33, 226, and that means that there's 93 levels in there, and so you can encode everything from quality score one, which would be an exclamation mark, to quality score 99, which would be the tilde. Good, so a different file format, which is used a lot is the GFF format, and this is called the general feature format. So the general feature format is a file format which is used to describe genes and other features of DNA, RNA, and protein sequences. So it is a format which will, for every line, it will have a feature. So for example, the first line might say, there's a feature on the DNA that I looked at, and this is an axon. And then the next line would say, okay, so there's a feature that I looked at, and this is a promoter region, or there's a feature, and this is an intron, right? So anything that you can kind of annotate to the DNA, be it an intron, an axon, a microRNA, a long-known coding RNA, you can store this in this GFF format. So there's two versions which are in use, GFF2 and GFF3, and both of them are very similar, and these two file types are actually just tab-separated files. So they're a special case of tab-separated files because they have nine columns. So it's just a tab-separated file with nine columns. So how do these nine columns look? So the only thing which is different from GFF2 and GFF3 is the eighth column. So all of the other columns are shared, but the eighth column is different. So the first column, right? So if I have a certain line, then the first position, or so the first entry in the tab-separated file is the sequence. So the name of the sequence where the feature is located, right? So that might be GB for Genbank, and then the accession number, and then chromosome one position and position. The second one is the source. So the second column in the tab-separated file contains the source, and that is just a keyword identifying the source of the feature. Like, did I use a program to annotate it, or did it come or was annotated by a certain organization? Right? So it can say, for example, BLAST, right? If I use BLAST to identify a certain feature. But it can also say Genbank, if Genbank is the organization that annotated it. Then we have in position three, the feature name, which is Gene, Exxon, MicroRNA, long non-coding RNA, intron, deletion, whatever you want. So there's just a structured vocabulary here where you can choose from, and this just gives you the name of the feature that you are going to annotate. And of course, this is kind of a free field because if you come up with some new type of feature, then of course you can add it, and you can just give it your own name. Then you have the start and the end position, which is of course very logical. You just have to remember that this is with a one base offset. So there's, and this has to do with the fact that computer scientists and people that do normal mathematics use different structures of counting, right? So if I am a computer scientist, and then I'm counting from zero. So zero, one, two, three, four, five, all the way up to nine. But if I'm a mathematician, then I count one, two, three, four, all the way up to 10, right? So these are two very different ways of counting. And this is why there's always these joke or this meme in programming is that, hey, there's three, or there's, there are two hard things in programming. And then the joke is that it's an off by one error, right, but that has its fundamental in the fact that computer scientists count from zero and mathematicians or statisticians, they count from one. And you can see that for example in a programming language like R, right? The R programming language, the first element in a vector is one. So you say V square bracket open, one square bracket close. But in a computer language or in computer science language like C or C plus plus, the first element in a list is actually V square bracket open, zero square bracket close, right? So, and that's a very big difference. And bioinformatics in many cases, the formats in bioinformatics follow statistics and mathematics and they don't follow programming. So there's always a one base offset. So the first base pair on a chromosome is called one while computer scientists would say, no, I start counting from zero. So I see an A, that's zero. Then I see a T, that's one, right? But we as computer bioinformaticians, we don't do that, we count from one, just like people in statistics and mathematics. Then there's a score in the sixth column, which is the confidence that you have on the feature. So these are scores which you can either encode with FRED scores if you want it, but most generally people give you a kind of a P value score. So hey, for example, I used BLAST to locate a primer into the genome. When I then encode it, I give here the Q value that BLAST gave me. Position number seven is the strand. So which strand was it on? Was it on the positive strand of DNA, which is five prime to three prime? Or was it on the negative strand, which is three prime to five prime? And you can also use a dot when you did not determine the strand. Furthermore, on the eighth column, we have the frame or the phase. And this has to do with annotating things like proteins, right? So when we have a protein, then we have a three base pairs codon, right? And a feature can be in line with the codon. So it can be in frame, meaning that the feature starts at the first base pair of the codon. But it can also be out of frame. Some features start at the second base pair of the codon. And this is what is encoded in the eighth column. So the frame or the phase of the coding sequence feature. Is this coding sequence in frame, or is it out of frame? So here you can see a zero, a one or two. For some reason, here we do count with zero again. So zero means it's in frame. One means that it's shifted by one. So the feature starts at the middle of the codon. And a two means that the feature starts at the third position of the codon. And then nine is attributes, which is more or less a free field where you can write down anything that you want about your feature that you just annotated. So of course, one of the things that GFF files are used for a lot, let me just show you one of them. I probably have a GFF file somewhere. So just so that you guys have an idea of how it looks and what kind of things are encoded. Let me just search for GFF. There's, I probably have GFF files on my hard drive. So just that you can see one of them. Yeah, here we go. Yeah, that's okay. Ooh, that's 500 MBs that I shouldn't open that one. Stop searching Windows, just do this one. All right, so let me show you the Notepad++ window. So this is a GFF, right? So it says here GFF version three, you have some annotation and then here you have the first feature, right? So the first feature based on GRCM 38. So this is the mouse genome. The first feature is a chromosome, right? Because that's the top overlaying feature of everything, right? So it's a chromosome. The chromosome starts at base pair one and it ends at 160 million base pairs. Or less, yeah, no, 160 million, right? Then the next one, so the first thing on chromosome one is a micro RNA gene, which goes from here to there. And then it has a certain description, right? Because it has a certain gene, right? So this is on the negative strand. So it's, and this is how you encode everything on the genome. So, hey, if I look at this, then this is that every region, every annotation that there is for mouse chromosome one. And hey, you can see that it's a massive file and there's a lot of different regions. So there's like coding sequences, exons, biological regions, no idea what that is, micro RNAs and all of these things, right? So it's just a file format to encode all of these things in a standard way. And you can download these from Ensembl as well. So if you wanna know, for example, which genes are located on chromosome three of humans, you just go to Ensembl and there you can download your GFF file with all of the annotation. So this is the standard way in bioinformatics to store sequence annotation, general feature format. All right, then the next one is the variant call format. So the variant call format is a way to store sequence variations, right? So for example, instead of an A at this position, some individuals have a T, right? So the variant call format only lists variations, but we need to know which reference genome was used, of course, because variations are relative to something. Let me also search for one of these files. So the standard currently is in version 4.3. VCF files start with a header. So header line starts with a hashtag. Headers provide metadata and then there are special keywords in the header which are denoted with this double hashtag and they describe how the rest of the file is going to look like. So VCF is very similar to a GFF file. And so it's again a top separated file. It has eight different columns, but there's an unlimited number of columns that you can give for samples. So let me just open one up for you so we can talk about the region. Oh, that's a 600. That's a big mistake, loading in a 600 MB file. Should have taken one of the smaller ones. Well, it actually did, that's good. All right, so let's first go through the columns. So the first column is the chromosome. Second column is the position. And then we have an identifier. We have the ref, which is the reference. So what is the reference at this position? We have the alt, which is the variation that we are encoding, right? So the reference might be an A, the alternative might be a T. We have qual, which is the quality score associated with the feature. We have a filter column where it can tell you if this feature or if this variant is passing a certain filter, right? If you say I have a filter and I only want to see single nucleotides being changed or I want to see only insertions and deletions, then based on it, it will flag the seventh column. We have the eighth column, which is the info field which describes how the rest of the fields are encoded. Then we have the format, sort of the format field describes the way that the rest of the file is formatted and then we have the different samples. So let me show you just one of those files because that's a lot more clear. So here we see a file format, right? This is a relatively older format, 4.1, but that's all fine. It will tell you which filters were used. So no filter was used in this case. We have the, oh no, we did use a filter. So we did use a low quality filter. And then we have the file and then it says a whole bunch of other info things and also in here is the code which was used to create the file and also which chromosomes there were, which version of the genome you used. But the information in the file is actually starting here. So you see that we have around like 240 lines of annotation but here we see that we look at chromosome three. At this position, this is a known SNP so it has an RSID. The reference at this position is a C. Some individuals were found to have an A, the quality with which this variant was detected is very high, 47,000. Then we see the info, right? So, oh no, this is the filter field. So the filter field says here that this is a low pass. So it actually passed the filter and then we have the next one, which is the info, right? So here we see all kinds of info on the variant like the inbreeding coefficient. We had 2,571 reads at this position but these are not that interesting because the interesting part comes in the back. So if we go all the way to the back, then here we see the different animals and what they had, right? So here we see that we have one animal going from the first tab. So this is column number nine and this is the second animal, right? So what we see here is that the first animal was coded zero slash zero and this means that this animal had reference, reference at this possession in the genome. The second animal that we look at had alternative, alternative. So there are different coatings, right? It can be zero, zero, which means reference, reference, zero, one, which means heterozygote, meaning that one, it's an AT, right? And one, one is alternative, alternative, so that would mean TT, right? If it's an A to T snip. So this is the general way that we encode snips and deletions. So here, for example, you can see a deletion, right? So we see here at this position, the reference sequence is C, T, T, T, T, T. Some animals have a C with two Ts, some animals have a C with three Ts, some animals only have a T and some animals were detected to have some additional Ts, right? So this is just a very complex variant where in the reference genome, we have like seven Ts, but there are different variants in the population that we look at. So this is the way that you would describe it and that's the VCF format. So VCF format, very similar to the GFF format, but the GFF format describes features while the VCF format describes variants. So changes to a standard reference genome. Furthermore, we have PET and MAP files. So the PET format contains both family-based and regular genotype data. So if we have a PET file, if the format was invented by Plink, then the PET file is again a top separated file and it will start with six fields before each row, right? So, and it annotates for each marker that we have for the genome. So for example, imagine that we have a population of a hundred animals and these animals, we did a microarray on, right? So with the microarray, we measure 10,000 locations across the genome and at each of these locations, we get to know if this individual is homozygous, heterozygous or homozygous alternative allele just like in the VCF format. But the VCF format and the PET format are slightly different because the PET format does not store all of this additional information. It stores different information. So it stores information based on an individual level. So, hey, it doesn't list the variants in the rows, it lists the variants in the columns. So each row in a PET file is a individual. So it's kind of the VCF format, but then turned on its side without all of the info columns and the quality columns and these kinds of things. So the PET format, the first is the family ID to which an animal belonged. Then we have the within family ID, which is the II ID, the within family ID of the father, the within family ID of the mother, right? Because you can have multiple families. And this format is used a lot in human genome-wide association where you have different populations that you look at, for example, I have a population in Groningen or I have a population which lives in the U.S. or I have a population which lives in Berlin, right? So those are more or less family trees. And then within each of the families, you can have individual IDs. Let me just ban the chat message. All right, so here we have the family ID, the within family ID, the within family ID of the father. So we know who the father of this individual was, the ID of the mother and we can fill in zeros if the mother or the father are not in the data set. Then we have the sex code and that is coded as one being a male, two being a female, zero being unknown. And then we have our phenotypic value and since this is used for human genome-wide association, we generally have a case control study, right? So the phenotype might be this individual is diseased or this individual is healthy. And then depending on what phenotype we're looking at. So the phenotypic value is one, being part of the control data set, two being part of the case data set. And so this could be healthy tissue versus diseased tissue. And then after this, genotypes will be encoded two columns per variant, right? So because we have every individual in our data set, if it is humans, they have for each chromosome, you have two copies so you can have an A on the one chromosome and an A on the second chromosome or an A and a T or a T and a T, right? So we're looking at single nucleotide polymorphisms. So the PET format is used a lot when coding human genotypes for GWAS. So every PET file, so here you see a little example of a PET file, right? So this is sample ID, so this is family ID one, sample one. The mother was not in the data set. The father was not in the data set. And then the second zero is for the sex code. So this was unknown, right? So this individual had an unknown sex and it is coded one, meaning that it is a case and not a control. And then we see at the first marker, the first chromosome is an A, the second chromosome is a G. So this is an AG. Then you see GG and AA. So these are just going across the different markers in the chromosome. Of course, every PET file has to come with a corresponding MAP file, right? Because we need to know for these A and this G where in the genome was this located, right? Which SNP did we measure? So the MAP file is annotating the different columns and the MAP file itself has four columns. So it lists the chromosome, the RS or SNP identifier and then the genetic distance in centimorgans and then the genetic distance in base pairs. So in many cases, the genetic distance in centimorgans will not be used and there will only be a base pair position. And this has to do with the fact that when you look at humans, you don't have a structured population so you don't have genetic distances like we discussed in the GWAS versus QTL lecture. So that MAP file is very, very common file types to store these kind of association studies data where we have genotypes on the one side and phenotypes on the other side. So that's how it works. Good. So those were just very basic couple file formats that I wanted to talk to you guys with. Like there's many, many more. There's like BOM files, you have SOM files and but like each of these file formats you can generally just Google them and see how they are described. But remember that in bioinformatics most file formats are just plain text file formats which you can open with a text editor and just look at how the thing is structured and many times these file formats are based on tab separated files. So I wanted to give you a little bit more information about like programming and things like source code. So source code is a collection of computer instructions. It is written in a human readable programming language. It's stored as ordinary text files. I say here normally, but this is almost always the case. There are a couple of esoteric programming languages which for example store their source code as a image. There's a really well-known programming language called Pete after Pete Mondrian where the program code is actually just a JPEG file. So had a JPEG file is then fed to the Pete interpreter and then it executes the JPEG file. So it looks at each pixel, sees what the color is and then based on the color does an instruction. But generally we have files which are like our files like code files where it's an ordinary text file where there's comments in there. So you can go from source code and you can execute source code on a machine in two different ways, right? So you can compile a file which means that you take your source file and then you run it through a computer program called a compiler and this compiler will translate your source code into instructions for your machine. And this of course is machine specific. If I have a novel Pentium processor, right? Then this will have different instructions than a very old AMD. Had the smartphone that I have has a different CPU than my computer. So for each of these different platforms when you are compiling code you are compiling for a certain CPU because you are generating machine code instructions and machine code instructions are different from Android phones compared to iPhones because they have a different chip in there. And even two Android phones do not have the same chip. So to kind of get around this limitation, right? That you have to compile your code for every different platform that you want to run on a lot of programming languages nowadays are interpreted, right? So that's for example, how R does it. R does not generate or doesn't take your source file and then generate machine code which you then run. No, what R does, it takes your source code, it is parsed into kind of an intermediate structure and then it is executed directly, right? So R, in the end of course there are machine code which is being executed but this machine code is being generated on the fly and it's not a fixed static binary that you get like if you would compile C or C++ code, right? Where you say take this source file and then generate a .exe file under Windows, right? So interpreted means that source code is translated into an intermediate representation which is then executed. So how does this look? So well every compiler and interpreter looks more or less the same. So you have different programming languages like C or C++, Ada or Java. What happens is that you first have the front end so the front end verifies that the syntax and the semantics are according to the language that you are writing in, right? So that you did not make an error. Hey, if you write a while loop that you really typed while and not will without an E. So that's what the front end does for you. So and then the front end takes your source code and then produces this intermediate representation. Then many compilers and interpreters have something which is a middle end and the middle end performs optimizations on this intermediate representation. So it looks to see for example if you have something which is, so you have a statement x equals 10 plus five, right? The middle end will now see this statement and will see, oh, but you're adding 10 to five, right? And the result of this is also constant. So instead of keeping the multiplication in and every time having the machine do the plus, right? The 10 plus five, it will just replace 10 plus five by 15, right? So it will look at the code, it will understand that certain things that you are doing are suboptimal and replace them by something which is more or smarter, right? So a lot of optimization of your source code is done in your middle end. And then you have the back end and the back end generates the machine dependent or the target dependent assembly code. So this is the code that your computer can understand, right? So, and of course, different PCs, different CPUs have different types of instructions that they use like modern computers, they use 64-bit instructions or 32-bit instructions if you have a relatively old computer. But for example, if you're compiling stuff for your smartphone, then this is an ARM machine, so a risk machine. So that has a completely different language. So the back end transforms this intermediate representation which has been cleaned up and made more or less optimal by the middle end. It takes that code and then runs it on and translates it to the correct language that your CPU can understand. So if we do software or code development, then generally we define a whole bunch of stages, right? And of course in academia, this is a little bit different because in academia, we almost never get further than kind of the alpha stage, right? So the alpha stage is the initial version. It lacks a whole bunch of feature and there's a whole bunch of bugs in there, right? So it's me writing our code for the first time, right? So I'm just writing and thinking, oh, I want to do this and that and this and I don't pay attention to all of the details, right? I just want to get a certain analysis done and after the analysis is done, and then we have an alpha version of my code. So the beta version is then when you start adding more features to your code. For example, have my alpha version just uses a text file as input. A beta version, for example, could be a version which now instead of having a text file as an input just takes the name of a gene as an input and then retrieves the source code automatically from NCBI. So it has more features and it is suitable for other people to use it, to test it, right? And this is where we have the closed beta when we have users from within our group test it or we have an open beta when we have code which we want to kind of give to other people which are not within our confidence circle, right? So after we've reached a beta stage, right? So then we have code which more or less has most of the features that we wanted to have. We have the users that tested it. Then in the end, we end up with something which we call the release candidate. And the release candidate is the candidate version that we are going to ship to other people, that we are going to give people. Like the compiled binary that we say to people, okay, so you can use our program now, right? So the alpha means I'm the only one using it and I wrote the code so it's full of bugs and it only works for my data set. The beta is when I've made it more generic so it can work on any data set and we send it out for other people to test. Then we end up with the release candidate after we've fixed all of the bugs that came out of the beta version, right? Because all kinds of people will say like Danny, that code that you wrote, if I input a minus one, it crashes my whole computer, right? So in the release candidate, these initial bugs are fixed. When there are no bugs found, then we have an RTM version. And this is not done anymore nowadays, kind of, right? In the old days, software would be distributed on CD-ROMs, right? And once the software was on a CD-ROM, the software was done, right? So the release candidate was a version that was distributed via email. For example, at Microsoft, they would send it out to all of the people that they know. They would comment on it and then at a certain point, they would say, okay, so now Windows 98 is ready. And then what they did is they took their RTM version, which is the version that would go to the guy that had this massive machine hole where they would print all of these CDs and DVDs, right? And we nowadays don't do this anymore because the RTM version is kind of gone because the fact that the internet is there and it's really easy to distribute software across the internet. And then, so the release to manufacturers, the version ready for distribution to the customer and the public, and then we have to release the stable or the final version, which is the version that people get, right? But nowadays, like this RC, RTM, and the release stable final version are more or less squished into one. Unless you are on Steam, right? Games that get released to Steam, they still have to go and make a release candidate, then they have this RTM, which is sending their code to Steam, and then Steam will then test it on a certain amount of users, and when everything works, then it will be kind of the stable version. So there is, in some cases, you'd still do this release to manufacturer step. But just to give you guys an understanding that source code comes in all kinds of different stages and different levels of certainty that you can have in the code, right? And if I'm using a release candidate, then it can be more certain that this thing will work and that it will do the analysis that I wanted compared to when I'm working with an alpha version of the code. All right, so of course, all of these things are only possible defining the steps in this process, in this software manufacturing process, can only be done when you test your code, right? So software testing is an investigation conducted to provide stakeholders with information about the quality of a product or the service under test, right? So again, in academic software, this is one of these fields where people generally ignore it. So they generally don't deal with testing their code. The code runs for them, so it's fine and the analysis is done, so they write their paper and they submit their paper. Of course, as a bioinformatician, your result is the code that you write, right? So you want to have your code tested and evaluated. And there's three different ways of doing this. So you can use integration testing, unit testing and regression testing. And these three ways of testing are kind of overlapping with each other, but they have their own unique features. So I just wanted to show you guys what the difference is between integration unit and regression testing, because a lot of people throw these terms around and they don't make it clear what they mean exactly. So integration testing is when you test individual software modules and test it as a group, right? So there's three ways of doing integration testing, right? You just compile the software and see if it works, right? So if you have, for example, I have a program which requires a local database and it also requires, for example, an output window or a printer to be attached to the system, right? Because the program that I wrote retrieves something from the local database and then shows it to the screen and also makes a print out to the printer, right? So big bank testing, which you see, just compile the code, just run it and see if the thing on your screen appears to be what you expect it to be and that the printer prints out the thing that we want, right? But we can do this smarter because we can also use bottom-up testing and what we do then is we look at our program, what does it do? And then we test the smallest component first. So it might be that the smallest amount of code is in retrieving the data from the database. So the first thing that we do is we compile our program, not the whole program, but just the database part and then we test the first, the database, right? Can it retrieve data from the database without using the printer, without using the screen? And so we start with the component which has the least amount of code and then we add the next component. So we test first, retrieval from the database, the next step might be retrieval from the database and sending it to the printer and then the third step would be retrieving the data from the database, sending it to the printer and sending it to the screen. Top-down testing is the other way where you start testing the other way. So you integrated modules are tested at the branches of the module and you test step-by-step until the end of the related module. So top-down testing is different from bottom-up testing because you start with testing the kind of global system and then breaking it down. So it's kind of a big-bang testing but then throwing out stuff while you kind of minimize the individual modules that you're testing. Unit testing is a different way. So the goal of a unit test is to isolate each part of the program and show that the individual parts are correct. So unit testing is generally used when you write functions. Right, so I have a function. When I give the function the number five, it should always print hello world. When I give the function the number seven, it should always print hello Mars, right? So by testing the individual functions, right? So it's generally a function that you test or an object or a class or a module. But a unit test is kind of a written contract. So it says that if I give this function the number 10, it should always produce 12, right? So, and that is a way to kind of harness, to put kind of a test harness around your code and it gives you confidence when you start changing your code, right? Because if at any point I run my tests and I see that the test fails like I input a 10 and now it doesn't give me 12, but it gives me 20, then I know exactly where the problem is, right? In this function, I change some code and now it doesn't do anymore what I expected to do. And I've used unit testing a lot, especially when you deal with legacy code, right? So one of my first projects as a bioinformatician was take some software was written at the beginning of the 1990s and now rewrite it in a slightly different language so that it's suitable for modern computers, right? We wanted to add things like use parallel computing and all of these things. But the first thing that I did was just take the old code, compile it and then start writing tests to make sure that I know exactly what each function does. So it would just generate some input, throw it into the function that I had and then see what the output was. And then I would write a test saying that if I input this to the function, I expect this to be the output. And then when you start shifting this function for, for example, C towards a new language, then you can use your tests to make sure that your translation of code from one language to the other language went correctly, right? So a unit is defined as the smallest testable part of an application, which is generally at the function level. So you write a single function and this function does one thing or it's an object that does one thing. And ideally each test case is independent from the others but that's something that just, it doesn't have to be. So the advantages of unit testing are that you can find problems early, right? So you write a thorough set of tests, you force yourself to think about what is the input, what is the output and what can go wrong. It allows you to change code, right? Like I told you guys, I had code in C which was written in the 1990s and I actually wanted to have it in R, right? So, and by just writing tests which would test the C code, the same tests were applied to the R code. So when I did my translation, in the end I would run the test and I would see did I translate everything correctly from one language to the other one. It simplifies integration testing because you know that the individual parts are correct. Then you also assume that when I have five functions and I've tested all five of these functions then the combination of these five should also do what I expected to do. And unit testing provides documentation, right? It's kind of live documentation of a system because you know, okay, so here I see a function which tests part of my system. So now I also know what this thing does, right? If the function is as simple as multiplying a number by two and then from the unit test, you can see, well, we throw in five then the answer should be 10. We throw in 15, the answer should be 30, right? So from the unit test, it kind of becomes clear what the code is doing or what it should be doing. There are also some disadvantages to unit testing because we have a decision problem, right? We cannot evaluate every input and every output, right? Because if we have a function which multiplies a number by two to make sure that this function is correct, we should test all numbers. But of course, testing all numbers is not possible, right? Because that would mean that you take an infinite amount of time. It is not integration testing. So it does not, if the printer is not on or if the printer is not responding, then your unit test might pass but the printer will never print something, right? So that's why you need integration testing. Integration testing is needed even though the whole system might be littered with unit tests. There's also the problem of efficiency because for every line of code that you write, you have to write a unit test. So you write three to five lines of code for every line of code that you write, you write three to five lines of test code. And this of course is very expensive. So unit testing is used a lot in areas where nothing can go or nothing is allowed to go wrong, right? The James Webb telescope, for example. Every part of this telescope, every line of code is littered with tests because if something goes wrong, right? If some function in this telescope does not work the way that it should work, then the whole project fails. The same thing is in the aviation industry. Like you can't have airplanes falling out of the sky because of coding errors. So that is why there's a lot of investment to create these tests for individual parts of the system. There's also realism problems. Sometimes it's very difficult to come up with realistic and useful tests. And of course there are platform differences. So hey, if you develop software on a Windows machine, then all the unit tests can pass, then you go to your smartphone and then some unit tests start failing just because the smartphone is dealing with floating point numbers different than the Windows system, right? It has less digits behind the comma and your unit test says no. The answer should be 3.14, 15, blah, blah, blah, blah. But the answer is just shorter. All right, and then we have regression testing. So regression testing is very similar to unit testing in a way, but you write based on what you see. So when you are, for example, in the translation example, the translation example is very difficult to judge if I wrote regression test or if I wrote unit test. So unit test is generally testing individual part. But regression testing means that you just look at the whole system, you give it some input and then you look to see what the output is. So you treat the whole program as a single unit and for the rest, it's very similar, right? So it's the same as unit testing, but instead of testing individual functions of part of your program, you're just testing the whole program as a single unit. And one of the nice things about regression testing is is that it's very useful when you find a bug in your code, you fix the bug and then the bug that you found now is turned into a test. So that the next time that you change your code, you cannot reintroduce this bug, right? So it allows you to kind of fix bugs forever because once you found the bug, right? I gave the program 15, the expected output was 45, but it didn't give me 45. So now I'm going to write a test that if this is the input, that should be the output of the whole program. So one of the ways to do this is test-driven development, right? So test-driven development is a software development process which is used by like larger bioinformatics groups or larger software development groups where you write code based on failing tests. So if you want to add a new feature to your code, the first thing that you're going to do is write a test, right? If I click on this button in my software, this should happen. So you write the test, you run all your tests and then the new test that you just added should fail, right? Because this feature has not been implemented yet. So you write your test, you run your tests to see if the test fails and then you write the code implementing the desired behavior. Then you run your test again and then you continuously start doing this until you pass the new test that you added and then you start from one. So next feature, you add a test, you make sure that the test fails, then you write the code, run the test, if the test failed, then you have to refactor your code so you have to update your code until it passes the test. And this is a very, very good way of developing software because it gives you a lot of security that the code that you are writing is doing exactly what you do. And it also makes it really, really easy to spot, to refactor code, to take part of the code and say, no, I want to make it more efficient. So I'm going to change this and this and this and when you run all of the tests, if all of the tests pass, then you know that the software, the behavior of the software hasn't changed. So test-driven development is one of these ways of developing software, which is used a lot in things like aviation industry in developing stuff like NASA. So when there's a lot of money on the line, so then you do test-driven development. All right, so documentation, just because every software thing that you write has to come with some documentation, no one will understand what they need to do if there is not a read me or a help file. So documentation is defined as text or illustration that a company suffers, computer software, and it explains how to operate or use the software. I think everyone knows this, but there are different types of documentation. So the first thing that is being written in any software project is the requirements. And the requirement states what the things should do, right? What attributes should it have? What capabilities should it have? What should be the quality, right? So how accurate should my thing be? Then the next type of document that is written is generally the architecture and the design document that tells people what should I have installed for my software to work, right? One of the things that Aimee talked about is that when she wants to run this new labeling system of her, she needed to have TensorFlow installed. So the architecture document tells you, no, to run this code, you need to have this installed, that installed, you need to have a monitor or you have to have a graphics card which has this much memory and these kinds of things. So that is the architecture and the design document. Then you have the technical documentation, that's generally the documentation that you as a bioinformatician write, right? So that is the documentation of your code saying that this is my code, I made it, it implements this algorithm. For example, it looks for multiple sequence alignment. This is the interface, right? So to do multiple sequence alignment, the first parameter of the code should be the sequences that you want to align. The second parameter is the alignment procedure that you want to use. The third parameter is the number of iterations that you use or something like that, right? So that's the technical documentation. Then you have the end user. So that's generally what as a bioinformatician you also write, right? I made for many of the software packages and code that I made in the past, I have also written tutorials, right? So that you can give a starting out master student your code and you give him a PDF document which is the tutorial, right? So that's the end user documentation. And then we have the marketing which generally in bioinformatics is not that interesting but in a lot of cases when you're working for companies and you do bioinformatics or software development you are also writing a marketing document which explains to non-technical users what your product does, what the unique selling points are and what is the demand, right? Which group should you sell it to? And this is generally the marketing documentation. So hey, if you have a good software development team they use something like test-driven development and they write all of these types of documentation and this makes for high quality software, good. So that's it. Are there any questions so far about this part of the lecture? Test-driven development? How it's different from Agile or different from other types of software development? If that's not the case, then we will do the next round of, then we will do the next round of breaks or well, one break but next round of animal gives which is going to be, I think raccoons. And afterwards I will give you a very, very short introduction about how to write an R package and have where, for example, the manuals go, where the tests go and these kinds of things. And I will run through it relatively quick because there's already, I think I talked already about this in the R course but I just wanted to show you guys how to write an R package very quickly. And we will do that after the break. So thank you guys for watching here. No, thank you guys for watching on YouTube. If you're here on Twitch, stick around, enjoy the beautiful GIFs and I will see you guys in 10 minutes.