 Welcome everyone. Last lecture, bioinformatics, the summary. So if you've not watched any of the previous videos, this is going to be the video for you because I'm going to summarize everything in like two hours. So it's going to be good. It's going to be good. So last stream, last stream, it's like puppy eyes. It's such a shame, such a shame. It's not going to be the last stream because the next lecture series is going to be online as well. I hope so. I hope so. It might be that actually the university will force me to do it in person. But we'll have to see. We'll have to see. But I'm very sad about it. Like I like the like weekly being here and talking to myself and reading chat and answering questions and these kinds of things. But it's really a shame that we're done with this series. So, but you guys had 14 good lectures and I hope everyone does really well on the exam. So that we don't have to have a makeup exam or something like that. Anyway, this is going to be the overview, right? So the overview is just going to be me talking through all of the lectures and then all the way at the end I have four example exam questions. They're actually not example exam questions. They're actually real exam questions from previous years. So just so that you guys get a feeling of what kind of questions I ask and how I ask them and I will be answering them so then you can see what I think is important. But with that out of the way, let's start with lecture one, the introduction. So in the introduction, we talked a lot about what bioinformatics is, right? Like bioinformatics is a discipline that uses tools from computer science to answer biological questions. And then I also gave you guys a whole bunch of this kind of definitions. But the thing is that if I would ask a question like what is bioinformatics, right? Then I want you guys to at least mention computer science and biological questions, right? Because those are the two core elements in the definition. So you can write it down anyway that you want, as long as it mentions computer science or information technology or some kind of analogy for that. And of course, the biological questions need to be in there as well. So that's kind of the way that I kind of do the slides, right? Many of the slides, they have this kind of blue highlight in definitions and that's the important part. So for lecture one, know what an algorithm is, know what data is, know what knowledge is, know the difference between data and knowledge and also know the difference between like in vivo, in vitro and in silico because those things are pretty important for bioinformatics. We also talked about DNA sequence, right? That sequence is the fundamental data type in bioinformatics. That's the thing that started it all, right? So we started doing like protein sequencing and DNA and RNA sequencing at a certain point. And that's where the whole field of bioinformatics come from. So DNA, RNA and protein sequences are more or less the reasons why the field of bioinformatics exists because people needed to store it in databases and then you need to analyze it. And know that sequence is also the entry point for many in silico studies, right? If we think about the current coronavirus situation then it all started with people sequencing the virus, figuring out that it's a Sarbakoff and then making like phylogenetic trees and tracking how the virus mutates across the world and spreads across the world. And know what whole genome shotgun sequencing is and also know that sequence alignment is one of the most fundamental algorithms in bioinformatics. So had the alignment of two sequences against each other and we spend a whole lecture on that. So we also went quite quickly through the microarray workflow in lecture one. So and know at which point you don't have to be able to reproduce the whole thing, right? Like I don't expect for you to like know point by point by point what the microarray workflow is. But I want to be able to ask a question like in which parts of microarray analysis is a bioinformatician involved, right? And then you could say well bioinformatician is involved in creating the arrays but also in data storage, data normalization and generally I will ask something like give me four steps or four things or three things, right? So all right then lecture number two, phenotypes. So phenotypes we talked about qualitative properties and quantitative properties. So qualitative is something that is something like it tastes good, it smells bad, I don't like it or I do like it, right? So it's something that is really hard to kind of put a number on. So not measured with numerical results. And then we have quantitative properties and quantitative properties that are properties that exist with a magnitude or multitude, which means that they can be measured using SI units. We talked also about Mendelian traits and complex phenotypes. So Mendelian traits are traits which are caused by a single gene which causes the difference in a phenotype. While complex phenotypes are phenotypes which are controlled by many genes. So one example of a Mendelian trait is earwax, there's dry earwax and there's wet earwax and there's a single gene in the genome that controls if you have dry or wet earwax, right? So it's a single mutation in a single gene. Complex phenotypes are things like human height, intelligence and all of these things, right? That's determined by many, many different genes. We also talked about like this mixing flowers thing. So if a phenotype is additive, right? So if the genetics underneath the phenotype is additive, then you get mixing. So that means that when we have a red flower and a white flower, right? So these are the gametes. Then you get this following Mendelian inheritance diagram. While if we have dominance, right? So one of the phenotypes dominate, then we get a different proportion and this is because one of the two, they can't mix together. So if you have a red allele, you will always be red. Also be able to read these kind of diagrams. So yeah, I might ask a question about a diagram. So I will show you a diagram and then ask you, is this a additive or a dominant phenotype? Furthermore, we talked about the concept of linkage, right? So because genes are located on a chromosome, the closer they are together, the more often they are inherited together. So if they are very far apart, then there's a high chance that these two phenotypes, there will be a recombination in between. So a homologous recombination when the gametes are produced, separating the two phenotypes from each other. So linkage is a very difficult concept. And I just want you guys to kind of be able to tell me in your own words what linkage is. And we also talked about two-point and three-point crosses, which are very much the same. But these are used to determine if genes are linked or if they're independent, right? So if they are on the same chromosome and how close they are on a chromosome. And then we have independent, which means that gene one is on chromosome one and gene two is on a different chromosome. For example, chromosome 11. And the advantage of using a three-point cross compared to a two-point cross is that in a three-point cross, you can also infer the order of the chromosome, right? So it allows you to kind of build a genetic map where you can say, well, if we start at the beginning of chromosome one, then we first see the phenotype for, for example, broken wings. And then we see the phenotype for eyes. And then we see a phenotype for antenna, right? So we can determine the order of the genes on the genome. And that is only possible when we use a three-point cross, because we can kind of figure out if A is closer to B, then it is to C and these kinds of things. Good. We talked a little bit about phenotypes in lecture two as well. So we talked about visual analysis like box plots and histograms. So be able to kind of tell me things about box plots or histograms, right? So a box plot generally shows the median value and then it shows the quantiles, right? So 50% of the data and then 95% of the data generally in the vexes. And then we have like things like histograms that we talked about. But there probably really be a question about that. We talked about multiple testing. So definitely know the difference between a type one and a type two error. And we also talked about descriptive statistics, right? What is an outlier? And how to deal with outliers. So you can, you can winderize them away. And generally outliers are values which are very, very far apart from the distribution. And they can be caused by things like comma failures when you write down the comma wrong. So instead of writing 3.0, you write 30.0, right? So these kinds of things happen. We also talked about things like exploratory data analysis and to decide which model to use on the data a little bit. So if you have a really nice normal distribution, and of course you want to go with parametric statistics. But if you have like a lot of outliers in your data, then it's probably better to switch to non-parametric statistics. And a little bit about hypothesis testing. Good. And then in lecture three, we talked about DNA, right? So we talked that DNA is used for diagnostics, used a lot in biotechnology and forensic biology and in virology, right? So DNA is used to catch criminals, find things like do you have the BRCA gene and do you have a high chance of developing breast cancer and these kinds of things. But also DNA and DNA research is used a lot in biotechnology, right? If we want to make like algae produce fuel, then also we look at the DNA of these algae and try to optimize them for producing biofuels. Virology, I think that speaks for itself. So we also talked about the old more or less classical ways of sequencing DNA data, right? So and we talked about Maxim Gilbert sequencing, Pyro sequencing, Sanger sequencing, and next generation sequencing. And what I want from you guys is that you are able to read these kinds of plots, right? So here you see a plot, which is a Maxim Gilbert, no, this is a Pyro sequencing. Yeah, I think this is Pyro sequencing. And so you add the nucleotides in order and then have one the nucleotide gets incorporated. You see a little flash of light and the height of the of the flash determines how many base pairs there were. But hey, if I would show you a figure like this, and I would say to you guys like, this is the order in which the nucleotides are added, then I want you guys to be able to say, okay, so this is then the resulting sequence. The same thing for this. So Maxim Gilbert sequencing, where you use like, where you use kind of cutting enzymes, which cleave at different points. So you have four different cutting enzymes, one which cuts at A plus G, one which cuts at a G, one which cuts at a G, and one which cuts at a C plus T, which causes to which causes the DNA to fragment, right, and then these fragments are brought up on the gel. And then one of the things that I saw a lot in recent years when we did the exam is that people actually read it the wrong way around. So the sequence here, you read from, from the bottom to the top, right, because have we see that it's CTA, CGTA, and here you see CTA. So hey, you don't read it. And that's what often goes wrong. So people are able to kind of figure out which base pair there was at each of the different positions. But of course, the first position is the lowest one, because the cutting enzyme cuts at a certain point, right. So the smallest fragment is the first base pair. So remember that when you see a sequencing gel with Maxim Gilbert sequencing, you have to read it from the bottom to the top. And that goes wrong a lot. So that's a tip for you guys. We talked also about the workflow for next generation sequencing, that you do sample preparation, then you do DNA sequencing. Generally, as a bioinformatician, you're not involved in that sample preparation is done by a postdoc or by a PhD or by a master's student in the lab. The DNA sequencing is generally done by an external company, because like at academia, we almost never do our own sequencing. But of course, what you get from the company is these fast Q files. And then we need to do all kinds of steps before we can end up with a list of our single nucleotide polymorphisms. So hey, we need to trim the reads, which means get rid of the ends, because the quality of sequencing drops. So the more base pairs I sequence, the lower my confidence in that the base pairs actually correctly sequenced. So hey, at a certain point we decided, or in the workflow, you have read trimming where you say, well, I have my read, my read is 150 base pairs long. But I see that the quality after like 110 drops off. So then I'm just going to cut the read there. And I'm going to ignore the last 40 base pairs. So the read trimming again yields a fast Q file, same format as we had. And then we do alignment. And alignment is just taking the read, scanning across the genome, see where it fits. And then, hey, you get a BAM file, which is this kind of file format used for for next generation sequencing data, which is similar to the sum, but then binary. But after alignment, we have to handle duplicates. Because in the sequencing process, we have generally a PCR step when we do our sample preparation. But also we have optical duplicates, which are caused by how the machine works. So we have to remove duplicates, which means that if we have a read, which is starting at a certain position, ending at a certain position, but we see the same read over and over and over and over again, then we just ignore all the duplicates. And we just say, no, we had one read at this position instead of having like 100 or 1000, right? So these optical duplicates are very common. So you have to remove those. Then the next step is indel realignment. So indel realignment means that you use known variation in the genome, right? We've already sequenced hundreds and hundreds and hundreds of humans, probably more in the order of hundreds of thousands of humans by now. So if we do an alignment of a read towards the human reference genome, then of course, it might be that inside of where the read fits, that there is a variant, right? So a variant means that there's a single base pair, which is different in some individuals. But that means that we don't want to penalize the alignment for having this variant. So that's what indel recalibration does. Hey, it looks at little insertions and known deletions and then says, well, I'm not going to penalize the read for this, because this is a known variant in the human genome. And of course, we don't have it just for humans, but also for mice and rats and other model organisms. And then we have the base recalibration step. So the base recalibration step is very similar to the indel recalibration step or the indel realignment step, but the base recalibration step is just looking at single nucleotide polymorphism, right? So the indels are for kind of short deletions and insertions. And the base recalibration is the same thing, but now for single base pair variants. And then in the end, generally what we do with DNA data is we don't look at the whole genome that we have. But imagine that we have a human, then we kind of want to summarize where the human is different from the reference sequence. So we then do single nucleotide polymorphism calling or SNP and indel calling to find the regions or the positions in the genome where our sample is different from the reference. So also know that there are drawbacks about doing next generation sequencing, right? You need a lot of computing time. It's getting better and better because tools become better and better, of course. But in the end, there's a lot of computational time involved in doing the analysis of DNA sec data. You need a lot of hard drive storage. You need a lot of random access memory. And you need management of files. So you need to keep track of all of these different files that are being produced. And because of course, in the whole pipeline, we start off with one file we get from the company, which turns into two files, three files, four, five, six, seven. So it's like seven, eight files that you have in the end. And you need to manage those and those need to be stored and you have to have backups and these kinds of things. Also know that there's a difference in the definition of what a gene is, right? In the previous lecture, so in the lecture when we talk about genes, as in phenotypes, right? So units of inheritance. But in molecular biology and in sequencing, a gene is not a unit of inheritance. A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products, right? Because a gene nowadays in biochemistry or in molecular biology, we see a gene as having introns and axons and a single gene can produce different proteins or different variants of the same protein. So it's a much more complex definition in molecular biology than it is when we talk about genetics. In genetics, a gene is very basically a unit of inheritance. So generally, it comes in two forms. You have a red gene and a white gene. And hey, you get one of the gametes from your father, one of them from your mother and these mix. But of course, in molecular biology, stuff becomes much more complicated because you have a gene which encodes color and this color gene can have like 10 different variants, right? Some of them are white, some of them are red, some of them are purple, some of them are blue and all of these things can mix and match in an additive or in a dominant way with each other. So in molecular biology, a gene is a very, it's a fixed definition and it's a very, it's a very good definition but it's a different definition than what we use in genetics. So be aware of that. We also talked about transposable elements. There will definitely be a question about transposable elements. Remember that they were first described or they were discovered by Barbara McClintock, one of my favorite molecular biologists ever. So there might be a question about that. But when we talk about transposable elements, so also called jumping genes, they come into two different classes. So you have retrotransposals and DNA transposals. So the retrotransposals, they have an intermediate RNA form. So they are more or less, in the DNA, they get more or less transcribed into RNA. The RNA gets then built in into the DNA as well. And the class two transposals are DNA transposals. So they don't have this intermediate form. And then every one of these classes is subdivided in two. So you have autonomous retrotransposals and you have autonomous DNA transposals. And autonomous means that they don't need anything to move. So everything that they need to move from one position in the genome to another or to copy themselves from one position to another, they carry with them. Non-autonomous means that it needs something from the host cell to move from one position to the other. Right? So it means that not all of the proteins that it needs to jump around are encoded on the transposable element themselves. We also talked a lot about different regulatory elements. So just read through them and know that there are different types of regulatory elements like insulators, enhancers. You have Tata boxes and you have like metal sensing elements in the DNA. But I'm not going to ask in too much detail about that. I think that the transposable elements, I like them much more. So there's more likely to be question about transposable elements than there is about regulatory elements. And also know the difference between a mitochondria and a chloroplast. Know their function. So the mitochondria are the powerhouse of the cell, which means that they produce ATP. Chloroplasts are the same thing. They are also the powerhouse of the cell but then in plants. Right? So they do photosynthesis and produce ATP for the plant that way. Lecture 4 RNA. So I think this is generally the most boring lecture for everyone. Also for me, because there are so many different like types of RNA. Right? So you have messenger RNA which comes in pre-messenger RNA called HN RNA and then you have the mature messenger RNA called mRNA. We have transfer RNAs which transfer amino acids. So they form the link between the messenger RNA sequence and the protein sequence by having, well, on one side they have the codon and on the other side, on one side they have the anticodon, right? Which matches the messenger RNA and then you have the amino acid which is attached to it in a cloverleaf system. You have ribosomal RNA which is RNA inside of the ribosomes which helps the ribosome be able to produce proteins. We have small nuclear RNAs which are in the nucleosomes in the nucleus which do things like splicing. We have catalytic RNAs like ribosomes which have a function themselves. So proteins generally have a biological function but catalytic RNA like ribosomes, they also have a catalytic function. So they are involved in biological processes. We have microRNAs which are there to do regulation of gene expression. So had they generally are binding messenger RNA which then gets degraded because the cell doesn't like double-stranded RNA. We have small interfering RNAs which is kind of a microRNA which is brought into the cell by humans or by microinjection. And we have non-coding RNA which is actually RNA which does not code for a protein but we don't know exactly what it does, right? So generally it's like the microRNAs but then much longer. So long non-coding RNAs or NC RNAs. So there are a lot of different types of RNA, a lot of different types of definition. I don't really like to ask very specific questions about it but I do think that it's important that you know that like RNA is divided into all kinds of different subgroups. Again the workflow for RNA sequencing is the same as for DNA sequencing. The only big difference is that you acquire your samples, you extract the RNA instead of extracting the DNA and there is this additional step where you do RNA to DNA reverse transcription. And of course in the end because we do RNA sec we're generally not interested in the SNPs, sort of variations in the genome. We are generally want to do the extraction of the expression levels at the end, right? So instead of saying that well at this position my sample is different from the reference genome. In this case what you are going to do is say well I look at my gene of interest and I count the number of reads that are there and then I'm going to take the number of reads in sample one and compare them to the number of reads in sample two to see if there's a difference in expression level. So the goal of RNA sec is different from the goal of DNA sequencing in that you want to get the expression of the genome, so the expression of the different genes in the genome, while generally in DNA sequencing you want to look for variations in the genome. We also look at tools to predict secondary structure of RNA, right? So generally you take the sequence, you annotate groups of secondary structures and then this is all based on the lowest free energy structure, right? So it tries to fold the RNA in such a way that there's the least stress on the molecule. So hey we have things like RNA fold which I think we had an example of but there's also context fold and RNA shapes and there's a lot of different tools. In the RNA lecture we also said that if you look at these short RNAs or if you look at these long non-coding RNAs, right? They generally have like this modular structure which means that if you have a long RNA molecule then part of it can for example bind RNA or DNA. Part of it can bind proteins but also parts of RNA can be conformational switches and these things they are build up modular, right? So you can have a long non-coding RNA which has two conformational switches and a protein binding domain or you have a long non-coding RNA which has a DNA binding domain, a conformational switch and then an RNA binding domain. So and based on which kind of structures we find in the RNA we can kind of figure out what the function of this RNA is, right? If an RNA has a protein binding domain, right? A piece of the RNA is predicted to bind the protein then of course we can kind of infer that this RNA has something to do with proteins. But then it's a modular structure so these long non-coding RNAs they are modular so they're build up different modules which are more or less mixed and matched together. Alright so in lecture five we talked about proteins. So here we have some nomenclature right which I want you guys to know. So an amino acid is a single building block. We have a polypeptide which is a chain of several amino acids. Then we talk about an apoprotein which is one or more polypeptides but not having the cofactor. So for example the zinc molecule that is needed to bind the thing that it needs to bind or the iron molecule to bind oxygen when we think about hemoglobin. And then when we talk about proteins right then we talk about apoproteins with cofactors, right? So hemoglobin is a protein and then when we say the hemoglobin protein we mean the four chains or the eight chains of hemoglobin. I think it has four. So it has four chains, so it has four apoproteins or four of these polypeptides and then within these polypeptides you have iron molecules which bind oxygen, right? And then we talk about a protein. So we also talked about chirality, right? So the fact that if you have an amino acid right then almost all amino acids have our chiral, right? Because this molecule in 3D right cannot be put on top of this, right? It's the mirror image, right? That is what chirality means that you have a molecule and then you have the mirror image of the molecule and these two, although they have the same structural formula, they do not have the same 3D structure. And because of that you can have one of them being very toxic and the other one being very beneficial, right? So we also talked about that in nature most of the amino acids are found in the left form, so the L form, and the D form is generally not seen or it's generally not produced. But the chirality itself in proteins becomes a big issue when you do like chemical synthesis of medication, right? Because when you do chemical synthesis the chirality is egal, right? Because we don't care about the chiral or the process, the chemical process that we use uses like A plus B is C, right? But when C is produced it's produced in both forms. So when you talk about amino acids remember that they are chiral. Also remember that there is one amino acid which is not chiral and that is glycine because glycine has an H as the R group, right? The R group is the kind of side chain which determines which amino acid we're looking at and of course when R is an H then we are able to turn the molecule in such a way that we end up with the mirror image. So glycine the smallest one, so when the side chain is just a single hydrogen molecule then it is not chiral. You can draw it and then try to do it. There's also these boxes actually. So you have these snappy atoms, snappums or something like that, they're called and there you can just build these amino acids, right? So you have C molecules and you can stick in the things. So if you are interested in chirality and stuff then pick up one of these boxes of snappums or snap atoms. I don't know exactly what they're called but then you can build these atoms yourself which is really fun. So when we talked about proteins we talked about the fact that you have the primary sequence. So the primary structure is just the amino acids in a row, right? So you have glycine, valine, valine, leucine, isoleucine. So when we talk about the secondary structure, the secondary structure and the primary structure of course and this is what I pointed out in the lecture is based, primary structure of proteins is based on atomic bonds and because of the fact that some amino acids actually are able to form sulfur bonds you can have primary structures which are not just a single line of more or less letters, right? I showed you guys I think I showed you in the lecture two or three more or less complex primary structures where you have two polypeptide chains which are connected together by a sulfur sulfur bond because of the and and that is that is the difficulty in primary structures for proteins is that unlike DNA and RNA which is just a single more or less straight line of letters in proteins the primary structure has already other interactions and so primary structure is based on atomic bonding secondary structure is based on hydrogen bridging, right? So and then we have the tertiary and the quaternary structure so the tertiary structure is based on more or less all forces working on it and quaternary just means that we take the whole protein so the different um polypeptides know that there are different computational tools to predict protein structure so there's up initial prediction where you just take the primary sequence and then try to predict secondary tertiary and quaternary structure um but we also have dedicated tools for secondary structure prediction um because that is more or less something that we can do very well but from the primary structure determining the tertiary structure is really hard um there's really good tools out there which can actually predict if there will be an alpha helix and if this alpha helix will be go through a membrane um because these things are very um are very common right so we know exactly how trans membrane alpha helices look like there's thread and fold recognition and homology modeling um so head and know that there are five different more or less um schools of thought about how to predict secondary tertiary and quaternary structure of proteins from primary structure um we also talked about the new alpha fold from google um which is kind of using machine learning to do it um but again machine learning is just a field of homology modeling right because machines they they look at all kinds of examples and then they learn how a protein folds based on the examples um but that of course is kind of a type of homology because learning from an example means that you use homology we also talked about how you can separate proteins right so um head there's 2d gel electrophoresis which allows us to separate protein mixtures um and we separate using two different methods so the the standard method is used for the y axis or the the y component of the gel right so that's the same for RNA gels DNA gels and protein gels and so here we separate based on size using an electric charge and then in the other um axis on the x axis we separate using a pH gradient so we start off with a very low pH low pH of like 2 and then we end up here with a high pH of like 14 right so water is like seven so in the middle so every protein comes with a charge and that is because they have side chains and the side chains they give this protein an intrinsic charge which means that a protein which has a positive charge feels more at home in a negative environment right and a negative environment means that you have um an an abundance of hydrogen um so that means that you are then in a positive pH but i could be wrong right but head this this second axis is based on the isoelectric point and the isoelectric point from a protein or a protein isoelectric point is made because of the fact um that the protein has side chains we also talked about orthologs, paralogs, in paralogs, out paralogs, sanalogs and i want you guys to kind of know what it is um and i i i hope i explained it well um but it was at the end of the lecture after like a two and a half hour stream so if i didn't explain it properly um in the in in the lectures um then do look it up online um because it is important there will definitely be a question about um what is the difference between an ortholog and a paralog right so and this has to do with the gene duplication events and speciations events and so when a species splits into two species or when a gene duplicates itself across the genome and i hope i explained it well during the lecture um but um if i didn't then um please look it up because like i can't explain everything perfectly because if i've been streaming for two and a half hours then sometimes the quality of my thinking goes down uh and besides that we have of course xeno logs so xeno logs are more or less pieces of DNA or proteins which are transferred from one species to another right so it's a horizontal gene transfer mechanism and we we talked about like four of them and one of them of course is just cloning or genetic engineering of bacteria um but bacteria also exchange DNA with with other bacteria so they make these little tubes and then they just exchange parts of their DNA with each other to increase survival for both of them all right lecture six was about metabolites so we talked about endogenous and ex endogenous metabolites and exogenous metabolites um has so um know the difference between the two uh we also talked about primary metabolites and secondary metabolites so primary metabolites mean that if you don't have them you more or less die instantly while secondary metabolites are metabolized which um you can go without um so we also talked a lot about the mass spectrometry workflow so have mass spectrometry is four different steps um the first step is compound separation which can be done using three different techniques um two of them which are chromatography techniques either using a liquid or a gas as the mobile phase and then we also have capillary electrophoresis which again is very similar um to how we separate proteins and how we separate DNA by their size um but here we use electrophoresis using a very narrow capillary and in the capillary we kind of break down uh or we we we slow down big proteins because they are big and and small proteins go through relatively quickly so after we done the compound separation in mass spectrometry we go to fragmentation and ionization which means that the the protein that or metabolite that we're looking at gets fragmented into little pieces and then each of these little pieces gets ionized so they get a charge put on them so this can be two positive charge or three or four or one positive charge and of course we do this to be able to have the the the thing fly through the mass spectrometry right because it needs to be charged to be attracted or to be shot out um and then we have the separation of the mass over charge right so we can do this using a sector instrument or a time of flight instrument and then we have the detection so the detection part is actually just generally a the the the the charge molecule molecule flying against the metal plate and then this is detected using a computer we also talked about keg so we talked about um we had that keg as pathway information and that it's based on kind of a um input protein output right so you have a metabolite and then a protein working on that metabolite transforming it into another metabolite right so it has these kind of compounds and reactions um yes so they have genes and and proteins in there um but the main selling point or the unique selling point about keg is that it allows you to reason what type of metabolites an animal can make and which type of metabolites the animal cannot make we also have reactome which is a different database which um is very similar right it also contains pathway information it also has many different organisms um this one is open source keg actually has a paid version and also a free version um but the difference between keg and reactome is that um keg is very much based in more or less chemistry right so we have a metabolite and a protein working on a metabolite transforming it into something else while reactome is um more um holistic in a way right so they have a pathway for RNA or for RNA transcription or DNA duplication right so their their their pathways in reactome are very similar to the pathways in keg um but they look at a slightly higher level so it's not metabolite protein metabolite um it's it's it's more conceptual we also talked about cytoscape um one of these open source tools that allows you to visualize complex networks and integrate with any type of attribute data which of course means that it's used a lot in bioinformatics to show like large gene networks or large protein networks um but it's also used in social network analysis which means things like facebook hey you can use cytoscape to visualize your friends and who are their friends and hey you can then use different attributes so you can say well um everyone living in germany color them green everyone living in in poland color them blue and all of these things right so you can overlay all types of different data on top of your network and that is why cytoscape is really useful um and we also have it has also used a lot in the semantic web um so have when websites are presented to you it's just plain text but you can use html tags to html tags um to kind of give meaning to um parts of the text right so you can tell for example the search engine saying that um denny is a name right and it's a first name while arans is a family name and then the search engine starts to understand what's going on and it can build up kind of an internal network saying that okay so hey denny arans is a person and he works at this department this department has other persons working there and then it can kind of form a more comprehensive image on what is being displayed on a website and this is called the semantic web or web 2.0 i think and nowadays people are talking about web 4.0 i i got lost at web 3.0 like for me it's all html cms and javascript um but there's there's apparently a difference between the worldwide web now and like 10 years ago i think the main difference is just that the spying is increased a lot so um then we had lecture number seven the introduction into r and there will be no questions about this on the exam because this is just a lecture for you guys to show you guys that you should if you want to have a career in bioinformatics you should definitely pick up at least one data science programming language um so hey it's really good to learn something like r or python um which are more or less the two main languages that are being used in bioinformatics um so but for you guys there will be no questions about this on the exam so that's good that means that you can just skip the lecture um when you are learning for the exam all right and then we went all the way back right because now we discussed all of the different biomolecular levels we started off on the lowest level which is the dna then the RNA then the proteins then the metabolites right and then we started i started talking or the lecture eight was about phenotypes and how we do qtl mapping right so um i talked to you about the quantitative traits and the the ec so a little bit of repeat of the first lecture uh qualitative traits which are more or less measured subjectively i showed you this picture and where we say that quantitative traits are a subset of all traits out there so all traits out there are qualitative and quantitative together um but of course like quality uh quantitative traits are a subset of the qualitative traits and this subset is growing right the more machines we build the better we are in kind of um expressing qualitative things into quantitative units right so an example of this would be uh the the taste or the quality of wine um that used to be a very qualitative trait right you would have a panel of wine tasters everyone would would taste the glass and then they would score the wine saying this is a good or a bad wine um but nowadays you just have a robot that does that right so a couple of drops of the wine get put in the robot and the robot analyzes the composition of the wine and then just gives it its score um so how quantitative is growing while qualitative is more or less shrinking um and again i talked to you guys about mendelian and complex phenotypes so when we talk about uh phenotypes and qtl mapping um i taught you guys about the crossover events right so that we have meiosis one meiosis two and that this whole thing works or this whole thing is um that we can do things like um associate a region of the genome with a certain phenotype or find a region of the genome where a phenotype is more or less controlled from um and that is only possible because we have this chromosomal crossovers right so that that in meiosis one has so what we get is we get duplication of the of the of the genomes that we have and then we have the homologous chromosomes have which are more or less bound together and then we have this um this crossover where parts of one chromosome are exchanged to the other one right and then of course we have the meiosis two where now we go from having two copies of each chromosome uh to having only a single copy of the chromosome and then these are called generally gametes so and um i also had two links there in the in the lecture which are two um more or less um little movies where it is explained in much more detail and graphically right because a movie can show you guys how this happens so that they align together and that they then get swapped around um and that's very difficult to catch in a in a picture so when we talked about qtl mapping i told you guys that you can only do qtl so quantitative trade locus analysis when you uh use an experimental cross right so you start off with for example two inbred founders who get crossed together then we get a generation which is called the f1 generation and in the f1 generation everyone has one chromosome from the father one chromosome from the mother right and there's no no crossover here or no recombination because of the fact that the parental line had two exactly identical chromosomes so for the father it had two exactly identical chromosomes and for the mother the same thing so had of course crossover occurred but this crossover had no effect i think i even made a little drawing during the lecture showing how this works um but i want you guys to know the advantages and disadvantages of the different types of crosses that we discussed and so um for example what a recombinant inbred line is and so where you do this cross between the two founders and then create these funnels in which you within the funnel you start brother sister mating so to make sure that you get immortal animals um which you can use forever and ever um well they're not immortal but they're like clones right so a single real line so recombinant inbred line so one of these lines um you can mate a male with a female and then the children of these will be exactly identical but there are of course problems there because if you have a recombinant inbred line then there's only two states so either being a a or b b um so there are no heterozygous animals within the population and because there are no heterozygous animals in a recombinant inbred line you cannot estimate things like additive and dominance you can only see that there's a difference between the two homozygous groups but you get no information about the heterozygous in the middle the back cross is more or less the same thing um but the back cross is really quick to make because you cross two inbred animals you get an f1 generation and then this f1 generation is crossed back to one of the two uh parentals and then we have the uh we have the advantage that it's really really quick to do because you only need two generations um but the problem with a back cross is that when you do the association you get large parts of the genome which are associated and you have this imbalance between having only 25 percent a a and 75 percent b b and again because individuals are only um a a or a b um you get no information about dominance and additivity um f2 cross more or less solves these things um so it has the disadvantage of still having like large regions but it allows you to investigate additive and dominant effect so qtl mapping g was we talked about the difference we talked about how they are very similar right so there's there's they're both methods to find regions of the genome which control genes or which control phenotypes right and um the the difference is that in qtl mapping you are able to map between the markers because of the fact that you have a structured population while in a g was you just have an outbred population generally um of humans and there of course you cannot know what is between the markers but in a in a in an f2 for example you can map in between the markers and of course there's another big difference and that's the way how these results are displayed so in a qtl you have a smooth line plot across the chromosome and in a genome-wide association you generally have the results presented to you as a Manhattan plot so head during the lecture we saw examples of that i talked also about effect size versus likelihood so that the effect size is the the difference between the a and the bb group and that the likelihood is the statistical test when you compare individuals having a versus individuals having bb um and of course here we have to also think about multiple testing but that came back i think in another lecture as well so multiple testing is of course the issues that when you do a lot of statistical test um you have to kind of compensate for the fact that you did a lot of test good um so i've been talking for around an hour so we'll do a quick break and then we will do the um remaining five lectures and then go to the um four example questions so that you guys have an idea of what i ask and when um so let me set up not the audio but the music um so yeah i'll be back in like 10 minutes and then we'll just continue with discussing the different lectures so five lectures left and then uh then we're done so it's going to be a very short lecture so i will see you guys in around 10 minutes if you're watching this on youtube then um probably see you tomorrow so bye bye for now