 And now we should be all good to go. Let me check if the recording is running. Yes, very good. All right, so welcome, everyone, to the first lecture of bioinformatics for plants and animals, something like that. It's a German name, but I always call it the bioinformatics course. So today will be a short lecture. We didn't start at 8. We start at 9, which is good. I think I don't like starting at 8 in the morning. And I think also for you guys, it's not a very good time to start. So what are we going to do today? So what are we going to do? There will be some general course announcements, which is really just some announcements. We will have a little poll as well. So the poll will be so that you guys can decide on a new time slot for the lecture. Then there will be an introduction into bioinformatics, which will be kind of a short history for people like the power of the weakness that actually followed the R course in the summer semester. There will be some repetition, which is fine, I think, because only by repetition people learn. And then we will have a short break after the introduction. And then we will do in the second hour a microarray overview. So I will just explain to you where a bioinformatician is involved in a microarray experiment. And then we will have an overview of the coming lectures. And the overview of the coming lectures will actually just show you what you can learn when you continue following the course. All right, so like I announced, we will first have a poll. I think that from 8 to 11 is just a horrible time. People have to wake up early. Hey, Commando, welcome to the stream. So from 8 to 11 is just a horrible time. I actually don't know why they cut my course by an hour. Normally we would have four-hour lectures, but apparently when it went on to Agnes, people made it a three-hour lecture. But that's fine with me. Three hours is long enough. And usually the slides will be only two hours. And then there will be like an hour for exercises so that we can have some exercises together. The way that I think we should do this is since I have no one on the Zoom meeting, I can just ignore that altogether. So I would say just throw into the chat what is your preferred time slot, and I will just tabulate the votes. And then we will see which time slot is the best for everyone. So let's just throw in some times. All right, so that's the first vote for 14 to 17. All right, vote for 10 to 13. 14 to 17, second vote. One to 13. All right, good morning, Sandra. 10 to 13, Salma 12 to 15, Comando 10 to 13. All right, any more? You can do it, people. There should be more people in the chat. Like I'm registering 15 viewers at the moment, so. But if you wanna remain the silent majority, don't vote, then that's fine with me. So currently, if I'm tabulating, well, we have four votes for the 10 to 13 slot. Misha doesn't really have an opinion. Okay, all right, so then I think it is decided and we will go for the 10 to 13 time slot, which is perfectly fine with me. Then at least I can have a sandwich. Power of the weakness, easier than the US election. All right, so there's one person who has a lecture until 12 o'clock, 12 o'clock in the morning on Thursdays. Yeah, well, we won't change the day. I think that that's a little bit too much because then, unless everyone says like, oh, I'm free on like Tuesday and then we can actually move it to Tuesday. All right, the fish guys have lectures until 12 o'clock. All right, I have lectures well, so. All right, so if we just see, so now it gets interesting again, like it never say that it's easier than the US elections. It's actually turning out to be a little bit more difficult. So if the people that voted for 10 to 13, like Power of the Weakness, Lydia, and for example, Commando, are you okay with moving it to the afternoon? So going to the 12 to 15 slot or going to the 14 to 17 slot? All right, Power of the Weakness says yes, no problem. All right, then I'm making an executive order, executive decision here, and then it will be from two to five which is really nice because then when I finish streaming, I can go home directly. So that's nice. All right, yes, thank you all for the votes. Then 14 to 17 it is and we'll keep it at that. All right, good, next slide. All right, like every time the slides and the recording will be made available on Moodle, I'm always saying that attendance is obligatory and that's of course not really obligatory because I can't force you guys to be here but I think that it's good to earn a bonus for when people show up for most of the lectures. So generally you will get like one wrong answer on the exam kind of ignored by following all the lectures. So I think that that's a nice bonus. So generally there's like 40-ish questions and just having one of these questions wrong will mean that you can still get a one which is really nice. So some lectures will contain practical exercises. The lecture for today is a very easy exercise. It shouldn't take you more than like 10 minutes but the other lectures will have much more practical exercises and I'm still a little bit unsure on how I want to do that because from my experience with the R course, people who don't do the lectures either drop out halfway or they fail the exam. So I'm very interested in people doing the assignments so the practical exercises which belong to the lectures. So I might actually make quizzes on Moodle. So just have like a PDF with the assignments and then on Moodle you have to answer like or you have to fill in like three of the answers and then after filling in three of the answers you kind of unlock the rest of the lecture. I don't like gating stuff between like these things so it might be that you just have to fill in like three out of five questions or the answers to three out of five questions on Moodle and that not doing it is just that there's no penalty or I might decide that there might be a bonus so if people do all the assignments then you get a bonus so instead of being allowed to have like a question on the exam ignored or a wrong question on the exam ignored. But I think it's very important that people do the practical exercises. We're having practical exercises because it's kind of a practical course. I want people to learn something and I want people to have like a new skill set when you're done because that's what bioinformatics is it's just different skill sets. So all right so since we decided and I think everything is clear here so I'm not gonna hold the attendance but we will kind of write down who's in the lectures and make sure that people who are following the lectures will get a bonus. All right so a question then for you guys what do you think is bioinformatics? Because bioinformatics is a massively broad, let me see, this is annoying because now the bottom part of the slide is not really visible, let me move it a little, little. So because bioinformatics is a really, really broad field and there's many, many different bioinformatics fields and I actually want to know from you guys more or less what is your experience with bioinformatics? Have you done any bioinformatics before? For example, did you follow the R course? Because the R course itself is of course just a programming course but hey it's good that I get at least an idea of if people have some programming experience. If people for example know how to work with things like ensemble, are there people who did like protein, protein, docking predictions or these kinds of things. So it would be nice if you could just throw into the chat any experience that you have or what you think bioinformatics is. I really think that it helps me to kind of plan the next lectures. Like if everyone says well I know what DNA is, I know how to use ensemble, I know how to automatically download sequences or these kinds of things and then we don't have to go into that in detail. So it helps me to kind of streamline the coming lectures so that people kind of know or so that I know what people know. It stays eerily quiet in chat so no experience. Like I know that some of you guys problem solving on a biological level, statistical methods for biological data. So that's what you want to learn or that's something that you already know how to do. So in my view bioinformatics is more or less the use of tools that come from like software development. So using software tools or using like big databases with biological data to answer questions. All right, so you find that that's, yeah, yeah, no. I think that statistics is a big part but it's not the only part. There's a lot of fields in bioinformatics where statistics actually does not really play a role. And for example, if I'm making video recordings of animals in a forest, then that is part of bioinformatics, right? I have for example, 10 to 20 cameras and I have to kind of use the 10 to 20 cameras, get the videos from there and then have like a software program which kind of analyzes which animals are in the forest. All right, so someone already has some experience with R and statistical methods. So the thing that I usually see is that when you ask people what does a bioinformatician do then people think that bioinformaticians are always like this gorilla and just using a computer. And of course not every biologist that uses a computer is a bioinformatician just like every bioinformatician is not directly a biologist. All right, Otto Bay participated in the R course, bit of experience with R. I want to learn to get information from ensemble. Okay, so that's good. I will write that down, get info from ensemble. And then you probably want to get that in R, right? Because all right, Canon R and I'm learning Python. That's really, really good. I think Python is a very, very strong second language to learn besides R. And so for coming back to the slide, what does a bioinformatician do? Well, a bioinformatician is always sitting behind a computer like this nice photo of me sitting behind my old setup. So there is a big computer component in bioinformatics, but there are fields of bioinformatics where you don't necessarily use a computer. Coming up with new algorithms or coming up with new ways to analyze data can be done on paper. And a lot of these things can be done beforehand before you even touch your computer. So a couple of other slides. Like I said, bioinformatics is a very broad field and I'm really interested in what you guys want to learn. So if you have, for example, a project that you're currently working on and have, for example, I'm doing microRNA discovery, then of course we can have a lecture about that. If you're saying I'm working on protein-protein interactions, we can have a specific lecture about that. So you don't have to come up with this now, but if you come up with a topic where you think, like, oh, this would be a really nice topic to have a lecture about. All right, Skorita, I work with DNA sequences, mainly, but not much with proteins. All right, so do you want to have a lecture then? But that's more about proteins and how to work with things like protein sequences because we can kind of accommodate that, right? And that's the idea behind the course, is since we're streaming it and we're doing it live, we can just talk about the things. All right, verwendung voor zuchtung, that's a very good one, so the usage of bioinformatics in breeding, and that's also what the course is about mainly. So we will do that. There's already a lecture about QTL mapping, so how to associate parts of the genome with, for example, traits that are important for, well, for example, in the milk industry, like milk production, or when you're thinking about plans to make plans more resistant to pests or to make them yield more crops. So there is already a big breeding component in that, but I will write it down, so breeding, and I will definitely make sure that we get a little bit more focused then on the breeding part. So good, Skorita answered. We also want to do something about proteins. There's already a lecture about proteins, but I will make it a little bit more applied, and I will make sure then that the protein lecture kind of couples back to the DNA lecture. And so that, yeah, because both are sequences, and so many of the tools that you're using is, many of the tools that you use for DNA sequence analysis, you can use very similar tools for protein sequence analysis as well. Of course, the nice thing about proteins is that you have to deal with things like 3D structure of proteins as well, which is not very obvious if you just look at the primary sequence. And when you analyze DNA, we generally don't care about the 3D structure unless there's very weird things going on. But generally, we assume that if you talk about DNA, you're talking about a double helix, but in proteins, it's of course not the case because proteins can take all kinds of shapes and forms, and that's interesting. And very important as well for the function of proteins. All right, so we have a couple of topics. I will go through the list of topics that I had set up at the end of the lecture. Those are all old lectures, and of course we can always change something when we want to do that. All right, so definition, a more formal definition of bioinformatics. I just got this from Wikipedia, which is of course not the best source of information, but if you're just looking for like a general idea of what is what, then Wikipedia tends to be a very, very good tool. So I will just read it. So bioinformatics is an interdisciplinary field. So that means that it connects things like biology with computer science, or it connects things like proteomics with AI. So in bioinformatics, there's a big interdisciplinary component. Bioinformaticians generally have either a bachelor with two minors, so a bachelor with a major on biology and a minor on, for example, computer science, but there's also a lot of people from, for example, physics that have a bachelor in physics that do a master in bioinformatics, which is because physicists generally are very good programmers or at least they get taught how to program during their physics lectures. They tend to end up in bioinformatics because physics is a really hard field to make a name for yourself, while bioinformatics is generally considered to be a little bit easier. The nice thing about bioinformatics is is that you don't have to be the best programmer in the world to be a very good bioinformatician, and you don't have to be the best biologist in the world. And you just have to be able to kind of connect these two fields together and know how to talk to computer scientists on the one hand and talk to biologists on the other hand. So you're kind of an intermediate between fields like biology, mathematics, engineering, statistics, so. But the most important part is that you use tools that are developed in computer science to answer biological questions. And one of the core things in bioinformatics is that you're generally dealing with a tremendous amount of data. So when we're talking about bioinformatic analysis, we generally talk about data sets which are not easily analyzed in things like Microsoft Excel or other tools like GraphPAD, which is a nice tool in itself, but it's just a way to make graphs. And just like Excel is a good way of storing some intermediate data, but not a very good way to do like massive pipelines where you analyze sequencing data, although you could, but it's generally not advised to do that. So some definitions, so just that you guys know what I'm talking about. When I'm saying the word algorithm, I generally mean like a cooking recipe that you can follow to answer a certain question. In the bioinformatics course, or in the R programming course that we had in the summer semester, we have a much broader definition, but I think this definition is sufficient for when we're talking about bioinformatics. So we have something called data, which is a set of values of quantitative or qualitative variables. And so that means things like, so we'll go into more detail about the definitions in other lectures, but qualitative means that you are describing a quality to something like this tastes good or this tastes bad, this smells good, this smells bad. And quantitative is something that you can measure. For example, color of things or the weight of a mouse or the amount of yield that you get from a certain crop. Knowledge is something which is different from data and that is something that in biology often goes wrong. Many biologists think that data is knowledge. So as long as you can collect enough data, then knowledge comes by itself, but that's not the case, there's a big difference. So knowledge is generally described as a awareness or understanding of something. All right, we have another Florian says, how to analyze chip-sec data would be really interesting. Okay, I will put it on the chip-sec on the list. There's currently no chip-sec because the current core structure follows kind of the biological dogma. So you have DNA causing RNA, causing proteins. And when you talk about chip-sec, you generally talk about like epigenetics, so stuff around the DNA modifications of the DNA. So yeah, I will see if I can make a nice lecture about chip-sec data. All right, so knowledge, familiarity or awareness of an understanding of a process or of a system. Then generally we use the term, when we talk about bioinformatics, we use the term in silico. So when you do a prediction and you do that on the computer, it's an in silico prediction, which is, so it's something that is tested or can be tested on a computer. And normally we can compare this to terms like in vivo, so doing an experiment in a live animal or in vitro doing an experiment in a lab setting. So these things generally are mean the same thing. So in silico means in a computer, in vivo means in a living animal, and in vitro means that you do it more or less in a lab setting. So you kind of remove the live part from the equipment. All right, so when we talk about bioinformatics, we always have to talk a little bit about history. If I wouldn't have become a bioinformatician, then I would probably have become a history teacher because I love history. And I think history is important. There's this quote from John of Salisbury, standing on the shoulders of giant, we can see more than them. And it's very important to know where we are coming from. So when we talk about bioinformatics, we have two histories. One is the history of computers. So when you follow the R course, we go much further back. We started like 2000 years before Christ when we talk about computers. But for bioinformatics, I'm thinking that, well, we should keep it simple. And we started around 1800. And of course, we start with Charles Babbage. So Charles Babbage is someone who is called the father of modern computers, or the father of the modern computer. And that is because he designed something which is called the analytical engine. So the analytical engine is a computer made out of cog wheels and ropes and pulleys. And it is a fully functioning computer. So it has all the parts that a normal computer of today more or less has. And it is not a machine that was built during his lifetime. No, it was a machine that he designed on paper. So it's just a theoretical machine. But Ada Loveless, one of the first computer programmer, she wrote algorithms for his computer. So she took his kind of schematical drawings of the analytical engine. And then she designed computer programs to run on this machine. One of the nice things is that she is considered to be the first computer programmer in the world. And she also designed the first algorithms. And she's also the inventor of recursive algorithms. So recursive algorithms are more or less algorithms which call themselves. So here we will have a lecture about recursion or not so much about recursion, but recursion will pop up. And recursion is a very efficient way of solving questions. So for people that followed the R course, they already had more or less a recursion lecture. And we will have, or I will point out during the coming lectures where every recursion can really help us feed up things compared to just using standard iterative loops. Of course, the third person that you have to mention when you talk about the history of computers is Alan Turing. So Alan Turing is more of a common-day figure, but he is the father of theoretical computer science. So theoretical computer science is the science of what you can compute and what you can't compute. And this is very abstract in a way, but to make it more physical, he developed something like a Turing machine. And again, a Turing machine, like the analytical engine, is not a real machine. It doesn't exist in real life because you need an infinite tape for it. And of course, infinity doesn't exist in the world. But by using this Turing machine, we can reason about what things we can compute and which things we cannot compute. So Alan Turing, I think he's mostly known for cracking the Enigma machine during the Second World War. But in computer science, this is just one of his minor accomplishments. His major accomplishment is the invention of the Turing machine. So this machine, which allows us to reason about can we compute this or are we not able to compute this? And of course, there are questions which you are never able to compute. Like what is the meaning of life? The universe and everything is one of these questions, which of course is incomputable because it's not a very well-defined question. So if we go and show you these things, the analytical engine is on the left side. This is a replica made in the 1990s. So in the time when Charles Babbage was living, they were not able to make this machine because of the fact that they didn't have these nice fancy CNC machines. So the quality of iron was not like the quality of iron and steel that we have today. So it was very difficult for them to cut cork wheels that were small enough to fit in this machine. And on the right side, we see kind of a Turing machine. So a Turing machine is a very simple machine which has a head which reads numbers on a tape and it has a memory to recall some of these numbers and it has a kind of programmable component. And we're based on the input which you read on the tape. You can do certain computations which are pre-programmed into the machine. But both of these machines are more or less theoretical machines. So the analytical engine was built but only like 200 years after. Charles, after Babbage. And it turns out that it actually works. So if you build one physically, then it works and you can use the old algorithms that Ada Lovelace made to run on this machine and get an answer. And the Turing machine of course, an infinite tape is not possible but there are very nice simulations online where you can simulate a Turing machine and that will allow you to reason about what you can compute and what you cannot compute. All right, so then when we talk about the first real computer there's a little bit of a disagreement. So if you ask Americans who made the first computer the Americans will say, well, that is the ENIAC but if you ask anyone else in the world then they should give you the correct answer and the correct answer is that the first computer was built by Konrad Zuz in 1941. So he's the inventor of the first working programmable fully automated digital computer. So you can see here Konrad Zuz and you can see the Z3, the replica in the Museum of Munchen. And this machine was built to study wing flutter. So wing flutter is the process at when you fly very, very fast with an airplane and you reach the speed of sound then your wings start flapping if you use standard wings which they used slightly before and in the Second World War. So they designed this machine to specifically be able to design like jet fighters which would be able to go faster than the speed of sound. And of course there was also a Z1 and a Z2 but those are not real computers because they weren't fully automatic weren't fully programmable. So they would have part of their programming pre-installed, kind of pre-baked into the machine so they weren't like general use machines. But the original Z3 was of course destroyed during an Allied bombing run in Berlin in 1943 so it only existed for two years. So how would you communicate with this machine? So communication with this machine happens with these kinds of things. These are punch cards and punch cards are something which if you are old enough you can still remember from when you went to the hospital they would give you this little card and that would contain your information and also the things that you would do. So it's just a little piece of paper or plastic which has little holes in there. You stick it into the machine. The machine has these mechanical, well kind of read out and by pulling the card through the machine it reads out the information on the card. So I don't know if anyone's still old enough to remember this. Like the first time that I went to the hospital I can still remember having this card and having to take this card with me from like station to station and everywhere they would put it in like one of these slots and the machine would read it and then they would kind of be able to have an overview of my medical data. I don't think that anyone uses punch cards anymore but it is, it had been the oldest way of really interacting with computers. Before we had keyboards and mice you would make these punch cards and then you would run the punch cards through the machine. There's some really funny YouTube videos about people using these old machines with these punch cards to do things like play Return to Castle Wolfenstein or play Doom on them. So it's a really interesting way of programming. It's good to see that at least one person in chat still remembers the old punch card systems. All right, so the next very important person in the history of bioinformatics or the history of biology is John von Neumann. John von Neumann is very famous because he designed the architecture of the modern computer. So every modern computer that you have on your desk including your mobile phone and other things they should actually be called von Neumann machines but we don't call them von Neumann machines because computers is just easier. But his idea was that when you make a computer a computer needs to consist of four very fundamental parts. So had the picture that you see here with the central processing unit, the input device, the output device in the 1950s, this was enough to have a very high scoring impact publication. Had just by drawing this little figure with like a couple of boxes he was able to kind of make a name for himself. So had four parts is you have to have an input device. You have to have an output device. This is to communicate with the user and besides that you need to have a central processing unit and you need to have a memory unit. So the memory unit of course stores data, intermediate data and the central processing unit has a control unit. So the control unit tells the computer what is the next operation that you should do. And there's an arithmetics logic unit which allows the computer to compute things like what is five plus five. But the other part is the logic unit which allows it to reason using Boolean logic. So Boolean logic is like if it rains and if it rains then there are clouds but if there are clouds it doesn't have to rain. So it's kind of a, you have like different Boolean operators to reason about logic. So to kind of make a prediction. We will have a little bit of a comeback to how our logic unit and an arithmetic unit works. But had just remember computers should be called von Neumann machines. And this is because of this guy saying that a computer should consist of four parts. All right, so that's kind of how far I want to go. It's just a very brief overview of the history. And there's of course a lot more people involved in the development of computers. But I think that when you talk about computers and you talk about bioinformatics you should at least mention these five people. So Babbage, Ada Loveless, you should mention Turing. Hey, you should talk about John von Neumann and you should talk about Konrad Seuss. Of course there's many, many different things that you can talk about the history of computers. There's also very exciting things happening currently which might greatly impact stuff in the future like the development of the D-Wave machine which is a quantum computer. But I think in general had just learning these five people and what they did and how they contributed to the field of computers and or of computer science. You get a general idea. So do look up these people when you are studying the lecture and read a little bit about their lives on Wikipedia because there might be questions about what they did and why they are so important. So if we talk about bioinformatics bioinformatics is a much, much younger field like computer science in generally is considered to be a relatively young field as well because it's only like 250 years old. So in the scientific community things like chemistry and physics are considered very old fields similar to mathematics which is two to 3000 years old. But bioinformatics is very, very young. So the birth of bioinformatics is more or less in the 1960s when we started having like computers available to people at universities. And that's also when we first see the term bioinformatics appearing. So the term bioinformatics was coined around 1960 when people in biology were thinking about using these neofangled machines that they had available at their universities to analyze data and to start using computers. So computer science together with informatics. All right, so a couple of molecular techniques and accomplishments I want to go through which are really important. Like in 1972, we have the first bacteriophage sequenced. This is an RNA bacteriophage. So bacteriophage is a little single-stranded RNA virus that affects bacteria. So generally when you search for what is a bacteriophage you see this nice icosahedron structure with these little legs. It docks to the bacteria and then injects the RNA into the bacteria. The RNA replicates, so proteins will be produced, RNA will be duplicated. And at a certain point the bacteria is full of phages and it bursts open and all the other phages come into the surrounding area and are able to infect new bacteria. So this one was one of the first things that we sequenced or to completely sequence. So the whole genome of this bacteriophage is very small. It's only like three and a half thousand base pairs or three and a half thousand nucleotides. But it is the first complete sequence genome, first RNA sequence genome. So then in 1977 we were able, or bioinformatics, well, not so much bioinformatics, but in biology people sequenced the first or the first DNA virus which was bacteriophage Phi X174. And this is the first sequence DNA genome. It's around five and a half thousand nucleotides long. And this one is very interesting because it is still used every day all across the world as a positive control during sequencing experiments. So if you have like a big Illumina X10 sequencer which can sequence like 10 humans a day, then still the positive control is this little bacteriophage which was first sequenced in 1977, which I always think is interesting that people 50 years after sequencing it, they still use the DNA of this phage as a positive spike in for doing sequencing analysis. We have to wait 10 years for a very new thing to appear and those are the spotted microarrays. So the spotted microarrays are a major step forward in biology because they allow us to kind of get a read on genes and the expression of genes. If you have a genome or if you have an animal, then of course you have the DNA level which encodes all the information and then you have the RNA level which transports this information into the protein world where proteins are effector molecules. So using a spotted microarray, we had the ability to kind of measure the activity of several genes. So they used to be spotted in the university. My old university had a old spotting microarray machine in the basement and this machine would give little pieces of DNA, put it on a little glass plate and then you could use this glass plate to analyze the expression level of genes. So this is something that was a major step forward because now we could really see the activity of the genome. So instead of looking at static sequences and looking at DNA sequences and protein sequences, we were now able to see kind of what a genome was doing live. So under different conditions, you can do microarrays and you can get an idea of which genes are more active when they're sold around, which genes are more active when, for example, the temperature is up and this of course has led to a massive understanding of how biology works and what the function of several genes are. In the 1990s, DNA shotgun sequencing was developed and DNA shotgun sequencing is one of the most used technologies nowadays, so it consists of chopping up like a massive genome, like a human genome into very small pieces, then sequencing all of these small pieces and then reassemble them and this can be done in parallel and that's the big advantage because normally when you would do sequencing, you would sequence base pair by base pair but using these things, you can, wait a second, let's not have my phone buzzing here and so you chop the genome up into small pieces and then you sequence these small pieces at the same time. So instead of having to go base pair by base pair, here you have a million little pieces and each of these million little pieces you read base pair by base pair, meaning that you can do it a million times faster than normal sequencing methods. That's also why it became the default sequencing method. So in 1995, influenza was sequenced which is of course a major, well, kind of advantage for making new influenza vaccines and it is the first free-living organism to have its entire genome sequenced and it consists of around two million nucleotides, DNA of course. 1995 also saw the miniaturization of microarrays for real gene expression studies which means that instead of measuring 100 or 150 genes, you are now able to measure 20,000 to 50,000 genes in one go and it also allows you to look at, for example, things like splicing where a single gene is producing different mRNAs which produce different proteins and by looking at the different parts of the gene you can then get an idea of which of the different transcripts of a certain gene are expressed. So in 2003 is the next milestone. In 2003 the Human Genome Project was completed and this is one of the major advances in biology. I cannot overstate how important this is. If you would think about the Human Genome Project then it is kind of on par with something like the Manhattan Project. So it is kind of similar to the development of the atomic bomb. It has the same impact in biology as more or less the Manhattan Project had on physics. So what was done is over the course of 13 years sequencing and assembly of 3.4 billion nucleotides was done and before that we didn't know how many genes human had but then after sequencing and identifying it we identified around 20,500 genes in the human genome which means that a human is very comparable to a mouse when it comes to the number of genes. The total cost of this whole project is estimated to be somewhere between 3 billion and 10 billion so the total cost of 3 billion is kind of the lower bound estimate but it is one of the main accomplishments of like the new era. So when we talk about like 2000 and up then had the Human Genome Project completion is one of the major, major advantages and it's something that's being used every day. So when you talk about bioinformatics and talking about human medicine and then having this Human Genome Project completed kind of allowed us to do things which we could not have done before. All right, so why do we need bioinformatics? Well, in my view we need bioinformatics because currently biologists can be considered data gatherers. If I look at biologists they generally tend to do experiments, experiments, experiments, experiments so they are collecting a lot of data but they are unable to deal with these large amounts of data. All right, question from the audience. Would quantum processors have a huge impact of the cost of genome sequencing? No, the major cost of genome sequencing at the moment is the chemicals. The bioinformatics part after genome sequencing is relatively cheap. The only thing which it could help us with is processing the amount of data more efficiently but that won't really change the cost because the major cost of sequencing is the chemicals that you need and I think that's around 80 to 85% of the cost while the computer part, the machine itself is very expensive but it lasts for dozens of years so you can spread out the cost of the machine over many, many different sequencing runs so I really think that quantum processors would not really help the cost of sequencing. It would help with the speed, it would probably help with the accuracy but it would not really impact the cost and cost of sequencing currently is already very low. If you wanna get yourself sequenced you can do that probably for like 200 to 250 euros currently which is a steal if you wanna get all of your genome done. If you think about things like ancestry.com or 23andMe they offer you to snipship your genome which is measuring variations or generally like 200,000 or 500,000 variations across the genome but you can have your whole genome sequenced for more or less twice or three times that cost and then you have all the base pairs so all 2.6 billion base pairs which is more information but things like structural variations and stuff also come into play then. So like I said biologists currently are data gatherers so they collect a lot of data like there's many biologists that say oh I want to do genome sequencing and then we're going to do microarrays to do gene expression analysis in five different conditions and we are using like six different species but they have no real idea of how to analyze it so they always have to collaborate with either people from computer science or bioinformatics departments within the biology department and so large amounts of biological data are collected and then I'm thinking about things like DNA sequencing or RNA sequencing expression analysis nowadays when you talk about proteomics it's also possible to measure well thousands of proteins at the same time when you think about metabolomics so the stuff which is not proteins not RNA and not DNA but still part of the cellular structure then we are also able to automatically measure like thousands of metabolites in a single run and of course we nowadays have automated phenotyping and so that means for example that if I'm a plant scientist then I have like a conveyor belt I put my plant on the conveyor belt and the plant is scanned using cameras other things of the plants are automatically determined like the size of the leaves the size of the plant and all of these things are automatically captured and just dumped into a massive database and so we're talking about like gigabytes of data being collected from a single sample which means that when you are analyzing a thousand plants you end up with a couple of terabytes of data and generally when you are a normal biologist you're not able to deal with having hard drives and hard drives full of data being thrown on your desk so you need to have tools and you need to have skills to kind of analyze all of this data and to go from the data on the one part to the knowledge which is kind of what we want because in the end it's about pushing knowledge forward so why do we need bioinformatics? Well, since there's a lot of automation in biological research I always say like if you're studying biology now and you want to work in a lab then probably that job will end in around 20 years there's so much robotization going on in big pharmaceutical companies and also biology companies if I think about major breeding companies like Keegene there's not that many people in the lab anymore 90% of the research in these big companies is already done by robots so if your major skill is I'm very good at pipetting then that is a skill which is still useful but probably in 20 years this skill will be completely non-useful because robots will just be much better much more precise and much more reliable than a human can ever be and we need also bioinformatics to distribute all of this data via the internet which is getting to be a bigger and a bigger problem nowadays when you have large-scale sequencing done in China they will send you hard drives they won't give you an FTP site where you can download the data they don't transmit this data over the internet because the data is just too big and that is something that is an active field of research in bioinformatics is how do we prepare and change the internet in such a way that we can transmit this massive amount of data over the internet instead of having to send hard drives around so that's a big issue one of the other things which is really useful is that computing power is getting fast and cheap very cheap at the moment so you can get like a 48 core machine with terabytes of memory for less than 10,000 euros so there's a lot of computing power available and then people always point to Moore's law and Moore's law says that computer speed doubles every 18 months but this is not true anymore in the last couple of years companies like Intel and AMD have said that Moore's law is not holdable anymore one of the other laws that exist is the law of sequencing and that shows a more than exponential growth so we are able to sequence more today than we were to sequence in the entire like 5 years before that so there's a more than exponential growth in the amount of sequencing done while computer power only exponentially doubles every 18 months meaning that with time progressing we are less able to analyze all of the data coming out of sequencing just because computers are not evolving fast enough compared to sequencing technology so again there coming back to testosterone this question is quantum processors there would help a lot to kind of catch up with this battle between sequencing on the one hand and analysis of sequencing data on the other hand where the analysis part is lagging behind more and more when time continues so I've talked a lot about sequence and sequence is the origin of more than bioinformatics so bioinformatics sequence is the fundamental data type as is in programming the fundamental data type of programming is the integer or the float or the pointer so bioinformatics we always start with the sequence the sequence is where we kind of hang everything on so it is the entry point for many in silico studies is just getting the sequence of the animal that you're interested in and this is of course because the sequence is the thing that drives everything else the DNA sequence that you have determines which RNAs you can produce which determine with proteins you can produce in the end so that's why DNA, RNA and protein sequences are kind of fundamental to any analysis that you do in bioinformatics and of course whole genome shotgun sequencing means that sequence alignment is the fundamental algorithm so comparing sequences and saying these sequences are similar to each other or these sequences are different from each other that is kind of what bioinformatics does in a way so sequence alignment is one of the major fundamental algorithm and we will have a whole lecture about sequence alignment and how it is done and what are the drawbacks and what are the advantages of some of the algorithms used there alright so I'm looking at the clock we're at 55 minutes and I know that I have to stop the recording at around 60 minutes otherwise it will be too big for Moodle so what is the DNA sequence let's get a couple more slides done before we take a little break so what is a DNA sequence so for me a DNA sequence is an electrophirogram when I started this was the way that we looked at sequencing of course we already had whole genome sequencing but if when I started off and I send something in for sequencing then this is more or less what I got back so what I got back was like this profile where on the one axis you have the intensity of a certain color and on the other axis you have the genome sequence so you can see here a very common thing which is that you see a little peak so this means that there was binding of the C allele meaning that the original sequence here had a G and on the opposite you see that this DNA sequencing when I started off was being done and you still get these things if you do an experiment and you have a little piece of DNA after you did your PCR reaction and you want to send it in for sequencing then many things are still sequenced using electrophirograms so you just get one of these pictures back and then in this picture you can see okay so this had to be a G, an A, a T, an A and that is how you read a sequence of course the height of these peaks also determines more or less the confidence that you have and there's many drawbacks here as well but still the whole genome massive parallel sequencing still uses colors and optics to determine what the sequence of DNA is so it's still very relevant it's not as relevant as it used to be in 2010 when I started but it's still the way to look at DNA sequences and to look at quality of sequencing results alright so since 1982 there's something called GenBank and GenBank is kind of the place to store sequence data so just as a little comparison in December 1982 we had 606 DNA sequences stored and the total storage in GenBank was around 680,000 base pairs if we look at GenBank now so the last statistics from October 2020 are is that we have around 219 million sequences stored there with a total of, well I'm not even able to say this number but it's like 700,000 million base pairs which is just an insane amount of sequencing data just thinking about how much this is if you look at a megabyte so a megabyte is 1 million so that would mean that this is then a gigabyte so that if you would be able to store base pairs into a single byte which you can't actually need 8 bytes for them then you at least need a full hard drive to at least store all the data from GenBank which doesn't seem like a lot but you have to remember that all of these base pairs also come with things like quality scores and they come with annotations so all of this data needs to be stored and this is really like one of these numbers which is growing exponentially if you want to see the curve or you want to have more information you can click on the link there and see how this exponential curve grows since 1982 up until 2020 and it's really a big wave and hey if you then plot Moore's law or the speed of a computer core you can see that this is growing much much faster than the average speed of a CPU which is used to analyze this data so just an overview of which species are in GenBank so we have humans, mouse, rats and cows and those are the main species in GenBank and of course at this data all of these base pairs that we have stored they mean nothing if you do not have annotation if you do not have like knowledge projected on DNA sequences DNA sequences are more or less useless but it's just a little overview but of course humans and mice are the most important animals closely followed by rats and cows rats because they are very much used in medicine research and of course cows because they are one of the main species that is used in agriculture and also ZMAs and Suscrofa so Suscrofa are pigs they are very important species for our production systems so they have a big part of their sequences targeting these species alright so I'm going to stop the recording and take a little break we're not exactly at the point where I expected to be so we're a little behind but I think we can catch up in the second hour are there any questions so far then just throw them in the chat and I will answer them when I get back and in the meantime you guys can enjoy the first break kind of slideshow that I have for you guys so I will be back in like 10 to 15 minutes