 Welcome everyone, also when you're watching it on YouTube or if you're viewing it later on Moodle. Welcome to the first lecture of Bioinformatics for Plant and Animal Sciences this year. So today we will just have like a general introduction and kind of an overview of the course and then at the end we will have our assignment. So the assignment for today will be setting up your software stack which means installing all kinds of software that we need for Bioinformatics. So the main thing will be is getting you guys to use a proper text editor because we will do a little bit of coding. So for that you need a proper text editor and we will be setting up version control. So I hope that that goes smoothly. We haven't done that last year but I think it will be good that we kind of have a project that we work on together. And if you wanna work together with multiple people you need version control and you need a way to kind of collaborate. So I'm hoping that most of the students have joined by now since it's two. So if you're a student just let me know in chat then at least I have an idea of who the students are. So I'm not a student but if you are here and you're a student just say hi in chat or something like that. So I know which users are the students. All right, so let's start with slide number one. So this is the overview for today. So we will have some general course announcements and I'm a student, I'm a student. Very good, very good. Thank you then for joining the stream. And that's three. I'm expecting a couple more but we'll see. So for today we will do general course announcements and I wanted to have a poll for the date and the time. I'm Melima, I'm a student, welcome to the lecture. So because I got a couple of emails of people that said that the time is not really suited for them and I think there's a course as well in parallel that some people might wanna follow. So we'll just have a little poll to kind of set up the date and time for that. I don't mind switching to a little bit earlier or a little bit later. I want to give you a very short introduction into bioinformatics just to get you guys warmed up in what we will be teaching or what we will be discussing in the next coming lectures. And I want to give you a little bit of an example of microarrays and how bioinformatics influences or how bioinformatics uses microarrays to measure the expressions of like tens of thousands of genes. And then we will have just a very quick overview of the coming lectures. So I will kind of tell you what you can expect next week and the week after that. And then we will of course have the assignment for today and the assignment for today like I said will be setting up your software stack and this will be hard. I don't know a lot of people who actually are able to set up things like the version control system that we're using. It's a complicated system and it has a lot of like little details that you have to get into. But I made a little introduction so that you guys can hopefully follow along and set it up. The problem is that I already have an account and all of the other things. So it's hard for me to show you guys how to set it up because that would mean that I had to remove all of my setup. So I decided not to do that and I decided to just make a couple of slides. So to explain you guys how to do it. And of course if you have any questions just throw them into chat we can discuss. And yeah, that makes the lecture more fun. Makes it lecture more interactive as well. So just some general announcements, slides and recordings will be made available on Moodle and I will also put them on YouTube probably and I might put the assignments and stuff on my website for people who are not students and do not have access to Moodle. So I always say attendance is obligatory but I use air quotes, right? Because like I can't force you guys to be here and your master students, I think so. But at least all of them, I think we have a couple of PhD students attending as well. So it won't be having your attendance. But attendance is nice, right? And it will earn you a bonus. I don't know what bonus it will be but probably if I realize that someone showed up for all of the lectures I will probably not be as harsh grading the exam in the end. So like I said before, the lecture will contain practical exercises and when possible we'll be doing them as part of the lecture. For the R course we did it the other way around. So then I would discuss the answers to the assignments for lecture one at the beginning of lecture two and I don't think that worked out very well. I think a lot of people just tuned in for the lecture and didn't tune in for the assignments. So I just wanna go with you guys through the assignments and hope that that way when you have questions you can directly ask them instead of having to struggle with the assignments and then having to wait a week to get the answers. And of course the grading for the course will be by examination. So if you're here as a student you will just be having a written exam in the end. It is still unclear if this will be an in-person exam or like digital exam. I have no idea what will be allowed. I know that it's allowed to do in-person lectures again but that's with a lot of additional work for us. So then we have to book a room and use like QR codes to scan everyone and I don't like that. So, but grading will be by examination and I'm betting that by then we're at the end of February, beginning of March I think. So then hopefully we won't be in lockdown or the situation will just not get worse. But just so that you're aware. All right, so the first order of business for today is picking a date and time which is suitable for everyone. So I would rather not change the day because I like Thursdays. So I made some time slots that people can vote on. Of course we can just keep the current time slot which is option number three. So just throw into the chat which option you like. So option number one is from 10 to one then option number two is from 12 to three and the last option will be from two to five which is what we're currently doing. So lectures will be three hours, sometimes it's a lot less, sometimes the lecture's only two hours and the assignments are a little bit more but we'll have to see. All right, so I'll start and I vote for option number three keeping where it is. But if there are people who have scheduling conflicts then this is your time to throw in the chat and say no, I definitely want option one because I have a course and I wanna follow the other course as well because it's an in-person lecture. All right, so I'm not seeing anyone coming up with a suggestion yet but we're just gonna wait a little bit. I think there's a little bit of a delay on Twitch so I will just be talking. All right, so the first is from 12 to three. All right, so that's from 12 to 1500. That's one vote. Then we have another vote for options two and three and one vote for option three. All right, so currently option number two is probably the best then because for me it doesn't really matter. Like I can set up the computer and I can have my lunch like an hour earlier if that suits you guys better. Option two works for me, good. That's another vote for option two. All right, then I think it's clear-ish that people prefer 12 to three. Option number two, okay, thank you. Perfect. All right, then it's settled. Then the next lectures will all be from 12 to 1500. I will write it down and I will send around I can't at 12. I got class from nine to 13. All right, so how about we move it one hour then? Is that an option for everyone? So from one to four? Is anyone against an option from one to four? At least I have some feedback. So I will mail it around and I can make a poll on Moodle as well so that everyone can more or less vote. All right, so from one, okay. Cool, then we do from one to 1600 and then if anyone really objects or says, well, I really can't, like it's always hard picking a date with multiple people. So, but it's fine for me. Like I can do anything. And of course, like you can always watch the lectures back later, but it's just like I do want you guys to have the option to ask questions. And then the next big question is, do you want to do the assignments in person? I would be okay with that, but it's a lot of extra work. We do have a room here in the building which is certified for the number of students that we have. But that would mean that everyone needs to be vaccinated or healed from COVID. I think that the university is going for a 2G strategy. So that would be more difficult, but I'm not against that. And of course, if you have any questions about the assignments, you can always ask me by email, like everyone knows my email by now. And not only that, if you want to talk about the assignments in person, that's also possible. So just send me an email and we can just plan a time somewhere during the week to meet and do the assignments. Because there will be some assignments which are difficult, especially the programming ones. Since you don't have to have any previous programming experience, but of course it's a bioinformatics course. So you can't really get away with not programming at all. Which of course is like the reason why we're here, right? You guys want to learn something new and programming is one of the fundamental parts of bioinformatics. So it's not just like using websites and using databases and clicking around. But I would be okay with doing in-person assignments as well. But it's just difficult to set up. And I think that it would work fine. But think about it. And if you have very strong opinion on it, then let me know. All right, so question to you guys. What do you think is bioinformatics? Because I always start with the question. So I want to kind of make this a double question. So what do you think is bioinformatics? Or do you have any previous experience with bioinformatics? Because having any previous experience with bioinformatics, like you follow the course before, or I did a programming course, or these kinds of things, then just let me know. Because I do want to get an idea of who you are and at what kind of level I should in the course. If everyone has programming experience and everyone's like perfect in programming Python or Java or something else, or everyone has done the SAS course during their bachelor, that makes it easier for me. Because, yeah, if I know what you guys already know, then for you guys, it will also be better because I can drop slides that everyone knows. So just gonna wait a little bit. We're not in a rush today. The lecture today is actually very short. So just throw in chat if you have any previous experience with bioinformatics or any previous experience with programming. So then we can have a little bit of a discussion of what you think is bioinformatics or what you want to learn from the course, right? Because that's important for me. All right, so first response. Xanaxin, no experience in programming and no experience in bioinformatics. Gini88, just some R and SAS in the bachelor. That's good. That's good, that will definitely help you. Having a little bit of R experience because I like using R more than SAS because R is free and open source. So you don't have to pay for a license so you can use it at home. While SAS, you have to install it at the university and that you are on the university license. So you can't use it once you finish. All right, I'm mainly Mina. Am I pronouncing that right? Is SAS in master biometry? OK, SAS and a little bit of R. All right, so at least people have a little bit of a basic in statistics then. That's good. Then I don't have to tell you guys in detail what a multiple testing correction is or how to do a t-test or what a t-test does altogether. That's good to know. So that will help me a little bit to get rid of some slides and add some more advanced slides. Good, good, good, good. What does a bioinformatician do? So in my mind, bioinformatics is very simple. You sit behind a computer like the monkey and you help people analyze their data. And their data is generally data that they obtain from experiments on plants or on animals. So this is a picture of the monkey and this is a picture of me behind my computer. A couple of years ago, I have a bigger screen now. And of course, I have a webcam and other stuff set up to do this. But yeah, what a bioinformatician does is more of a support role in a way. So of course, there's groups which focus purely on bioinformatics and they develop like novel algorithms. But that is rare. Generally as a bioinformatician, you're embedded in a biology group. So for example, me, I'm here in the molecular biology and breeding group. So most of our group are people who do wet lab biology. So they go to the lab every day. They do experiments. They collect data. And then with that data, they are either able to do their own analysis or not. And generally when it comes to things like massive parallel sequencing, so if you think about novel techniques like DNA sequencing or proteomics, so these kinds of techniques, generally people that are wet lab biologists do not know how to handle that data. So they have a lot of experience and they know a lot about the model organism that they're working with or the specific gene. But when it comes to more novel technologies, they don't know how to handle the amount of data that comes from it. And this amount of data varies a lot. Like people always talk about like the big data problem in bioinformatics and the big data problem in bioinformatics is that you can do a sequencing run and end up with a couple of gigabytes of data. And of course, that's just for a single sample. So recently, in our group, we sequenced like 3,000 cows so that ended up being around 10 terabytes of data. And analyzing 10 terabytes of data is a challenge on its own. And there's a lot of tools out there, but still just managing this amount of data and pushing it through all of these tools that are out there requires skill and finesse, and it requires training. So I think that's more or less what a bioinformatician does. So a bioinformatician can be a pure kind of bioinformatician who focuses on new method development, which is relatively rare and generally happens in groups which are focused purely on bioinformatics. But most bioinformaticians are people like me who are embedded in a more or less biology group and just help the PhD students or the other postdocs to handle the data that they are collecting. So it's a broad field. And there is always an option to suggest the topic for one or two lectures, because I know that you guys are master students and you are either very close to writing your master's thesis or you are starting a PhD program soon. So if you have a very specific topic and you think, I definitely want to learn more about that, then throw it in chat or send me an email with your suggestions. There's two lectures at the end, so one to two lectures. It depends a little bit on how Christmas is and how many days we have off and if there's an overlap with holidays. But I always try to keep one or two lectures free. So if you have a very specific topic and say, well, for my master project, I will be forced to analyze epigenetic sequencing data or we are doing a chip-on-chip experiment. So we'll go through all of the topics during this lecture that we will be discussing, where I have already prepared lectures. But if you're really interested in something and you think that's something that I definitely want to learn and it's not on the list, then send me an email with your suggestion and this slide will come back a couple of times. So just to remind you guys that at the end we have two lectures where you guys can come up with an assignment. And generally, I like then when people provide a data set with it. So last year, no, not last year, so this year during the summer semester we had the R programming course and the additional lecture that we did was based on fish data. So a lot of data from fish caught in different swamps all over Germany. And just going through this data set and showing the peculiarities of the data set. So if you have a suggestion then definitely don't be scared. Just send me an email and let me know. All right, so some definitions, right? Because we're starting off bioinformatics. So had the formal definition of bioinformatics according to Wikipedia is that it's an interdisciplinary field, which means that there will be a lot of biology in the lectures. There will be a lot of software tools. There will be software development, things like version control. So all of these things tie into bioinformatics. Bioinformatics is really being able to talk to biologists and computer scientists, but also being good at just installing software and figuring out the errors. So the most basic definition that I can give from bioinformatics is that bioinformatics uses tools developed by computer science to answer biological questions. And of course, this involves processing tremendous amounts of data. And here you can kind of give your own idea what is a tremendous amount of data. But tremendous amount of data for me is when it doesn't fit on a hard drive anymore. So if you buy one of the newest hard drives, it's a four terabyte hard drive. And as soon as the amount of data that you get from an experiment goes over four terabytes, then generally, people are lost. The data can't fit on their own computer, so they need someone to put the data somewhere in the cloud or on a server or use cloud tools or other things to analyze their data. And that's generally the point at which biologists get lost, which is OK, because as a biologist, you're not expected to analyze data or to be a professional software analysis or use data analysis tools. So just as a reminder, if we're in the exam and I ask you give a definition of bioinformatics, then the words that I kind of write in bold should come back. So then the answer should contain something like, well, computer science, biological questions, large amount of data. So that's kind of how I build up my slides. So some more definitions. So we will be talking a lot about things like algorithms. So for me, an algorithm is like a cooking receipt that we can follow to answer a certain question. There is data. So data is a set of values, somewhat 21. Thank you for following. I actually forgot that that also comes with a little bit of a sound effect. I hope that that's not too loud. Otherwise, we can tone it down a little bit. So data is a set of values or quantitative or qualitative variables. And we will be spending a whole lecture talking about the difference between quantitative and qualitative. You didn't hear it, moderator. All right, then I will put the sound a little bit up. And I can give you guys a little bit of a sound effect. So let's just test it. Let's do something like, is that audible? Just testing. So data, we will be talking a lot about data and data acquisition. So we will talk about the difference between qualitative trades, quantitative trades, and how this influences the whole. OK, so people hear that. That's good. I like that. Free sound effects, like I love them. There's like, was that supposed to be a brown noise? I hope not. I hope not. But it's one of them, like I have crickets and birds as well. So if I ask a question, I'm probably just going to put on the cricket tune until someone answers, just to kind of force you guys to give me an answer to the questions that I ask. So there's a difference between data and knowledge. So knowledge is a familiarity and awareness and an understanding of something. And that's what a bioinformatician does. We use data. We apply algorithms to go and get knowledge. And of course, we do this in silico. So we do this on a computer. So as a biologist, you don't do things in an animal. You're not going to a lab and doing experiments. You're just sitting behind your computer gathering all kinds of data from all kinds of different sources. And then you use algorithms to kind of combine these data sources and extract knowledge from already obtained data. All right, just some basic definitions so that we talk about the same things. I always say that if I wouldn't have become a bioinformatician, I would have done something like history. So if I wouldn't be like a full-time bioinformatician, I would probably be a history teacher and be teaching history. I love history. And I think it is really, really important that you know kind of where we come from and where we're going to. So just starting at the very, very early stages of the computer, we can say that the father or the person who has invented the modern computer is Charles Babbage. So Charles Babbage lived from 1791 until 1871. And he is the inventor of the analytical engine. So the analytical engine is a computer which never got built during his life. So he never had a computer. He just wrote down how a computer should work, and he made schematics which used cog wheels and these kinds of things to kind of build a computer, an analog computer, not a digital one, to do computation. So Ada Lovelace, and I always like mentioning her because she is the first ever computer programmer, she wrote algorithms for this nonexistent computer that Charles Babbage dreamt up. So he dreamt up a machine which was able to do computation and she wrote the algorithms for his computer. And Ada Lovelace is also the inventor of many programming techniques. And a lot of people don't realize actually that up until like the 1960s, 70s, computer programming was actually thought to be more in line with how women think. Because originally people thought computer, which I think still is true, but computer programming is very creative because you have to come up with new things, right? So the computer itself is kind of a static object. You know how it works. We know exactly how the thing functions. If you do this, the thing does that. But writing programs and coming up with novel algorithms is something that requires a lot of creativity. And this creativity is of course something that generally is more ascribed to women. And I do love the fact that she's generally considered the first computer programmer. She's the inventor of the first recursive algorithm. And generally, she never touched the computer in her life because there were no computers. And it's like DNA, like people in genetics, we had an understanding of how inheritance worked. But we didn't even know what DNA was, which came like 60 years later with the invention of DNA. So another very famous person is L&M Turing. And there's a really good movie about his life and the movie focuses more on the cracking of the code of the Nazi cipher machine, which is, he was leading the group that did that. And I think Cumberbatch played him in the movie. And he's the father of theoretical computer science as well. In history, he's mostly known for cracking the German Enigma machine, but he is generally considered it by computer scientists to be the father of theoretical computer science. So of course, by the 1950s, there were, ah, the imitation game. Thank you, moderator. So yeah, the imitation game. A really good movie to watch, but they don't really go into his contributions to the field of computer science. So he is more or less the guy who defined what we can compute in a reasonable amount of time and what we will never be able to compute. And he did this by inventing the Turing machine. So the analytical engine was a machine that never existed only in the minds of Charles Babbage and Ada Lovelace who wrote algorithms for that. Turing machines are machines that you can actually build, although there are some limitations. So the Turing machine itself, it needs an infinite band of input and of course there's no infinity in real life. But based on this kind of schematic that he dreamt up, the Turing machine, we know what we can compute and what we will never be able to compute. So he kind of defined what is a solvable problem and what is an unsolvable problem. So just to show you guys, the analytical engine was actually built in the 1990s and there is a version of the analytical engines somewhere in the UK in one of the museums. So you can see it there. And as you can see, it's a very, it doesn't look like a computer, right? It doesn't look like what nowadays we ascribe to be a computer. But in the time of Charles Babbage, steel wasn't as good. So we couldn't make cog wheels small enough and strong enough to build this machine. But of course with modern technology, we were able to build an analytical engine and we were actually able to show that it actually works. So, and it actually runs the programs developed by Ada Lovelace. So almost 200 years after the invention of the machine, we were actually able to build the machine in the real world. And then that gave us, like it's a really nice achievement. Next to it, you see a Turing machine and a Turing machine is really, really easy. So it's an infinite amount of tape which contains symbols, zeros and ones. Beside that, the machine has a memory and it has a register for operations. So operations can be move the tape forward, move the tape backward, read the number from the tape, write the number from the tape. And it's a very simple machine, but it allows you to reason about what you can compute and what you can never compute. So if you wanna learn more about these subjects, like I'm not going to go in match detail, but either watch the movie with Benedict Cumbermatch if you wanna learn more about his life. And there's always Wikipedia and other sources out there if you wanna learn more about the history of computers. But then we've been able to achieve so much in the last 50 years and that has never been dreamt possible before, right? Like even a hundred years ago, in the 1920s, computers weren't there and like the explosion of computer power that we have now is just incredible. And it continues up until this day. So nowadays people are working on things like quantum computers and all of these things once are an idea in the mind of someone until someone actually takes up the job to kind of make it a real thing. So if we talk about computers, we have to talk about the first computer. And depending on where I give my lecture, there's also a TV movie called Code Breaker which highlights more about the Turing machine. Oh, that's interesting. I didn't see that movie, but definitely watch that movie. So, but if we talk about computers, we always have to talk about the first computer. And depending where I give my lecture, if I give my lecture at a German university or if I give it at a different university outside of Germany, I always have to switch this slide. Because if you're talking to Americans, Americans claim that they invented the first computer. They say that the ENIAC is the first real computer which is not really, really true. So the first real working, fully automated digital computer was built in 1941 by Konrad Zoos. You see a photo of him here working in this lab. And it is called the Z-Drei. So not the ENIAC. The ENIAC was built in 1946, I think, so just after the war to do ballistic missiles. But the Germans were already building computers in 1941 and these computers, they used to study wing flutter. So wing flutter is the thing that occurs that if you have an airplane and you go really fast and you approach the speed of sound, then the wings of your aircraft start doing kind of like this. So they start flapping up and down. And this computer was built so that the Germans could study in more detail why this occurred and how they could modify the shape of the wings of fighter aircraft so that fighter aircraft could become quicker. It was destroyed in 1943 by an Allied bombing run on Berlin. The machine itself, actually a replica is available in the German Museum in Munich. And it's called the Z-Drei. So there was also a Z-1 and a Z-2. And I think one of the museums here in Berlin also has a Z-2 replica. It's the museum with the airplane on its roof. So if you're interested in computer science and computers, definitely go and check out these machines. They're really, really interesting to just walk and see how everything's connected. So to program this machine, you used punch cards. And depending on how old you are, you used to be, you still remember punch cards. When I went to the hospital, when I was a little kid, the hospital information that they had on me would be on these little cards on the, so the card contains punches and these punches signify data and code. The Deutsches Technik Museum. Yes, the Deutsches Technik Museum in Berlin, I think has a Z-2, which is not a fully automated digital computer, but it's very, very close. And it's an impressive machine to see. It fills almost a room and hey, it has very, very little memory. So hey, you're nowadays, you're a teddy bear, which plays a little bit of music or a card that you open, which plays music as well. That has more computational power than the original Z-1 or the original Z-3. All right, so originally destroyed and it worked just via punch cards. All right, so my moderator restored but non-functional Z-INUS on display in the German Museum of Technology. All right, good. So yeah, if you wanna have a little bit of a trip and look at old computer stuff, then definitely go to the German Museum here in Berlin and the Deutsches Museum in Munich actually had a Z-3 and a Z-4. That's already a very advanced machine. John von Neumann is one of the other very famous people. So von Neumann lived from 1903 to 1957 and we see here the picture of von Neumann, but we also see his big invention. So nowadays all modern computers are von Neumann machines. So von Neumann machines are a machine which follows this kind of more or less strategy. So it contains an input device, it contains a central processing unit, a memory unit, and it contains an output device. Of course, output devices and input devices have more or less grown and are different now than they are in the time of von Neumann, but he was the first one to say that, well, no, you need to separate your memory unit, so the memory of the computer from the central processing unit, and also the term central processing unit was coined by von Neumann. The control unit is the unit which controls the operation, so which operation goes when, and then we have an ALU unit, so an arithmetic slash logic unit which does either computation or Boolean logic. So that was his big invention, so in the 1950s, you could become a famous scientist by just coming up with a diagram like this. I always think that that's interesting that nowadays there's so much complexity, but in the 1950s the world was still simple enough to have an idea like this saying that no, the memory should be separated from the CPU, and the CPU consists of two parts internally, which is a control unit which decides which operations go in which order, and an ALU unit which does arithmetic, so plus, minus, multiplication, and these kinds of things, as well as a Boolean logic unit, which is like Boolean logic is true and false is false, and true or false is true, right? So it defines the and, the or, the nand, and these kinds of operations. All right, so that's all what I kind of wanted to give you as a history overview. So these people I think have had like a massive influence on our computers nowadays, so how we see computers, how computers are built, how they are structured, and they're definitely worth checking out further. And of course, they're in the presentation so that I just can ask you guys, like who is generally considered the first computer programmer when we have the exam? Of course, bioinformatics is a merger of two fields, right? On the one side, we have computer science, and on the other side, we have the side of molecular biology, and the advancement in molecular biology push the computer science to do more and perform better, because we currently are generating more data every year, at almost double exponential rate. So that is becoming a problem, because biology is generating so much data that even with the current advances in computer technology, we can't keep up, but the term bioinformatics actually was coined in the 1960s. So like people think bioinformatics is a very new field, and of course, within science, bioinformatics is probably one of the newest fields, but of course, bioinformatics is already 60 years old. So in bioinformatics or in biology, so the molecular techniques is sequencing is the thing that drove bioinformatics. Once we started more or less collecting sequence data of DNA sequences, RNA sequences, and protein sequences, we've run into a problem, because if you have a DNA sequence, which is a million base pairs long, how are you going to kind of publish that, right? How are you going to put that in a journal? And because in the end, it's in science, it's all about writing publications. But of course, to support this, the computer scientists were kind of pushed by the biologists, saying that, well, we generate so much data, is there no way for us to store this data? So some of the advancement in biology, which kind of pushed bioinformatics forward or made bioinformatics a necessary field in biology, is for example, the first sequencing. So the first sequencing is the bacteriophag MS2, which is a phage, is a virus which infects bacteria. It is a very small single-stranded RNA virus, and it is the first completed genome sequence in the world. So the genome sequence was completed in 1972, and the whole sequence of the phage, so this little virus which infects bacteria, is around three and a half thousand nucleotides. So it's a very short. In 1977, bacteriophage Phi X174 was sequenced, and this is the first sequence DNA genome. So strangely enough, although DNA is much more stable than RNA, we were more, or in the beginning, we were much more capable of sequencing RNA than sequencing DNA. It is around 5,300 base pairs long, and it is still used nowadays as a positive control in every DNA sequencing run in the world. So every sequencer in the world, every sequencing run that happens, there's probably tens of thousands of sequencers active at any given point. Every sequencer, when it sequences, there is still this little, the DNA from this little bacteriophage is always added to the sample to serve as a positive control, to make sure that the sequencer sequences this phage properly. So that's kind of our golden standard. In 1987, we have the invention of spotted microarrays, and later in this lecture, we will be talking about microarrays more, so I won't go into detail, but microarrays are tools which allow us to measure the activity of the DNA, so it allows you to measure how much a gene is expressed and all of these things. In the 1990, we have the development of the DNA shotgun sequencing technique, which means that we could do lower, or do longer and longer fragments because based on chemistry, you can start reading a DNA strand, but if you read like five to 600 base pairs in, then stuff becomes a little bit bad quality. So DNA shotgun sequencing solves that by taking a whole DNA fragment, for example, the human genome, and just chopping it up in little pieces, which are two to 300 base pairs long, and then just sequencing these little fragments. So you first chop them up, and then later you reassemble them into a complete genome, and of course, this is computationally very, very expensive, right? Just sequencing from the beginning to the end is easy because you just get the sequence, but if you get like millions and millions of little reads, which are 50 to 70 base pairs long, then of course reconstructing the genome becomes a combinatorical puzzle, which of course you need a computer to solve. In 1995, we sequenced Influenza, which was the first free-living organism to have its entire genome sequence with around 1.8 million nucleotides. And in 1995, the same year, also microarrays were more or less scaled down to single little glass plates, which allow you to measure not a couple of genes, but will allow you to measure all genes in the genome at the same time. So the miniaturization again pushed the data quantities that were obtainable by biology into new heights, which also meant that computers had to adjust and had to be able to cope with this new influx of data. In 2003, we have a major milestone, which is the human genome project. So the sequencing and assembly of 3.4 billion nucleotides of the human genome. In total, we identified around 20,500 genes, which is very comparable to mice, which is my standard study subject. And the total cost were around 3 billion. And the cost here is a little bit of a discussion because the human genome project was originally started as a collaboration between many universities across the world, but in the end, the Sanger company actually finished much earlier because they used this new whole genome shotgun technology. So instead of putting like long fragments into clones and then cloning that and then sequencing that using like standard sequencing techniques, they just fragmented the whole thing, threw it into a sequencer and more or less reconstructed it later. A lot of people don't know, but actually there was a kind of second human genome project, which is called the ENCODE project. And the ENCODE project is one of the biggest achievements of the last like 15 years. It is the encyclopedia of DNA elements and it is the follow-up to the human genome project because data is not knowledge. Data is just data. And if you just have a genome sequence, if you just have 3.4 billion ACTGs in a row, then you still don't know how a human works and you still can't really do anything with it. So to follow up on the human genome project, the ENCODE project was there to look at how genes are expressed and how this variation in the expression of genes is leading to the development of certain diseases. So the ENCODE project, you can look it up online. They have a really good website where you can actually get all access to all of this data for free. And they provided a lot of the annotation of the human genome and it is the reason why the human genome currently is one of the best annotated genomes in the world. So a lot of genomes are just base pairs. If you're lucky, there's the prediction of certain genes. If you're unlucky, actually you just get base pairs and there's no real annotation. But due to the ENCODE project, we know exactly which gene is located where in humans and we also have a pretty good idea that if the expression of this gene goes up, your chance for cancer goes down. And so that's a really good knowledge base to work on. And it's the biggest achievement, I think, next to the human genome project. And the human genome project itself was, of course, like a major achievement comparable to landing a guy on the moon. But the ENCODE project actually makes it so that all of this data, which people paid three billion for, kind of is useful and can be used. All right, so to kind of summarize, I think we're almost at the end of the first hour or first part of the lecture is, why do we need bioinformatics? So I told you guys that nowadays, biologists are more or less data gatherers. So they go into the lab, they extract DNA, send in the DNA for sequencing and then they get a hard drive back with ACGGs on there, which they don't really know what they have to do with. And there's large amounts of biological data collected. We now call this the big data avalanche or the big data problem. But remember that every year, new technologies are coming to light or new technologies are developed, which can do or which can collect much more data than the technologies before, right? You started off with spotted microarrays, which could measure like 10 or 100 genes. But nowadays we measure 20,000 genes with ease across hundreds and hundreds of samples. And that's just in the time span of like 20 years. So the explosion of the available data is just massive. We need bioinformatics to do sequencing, to do expression analysis, proteomics, metabolomics, but also something called automated phenotyping, which is becoming more and more popular. So automated phenotyping means, for example, that you have a system where a plant is put into a chamber and continuously every aspect of this plant is measured during its growth. So the amount of leaves, how green it is, how much water it uses, photos are made. And so there's like automated phenotyping just is continuous surveillance of a plant or an animal, which of course also leads to billions and billions of data points during the life of a plant or an animal. So we need also bioinformatics to automate biological research. A lot of biological research used to be a single guy in a lab underneath a hood that did his experiment, but nowadays we have pipeting robots and we have automated machines, which do a lot of these steps, which normally would have been done by a human. Furthermore, we're able to distribute all this data very easily via the internet. So as a bioinformatician, you can pump like gigabytes of data from A to B. And of course, this also is this transfer of data, how data is transferred and how we make sure that data did not change during the transfer is also part of bioinformatics. Of course, we do large scale statistical analysis nowadays on multivariate data sets. And with that, I mean that in the old days, you would measure like one phenotype or two phenotypes, like you would measure the amount of grain that you got from a grain plant, and then you would find in the genome, one of the regions which would control the yield of that plant. But nowadays we don't look at a single phenotype, we look at hundreds, if not thousands of phenotypes at the same time. And of course, all of these covariates that we are measuring are something that we have to deal with and that makes for a massively complex analysis. So as a bioinformatician, it's not just computer science, it's not just biology, there's also a lot of statistics that you will need to understand. And of course, fast and cheap computing power has so computing power becomes cheap because there's something called Moore's Law which says that the doubling of the speed of a CPU is happening around every 18 months. And to show you this, I got this really nice graph. So how we start off in the 1970s all the way to 2020. And here you see a list of all of the different CPUs which have been brought out. And the scale here is a lot of logarithmic scale. And you can see that actually there is a really nice logarithmic curve, right? So that means that around every 18 months we have to double the amount of computational power available. The amount of sequencing data grows double exponentially. So much, much faster than the amount of computing power. So there's a kind of conflict between the two. Computers don't become faster fast enough for bioinformatics. So like I told you guys, the origin of modern bioinformatics is sequence. Sequence is the fundamental data type. So everything that we do hangs on DNA, RNA or protein sequence. And the sequence itself is the entry point for many in silico studies, right? So we use sequence to figure out if there are differences in a gene, for example, if a protein has mutations. And so in silico studies, generally use the sequence and use the sequence to predict things like phenotypes of plants and animals. And the first algorithm or the main algorithm in bioinformatics used to be sequence alignment. More and more algorithms are coming out of course, but sequence alignment is still very, very fundamental. Because if you do DNA or RNA sequencing using whole genome sequencing, so by cutting it up into little pieces, then of course you need to align these sequences towards a reference genome. The same thing holds for proteomics, right? We can determine the order of the amino acids in a protein, and then we can then look in the DNA of an animal to see where it is encoded, if it is encoded and if there are changes between the protein and the DNA level. So sequence is fundamental. Sequence alignment is the original algorithm in bioinformatics, which kind of made the point to biologists that we need to use computers. We need to use alignment algorithms. All right, so for just my... I think that everyone knows what DNA is and what a DNA sequence is, but for bioinformatician, this is a DNA sequence, right? So what we see here is an electrophytogram. If you send a little piece of DNA for sequencing and you send it in for some sequencing, you still get one of these. Even modern sequencers still work the exact same way. Little dots on a glass plate which give you a color. And this color tells you which base pair was there. So here we can see on the bottom that we have a sequence which goes G-A-T-A-A, blah, blah, blah, right? But in the end, this data is not static, right? Data is generated by machines and machines make errors. So generally when you have a DNA sequence, we store this DNA sequence in something called the FASTA or FASQ file. And this file comes with the information about the uncertainty, right? Because in the end, you can have a sequence. Also modern sequencers use colors and use lasers to kind of excite it. And then they measure the color and based on the intensity, they make a judgment call saying, well, there's a G here or there's an A here. But of course, the sequence is never perfect. You always have these areas where it is uncertain. There's two colors running through each other and you cannot really define which one it is. And so we have something which is called the FRETSCORE which is a very old scoring technique which allows us to encode how reliable a certain base pair was read and how reliable we are that this base pair is really true. So this figure shows you a FRETSCORE. So FRETSCORE's work on a logarithmic scale and when we have a DNA sequencing lecture, we will talk more about how FRETSCORE's are and how they work. But I just want to give you, I just want to tell you that even though we work on sequences, these sequences are not 100% correct. There's always things that we don't know or there's always areas where we're uncertain. Could be an A, could be a G, right? And the FRETSCORE, so the score for this tells us how reliable it is. So if we have a FRETSCORE of 50, it means that we are 99.99999% sure about the base pair. But if we have a FRETSCORE of 10 for a certain region in the genome, this means that we are only 90% certain what the base pair sequence is there. And of course, if we're completely uncertain, then we just include an N but we'll talk more about FRETSCORE's. But I just wanted to give you that like, everything that we do is not set in stone even though we write algorithms, we assume that everything that we get, the data that we get is 100% correct, but everything comes in kind of a probability distribution. So where do we store all of this data? All of this data goes into GENBANK. GENBANK is the sequence database in the world. It is the sequence database, it is open access, it contains an annotated collection of all publicly available nucleotide sequences and their translations into protein. So just to give you an example of how massive this data is and how small it used to be. In 1982, when GENBANK started, there were 606 sequences, DNA sequences in there. And this was around 680,000 base pairs. In October of 2020, we actually had 219 million sequences stored in GENBANK, and this contains a number which is 698 billion, 698,000, I think it's billion, right? Million, billion, trillion. It depends if you're German or if you're English, but had the data really has exploded. So in 40 years time, we go from having 600,000 base pair sequence to having more than, well, almost 700 billion base pairs sequence. Of course, this is divided across different species, but there's only DNA sequences in GENBANK and this is already a very old slide that I actually took. But here you can see that humans are the main thing that is being sequenced. The second most well annotated organism or the most sequenced organism is the mouse and then we have the rat. Both of these are model organisms for humans and only then do we find the first more or less agricultural species, which is cow. So Bostaurus is the fourth most sequenced species in the world and of course this is because cows are economically very important. Then we have maize, we have pigs and we go all the way down to having things like chickens which are still sequenced a lot. But the sequencing and DNA sequencing focuses primarily on humans and model organisms to understand humans. So all of this data needs annotations. So annotation is additional information about a DNA sequence like the location of a gene, the name of a gene, the structure, the function and all of these. And this annotation is really easily accessible. And so here, for example, we see the DNA sequence of myostatin. It's a protein that encodes muscle growth but of course this won't teach us anything. No one can do anything with just a plain old DNA sequence. For that we need to have something like ensemble. So besides GenBank, which just stores all the data and stores some annotation on where it was collected by whom and what species was done, ensemble is the entry point to genomic annotation data. So if you ever wanted to learn anything about a certain gene in a certain species, then go to ensemble and ensemble is the entry point to find out, well, where is my gene encoded? How many exons does it have? In which tissues is it expressed? What proteins are encoded by it? What variations are known? So ensemble is your friend. And of course there will be more, like we will hit ensemble multiple times during the lecture series. And of course, even more data is generated. Nowadays we are in some era which we call the omics era, which means that we do genomics, so DNA sequence, epigenomics, modifications on the DNA. Then we have transcriptomics, which is messenger RNA and the expression of genes, proteomics, the whole world of proteins and associated things. We have metabolomics, which are metabolites and we have phenomics coming in, which is the automated measurements of many, many different phenotypes. So these are kind of the levels that as a bioinformatician you have to deal with and from each of these fields, you need to low a little bit, right? It's a interdisciplinary field, so you need to be able to understand computers, you need to be able to understand biology and biology of course has all of these different fields, but these are the most important fields and all of these fields are of course coupled to sequence, which is the main data type. So I hope that I convinced you now in this like hour that I've been talking to say that why we need bioinformatics, well, it is essential to understand, bioinformatics is essential to understand how organisms function. And it is essential that bioinformatics is there to connect all of these different levels of either sequencing data or protein data together and to create a holistic view on how an organism does what it does and how input from outside, like the environment works on the organism and then creates things like disease or other economically interesting phenotypes. So key task for a bioinformatician, data analysis and interpretation, storing data, managing data and doing data visualizations. That is what we do on a day by day basis. So I told you data is not knowledge and analysis means going from raw machine output to understandable results and we need to use statistics to know what is different and what is changed. But also there's a lot of data interpretation in bioinformatics. So to use previous work and previous knowledge. So there's a lot of algorithms out there which do like literature research and interpret literature. Also that is part of bioinformatics to go through all of the publications automatically, have machine learning kind of learn from all of the publications that are out there and then in the end give us an indication of what's going on on a holistic view. Data needs to be stored and managed. Don't be like this, be clean, right? Like bioinformatics means that you are a professional which means that you work in a structured way like a biologist works in a structured way in a lab, a bioinformatician needs to work in a structured way on a computer. Data needs to be stored and managed. So why do we need that? We need reproducible research. We want to quickly find back all of the relevant data and we want to make connections between different sources of data, right? We did some sequencing in the past. We now measured some proteins and we want to know how these two things relate. But the most important thing and that's one of the terms that I don't think a lot of biologists are familiar with is something called the bus factor. So the bus factor is very basically how many people within a team need to get hit by a bus before all research stops. So generally this bus factor is not really high in science, right? There's the professor, there might be like one or two postdocs who know what's going on within an experiment. But of course, like, hey, if the professor gets into a car crash, hits a bus and dies, then of course there's no one to continue this research. So what a bioinformatician tries to do is prepare data in such a way that other people can continue working on this data. So to increase the bus factor so that more people need to be hit by a bus before research stops. So a little quote from my own thesis, my own PhD thesis is that there is no good communication on big data without proper visualizations. So as a bioinformatician, you do a lot of visualizations. You make visualizations to communicate results to biologists, to bioinformaticians and also to the general public, right? You're kind of an intermediary between different groups who all have their own specialties and you are not part of any of these groups but you're kind of the thing that glues them all together. All right, so that's what I wanted to tell you for the first hour. I went a little bit over time. I hope you guys don't mind. Are there any questions so far? Remarks, is everyone still awake? Sleeping? If not, then we'll have a short break. I will be back at 3.15. And in the meantime, I've prepared some fun and animated gifs for you guys so that you can spend the time. And I'm also going to put on some audio just to make sure that the audio doesn't come back in the VOD track from Twitch and that I can learn a little bit more about how to use Twitch and how to stream. All right, so if there's no questions, then I will see you guys in around 9 to 10 minutes and enjoy the animated gifs.