 So welcome everyone, if you're watching this on YouTube or you're watching an unmoodle later or in a couple of years, welcome back to lecture, the phenotype lecture for bioinformatics part number three. So we looked a little bit at the IMPC database so it's a really great database if you have a favorite gene or if you are going to study in a group which is interested in a specific gene and it's just a generally like interesting database to browse around and all of the data like I said is available for free making it like one of these like most valuable resources in in studying of mouse genetics but not just mouse genetics of course if a gene does a certain thing in mice it will also do a very similar thing in humans when you're looking at a homologous gene. So I told you guys that my favorite gene is bbs7 because that's the gene that we're studying the most so just a quick overview so when we look at the bbs7 gene we see actually that there is a very significant association with this little skull so there is a significant influence when you knock out this gene on mortality and mortality and aging and that is because mice that have this gene knocked out and when they are homozygous then they are not born so in the embryonic phase of the mouse this gene plays a very important role and if this gene is broken then mice don't get born so they only have phenotypic data for heterozygous mice so mice which have a working bbs7 gene and one broken bbs7 gene so you can see that they haven't measured a lot of phenotypes yet but the phenotypes that they have measured is for example the eye and the vision so if we go if you remember the picture that I just showed you guys let me pull up that picture again so I'm going to pull up the picture again like the fat mouse you can actually see that the fat mouse has these kind of red reddish eyes and that is something that we never really studied because it's a mouse model that we use for juvenile obesity but it turns out and we kind of knew this but we weren't sure but from the IMPC database we actually figured out that this mouse also has problems with its eyes so it's not just a mouse model that you can use to study obesity you can also use it to study like eye and eye behavior and the Berlin fat mouse is really interesting of course because the gene when you have it knocked out is lethal right so you don't have homozygous mice being born with two copies which are broken but the Berlin fat mouse is an interesting model because it does have a couple of mutations in the gene which do not kind of destroy the whole gene function but which modify the function of the gene so that that we always found that very interesting and of course it has a significant influence also on adipose tissue but that's something that we already knew from our mice but it's good to see that it's actually had that it's confirmed by an independent group and so we found it by doing a genome-wide association study so generating a big population of mice and then mapping the phenotype back to the genome we discovered this gene and of course the IMPC did it the other way around they just knocked out the gene and then they saw these mice have abnormal lens morphology but they also have an abnormal fat fat mass in in there okay so that's all what I wanted to say about the IMPC there will be a lecture or there will be an assignment this today about the IMPC database where we will I will just ask you some questions like look up this gene and what does it do and just to get you guys a little bit familiar with browsing through the database and and seeing what all of the different buttons do all right let me close the Firefox again and next database so the next database is the Mendelian inheritance in men so this is a massive database the updated October 20th 2014 I would just ignore that because like by now we've more or less discovered all of the Mendelian phenotypes in humans that there are so the database doesn't get updated a lot but there are a lot of Mendelian diseases in humans and or Mendelian traits in humans so the database describes itself as a comprehensive authoritative compendium of human genes and genetic phenotypes that is freely available updated daily kind of and the nice thing is is that it provides full text reference overview and it contains information on all known Mendelian disorders and over 12,000 genes and so all men focuses on the relationship between phenotypes on the one hand and genetics on the other hand so if you're interested in a Mendelian trait and you want to know okay so is this on chromosome 10 or is it on chromosome 12 then OMIM can tell you but it is really one of these starting points so if you if you start doing a PhD project right and in the PhD project they study airwax then you first thing you can do is go to OMIM and look up everything which has been published in the last 40 years about airwax and about the genes involved and who did which studies and which which papers you should read so it's a really good database for for these kinds of investigatory research at the beginning of a PhD project so let's just quickly look at it no I don't want to donate let me show you the Firefox again so it actually was updated like very recently the pictures are relatively old picture so again it's just a single search box so you can just fill in what you're interested in so again what is your favorite Mendelian disease or your favorite Mendelian thing for you guys like hey can I can search for something like obesity again but it would be interesting to have a suggestion by you guys so that we can look at something like that if not then we will just search for well obesity is not a Mendelian disease so let's search for a Mendelian disease so let's just search for airwax right because that was one of the examples so when we search for airwax it it says here that there is a ATP binding cassette or ABCC11 gene which has been implicated we see that there's an apocrine gland secretion variation in and there's a kidney I can't pronounce that word but if we go here right then it says that the alternative symbols are wet dry airwax and this is variation in this secretory pathway which makes it that it's that it's interesting or not not so much that it's interesting but that makes that it's so it has to do with secretion of fluid in the air which protects the air we can see that the phenotype-genotype relationship is here located on 16 q12.1 so that is chromosome 16 the gene which is causing the difference between wet and dry airwax is called ABCC11 and the inheritance is AD so autosomal dominant so if you get the dry or if you get one of the two copies so the dry or the wet then that will dominate over the other copy there's some basic text right where it is located and then here are the clinical features right so hey you see that it starts in 1962 with an observation by Matsunaga then there's a paper published in 1967 and it just gives you a whole historical overview of what was studied for this gene more more or less it gives you the mapping right so how did they figure out where it was located so they did that by micro satellite markers and then they fine-mapped it to this region and then has so and then they just go down then there's a part which tells you the molecular genetics so how does this do variations in this gene affect the airwax if it's dry or if it's wet and you also have like a population genetics so this tells you that there's a has or the original Ainu population of a Japanese islands of Hokkaido has an exceptional high frequency of the dominant wet airwax phenotype like I told you guys have people from Asian descent or more because from Japanese Asian descent have a high likelihood of having wet airwax while people from European descent generally do not right and it provides you an overview it gives you full citations for this phenotype so it is a perfect starting point if you ever have to write a PhD thesis or you have to write your master's thesis and you are studying a phenotype which is Mendelian check out OMIM it will definitely help you write the introduction right of course you can't just copy paste because that would be plagiatrism but it gives you a lot of starting information it gives you a lot of information which groups around the world are researching it what is known about molecular biology in these kinds of things so it's a really really good starting point so if you have a if you have a favorite Mendelian phenotype like the finial teotolomite one that I generally try to kind of promote because I like Brussels sprouts a lot because I like the bitter taste I love coffee as well so I definitely can taste the finial teotolomite and if you are wondering okay what does this tell me about my genome then hey you it's a very very good starting point so and we will of course have a have a question about OMIM as well and the question there will be about color blindness because also color blindness so for example the red green color blindness which is very common in males that of course has a lot of history as well so it's it's one of these databases which is just perfect for an initial cursory gland at what is known and it provides literally data which are citations which go back to like the 1960s so you can really drill down and get a good overview of everything that's known about a certain Mendelian phenotype all right so OMIM G network I actually dropped the gen2 fan database I think that like I mentioned the gen2 fan on the overview right and it's as a link and and we can look at it if you want but I think that the G network database is is a little bit more interesting because I actually worked on it in the past so I contributed code and other things to this database and it's it's just a generally nice group of programmers to work with this is the old version this is gene network one currently we also have a gene network two which is kind of a newer version which is brought into the 20th 21st century right the old database was built in 1980 something so it's a it's one of the older databases in molecular biology containing data phenotype data on on different mouse strains so they describes them they describe themselves as saying it's a group of linked data set and tools to study complex networks of gene molecules and higher order gene function phenotypes but the nice thing about this database is that it contains more than 25 years of data generated by hundreds of scientists and it also gives you access to genotyping data as well as like expression data set so not just real phenotypes but also endo phenotypes and never got back to the endo phenotype so in the old days like up until like 40 years ago when we talk about phenotypes we meant phenotypes that were observable from the outside nowadays we also talk about things like endo phenotypes so endo phenotypes are phenotypes which are for example measured within an organ right the amount of glucose in the blood is of course also a phenotype right it's a trait you can measure it so it's not just but generally phenotypes the word is reserved for classical phenotypes things that you can observe from the from the outside by endo phenotypes generally are phenotypes which are internally inside the body or inside an organ or even inside a cell so it's a it's a kind of it's a little bit of a discussion on how to exactly call it all right so let's take a look at gene network I'm hoping it's not down because that happens a lot because it's a yeah it shouldn't be down no it's actually it's up so that's good all right so this is how the new version of gene network looks and so the nice thing is is that it does not provide only information on mouse if you click here you can see that there's also information on humans on rats on monkeys there's sofila barley arabidopsis poplar which is a type of tree soy beans tomatoes and origia latipedes which is Japanese made DACA I have no idea what that is but then so anyone is free to contribute data to gene network and the nice thing about it is is that well not the nice thing but a lot of times when you are measuring phenotype data or when you're measuring genetic data journals upon publication of your article will ask you to make the data publicly available right so that people can redo your work and that you can have reproducible research and gene network is one of these databases which is very open to accepting data even if you collect it a couple of gigabytes of data they will say we will host it for you no problem and they they will just add your species to it of course originally it was for the mouse geneticist community so there's a lot of mouse data in there and the way that it structured is that it goes by species and then you can select your group so your group when you click on the group here or you can see the drop-down thing that is annoying why does it not capture let me see if I can fix that for you guys so that it will also capture the all right so then you have to go to the database yourself and click on it but if you click on the group you can see that it has different types of reference populations in there and the BXD data set is probably the biggest data set that they have so the BXD is the black six the reference mouse that I showed you guys crossed with a DBA mouse so DBA is just another mouse which has been in laboratories for almost 50 years so it's an inbred mouse that means that the children of the mice are the same so it's kind of a clone and that is done by inbreeding so new follower thank you just just those just those Sophie thank you for following so but the BXD family is like the biggest data set that they have but they have much more different ones so they have for example different data so for example you can look at adult BXD's just a Sophie hello hello welcome to the chat and welcome to the stream so it's a it's a database which contains a lot of phenotypic data you can select different data sets in this case because we're looking at the I data there's only one data but for example if we look at the standard BXD no not BXH but the standard BXD family they have things like traits so phenotypes and covariates but they also provide molecular traits like messenger RNA measured in the adrenal gland messenger RNA measured in the eyes of the mice messenger RNA measured on the pituitary gland right so if you are very interested in a certain a certain part of the body of mice or house as a model for humans then it's a really good database to kind of get an initial dig right so if we would just search for obesity again right so if we are interested in what is obesity then we can just search and it will take a little while because there's a lot of data in there and then we can actually see that had they have different phenotypes measured so here there's obesity based on high-fat diet and it was measured in Germany this is actually done by our group and what they did is they gave them a certain amount of food and what is the thing that they measured so they measured fat 42 which is kind of cryptic but fat 42 is just the amount of fat mass at 42 days of age and this was done in females it was also done at 56 days and all of these phenotypes have an information about like what was the diet of course that's very important if you look at fat mass and stuff but they also show you where this where the highest likelihood is that there is a gene which is controlling this phenotype so here this phenotype seems to be controlled from chromosome 5 at 39 centimorgans right so from the beginning of the gene to 35 centimorgans it also shows you the effect size so it's the one mouse strain had plus 0.35 0.3 a more fat mass at 112 days compared to the other ones you can click on the phenotypes of course and then you kind of drill down so you have the phenotype then who published or who wrote about it the title if it was published so in this case the data has never been published and then here you can see all of the measurements that they did so you see that the the founder strains right so the black six mouse the dba2 mouse so you see that the black six Moses 3.2 point 3 grams fat the other founder has 2 grams of fat and then here are the f1s so when you cross these two mice together and you can do this in two ways because you can take a female cross it with a male or a male with a female right so the name here is again like we always use in genetics first the female then the male so this is a black six crossed with a dba so black six female with a d2 male and this is then when you do it the other way around so they didn't do it the other way around so they have no measurement for that but you can see then that the offspring so the bxd's which are more or less in in bread strain so for each of the in bread strains you can see the the measurements that they have and you can actually directly order them so if you're interested in bxd1 you can click the link you go to the site where you can order one of these mice in case you want to study it yourself in your own lab besides that they have all kinds of tools like statistical tools where you can see how many mice were measured what the average measurement was the standard error in these kinds of things if you're interested in a histogram then well in this case it's not a very interesting histogram because there's not a lot of variation in this in this phenotype you can look at a bar chart and so here you see all of the different mice with all of their different phenotype measurements and you can have a probability plot to see if it's a normal distribution so there seems to be kind of an outlier here and this one is also so they have all kinds of data to kind of drill down into into your phenotype of interest and if you want to see how this complex phenotype right because fat mass at a certain age is of course not controlled by a single gene then they also provide these things called mapping tools and then this allows you let's just run one it allows you to do a scan across the genome and for each marker in the genome it calculates how likely it is that something is controlling the phenotype there so if we zoom in and then we see that for this phenotype there is a likelihood that it's located at like 125 megabases on chromosome 4 but we also see that there's probably a gene here at the beginning of chromosome 5 which is controlling the fat mass in these mice at this age so it links genotype to phenotype information and it's a big database like I said I once tried to download it and it's like almost a terabyte of data I think so it's it's really big and almost 25 40 years of data collected by other people and data just becomes better and better over time because had more people start adding their own measurements so gene network so I'm a contributor I should be all the way on the bottom being thanked somewhere Rob Sonak are my where am I here see see I did something all right so that's one of the reasons why I wanted to show you guys just promoting my own work a little bit but it's a really really interesting data set if you again want to know something about your phenotype of interest just go there fill it in and then see if anything comes up so a really good starting point for your own analysis all right so a couple of databases that we looked at so head as a bioinformatician like me you generally don't go to the databases right because clicking around on a database is not something which scales very well so as an informaticia or a bioinformatician and if you do bioinformatics you generally want to analyze data but you don't want to click through websites and then download little Excel files the best way to talk to an external server or to an external database is through an API so an API stands for application programming interface and it is a sub subset where you can ask queries and for example update data or download data from within a programming language right so I generally use R but there are a lot of different languages which you can use which you can directly talk to the database so instead of having to go through the website typing something in and then copy pasting data off the website this allows you to directly do your query and retrieve the data in a data format that you can program against so a lot of data is available right because we had I just showed you like three databases these three databases contain a massive massive wealth of data like OMIM goes back to like the 1940s 1950s gene network started in 1982 and everyone in science uses these databases and contributes data to these databases so what can you do with all of this data as a bioinformatician well you can develop new hypothesis without even doing any experiments right so without doing any experiments you can see like oh I know from OMIM that this gene is involved in this Mendelian phenotype and I know from IMPC that this gene also has an influence on this phenotype right so it allows you to kind of get an overview of what's out there and to say well my hypothesis would be that X actually functions slightly different or X is controlling Y and that is because B right so you can develop hypothesis without even doing experiments another thing why you want to use databases is because of reproducible research right in the end you are here to become a scientist right and science means doing experiments but also writing down exactly what you did like documentation is important and why is it important because other people are going to redo the work that you did and you have to kind of when you write a paper you have to write down the steps that you took in a way that provides other people to redo what you did and get exactly the same answers and that is one of these these things which a lot of people kind of ignore and we have a big reproducibility problem in biology a lot of things which are published in biology are not reproducible because either the data is not there because people didn't make the raw data available they didn't do a good job in writing down what they exactly did and all of these things right but but reproducible research is core because we are working in a field where in a hundred years or 200 years people will still be working on the same things and these people need to be able to know what you did and to redo what you did to validate that your results are really true not only that like bioinformatics also is concerned with putting current results that you obtain from your experiments into a historical context and that's one of the reasons why I like mentioning OMIM is because you can see if the stuff that you got today is actually still the same as what the people in the 1970s got and of course there will be differences right the nothing in biology is static had the fact that wet ear wax is something that is generally found in people of Asian descent might not be true in 50 years because Asian people change they have children and their children will be different again just like in in Europe yes so there might be a shift and of course you can use all of this data to validate results that have been obtained by others so if ever in the future you are doing a PhD or a postdoc you will be asked by your professor to review papers made by other people and these these papers might not be directly related to your work or your knowledge so these databases especially databases like Omen and IMPC give you a very good way of getting some basic knowledge and when you are reading a paper and trying to see what they did in the experiment it gives you an historical context and not only that but you can see well in IMPC it says that this gene is involved in skeletal development but you guys are showing that there's no influence on the skeleton how how is that possible right so it helps you to ask questions to validate the results by others but also to review papers which as a scientist you you need to do because we need to review the work of other people and of course to do anything with this data we need statistics so just a very very brief overview of statistics right it's a short introduction and we'll revisit these topics later lectures because there's so much to say about statistics that it's just too much for like one lecture but I do want to make you guys aware that there are like different ways of doing statistics so one of the ways that I like doing statistics my favorite way is actually using visual analysis I am a very visual person so if I see a graph I can understand it if you show me a table with numbers I'm less enthusiastic about that so descriptive statistics are for some people very telling right some people can read and say well okay so I have a mean and a certain variance and there's correlation and they can kind of in their mind build a picture on how the distribution looks or how the correlation looks but for me that's not the way that I work I work really by doing visual analysis I like seeing things like scatter plots or box plots or histograms or heat maps that just is more telling to me but to remember that descriptive statistics is something that you always have to add to your paper and of course visual analysis is a way of or showing your data to other people in a form that is easily digestible instead of having to have tables upon tables upon tables so when we talk about descriptive statistics there are two flavors of descriptive statistics or two types one of them is univariate analysis where you look at a single variable so or a single phenotype right so if I've measured body weight then of course I can do all kinds of statistics on this body weight saying that well the mean of the people that I looked at was this and in males it was that and in females it was that and then I can look at the dispersion like what's the variance is it a normal distribution or a different type of distribution so a univariate statistics is when we look at a single variable or a single phenotype when we talk about bivariate analysis we we talk about the relationship between two phenotypes so for example correlation right so if your body weight goes up then generally also your fat mass goes up right so then you're talking about two phenotypes and the relationship that they have and that is called bivariate analysis or bivariate statistics while if I'm just analyzing a single thing then it's called univariate so one of the things that I always ask people in the R lectures how many means are there because everyone knows how to calculate the mean right the arithmetic mean and so you add up all of the numbers that you have and then you divide by the number of numbers that you have and that is the mean so arithmetic mean is something that everyone knows at least I hope everyone had a high school education so hey you know how to calculate the average or the average generally is called the arithmetic mean but of course that's not the only type of mean there is there are four not three or four different types so you have the harmonic mean the subcontrary mean and those are all different ways of calculating the average so when people write the average is 15 then you always have to think in your you always have to wonder like did they calculate the arithmetic mean or is this the subcontrary mean or the harmonic mean but the mean of course is a univariate statistic because you're only looking at a single thing body weight you have a hundred measures and you calculate the average body weight the median I think that everyone knows what a median is but I'm just going to discuss it as well so it's the numerical value separating the higher half of the data samples from the lower half the median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one so that's the definition of the median of course you can have slight differences in how you calculate the median as well but again a univariate statistic and in this case which one of the two you present is very dependent on the distribution of your data right when we talk about average income then average income is generally not represented very well by the arithmetic mean because some people make millions and millions and then there's a big group of people who earns between like 20,000 euros to like 120,000 euros right so if you would calculate the arithmetic mean then these few people who have only so these few people earning a couple of million will pull the average up because they are outliers and then hey if you talk about income in a population then the median is generally considered a more accurate measurement so depending on distribution it is sometimes better to show the arithmetic mean sometimes it's better to show the median so variance we use it's the way that numbers are spread out across a range of measurements right so it's kind of when you have a normal distribution the variance tells you how wide the distribution is so a variance of zero means that all the values that you measured are identical and remember that variance is always non-negative right so variance is always positive and small numbers of the variance is 0.0001 that means that all of the data points are very close to the mean well if your variance is big then you have a very wide distribution so variance is a univariate statistic to kind of show what the distribution of your data is when we talk about five variance statistics right you talk about things like correlation and correlation is a measurement of dependence between two variables right so if one variable goes up what does the other one do right so if the weather is getting warmer and warmer and warmer what happens to the sale of ice cream if the if the body weight of of a cow goes up what happens with the milk yield so these kinds of measurements right where you are looking to see the dependence so what happens when one variable increases does the other one increase or decrease as well so that is correlation and a lot of people say actually that correlation is not causation that is true correlation itself does not point to a causal thing but does not point to causality but it is very hard to have causality without having correlation and a lot of people say well correlation is kind of useless but I love correlation big part of my PhD thesis was investigating correlation and measuring correlations I still have a half finished paper somewhere on my hard drive which is about some work that we did to calculate large-scale correlations so correlations in my mind are very useful and why are they useful because you can use this dependence between two things to have a predictive relationship which you can exploit right if you know that the price of ice cream will rise when the temper or when the temperature goes up then you can use this knowledge to buy ice cream before the temperature start rising right or you can if you know that on Monday stock prices are higher because of a certain effect and then you can buy stocks on Friday and sell them on Monday if because this correlation of course there's no causal relationship between it being Monday and the stock price being higher but still hey you can exploit this so if you know that two things are correlated you can generally find a way to exploit this dependence between two measured values that you have so I wanted to run through a quick example of Arabian horses so just for you guys to see how I generally analyze data or when I look at data what do I do so we have physical measurements on Arabian horses so in this case we're looking at three different subtypes of Arabian horses called Calavi Saklavi and Hamdami so they're just different types and here we just explore first the phenotypes and we want to look for outliers and for relationships between them right so the first thing that I do is actually oh this is very bad this is not white let me actually fix this slide very very quickly for you guys so that you can actually look at what we're looking at so this one here we have the body length and just make this guy white for you guys because otherwise you can't read this all right so let's continue and it's still a little bit shifted I will fix it when I upload it but the first thing that I do when I get data from someone who I'm collaborating with is make some scattered plots right so just show two variables against each other and then color them for example by the type of horse right so here we see the width of the chest of the horse versus the length of the body and what you can directly see is that indeed had the bigger your chest the bigger the horse which makes sense right and but this is a crucial part on bio on bioinformatics on data analysis is to make sure that the relationships that you expect there to be are really there right because if this would not be the case then you have to wonder what's going on why do bigger horses not have bigger chests and that so it allows you to kind of you look at data you just plot different things against each other and so I also looked at the body length versus the neck girth and then you see that that's not really there right you see that there are some animals here but here there's a kind of a circle there doesn't seem to be a correlation between the girth of your neck and the length of your body and that is also explainable right because when you think about grown horses then the body length itself is not directly saying something about your neck but it is saying something about your chest right because the bigger you are the more oxygen you use the bigger the chest you have and of course I color them by the different types of horses to see if there's any clustering if some horses are bigger on average than other horses so hey if we look here at the chest width we can see that the highest chest width is found in Calabi horses and the lowest chest width is found in the Saqlawi Arabian horses so hey you get a little bit of knowledge from just plotting things so besides scatter plots I always do some type of visual analysis and visual analysis for me is making things like box plots so what we see here is we see the body length and we see the body length notched so in in our generally people when they look at box plots they look at box plots like this without notching them and I think notching them is is always useful because the notches are more or less the error margins surrounding the mean right so yeah here you can see that all of these horses are on average 160 centimeters long but if you then look at the notches then you see that the notch for for example the green type of horse or the Saqlawi horse is much much bigger than the notch which you see here in the Calabi horse that means that we either have more Calabi horses that we measured so that the variance becomes smaller or we have a real kind of difference in in the spread around the mean and we can also see that more or less from the from the range so Saqlawi horses seems to be like the smallest horse in our experiment was one meter 20 the biggest one was 170 but if we look at the Hamdami ones here then we see that like a horse was like one meter 35 on the minimum and the maximum horse was like 170 so head there's you can see the spread of the data and I like the notches because the notches allow me to reason the nice thing is is that without doing any statistics when these two notches do not overlap right here you see that this one overlaps with the other one then you can say there's no difference between these two horses on average there's also no difference between these two but if the notches do not overlap right if this one would be down and would show a notch here then you would know that the Saqlawi would be significantly smaller than the other two types of horses right so it allows you to reason and do a little bit of statistical interpretation without having to do all of the t-tests or whatever test that you that you use of course we want to look at histograms who were invented by Carl Pearson so histograms are a good way to kind of get an overview of the data see if it's a normal distribution has so here we see the body length and the chest girth I think this is just for one horse type but we get an idea like it's not really a nice normal distribution so we might have not sampled the population correctly and generally when you sample a biological phenotype you get a really nice normal distribution and that is because most animals are around the average and some animals are of course smaller and bigger but the further you get away from the mean the less animals there are so you see that kind of this one looks a little bit bimodal right there seems to be kind of a normal distribution here with a mean of around 140 and then there seems to be a second normal distribution with a mean of around 165 so that's kind of how I look at these plots and based on these plots I can then decide oh if it's a normal distribution I can use parametric statistics if it's not a normal distribution then I have to switch to non parametric statistics so and just a little bit of an overview heat maps is something that I love to visual visualized data with a lot of people find it really hard to read heat maps but I like heat maps a lot I'm a visual person like I said so here we have all of the different phenotype measurements that we had so from width or height to group width to the hind canon length and here we color the difference between the different horses and so we can see that most horses have very similar phenotypes like if we look at the the the group width then none of the horses are really different but if we look at the chest depth then we see that the saklawi is relatively low while the kalawi is relatively high and the hamdami is is also relatively low so you can directly see where the differences are across a wide range of phenotypes so really good way of seeing it hey testosterone welcome to the lecture welcome to the mood box as well so heat maps very good way of visualizing data especially when you have multiple phenotypes measured on multiple genotypes or multiple animals all right so when we do some more advanced statistics of course then we are talking about hypothesis testing right the descriptive statistics are there just to describe certain certain characteristics on the distribution of their data but generally you want to test a certain hypothesis so and of course hypothesis testing is something that we do to see if a result that we get is statistically significant and sorry sorry sorry sorry I'm sorry testosterone it was you appearing that that made me all of a sudden like alright so but when we talk about hypothesis testing then we want to see if a result is significantly different right and significantly different means that it it is not based on a random chance right because if we take a hundred people from there and we take a hundred people from there then of course there is a certain chance that we will get something which is different and there's many many different ways to do hypothesis testing the easiest way is to just do a t-test but you can also do like a novice or other linear modeling techniques and of course you can also use machine learning and all of these things will come back in in future lectures so don't worry I just don't have a lecture about machine learning so if someone in is listening and says I really want to have a lecture about machine learning remember there's two lectures which are more or less free for you guys to choose so if if someone is really interested in learning a little bit about machine learning then I can just make a lecture about that but we will come back to t-test we will come back to ANOVA so of course in here we have a problem of course because when we test for differences between for example these different horses that we're looking at if we want to ask the question is the body length significantly different between the three different strains of horses had them we set a p-value so we say that if the p-value of the statistical test is below a certain value we say that it is significantly different and in biology this p-value is set to 1 in 20 right so if they're if it's if they observe difference is significant if it happens one in 20 times by chance right so that would mean that if we would say well gravity and biology is very bad because we have a very luck threshold so hey if you would take a stone and you would drop it like you would just release it and this stone would drop to the ground 19 times and one time it would fly up into space then we as biologists would still conclude gravity is real and gravity always exists right so 1 in 20 is not that uncommon because like for gravity of course if you drop a stone 20 times and one times it flies into space then you would still usually start doubting gravity if it really exists but we have our p-value and that's what we agree on of course when we start testing multiple things then of course we have to adjust our threshold to make sure that that we don't make more mistakes than what we normally would do so have we do multiple tests in our case so it tests the first horse versus the second the white this is wrong so this no first our first of the second first our first of the third one and the second one versus the third one right so in this case we need to preserve our 1 in 20 threshold so we have to correct for the number of tests that we do and we do that just using both errone correction so what we say is that well our p-value that is significant should now be below 0.05 divided by 3 by 3 because we do three tests so hey the p-value is only significant in this case when it is smaller than 0.016 yep in many cases if you think about phenotypes right we're talking about thousands of phenotypes that are measured especially if we think about automated phenotyping and plans going around on conveyor belts and being tested and so we have to deal often with thousands of statistical tests so there's two types of errors that we can make so we can make a type one error when we say that something is significantly different while it is not and we can avoid this by using bomb for only correction so that's just saying this take the p-value divided by the number of tests that we do so this allows or this protects us from saying that something is different while it is actually not on the other side we have also the type 2 errors which is saying that something is not different while it actually is and we can avoid this by using the Benjaminie-Hochberg false discovery rate procedure so just a different way of looking at it so we can optimize for calling something different while it is not while we can also optimize for calling something not different while it actually is right and in other words so the type one error is a false positive and the type 2 error is a false negative so we call something not changed while it actually is so plots and statistics using our programming languages can create a question from chat what kind of significance indicator do you use when you compare genes so when I compare genes it depends on the question that you ask a lot right if you want to know if if it depends on on on what you want to do with the list of results right if I have the money to test five genes right then I would optimize for false positives right because from those five I want to be sure that they are really true but if I have the money to test 10,000 genes then I would optimize for false negatives because from these 10,000 I would not care if a hundred were not not different right because I have the money to test 10,000 of them so has it if 9,000 are real then I'm happy about that and I don't really care about the negative ones right but if I have if I have very little money I would generally optimize for false positives because I can only test a couple of them in the lab but if I'm able to test a lot of them in the lab then I would rather optimize for false negatives so it really depends on what you what you want to do right it's the same as with tests if you do a corona test then you generally want to optimize for false negatives right because missing someone who has corona causes a new outbreak while false positives are not that bad you just send someone home for two weeks and they are staying there and they that's fine right let's just one person affect it but if missing someone with corona might infect another hundred people and the pandemic just continues right so it really depends on what your goal is on the other hand if you think about HIV you want to optimize for for the other way around right because in a way missing someone with HIV is bad but telling someone that they have HIV while they actually don't is also bad right you don't want to have people start writing their will and saying goodbye to all of their children and so it really depends on what the goal is it's in what you want to do with it so sometimes you want to optimize for false positives sometimes you want to optimize for false negatives so test those hours if you if you have a very specific example then I could answer that but in general it really depends on what the consequences are of allowing a false positive or allowing a false negative and these consequences can be very different in very different circumstances that's enough thank you you're welcome it's 407 I think we have to do a short break I have five slides left we could continue I do want to have a break because I want to cough a little bit and this is just some promotion for myself right so learn how to program like you're in a bioinformatics course and trust me bioinformatics is only possible when you learn at least one programming language you you can't really be a bioinformatic if you can handle a large amount of data so I teach an R programming introduction course in the summer semester so you're more than welcome to join that if you want to learn how to program and so statistics are hard and even I talk to other people who are statisticians to kind of make sure that what I'm doing is correct and no one actually knows exactly what is correct statistics is a broad field it's a difficult field there are people who have opposite opinions had like the false positive false negative thing some people say no you should always optimize for false positives right so we make as less or make the false positives as small as possible other people just zealously believe in always optimizing for false negatives so it really depends but we're gonna do a short break I think there were also some questions about assignments from last week so I really want to give people the opportunity to ask some questions about that too like I said I have five slides left which is actually four slides plus the question slide but I didn't prepare a break oh that's so bad that is so bad I didn't prepare a third break for you guys okay let me do that directly then so I will stop the