 So recording is walking so welcome back everyone talking about application programming interfaces because we want to query the database not just Single query by single query, but we want to talk to a database generally when we use a programming language So an API and computer programming is an application programming interface, which is a set of subroutines That allows you to query a database Or update a database from within a programming language Which means that within our you connect to the database and then hey you just query the database from our in in for example A for loop which allows you to query hundreds of genes In a row without having to go through the web interface and clicking on it every time So we will get back to application programming interfaces when we talk about biomark to connect to the main biological databases such as Ensemble and these kinds of databases So like I told you a lot of data is available. So what does a bioinformatician do with all of this data? Well, the main thing for which you can use databases is to develop new hypothesis and head like have an idea of Well, this is my gene of interest, but which other gene should I be looking at as well a Database of course provides you a way to do reproducible research And because you can query a database and then you can save those results and you can store them It also helps to put results that you have obtained into a historical context For example, if you think back to the OMIM database, which has data spanning back 80 to 90 years And so you can see if your experimental results Are in line with experiments that people did in the 1970s or the 1980s And then see if your results make sense or if your results are showing something different And you can validate results that have been obtained by others by again checking Data to see if the results that they came up with are consistent with what people found before so very very interesting Topic and there's like literally like gigabytes and gigabytes of data out there which you can use but for all of these we need stools to kind of Do reproducible research but also to compare results to see if the significance that we got or the direction of the effect Is the same as what other people have found So of course the statistical analysis will just go through a couple of little things I will show you a couple of examples from real research that we did in the past couple of years And there will be a whole lecture about statistics and about how we can do statistics either using are And what these things mean but today? I just want to have a little bit of an overview of descriptive statistics like when we talk about mean or variance or correlation And I want to show you a different some different types of visual analysis Which I think is really useful because only when you visualize your results do you get an idea of what's really going on? And so we will talk about scatter plots box plots histograms and heat maps Which are kind of the four main visualization techniques when we're talking about data So when we talk about descriptive statistics, there's two flavors one of them is the Univariate analysis, which means that you are looking at a single variable or a single kind of measurement so have we're before example measured body weight and Had then we want to Show to other people how the body weight of our animals was Distributed and what was the mean? What was the variation? What is the standard deviation? And then you have bivariate analysis and bivariate analysis is something like correlation where you take two phenotypes And you look to see if there is a relationship between them And if there is a relationship, what kind of a relationship there is So when we talk about the mean We generally talk about the arithmetic mean When we will talk about the mean later on I will also show you different types of mean Means because the arithmetic mean is the most common one, but sometimes not the best one to use And of course when you talk about the mean It is just summing up all the numbers and dividing by n which is the number of numbers I think that and everyone knows how to calculate a arithmetic mean The median is very related to the mean It is the numerical value separating the higher half of the data from the lower half The median of a finite list of numbers can be found by arranging all the observation from lowest value to highest values and picking the Middle one of course when you have an even number of observations Then you take the two middle ones and then you take the arithmetic mean of those so mean tells you something about what the average is the median tells you more or less what the Well, what the middle of the data is which doesn't have to be the same thing When you talk about variance and then variance is a measurement of how far apart or how spread out your numbers are And of course when you have a variance of zero that means that all your measurements that you did are identical The nice thing about variance is that it's always Non-negative and the bigger the variance the higher the distribution of the data So the more difference there is between the the lowest and the highest animal And it gives you an idea of how the the distribution of your data more or less looks So the variance is said the mean is generally in the middle when you talk about a normal distribution and the variance tells you kind of how how why the curve is When we talk about correlation and correlation is a measurement between two variables and Correlations are very useful Because they can indicate a predictive relationship that can be exploited in practice And that's one of the reasons why correlations are used so much Hey, you often hear people say that correlation does not equate to causation That is of course true, but the opposite is generally not true So when you have causation between two things, they are often very often correlated Yes, so that means that hey, even though two things might need not have a causative relationship If there is a correlation then you can still exploit that Think about for example Stocks or the stock market if you know that when the when the temperature outside goes up the stocks of Ice cream manufacturer rise as well This is because during hot weather people tend to buy more ice cream as so they're here There is a correlation which is not directly a causative relation the temperature outside does not make stock prices go up It's indirect But still when you know that you can earn a lot of money Because have when the temperature is going up or you see the weather report saying that well The temperature will like rise in the coming days and then of course It's a good time to buy stocks of ice cream manufacturers because There's a big probability that they will rise as well All right, so a little bit of an example of some already been horses that we did and I just I just wanted to go Kind of through the different phenotype measurements that we did and how we can kind of relate these to each other So here we are looking at Arabian horses Arabian horses come in five different breeds or six different breeds depending on how you kind of define them and the main breeds are Calabi, Saklavi and Hamdami So they are fish physical measurements that we did on different different types of Arabian horses And the first thing that you have to do when you look at phenotypic data is look for outliers There's things like comma errors where someone put the comma wrong Hey, which means that all of a sudden one of the individuals either ten times higher or ten times lower Just because the comma is at a wrong position. So someone wanted to write down 37.6 But they wrote down 3.76 and of course hey if you see that then you can easily like change the number as or leave it out With a scatter plot yes, yes So if I look at my my data, I generally do something like this So I take two phenotypes like the width of the phenotype chest Are the width of the horses chest and then I plot that against the body length And then you get a scatter plot and of course here you see that some individuals are relatively Small right the body length is relatively small, but the chest width is also relatively small So here I do not really suspect that someone messed up a comma Error and so here you see the different types of horses So the the color of the dots means that we're either looking at a calabi or Saklavi or an Hamdami And here you can see that in this plot It seems to be that the calabi have a slightly higher or a bigger chest Compared to for example the Hamdami horses have because had their body length is more or less in the same range Right, so it ranges from around like 1 meter 30 to around 1 meter 70 for the red dots and the black dots But you see that the red dots on average have a higher or a bigger chest Even though they are of the same length So when they are like 1 meter 60 then the black dots are more or less around here Well, the red dots are a little bit further apart And of course you can do the same thing for different phenotypes like the girth of the neck Had to see if there are relationships and of course when you have an outlier The outlier will be very far away Especially in this case when you're talking about something measured in centimeters and when someone makes a comma error Had then either the horse now has a body length of 12 centimeters Which of course for a horse is way too small But the first thing that you do is make some scatter plots plot the different phenotypes against each other So on the left a positive correlation, yeah, yeah, you can you can more or less see a positive correlation here So had the bigger your body the bigger your chest But it doesn't really seem like the neck girth has anything to do with the body length, right? That seems to be more like a single ball and there's like four individuals here Which have a very low girth of the neck and a very low body weight But they seem to be kind of outlier so strange in this distribution Because most of the horses are here and these four are kind of outside So here you see indeed a positive correlation, which makes sense, right the bigger your body the bigger your chest And it doesn't really seem to be the case when you look at the neck So when you look at the body length, it seems to be not highly correlated with the neck girth Probably the correlation here is slightly positive Driven by a couple of outliers, but just visualizing your data will give you an initial impression of If people made errors when collecting the data You can do a visual analysis in a different way. So here we're looking at box plots So box plots come in two flavors on the left side. You see the more or less standard box plots here You see the body length, right? So you can see that You don't see much difference in the median of the body length, but you can see that The kalawis So the red ones seem to have less variance So they seem to be more uniform while the other two breeds have a much bigger range of body body length On the other side, we see a box plot which is notched and notched box plots tell you something about if Two phenotypes are significantly different. Yes So what you see here is that you see the notch ranging from here to here for the for the Gray box so for the hamdamis and then you see here the notch for the kalawis And you see that these two notches they overlap. So if they overlap it means that there's No significant difference between the body length of the kalawi versus the body length of the hamdami And so it's a little bit of statistics It's more kind of by eye statistics But if if one of these boxes would be much much lower and the notches would not overlap Then there would be a really good indication that there is something different between the different Different different horse types in this case that we're looking at And so the spacing between the different parts of the box indicates the degree of dispersion the spread of the data And so it's kind of a measurement of the variance and a little bit of the skewness because you can see here that the part Yeah, the notches show the 95% confidence interval surrounding the median so you see that A box plot is in this way that this is the median then the box itself contains 50% of the data and then here in these waxes is 95% of the data so 95% of the measurements or in this case a hundred percent of the measurements fall between 1 meter 20 for this Sakla we up until 1 meter 70 well for example in the hamdamis They're bigger on average, so they are like 1 meter 35 up until 1 meter 70 so the notches They showed a 95% confidence interval, but in many cases you can you can change the 95% So you could say give me 70% notches or give me 99% notches But the default is generally 95% confidence interval Yeah, but it's just a way of representing your data And when I look at box plots as I generally tend to look at them notched Because then if there is a significant difference you can directly see that because the notches do not overlap With each other and in this case all of the notches they overlap right because the notch here is Captured within the notch of the green box So these two distributions are not different from each other or at least the median is not different from each other But had just a visual inspection of your data before you start doing any analysis Another thing that I always look at is make histograms So for example when I look at the body length and now I haven't separated it out by the three different breeds But you can see that had the body length looks more or less like a bimodal distribution There seem to be horses which are more or less here So there seems to be a little normal distribution here with a mean around or median around 140 and then there seems to be a second kind of normal distribution on the top with horses Which on average are like a hundred and sixty centimeters big For chest girth you see that the distribution looks different But again, hey, it's not a perfect normal distribution, but it kind of looks like a normal hey, of course when When it is very strange when head there's no data And then you see one big peak and then nothing and then of course you really have to wonder like what what's going on And why are there some animals which are this small While other animals are like relatively big but normally when you measure phenotypes especially classical phenotypes on animals or plants you expect most phenotypes to follow either a normal distribution so a Gaussian distribution or a bimodal Gaussian distribution Did you take the median because of the outliers in this case there were no real outliers The median in is Generally, I prefer the median above the mean in almost all cases The median since is half of the data is lower than the median half of the data is higher than the median it gives you a more truer estimate of what what like Half of the data looks like So I generally prefer median across mean because the mean is heavily influenced by like high numbers like outliers, right the The average salary in Germany is much higher than the median salary and that's because some people earn like 20 million a year But that's those are only a few people but these few people they drag the distribution up that much that when you look at the mean The median is much better. So the median income per country is a much better measurement than the average income So the mean income so I generally prefer the median above mean But it depends on everyone Histograms actually have been are an invention by Carl Pearson. Carl Pearson is very well known for Pearson correlation so It's a funny story because Pearson correlation is actually not invented by Carl Pearson Carl Pearson is the inventor of the histogram. So There's There's a lot of times in science that Someone gets credited for an invention, which is actually not his invention. So Pearson correlation is not Invented by Carl Pearson the histogram is invented by Carl Pearson, but some reason the Invention of Pearson correlation actually got ascribed to to Pearson If that makes sense there's actually if it's actually Stigler's law that all main Inventions in science are actually accredited to other people So that if you invent something there's a big chance that someone else's name will be on there in the end instead of your name Anyway histograms invented by Carl Pearson very good to just have a quick overview of how the distribution of your data looks All right, and then heat maps heat maps. I like looking at heat maps It's it's more difficult to look at it I've noticed that especially people who are not from kind of a computational field. So people who are really like It sounds like derogatory, but I call them butterfly biologists so biologists that go out into a field and catch butterflies so that are really like doing biology instead of Being in a lab or doing data analysis But I don't mean it in a bad way because you need butterfly biologists as well, but they tend to Not really understand heat maps They find it difficult to reason about but the nice thing about a heat map is is that you can kind of visualize three dimensions Right because you have like one dimension So the horse dimension on the on the y-axis here then on the x-axis You have all of the different phenotypes that we measured and then the color is kind of the the height or the difference in this case between the phenotype and And and and the horse and so what you see here is this that we look when we look at the chest measurements That have all of the measurements for all of these horses don't seem to be that different But there seems to be a massive difference and the Kalawi seems to have a much wider chest Compared to the other horse breeds in this case So they are they have a in this case. It's a it's a lot score Which is a p-value kind of way that I visualized But the Kalawi seem to have a much much bigger or deeper chest Than the other horses which makes sense because Kalawi are the racing type of Arabian horses While the Hamdamis are more the show horses and the Saklawi are kind of the work type Arabian horses So it makes sense that the Kalawi has a has a different chest because when you're a race horse You need a lot of air to be able to perform well While if you're more or less a workhorse It doesn't really matter that much because it's more like the endurance and not so much the kind of peak performance So a very good way of visualizing data, especially when you have multiple dimensions So hey when you look at multiple phenotypes across multiple species or multiple subspecies, it's a very good way of visualizing data and showing where the differences are All right, so of course there's some more advanced statistics that we will go into later again And I just wanted to tell you that like one of the main things as a bioinformatician is this that you have to do a hypothesis testing Right, so you have a certain hypothesis and you want to know if a result is statistically significantly different and in Normally when we use the term Significant then we mean something which is unlikely to have occurred by chance alone and of course there's many many different methods to Define or to calculate Significance and so the most basic significance test that you have is a t-test Where you just have two groups and you want to know if the mean of the one group is different from the mean of the other group We have much more advanced statistical testing if you want to correct for all kinds of other factors like the sex of an animal or the age of The body weight so it had then you can do things like a novice Which is very similar to a t-test you still want to know if there's a difference between two groups or three groups When you correct for different environmental factors like the sex or The farm at which they grew up and then of course nowadays. It's very hip to use machine learning So that you have a test and a training set So to do kind of assignment, which is not really Going into like is something significantly different. It's more about which features are different between Different observations, but there will be a whole lecture about statistic I just wanted to have it mentions that hypothesis testing is one of these things that as a bioinformatician is one of your Main things that that that you do So when you talk about testing of course you always have to mention multiple testing because in this case If we look at our data, we have three different horse strains So head when we want to test if there are any significant differences between these three horse strains Had then normally we agreed that when a p-value is lower than 0.05 It is significant and why is that because if it is less likely than one in 20 Then we say well, then it's unlikely to have occurred by chance. Yes, so that's where the one in Where the 0.05 threshold comes from and of course, this is a very We are threshold because in theory if you would think about gravity if you would take an object and you would Release the object and 19 out of 20 times It would fall to the ground and one of the times it would fly up in space Then we would still say gravity exists because like one out of 20 is not enough to discredit gravity So and of course, that's why in other fields Like in physics they work with a p-value, which is one times minus five and that means a 0.0000001 So head there they require much more evidence before they say something is significantly different But in biology we generally take the zero point zero zero zero point five threshold And that is because well one in 20 is kind of our bread and butter So if it if 19 out of 20 times we find an effect and we believe that this effect is real of course, we always have to deal with multiple testing and In this case because we have three horse species we can test the Hamdami versus the Saklavi We can test the Hamdami versus the Kalavi and we can test the Kalavi versus the Saklavi, right? So we always do three tests. So for every phenotype like body length We we test all of these three possibilities But we need to preserve our one in one in 20 threshold, right? Because we don't want to have because if we do three tests head that would mean that instead of Allowing 19 out of 20 to be similar. We now demand more to be similar. So we now demand one in 60 So that would mean that our p-value goes from zero point zero five To a p-value of zero point zero one six And then if the p-value is below this then we say that something is significantly different So you always have to correct for the number of tests that you do and this this falls into the fact that If you do a statistical test and you do a hundred statistical tests Then five of them will be Significantly different because of the fact that we have an alpha level of five percent because we every we don't care in five percent In the cases if the cases are different from the general So how we will go back in and talk about multiple testing a lot I have like one more slide and In bioinformatics we're often dealing with thousands and thousands of statistical tests Generally we want to know if a gene is different in the one species versus the other species and of course There are 20,000 genes in the genome so that the chances of making a mistake and saying that something is Significantly changed while it is not is really high When you do 20,000 tests. So this is called a type one error. So the type one error is calling something Significantly changed while it's not or due to chance and you can avoid the type one errors by doing Bonferroni correction Then there's the type two error and type two error is saying that something is not Significantly different while it actually is so you are missing a significant change and you can avoid this by a menu Mini-Hochberg false discovery rate procedure But these will come back I always find this a little bit difficult. So I always call the type one error the false positive. So something which is Which you you say is positive, but it's actually not And the type two error is a false negative So that means that something that you say is not changed is actually different between the groups that you are dealing with So but we will get back to multiple testing and how to properly deal with multiple testing There are multiple strategies and this is just the most Simple one and the Bonferroni correction is just taking your p-value Divided by the number of tests and that is your new p-value threshold Menu Mini-Hochberg is a little bit more difficult, but they're the most simple ways of dealing is Is it an alpha or beta error? So these are alpha errors. So errors on the alpha scale So you have the the beta errors come in When you are dealing with the true positive true negative rates, but here we're just talking about the alpha So calling something significant while it is not or calling something not significant while it is So it's the the alpha error rate But we will have a whole discussion. I think there's a whole lecture about like Well, not a whole lecture, but there's a lecture about statistics and more about statistics I just wanted to mention it in this context since when we're looking at three horse breeds There's a chance of saying that something is significantly different between the horse breed Just because you have three tests for every phenotype All right, so a couple of words about plots and statistics for example when you are using r So programming languages such as r can create these plots and do of the statistical analysis for you However, make sure that you know what you are doing Just because you are programming it and just because you are feeding it information does not guarantee that the results that you are getting are correct Always know what you're trying to do and or consult a statistician So I am a pretty good statistician, but I do have people that I call when I need help So always make sure that you know someone or have someone as a backup that can help you And advise you on what is the best statistical test to do Hey, is it better to do a t test or should I go to a non parametric t test because my distribution is not really normal And if you want to really learn how to to program and do statistics Then of course, I am teaching an r course in the summer semester And you're more than welcome to to join the the course. It's a very introductory course. So we start off very basic But from like lecture three to four on we kind of ramp up the difficulty quite tremendously And in the end the idea is is that people can make their own plots Look at their own data and do some statistical analysis on their data to see if there's any differences between the measurements that they got So if you're really interested in statistics and in making making plots on learning how to to program Then I It's a little bit of a plug to follow the r introduction course And of course just subscribing or following me on the channel will mean that you get an email when I'm starting the summer course So All right. So and what does statistics tell us? So and descriptive statistics is very useful to detect outliers to have an exploratory data analysis to see if anything is wrong And it can help us to decide which model or distribution is suitable to use on our data. And if you are Doing hypothesis testing then of course It test it gives you a probability that your hypothesis is true given the data That you give to the to the algorithm. So had these are really the things that statistics can tell us Has statistics is just a tool one of many tools in in the toolbox That that you need to to analyze data All right, so I always like to end more or less with project planning Because if you're talking about statistics, then of course, you always have to cite ronald fischer And ronald fischer has this really nice quote where you say to consult this To consult the statistician after an experiment is finished Is often merely to ask him to conduct a postmortem examination He can perhaps say what the experiment died of And this is really true in science. I see a lot of people getting a lot of money for projects and then doing the project And then realizing that they actually had 50 samples too little So if they would have done 50 more samples, then they could have really proven what they wanted to prove But now their p-value is just not good enough because they didn't have the sample size So if you're planning on doing a big Project proposal and you want to ask for a million euros from the european union to have a certain research project Then do make sure that you Ask a statistician before you start your experiment Because that can really really open your eyes to kind of Realize what you want to do and how many samples you would need Or if you need to increase the dosage of your treatment versus control group and these kinds of things Um, so in that Notion I actually made a statement myself because I think why should you consult a bioinformatician nowadays? I think besides a statistician you should always consult a bioinformatician Especially when you are doing things like dna or RNA sequencing experiments because you're talking about Data storage like 60 to a terabyte of data per sample You're talking about massive amounts of computer analysis time like 20 to 200 hours of computation just per sample And a bioinformatician can also help you with things like sample size estimates If you're dealing with german tier shoots or animal welfare What are the minimum number of samples that I need to get a significant results? And a bioinformatician can also help you to get informed about what statistics you might be able to use or How to properly randomize samples for example when you're doing a microarray experiment and of course these things are are kind of Logical, but I see them being overlooked in a lot of the experiments that that I see Being done nowadays, especially the dna and rna sequencing experiments like I've run into biologists that say well, I have like 500 of my arabidopsis plans and they're growing in a greenhouse and we're going to sequence them all and I have the money to Sequence them all and then you ask, but how much money do you have to buy a massive server to store all the data? And do you have a cluster somewhere to analyze all of the data? And then they never thought about it. So they they write their project proposal They ask for money for the experiments for the plans for the sequencing But then they forget that the data has to live somewhere as well And that they have to analyze the data as well and all of these things cost money as well Like computer time is not free if you want 100 hours or 200 hours of computer time then of course If you buy that at amazon that will cost you a certain amount of money as well So when you plan a project consult a statistician for the basic statistics But also consult a bioinformatician Especially when it comes to things like data storage and analysis time Which you have to calculate into your project proposal and into How much money you need to apply for And of course, this is still far away for many of you guys although When you're in your master phase then when you start doing your phd or when you're thinking about doing a phd Then these things become More and more like a day-to-day issue All right, um, so that's it Four five nine, so we're one minute before So what I told you today is I told you about phenotypes about traits about what is a qualitative trait? What's this quantitative trait? I told you about Mendelian traits quantitative statistics We talked a very little little bit about statistical analysis We looked at a couple of phenotypic databases, which are out there, which I Think are useful I we had a little like two words about multiple testing and a little bit about project planning But I think that's all of it for today There's nothing really more that I wanted to say Unless there are any questions So the homework for today is on Moodle It is five questions with abcd And most of these questions are going to the impc and the omim database and Just Finding some things in the database and then seeing what is known about it All right Good So let me quickly check how many people we are left with we are left with one two three four five six seven eight Nine ten good. So ten people left at the end of the lecture. So that's pretty good So if there's any question then When should we submit the homework you should not submit the homework you should just do the homework and Write down the answers for yourself and then at the beginning Of the at the beginning of the next lecture, I will go through the questions And I will show you where you can find the answers So I'm not going to check homework. That's something that people do in high school I'm just here to Show you what's possible and it's up to you to Be interested in and do the questions And of course if you have any issues with the questions or the homework Then just send me an email and yes sundry. I will I will send you a pdf I'm still setting up something on my website so that I can mirror the Moodle there for people that don't have access to the Moodle Although I think I might actually be able to invite you on Moodle as a as a guest lecture or a guest Guest account so I will look into that if I can get the Moodle to work for you Then I will just send you an email And olexander I hope I'm pronouncing your name right. Thanks. Yeah. Thank you guys for watching like it's only fun when there's people asking questions So I consider for three hours and talk to myself as well, but it's it's more fun when When people are here, so all right commando. See you next week. How do you get the little crown in front of your name commando? Is that a is that something that you can can share with us? Thank you guys, I will stop the recording