 So welcome everyone also on YouTube and Moodle if you're watching this today We will be doing regression analysis, and I'm really really excited about it because Regressions and novus and these kind of things are kind of my specialty in a way. It's where I did my PhD in so I I'm quite familiar with how to do regression and I think a lot of you guys are actually really wondering when will the statistics start? We already did a little bit of statistics right like t-test and nonparametric t-test, but I think Now we're starting to do regression right so now it becomes a little bit more serious in the statistical part And I will just go through Let me see what do we have for today, okay? So yeah first off the exam will be on the 21st of July at 2 You Can't register yet in Agnes I think because I got a mail from someone after the last lecture saying I looked into Agnes And it's not there yet, but that is because you can only register in Agnes two weeks before four weeks before the exam so four weeks before the exam means that on the 2nd of July it should be possible to register so if you've Looked in Agnes and you haven't found the exam yet Then this is the reason The re-exam will be on the 9th of the 9th again via zoom Moodle and the best grade will count So for everyone who says I can't really study for the first exam So we'll do the re-exam don't just do the exam because if you do the exam You can always do the re-exam to improve your score And if you only do the re-exam then of course you only have one chance and having two chances is better than one chance So yeah, it will be online on Moodle I hope the 2nd of July and then you should all be able to register So yeah, I will I will keep track of it and people who can't register just send me an email And then I will let you know how to register So the assignments from last week, I I hope people were able to do it So hands up in chat or just shout something in chat if you were able to I Have registered for the exam yesterday. Oh nice. So then it is already in Agnes. Okay, good Then the guy who mailed me I will mail him back to to Make sure that he checks Agnes again because he checked us Agnes last week and it wasn't there last week But good good. That's very very good news. Thank you for telling me perfect So yeah shout out in chat if you were able to make a package. So did you build your first R package and were successful or Didn't because of some error I'm really curious to see how it went My name is Mausie. You're also not a VIP. What the hell is happening? Why are people that are here like week after week not VIPs? So but you were able to build a package That is very good. I like your name by the way. My name is Mausie There you go, and you are also a VIP. So you got your little diamond Good so two people were able to build the package I know Daniel tried as well and failed halfway through and asked me some questions Like by email. Are you here Daniel? You told me you should be here. So shout out in chat for Daniel Where are you? Like you're finished with your work, right? So you should be able to attend the lectures now But I can show you guys like how I made my package But I made it exactly the same way as as I told you during the lecture. So I can I can show you So let me open up the correct files. So it is docx Then we go to our course and then we have your package name So first off The description file. So this is the description file that I used more or less very similar to the Description file that I showed you during the during the lecture So it just tells you or it tells are what you need to run this package, right? So it gives a name a version when it was created a title the author they maintain her You need are and at least version 3.0 it has a description and a license and of course it doesn't have to be in this order So you can change the order around if you want to and that should all be perfectly fine Then there's a namespace file which of course Initially used to be empty But during the lecture I told you that if you want to give a function to the user Then you should export it. So my package is exporting two functions one Which is called my first package function just like in the lecture and then export the call test C from R Which is the example which I showed you guys on how to call C or C++ code And of course since we are using C code which gets compiled We have to load the dynamic library So we have to load the compiled code first and that is also done using the namespace file So this loads the compiled DLL under Windows or the SO file when you're under Linux and then we export to functions I can show you the functions that I wrote So I have one file which is called my first package function dot R Which looks like this. I also made an internal function I see now just so that I have an internal function to annotate And I just made a function which does nothing and of course this one just brings something to the screen But this one is not exported to the user. So the user cannot call it Or gets an error when they try to call it and it says that no such function is found There are some ways that you can still call it And then we have the call Call C from R test file and this is again the same file as that I showed you during the during the lecture We have some manual files, of course, so we have the call C test from R So again, very similar to what I showed you And of course we have the my first package function Like this and had this is kind of the minimal amount of things that you need to mention Yes, so you need to mention the name the alias the title you have to give a description Even though the description can be like very short You can have a usage section arguments details values and examples and then the keywords have to be method So I'm hoping that everyone was able to build it because kind of these are the four Minimal files that you need to build a package for R and like I said like building or being able to build a package in R is very useful because There's only a limited amount of people who can program R and from the people who can program R There's only a limited amount of people who can actually build a package So being able to build a package for other people when they ask you Yes, so if you if people know that you can program in R Then they you can tell them on but I can also build a package And then in the future when they have some nice code that they wrote to do an analysis Then you can help them build their own package And this will it's a nice way to get co-authorships on publications and just by Contributing your skills as a package or you can get some nice Publications or some co-authorships. Hey, you won't be first author, of course All right, so kind of that's everything that I wanted to say about how to build a package So you need and like you need a description file a namespace file You need some our files where there's some code and you need some manual files to describe What code you are going to give to the user? And then if you did everything correctly and you put everything in the right folders Then if you do an RC in the check, it should just run through it without any issues And you might get one or two notes about like minor things But if everything went correctly, then that that's how you do it So if there's no other questions about package building, then I think we should just start with the lecture I'll wait a little bit so that people can ask questions in chat But I hope that everyone was able to build their own first package. I think it's a nice Halfway point right because we're like last week we were halfway through the lectures and Halfway point is really nice that you are at a point where you can write your own code or write your own little functions and Have the ability to kind of give this code to other people And head of course, there's much more to it right if you want to get your package on cram Hey, then you have to follow their system So you have to submit an email first saying that I want to upload a package with this name Then they sent you back a link and then on that link you have to upload your package But that is all quite straightforward. It's just following the kind of manual line All right, I see nothing in chat So everyone was able to build their package, which is really really good and I'm really really happy about that. So There will be some questions about package building on the exam All right, so for today regression analysis, I have an overview So what will we be doing or what will we discussing today? So we will be discussing basic regression So first we will start with single linear regression And talk about things like confidence intervals and how to plot confidence intervals and residuals and these kinds of things Besides that I wanted to talk to you guys about multiple linear regression, which is actually my kind of my PhD study field And then I wanted to say a few words about quadratic regression Because this fits really nicely into the data set that we will be analyzing and then a few words about model selection And how can you how can you compare different models that you have right? So you have a phenomenon for example Precipitation that you measured and besides that you have all kinds of variables, which are predictors for the precipitation For example, are there clouds? What's the temperature? What's the air humidity and then all of these things you combine into a single model but of course you can combine them in many many different ways and you want to have a kind of Way to decide this model is better than the other models And that also means that there's one slide in Latin for me today So that's a slide that I'm going to give you for free. So you can listen to my beautiful Latin speaking And I'm curious to see how it will work out. I I Officially had Latin in high school But since it's a dead language, no one knows how it's spoken. So which it should be fine. No matter how you pronounce it It's always correct, right? So these are the topics for today But I want you to ask questions, right? Like on this slide It is an introduction for you guys on how to use the LM function and how to use the ANOVA function and some of the Helper functions around it. But like I said, I have a lot of experience in linear modeling But I try to keep kind of the lecture short Latin died my condolences. Well Yeah, the only one who speaks it is the Pope, right? And he's the authority on it and Yeah, officially, there's no No way we don't have any recordings of people speaking Latin in the old days. So But that like today we will have a very short Lecture it's like 41 slides. So I'm hoping to fill at least one and a half hours with it But if you ask questions, we can have a much longer Discussion today like I'm here for you guys until five So ask unrelated questions if you want more details So today we will do basic linear modeling, but don't worry next week We will talking about linear mixed models and a week after we will be talking about generalized linear mixed models So head because it's a massive field linear models So I made a little bit of an overview slide for you guys so This is more or less everything, right? So here we have what we are going to talk about today. So today we are going to talk about linear regression About a nova and about multiple linear regression the analysis of covariance I will kind of skip over because it's very similar to the analysis of variance And these are called general linear models. So that will be the topic for today And they they are really nice, but they have very severe limitations. So Things like repeated measurements cannot be handled by standard general linear models So next week we will be talking about when you have repeated measurements or when you have time series data And how to use mixed models to kind of analyze these kind of structures have where there's Structure between the individual measurements that you have But for today, we will only be talking about when you have Randomly measured things from a population and you want to calculate things which are valid for the whole population Well, and you can do the same thing of course with mixed or linear mixed models but the linear mixed models allow you to kind of Optimize your PV or optimize your p-value sounds a little bit like p-hacking, but it allows you to deal with things like repeated measurements Things like random effects where for example certain individuals have a or share a father or they share a mother How to deal with those? And then the week after we will be talking about generalized linear models And we will be talking about things like logistic regression Binomial and log linear models So in the case that the input variable that you have is not a Kind of a continuous value variable. So when it is for example a yes, no answer or When you have very specific amount of levels or if you have a very very weird Finer type structure or a very weird measurement structure that you have but for today general linear models, so very basic very short introduction and Ask as much questions as you want I can talk about this shit for like hours and hours on end Which I actually do during my work But since it's a lecture and I don't want to overload you guys with all kinds of like little details I try to keep it kind of high level, but if you're Interested in think like oh, but have the data that I have has this kind of a quirk to it How do I deal with it? Then just let me know right because then we can I can just make an example in R And we can just talk about it and since I only have a limited amount of slides We have enough time to to deal with your questions All right, so very basically regression analysis is the statistical process for estimating or kind of finding relationships among several Variables that you have and there are many techniques for analyzing these several variable models Have but the focus is on the relationship between the dependent variables and one or more Independent variables, so had just terminology speaking the dependent variable is the output or the effect So the thing that you have measured and want to explain and the independent variables are thinkers like the input or the cause And those are the things to which you want to distribute the variance to right so in my mind In there is variance in the dependent variable, right? So you did like a hundred measurements of a certain thing and now based on other variables You want to distribute the variance that you see in the dependent variable on to the independent variables to kind of Assign variance, so it's kind of a sign-to-blame game right because you want to say well This genetic marker is responsible for a change in this genes expression And then not only is this genetic marker responsible for it, but there's also some environmental influence on it, right? So you are trying to Take all of the variance that is in your dependent variable and assign this variance somewhere so the basic regression model looks like this so the basic regression model is like why which is our Dependent variable right the response and that is then predicted by a function and in this function We have our independent variables and these independent variables are then coupled to something called beta and Beta is the regression coefficient for X So if you have five independent variables, then you will have five beta values Right, so there are some constant which are not mentioned in this model But hey you have n which is the number of independent measurement So that is the length of your of your vector y so of your dependent variable, right? So I might have measured a hundred cows or I might have measured milk yield from a hundred cows so then n is a hundred and then k is the number of unknown parameters So that is the number of kind of independent variables Right, so I can have I have measured milk yield in cows And now I for example have five other measurements like the body weight the length of the tail the does it have horns? Other measurements and I want to assign the variance in the milk yield to these different independent variables I Hope that's clear so hey, we have beta which is the thing that we are going to compute Then we have X which is a matrix more or less But if you have only one then of course it's just a vector But X tends to be a matrix of different things and then we have Y Which is the dependent variable and that is the thing that we are trying to predict the thing that we are putting our variance Where we are taking the variance and distributing it across the different vectors of X alright, so regression comes with a lot of assumptions and I Kind of want you guys to know all of them right because if any of these assumptions do not hold Then your regression model is not valid, but of course there are some which are more More important than the other ones So I highlighted in red the things which often go wrong and the things that are really important for a valid regression model So the first assumption is is that the sample is representative of the populations for the inference prediction Right that means that if I want to make a statement in the end, right? So I do my modeling and in the end I want to make a statement like Humans who have this gene tend to be bigger Right then I have to have randomly sampled from the human population That means that just doing the study in Germany is not good enough because Humans is much broader than just Germany right there's people from Africa as well and there's people from Asia And there's people from America and they are not German and this is one of the things which is Like in science goes wrong most of the time Right so people draw Conclusions and make it seem like these conclusions are valid for all cows on the planet or all humans on the planet But then when you look at how they sampled they did not randomly sample across the planet They only sampled Eastern Europe or they only sampled Western Europe or they sampled a very small area So the sample that you are taking has to be representative of the population that you are going to do the inference on and This is the thing that goes wrong the most in science and in many many scientific papers Really nicely written papers very good data Relatively good results, but then they their conclusions are way too broad and this is because of the first assumption Underlying linear regression or underlying ANOVA analysis the second one is very important and that is that the error right so the error is the thing which remains right so if I have like my My variance in the thing that I want to predict and I have assigned Variants to all the different things that I have used to kind of catch this variance then in the end I'm always left with some kind of error Because of course no measurement is perfect. No measurement machine is perfect however, the errors in The error is a random variable with a mean of zero conditional on the explanatory variables So that means that if I want to rephrase it It just says that if I look at my model and I take my my phenotype of interest and I fit it All the different things that I am interested in or want to look at what the effect of this is on the on the phenotype that I measured then I'm left with an with a Additional term and this is my error term also called the residuals of the model and the residuals need to have a mean of zero Conditional on the explanatory variables. So of course the residuals don't have to be normally this They have to be normally distributed with a mean of zero. That's kind of the idea behind it The independent variables are measured with no error Also, this goes wrong a lot of the times But it is not that important because a little bit of error never hurt anyone and that's kind of what I think But the independent variables normally should not be measured with error And so for example, if one of my predictors is the sex of the animal that we're analyzing, right? Then of course, I cannot make mistakes there I cannot say this is a male while it's actually a female and I cannot say this is a female while it's actually a male, right? So so I have to be very careful when I define my independent variables because these independent variables The model is much better when you have no error in these measurements. Of course errors always occur So you you can't guarantee that there are no errors But then this is one of the assumptions underlying the and the regression model is that there there are at least a minimal amount of errors Then one of the things that often goes wrong as well is that the predictors are linearly independent it is not possible to express any predictor as a linear combination of the others and This is called co-linearity Yes, so that two things are co-linear and this is something that is really hard to avoid and generally almost impossible to avoid There will always be some linear dependency the question is just how much is there and We can we can go back to this later but this is one of these things that is really important and which is something that you can actually check by just Building your model and then taking your independent variables and start switching them around Because if you if you have like three independent Explanatory variables, right? And you have several p-values for them or several beta coefficients When you start moving them around and the beta coefficients start like changing by a lot Then of course then that means hey Florian welcome to the lecture Have if you start changing them around a lot Then the problem becomes that you cannot assign the variance because then you cannot uniquely say well being clouded has a certain effect on the raininess because the cloudiness is coupled to the temperature Errors are uncorrelated which means that the errors cannot be correlated to any of your predictor variables Generally, people don't even test this but it is one of these assumptions that is there and then the last one which is very important but a Lot of times overlooked because the word is just too hard to pronounce is that the variance of your data should be Approximately equal across the range of your predicted variable values And this is called homo's cladasticity or absence of hetero's cladasticity I'm not even going to say it if this is not the case log transform or other methods might be used instead Right, so but this just means Let's just make a little graph of that, right? I have my board here So why not just use it once in a while like I haven't really used it in a long time So I don't think we ever used it yet. So let's go to full screen and I will draw a nice. Oh, that's that's strange That's not how this should be Filter Delete that yes, and I'm big very good So then you finally can see the rest like you can see my my bottle collection like that. That's my pension fund right there I'm in Germany. So you get money for bringing back bottles All right, so homo's cladasticity or hetero's cladasticity. Let's remove the attack of the lockdown as well And just like make a little graph, right? So linear regression like looks like this a little bit so on one side we have our independent variable in deep and Here we have our dependent variable, right? And now what we want to see when we do something like this is that there is a relationship So imagine that my independent variable so that the thing that I measured has five levels, right? For example Human height let's say that we divide human height in like five groups, right? So we have people who are less than 140 Then we have people who are 140 to 150 centimeters 150 to 160 and so on right and then you have people who are larger than like two meters Now if we have our dependent variable for example the body weight then of course the first group has a certain Variance right so we have some measurement points and then we have more measurement points here as well We have more measurement points here. So that's three groups then we have four groups and Then now we have five groups Right, so the the haters clodicity. Do you use our studio? Well for the lectures we generally use our Standard so the R GUI, but I have used our studio in the past. It's just our right. It doesn't really matter which you use but haters clodicity or it's such a bad word or homo-clodacity Is the fact that when you fit your model right because we are going to fit their best fitting line So we want to do a regression line, but the thing that we don't want to see is that the variance in the first group is Much smaller than the variance in the last group, right? So if the measurements here go all the way like this right in the last group And now the variance here is much much bigger than in the first group and this means that now your model is not valid If you see a pattern like this And this is something that people rarely check they rarely check for haters clodicity and there are ways to solve this But generally you have the standard that had the bigger your measurements become also the bigger the variance becomes But this means just that the line the regression line cannot be estimated very correctly Because in this case had a regression line like this would also fit quite well and a regression line like this would also very well And that is because the the the points that are here at the outside of the distribution Have a much larger influence on our regression line than the points here, right? So points here are much closer to the regression line So they don't really influence the regression line that much but in a heteros Heterosclerosis situation it ends up that these points here because of the variance being much bigger in this group And this is very very detrimental to when you are doing a regression model And this kind of invalidates all of the predictions because when I'm now starting to predict things for people who are larger than 3 meters And then of course you can see that these lines are kind of going very broad So it is it is very big You see this in fermentation experiments when you measure the od 600 of cells the error from the od measurements Effective reading at a lower value compared to a larger value. Yeah. Yeah. No, that's that's true You see this in a in a bunch of measurements and it's actually like every kind of Measuring machine suffers from this right because you have a certain dynamic range But large numbers tend to be associated with larger variants So it's not something that will completely invalidate your regression model It's just something to be aware of that this is happening and Florian is pressing buttons again Florian, I can't do the next slide in three different languages like pick one. Which one do you want Dutch or German? And you're not getting your points back for the other one like you were dumb enough to click the button both so Flemish and now I'm not gonna do it in Flemish like I I have certain standards here like All right, let's get back to the lecture and let's move myself in out of the way. So face cam All right, it's better, right? So you guys can see me again All right, so these these are the assumptions for linear regression Right, so there's a lot of assumptions and we generally need to check them all the ones in red are the ones which often go wrong So if you ever need to do a journal club And you have a paper where they use linear regression Make sure that you check these assumptions and the first one is the most important one because this is what very very often goes wrong in research that people make Conclusions which are overly broad, right? They only measured people in a very like narrow area of the population and now they try to generalize their results to everyone Right, which is often what you see Hey, which is also probably the major issue with phase three trials for medicine Which is usually done in like one country or like two or three countries And then they conclude well the vaccine is safe for everyone on the planet Which actually Is it right you can lock? Yeah, yeah, so it also says here right so the last part the homosclerosity you can lock transform to get rid of this effect There are other ways that you can use it to kind of minimize the effect of these like extreme points on your regression line But lock transformation is one of the most common methods But there are other ones like you can use square root transformations as well And there are some other transformations which also affect kind of the higher range much more than the lower range So but and you can have it the other way around as well Right, you could have a large variance in this group and a small variance here Which means that you're unable to predict down, but you're actually pretty able to predict up But that really depends on your measurement machine and where the error is but making a plot like this is always always good to kind of Check if your data is valid All right, so those are the assumptions I'm going to do this in the Netherlands for Florian so Florian here we go when n is bigger than k and this goes back to the slide here where we have n is the number of Undependent measurements and k is the number of unknown parameters that we want to estimate so the number of beta's when you have more Metals then have independent variables and the measured error is Normally distributed, so it is a Gaussian Distribution, then we speak that there is an excess of information So an overflow of information And this and this can be calculated by just saying that we have a hundred measurements and we have five Beta's that we want to estimate and if we subtract that from each other then we have ninety-five Freedom degrees less or more it's not exactly freedom degrees, but if we If the assumption is that everything is linear, so that we just don't have groups, but that we just have a steep linear effect then we can the Overflow of information to make statistical predictions about the unknown parameter So that was in in Dutch and I will read it now in English so the n right which goes back to n and k So the number of independent measurements that I have Related to the number of parameters that I want to estimate Had the difference between these two is the amount of Excess information and the excess of information is kind of expressed as n minus k Of course, this assumes that you are looking at a linear Relationship and not at a grouping relationship like I did here So if the if the independent variable is just a linear Or it says just a continuous variable and your dependent variable is also a It's also kind of a continuous variable than every Every k measurements or every beta that you estimate takes away one degree of freedom So when I have a hundred measurements, and I'm trying to estimate five independent variables Then I have 95 degrees of freedom and the more degrees of freedom I have the more certainty I have or the more statistical power I have to make predictions about the beta parameters that I'm trying to estimate so about the unknown parameters and That is where the power of regression comes from This also means That when I'm trying to estimate too many Unknown parameters with a very limited amount of measurements, right? If I have only 20 measurements, of course, I cannot estimate 30 different parameters because then had 20 minus 30 is minus 10, which means that I actually actually have a lack of information So I don't have enough measurements to make any statistical Or to say anything statistically valid about my unknown parameter. So about my independent variables all right Linear regression in R is done via the LM function So I'm here using the air quality data set which we already saw during the plotting lecture So just for you guys to remind you the air quality data set has ozone temperature, I think wind speed and What's the last one because it has another one? data air quality Let me let me actually just load it and then the data set is called air quality, of course Solar radiation, that's the last one. So it has the ozone concentration Solar radiation the wind speed the temperature and then the month and the day at which it was measured So very basically if I want to predict the ozone temperature Like I'm like I'm going to do here, right? So I'm saying that the ozone in the air is Predicted by the temperature. So ozone is my dependent variable Temperature is my independent variable. Of course, you can directly see that one of the assumptions is broken in this model And if anyone can tell me Which one of these which assumption is broken in this model? You will have my like eternal gratitude or no, I'm not gratitude like I will be very very proud of you guys So had just very basically think about the air quality data set, right? So it has a couple of different measurements We can make the model by just saying linear model Take ozone as our dependent variable Temperature is our independent variable and then say data is air quality and then we just store this model If we want to see what's in the model, then we just do the summary function for the model and then have we throw in the model So let's go back to the assumptions. So which assumption is broken? So There's 16 people viewing so at least one of them should be able to do a guess And there's no wrong answers Like you won't get points subtracted on the exam when you don't know the answer Like it's just something like hey, if you just think about how it is Which ones of these might not be valid for the current model when we say that the ozone is dependent on the temperature Go go go. No one come on people Is it number one is it number two is it number three is it number four number five or number six No one number one. Maybe number four. Okay. There there come the guesses All right, so the number one there is something to say but the problem is is that for this data set we cannot know What we are doing an inference for number four the predictors are linearly independent We only have one predictor and if we have one predictor the predictor is always linear independent, right? Because there's no no one to be no the the issue here is number three. You cannot measure temperature without any error That means that like the temperature can be forty five point three three three seven two one one eight seven three But no matter how you write down your temperature You're always wrong because there's always one more digit behind the comma That you could have measured that you didn't do so there's always a little bit of an error Right, so the independent variables are measured with no error But that's impossible because you cannot measure temperature without an error You can measure sex without an error, but temperature. No Because you can say well, it's forty three degrees, but it could be thirty nine point nine eight and then there's zero point zero two Error in your measurement, so no matter where you put the digits no matter how many digits you write before me It's a small error, right? And that's why I'm why I highlighted this one in black because this is of course almost impossible Because there are no machines that do not have any error in measurement even a basic thing like a thermometer Has a little bit of error So the assumption which is broken is assumption number three no error in the model Of course the linear independence doesn't come into play here because we only have one Independent variable and it cannot be linearly dependent on any of the other ones, but it's good that you think about it, right? Alexander Alexander is their assumption for error in model Um Well, yeah the assumption for the error right the error term in our model here Have where we do kind if you would write this down on a board, right? And you would you would want to publish your model here Then the model that you're doing is looking something like this So and you would write down is this visible by the way if I just do it here So you would write down something like ozone is Predicted by the temperature plus an error Right, so ozone ij is temperature on i error on j Right, this is the way that you would write down your linear model in a paper So the the error term here, which you always hide or almost never mentioned right those are the residuals So when I when I regressed Ozone on temperature then there is of course variance left and this variance here goes into the error term And then when I look at the error term Then when I plot the error term it look should look like a normal distribution With a mean of around zero, but if we look at the summary from this model very basic model We say that the ocean concentration in the air is dependent on the temperature that we measure So and of course it doesn't fulfill all the assumptions, but that's okay You can never fulfill all the assumptions of the model. It doesn't make the model invalid It just means that the model is less valid than it could be And of course if you take a better thermometer, right? If you have a thermometer which measures on five degrees accurate head and of course changing your thermometer by one Which is like accurate to two digits behind a comma will already improve this like third Assumption right the less error the better you can never avoid all errors together But if we do the summary right in our then this is what our tells us So the first thing that our does when you do the summary of the model It gives you back the formula that you used which is useful in some cases because if you if you build like six different models In our and you store them all in variables called a b c d e which you should not do right because you should do a Model you should give variables good names, but people generally don't It's good to be able to see which model you fit So here you see some information about the residuals and you see that the residuals here are like Alexander said they're not Actually zero right the median is slightly negative And you see that the minimum or the first quantile range is from minus 17 to plus 11 So this cannot really be a good Gaussian model, right? So I think Alexander is a little bit right that the that the residuals of the model are not a Gaussian distribution But we can't really see this from the slide But based on the numbers here we can assume that that the mean is definitely not zero or the median is definitely not zero It's slightly negative and it seems that the that the negative side of the normal distribution is way longer than the positive side Because it had the first quantile is at minus 17 while the third quantile is at 11 So there's a like a six-point difference here The thing which we want what I want to show you guys is actually the estimates right because the estimates are the beta estimates So the the estimates that we want to predict, right? So what do we learn we learn that if the temperature would be zero The ozone would have a value of minus 147 round it right and then for every Degree of temperature increase the ozone seems to increase 2.4 points And this is of course a little bit strange because that would mean that at a temperature of around like 20 degrees Celsius You would still have kind of a negative ozone concentration Which is a little bit weird, but this is just the data. It's just the regression model I don't actually know why it does that because it should not be because the ozone values are always positive But it's just a bad model That's the thing that we will conclude in the end that this is not a very good model and that it's kind of overfitting at the At the lower end. Yeah, but this these are the things that we learned so the the intercept is the location where Temperature is zero so all the all the Independent variables are zero and then this is what the prediction is for the ozone concentration And then temperature itself gets a bet of 2.4. That means that every Degree of temperature increase leads to around a two and a half point increase in the ozone concentration And then here we see the multiple r squared the multiple r squared is 0.48 Which means that around 84 percent of the variance in the ozone concentration is Determined by the temperature which is relatively high. So it's not the model explains a lot It's just probably that the model isn't really valid But in our regression you can do via the LM function if you don't want to show the results You have to use the summary function All right, so and if we now want to plot this model Right, then we can do that very easily Because are when you do the plot it allows you to just input more or less a linear model. Oh Crap someone's doing R on Twitch. It's not Maddie two shoes. Thank you for subscribing Why wouldn't you be doing R on Twitch, right like it looks so beautiful like let's open up the R window like It's our it's our as in awesome. Oh, okay, not. Oh crap like no I just got away from our and now I'm on Twitch and there's even more are Yeah, well, we've been doing this for quite some time so this is actually lecture number nine So there's eight more lectures on Twitch that you can look back and like learn all All right, so hey in our Have we can use more or less the same way of writing down a formula to also make a plot, right? So if we want to make a plot which has ozone as the Dependent variable and then air quality as ozone as the I always mix them up So this is the predict this is the predictor and this is the response Let's switch the terminology a little bit strange. I hadn't found you yet, but I'm glad yeah No, well welcome to the to the lectures like we're also on YouTube But no no no we're in season three already. Yeah. Yeah. Yeah, we also did a bioinformatics course last winter semester And this is the second time that I'm doing the our course on Twitch. So Yeah, so we can use the plot function We can just use more or less the same strategy as before the only thing that is different from the model here We say data is air quality But in the case of the plot we cannot add data is we just have to say take from air quality the column ozone and Then regress this are and regress this on the air quality data set take the column temperature. I Do PCA it's 19 because I want to have like these filled dots right because I don't like the open dots Which is the default for our and then I say color is blue because I just like blue so we make a blue plot I missed everything. Oh, you didn't miss anything like If you know what are is then like you're already ahead Okay, and now when I want to plot the regression line also in our it makes it very very easy So I can just say from LM temp, right? Which is my my model that I built here. So from Ellen temp I take the coefficients and then I say take the intercept, right? So the intercept is my a coefficient and I hope that everyone knows what the what the co what the Formula is for a straight line in a plot. So straight line in the plot I'm going to just write it down for you guys because like I know a lot of people Don't do a lot of mathematics. So let me just do like So a straight line in our are a straight line in any mathematical formula says that why Right, which is the why position of my point is determined by something called a Right, so the intercept so plus B X Right, so X here is my X coordinate B is my Is B is my directional coefficient and a is my intercept So if I want to plot a straight line if I want to pull a straight line in our I have to get the a coefficient, which is the Intercept and then of course I also have to get the temperature coefficient Which is my B? So that is my how much do I increase with every increase from my air temperature? All right, and then I just say use the upline function say a equals a b equals b color is red line White is 2 and then it will plot the regression line in my plot and here we can also see why now all of a sudden my like Intercept becomes so massively negative is because at a temperature of around 60 degrees Fahrenheit There's almost no ozone in the air But of course the linear model doesn't know that negative ozone concentrations cannot occur so it just Extrapolates a straight line and here you can directly see one of the drawbacks of modeling and linear modeling is that often your linear models They holy shit. No. I don't want to become famous block all these people help. Where's my moderator? No, don't become famous. I'm already famous enough Anyway, and so here you can see the intercept is negative because just the linear model doesn't understand that negative Negative ozone values cannot occur right. There's also always some ozone chill. I'm on it. Yeah. Thank you. Thank you You're the best moderator ever But just plotting the line and this is generally what you want to do in in a publication, right? Of course, you want to make the plot look a little bit better Hey, you might want to remove a couple of these outliers. You want to use GG plot or something like that, but in the end just Use the upline function use the plot function and you can make really nice plots and I like the fact that You can actually specify your your dot plot very similar to the plots Are very similar to the LM function The only drawback is that you have to every time say air quality dollar ozone air quality temperature But of course we've I've told you guys that if you use the with function, you can just say with air quality plot ozone Squiggly line temperature and then of course you can do the same thing So and like in the plotting lecture We already showed you how you don't have to start or how you don't have to repeat air quality all over again all right, so Confidence intervals, right? So if we look back to the model, we see here that every estimate has a standard error right, so Because we are just doing an estimate of a certain relationship We cannot be sure that the the temperature Exactly rises with 2.4287 points every temperature right because sometimes it will be a little bit more sometimes it will be a little bit less so every every variable or every predictor variable that you estimate comes with an error So how do we do this? So when we calculate our confidence interval right where we where we have 95% of our data in The first thing that we want to do is calculate the margin of error, right? So in standard regression, we can use the t statistic, right? So the standard error from the LM summary we need But that is given to us and then we need to find the critical value, right? Because the critical value is the value at It's the value which we define because we want to have a 95% confidence interval or we want to have a 90% Confidence interval or we want to have a 99% confidence interval and just to Give the researcher some wiggling room in to say how certain he wants to be, right? So the first thing that we say is we want to define our critical value and generally in biology our critical value is 0.05 right very similar to the alpha value But in this case since we are looking at confidence intervals, we have to do this two-sided So because it is two-sided we have to define our probability boundary as 1-alpha divided by 2 So that means that in this case our probability boundary for our 95% confidence interval is 0.975 I need to know how many degrees of freedom there are fortunately the amount of degrees of freedom is also mentioned in the Well not in the summary, but it's measured in the ANOVA test But we can get the number of degrees of freedom quite easily because it is the N value So the number of measurements that we have minus 2 y minus 2 Well, we lose one degree of freedom for estimating the intercept and we use one degree of freedom for estimating the temperature component right because we're estimating two Two parameters not just the beta for the temperature, but also the intercept And then I can just use the Qt function So this is the the t statistic function and they can say well give me Qt of 0.975 and minus 2 so this is the Quantal t distribution we can use and then we get our critical value And now the margin of error at each point of the line is just Calculated as the critical value times the standard error. This is really difficult. So I made a little example But it's three. So I will take a very very short break. I had some coffee somewhere But yeah, you guys take a break as well This week break number one the animated GIFs I did not select So the animated GIFs were selected by Daniel I'm really happy that people actually are interested in the lecture and actually want to contribute something like animated GIFs Did you record it? Yeah, yeah, I'm recording so Everything everything fine. Chill. I'm on it anyway, so You guys can enjoy some animated GIFs for around five to ten minutes I will be back at three ten and then we will continue with an example of how to calculate the Confidence interval and how we can then make a nice plot Hey, because of course in the regression at just plotting the regression line is not enough We want to have some nice confidence interval Surrounding it to show people how good our model is and of course after that We will start expanding our model and making it more and more complex All right, so I will see you guys at three ten and enjoy the animated GIFs