 All right, so stream will start soon Yes, we can but you're speaking Dutch. Yeah, I'm in Holland. It's Difficult or I'm getting used to speaking Dutch again, which is strange because normally I speak English all day or English mixed with German, but Last six days have been completely Dutch. So that's good All right, so stream will start soon. We already started so We have a lot to do for today because it's more or less the last lecture And I will explain to you guys why We'll still have a lecture next week, but that will not be part of the exam. So this is the last lecture before the exam Because I do want to give you guys some time off as well. So today we'll be doing logistic regression We will be talking about some common R programming idioms. So some common ways of doing things And some other various things which I think you guys should know especially because I've been working hard on the lecture for next week And of course, I've been practicing a lot with my drawing board So I've also made some things that I want to show you guys for next year for Pandemic edition three if you get one All right, so let's jump into it before we jump into it the exam dates I think everyone knows by now, but I still want to repeat it 21 July and the ninth of the ninth The best grade will count so you can do both. I Would definitely advise everyone to do the first and if you don't pass to the second But you're more than welcome to do both of them. So that's perfectly fine That's great. We'll count. So good luck for that We'll have a short overview at the end of the lecture to just discuss all of the lectures that we had and I want to just highlight what I think is important So that would be interesting for you guys because then you guys know what the question will be about more or less all right, so The assignments I had one question sent to me via email. Let me open up my email But if you have any other questions after reading the PDF as was the assignment Then I'm more than welcome to hear any questions and otherwise we'll start with the First question that I got. I will just read the question quite quickly dear professor Denny That's always nice that people call me a professor. Although. I'm just a doctor I'm watching the last lecture and I want to ask you something When you were explaining how to correct data for a specific effect after correcting if you've done a t-test to compare Among the different strains My question is is it the right approach to do a t-test to compare three groups? So it depends a little bit on how we think about linear models And let me see I'm gonna switch to the drawing. So it's actually working on the pandemic edition Layout, but so what we have is we want to build a model, right? So we want to model all kinds of effects and there's a whole bunch of nuisance variables Which we're not really interested in so we want to get rid of those So after we've gotten rid of those I think that's very basic right so that we for example have some kind of a linear effect And so we have our Me put it to a color that you guys can see All right, very good. So if we have like a Data distribution and we have some data points scattered around right then in first instance We want to look at all of the effects that we have and we want to kind of draw a straight line To get rid of of some nuisance variables, right? So we have nuisance Yeah, I have to work on my handwriting more So we remove the nuisance variable and then we are left with data distribution and in our case We had three different strains, right? So we had the fmi Homozygous then we had the heterozygous and then we had to be six animals, right? And the idea is is that these things of course you can code in different ways, right? So when you do a linear regression what you could say is well We interpret these three groups as being linear, right? Because we are interested in an additive effect where for example the b6 allele going from having no b6 allele to having 1 b6 allele To having 2 b6 alleles and and we code it like that, right? So we say here this is 0 this is 1 and this is 2 and now when we do a linear regression What we see is that we have our data points. Let me make those a little bigger So we have our data points here In the first group Then we have our second group, which might be a little bit higher on average And then we have some other data points in the last group, which again is a little bit higher So if we code our regression model like this 0 1 2, what happens? This is R will just draw a straight line through it So what we get is we get just a single beta coefficient and this beta coefficient tells us Having a single b6 allele increases your phenotype by this much And having two of them is of course two times better However, when we do a normal linear model We might want to not code these as numeric because we might be interested in other effects as well Not just the additive linear effect Which comes from having a b6 allele, right? Because we can think of situations where we have a different structure However, for example, we have the bfmi individuals all being relatively high We have for example, the heterozygous all being low And then we have for example the b6 individuals being slightly higher than the bfmi again Right? So if we have the same three groups on the x-axis, but now when we draw a single straight line This this single straight line is not going to capture this effect, right? Because we'd have three groups But the straight line tries to optimize the distance to each of these groups So you see that we get a very poor beta coefficient, which is kind of around zero, right? So to circumvent this we can not code it as one zero one and two, but we can code it as a factorial So when we do an s factor Then what we tell the the model is to say well take one of the groups as being the basis base And then compare the other groups to the base So what happens is that we have a model for example like this and we remove the straight line because if we model it as a factor What will happen is it will take the first group that it encounters as being the base So the mean and then it will do a test to see if this group is different from the group here And then of course, we have the other situation where we have this group Which is also being tested against this group So what we get now when we do or when we use an s factor we get two beta coefficients So we get a beta for group one versus group zero and we get a beta for group two versus group zero And now the p value tells us if any of this is true, right? So if this if this group one is different from group zero or if the group two is different from group zero But it doesn't teach us which group is different So in that case we have to do something which is called a post hoc test So after we did the association and we found that there is a big difference at this marker We still have to figure out Which of the alleles carry the effect right because in in our situation having A single or being heterozygous so having one allele from b fmi and one allele from b6 makes you significantly smaller Than the first group. But for the second group. That's not really true The second group is more or less the same as the first group. So and then we start using the t test So for the t test, we use it as a post hoc test So once we figured out that there is something going on at this marker We then want to do a test to see which one of the groups is is significantly different from the other ones And so we then do a post hoc test just saying well, we do a t test against group zero and one we do a t test for one against two And of course we have to do the t test for one against two as well, right? So in this case, we have to do three tests afterwards To figure out Exactly what is going on on this marker So that was the thing that I was trying to explain that have first you get rid of the nuisance from your Variants so from your phenotype that you're interested in and then you start doing the t test If you find that there is a difference in one of the groups Of course, this only works when you model it like a factor Right, because if you model it like a factor then you get two of the betas Well, if you model it like a numeric effect So just coding it as zero one and two or minus one zero one Had then r will treat it as a single linear line And then it will only estimate a single beta But in the case where we have a pattern like this Then of course we will not get a significant effect even though one of the groups is significantly different So linear modeling is very Sensitive to the question that you ask in a way So the way that you model your data being it as a Linear effect a numeric linear effect where you say well, we're looking at an increase Or if you're looking at it as a categorical variable Where are you saying well, I have three different groups or four different groups or five different groups But if you treat it as a factor then you always have to Do a postdoc test afterwards comparing the different groups and then making sure that the group which you think is different is really different Because of course had depending on the variance. This does not have to be the case It is not always that the group which has the lowest beta or the highest beta is the group Which is the most different because that that holds or that belongs together with the variance So a big difference in mean doesn't automatically have to Signify that there is a significant difference between the groups It could be that the variance in one of the groups is really high All right, so that's it. That's the my answer to why do you want to do the t-test? Well, the t-test here we use it as a postdoc test So after we did the initial removal of of covariates The nuisance variables then we do the association using either a linear effect If we want to look at additive genetic effects saying that having a single allele from Your mother will increase your phenotype having two alleles, which are more or less maternal alleles will double the effect While we can also look at different types of effect. It doesn't have to be a pure linear effect You can also just use different groups So wouldn't it be better to do an ANOVA include them all? Yes, you would first do the ANOVA to do a test to see if there's a difference And then after you found that the ANOVA says well at this position in the genome There might be a difference then you use the t-test to figure out exactly which group is different Um, of course, this this also has a little bit of an effect on the p-value, of course Because the p-value from the ANOVA will not be exactly the same as the p-value that you get for the postdoc test, right? Because you're doing three postdoc tests and from these three postdoc tests generally one is significantly different or two But had the the p-value of this test of this t-test will not be the same as the p-value that you have for the linear model Um Why is the third group not considered? I think I was just lazy and I just compared group 0 to 1 and 0 to 2 and I forgot the comparison between 1 and 2 Sorry for the bunch of questions. No questions are important So if you have more questions, then let me know so All right, so that that's the question that I got. I also got One remark about the exam if you need to Have me write you a letter saying that you can join the exam To make sure that I have the request before Monday on Monday. I'm going to sit down and make letters for everyone I already made one. So it's just a copy paste But I know that there are some people who need to have a letter from me Certifying that you are allowed to Good. So if there's no further questions, I will switch back from my drawing to the assignments So In lecture number nine, I would say we did the linear model So the LM function together with the ANOVA function We did mixed models last week. So using LM ER to model different groups So we have a really nice picture of a generalized linear mixed model Where we see that every group has a different intercept more or less at point zero every group has a different A different value where they're hitting that and we see here that this model also has a random intercept for each of these groups Because we can see that the directional coefficient of each group is slightly different So this is a really nice example of how you can use mixed models to model multiple groups And to have like a Predictor variable have where we where we see that when we correct for the groups Every group has their own directional coefficient and their own intercept So today we will be talking more about generalized linear models and generalized linear models is when you're Resulting variable. So you're you're the thing that you want to predict is not a Normal distribution or it's not a continuous variable So it might for example be a case control study where you have cases and where you have controls And this is the thing that you want to predict For example, if based on someone's genome or based on some other tests that you did if an individual is either Um affected or if they are unaffected So a zero one Phenotype so hey that that's a extension that you can do to general linear models And then we are talking about generalized linear models because now the output Or the thing that we are predicting can be anything that we want and it doesn't have to be a continuous variable And so it doesn't have to have a value which can range from zero to 20 But it can be just two possible values or three possible values All right, so the overview for today We'll start by doing regression. So again more regression have what if the response is not a continuous variable What if our response is a zero one? I want to talk to you guys about the long versus the wide format since that comes up quite a lot and everyone Has their own opinion on it Florian looks better than your office. Does it I don't know I don't have a drawing board, but that's why I have the drawing board here. I I I like it. It's nice and green Um, I hope it doesn't start raining because when it starts raining, it gets really really noisy. So I hope that that will be okay Um, so long versus wide format It's it's just two different ways of how you write your data down And both of them are fine and you can go from one to the other But I just want to explain to you why sometimes it's better to use the one format versus the other format Like I told you, I want to talk about some idioms today And one of these idioms is for example Executing external programs which are is really Hey, which can really help are to become a lot more powerful If you have the ability to use are to execute things like sequencing programs Hey, you can you can build up a pipeline Where are just calls out to different programs and then read in the results from these programs And that will help you in the end if you want to build a pipeline which has multiple steps or hey If you have for example your data, which are images in a certain folder And you want to go from the images using some external tool like Image j or some other thing and then go from having image j Output files to have a comma separated files in which we can do statistics Um, I will be talking a little bit about how to do scripts from the command line This ties in of course with the executing of the external programs and also how you can supply parameters to our scripts using command line arguments So that you can write a script and then give the script for example a file name as input. So say our script Resize this photo and then give the name of the photo that you want to resize And like I told you guys an overview at the end So the overview at the end will just be going through all of the lectures from 1 to 10 and Highlighting what I think is important and I hope that with that information Every one of you should be able to do the exam Because you kind of know where the questions will be about Um, like I told you guys next week will be the last lecture So next week we will be analyzing some very very very fishy data So I've decided to do the whole thing in the new style which already cost me like three days of drawing To make really really nice slides for you guys, which are hand drawn and like better than What you've ever seen before or at least I I like them a lot So next week we'll be talking about the bugger say project Which is a project where fish have been captured and measured Across different lakes in northern germany, and the data was provided by Professor Adlinghouse So I figured out that with the new drawing board. I can do a lot of things I have like these pencils which are like galaxy pencils and stuff and they look really nice So I'm I'm I'm planning on using it a lot more But I still have to practice my handwriting a lot A lot because it takes a lot of time making just one of those slides I spend like two days making slides And I have written the code like two hours on the day before and then making the slides explaining everything Yeah, so 39 slides cost me two days, which is a massive amount of time But I hope it's worth it and I hope that you guys will like it next week So next week fishy data And there's a lot of fishy things going on there. So Let's see what we can find out I think that for next week. I will actually make the data available So that people can just Download parts of the data and work with it themselves So that you can have a little bit have them then we can play with the data together and have some nice drawing and Have some fun. So I'm really excited about it. I like the data set So I hope that you guys will also be interested in how to analyze some real world data. All right, so Regression for today It's good that we don't have any Assignments that we have to go through in detail. So I think we can be done quite quickly. I have 55 slides We're now at slide number seven. So we're making great time So when we talk about regression and we are talking about the final form of regression. So generalized linear So generalized linear models Then we're dealing with different response types. So for example, we have a dichotomous outcome variable or a binary outcome variable saying that well Something happens. We have a certain amount of measurements And now in the end we have someone who becomes sick or someone who stays healthy Or we have someone who passes an exam or someone who fails an exam So when we are dealing with these types of regression We're doing logistic regression and that means that our response is not a continuous variable like in the case of body weight or body length But that we just have a dichotomous outcome. So either pass or either fail So we will be using the data set that is listed here And it is a data set which contains admissions to UCLA so But they measure all kinds of things and as students they apply to go to UCLA and they either get admitted or they don't get So we have a single response variable which is called admit Which can be one for people who got into the school and zero for people who were not able to get into the school And of course for all of the students that applied they have some predictor variables. So they have their GRE which is the graduation record exam score They have the GPA, which is the great point average And then they have the rank and rank here's a little bit of a funny variable because that is the prestige of the high school that the student went through before applying to UCLA So it's kind of a Five level score where you have like schools who are ranked one, which are really really good High schools and you have schools which are ranked five which are like the poorer high schools from where They don't have any lecture materials or online teaching But that's so they they rated all the schools And they have these three predictors which they want to model or which they want to use to kind of figure out What makes a good admission into UCLA and what makes a successful application? All right, so we can load in the data using the read CSV function Fortunately, we don't have to specify anything because the data is properly formatted So we don't have to specify that we need a header or where the row names are No, that's all taken care of so we can just use the read CSV data Read CSV and then we just can directly load the data from online and we can store it in the data variable called my data So now we're all set. So first things first if we want to do logistic regression and we use Factors right because the the GRE and the GPA are of course just numbers So they are kind of continuous variables, right? You can have a gpa of 4.8 or 4.7 And you can also have a GRE which is a specific number But rank the prestige of the school is again one of these categorical variables, right? So one of these factors And the thing in logistic regression is that none of the factors is allowed to have a zero in there because we are Assigning groups. So the first thing that we have to do is use the x-tops to make two-way contingency tables So and we only have to do this for the rank So how do we do this? Well, we do the call saying well use the x-tops function to create a two-way contingency table What do I want to have on the x-axis of the tables or in the rows? Thank you for following Can you guys hear that? I think I muted the desktop audio, but I still get a message when someone subscribes So thank you for subscribing So we have the admission And then we want to model the admission by the rank of the school And we we definitely need to make sure that none of these is zero because if any of these is zero then Strange things will happen in logistic regression because then this rank will always be massively significant And that just has to do with the fact that when there are no observations Hey, because we're looking at a true false situation. So when there are no observations in one of the groups Then the modeling just doesn't work as as as it should So we have to look and see if there's any zeros Fortunately, there are no zero so at zero or for the people that weren't admitted 28 came from a rank one school and 55 came from a rank four school And from the people who were admitted you can see that the rank one school has a higher admission rate because almost More than 50 of the people who applied got in While when you are from a rank four school, you directly see that that's not the case Because there for every six people who apply only one gets accepted into UCLA So we can directly see that there is some kind of a structure and that the higher the rank of your school The bigger or the higher the chance that you get into UCLA, but none of these are zero. So that means that we have a We can do a model because if if we would have a zero somewhere Then we should have dropped that rank altogether And this is very important, especially in logistic regression Because you're working with a zero one variable in each of the factor groups You want to have at least some observations and I would say If there would be like two observations, right, I would also throw it out So I want to have at least five or six observations in each of the groups To make sure that when we are fitting our model that their model will be really Accurate in that sense and then it will be correct So it stops very useful function. You can make two-way contingency table I think they can even make three-way contingency tables when you just say plus rank plus something else But for two-way contingency table, they are really good All right, so after we checked all the possible contingency tables in this case, there's only one because there's only one categorical variable Then we can do Then we can do logit regression All right Anna Margareta redeemed next slide in German. All right next slide will be in German then But after we've done the contingency tables We made sure that there's no group which is zero or very small. Then we'd use use logistic regression So how do we do that? Well first things first we have to make the the rank Variable a factor making sure that it's not a numeric variable, right because the coding of the rank is zero one two three four And we don't want r to be Or to by accident interpret this as a numeric variable Because we don't know it might be that there are that there's a different pattern than just a linear pattern in the rank So we have to make sure that it's a factor and then we just make our model So in this case we use glm for generalized linear models So instead of the lm function, which you use by standard regression We now use the glm function and we just built the modelers that we always do so we say model the admissions by the First score so the gre score then we have the gpa score and then we have the rank which is a factor variable with five levels We give it the data. So because we give it the data We don't have we can specify the columns of the data frame directly into the model and we just add saying family is binomial Saying that well the output variable is a binomial distribution. So it's either zero or one And then of course we have fitted our model or it will it will Be almost instant because there's not that much data And then we print the summary of the model. So we print the summary of my logistic regression Good. So next slide in dutch Wiedermal logistic regression How would you say that in german? look That's a good question logistic regression logistic regression Yeah, that doesn't sound very germ but For the slide what we In these folien was man said I must have Not ganz gewöhn um wieder deutsches reden. So was wir observieren natürlich von das model is am ersten Seigt es uns welches model wir genutzt haben so wenn ich mein summary von mein variable von mein model mehr angucke Then sehe ich dass den formula wieder aufgelistet ist und Es gibt auch unsere coefficient so den coefficient ist natürlich wieder den coefficienten Wo wir in interessiert sein so es gibt unsere estimate so unsere better coefficients Daneben gibt es auch den standard error von den coefficient Und wir haben auch hier den probability und diese probability jetzt ist auf ein set distribution hat sich ist mit ein set distribution berechnet Und das ist dann das ist dann unterschiedlich weil her wenn man eine normale regression macht Mach man ein f test oder ein t test aber mit ein binomial test hat man ein set score den man ausrechnet und da von den set score kann man dann nach den probability gehen Was man seht ist dass den g re und gpa variablen beide ein ganz hohe influt einfluss haben auf das zukangsgeschehen an den ukulee universität Aber was man auch zählt ist das wieder hier sehen wir rank 2 3 und 4 Und das ist um das all diese betas wieder Abhängig sind oder den sind relativ von den rank 1 gruppen Und was man seht ist dass ja den ersten rank rank 2 ist nicht so significant unterschiedlich von rank 1 rank 3 ist unterschiedlich her und an rank 4 ist viel unterschiedlich Und das haben wir auch gesehen hat das wenn man nach den tabellen Guckt hat das den tabellen auch ganz deutlich zeigt dass wie höre den rank von in school wie größer den chance um zugelassen zu werden um Aber was man hier ganz genau seht ist dass es Hier in logistische regression nicht so etwas gibt wie einen einzelnen p-wert normal wenn wir das mit einem continue variablen gemacht haben Dann hat man nur eine bete estimate bekommen oder man hat verschiedene bete aber jedes war in den Summary zusammengefasst in eine p-wert so den frage jetzt ist natürlich wie unterschiedlich oder wie Significant ist den einfluss von rank auf den zulassung auf den universität Herr weil es ist anders als mit einer normalen lm-funktion wo man nur eine p-wert bekommt für jeder für jeder predictor variable Aber hier Weil es logistische regression ist bekommt man ein unterschiedliche p-wert so Ein bisschen wie ein post hoc test bekommt man den p-werten relativ zu den erste gruppe aber wenn wir wissen wollen wie significant den rank Zusammengefasst ist dann können wir das auch ausfinden und da für We can use the aod package Because if you want to summarize different coefficients into a single Into a single kind of p-value so a single likelihood You can't just add up the p-values right? It's not that you add up three p-values and divide by three and then say well This is the likelihood of of the or the average right likelihood of rank So there's a formal way of doing this and doing this requires doing a vault test The vault test is not standardly available in r so we have to install first the aod package So we can install the package using install the packages and then we can say library aod And by loading the library we get now access to all of the functions which are provided in there And now we can use the vault test function To compute the significance of the rank variable But we first need to know which of the coefficients we need to combine right because when we do Coefficient of my logit model what we see is that we see the coefficient of the intercept The coefficient for the g re the gpa and then we see the three coefficients for the rank Which are relative to rank number one So how we can just count this is the first coefficient the second the third So we want to summarize coefficient number four five and six together into a single variable And that that's the thing that we have to remember it's coefficient four five and six that we want to summarize So how do we combine this? Well, we can do the vault test The where we say that the beta that we want to test is the different coefficients of my logit model Right, so I give it all of the coefficients for the model in the beta I give it the variance covariance structure of the model in sigma And this is because to estimate or to kind of group different coefficients together It needs to know the standard error So it needs to know if The different betas are linearly related to each other So you can do that by providing the vcoff structure This is just the variance covariance structure. So it gives it gives the the vault test function The ability to say or to see well this coefficient has a very Very small standard error while some of the other coefficients might have large larger standard error And then we specify which terms we want to kind of summarize So in this case, we want to summarize terms four five and six And we want to just get a single value for how significant is this or are these three groups from the rank one school So however, we're grouping them all together So when we do that we see that we get a chi square score We take three degrees of freedom have because we're grouping three Very or we're grouping three different levels together And we see that the probability of this having a significant effect on the admission or the combined effect of these three groups Relative to the first group is zero point zero zero one All right, so that's how we do logistic regression The model itself is again very very similar Hey, you just use the glm function instead of the lm function and you have to specify the family So in our case binomial because we have zero one possibility for an outcome So we can then combine the terms if we have a factor variable And then we get a single p value for the combined factor And of course in logistic regression Often when you are dealing with biological data or data which is done in a hospital on sick people and healthy people We are doing a kind of case control study and in a case control study. We always report Results not as beta coefficients, but as odds ratios, right? So the odds ratio is more or less the the increase in chance of becoming sick In one group relative to the other group, right? So it's not that you have a beta of plus 1.5 sickness units But in in when we talk about case control studies We generally express the difference from one group to the other group as being the difference In an odds or in a ratio, right? So we want to say that well if you are smoking then you have 50 percent more chance of This disease while if you are eating tomatoes This decreases your chance by eight or 20 percent So how do we calculate the odd ratio? Well, the odd ratio is very easy to calculate We can just take the natural Logo rhythm. So the inverse of it. So the exponent So how do we do that? Well, we say exponent and then we take the coefficients of the model that we just calculated And these will give us the odds ratios If we want to also add the 95 confidence interval. So what are the boundaries? In which the odd ratio is is more or less Which what is the five percent? Uncertainty or no, what is the not what? If we take the confidence interval, what we will know is that 95 percent of the cases the real value The real increase will be between these two bounds And if this is significantly different from zero then there is generally a significant influence of your factor So if we do that then we see that the odd ratio of the g re is relatively low So the It it has no big effect. It only has a zero point two percent increase in your chances To become admitted But you see that the gpa has a massive effect that has a hundred and twenty three percent Additional chance. So if you have a high gpa This counts much more than having a higher g re and then we see the different ranks So we see that the rank of the schools are below one Which means that rank number two has 50 chance of getting admitted compared to people from a rank one school People from a rank three school has around a 26 chance of getting admitted and people in a rank four school have around a 21 chance of getting admitted Which is more or less similar to what we saw and of course here We see the confidence interval So the real value will will probably be somewhere between like nine percent chance all the way up to 37 percent chance compared to a rank one individual Right, so that's the case control So in case controls we generally prefer to mention things as odd ratios instead of just bad Better estimates so we can take the exponent of the coefficient and then we get the odd ratio I see my moderator is not helping me delete messages. So I will delete it All right Next slide We now I just showed you how to do this with a With a variable which is zero or one, of course, there can be many many different distributions Right, we can have a Gaussian distribution and also using a gaussians or using a normal distribution with a continuous variable We can of course use the glm function and the lm function does nothing but just Call the glm function saying that this is always a Gaussian distribution But you can use the glm not just with gaussian. You can use it with binomial distribution. So zero one Gamma distributions, which are distributions, which are similar to a normal distribution, but slightly different so slightly compressed or slightly extended We have inverse gaussian distributions, which is of course not a normal distribution But the other way around so hey where the most of the values are to the sides And there's kind of this u shape in the middle of the distribution We have the Poisson distribution, which is counted. So when I think of a Poisson distribution I always think of the number of b's on a flower Hey, if you see a flower Generally, there's no b or there's one b sometimes there's two sometimes there's three But there's not going to be like 50 b's on a flower So the Poisson distribution starts off with having like high observations of the zero and the one And then you have very low observations of two three four and five. So it it's kind of a Normal where you look at one side of the norm, but Poisson has always counted data and not continuous. So it cannot be 2.5 Um, and of course we have different other distributions like quasi quasi binomial quasi Poisson Which are all distributions, which are very similar to the Poisson distribution or to the Gaussian distribution or to the binomial But not exactly a standard binomial distribution So you can just specify these and so in the family argument When we go back here in the family binomial, we can then say family is Gaussian or family is Poisson And then it will deal with the response variable and it will also check if the response variable is really binomial or Poisson All right, so That's everything I think for regression so far So we talked about generalized linear models or general linear models You can use the lm function We talked about mixed models and repeated measurement models Where we can use the lme 4 package and use the lme function or lme r function to do the model And we have generalized linear models where we have things or different responses to our To our predictors. So which might be binomial, which might be Poisson Or gaugin And of course had these are all generalized linear models So you can decide which model you want to use and of course you have to first look at your data on how it how it looks like But then you can check the model that you have if you have any grouping in your data or repeated measurements Then you're forced to use mixed models, but there is also a glmm Which is a mixture. So it's a generalized linear mixed model So it's the same as a general generalized linear model, but it also allows you to do repeated measurements in there all right Good, so I've been talking for 42 minutes How many slides is this this is a couple of slides Let's just do a couple because it's now 242 so we can do like 10 minutes more of long versus white Good, so when we're when we're thinking about data and data in data frames or data in matrices Then we can have different ways of writing data down and It is always difficult to go from one format to the other format or it's not difficult, but it's it's always a little bit tricky because some Algorithms they require you to have data in a white format while other algorithms require you to Data format that you generally have for a linear model is the white format, right because we have An observation So we have for example mouse number one Which has been measured on a certain date And then we have the first measurements like the body weight or in the case of the students We have the admission The gre the gpa and all of these measurements on a single individual are mentioned in a single row of the matrix So we call this white format when we've measured different variables Which are described by one or two columns. So this is called white because there's multiple measurements In the same row of the table The long format is slightly different because the long format has this additional column called variable And this is what has been measured. So it always has only a single value Instead of having three values what we now see is we we don't have a single row with three values But we now have three rows and all of these three rows have only one value And then there's a column called variable which describes what variable was measured So this is called the long format and this is called the white format so If you want to go from one to the other There's this reshape to package and this reshape to package allows you to go from white format to long format White format if you if you need to Without having to manually copy paste your data or make new data frames yourself So it's a really handy package for if you have your data in one format and the algorithm that you're trying to use Requires it to have in a different in a in a different type So question to you guys first if we look at the air quality data set Is this a long or is this a white data format? And I'll just wait a little bit because I think there's some delay But just throw in chat if you think it is long or if you think it is white So and I'm just gonna sit here a little bit I might actually do my high voice but We're just gonna wait on this slide until someone answers. So Toca for all says it's wide All right, a cool a cowbie a cowbie wide Wide okay, so there's a lot of people going for the white format So that does for example took for all. Do you want to explain why it is wide? Because it is wide of course. That's that's that's very logical And I think that anyone can see that so wide Good, then we'll just keep it there So the reason why it is wide is of course because you have on a single measurement So on it on a month and on a day you have four different measurements next to each other Right. So this is a dead giveaway that it's a wide format Um, so of course we can go and take the air quality data set and go from having a white format air quality data set to having a long format Long format air quality data set Yeah, it's not concentrated the thing which uh, um, what what you can Where the easiest thing is to look and see if there's a variable Right. So if there's a variable thing and if there's a single measurement Head then it's definitely long format But in this case white format Why because we have like four different measurements and then we have two columns describing the measurements Right. So it's uh, it's kind of a Two columns saying when the measurement was done and then four different measurements. So it's definitely a white format All right, so How do you recognize these things right? So you always have identifier columns Because data has been measured. So there's always columns needed to uniquely identify a row So for example the sample name the date the month or the year at which it was measured And so when we look at the air quality data set, we have two Id so identity columns, right? We need the day and we need the month column to uniquely identify a single measurement So that is what we need to first do ourselves is to identify that these are the two columns that we need To uniquely identify a measure and then we have measured variables, right? So those are the columns that contain the measured values. So those are ozone solar radiation wind and temperature Those are the things that we measured and these are the four columns that we need to now Kind of squish into the variable column and then of course all of the values go into the the value column So melting is going from white to long So if we go from the air quality data set as it is now We can melt it into the long format and the other way around is called casting So going from the long format to the white format is called casting and also the function names are based on the terminology So if we want to take our air quality data set and we want to melt it from the white format into the long format How do we do that? Well, we have to say melt for the function melt Give it the air quality data set Then we have to say that the id variables So the id vars are the month and the day Right because it needs to know that these two columns are not measurements, but identification And then we have to specify which variables were measured. So in our case We have four measured variables ozone solar radiation the wind and the temperature So when we do this call then we get I store it here in a new variable called molten And then when I look to see how molten looks like, um, then this is how it looks like So what we see is that we have the month the day then we have variable saying which column Did this measurement originally come from and then we have value and that is the measured value on this month and on this day So going to a long format. So there's one thing which is a little bit annoying here Because we have these things Right because on the fifth month on the fifth day the ozone concentration had a missing value And we can already see that that was the case here in the in the white format, right? But in the long format These two rows are completely Unnecessary right because they don't add any information Right there there wasn't a measurement. So why have this role all together? So the nice thing is is we can automatically get rid of that Because what we can say is say na.remove equals true So then it will just show or it will just leave out the na values So once it it sees that one of the measured bars has an na in there, then it will not add a additional line So when we then Sample some random numbers Have when we showed 10 random rows, we now see that there is no na value occurring anymore Yeah, so na.rem equals t It's actually better nowadays to spell true. So t r u e And that is just because r Doesn't like you specifying true and false by t and f anymore but um, so yeah, the na remove will just remove all of the kind of Non-informative rows when you are melting from the long from the white format into the long Of course, we can also go back. Um, this is called casting So there's this function dcast And what we can say is we can now say so dcast our just molten Data set right because we went from white too long So now we give it the long format and now we specify in a kind of formula way How our data is structured? So we are going to say that well, we have our month plus day And then what was measured on the month plus day? Well the variable column was measured on every month and every day And what we see then is when we do if we if we store the results from that when we look at the head We get more or less the original data set back The only thing which has changed is of course is that it randomly chose the order And of course the day and the month the month of the day are the first two columns because the id columns get mentioned first And then it will go through the list And say okay, so I see a wind so that's going to be the first column and it puts in the number But this is a very easy package which allows you to go from white format To long format and back With a single call So when you ever need it because of an algorithm because the algorithm just happens to need long format You can easily go from one format to the other form and the package is relatively optimized So even if you have like millions of measurements, it still will not take too long And doing that by hand using for loops might be a big issue because that might take a long time to do All right, then we will take a short break here. I will be back at 305 And for you guys there will be Animated gifts, of course, I just forgot what It's either goats or birds or a combination of the two Um, well, it doesn't matter. I will see you guys in around 10 minutes and get some coffee and Take a breath and then we will continue with some more common idioms So I will see you guys soon