 All right, let's see if this works now testing one two three testing one two three should be able to hear me if You don't hear me put it in chat Which is impossible? So if you do hear me put it in chat as well. Oh Thursdays Thursdays Thursdays again So I hope everything's fine I'm not getting any feedback from chat. All right good. My moderator says that she hears me and that works that works perfect perfect perfect Anyway, anyway Thursdays again Thursdays again So warm so warm It's gonna be even warmer over the weekend like you're saying that it will be above 35 degrees Celsius and And it's already very very warm. So Theory you should be able to see me now as well I Don't know why the YouTube streaming preview is giving me such a hassle and Like it also pops up every time this weird suggestion saying that the audio streams current bitrate is zero Well, if you can hear me, it's definitely not zero. So but welcome welcome Well, wait another minute before we start right because it's Want to start at two? Let's hope everything works out Good good good good good. So I hope everyone had a good week. I hope everyone was able to do the assignments the assignments are Difficult I think I looked over the assignments and I Actually cheated on one of them. I had to look at my old answers to see how I did it What do you and videos seem to be out of sync? Yeah, that's what I was thinking as well I have no idea what's going on there So it's really out of sync Let me actually Hear the playback for myself Yeah, that's a little bit out of sync indeed It's not really no no, but we we try and be professional here, right? Like that's why That's why if they're like influencer lamp and stuff and like the the nice stream deck setup and everything I have no idea what's going on the stream settings are the same as what I did. The only thing which I did is Looks like you're a foreign show broadcasted on German television. All right Let's quit the stream and restart the stream then. That's the only thing that I can kind of come up with I shouldn't have changed the latency While the stream was already connected. I know that that gives issues sometimes for for YouTube. It's not an issue on Twitch though So I'm just gonna stop the stream Restart it and I hope I don't have to fill in all of the stupid stream information There's more latency now Shouldn't be shouldn't be it might be that because I connected the streaming software already You might be a little bit further behind let me just stop the stream and then start it again see if that fixes it Okay, I hope I'm back Hope it's better. There's not a lot of things that I can do on my end like when YouTube messes up It's like I have only like four buttons that I can Click and these four buttons actually are not controlling any of the latency things So I'm sorry if it's still bad Improvement, okay, that's good. That's good. All right Good. Um, so let's start then before YouTube starts messing up again So regression today, I'm very excited about it regression is really my topic I've been doing regression now for almost 14 years professionally Try the YMCA So, yeah, no, I've been doing regression now for 14 years professionally and I have a lot of experience with it So I hope that people are excited and want to ask a lot of questions But of course like always We're not gonna start with the regression lecture. We're gonna start with the answers to the assignments And like I said, I found the assignments really hard I actually had to cheat on one of them look at my old answers to kind of figure out what's going on and what I actually Wanted so I hope it was doable for the students and that everyone was able to do the assignments So let me pull up my Notepad plus plus window and I also need an R window. I think Yeah, and then I can do this to switch between the two that all seems to work very good very good So I already made the file, right? So first things first if you open up a file and you're going to code Put in a header write down what you're going to code. So in this case, it will be the answers to the assignments Put in your name and if you are working for a company also add the company name Because you're being paid by the company that means that they own half of the intellectual property that you are creating But the other half is yours. So that's always nice. So Good. So nice things is that this time we didn't have to load in any data set So I don't have to do a set working directory or anything We just start off with some kind of first assignments that are taking from project Project Euler is a really really good resource if you Want to learn how to program they have assignments these assignments go from really easy to really really hard and Generally the hardest part about programming is figuring out what you want to program and I found this out myself because sometimes I want to do something I want to program something But I have no idea what I want to do right and then just go to project Euler and you just look at one of their questions and Generally that that helps you right because then you have something that you can work towards I have a question. Wouldn't it be possible to have this course as an official ube pay because I'm a biology bachelor student And I think many of my fellow students would be interested in this course as well the big issue at the Humboldt University is that as a bachelor student you are officially not allowed to Do master courses That is not entirely true. You are not allowed to do master courses within the same faculty So as a biology student you can go to the political science faculty And you can follow a master course there and get it accredited in your ube pay But since the Albrecht Daniel Ter Institute and the biology Institute are more or less one thing Because they all fall in the faculty of life sciences You are not allowed to do that and that is because they want to not have the hassle To check who did certain courses in their bachelor and in their master because in theory you could follow the same course twice once for your bachelor once for your master and Of course, it would be easy for them to check But as people probably have figured out the pre-functs bureau here is very very overloaded and Yeah, I would agree For me the earlier you start with programming the better I have argued when we had the student reform meetings many many times that people should start learning how to program directly in their bad bachelor and that they That they yeah, yeah, it is very sad. I don't understand why the university has such a big issue With bachelor student following master courses. It makes no sense to me Because the courses should be harder, right like a bachelor course generally is easier than a master course And like in my opinion the earlier you start with programming the better it is So if it would be up to me, everyone would have to follow an R programming course in their bachelor Not as an ube pay, but as a flich module. So as a as a required course for your bachelor But we'll have to see like my influence here at the Institute is virtually zero so Yeah, I've dropped the latter weeks ago and the proofing bureau is still not replying. Hopefully they do is Hopefully they do by the second exam date at least Let me actually Show you the exam dates. So you Yeah, the courses on YouTube, but a lot of people want to follow the course and want to get the credits for it, right? because as a student you only have so much time and Based on that time you want to follow interesting courses and of course, it's it's if you don't get any point for it but In there are some ways around it, right? If you are in the in the last year of your bachelor You could follow the course in your bachelor and then not do the exam But do the exam a year later when you are a master student. So you don't have to refollow the whole course You just have to do the exam at that point in time And I am perfectly fine with that like I really don't care about these kinds of constructions Like if people follow the course, they learn how to program. I'm more than willing to give people points But some teachers don't allow that either So, yeah, the first exam is going to be on the 28th of July and then the re-exam is going to be on the 23rd of September I think that's the ninth month, right? Points are more important than improving skills. If you want to get your degree, then yes, points are more important than skills You can be like the best programmer in the world, but if you only have like 36 out of 40 E-CTS Then they won't give you your bachelor degree. It's just that simple. You have to get an X amount of points and That's the way that it works But yeah, I would totally agree and all of the bachelor students that try to sign up for the course in their UAP and stuff I always say that just follow the course on YouTube. You can follow it this year And then just sign up for the course once you hit your once you hit your master And in theory, I can give you a license shine already And then you could give the license shine to the pre-functs bureau when you are a master student So there are a lot of ways around it But none of these ways are really appreciated by the pre-functs bureau because they are already They are already overworked and they don't want to check if people Follow the same course twice because that's their main fear is that you do the course in your master in your bachelor Get the points for your bachelor and also do the same course in your master and get the points a second time Which of course would be unfair, right because you should do that So you have the exam and then you have the re-exam and I always say the best grade counts So if you do the exam and you get a 2.3 and you're not happy with that Then you can do the re-exam if by any chance you Mess up on the re-exam and get like a 4 then the original 2.3 of course counts because like It makes no sense to do a re-exam and then score worse than the original exam Good So these are the exam dates so put them in your agenda and They will be via zoom or Moodle we might do it in person at least the first exam but The secondary example at least not be given by me that will be done by Paula because I Contract here ends at the 31st of July and then first of September I will start a new job if everything goes well at North Umbria University in the UK and They decided to make me an associate professor So graduate sound that sounds wise Yeah, yeah in the end like ECTS is what counts right like you can Get 4.7s on all of your courses, but in the end if you have enough credits then you still get your bachelor like in the end It doesn't matter. Oh my god news drop. Yeah, well, I hadn't said it before but so Very very sad about it. I had a really good time here at how I've been doing it here for almost eight years or more than eight years actually so University didn't Didn't want me to make me a professor and the other university did want to make me a professor So I'm going to do where are you getting your professorship? Oh, I'm going to North Umbria University So North Umbria University in the UK, which is located in Newcastle. So Newcastle United go go go I Don't like football like since you're living in Newcastle. You should support the local football club Newcastle up on time. Yeah, that's the official one. So Good, but with that out of the way I think we should start programming a little bit right because that's what we're here for instead of complaining about how the ha and Let's not complain. Let's let's keep it positive, right? Congratulations. Yeah, thanks. Thanks. I'm really really happy about it so lots of things to still arrange and Go sports Yeah, yeah any sports They probably have a good rowing team as well, right? That's very popular at universities in the UK is do rowing Which is also very popular in the university where I originally come from Good So like I said project Euler very good resource if you don't know what to program So I took three of their questions. I Credited them so they They they They get the credits for it so first question and The first question is about integer numbers, right? So it's about whole numbers So I hope that that was clear from the question itself because if we list all the natural numbers below 10 That are multiples of three or five. We get three of course five six Which is two times three and we get nine which is three times three and Then if you sum all of these numbers up, right then you get 23 right because three plus five plus six plus nine Equals 23 and then the question for you guys was find the sum of all of the multiples of three or five Below a thousand, right? So it's not three and five. It's three or five So there's two major ways to answer this we can answer this using the sec function or we can answer this Using some built-in function which is there, but I always like first going for the sequence, right? So we want all of the numbers that are multiples of three below a thousand. So that is actually relatively easy, so I'm just going to create a vector called X3 and I'm going to assign to X3 a sequence from three, which is the first multiple of three all the way up to 999 which is the number which is lower than a thousand and I'm just going to step by three, right? So I'm just going to make a list of all of the numbers which are multiples of three and then I'm going to duplicate this line And I'm just going to say I'm going to do the same thing for five, right? So start at five go all the way to 999 step by five and Now in theory a lot of people think okay, so now I can just say why I want to combine X3 and X5 and then I just want to get the sum But of course this doesn't really work Entirely because the same numbers can be in X3 and in X5 So now for example the number 15 which is divisible by three and divisible by five will be in X3 But it will also be in X5 So the question is is How do we solve this because we only have to count 15 once and not twice, right? So we can we can modify this a little bit and say well I want to remove the numbers from X3 that are also in X5 So how do I do that? Well, I can just ask which numbers of X3 are Located in X5 like this, right? And I can ask which ones those are and Then I'm going to throw them away So I'm going to say minus and I'm going to throw them away from X3 and Then I'm going to store this X3 as a new variable or I'm just going to overwrite the variable And now I can do what I wanted to do so I can say combine X3 With X5 since now there's no duplicate numbers anymore and then give me the sum. All right That's question number one. So let's go to R. See what actually the answer is. So when we go to R Where's my R window? There's my R window So the sum should be two three three one six eight so two hundred and thirty three thousand one hundred and sixty eight and of course if you do this question on project Euler You would have a little input box. You would fill in your Your answer and then you would click check And then it would check if the number is really the number that it should be We can actually compare this with With other sources, but this is really the correct number So just throw them away. Of course, you can actually do this a little bit easier So if we go back to notepad then the alternative Is to use the built-in unique function, right? So we can just say okay, so I'm going to draw X3 and X5 and Now I'm just going to combine these two together So I'm going to take X3 and X5 and I'm only going to take the unique values So if a value occurs twice only take the first one and then I'm going to sum them up, right? So this also works is a little bit cleaner a little bit like more But now you're using the unique function But you can just do it yourself if you want to and and this of course gives you the the exact same answer All right second question so Each new term in the Fibonacci sequence is generated by adding the previous two terms by starting with one and two the first Ten terms will be one two three five eight and so forth So we did the Fibonacci at the end of the last lecture And by considering the terms of the Fibonacci sequence whose values do not exceed one million Find the sum of the even value terms, right? So there's a lot of things in this question the first thing what we need to do is find all of the Fibonacci numbers Which are smaller than a million and then what do we want to do? We want to find all of them that are even so divisible by To without any remainder and then we want to sum all of these things up. So I'm just gonna write down a q2 Hashtag q2 so first things first we need to do the Fibonacci sequence, right? So and I already gave you the code at the last one. So I'm just going to Just going to write the fib function, right? So in this case, I want to write Fibonacci is a function and that takes a number x Which is the index so the nth Fibonacci or the x Fibonacci number that we're going to compute So if x is is 1 then we are going to return 1 if x is 2 We are going to return 1 as well And otherwise we are going to return the Fibonacci number of x minus 1 Plus the Fibonacci number of x minus 2, right? So we're just going to ask for the previous two numbers This is a very bad function We saw that last time that it takes a long time to compute the 1 million Fibonacci number But it's good enough, right? So we're of course going to test it. So we're going to say fib 1 We're going to say fib 2 3 4 and we're going to do 5 as well And I'm going to put them all on one line using the dot comma so that they can combine multiple statements on one line Just to check if they match up with the numbers that we saw In in the in the example All right So I'm just going to copy paste this in and then we're going to look at the numbers and see if the numbers match up To what we expect so it's 1 1 2 3 and 5 that seems to be correct, right? And we could check a couple more But in theory, this is the Fibonacci sequence. So I know that this is correct Okay, so now we have to find the sum of the even Fibonacci numbers, right? So in this case if we look at r Then have 1 is not even 2 is even so 2 needs to be summed up 3 doesn't need to be summed up 5 doesn't need to be summed up the next one is going to be 8 and 8 is of course even again So it needs to be summed up. So how are we going to do this? Well in this case? We don't know which number is the Number below 1 million, right? So we have to use a while loop So we're just going to say x equals 1 So this is the Fibonacci number that we're currently looking at I'm going to say VIP sum is Zero because initially we haven't added up any numbers, right? And then I'm just going to say the Fibonacci number that we are looking at so VIP num is going to be VIP of x Right because we need to we need to initialize the number where we're looking at, right? And now I'm just going to say while this VIP num So the Fibonacci number is smaller than 1 million. So 1, 2, 3, 1, 2, 3 What am I going to do? Well, I have to check if the number is even so I can say if The VIP num that we are currently looking at is divisible by 2 without any remainder, right? So when the remainder after dividing by 2 is 0 then the number is even so at this point I know that the number is even Right, so I'm just going to write that down So I'm just going to print it because I want to have some feedback from the algorithm while it's running so I'm going to say cut x so the Number is And then I'm just going to say VIP num Right, and then I'm going to add a slash new line so that every time that it finds a new Is there a way to make large numbers more readable? Not in R. I find that very annoying as well There are programming languages which allow you to do this so to use underscores To separate the thousands but R doesn't allow you to do that But you can use scientific notation, right? So 1 million is 1 times 10 to the power of 6 So I could just say 1e6 and this of course is clearer in a way but it's For anything under like 10 million. I just write it out, but you can use scientific notation So 1e6 are figures out that this is 1 times 10 to the 6th. So which is a million? So you could write it down like this, but I don't like it also because code highlighting goes a little bit haywire Maybe for you, how do you mean maybe for you? I I just like this But I always I always make sure that it's like this right? So I'm just like put like a couple of spaces and then once you check and make sure that it's really the number You just do it And you just add it All right So and now we're going to go through all of the Fibonacci number if the Fibonacci is divisible by two without any remainder It means that it's even I'm going to then print the even number to the screen using the cut function And I'm going to update my fib sum right? So I'm going to say that the fib sum is the fib sum plus the Fibonacci number that I'm currently looking at and Actually, someone's calling me on the phone. I am very sorry. I probably have to take this because I will be right back Interesting interesting I'm not even gonna say what they called for like it's just too silly All right, so if the number is even right then we just add it to the current fib sum Right, so don't be like it's some and num So there's not a lot of difference in the variable names, but it's it's good enough. I think to see All right. So what do we need to do? always right because we do not want to create a For loop or we don't want to create a while loop which goes on forever and ever and ever right? So I need to make sure that I move to the next Fibonacci number, right? So I'm going to say x equals x plus one and Then I'm just going to compute the new number, right? So I'm going to say fib num is now the fib of x So I'll hit so I'm going to Hashtag increase the number the nth Number right, and then I'm going to compute the Fibonacci Fibonacci number. All right, so now in theory it should End at a certain point right because VIP num will become higher and higher and higher It will print them out and if it figures out that the number is divisible by two it will Add it to the sum We can actually have the sum running in as well. So let's do that right, so See some so the current sum is and then we can just say VIP sum right to make it clear that we have the nth Fibonacci number The value of the number that we have plus the sum that we are currently looking at Maybe one time stand to the six is clear for you It's everyone's personal choice. I like looking at numbers on a scientific notation basis It's just if you if you deal with p values Which are like one times ten to the minus seven or one times ten to the minus eight then you kind of quickly get forced because no one wants to see 0.0000001 Right because then you have to count the number of zeros and then you have to add one to that and then figure out what the probability was Good, but so we just build up our function, right? So the we define our Fibonacci function we test to make sure that it's really what we expect it to be Then we define our variables that we require We write a little while loop saying that do this until the Fibonacci is below 1 million Check if it's divisible, but even if it is not divisible, right? Because this part here is outside of the if statement because we always have to do that Even if the number is even even if the numbers ought we still need to move on to the next number All right, so let's go to our and let's copy paste this in Just to make sure that everything works So here it starts doing the Fibonacci so the third Fibonacci is two and That the sum at that point is zero right then the next one that we find which is even is eight So the sum is two so of course eight plus two equals ten for the next time So the next even Fibonacci number is the ninth one and that's 34 so you can see that this is regular, right? Because every third Fibonacci number is an even number And now if we want to know the VIP sum then we can just type the variable name And then we see that the VIP sum in total is like 1,089,154 Again, it's a project Euler question So if we want you to check that this number is really the answer We just put it in project Euler We say check and then it will tell you this is correct or this is wrong You it next one Palindromes Palindromes are really really interesting and also really Biology related right so Palindromes are sequences Which when you read it from the front to the back read the same as when you read it from the back to the front so In chat throw your favorite Palindrome There are a lot and everyone I think has their favorite Palindrome So the question that we have is a Palindromic number reads the same both ways The largest Palindrome made from the product of two digit numbers is 9009 which is 91 times 99 So find the largest Palindrome made from the product of two three digit numbers Of course, my moderator goes for Anna. That's that's that's such an open door But it's a good Palindrome. So I like the Palindrome Anna as well Good, so let's find the largest Palindrome for three digit numbers, right? So we we the thing that we do here is we take two numbers We multiply it together and then we see if it is a Palindrome So we need a function to check if a number is a Palindrome, right? So is Palindrome Is going to be a function Which takes a value of x? So the first thing that we need to do is to make sure that the number that we get is really going to be a string Right, so when we when we multiply two numbers together, we have a numeric value So I'm first going to say that I want to change this numeric value to be To be a character value, right? I'm going to say x and I'm just going to reuse the variable. I'm going to override it So but I'm going to make sure that I say as character X right so that even if I would input a number the value of x would be a character So how do we check if something is a Palindrome? Well, the easiest way is to just Separate all of the letters from each other and then just reverse it, right? Because if the reverse is the same as the original one Then it's a Palindrome right because you can read it the same both ways So fortunately r has a reverse function and we already seen that we can Split a character string By using the string split function. So I'm just going to say string split x, right? So split the string into its individual numbers individual letters. So I'm going to split it by nothing Right so every every time that you find nothing split it which means that we within either Within either that it will split them all Another Palindrome. I knew a kid. His name was Tim Schmidt. Yes, Tim Smith is indeed a Palindrome if you don't include the space, of course String split expects that we have not one string, but multiple strings So it will always return a list. So in this case, we want to have the first element from the list Right because we are only going to give it a single number We're not going to give it like five numbers or a hundred numbers that it needs to split It only needs to split ones one number. So I'm just going to take the first as being the answer Then I'm going to reverse it Right, so take the take the vector that comes out and then reverse the vector And then what I'm going to do is I'm going to use the paste zero function To paste all of these numbers together and I'm going to say the separator is nothing and I'm going to collapse The numbers or I'm going to collapse the string also using nothing and then this should be reverse of x And now I can check right because now if x is is the reverse of x Then I need to return true And of course if this is not the case then return false And this part is a little bit Duplicated in a way because I can just reverse or I can just return the value of this test Right because if this test is true Then it's true and if this test is false, then it's actually not a polyndrome So I can just say return x is is the reverse of x All right, so let's try our is polyndrome, right? So we say is polyndrome. So we are going to use Anna We are going to use uh tim smith Because we know that those are polyndromes and then we are going to use my name as well Which is not a polyndrome just to check, right? You have to check the the positives But you also have to check the negatives when you write code All right, so let's see if this works. So let's go to r So we have our is polyndrome function and Anna is indeed determined to be a polyndrome Tim smith is also determined to be a polyndrome And then we have Danny, which is definitely not a polyndrome. So seems to work All right, so now we have this function which can tell us if a number is a polyndrome Let's actually try a number as well, right? So let's try 9009 And that is indeed a polyndrome 9006 is not a polyndrome So it it checks out right the function that we wrote works And is functional Cute so now we have to Do the second part, right? So the idea is to find the largest polyndrome made from the product So the multiplication of two three digit numbers So we're just going to use brute force, right? So I'm going to say four Oh, I'm not typing So we're going to say four x in 100 to 999, right? Because like there is no These are all three digit numbers Then I'm going to say For y which is my second number in 100 to 999 What do we want to do? Well, we want to multiply them together. So x times y and then we want to ask Is this a polyndrome, right? So is polyndrome x times y? And if this is the case Then we need to remember this one We need to remember the polyndrome, but we also need to remember the two digits that created it, right? So I'm going to say I'm going to say x i Um y i right initially this is zero This one is zero And I am going to say I have the the largest polyndrome so far. So largest polyndrome so far the lpsf Right and the largest polyndrome so far is also zero So if these numbers are a polyndrome, right? And The largest polyndrome so far Is smaller than x times y Right, then now I know that I found A number which is larger than the current polyndrome. So now I need to remember x. So I'm going to put x in xi I'm going to put y into yi And I'm going to remember my polyndrome, right? So my This polyndrome so far Is going to be x times y and for good measures I am also going to Cut it to the screen. So I'm going to say x comma y comma largest polyndrome so far comma new line And I'm going to add some spaces to it. So I'm going to say equals And I'm going to say x multiplied by y Is largest polyndrome so far, right? So I'm just going to brute force it Just going to go through all of the three-digit numbers for the first number I'm going to go through all of the three-digit numbers for the next one But I have to make sure that every time that I find a number which is a polyndrome That it is actually larger than the largest polyndrome that I found so far I'm going to remember x going to remember y I'm going to update my largest polyndrome that I found and I'm just going to cut it out to screen Um, and this is of course, hashtag q3 for question three All right, so let's see if this works So let's go to r And let's see if I typed it incorrectly. So here we see that it starts like multiplying numbers together And you see the further we get the The vinegar polyndromes we find um for some reason it actually did something weird here No, it's just it's just still computing Right and of course we we can check and we can see that all of these numbers are polyndromes And in the end we end up with this being the largest polyndrome, right? So the largest polyndrome that we can make from two three-digit numbers is 906609 and we get that when we multiply 913 by 993 Is that clear? I know I'm going through them quite quickly, but we have a lot of exercises so far So I'm hoping that this is clear that um Everyone understood that we first have to write a function which checks if something is a polyndrome Um, which is just done by reversing the input Um, and then we go through all of the numbers all of the three-digit numbers And then just multiply them together We check if it's a polyndrome and if it is a polyndrome and it's larger than the largest polyndrome so far Then we just remember them and we just update Good. I was I hope that people were able to do this like these are I think the first three questions from project Euler So they are relatively Easy in a way, but they are really hard, right? You can already see that you have to really read the question correctly And there's there's a lot hidden in this like two sentence question All right, the largest polyndrome made from a product of two three-digit numbers, but now we know 906609 Good question number four Each deck q4 Okay, write a recursive function to count towards zero E.g. When starting from a hundred it counts to zero when starting at minus a hundred it counts up to zero All right, so we are going to um Call this function count Count to Zero Which is a function and it takes a number Right, so we need because it's a recursive function. We need a base case Right, so the base case here is zero. So if x is is zero We say that we're done, right? So we just return from the function Um, because we're done at that point in time so Otherwise, right if x is not zero we need to figure out if the number is larger or smaller than zero Because in the one case we need to count up and in the other case we need to count down So if x is larger than zero, we need to count down. So we are going to return Count to zero of x minus one If x is smaller than zero, right? We now have to count up. So we are going to say return count to zero x plus one And of course we want to have a little bit of feedback So before we do anything we are just going to cut the current number that we're at So I'm just going to say cut slash new line and to make it a little bit more fun I'm going to add a special print to the base case So when it reaches the base case, it will not print the number Well, it will print a number because the statement here will always be executed It's not guarded by an if statement But if we reach zero, I'm just going to say something silly like blastoff, right? Like a rocket I'm going to use a dot comma to keep these two statements on the same line And I added brackets to make sure that the statements are within the if and I'm going to do that for all of them Just so that it's a little bit clearer, right? In theory, we don't have to do this if statement at all Because of course if it is not equal to zero if it is not larger than zero Then at this point in the code, we know that it's going to be lower than zero Good. So let's execute the function. So count to zero of 10 count to zero of Minus 100 Right or minus 100 like this So let's see if this works. So we go to r And we just copy paste it in Um, so we can see that indeed count to zero it starts by 10 And then it just goes all the way down and then says blastoff and then you have here null, which is a little bit Ugly right because the null is the thing that is returned because if we look at the function We see that we return from the function an empty empty thing so For this special situation r actually has something which is called the invisible return So you can use invisible To hide the return value so it will still return a value What is the difference between using two if statements and one if Between one if and an else if there is no real difference. It's just clearer to use an else if And of course There are some special circumstances because Here i'm using a return right so the return means that it will give something back from the function so It will it will never go afterwards right if it would not return right if it would just say count to zero Minus one then it will just execute the function, but then it will just continue So the else if is kind of a guard against these things Um, let me see if I can find or manufacture an example for that um, so if we have for example a number um Right, so we put a number in x So let's do a random number So one random number from zero to a hundred and we are going to round this so that there's nothing behind the comma Right, so now we could have say if x um, so if x is smaller than 10 and x is smaller than Uh 20 right, so if x is is 11 to nine through 19 We want to do something. Um, for example cut Between 10 and 20 Right, so now if we do another if statement, right? Um, saying that if x is larger than five And x is smaller than 15. We are going to cut saying between five and 15 So here i'm using two if statements So both of these can be true at the same time If we use an else if that is not possible, right? So if I write the same in an if right, so if I say if x is And now I have to use brackets of course just to make sure I'm going to cut Like this, um, and then I'm going to write an else if And I'm going to say This and then I'm going to say The same statement, right so these Two constructs are not equivalent This one can fire twice. This one can only fire once Because if the if statement is true, it will not look at the else if Because it just if one of them is true, that's it then it then it fires and then it just continues after the if statement So here of course like this one can so if the number that i'm drawing is going to be like um 12, right? Then now you will see that the first statement is giving you a slightly different answer Um, then the second one right because the first one is true So it fires the second one is also true. So it fires But when I use the else if It only fires for the if Because the if is true. So it doesn't look at the next else if statements because you as a programmer Should be aware that the first thing that is true that matches is going to be returned So it's it's slightly Different, but the same so you can use an if and an else or you can use an if and an else if but here always the question is is do you need to Deal with things like overlap Or not so if there's no overlap then the if statements two if statements versus an if and an else if statements are equivalent to each other But if there is overlap between the ranges for example that you're testing then An else if statement is not what you want Because it only fires one. So what is the benefit of else? So the else is just to make things very clear so it's it's If you do this else do something do do something else, right? So the the else is there to make it clear and the only reason why in this case we can actually get away with using this here Right is because I'm returning from the function. So I'm physically ending the function at this point If if I would not be returning from the function Then I should have used an else But in theory you would always if you write an if you also want to write an else Unless nothing happens for the other case. So but it's it's it's a little bit of a feeling You will get a feeling for it when you write a lot of if and else statements um Because you can test Disjunct things as well in the if and the else statement, right? So you can write hey, you can draw two numbers and say if x is between a number do something else If y is between so then if the first one is true, it will never look at the y value So it will only look at the first one And the else statement itself is kind of meaningless Um, I always use multiple ifs. Yeah. Yeah, and sometimes that is correct. Sometimes that's not really what you want um, but Generally people dislike writing else if statements Um, so generally people tend to use multiple if statements But that is not always correct because multiple if statements if you have overlapping ranges can fire multiple times And sometimes you don't want that you just want to check something if that is true Then don't look at the rest of the statements Because there might be overlap and you might not be interested in the overlap, right? You might might want to find the largest range in in which a number is located And you want to check consecutively smaller ranges which overlap with the original range, right? So then you want to return at the point that it's true Um, but it's it's really a little bit of a feeling here because in theory you can always write Multiple if statements that do the same thing as a if else if statement By using else so But don't layer them too deeply Right, if you if you want to check like 10 different things at the same time Don't say if something if something if something if something if something right because you don't want to go too deep into into your kind of e-dentation But it's it's something that you just have to get a little bit of a feeling for So yeah for the blaster function I use the invisible so the invisible is it's the same as return But it will not print it to the screen So hey if we would run it like this Count to zero and we go to r Then we would see that now it says oh I'm doing something wrong count to zero 10 Yeah, so i'm actually interestingly enough it Interesting enough. I broke my own code Because that's interesting because now because of the if and the l5th statement that we just were talking about right Um, it actually starts flipping here because it doesn't return So I have to invisible return So the return is not the exact so invisible is not exactly the same as the return It just makes stuff invisible, but since we were not returning right So what is happening when and when this statement would not he be here, right? If I would just say blast off Then it would now check Right, so it would just fall through And then it starts doing this thing in r where you can see that well it says zero So it says blast off But then if you look at the code Right, it says blast off This one doesn't hold because x is not larger than zero But this one does hold right because I always do this Since since i'm not returning from the fun So it starts going like in an infinite loop going between zero and one Right because it will update it with one then it will see oh, it's one So I have to remove one and then i'm going to print blast off So in this case it would be much clearer to use the else if so to say something like this So if x is do this right and then else Return this Right because now even though i'm not returning at this point it will now not go into this infinite loop So if we would now do this count to zero and we would go to r Then now it will go into the infinite loop. Why is that? Because it shouldn't Oh, wait. I'm having a typo. Yeah, okay. So because of the typo it still had the old function So this one needs to go. All right, so let's try this again, right? So we have count to zero No, don't go into the blast off loop All right, so now we're using if x equals zero if else if x is larger than zero and else do this Right, so now when I say count to zero And I give it the number five It will say five four three two one zero blast off and it will quit Because now it doesn't have this last the last statement like these two return statements are now guarded Right because if the first statement is true, it will never look at the other ones So even x could be minus one at this point It will just say well if if x was zero at this point I'm just going to cut blast off and then I'm going to skip the rest of the code And by skipping the rest of the code it ends up with the end of the function. So it will just end the function there Good, it's something that you just have to play around with get a little bit of a feeling for it like see how it actually Works out, but the there is a good reason why there are if and else and if and else if statements And that is because of this multiple overlapping problem Good. All right. So question number five, write your own version of L apply Make sure to check if the input is a list and throw an error using the stop function if it is not a list The function signature should look like This so let's just Give you guys the function signature the way that it should look like So this is hashtag queue number five. So I am going to make my L apply Which is a function It takes x as an input. It takes a function as an input and then it uses dot dot dot. So variadic arguments. So and then There's two ways of doing this, right? So the first thing that we need to do is make sure that x is a list Right. So how do we figure out if something is a list? Well, we can check the class. So if class x Not equals list Then I'm going to stop and say Not a list And now the question is is how am I going to implement this function, right? How I'm going how am I going to implement my own L apply So My approach was To just do it, right? So just to say so we have x f right which is x after applying f to it So and the x with the function and then I'm going to say for x Not not x for i in one two the length of x. I'm going to do something. So what am I going to do? Well, I'm going to say well Apply the function on x at position y and then Call it using the three dots And then I'm just going to say put this in x f, right? So x f at point i is going to be this And here I'm going to just say I'm going to preallocate the whole vector. So I'm going to just say This is a vector of type list And it has how many elements well the length of x elements Right, I'm going to put it in x f and then once I'm done. So I've iterated through the whole thing I'm going to just return x f And I can test this because I can say my l apply for example I want to have a list of One two three four five And then what do I want to do with this list? I want to apply a function So I want to apply the plus function And I want to plus six right something like this So let's go to r see if this works. No idea if this is going to work. Um, could not find function fun Oh, is that? Ah, yeah, because it's a character in this case, so I can't use the n fix but When I wrote this code, I was like really puzzling like how to do that and then One year I had a student and the student just said but Why not just call the real l apply function? Right, so what the student just did was say well return The l apply so using the real l apply of x using the function using dot dot dot Right, so don't deal with the implementation Just because there is already an l apply function Right, so why would I rewrite something that already exists? And I think this is the more smart approach Right, because now like you're just leveraging the fact that there is already an l apply and the code is really really like tiny in a way Right, so it it's much better to do it like this Then to spend a whole bunch of time trying to write your own my l apply function because there is already an l apply Right, so I'm just going to use the leverage the fact that there is already an l apply and my l apply function is just going to call this So of course I can now do the thing that I wanted to do in r right where I just say my l apply So I'm just going to call this now Same thing And now we can see if this works right because now I'm just going to leverage the l apply function To make sure that that it works and now you see that it works Right, so instead of being smart and trying to implement it myself. I should have just realized that no there is already an l apply function so This is actually my preferred Way of doing it now is just saying no no if I'm wanting to write a my l apply function Just leverage the fact that there is already an l apply function And of course, I do need to check the first thing Right because that was explicitly in the assignment and make sure to check if the input is a list and throw an error if it's not a list So just check if it's a list and then call the l apply function What is again the difference between l apply and s apply? I actually have no idea. I never used s apply I am going to look this one up for you So, um, I'm going to go to r and I'm going to ask what s apply does um So if I show you guys my firefox window, this is how it Is described by the r help l apply returns a list of the same length as x each element of which is the result of applying fun To the corresponding element of x s apply is a user friendly version and wrapper of l apply by default returning a vector Matrix or if simplify is array an array if appropriate By applying simplify to array s apply xf simplifies false use names as false is the same as l apply xf So the only thing that it does Is that the s supply function? calls the The unlist function for you right and and to simplify to array in case it you say that you want an array So the difference would be like this in r, right? So if you use l apply off like list No, of c One two three four five right. I want to call the function plus And I want to add six to every element Now you see that you it l apply always returns a list right the input was a vector But it returns a list So if I call s supply Then now you see it It's a vector so it returns a vector So it's the same thing. It's just that here In this case the only thing that it does it it calls unlist for you Right because if they do an unlist of the l apply So it's the same as l apply because internally it uses l apply But it simplifies it because it just makes sure that if the return value is not supposed to be a list Because the input was a vector. It will also make the output a vector So it's just a convenience function. So there's no really um No real reason to use one one or the other the supply might be more convenient. So Here's obey welcome to the chat So same kind of thing um I never used s apply actually I always use l apply and then deal with the fact that it's a list and often You actually want to um, you want it to be a list Because it might be that some of the list elements that you give are actually like Vectors or matrices But in theory, they're the same thing, right? So you can you can see from the help file That it's just a user friendly wrapper Of l apply and by default it returns a vector a matrix or if simplifies array an array if appropriate So it's it's the same thing Good, um So we talked about the fact that you should be smart Right. So in this case, don't start writing your own l apply Just use it Right. Just use that that that l apply is there. So when I saw a student doing this, I was like, yeah, this is much much smarter um Because I had a really really complex function that deals with like characters and different operands and I thought no No, it's much better to just say the l apply function exists. So my l apply is just going to call the l apply Good make a function which as an input takes only takes dot dot dot this function when called should print out all the parameters past to it All right, so hashtag question number six So, um, did I give you guys? No, I didn't give you guys any, uh, any hints here So I'm just going to call this thing dots, right? And it's going to be a function and the Only input parameter is dot dot dot And what do we want to do? Um, we want to Print out all the parameters past to it So well the dot dot dot we can actually do a list of dot dot dot. I think Um, and then we can just return this right just to make sure that um, we can we can test a little bit So let's go to our right make our function So I'm going to call dots and I'm going to call a equals 12 and c equals 16 Right, so now you can see that it returns a list A named list so a having the value 12 c having the value 16 We can of course make a more complex by including more numbers in it Right, so make a vector of 312 And now we can see that this is the case So we want to print this out to screen, right? So we can just say, okay, so we have this Instead of returning a list of the dot dot dots We are just going to make this list. So I'm going to say ml And I'm just going to say four I in one to the length of my list Well, what do I want to print? Well, I want to say cut Right, so first off, this thing might have a name. So I'm going to say names ml at position i And then I'm going to do comma and then I'm going to have a space I'm going to add an is statement and then I'm just going to say ml at position y And then I'm going to do slash new line, right? I have to remember that this can actually be a vector So I am going to need to make sure that I paste these things together Using for example the separator equal to comma No separator is nothing Collapse is nothing Now do commas, right? So make sure that if I have multiple numbers that you put commas between them all All right, so let's try this dots function All right, so now I can say dots. I'm just going to use the old version that I have so now it says dots is a equals c 12 12 12 and c equals 16 Exactly what I want. So let's use the other one as well where a is 12 and c is 16 We can add more parameters, right? So we can add more and we can say c oh 16 16 16 for example And then that that's that's the way that we want it It's a very very simple function just to get you guys to to practice using this dot dot dot parameter Because this dot dot dot parameter is is important because it it is used a lot in plotting functions and variadic functions good So those were the assignments so far There were two additional assignments, right? So we have the recursive function to find the greatest common denominator I hope that people were able to do that Just Read it up on wikipedia, right? That's what I said. Just google the algorithm and then implement it into r because it's on But I want to show you guys my answer to question number eight So question number eight is a hard question which you can spend a long time on Which is a recursive function to draw a tree Because that's what recursion is really good at. It's really good at making fractals trees and these kinds of things So I just want to show you my answer. Um, so let me actually open up the old answers from last year Um, and I'm going to show you both of them. So let me just Copy paste the code from last year into the notepad windows so that you guys can see it. So for um hashtag q eight So they are complex functions, right? So it's a draw tree function which takes an x1 and an x2. So these are the positions at which The current position that we are at Then it takes an angle. So the angle is if the if the current line should be angled towards this Um, let me actually draw this for you How it does this um, so I need to open up PowerPoint You know what we're first gonna have a small small break I just noticed that it's already 10 past three. So we're going to have like a 10 minute break When we come back, I will make a really nice drawing for you guys to show you how this recursive draw tree function works I am going to show you the the prettify tree already So that you guys know how beautiful it can look So here we see a really beautiful tree Fractal three Drawn with brown and blue colors. Of course, we can we can we can Define the colors ourselves So but this I just like more alien looking trees And you can actually add more trees, right? So you can add a tree because this one starts at 300 zero But it could add a tree here at 100 zero as well And the nice thing is I can just call the function again And if I do that it will draw a second tree But this tree will look completely different than the first one And why is that this is because it uses random numbers So it will never draw the same tree twice And I can add more trees, right? I can put one here at 500 zero as well So let's do that. So add another one, right? And again, here you see that the tree again looks completely different than the other two But I'm going to explain to you using my drawing board How this exactly works and we're going to go through the code because I think it's a really really fun exercise Because people love fractals And these are fractal trees and this is actually used in game development a lot to generate These forests with trees and none of the trees look alike. So they're all a little bit different And by doing this you kind of get like an infinite amount of variation Just by doing one or two iterations. So it's used a lot in game development And I just like how pretty they are Because you can you can change the colors And hey, you can you can have the line colors being slightly different And here we have blue, but we could do the same thing for example for a green tree. So you can Also, yeah, because like these look pretty alien, right? But here we can use green to draw a green tree then we can draw another green tree here And they all look like a little bit different and unique which makes them very very nice I I really love doing these kinds of like graphical displays and people love fractals So they always look good First little break. So I've been talking for one hour and 11 minutes. So 10 minute break I hope I'm back before the music and I think the first break is going to be birds And I hope that we didn't have the birds already But I don't know I didn't look back all of the old lectures to make sure that I didn't select the same animated gifts again So I will be back in 10 minutes. I will start the music And I will see you guys soon. So Before the end of the song All right, so let's see how we can use recursion to draw these beautiful beautiful trees in r, right? So In theory it is Simple, but in practice, it's a little bit harder. So, um, let's just take the drawing board and So the thing that we're going to do is we're first going to set up a plot window, right? Which we we always do so let's just draw a kind of basic plot window if the drawing thing will work So we have a straight line. Oh, that's not a straight line at all So we have a straight line and we have another straight line, right? So if you draw a tree Or a stickman for that matter The thing that you do is you start off with a single line And generally when you draw a tree the sink the first line is going to be brownish Which I don't have so I'm just going to take red, right? So you just draw a line which goes like this, right? And if you then Have this first line then you need Another line Right, but we're going to write a function which draws one line So because of recursion we can then call The same line drawing again. So we have the first start point, right? Which is x comma y, right? And then we have the end point which is going to be x and These then the end positions of the first line is going to be the start position of the next line, right? Because the next line is going to be something like this And we don't have to limit ourselves to calling the function once we can call it twice or three times with the same start point So we get something which looks like this If we do it three times So now we have different end positions again and for each of these end positions We're going to repeat the process. So the end position is going to be the start position of three new lines right and If we do this often enough And we are going to swap the colors right because we are going to start with the brownish color And then after doing like the first two or three layers We're going to switch colors to something which is green Right, and then we get a tree which looks like this Right and then if we repeat this often enough Then of course in the end we end up with something which kind of looks already like a a tree Right so already after like three or four iterations It looks like a tree But if we continue drawing lines at the end of each other line, then it it it becomes what you see Here in right so the thing to remember is we're going to write a function Which draws a single line and then after drawing this line, we're going to recall this function Two or three or four times, but now with the with the end position being the start position All right, so let's go to the code right so the code that I had So I made two versions right so the first one is is The easiest one because it's not pretty fight. It doesn't use any colors. So I have x one and y one You guys are not looking at the code. So you have x one and y one, right? So it's a function. So this is the start the x position start y position start And then we have something which is called angle right because we need to Have multiple different angles because we we can't fix the angle right if we look at the drawing And then here you see that the first one would be drawn at an angle Right so the first line here would have a certain angle and this one would have a slightly different angle And this one will have a slightly different angle as well So I need to have a parameter where I can set the angle at which I'm going to draw the line Right, so that's where the angle parameter comes in then besides the angle parameter We also have a depth parameter and the depth parameter is going to be How many of these lines or how many of these lines did I already have? Right, so the first one is going to be at a depth of one Then this is a depth of two. This is a depth of three and this is a depth of four and so on I'm also going to define something which is the maximum depth So the maximum depth is going to be How much times do I want to repeat the process before I stop right? So together the depth and the maximum depth Determine my regression invariant And then I have n branch and this is the number of branches that should come out After I draw a single line. So each line is going to have three branches in this case coming out Right, so the code When we look at this is relatively simple So if the depth is not zero right because if the depth is zero, we have to stop What do I'm going to do? Well, I'm going to take my angle, right? Which is just a value between zero and 365 for being a round circle I'm going to multiply my angle times pi divided 180 because I need to go from using Angles in degrees to using angles in radians, right? Because the sine function takes values from zero to two pi And because it works in in radians and not in in angles And then I'm just going to say well take the sine of the angle multiplied by pi divided by 180 and then add that to x1 So x1 is going to be the sine and x2 is going to be the cosine because sine plus cosine is always one This means that the it will always The length of the the the branch will always be one right because if I add sine of x plus the cosine of the same x Then it ends up being one So here I just make two ending positions. So I make So I make the x end position and the y end position And then I'm just going to draw the line, right? I'm going to say x equals So I'm going to go from x1 to x2 and my y position is going to be from x from y1 to y2 And then for each of the branches, I'm just going to call the function draw tree again I am going to say well draw the tree. So start at the end position. So at x to y2 Do 120 times a random value, right? So I do run f1 which is a value between zero and one and then I subtract half So this would mean that that the angle goes from Minus 60 to positive 60, right? So from because this is from minus 0.5 to plus 0.5 Times and this is to make sure that the branches don't Go entirely back, right? Because I want to have like a 60 degree angle In which I am allowed to kind of go up, right? So from if we would look at the drawing and then then the thing that I'm doing is saying that I'm limiting At a certain point. I'm limiting the spread like this to a maximum of 60 degrees Right and the same thing here. So here as well I'm going to limit it to 60 degrees relative to the angle that it had Just to make sure that I don't get trees which go like this, right? Because if I had a tree which looks like this Then it's not going to be a good tree So I want the first couple of branches to always go up and of course after I've had 360 degrees angles, right? Because I can do like this then I can go 60 degrees 60 degrees and 60 degrees And of course at a certain point it can go down, right? But in the first three rounds I want the tree to grow up and that's why I'm limiting myself to to 60 degrees All right, so then I'm going to just subtract one from my depth because I start Uh, I started the depth of of zero No, I started the depth of 15 and I'm going to pass in the max depth and I'm going to then For the next round say Just draw a random number for the number of branches that we're going to do right because initially I start with three branches and here I'm just going to say no I'm going to draw a random number multiplied by five round this one plus one So I can have anywhere from like zero to No from one to six branches For the next iteration So let's just show you guys how this looks So in this case, the setup is a little bit different, right? It's not going to be as pretty as these ones But the idea is I think very clear Right, so here we see the most basic tree that we can draw Right, so it's it's a single tree with three branches. These branches are branching of the tree Right, so now when we make a new plot and now we we increase our maximum depth to three Then we see that we get another iteration Right, so here it was unlucky and it only drew one branch here We have three branches and here we have four branches Right, we can reset the plot and then we can say well go to six levels of iteration Right, then you see that it starts to become like a crop like a tree Um, and of course, we can just continue this We can play a little bit with the numbers We can say well we want to not start at an angle of 15 degrees But we want to start at an angle of 35 degrees, right? And then it hey, you can see that the first one goes like this and then it draws two branches So it gets this weird weird ish But that's kind of how the tree drawing function worked and you can spend a lot of time making this like really pretty You can kind of play with the um with the angles And the the other version that I had So if we go back to notepad, right? So the the secondary version which is prettified with colors. It also draws a random color But if the depth is larger than five and I have then I'm going to make it brown And I'm going to make like a little bit bigger lines But in theory, it's the exact same thing I limit myself here to 30 degree angles Which is also based on a number of branches So the more branches I draw the more freedom I allow for the tree And in the end you can spend a lot of time on it You can you can make all of these parameters as well You can say well, I want to have for example blue trees or orange trees or green trees But the nice thing is is that they look absolutely stunning At least I think that they look absolutely beautiful And in this case, I could update the plot window a little bit saying that go to like 400 and Draw my tree right and you can see that it draws a tree It looks like a tree. I can put another tree into the same figure And I can create like really nice looking forest and nice looking trees So very common technique used in computer graphics used in computer In in games to create the random looking structures, which are more or less dynamically generated And never look the same twice Of course, I can make them look exactly the same by setting my seed But this is this is a very very common technique So if you didn't do this assignment, that's perfectly fine But do spend a little bit of time trying to understand how it works I think I hope that my drawing made it a little bit clear to kind of show you guys how you want to build it up, right? So you just Make a function which draws a single line using a random angle And then hey, it calls the same function an x amount of time for each of the branches and each of the branches does the same thing But of course, we have to stop somewhere. So that's why we need this recursion in variant Very fun. Very fun. I I really love it good On to the lecture for today. So today we are going to talk about regression So like I said, I have a lot of experience with regression. So feel free to ask any questions The lecture regression lecture was online as well for last year And I I do love doing this lecture. So regression is One of these core techniques in machine learning in AI It is a core technique in doing predictions So if you ever want to do like stock market predictions or other things, then you're forced to to deal with regression I already gave you guys the exam dates So what am I going to talk about? I'm going to talk a little bit about the basics of regression. So When can we do regression? Then I want to show you guys single linear regression and go a little bit more in detail on how to calculate things like critical values and p values And then we will discuss Different types of linear regression like multiple linear regression and quadratic regression And I wanted to say some things about model selection since when you are building models to describe reality You have to have a way to compare these models against each other in an unbiased way So like I said ask questions It's just a basic introduction how to do basic linear regression in r using the linear model function and the ANOVA So we're going to build linear models and we are going to use ANOVA analysis of variants to figure out How good our model is fitting to the data that we have I have a lot of experience. I already said that I tried to keep the lecture short And that is since otherwise we would sit here until like 10 in the evening So next week will we using or will we doing linear mixed models and the week after we will be doing generalized linear mixed models? And of course ask any unrelated questions for any details if you want more details So here you see the field of regression. All of these things are called linear models regression models So we have things like ANOVA analysis of covariance multiple linear regression We have time series analysis, which is called ARIMA We have non-linear time series. We have generalized linear models when things are not normal distributions So today we will be talking about this part of this whole field Next week we will be talking about Repeated measurements models and the week afterwards we will be talking about generalized linear models So three whole lectures about regression So very basically regression is to estimate the relationship among variables among measurements that you did So it is used to do predictions That is the goal of regression the goal of regression is to figure out how variables are Connected to each other and how we can exploit these correlations between variables or these co-variances between variables To do predictions so that we know when to buy stocks to make money or so that we know when to When to do a treatment right for the biggest effect So there are many techniques for analyzing several variable models like I showed you guys the field of regression is really big So but always the focus is on the relationship between the dependent variable And one or more independent variables So the dependent variable is the outcome or the effect It is the thing that we want to predict And the independent variables are the input or the cause. So it is the the variables that we assume are Controlling the dependent variable, right? So they are more or less the the things that we use for our prediction. So So In the regression model we have several variables. So the basic regression model looks like this So we have y which is our dependent variable Is linearly related to some function of x which are the independent variables And badass unknown parameters. So the unknown parameters are the things that we want to know We want to know if the temperature increases with five degrees Celsius How does the likelihood of getting rain change or If the temperature rises two degrees Celsius How does this affect the sale of ice cream? Right, so we might be interested in investing in ice cream stocks So we have a couple of constants as well So we have n which is the number of independent measurements that we did So it is the amount of for example individuals that we measured or the amount of data or the amount of time points that we use to measure ice cream sales And then we have k and k is the number of unknown parameters that we are estimating, right? So the independent measurements here are what gives us power And k is what takes away power, right? We can for example, if we only look at the temperature To predict ice cream sales. We have one unknown parameter So k is one and n is the number of days on which we measured our ice cream sale and the temperature But we can include much more, right? We can look at the temperature But also if it is raining or not if it is a monday And well any other things that we might think would Effect sales of ice creams, right? So the most basic regression model means that we are looking at a dependent variable which is Controlled or which we think is controlled by a number of of independent variables so things that we can influence generally And for each of these independent variables, we want to estimate something called beta and beta is more or less the directional coefficient So the the relationship between when you put the independent variable on the x-axis and the dependent variable on the y-axis So in regression, we always talk about statistical power, right? So When n so the number of observations that we have is larger than k And the measurement errors are a normal distribution Then we call this the axis of information So the power the statistical power that we have to estimate these beta parameters depends on two things and two things only The number of individuals that we have observed And the number of parameters that we are trying to estimate So n minus k is called the axis of information. So it gives us this power to make statistical predictions about the unknown parameters When we do regression, especially when we do basic linear regression with a normality assumption We have a lot of assumptions And in blue I highlight the most important assumptions that need to hold for a model to be valid All of them have to hold for the model to be valid But the blue ones are the ones that go wrong the most often, right? So when I'm reviewing papers I I always check if people do linear regression if these Four or three major assumptions hold and I also will check if the other assumptions are more or less satisfied But I'm not going to complain about them in much detail. I am going to complain if one of the blue ones is not satisfied So the first one is the one that goes often the most often Goes wrong the most often, right? So the assumption here is that the sample you are looking at Is representative of the population that you are trying to infer a prediction for So when I think about human genetics I see this going wrong all of the time Because they will do a certain study And they will look at patients that arrived at the hospital Because if you're a human geneticist, you generally work at a hospital So the people that are in your random sample are patients walking into the hospital But people walking into the hospital is not a Representative subset of the population. It is not a random draw people going to the hospital are sick And generally you want to make predictions for people who are Sick or not sick Because your initial random sample that you take is people walking into the hospital who are sick or who think that they are sick All of your statistics that you are using Is only going to tell you something about people walking into your hospital, right? It's the same as when people do studies on Um, a certain area, right? So they They for example want to know if a certain drug is having an influence on a certain disease So they are going to have a hospital where some people get the drug other people get the placebo But this hospital is for example located in germany And then their conclusion in their paper is this drug Increases your chance of surviving cancer But they fail to mention there That their sample is only representative for people living nearby this one hospital in germany Where they did this study, right? So if you look at big studies that are being done by for example, fizer or other big companies They measure these effects So they give their drugs and their plus sables to people all over the world People who have different cultural background different heritage and that is required because you cannot only look at a small german population And then extrapolate your findings to a larger population like humans, right? Humans are very diverse and and germans in general or Europeans in general are just a subset of all of them So the first one is the one that goes wrong the most often the sample that you are Doing your prediction on needs to be representative of the population The next assumption is the error So in linear modeling we have no assumption on the input So our dependent variable are y can have any distribution that we want The normality assumption in linear regression is on the error term So the error is a random variable with a mean of zero conditional on the explanatory variables So after we fit all of our different parameters in our model We look at the error and the error needs to be a Gaussian distribution with a mean of zero The original input distribution doesn't have to be this is different from a t-test in a t-test the original Distribution that you input needs to be a normal distribution In linear regression there is a normality assumption But the normality assumption does not hold for the dependent variable It holds or it should hold for the error term So after removing all of the effects that we are trying to estimate the remaining the residual variants Should be a normal distribution with a mean of zero The next one is not that Is not that hard The independent variables are measured with no error And of course, this is impossible. You cannot measure things without an error, but it is an assumption for linear modeling. So The only thing that you can learn from that is that all models are wrong Because you cannot measure something without an error, right? If i'm just measuring body weight, then body weight will like the scale that i'm using doesn't have infinite precision And that is not bad It only means that you have to do your utmost best To not have an error Right. So when you are measuring people, you should not use a scale which is accurate to 50 kilograms You should use a scale which is accurate to like two digits behind the comma If you're measuring mice the body weight of mice is much much less You need more than two digits behind the comma because mice are generally in the order of like 10 to 40 grams So having a scale which tells you that this is Like somewhere between like zero and 100 gram is not going to be proper, right? But you cannot measure independent variables with no error But the assumption is that they are measured without any error So you cannot satisfy this assumption even if you wanted to The predictors are linearly independent It is not possible to expect to express any predictor as a linear combination of the others and this goes wrong all of the time And this is because we live in the real world and things are correlated to each other For example, the temperature outside and the amount of precipitation from a cloud are correlated to each other So in theory when you're predicting something about atmospheric conditions and you are using the temperature And the amount of precipitation into the same model So for example, you say the number of hailstones I observed on the ground is determined by the temperature Plus the precipitation plus some other factors Then these two things are not completely uncorrelated Because the precipitation is tied to the temperature that you are observing Errors are uncorrelated That means that when I look at the error terms also the error term is not correlated to any of the predictors or to the dependent variable The variance of the data should be approximately equal across the range of your predicted values And this is called homoscladacity or the absence of heteroscladacity And if this is not the case you can log transform your input data So how does this look? Let's just go back to the drawing board since I've not been using it too often I only have actually a New drawing so let's just give me a new drawing Drawing time Right, so homoscladacity or heteroscladacity is a is a relatively difficult concept But with one picture I can explain to you what that means, right? So when we have our measurement values What we don't want to happen is that there is A bigger error the higher the values become Right, so here we can see that had the more increase in x there is right So if we look at the increase in x we see that the That in y the variance becomes bigger and bigger and bigger And this is called heteroscladacity and we don't want that Because if we now fit the best fitting line, right then you can see um, let me use another color for the best fitting line Let's use what's going on. Let's use green Right like this So now we see that the errors. So here are very small But the errors here are very large. So there is a Correlation between the error term that we observe And between the x so between our predictor variable And this is called heteroscladacity and we don't want that if we see something like this happening Generally what you will do is you will take a log transformation of your data And that is actually the reason why for example, microarray data is log transform Because you see this heteroscladacity If you look at the larger values on the microarray because the larger intensities just have more variants than the lower intensities on the array And this just has to do with the law of big numbers had the bigger a number the harder It is to measure this number exactly so So those are the assumptions But check the blue ones So if you ever need to review a paper and they do linear regression make sure that you that you check at least the three blue ones And make sure that those hold Good. So in our Regression is done via the lm function or not so much the regression itself But the linear model is done via the lm function So here we are using the air quality data set. So I'm first loading data air quality and air quality is a data set which has Ozone measurement temperature measurement And it has some other measurement as well. So it has the wind speed and some other Measurements that we can look at So the most basic model that I can build using linear regression in this case is to say that the ozone concentration Which is observed in the air is dependent on the temperature that we have at this day Right. So and then I do my linear model. So I say make a linear model where we regress The ozone temperature or the ozone concentration on the air temperature And then I'm going to say comma data is air quality So here I'm specifying where the ozone column is and where the temperature column is So air quality just has a column called ozone and a column called temp And then I have my lm dot temp and then I'm just going to type summary lm temp And then it will give me the overview. It will give me the summary of the model So how does this look? Well, it will print out the formula that you used It will give you the residuals And it will give you the coefficients So you see here that there are two coefficients. There is the intercept and there is the temperature Right. So when the temperature is zero The ozone concentration is expected to be negative 146 And then for every degree in temperature rise The ozone concentration will increase with 2.4 units This is what the estimate means And we see here that both of these effects are very significant Right with the three stars and it tells you here that the multiple r square of the model is 0.48 So that means that around 48 percent of the variance in the ozone concentration Is caught by this temperature coefficient So how do we now plot this right if we want to see what's going on We can plot the air quality the ozone column versus the temperature column And i'm plotting them here in blue using solid points And i'm getting the coefficients out So i'm getting the the temperature so i'm getting the intercept Which is where the x is equals y line Is right so you can see that When the temperature is around 60 degrees Fahrenheit The ozone concentration is very low around zero When the temperatures are now 90 degrees Fahrenheit the ozone concentration is around like 60 to 70 And you see that there's also variance in our data So if I want to plot this best fit line I'm just going to take the intercept out and i'm going to take the temperature out of the two coefficients, right? So here the a coefficient will be Minus 146 and the b coefficient will be 2.4 2.42 Right and then i'm just going to add the best fitting line to my data saying Draw a line the alpha coefficient of the line is going to be a And the beta coefficient of my line is going to be b make the line red and plot it in my data Right and you can see that this model fits the data relatively well Right a lot of the variance is explained by this one line that we draw through the data Of course When we look at the model, right? It tells us that we have standard errors, right? So it is it is not exactly able to identify with 100 certainty where the intercept is going to be Or what the exact influence is going to be of temperature, right? So we have always an error in the prediction So if we want to calculate the confidence interval the confidence interval means that 95 of the data Should be more or less inside of the confidence interval We have to do a couple of steps in computing ourselves and of course R can do this for you as well, but I actually wanted to show you guys how you can do this by hand So what we want to do is we come on to calculate the margin of error So in standard regression in standard single linear regression We can use the t statistic. So this is the t distribution from the t test So it needs the standard error, which we can get from the lm summary, right? Because it just gives us the standard error for both of the parameters What we also need is the critical value, right? So the critical value is the probability boundary that we are assigning to it So in our case because we work in biology, our standard probability boundary is an alpha value of 5 percent However, if we don't know if the parameter that we are estimating is going to have a positive influence or a negative influence We have to do a two-sided test just like with the single-sided t test versus the two-sided t test So what we have to do is we have to do two-sided Which means that we have to say we divide our alpha by two And then we say one minus the alpha. And so we are dealing with 0.975 Which is which is more or less two-sided of the of the t test We need to know our degrees of freedom, which is n, which is again the number of independent observations minus two y minus two That's just the case. I'm not going to explain degrees of freedom They will come back But the degrees of freedom are more or less the statistical power that you have and we do minus two here Because we have two parameters that we estimate it Right. So the first parameter that we estimated is the intercept and the second parameter that we estimated is the is the slope of the temperature coefficient Right. So that is why we subtract two because after fitting these two values the intercept and the slope Only n minus two observations are independent two of them are fixed Because we calculated two numbers The critical value here we can calculate. So we just say we take the t distribution So we take the quantile t distribution We fill in our probability boundary that we want to have which is not 0.975 And we have n minus two degrees of freedom and this call in r will just give us the critical value And the margin of error is defined as the critical value times the standard error So the standard error we can get from the summary the critical value We can calculate based on the number of observations that we have and the probability boundary that we want to have So we can calculate a 95 confidence interval But we can also calculate a 99.9 confidence interval Right. That would just mean that instead of having an alpha of 0.05 we go to an alpha which is 0.0001 Right. So we can be as stringent as we want So the margin of error is the critical value times the standard error So in our example we can take the summary. I'm calling this lm.sum And I'm going to take both coefficients the temperature coefficients and the standard error from this coefficient Right. So the standard error from the temperature coefficient is the one that I want to have So and that has a standard error of 0.233 Right. So that's this number here 0.233 Right. Because I'm not going to look at the intercept at this point. I just want to know What is the confidence interval of the temperature because we can do the same thing for the intercept But we're not really interested in the intercept that much Right. I can now calculate my own critical value Which is just the number of rows in the air quality data set because every row is an independent observation I subtract 2 because we calculated the beta for the temperature and we calculated the intercept And then this gives me that the t statistic at this point is 1.975 So I can calculate my margin of error on the Coefficient that I computed and if I do that I just multiply the standard error versus the critical value So I'm saying that 0.46 Right. So the confidence interval that I'm having means that the temperature coefficient The real value of the temperature coefficient is going to be as low as 1.97 All the way up to 2.98 right and of course the estimated parameter Was 2.42 which is exactly in the middle right. We have just as much Error on the bottom as that we have on the top Good. So this is the the real value is going to be between these two numbers right. So now now we know so I can actually Plot the confidence interval as well Right. So I can say well, I can use a prediction to predict the values of the ozone Using the temperature. So I'm just going to say I'm going to make a range On which I want to predict and this is going to be from the minimum temperature of my model to the maximum Temperature of my model right. So if we look at the plot, this is going to range from like around like 55 all the way up to like a hundred point something Right. So I'm just going to say do a prediction over the whole range in which I have observed data Right. So that's what I'm doing here Then I'm making a data frame of it because the predict function takes a data frame So because this is a vector, but the predict function takes a data frame with the data and the columns So I'm just going to say make a data frame Where the temperature is the p range by is one. That's not that interesting. Um, oh wait, this actually let me actually fix the slide Because that's wrong Something went wrong here because the by parameter, of course belongs to the To the sec function Right. So I'm going to say from the minimum to the maximum by one All right, let's Go back Right. So I'm just going to say for each f4. I'm going to do a prediction from the minimum temperature for 55 degrees I'm going to do a prediction from 56 57 58 So it's just a sequence right and I'm going to then make a data frame Which has one column called temperature and then I'm just going to call the predict function So the predicts function as the first parameter takes the linear model on which it needs to do the prediction So it takes the linear model It takes the new values, which in our case is the range from the minimum to the maximum I'm going to say I want to have the confidence interval So int is c for confidence and I'm going to want to have the 95 confidence interval I'm going to make a plot and I'm just going to plot the values of the temperature versus the ozone And then I'm going to add the regression line. So the the regression slope So the standard so the fit so I'm going to get that from the from my from my prediction And I'm going to have the upper limit and the lower limit, which I'm going to plot So of course, this is a lot of work like you can see that have we have to do our own prediction and then using this prediction We can plot the confidence interval on top of our data that we have We could also just use an external library. There's many libraries that can use this or can do this So you can use the vis rag library in r to do the same thing So when we then look at it, we see that We see now our regression line. We see our confidence interval And now 95 of the data is between this confidence interval And then you guys will say but that's not the case Danny Like you are completely crazy 95 of the data is not within the blue lines or here within the gray area And why is this not the case? Because my model is not perfect My model had an r square of only 48 percent Which means that 95 percent of 48 percent is going to be within the lines, right? So in this case like one in four Values is going to be within the confidence interval just as a Because that that's important right because always people ask me like but your confidence intervals are way too narrow because it doesn't contain 95% of the day It contains it does not contain 95 percent of the observed data Because the observed data Is of course only predicted by this model for 48 percent. So in the end we have 95 percent of 48 percent being within these lines. So we expect like Slightly less than half of the dots to be within the confidence interval And you can see that the confidence interval is not the same So the width of the confidence interval is is smaller in the middle than it is at the edges And why is that well that is because when we do the When we do the estimate of the parameter, right? Oh, um, just Screw this out, right? So we have Our x values and we have our y values, right? So if in this case we have the temperature on the one axis Still writing with green and here we have our ozone Right and we have our Our values which kind of go like this Right, so these are our observed values and then we have our regression line Right, but the regression line has a Directional coefficient and we just learned that this directional coefficient can be as high as like 2.8 But as low as like 1.7 Um or 1.4. I think Right and when you would draw all possible regression lines in your data Right, then you see that what starts happening is that there is a kind of That the middle in the middle the confidence interval is smaller than it is at the edges Because of the variance in slope, right? So the confidence interval that you observe in your data is nothing more than an infinite amount of regression lines Between the standard margin of error that we originally calculated, right? So between Minus 1.97 all the way up to 2.89 Right, so these it's just all of these lines and because of the fact that the way that lines work in the middle That creates a smaller confidence interval than at the at the edges good How do we now? Figure out how good our model is, right? So we can look at the residuals So the residuals is the variance which is left over after fitting the effects, right? So this is a measure of how well the regression line fits the data So the aim of regression is to minimize the sum of square of the residuals And the sum of square is the difference is the square difference between the observation and the mean of the observation So in our case, it's the sum of squares of the Observed value relative to the predicted value that we have, right? So our predicted value is just a straight line So just minimizing the sum of square means that we want to make or we want to have a line which is As close to the individual dots as possible So we want to minimize this sum of square And if we do this then what we are doing is we are fitting a maximum likelihood model So how do we visualize the residuals? Well residuals are easiest on clean data, which means that No NAs should be in the data because of course when we have an observation If we have an ozone observation, which is NA We cannot calculate the sum of square because we cannot take the observation minus the prediction value but The same thing holds for when for example, the temperature is missing, right? We cannot do a prediction when we have a missing value So the first thing that I'm going to do is just say remove all of the NAs from the ozone And head because the ozone measurement has some missing values So I'm just going to make a new clean data set. Um, which has no rows which are missing So I'm going to do my linear model based on my clean data I'm going to do my prediction, right? So I have an LM temp model and I have an LM temp clean And then what I'm going to do is I'm going to plot the air quality clean the temperature First is the ozone. So that's just the observed data that we already saw I'm going to put in the regression line from my model And then what I'm going to do is I'm going to say for I in one to the number of rows of the air quality clean I want to draw a line from the observed data point Towards the predicted data point, right? So the air quality clean temperature So the prediction that I'm doing here is not having an additional input I'm just going when you don't provide a range to predict across It will just do a prediction for each of the observed temperature values So I'm just going to draw lines for each of the temperature points to the ozone To the prediction that I had right so from the real observed value to the predicted value And I'm going to do that in blue. So this is how our residuals look like, right? So we have an observation here while the prediction was all the way here We have some observations here while the prediction was here So minimizing the sum of square means that we take the square of of this number Because this is a number all of these things are in theory numbers like here we go from Like 40 all the way down to like zero. So this is 40. So here we have 40 squared 20 squared 10 squared here We have minus 20 if we square that it becomes positive, right? So we sum up all of these numbers together and that is the sum of squares It is the amount of variance which is left After we fitted our line. It's the squared variant good Let's take a break here Are there any questions so far about single linear regression because we already discussed a lot, right? We discussed how you can fit it using the lm function. You can use the predict function to do a prediction based on your model We already talked about confidence intervals on how these relate to the data that we have and how these relate to the estimated parameters And we also talked about residuals and what are residuals of our data So i'm going to wait a little bit because there's always a little bit of a delay on youtube So if there's no questions Which is fine, then we are going to go to the second break I forgot which animal I picked for the second break Um, but you'll figure out soon enough when I start the break Good, so there's no questions so far. So I hope that everything's clear So if I ask what is a residual, what do we want to minimize? Then you guys can answer these questions on the exam, which is perfectly perfectly good because It is a relatively difficult concept to deal with regression And it's not a solved issue at all like there's people that work in in data analysis and optimizing regression their entire life Um, so it's it's one of these active fields of research within uh with science Good, but that being said enjoy the second break. Um, I'm going to start the music I'm going to switch to the second break And we'll be back in like five to ten minutes All right, I made it back again. I hope everyone's still awake and cooking inside like I am like the temperature is just Getting crazy here getting crazy All right, so Around 40 minutes left. We still have like more than half of the slides to go So I'm gonna speed up a little bit, but I think that Once you understand the basics of regression, um, multiple linear regression is actually quite easy because Of course, we don't just have the temperature measured right We might have wind speed measured the month the day the solar radiation currently So there's a lot of different things that can predict the ozone temperature that we have Um, yes, so when I look at the head of the air quality data set, we can see that we have our ozone But we also have different different things that we can use for our predictions So when we have multiple linear regression, we have a mathematical model with for example, two factors So we have the Alpha here, which is our intercept Then we have our first directional coefficient for the first Parameter that we're estimating for example temperature And then we have our second unknown parameter our beta 2 Which is controlling or which is giving us the influence of the second variable You can generalize this to of course as many variables as you have So for example, if we model the ozone as a function of temperature and wind, then We get ozone being our dependent variable We have the intercept we have the temperature and the wind as the two dependent variables that we are looking at And of course, we now have to estimate two batas and a single intercept So in our nothing much changes we can use the lm function again So we say that the ozone is a function using the tilde of temperature plus wind We fit it using the air quality data set and then we just look at our summary, right? And we get again our estimated intercept We get the parameter for temperature, which has changed And we have now an estimate for the wind Furthermore, we see that the model fits better, right 56.8 percent of the variance in our Ozone concentration is explained by this model So compared to our first model, which explained 48 percent This model explains 50 percent of the observed variation and The thing to notice here is that the estimate for temperature changed slightly, right? So the initial estimate was that for every increase of degree Fahrenheit in For every degree increase Fahrenheit We have 2.4 units more ozone, but now we have an estimate of 1.84 And this is because there is of course a relationship between wind and temperature as well So the wind takes a part of the variance which is explained by the by the temperature and this is distributed to the two terms Right. So again, we can plot the observed values versus our estimated values if we wanted to And so in this case, I'm using the width function so that I don't have to type air quality all of the time So I'm just going to do a prediction using our new linear model I'm going to make a list in which I provide the temperature and the wind speeds So I'm only going to do a prediction when we had this temperature and the wind measurement So I'm going to plot and then I'm just going to add points. So the plot is the the black dots So those are the observed values and in red. I'm going to add the predicted values to our to our Graphic, right? So again, we look at the temperature axis and we look at the ozone axis Of course, I could have looked at the wind axis as well But in this case, I think it's more logical to come from the temperature side because the original model that we use Was also based only on the temperature But these are the predicted values When we have the wind and the ozone the wind and the temperature in our model So you can see that this seems to fit already a lot better than the single straight line that we originally had So we can also add interactions to our model So an interaction means that we have an x1 which might be temperature We have the x2 which might be the wind observation and then we want to say that there's an interaction between the two It might be that at higher wind speeds the temperature has a bigger effect or at lower wind speed the temperature has a bigger effect So we just have an unknown parameter an additional unknown parameter Which is multiplied to the interaction term to x1 multiplied with x2 So in our example here, um when we use temperature and wind we can add an interaction term In r we use the double point for interaction terms into the model So again, is there an interaction between wind and temperature? So we just pose the following model saying we have a linear model where the ozone is controlled by the temperature by the wind speed And the interaction between temperature and wind We do the summary function and again, we see that our estimates have changed the temperature now has a much bigger influence All of a sudden the wind also has an influence of 14 And the interaction between the two is slightly negative We actually see the probability value. So all of these terms are highly significant So that means that yes, there is a significant interaction Between temperature and wind when we are predicting our ozone concentration in the air Again, we have a better multiple r squared because of course the more parameters we put in our model the better the model fits And in this case our model explains 62 Point six percent, right? So adding the interaction compared to our first model which explains 48 percent the second model explains 56 percent And this model with the interaction term now explains 62 percent of the observed variants Again estimate that the temperature has changed That the original estimate was 1.84 and now the estimate is that there's 4.07 And this is of course because the interaction term is very co-linear with both the wind and the temperature So it will absorb some of the variance of the data gets not assigned to the main effects But to the interaction between the two main effects good So When we talk about linear models people always think about straight lines But that is not true because we can also talk about quadratic linear regression Linear linearity or linear regression means that a change in one of the independent variables Always yields a corresponding change in the response variable, right? So people always think about the standard formula for a straight line saying y equals the intercept plus a certain unknown parameter Certain beta times a main effect that we are looking at however The following functions are also linear functions Because every increase in x will have a corresponding increase in y, right? So if we look at the following formula where we say we have a formula where our ozone is predicted by an intercept Plus a beta coefficient times x to the power of two Right, so we take the the the power the second power of x and this although x now is not linear In the sense of a straight line because it's now a curved line, right? It's now To the power of two. This is still a linear model The same thing holds when we do something like y equals the intercept plus e So the natural the base of the natural logarithm to the power of the beta one times x, right? Here we see beta one x which is a linear effect And if we take e to the power of a linear effect the effect itself still continues to be linear So in r we can also do quadratic regression, right? We can we can fit quadratic regression, but we can also fit this kind of e to the power of regression coefficient So how do we do x to the power of two? Well, imagine that in our mathematical model We actually assume that there is actually the the temperature has An effect but the the second power of temperature also has an effect, right? Then we can write down the following model saying that the ozone is the intercept Plus the temperature to the power of one. So we have a single beta that we estimate And we have a secondary beta that we estimate for temperature to the power of two So in r we use this Well floating head thingy for quadratic regression. However, we need to surround the statement using i Um using the i function and this just has to do with the way how In your regression works in r. It's nothing magical But just remember that every time that you want to fit a quadratic term or a Term to the power of three or to the power of four you have to make sure that you do this term And you use the identity function Before you put it into the model So how do we specify a model like this in r? Well, again, we can say we have a summary of a linear model where we say that the ozone concentration in the air Is determined by the temperature plus i Temperature to the power of two Again, we use the air quality data and now we see that have we get the formula back We get again our coefficients with the estimates and again We see that there's a significant interaction of the temperature to the power of two Which kind of we already saw when you looked at the data, right? The data does seem to not be an exact straight line. It seems to be kind of a quadratic curve Um, again the model fits better than the original model, which explains 48 percent This model explains 54.4 of the data, right? So it explains more And the quadratic term is very significant So if you would write down this in a publication You would say that the ozone temperature is 305.5 Minus 9.6 times the temperature plus 0.7 a times the temperature to the power of two Right. So and because these are just the parameters that we get from here So let's plot this regression coefficient Right because we can easily plot this as well because we have a quadratic term So what we can say is we just plot the ozone versus the temperature, right? So we plot the dots and then we now say Make a curve, right? Oh, and here we say the curve is 3.305.5, which is the intercept at zero And for every unit in x you go 9.6 units up and for every Squared unit of x you go up 0.078 And then here we see our linear regression line And you can see that this is not a straight line anymore But this line fits the data pretty well, right? You actually see that it goes down a little bit in the beginning Because this minus 9x term as much has a bigger influence than the quadratic term But in the long term the quadratic term seems to kind of It it it wins from the linear linear minus 9.6 times x Good, so this is how you do quadratic regression and are very similar to doing standard linear regression So the question now becomes is we made a lot of models, right? So we we made a model where we say that the temperature is the only thing influencing the ozone concentration We made a model where we say the temperature and the wind have an influence So which one is true? so None of them are true But the one that is the best model is given to us by Occam's razor. So Occam's razor is The statement which was done originally in latin and it is And in english this translates it is futile to do with more things. That's which that which can be done with fewer And this is a basic rule in science. So science always prefers the simplest explanation That is consistent with the data available at a given time Right, so we should not use parameters Unless parameters significantly improve our Our fit of our predictions towards our observations So How can we now have a formal approach to decide which model is the best? Well model selection is the task of selecting a statistical model from a set of candidate models So in this case we have three or four candidate models that we made Temperature temperature plus wind temperature wind plus the interaction and we can now use the akaki information criterion also called aic Which is a relative comparison of models But you have to remember That the aic just like the bic or the log likelihood is not a statistical test It is a guideline like the pirates guideline is a guideline, right? So the aic what it does is that it rewards the goodness of fit Is assessed by the likelihood function, but it also includes a penalty That is an increasing function of the number of estimated parameters So the more parameters you fit in your model the more the model is penalized for having multiple parameters Right, just like occam's razor. It's better to do it's better to have a very simple model Which fits relatively well then have a very complex model which just fits slightly better In the end we want to have the best model, which is the simplest model which explains our observations So the aic is defined as two times k. So k here is the Is again the number of estimated parameters in the model and to that are from that we subtract two times the likelihood of l So you can see that k here is positive, right? And I told you guys that the aic is actually penalizing based on the number of estimated parameters And this is because the aic when you look at an aic number a lower aic number is better than a higher number Right. So l is the maximum fit of the likelihood function. So it's a goodness of fit estimate And k is the number of estimated parameters in the model So the preferred model is the one with the minimum aic value not the maximum aic value And for for a model to be better than another model it needs to drop at least 10 aic points And this is very under discussion among statisticians some statisticians say well if it drops like five points I already find it that it's a significant improvement. Some people say well if it drops two units I already find it a significant improvement. So this is very up for debate However, in my mind 10 units is the absolute minimum drop Before I'm going to consider a model which is more complex over a model which is simpler So in r we can just use the aic function very easily Which takes a number of models and then it compares these models So it calculates the likelihood it looks at the number of estimated parameters and then just calculates the aic for you So for example here, we have the model where we say ozone by temperature Ozone by temperature plus wind plus the interaction And then we have our last model where we said that the ozone concentration Is determined by the temperature plus the temperature squared So if we then do the aic of the three models that we looked at Then we get an overview which looks like this. So it tells us that the worst model is actually model one The model that is slightly better is actually model number three But the best model is model number two And you can see that the drop from the best or from the worst model to the best model is around 32 points So we are going to prefer model two even though it is more complex, right? It includes a temperature It includes the wind and the interaction between the two But it drops enough to be considered as the winning model in our case Good So in conclusion All models that you make are wrong and this statement All models are wrong. It's just that some of them are useful And this statement is generally attributed to george box and with this I would like to thank you all for your attention And if there are any questions about linear models standard linear models quadratic regression or using interaction terms Then if you have a question Just throw it in chat But that's what I wanted to tell you guys for today. So next week we will talk about Linear mixed models so models where we have repeated observations of the same individual And then the week afterwards we will be talking extensively about generalized linear Models or generalized linear mixed models. And this is when your distribution of your input data is not um, well not not gaussian, but when it is not a Um a continuous variable. For example, when you have a binary outcome someone dies or survives or when someone Um Or a Poisson variable for example where you have counted data like the number of bees on a flower Right. So if you want to do a prediction for that, then you cannot use standard linear models Davide, I had to leave great lectures always. Um, have the best weekend. Yeah, Davide you too. Thanks for for being here Thanks for being until the end. Um, I can understand people not Spending the whole afternoon just watching the lecture like the weather is so beautiful that I if I could And I would be outside near a Puddle of water as well, but unfortunately or fortunately, I can give you guys the lecture So thank you for being here and I hope you guys learned something. Um, so linear regression one of the core Um algorithms in more or less predictive biology But also in machine learning machine learning when you think about like artificial neurons They use linear regression internally To kind of find the best fit of the input towards the output Leonardo fairy. Thank you very much. Yeah, thank you guys for being here That I always like doing this lecture. It's a little bit Intensive I think because you're trying to cram so much into a very little time But thank you guys for being here This this course wouldn't be possible without you guys, right? If there would be no one to ask questions It would just be me doing the exact same thing as last year And so your questions are the thing that makes the lecture different from last year and makes it more interesting, I think All right, so with that 20 minutes before time. So I didn't really have to speed up, but that's good. All right. See you next week time to cook. Yeah, enjoy Enjoy your dinner Amisha and already enjoy your weekend. We'll definitely talk over the weekend. So Good then enjoy the rest of the day and I will see you guys next time and thank you for being here thank you for listening and Enjoy your pizza. Yeah Yeah, it'll probably be pizza again because it's Thursday Thursday is pizza day. So Okay, thank you guys and see you next time