 Okay. So I have a quick presentation talking about linear models in case people aren't as familiar about this. So we are looking at these two layer or these two columns in our data. I'm just going to move into what does it happen? Okay. We have two columns of continuous data, meaning that we have a bunch of numbers of very wide range, average glucose and BMI. In general, relationships are very hard to see if we're just visually looking over this data. We perhaps could say that 70 is the lowest number on our average glucose is 148 the lowest. No, there's 102. It's very mundane, not very effective. What we want to do is visualize the data and then try and draw a line to see if there's a correlation between these two. So we're going to take your data, run a linear model and then visualize it. When I say we're working with linear models, we are trying to see if there's a relationship between two variables. This can be a confusing topic because variables have a lot of names that mean the same thing. So we might have a predictor variable and a response variable. So, for example, biomarkers and cancer are very hot topic. If we can find a level, say, if your level of insulin in your blood increases, that indicates that you have cancer. It's a lot easier to measure your blood level compared to directly finding the cancer and measuring its size. So we have an indirect measurement. So in that situation, we would use our blood level of a certain protein as a predictor for the response, which is perhaps tumor size. A lot more invasive to be measuring tumor size, a lot easier to be measuring our predictor variable. So our predictor variable generally is our X variable also can be referred to as the input and independent variable. It has our response variables on the Y axis and can also be called the output and dependent in our linear models. We have observations. So for the data that we're working with, we have an observation so person number one had this measurement for glucose this measurement for BMI. Observation number two, we have this level and this level. We have a bunch of observations. Everything is wrapped up in noise for like technical noise and perhaps just like biological noise. And what we want to be able to see is something that looks like this when we plot our X variable and our Y variable, they scatter in a trend. And we can draw a regression line that models a relationship. So that even though we don't have a dot that's exactly at this point, now that I can quantify the trend between it, I can start predicting and say that well, if then I find somebody that has a plasma level of X, can I then give them a prediction of how likely they have a large tumor. So when we're modeling between our dependent and our independent variable, we're not likely to have a one to one relationship, we need some modifiers. So we have our slope coefficient. If we have a positive slope. That means as X increases Y increases. If we have our negative slope, as X increases, Y decreases. And if we have no slope. This is what we saw with our data source doesn't it's noise you have dots everywhere there's no good line that you can draw to show a correlation between your X and your Y variable. The other thing that you can modify is a Y intercept so why intercept one is higher than one intercept two. So going back to our high school formulas of Y equals MX plus B. We also have like a or different ways of calling this. But the fundamentals we want to walk away with is we can modify the slope, and we can modify the one intercept. And we do have an element of error but most cases we try to pretend that this doesn't exist because it's hard to model something so stochastic, but as biologists working with real data. We have to keep in mind that there is going to be noise that we can't recognize in the model itself. So when we create our linear model is going to give us a whole lot of numbers. Don't panic, we'll go through each of these steps together. The first part of the output is it's going to give you a formula. How is it looking for a relationship between these two things. We have our data equals heart. So we have the heart data set that we've been working with. And we have a bit of a function over here because we have our two column names, average glucose level BMI. So I want to see the way that we might read this is we're interpreting average glucose level as a function of BMI. So we may be able to record our BMI easier than we are able to determine our own glucose level. Right, because we can step on a pound we can go and measure a height, but at home it's difficult for us if you don't have the right technology to measure glucose level. Next up is residuals. So when we're drawing a line of the best fit, we know that not all data points are going to land exactly on the line. We don't get as close as possible, but the residual is a difference between the observed and the expected. So here are expected values are the line that we drew, and the observed value is our little data point. So here we have a positive residual. And if we have a data point that we throw over there, this is a much bigger residual. If we're reading our line of best fit onto our data, we're trying to minimize the residual of all our data points so that we can have a line that best models what we have. So this residual is a summary of how good our line is how far we as our observed versus our expected. We generally are not going to touch this too much is just returning some facts to you if you see it very big that's an indication that it's not a very good line. So here we have the residuals mean. Because we have some residuals that are positive and negative. If you just add them together it's going to negate each other so you could have a situation in which your residual is zero and it looks like it's very good. So what we do is we square them so that everything gets turned positive. And our result of that is that the data points that are furthest away, once you square them it becomes an even bigger influence to say that we have an even worse module model. What we care most about is our coefficients. So when we're talking about y equals MX plus be the coefficients are this M and this be. So we have our intercept value that's returned to us it's estimating that the intercept is 88. And the coefficient of BMI, we will put that over here. We're working with a simple linear model, meaning that there is one predictor and one output. But you can imagine that most diseases aren't this simple they're complex so you might want to be looking at heights you might want to be looking at like the number of meals or perhaps like fat in your diet. So we can build a more complex model takes in multiple variables for a single prediction of the output, but that gets pretty complex about how you balance each of those. So we're just going to look at a very basic simple model, one predictor one output. But this sets it up so that we could have multiple things that we're looking at. So we put a slot in the intercept as 88, and our slope as 0.1. Notice that our slope is very small right now right it's close to zero. However, we do still. So it's close to zero but it's positive. So we can show that for every increase in BMI. There's an increase in glucose level, because it's positive. However small this effect is notice that we have these triple stars over here because we have our P value, our P value is to the power of negative 16. And the triple stars means that its significance is close to zero. So, keep in mind that this means that the results we see are statistically significant, but whether this is meaningful for the biology that you're looking at is that for you to decide. So if eating five kilograms of carrots increases your test mark by 2%, is it still meaningful for you to be eating five kilograms of carrots every day. There is a link, there's a statistical link, but whether it's like biologically relevant to you is another call. So remember we are working with numbers at the end of the day. And at the bottom, we have our multiple R squared. So this is looking at the percentage of variance in a Y explained by X. So here with 0.02. That means that 2% of the variation in our average glucose is explained by BMI. So not a lot. You need to be using the adjusted R squared until you get into complex linear models where you need to be correcting for multiple inputs for a single output. So this is a summary slide. I like the rule of three. So when we're walking away from interpreting a linear model, three key things we want to take note of the direction of the effect, whether this is a positive as X increases Y increases, a negative as X increases or nothing. So the direction of the effect for us we have a positive. We have the size of the effect. So although we have statistical significance is 3% very meaningful for us. Is this going to change the way that we practice or is this enough to introduce a new intervention. So these are the three things that I would summarize as important when we get our output of a linear model. So I think that is the end of this linear model. I know I went through a lot. Do we have any questions about this. Looking at continuous variables. You mean like there is no correlation between average glucose level for people shorter than six feet, but people taller than six feet have the effect. I'm sorry. But then that's not a that's not a linear thing. So you wouldn't use a linear model. It would, it would have to be a different model. So complex models will be fitting for things that aren't a straight line. But when you're working with a simple linear model, it always is going to be this Y equals MX plus B. It's always going to be a straight line. If you have data that is not linear. That is okay but you cannot use a linear model. It depends on the shape of the data. So if you can transform a pair, if you can transform a skewed data set into a normalized line, then you can use it. So if you apply a transformation and moves back into a normal distribution, then that's okay. But there are other things there are other statistical models that you can apply. So all of these tests again this isn't a stats course so I'm not going too much into the assumptions for you to these tests. But it sounds like you have a situation in which the assumptions of a linear model are not met. Therefore, you cannot use a linear model. Sure, but then you're yeah you would eat no matter what you do you can transform the data you can select the data, but then. Yeah, it depends on your use case. I don't know if that would be the best way of handling it, but that would be a way of handling it.