 There's only one next to two people. What? OK. Now after we've calculated our model, the pressing question is, is it any good? And can we use it for something useful? So this is the question of residuals and the question of confidence limits that we're looking at. Now, let's speak briefly about residual. What's the residual in the first place? That's a bit illustrated here. The residual in principle is the distance of an observed value, the true data, that's what the data point is, from the point where it should be if the ideal model were 100%. And there are some assumptions that need to go into residuals, into the distribution of residuals, the epsilon terms, that influence whether a meaningful model can be calculated in the first place. So the error term in principle should have a zero mean, because if it doesn't have a zero mean, it will basically shift your ideal curve away from where it should be. And this will be perfectly and invisibly compensated in your parameters. So there's the intercept parameter. If your error terms pulls your data up, then the intercept will go up and you will never notice it. You think your data is not good, but there's no real good way to compensate for that. The error term should also have constant variance, i.e. the variance should be independent of the large and small values. This is a term we violated in our model, because it was biologically meaningful to do so. And just note that in this case it wasn't a very dramatic violation. The parameters were recaptured in any way. If the violation would have been large, essentially this would mean that we're waiting the deviations on the far end of our distribution more strongly than the deviations at the close end of our spectrum. So there's just more variability there. So the curve is going to be, or the regression line, is going to be more influenced by the distribution of points at the far end than at the near end of the spectrum. And in order to make conclusions for prediction, the error terms should be normally distributed, because only if we know what kind of error to expect, we could then say what's our expectation, whether a new point is going to be close or far away from the line that we've just calculated. So here's some examples of things that are not linear. So this is not equal to the variance. So this is the ideal thing. The error terms should have a mean of 0 and it should be more conservative. This is the situation, the approximate event that we've modeled with a smaller variance at one end and a larger variance at another end. Still it has 0 mean. You're going to get good parameters. The predictions are not going to be as good out here because the variance should be as good and larger. But overall, since the prediction goes in the average error terms, we might not notice this. We might over the over-confident of our model's predictive properties. This is what you're actually looking for in the plot where you plot your independent parameter versus variance. Because this says there's another model, some kind of model, some kind of a relationship, some kind of a correlation that's not even captured by our linear model. Our linear model cannot explain everything systematic that's going on in the data. And this tells us the model itself is correct. Never mind the fit and the coefficient of correlation and all the other numbers. Just qualitatively, we see there's some reality, some physical, biological reality that we're missing because there is still some relationship in the data that's in the residuals, i.e. after we've subtracted our best possible model. So that means there must be another non-linear model that would be explaining better than your data. Exactly, that's what I would try. I would try to have a superposition, well, either it could be a quadratic relationship or an exponential relationship to do something to begin with or try a superposition of a modulate with a curve with a linear model that's simply an addition and so on. I need to rethink. Somehow my observations are violating the key, the fundamental assumptions in the model and that there is a linear relationship. Okay. Now, this is an interesting plot. So this is simply a plot of the residuals in the line and it shows you the value. It's kind of nice to look at. It may not be as meaningful. It's a pretty plot. But if you're trying to illustrate the residuals, this is a very useful plot. I've put this here because it shows us some interesting R syntax that I'd actually like you to go through. So we'll take a little break from just listening and phasing out and reproducing this here on your R consoles. So if we go back to the V-key, statistics, model five, no, module six, regression. So this is our data function and you don't actually need to source it and store it. That's the normal way I would do it to be saving variations. You can just paste it in here. So let's do that. Okay. Paste, return. Now the function make data is defined. In fact, if we invoke the function without parameters, it's not being executed. But R gives us the definition of the function. That actually is what R does for all functions. So if I press LM and don't give it, or if a key in LM and don't give it parameters, you know, I get this complete source code of everything that's behind generating and reading and applying these linear models, which is, you know, not so much, if you think of it. Just, you know, 100 lines of code or so. Anyway, so make data is then defined. The next step we had is to use it to make 40 data points. And we can then list these 40 data points. There are random heights and weights, column one. This guy is, or girl, is two meter and two tall, weighs 78 kilos. That's a bit skinny. Two meters and 78 kilos. Yes. And each of these has different values. And each of you are going to have different values because we didn't use the set seed function this time. Oh yeah. Yeah. You could use the set seed function here. Maybe just put the set seed function into the function itself. What we then did was just plot it, plot this data, matrix, label height in centimeters. Oh, that's actually wrong. In meters on the x-axis, weight in kilograms on the y-axis. I see nothing because I don't see my quartz window here. There it is. That's what that looks like. Because you have to find the 2x2 table for the presentation. I think you have to close it. Okay. Everybody got that? The next step is to evaluate this with a linear model, linear regression analysis. This is the call, intercept, and data. The fit is not as good as we had at the last time. But remember, when we looked at the summary, we saw that there is some variability in there. And the regression line, I think I need to redo the plot, and we plot the regression line without actually entering the parameters. The KB line is linear model. Oh, and I forgot something here, which I'm sure you've noticed before. There's now one character missing, and it asks me to complement this with the missing character. Closing this square bracket, two parentheses. There we go. Okay. And now this is the regression line that we get here in this case. Different random variables, different distribution, different solution to the best description of the data. Now, let's calculate the residuals and the idealized values. And we just put that into vectors. The residual is this function, relative of the linear model. The idealized values is the fitted data that is produced by that model, where the X value goes exactly through the parameters, i.e., all the points that are generated in this fit vector perfectly lie on that line. And I just put them into two vectors with the name RES and fit, and that's that. Nothing happens because all this is doing is putting something into a vector, and I can look at that vector, and get the first 10 values of the i-axis, the y-axis. Now, if you look at these values here, these are the idealized fitted values. You notice that they are not sequential. They're all over the place. So this is 80, 80, 707, 607, 707, 806, and so on. And that's because these are the values that correspond to the X values in our data matrix, and these are not ordered. And that's kind of important because if we start plotting, say, the residuals like that or the fit like that, and since they're not ordered, they would be jumping left and right and back again as they go over the entire plot. The plot could become very messy. So what we should do is we should be ordering or sorting our data on the X-axis to proceed. And that's slightly intricate to understand how you can achieve that. It's actually beautifully general and simple. But before we start, let me just plot this here quickly. This is fit against residuals. We'll be expecting it. The ideal is one. And the residuals is the magnitude of the actual observed residuals. And as you see, there's a trend there that sort of gets larger as you go to larger values. Which is simply in the way we've generated the data. If we, Rafael, talked about QQ plots, and you see that, that looks pretty good actually. Right? So if they were drawn from a normal distribution, they would, they don't violate the assumption in a systematic way that they're normally distributed. So that's good. Now, let's look at the prediction confidence limits. So in principle, prediction confidence is easily calculated. With the predict function and the parameter P or C, you can get different ideas of where a new point lies with respect to the old data. But as I said, you have to order this first. Otherwise, these lines that you get, if you want to plot that, will be connected all over the plot. And I need to put in a little intermezzo here to show you how that is done. I'm just wondering where I put it. Okay. So, maybe I'm just going to type this in. Bear with me. Now, in order to, if we sort a vector, it's obvious what we're doing. We're arranging the elements in that vector from largest to smallest to smallest to one. If we sort a matrix, it's not obvious what we're going to do. Because that depends on what column you're sorting on. So in order to sort a matrix, you have to specify and sort that matrix by defining which column you're sorting on. And the way this is achieved in R is to use a function which is similar to sort that spits out the order of the sorted elements and apply that to a matrix row by row. So we have this vector here of our data. Right? So this is our x-axis and this is our y-axis. And if we order this, if we make a vector which is calculated as the order of the first column, this gives me a column of the indexes in the right order that they are, if I apply these indexes that they're sorting. So if I make that vector, and this is what I get. This says the smallest column is in column 26, the second smallest is in 22, the third smallest is in 24, the fourth smallest is in 32. 26, 24, 32, and so on. You see it writes in 1.51, 1.56, 1.59, and so on. So how is this vector useful? That gives me the order. Well, I can display my matrix cell by cell. So I can display my matrix by saying, you know, just give me the value that is in row 2. So like that. Or I can look at row 26, or I can look at row 22, and so on. So the point is, since I can address this point by point, if instead of giving it the explicit indexes, I feed it this entire vector, I will get the entire matrix in this sorted fashion. So because these are the indices of this order that I generated by ordering the indexes in column 1. Doing the same thing for Y, I would have just put in order on column 2 into this vector. Then plug that into here and get the results. So simply typing O in here. Now gives me the vector sorted on the X values. Trivially, let's just sort it on the Y values too. So that would be the data vector. So I want the rows of the orders of my data vector in the order that sorts them on the second column. This is one expression. And you get the light guys first, and as you go down, notice that this is still sort of all over the place on the constraint that there's a relationship between the data, but this is not perfect in this order. Once you understand that, it's really powerful and general and versatile. But you have to understand that you use this order function to generate a list of indices, a vector of indices, and you use the vector of indices to address your data vector row by row by row to then get the data in a different sorting order. Now, so I do that here and I store that in a data vector data 2, which is now sorted on the X values. And now if we plot something about residuals or about confidence limits, we nicely go monotonously from left to right through the entire plot and the lines are not all over the place. So I calculate the prediction values. I add the lines from the PC vector and the PP vector to the plot. So again, this is simply the plot. So let's add these lines here, these lines here. Okay, so the limits of confidence and the limits of prediction. If you want to specify the expectation value or something in your data, these are the boundaries that the outer boundaries basically describe the prediction value. So if you have an unknown data point that was not in that set, you can go on from the same distribution but newly it's going to lie within these boundaries. And this is basically telling you how a goodness mind model will predict something. So it will predict that new values will lie somewhere within these boundaries. If you think this is pretty broad around the linear model, yes it is because there's a lot of variation in the data and it's a nice sense of plot to tell you exactly where the confidence limits of your model are and what you can also do is you can look at individual parts and identify them as outliers within these boundaries. P and the C stands for confidence limits and where's my R window? All the parameters are nicely explained in the manual. I thought it would be. Here, predict linear model. Okay, anyway, in the documentation, what's important to know is that the inner bounds basically explain your data as it is, or refer to the data as it is, the outer bounds refer to predictions of data that you haven't observed yet. So it's a sort of... Yes. We've gone out to this wildland and we see that any point also is different from its expression. Something like that. It should coincide with your T test and that the significant outliers should also lie outside of that plot if you're plotting control against experiment. Okay, because if there's no effect, if they're all the same, they should all be on a line and then they should start deviating. Okay. Okay. Now, just a very brief note because in the end it's all more the same. Multi-route regression assumes a model like this. Y is a constant intercept plus some scaling factor operating on one set of X coordinates, some scaling factor operating on another set of X coordinates, and another scaling factor operating on the third set of X coordinates. And all of these adding to each other to finding the one observed result. So, just a note here. You set this up with more parameters. You usually, in order to get good results, you need a lot of data points because there's additional modeling parameters involved. But you can, in principle, apply similar subscript errors techniques to come up with the same kind of minima as in the linear models. So, essentially, it works in a similar way except that this time you're combining several different effects. This is the kind of modeling you would use. For instance, if we have caloric intake, height, and weight, and you're trying to establish the relationship of caloric intake and height jointly on the weight of the individual and tease apart the relative contributions on the weight that that makes. Now, sometimes you have a priori knowledge about the functional form from first principles, and you know that it's not a linear model to begin with. Now, in some cases, you can transform your model and simply take all the Y values and transform them on a function and get a linear model from that, solve the linear model, transform the parameters back, and know what that would be like. But sometimes that transformation isn't obvious or possible, and you're then in the position in the need to do the so-called nonlinear regression analysis. And this is usually a situation where immunization closed for analytical immunization is not possible. So for linear regression analysis, we can write down a formula, plug in the observed data points, and come up with the right solution. Nonlinear immunization is almost always a numeric procedure where you come up with some start values to try out how well they fit. You calculate the sum of squares estimate. You tweak the parameters. You recalculate the sum of squares estimate. You accept the new parameters if they've improved your model. And you try different things if you haven't improved them. So in principle, what you usually do is, you know, you think of a surface that reflects the different parameters and you look for a gradient on that surface from a certain point, and you try to slide down that gradient until you basically end up in a minimum of error for the parameters that you're trying to optimize. The problem is, there's no guarantee that this minimum is going to be not a local minimum. So in principle, only under some circumstances can you guarantee that you have a global minimum. And sometimes when you minimize from starting parameters, you can get stuck, especially if your functions are complicated, if your functions have poles somewhere, i.e. regions where they go to infinity, especially difficult if the parameters are not linearly independent, i.e. if a change in one parameter needs to be, or can be compensated by a change in another parameter. So if they're not independent, you can just choose them in whatever way you want and your analysis will never converge. But there's a certain amount of trial and error that needs to go into that. One part of the trial and error is to choose good starting parameters, and in order to choose good starting parameters, often it's useful to simulate your model and work with it a little bit and try to establish what your parameters and what the change of your parameters actually do. So here's an example for that. And what I'm doing here is I'm considering a function which is called the logistic function, because the logistic function is useful for many other things. And originally, the logistic function was formulated as a way to characterize population growth in a minute. This is a population that has n-members, say gerbils in a cage, large cages, like in large numbers, a couple of thousand. Of course, what gerbils do, what gerbils do is the population grows a little. And this term here characterizes the rate of growth. It says the difference in the number of gerbils between yesterday and today is a certain i plus one. Current time step and next time step. So basically this says the number of gerbils that were added to the population time step first time step from the log. And that's simply the reproductive rate of the gerbils, i.e., I don't know what the gestation time of gerbils is. It's been ages since I've owned one as a kid, maybe six weeks for them to multiply. About two days from when you bring them home. Two days from when you bring them home, okay? So you can calculate that from that. And that's the reproductive rate. Under ideal situations, so the population at times, I would grow in principle under ideal situations by that number. But there's a limiting situation here because at some point it just stops feeding them. Or cleaning the cage. I don't know how that begone them. You know, the cage just gets too small and they start getting stressed and heated and all these educational things that they do it for. So parents eating their kids. I don't know, is that how they actually give them gerbils? So they see that could also have big wars. Anyway, so there's a limit here. A population statistic, this is called the carrying mechanism. This is as much as the system will allow. And when you have as many in your population as that carrying capacity, this term goes to one. When that term goes to one, this term goes to zero, and this term goes to zero, there is no more growth. That's what the population is at the political. No more gerbils are added. Or equilibrium actually doesn't mean no more gerbils are added. It just means they die as quickly as they can. So this is a discrete expression here. You can also write it as a continuous expression, P n over V t is the differential equation of r and t over this term. And this is a rate. This doesn't tell you anything yet about your population. To solve, to get the population, Michelle has a different font set. Never mind, this is what counts. This is, in principle, the functional form of the population that arises from that. So, writing this in r gives you this nice kind of function that is initially, the gerbils grow very, very fast. Then as they approach the carrying capacity, the growth levels up until they growth of the population basically peers out and you have a lot of them but you're not getting any more. Now, we can use something like that and it is use something like that. You know, it looks like a classic dose response group. So we're not just interested in gerbils in a cage. We're interested in, you know, if somebody's been smoking for all of their life, what's the probability that they get lung cancer at a certain age? If they have high cholesterol, what's the probability that they'll get a heart attack after a certain number of years? And we have these dose response curves for that that apply in our data. So we can use that for a simulation model. This is a simple simulation model that we applied here. The logistic function for our disease is a function of N. Again, it's basically the same thing as calculating our age and weight distributions. We empty a vector and we calculate a random number that's distributed according to this explanation here. Now, this is the core and that is what I need to describe a little bit. Because what I'm going to show you is that this, you can use this as a very general technique to allow to generate arbitrary probability. So let's first look at this expression that I've used here to generate this curve. 1 over 1 plus the exponent over some scale number in the range from minus 15 to 15. This is simply the formulation of the logistic function that we've had before. The value of the logistic function is 1 over 1 plus e to the minus z, where z is some parameter that we're varying over time. It could be age. It could be a dose of medication. Now, if we want to see some random numbers that look like that logistic function, the easiest thing to do is if r has that integral, in fact, that's what you do if you use our uniform random uniform estimates or our normal random normal deviates. But if r actually doesn't have, I don't even know if they have logistic random deviating function here. We can use the code that I've showed you to generate such random deviates in a completely general sense. So this is the core of that function. If you just plug in a different function here, you can get different random deviates. The idea is that, at first, you choose a random number in this interval, any random number. Once you have that random number, you calculate a second random number in this interval. If the second random number falls above the line, we accept it. If it's below the line, we reject it. So imagine that we're at this inflection point. At this inflection point, we should expect that random deviates generated at this point should have a probability of being accepted with 50%. That's exactly what we get because there's 50% probability that it will be below the line and 50% probability. If we're somewhere down here, we should be accepting things with a small probability. We get a very small set of random numbers. So if we're only accepting them below the line here, it's only rarely that one will go into our distribution frequently that will be rejected and exactly the opposite. So basically, by putting random numbers on this plot here and accepting only instances where they fall beneath that curve, we're doing something like a numerical integration because in the end, the number of points that we accept is dependent on the area under the curve. The smaller the area, the less points we accept and the larger the area, the more points we reject. So basically calculating a random number here is equivalent to numerically integrating over this probability density function and we can use this to generate numbers that are distributed in this way. And that's what we're doing here. So this is a uniform distribution. One sample from 1 to 100 which gives us a real value number and we cut that off the floor so we drop everything behind the decimal point and we just take the integer part of it and that gives us an estimate of weight, of age. So basically the point somewhere along that or a random number on that x-axis. We then calculate a second value, s, is a uniform distribution between one value, no parameters specified therefore by default between 0 and 1 on the y-axis where a random point would fall on the y-axis. And then we say, if s is smaller then age taken through this transformation which would give us this characteristic regression function. So basically evaluating the function at that point. If s is smaller than that we append the value to our vector i.e. we accept the value. We do this wrongly in the notes but there should be yes, that should work. Because we have to update otherwise we're not using it. This is a while loop. We're testing whether we already have n estimates here. If i never gets increased it just runs away. You have to hit the escape key. So this is the update. So we add to this vector x element after element of ages that correspond in the distribution to this logistic regression function. And what that does is when we run this source it or paste it do this maybe 10,000 times. You don't want to look at the entire vector then because it's large. You use the head function on the vector. It has the first 20 elements for example or you can also use the tail function not heads and tails. This is not the statistics even though it's head and tail. For the tail function for the last 20 elements and this distribution of ages where under this model our pro-brand would have become ill. Yes. Whatever the name of the function is you're right. Okay. So we have 10,000 ages and then we're interested in how many of these do we actually have? Now we could typically calculate a histogram for that. Just put it into his of that vector and you'll get a histogram of the distribution. And that's tabulate. Tabulate is a function which is sort of similar to histogram but now every single integer value has its own histogram bin. So since I have 99 ages 1 to 100 in my bin here I get 99 possible values in my vector and the 99 values have the accuracy of observing that particular age and this is the distribution. It looks like it's my simulated data of if you treat it as a general non-linear function here in this case of one single risk factor being applied to a pro-brand. And again the question is well this is nice we have something that sort of speaks to us but think about can we use our tools to recover the parameters. The parameters here where this age minus 50 basically took the function which is only symmetric around zero and put it such that the inflection point here should be at 50 years old and there's also a scaling factor normally the logistic function goes to zero or close to zero around 6 minus 6 and 6 so if we scale that by 0.1 we sort of get this approximate. So that's the scaling factor we had put into the model can we get these scaling factors back by the non-linear modeling technique. Now non-linear least squares fitting with R is a generalized version of linear least squares fitting but the function template that you have to give so the function is non-linear least squares fit and you have to supply a formula this is the formula that you actually fit on you also have to define some data and you have to define some starting parameters to know what's your initial first guess in what the parameters should be so the formula can from our example can be written simply in one line fz is a function tstm of b a function of time a function of scale a function of median time a function of something that generates a slope in that b is large you get a very steep response if b is small you go sort of get shallow those responses and this is the same formula that I've used before with the different main parameters that's the same formula that simulated my data to begin with plus the random noise that we applied and after this is defined we can try some reasonable starting parameters and we can say please draw me a curve of the formula fz and this is generally useful if you want to overlay block curves over data this curve of this formula specified in some way of some the data on x with a parameter for s, a parameter for t and a parameter for b so s also is starting parameters and we know that the value is here 50 but this is the curve we meant that's sort of similar to our data it's of course not perfect and not identical and as long as it approximately models the data it's going to be equal if any of these parameters have the wrong sign for instance in order to get the right parameters they have to go through 0 and then invert for any kind of numerical or if there are many orders of magnitude wrong then they sometimes also it's actually interesting to see from which wide range of parameters so this is approximate and so we save our counts per age as the tabulation of our X vector so basically the Y values we generate simply a list of 1 to 9 to 9 values as the X values and then we invoke NOS as that count is modeled by the function of s with the parameters age, s, median and b starting parameters for s are 180 median 8 this is then basically generating the structure risk of fit which has the parameters of the non-linear regression that we can use here at the time okay we run this and after it converges which it does after 7 iterations to a very low tolerance we find that 200 is well reproduced 50 is well reproduced and the 0.1 that we had in our original model is also well reproduced and then we can take this curve function again plug in the parameters that are modulated back re-plot that onto our graph to see how well this explains it so the red line was our first guess of what the parameters could be the blue line here is the non-linear least squares fit on a function of that form from which we've run these functional points again so if you ever had to fit like protein folding stability curves or Michaelis maintenance kinetics in some different package I find doing this in R exceeding the simple straightforward and robust and it gives very good results it's a very flexible way to find things to specify the formula and then just adding in the parameters that's very gentle to make this work I have to have some idea of what shape my data is going to to make this work you have to have some idea of the functional form of the model that's beneath your data I could do a linear least squares fit if I thought that that would be the functional model actually that would be a nice simple exercise if you wonder about the syntax here simply write and y equals a x plus b I'm going to do that into this into the formula that we've had here and see how that can be fit of course you don't need to use the linear model procedure of R but you can use the nonlinear least squares fit with the linear function and that should give you the same result but you need some idea of what Francis you should are you even sure if I trip over that great point you need basically to have some idea of what your functional form is I'm not aware of the programs that I'm not saying they don't exist I'm not aware it's exceedingly hard to come up with a program that will derive functional forms from raw data it's not impossible I think six or eight weeks ago there was a nature paper where researchers have built a computational inference machine that could take physical data and deduce the functional form of the underlying guiding principles from that based on nothing but the data itself and what they applied it to was a chaotic system basically a coupled pendulum if you have a coupled pendulum where a pendulum swings and the pendulum is attached to that will undergo chaotic motion and they fed their system a time series and the system actually was able to come up with the correct functional description of the data even though there was noise there plus on the fly deriving Newton's laws of motion from that as conservation of mass and energy within these forms of it I'm not aware that this is available for every day and everybody's use it would be beautiful if you could just take your data and the black box and the black box tells you what it is so that's not in general use right now so you're not all out of the job right away the black box is some biology inside and thinking is still required so yes for practical purposes you have to know or guess the functional form you have to have some idea what possible functional forms exist for practical expressions choose the one that fits your data with the least number of possible parameters yes but as with the clustering it's a trade off between being able to fit something well with the coefficient of correlation and adding a large number of parameters the large number of parameters will increase your confidence limits so the smaller the number of parameters are the smaller your confidence limits are going to be okay now this this really had nothing much to do with logistic regression because I just used a logistic function as an example and you fit with a non-linear least squares fit but the logistic regression is I don't know why some of these work sometimes that's weird in order to summarize the data in terms of there's much the same extraction functions as the data model in the documentation now risk effect data in some areas that have a binomial outcome that can be modelled or aligned or infected and healthy or cancer and cancer free can be modelled by linear combinations of these logistic functions the starting parameters educated guess so in this case I determined that since I know approximately what a plane logistic function would look like i.e. it goes between it goes to 0 or it comes close to 0 at minus 6 and it comes close to 1 at plus 6 you sort of try to scale it in between it's just a linear scale you try to use some reasonable parameter so it's basically an educated guess and then I draw out my educated guess on the data with the curve function and see whether it somehow vaguely resembles my data or whether it shoots off into a space and when it vaguely resembles my data I can be pretty confident that it will converge on the correct solution so the choices basically educated guess about your knowledge of the mathematics of the function that you're benefiting here okay so we have possible outcomes dead and alive infected and healthy cancer, cancer free and so on we can find something that we can that is called logistic expression and if we assume a set of risk factors that each contribute to this logistic function so basically the functional form is not just e to the minus z but with one theorem but there is a eta one x one and a beta two x two and a beta three x three and so on that all contribute to that we can look at data and try to model what these individual risk factors are and identify the quantity so basically what we do here is isn't that weird so sometimes I get these hashes and if I just go forward and backward with the slides I get different hashes weird, huh? yeah predictable nature okay so this is the formula and we can rearrange that maybe I can look at this okay and we can rearrange that by multiplying by this term here dividing by that term right do you remember how to do that it looks like one minus i over eight times i sum here in order to get this out from the exponent into something we can actually work with take the log of that so the log of this fraction here equals this and that's kind of nice because this is a function of these inner components so once we have it in that form it's exactly the same thing as the multiple linear regression and we can apply multiple linear regression and isolate and analyze the individual factors I'm not going to go through that step by step by step right now now if you there are examples abundant on the web that you can easily find with a set of data points that relate to cardiac risk factors and cholesterol levels and age and so on that are modeled in exactly the same way and you can there's R code for that if you google for it it's also I think explicitly spelled out in Delgar's book introduction to statistics with R and that's a good homework to do what I hope that you can take home however is some appreciation for how this works in the first place that the individual terms represent summations over these risk factors that correspond to a logistic function and that the individual terms and the modeling as an inter-superposition of risk factors and solved within R and summary rational analysis is a statistical piece for modeling relationships so whereas most of what we've heard about previously is laboratory data analysis PCA analysis clustering analysis had to do with looking at relationships within the data in some way describing the data with statistics the mark the regression analysis and the data analysis don't just look at things within the data they compare the data to some external model so this is a statistical model we modeling the relationship allows a parameter estimation to see what specific kind of models the data will be representing hypothesis testing linear model or not and use of the model for prediction regression analysis what's the likelihood that somebody will get a heart attack if they're 70 years old and have been smoking all their life it's a powerful framework that can be ready to generalize but in order to apply it well you need to be familiar with the data and I cannot over emphasize how competent to simulate it to try it with different variations different assumptions generating synthetic data sometimes people call this toy data I think this is the completely wrong approach this is not toy data this is the data that you use as your positive control this is the data that reflects in your little computer model which is easy to set up in R ideas, your knowledge of the data that you can then create and analyze and see whether your analytic routines will retrieve what you've put into it and check that play around with it and check the model assumptions carefully so where do you go from here again I emphasize some more reading try the examples out I think the examples that we've gone over this afternoon are easily generalizable they can be easily templated to different applications which you play around with it it's going to be interesting if you don't actually get your hands on that if your feet wet with it and start playing around with it you're going to forget it very quickly I always think what we can do in a workshop like that is not actually teach you a whole lot how to do things but we can break down some of the activation barriers that inhibit you from ever trying out something first of all you've all at least copied and pasted that code into your windows so I hope the activation barriers have come down somewhere and you can start actually trying it yourself but nothing of copying and pasting will actually help you as much as closing that other window and trying to type it yourself trying to generate a few random deviates and generate statistics on them and become familiar I can't repeat this enough simulate your own data and especially if you look for simulations on logistic regression you'll find that there's a couple of examples on the internet where you simply use the inbuilt probability functions to generate something that you then take out again that's not really what I mean by simulating when I speak about simulating I mean simulating from first principles by taking observations applying probabilities to the observations and then storing that as a data really running a computer experiment and not just drawing from a distribution for which you already know the problem most importantly have fun with it I think it's huge fun to have this power over your data and start manipulating it and looking at it in different and several ways and I hope that's something you can take home from this workshop which we're very glad you attended for teaching this before you all go away I need to pass my stick onto Michelle because she has to do it so there's a thanks for us for finishing today and thanks all of you for sticking it out because I know it's a lot of information to take in and absorb but that's why we record this so that I will take a few weeks because Raphael actually took his entire file of voice recording with him so I don't have it but within two weeks I would check back on the bioinformatics.ca website for the voice over lecture material so you can again take it listen so before you leave a couple things on the wiki so the books section the picture that we took yesterday is on the wiki and there is a link if you scroll down a little bit more a link to the survey so I know you surveys are hard why get another survey fill it out blah blah blah but at the CBW we actually take it very seriously if you look at the course content from last year for statistics versus the course content this year it's almost been completely revamped based on the feedback so we take the feedback very seriously and have an annual meeting to discuss what worked and what didn't work and what the student group needs for the work that they're doing in research so take your time to fill it out and Francis is going to hand out certificates for everybody who's still here the tenacity certificate it's a bit light on the topics but it's very very good integrated introduction if you haven't ever done any of them and the rest goes on so far I'm not doing books anymore just say no but there's still a market oh yeah but in order to get something like that but in order to get something like that the quality of delegates this is sorry that's the introduction to it that's sweet I guess we have to pick a niche market yeah the curl of my answer finishing at the beginning the advertising and the auction and the workshop in a number of ways in the end the best way is you telling your friends and they are telling their friends and so if you tell your friends and your colleagues and your lab mates your mentors and that's how we get the they're interested to register and that will ensure that we consider that we can make it happen as you know this one is all filled up and there's other people that are not able to come to this workshop and we really like the auction so I think you're very much very welcome Brian Cora Crescer Brian Brian Cora Crescer Cora Crescer