 This is the last video in the seminar series on linear models where I just discuss these basic four fundamental model types and I call them fundamental once again just to remind you of the fact that I think that if you understand these four you can go on to more complex models. Now in this one we're going to change things up a little bit. Things are quite different from linear regression analysis or variance analysis of covariance so let me show you why this is so. I've opened a Jupyter notebook inside of Visual Studio Code. Again you can see that I'm running Python 3.9.10 and these are the packages that we're going to use. As always we're going to import pandas so that we can create a data frame. We're going to import the stats module. We're going to import numpy and patsy. Now one of the only or the only namespace abbreviation that I'm going to use is sm so statsmodels.api I'm going to import that as sm and then something brand new. So from statsmodels.api I'm going to import this function the loget. It's actually a class that it's going to create and we're going to fit our data to this instantiated class loget. So let's run that and later on we'll see how that works. Here's our old friend the plotly plotting package and nothing other than you have seen before. So I think it's quite important to remember that this is the fourth of a series and you really have to watch video tutorial number one two and three. So linear regression analysis of variants and analysis of covariance so that you understand what is going on here. There'll be a link in the description down below to those three videos. You can read up about it on the website or if you're watching on a computer I'll put a little card right up there in the left hand corner. So here we are we are busy with binary logistic regression so that's a regression technique and what we see here in our familiar old table we write you see the italics right at the bottom binary logistic regression. So instead of an interval type or a numerical variable as our dependent variable data type we're going to have a binary categorical variable. So on or off zero one yes or no or anything any categorical variable that has two levels or two classes and as far as our independent variables concern that can be numerical or categorical doesn't matter and we're going to try to understand the relationship between our independent variables and then we have to figure out a way how we understand this relationship with these two binary values. Now it might become a little bit difficult you know I can give them numbers three and forty two seven and eight zero and one one and two so these two levels of my dependent variable can be encoded with any number and so we had better choose carefully such that this makes sense and the way that we do it is what I've written down here we see that we can have two levels we can also sometimes refer to them as classes of the dependent variable and we're going to encode them with a zero and a one. Now there's two of these and you have to decide which one you encode with zero and which one you encode with one and it's a very very important decision that you have to make based on your research question. So the form of the zero we sometimes refer to that as the failure and the one as the success so you have to choose one of these two values that you have for this dependent variable this binary categorical dependent variable you've got to choose which one is the failure and which one is the success and it has nothing to do with the actual definition of the English words failure and success it's just that the success is the one that we are after we're trying to see which one of our and to what extent our independent variables how it has something to do with the success now it might very well be that success is death or recurrence of disease so as I said it has nothing to do with the definition of the word success it's just the the one of the two binary categorical variable values that we interested in examining so we can choose either of the two levels or the two classes as the success and we're going to encode that with a one as I write there this should be our class of interest so once we've done that we've got zero for failure and one for the success that we can view these encoded values as probabilities so if we have a subject in our sample with a failure outcome as far as the dependent binary dependent variables concerned we'll code that with a value zero and then we can think about it as saying that this specific subject now had a zero percent probability of being in the success or the one level the one class so whether you know doesn't matter what word that was the value for your dependent variable one of the two if that's the failure they had a zero percent probability of being in the success class that should be very very clear and a subject in the sample with a success or one value for the dependent variable they had a hundred percent probability of being in the success or one level class level or class so now we can build a model that will use the independent variables and estimate a probability of being a success or one because now we have this very nice interval from zero to one and we express this as a probabilities what is the probability of being a one a success now one problem we'll have immediately based on making this decision is that we cannot have this straight line in a in a graph that is our model we have to come up with something new and indeed we will so the data that we're going to do in the first two lecture video tutorials we designed our own random values in the third one on analysis of covariance we actually imported i showed you how to import a csv file but yeah in video lecture number four logistic regression we're going to just generate our own data once again so we're going to investigate research scenario and our research is going to consider whether a participant required a second operation so we're investigating that this is all simulated data we're going to simulate the values we're investigating surgery of the abdomen for um necrotic bell so a piece of the small bell becomes devoid of blood flow and that becomes a schemic and eventually dies and causes sepsis and all sorts of severe complications and is managed by doing a doing surgery on the abdomen and removing that ischemic bell and there's various ways we can go about restoration or trading stomas etc but some of those unfortunate patients would require a second look laparotomy or what we call a real look laparotomy so the index procedure that first surgery wasn't enough and they might require a follow-up procedure and there's various ways that we can go about that we can do that on demand or you know as a plan procedure or or on demand in other words if the patient doesn't do well only then do they go back for their second surgery so that is going to be our binary dependent variable yes they need a second look or no they don't need a second look laparotomy and we interested in making the success class the ones that do need a laparotomy so you can once again see that has nothing to do with the meaning of the term success and actual fact that would be a failure wouldn't it and in some sense so we are choosing that yes for a real look laparotomy as our success class and we're going to encode that as one and we want to know from our independent variables what is the probability given values for them what is the probability of the success class of needing a second look laparotomy and for our independent variables we're going to look at the length of the bowel involved that's termed the ischemic bowel length as well as the seniority of the primary surgeon at that first procedure that index procedure and we're going to have three levels of that treatment or three classes for that categorical independent variable and that is a senior resident or an attending surgeon or someone who's an acute care surgery specialist so over and above being an attending having undergone extra training and is an acute care specialist so that depends really on what what part of a world you are in there's not always this differentiation between surgeons but certainly this is what we're going to do for the simulated study we're going to have a senior resident and then they might be an attending surgeon or they might be in acute care surgery subspecialist or super specialist depending on which terms you'll use so we're going to create these random values so first of all i'm going to create a numpy array with 30 values of yes and then 30 nos those would be a strings and i'm going to assign that to the variable dependent and then i'm going to create a numpy array of 30 string elements that that are either senior attending or specialist and those are going to have a probability of 0.5 0.3 and 0.2 you can see they sum to 1 which we have to do in this numpy array i'll show you the code or you can actually see the code down there so for these 30 they're going to be attached to the yes so you see sort of pre-empting the fact or i'm simulating this data that for those that did need a re-look that there was going to be a higher probability of them being senior so i'm just fudging the creation of this data but that's why i say it's important that you know how to simulate your data especially when you start learning these things because you can take control over the data and for the specific outcomes that you want and it helps you understand that then we're going to have another numpy array also of 30 elements also with either senior attending or specialist but the probability is a bit reversed now 0.2 for senior 0.3 for attending and 0.5 for specialist so that when those random values are selected there's a higher probability that it'll select specialist and then i'm going to have another numpy array of 30 numerical variables random variables from a random distribution with a mean of 120 and a standard deviation of 20 120 and 220 and then another numpy array but this time i'm dropping the mean a bit to 100 so let's see the pseudo random number generator as you can see there so there is numpy dot repeat so that's going to repeat yes no 30 times i'm going to yes yes yes yes 30 times no no no 30 times and then the seniority it's the numpy dot random dot choice function i pass a list of strings there i want 30 of those but it's going to be more probable that it selects senior probably of 0.5 0.3 for attending and only 0.2 for specialist and i'm assigning that to the variable seniority underscore yes and then for seniority underscore no you know i just changed that probability so now i'm going to have 30 values and another 30 values and i'm doing the same for length it's the stats dot norm dot r vs function lock is for the mean scalars for the standard deviation i want 30 of those and i'm just putting that as a first argument in the numpy dot round function and i want zero decimals just so that i have integer values of those right now we can put all of that in a panis data frame so i'm going to have three columns and i add these to the panis dot data frame function as a dictionary with key value pairs the key will be the column header or the name of my statistical variable and the key the values would be the actual values that we just simulated so there's going to be a relook required column and it's going to have those 60 values the seniority is i'm going to make a numpy array and i'm passing this list of 30 and the other 30 values so there were 30 seniority yes and 30 seniority no values and then i'm doing the dot flatten method there because i have two separate numpy arrays i want to combine them and then flatten them so that they have one array that's what the flatten method will do and then i'm going to have a scheming bell length and i do exactly the same a numpy array of this python list which is two numpy arrays and i flatten them so that they 60 values all in a row so there's my data and it's now it is now in a panis data frame now we've got to think about our base classes the the class for our categorical variables against which the other is measured now both the relook laparotomy and seniority they're both categorical variables so we're going to use the pandas categorical function to indicate that these are categorical variables and we're going to state the base class so when it comes to seniority of the surgeon we're going to choose the senior resident the most junior of those three categories of surgeon as the base class and we want to measure being an attending and a subspecialist or super specialist against the attending i should say against the junior surgeon which we term the senior resident so you have to choose that base class and we've seen that before when we did dummy variables one of them has to be the base class and then when it comes to our dependent variable we're going to choose yes as our our success that is the the class of our categorical variable our dependent binary dependent variable that we're interested in so this is what we do so we say df.relook required that gives me back a pandas series object which i'm now going to overwrite we say pandas.categorical take df.required and these are my categories no and yes so the one that you state first here is going to be the zero and the next one's going to be yes and because yes was relook that's the success class again success has nothing to do with the definition of the word success so relook that brought to me that's going to be encoded with a one so i've got to put them in that order and when it comes to seniority that i'm overwriting df. seniority the categories are now senior for senior resident attending and specialist and so this senior is listed first and that is going to be the base class and we're going to measure the other two against this base class so that's a very important decision to make so let's call the info method so we can see what our data frame looks like we've got three columns relook required seniority and ischemic bow length each have 60 values in them and we see relook required as a category seniority is a category so that stands for categorical variables and so if we didn't change them using the pandas.categorical function they'd be objects that's not what we want we want them to be categories and then a 64 but float for our continuous numerical variable there so let's just do some explanatory analysis of the data it's always good to understand your data before you start working on that so let's start off with the relook required and we're going to use the value underscore counts method that's going to count for us each of the unique values in that pandas series df.relook required and we see we find no 30 and yes 30 it's exactly how we designed it so that works out well for us by the way as you can see this type of research this is not an experimental design we're taking data that already existed and we're just going to analyze that so if we use the pandas.cross tab function and I pass these two series objects df.relook required that column and df.seniority that column i'm going to get a nice table of observations a contingency table of observed values to see relook required no and yes and you see seniority senior attaining and specialist and you can see how many that's a frequency table so it just counts how many people were in that intersection of not requiring a relook and had a senior resident as their primary surgeon so that is a contingency table and we can actually do a chi-square test for independence on all hypothesis there's no dependence between whether a participant had a relook and the level of seniority of their principal surgeon in that index procedure and it's really easy to do and by the way in in python so i'm using the stats module in sci-pi stats.chi to underscore contingency and i just pass this table of observed values and we get back a chi-square test statistics there statistic and a p-value which for an alpha value of 0.05 we fail to reject the null hypothesis and here is a table of expected values so that's our table of observed values here's our contingency table of expected values and we see those two are not that different in as much as we fail to reject the null hypothesis so we state that these two variables are independent of each other let's look at a schemic bow length we're just going to use the dot describe method on a schemic bow length but we're going to group by the three levels of our treatment our relook required and two i should say no and yes we see 30 in each as we expected and as we designed we see the mean for those that did not require relook was 97.9 and the average bow length of a schemic bow was 121 centimeters exactly as we designed it here's a box and whisker plot and that shows us this difference between the two once again with plotly you can hover nicely over these two so you can see what the results are by the way if you think about it we can look at these two groups that we form by the categorical variable the dependent variable and we can say let's let's use those two values to divide our data set into two and we can to compare the mean of the numerical variable in those two groups so that's just a t-test so what i've done here i'm using stats dot t-test underscore ind so that's an independent t-test students t-test and i say df dot relook required equals yes equals equals yes so so go down go down all the rows in that column find all the values that are yes and take their schemic bow length and then the second group all the ones that were no as far as relook required was concerned and get their schemic bow length so i'm just doing a t-test and as you can always think of analyzing your data in this way and we see a p-value much much smaller than an alpha value and we reject our null hypothesis which would state that there's no difference in mean the schemic bow length between these two groups and we accept alternative hypothesis that there is a difference so just a little segue or a little something different for you there now before we understand our data we've created our data we understand that and now there's a few things that we have to just touch touch base on and before we get to logistic regression so we're going to look at the idea of probability odds and odds ratios so the probability is a value that we give for the occurrence of an event so the event can either occur with a certain probability or it doesn't occur there's there's nothing in between it occurs or it doesn't occur but there's a probability for it occurring and remember as it is a probability we constrain it to this interval from zero to one if we multiply that by 100 percent that'd be zero zero percent to a hundred percent and we express we use this letter p to express the probability of the event occurring so if the event occurs with the probability of p and you know we constrain but on this interval zero to one the probability of it not occurring must be one minus p that should be simple enough to see and that gives us this idea of the odds of something happening the event occurring that is this ratio of the probability over not the probability of the event occurring over not the event occurring p over one minus p that is the odds so the odds of the occurrence of an event is the ratio of the probability that the event occurs p over that the probability that the event does not occur that's one minus p so if we have a specific event you can see in this example if we have a specific event that occurs five times in 15 attempts so it occurred in five and it didn't occur in 10 in other words that means it has a probability of about a third because five divided by 15 the total that's a third so that it not occurring would be two-thirds wouldn't it so there's our p but let's look at the odds then of it occurring that's p over one minus p so let's just save this value p and that's a third and p over one minus p that's a half so what we see there's round off error you're always going to see that as you know if you use python some other time you're going to see it other languages as well but clearly 0.4999 it was just a round off error it is 0.5 and so the result is a half or one to two and you know that's what you would expect isn't it i mean five it didn't occur 10 it did occur it's five to 10 which is one to two and we can see there you know how the solution now we got that solution it was one over a third in other words over one minus a third and i just do a little bit of algebra there and you see we come out to a half there now we have now that we understand probability we understand odds the more difficult of the three is the odds ratio that is a ratio of odds so even though we're talking about a ratio of probabilities here now we're talking about a ratio of odds not of probabilities but of odds and an easy way to do that is just to consider an unfair coin with a probability of heads of say of 70 0.7 and then a fair coin with a probability of 0.5 so let's save those because the odds ratio is then going to be four heads given the unfair coin over the fair coin so let's do p p underscore unfair is 0.7 p underscore fair is 0.5 let's work out the odds for each of those and i'm going to save those as odds underscore unfair and odds underscore fair and again it's just p over one minus p for each one of them and now we can work out the odds ratio which is now the odds of the unfair over the odds of the fair so you can see that's a ratio of odds and that gives us 2.33 so we can say that the odds ratio is two and a third to one 2.33 to one for heads given the unfair coin over the fair coin that's the odds ratio now since this this odds ratio we're going to abbreviate it or is greater than 1.0 we subtract 1.0 from it so we get 2.333 minus 1.0 that's 1.333 and so what we do we multiply that by 100 because in words we can state now that the odds of heads given the fair given the fair coin is 133 and a third percent higher in the unfair coin over this fair coin so the fair coin is our base and we increase the odds by 133 and a third percent there's 133 and a third percent increase in the odds of heads given the unfair coin over the fair coin and we can verify this so the odds increase is the odds ratio minus one so 2.33 minus one so i'm just saving it there so if i take this odds of the fair coin remember what that was plus fair coin that we one-to-one but the plus the odds increase times the odds of fair and if i do that i get back to 2.333 and we can say i've increased the odds by 2.333 or 133 and a third percent so come back to this often as you start working with odds ratios this coin example makes it very easy to see so once again i'm just taking this odds increase times the odds of the fair and i add that to what the odds of the fair was already and then i get back to 2.33 that's what it means the odds ratio is 2.33 now in this first of the three logistic regression models that we're going to build we're going to start off with with just selecting the continuous numerical variable the ischemic bow length as a predictor of whether a participant required a relook laparotomy so what we have to do is we're just going to select this get underscore dummies function so pan is dot get dummies we're going to take this df dot relook required and i'm saying dot yes now remember i can just say dot yes because i don't have illegal characters so i don't have to put that inside of square brackets inside of quotation marks to be a string i can just say dot yes which means what pandas is now going to do is going to see that yes is now my success and that's going to be encoded with a with a one so let's have a look at what my data frame looks like so there we go you see relook required was yes and then relook encoded was just one i don't need the no the zero no relook was required because you know if it's one there was a relook if it's zero there wasn't a relook so with the dummy variables i only need this one dummy variable and the reason why i've done that here is not for our analysis but i want to plot a scatter plot of what is going on here so there's the zero all these participants did not or patients did not require relook and these did require relook so they had a hundred percent probability of relook these had zero and we have a schemic bow length and the reason why i encoded it with the zero and the one and using that column is to so that i have this nice y axis otherwise panda plotter was going to flip those around for me and it would have the yeses at the bottom and the nose at the top so that's the reason why did that but what i want you to see here is the fact that we can't draw a straight line here and we also can't use the means as we did with analysis of variance i have to come up with a different strategy to build a model that'll give me the probability of a relook but certainly i can't draw a straight line because if i were to do that i'm going to have probabilities of less than zero and more than one and we can't do that there's no way to have that kind of line our aim though is still to create a linear model beta sub zero plus beta sub one x and that x remember is our ischemic bow length the problem is we can't have we don't have a numerical variable is our dependent variable we have to come up with something different so we want to and this is the magic word link we want to link this linear model on what we're going to have on the right inside beta sub zero hat plus beta sub one x we want to link that to a probability because we can't equate it to a continuous numerical variable we link it to a probability and for that we need a link function we still want a linear model but now we have to link it to a probability it's not an equation we don't equate it to a continuous numerical variable and there you can see the logit function in three here that's the log the natural log so it's base e well as number that's the estimated probability over one minus the estimated probability so what is that that's the odds so we also call this the log odds so this is the link between probability and the linear model on the right hand side so we see we still have our linear model on the right hand side we can have an intercept and a slope and we link that to the probability such that link is we can equate that then to this log odds and we can solve for this estimated p just as we had estimated values before remember our dependent variable was continuous numerical and our model then gave us an estimate of that we can do that for this estimated probability and i show you the little bit of algebra here it's not difficult we exponentiate both sides by the law of that we can bring this odds out of the e to the power natural log and that means we'll have e to the power our linear model on the right hand side i can multiply both sides by one minus the estimate of p once i get that multiplied out i can isolate the estimated probability on the left hand side take it out as a common factor divide both sides by the one plus e to the power the linear model and there we get our estimate for probability so instead of with as i said before we just had a continuous medical variable estimate and that was going to equal our linear model on the right hand side we still have our linear model but we see the big difference that we have there now let's use our legit function so that we can create a model first of all though we just need to create these design matrices so remember we've used them before y comma x equals pat c dot d matrices and i'm using a little formula relook required given the ischemic bow length from the data frame and now i'm just going to convert them to a numpy array to numpy arrays so let's just have a look at what the dependent variable is it's just one one one one or if they didn't require relook there'll be zeros in there and as far as our design matrix is concerned we'll have our column of constants one and our second column is the ischemic bow length so this should be very you should be very comfortable with this you see we're still going for the right hand side which is basically nothing other than a linear model so let's have a look we can use the legit function just as we used ols before although we can't use ols before you can see why and actually what is used here is maximum likelihood estimation that's how we'll find those best values in this instance we use the dot fit method and where there we have a model so let's have a look at the summary of this model and there we go we just need to learn how to interpret this we still see coefficients there we see a standard error for the coefficients coefficient divided by the standard error in this instance gives us a z statistic and we can express a probability for that z statistic given the parameters for the z distribution and then we can see the 95 percent confidence intervals but we have to be very careful about these because the values that we have here for the coefficients this is how we would write them the legit function on the left hand side does log odds equals there's beta sub 0 minus 5.7102 plus beta sub 1 0.0519 times ischemic bow length but that linear linear model on the right hand side know that equates to the to the link link function and so we still have to solve and we saw how to solve for this estimated probability so now that we've done that let's just take a participant and we suggest that they have an ischemic bow length of 120 if I now pass this with the constant one of course so my list there my python list has the two elements in it my constant comma ischemic bow length and I pass that to the predict method of my new linear my logistic regression model I see a probability of 0.6275 so given 120 centimeter ischemic bow length there's an estimated estimated probability of 0.628 62.8 percent of a relook laparotomy so that's actually you know that's very very very nice what I want to do here you don't have to worry about the code I'm just adding a constant to this whole range of values so I'm using numpy.lin space so I'm going from the minimum bow length to the maximum bow length of my data set and I want 15 numbers in between that gives me an array I just add a column of constants with the sm.add constants in the front and then I'm going to predict I'm going to use the predict function and predict for for that whole range of ischemic bow links I'm going to create the an array that contains all these probabilities because I can now do a nice little scatter plot because look at this so this is what we have instead of our straight line so is maximum likelihood estimation and that calculates all these values for us so we can see our model here and as the bow length gets longer and longer you can see the probability of of the relook laparotomy occurring or being required and what we have to do and that becomes very important and as you learn more about these you know there's techniques to look at where we make this cutoff because what this model is going to give us is a probability say there of 0.6407 now do we put that in the one class as a prediction or in the zero class as the prediction we have to decide what our cutoff is generically we could suggest that it would be you know at 0.5 so anything above 0.5 so that's about 110 centimeters so that'll flip over 0.5 as far as our probability is concerned below that under 0.5 and we can say well one does predict and the other one doesn't predict but it really depends you know if you don't want to make mistakes you want to be conservative so you might choose something like 25 if it's above 0.25 it predicts that there is a relook laparotomy it would be required so you can decide where you want to place your cutoff but that's a sort of a bit more of an advanced topic so now we can still look at our parameters so we're going to call the params attribute there for our model and we see beta sub 0 and beta sub 1 so we can write a solution to our research equation here so we have the estimated probability is this e to the power and you see the linear model on the top there in the exponent and one plus e to the power so very easy to see how we then answer our research question by the way remember in seven there we see how we calculate the confidence intervals for these just to use the standard error there now what does this mean though i mean it's in an exponent and you know it's part of this logit function this part of this link function how do we understand that value beta sub 0 and beta sub 1 well what we do is we exponentiate those values so you can see a numpy dot exponent i take my log underscore model underscore 1 that's the model we've just built the params and i'm taking the second one so beta sub 1 in other words remember python is zero indexed so square brackets is indexing i want the second value back but i'm exponentiating that and now i get a value of 1.053 and that's my odds ratio and you can look back at the mathematics why this would be so important to remember though that if we exponentiate that beta sub 1 we're going to get the odds ratio so 1.053 still again how do we interpret that and how we interpret that in the case of a continuous numerical variables such as this we say that for every one unit increase in that variable so every one centimeter increased units for the measurement of its chemical balance was a centimeter so every one centimeter increase doesn't matter if it's from 120 to 121 or 150 to 151 doesn't matter a one unit increase is going to increase the odds of re re-look by 1.0533 so the 121 centimeters over the 120 centimeters would increase the odds of a re-look by 1.053 and you can see we can subtract one from that and multiply by 100 that'll be 5.33 percent so every one centimeter increase increases the odds over this the lower centimeter by 5.33 percent so that's a very powerful thing to do so let's create you don't have to do this but let's create a user defined function so i'm going to use the dev keyword dev and call my function probe and i'm going to have a variable that i pass to it x and it's going to return for me the log underscore model underscore 1 predict 1 comma x so that's just going to return for me the probability if i give it a value ischemic bar length value and now we're going to use that inside of a second function i'm going to call this one odds and so i still pass in the ischemic bar length i'm going to assign this value p to calling this function probe with this x so it's going to turn a probability for me so that i can calculate the odds so you can have a look at that code so now that i have this odds function i can pass 120 and what it's going to do it's going to spit out an odds for me of a re-look given 120 centimeters now remember we said the odds ratio was 1.053 and i want to show you that is so so let's pass 121 centimeter over the odds of 120 centimeters and lo and behold i see 1.0533 that's the odds ratio so let's do odds of 156 over 155 1.0533 every one unit increase increases the odds of a re-look by 1.0533 and there we said it's above one so i can subtract one from it it's 0.53 multiplied by 100 that should be 0.053 there and that gives us 5.3 percent so 5.3 increase in the odds not the probability the odds of a re-look for every one centimeter increase by the way just for clarity and completeness sake if that odds ratio was below one say for example 0.8 i would subtract that from one so not one from it but it from one so 1 minus 0.8 that's 0.2 and that's 20 percent and i would say it would decrease the odds of a re-look by 20 percent so just check which way you have to subtract either it from one if it's below one or one from it if it's above one multiply by 100 and that gives you the change in the odds and we can do exactly the same thing with the confidence intervals around the coefficients so i can get the conf int values log underscore model underscore one dot conf underscore int but i'm passing that to the exponent function numpy dot exponent so that i can get the 95 percent confidence intervals around the odds ratio which is the exponent of the coefficient there you take e to the power the coefficient which gives us the odds ratio and you can see there the 95 percent confidence intervals around the odds ratio the exponent of the coefficient so that means we can express p values for these and we see here for beta sub one we have 0.000652 that's less than an alpha value of 0.05 so we reject this null hypothesis and by the way our null hypothesis here would be that beta sub one equals zero or in words we would say that ischemic bow length is not a predictor of the need for a real-look laparotomy i just want to show you visually i'm just going to create a z score and we get that from our table it was 3.406 it's a coefficient divided by the standard error i just create a bunch of these x values and i'm going to use that to plot the z distribution you can see these dotted lines that was our critical z values and we see our z values is way beyond that and that's why we got our p value that was so much smaller than 0.05 so just some code for you to have a look at that it's nice to express visually why we get this significant value now let's ramp it up and we're going to use a categorical variable as our independent variable so we can use that seniority of the surgeon principle surgeon at this index procedure to see to understand that relationship between that independent variable and the probability of a real look so there we go we're going to create our design matrix so y comma x equals patchy dot d matrices there's my little formula real look given seniority the data comes from the df data frame and let's print some of those out so you can see my constant for my design matrix x my constant of one as i always have and then we see the two dummy variables so that will be for attending or for subspecialist because remember the base class with the senior resident we don't need that one we only need these two and here you can see i've just printed out from the data frame itself it was attending attending so that's why it's one zero one zero zero zero with senior resident zero zero a senior resident and then attending again one zero and if it was zero one that would be a subspecialist so we're going to convert those to numpy arrays as we always do because i want to use this legit function once again and i'm going to call my new model log underscore model underscore two and let's have a look at this summary now because we need to learn and understand how we interpret this i still get my coefficients and i have beta sub zero so 0.55 beta sub one negative 0.221 beta sub two negative 1.19 i see the standard errors of those take the coefficient divided by standard error i get the z distribution and from that i can calculate a p value and then we see the 95 percent confidence interval values around the coefficient remember what this all means it means i can fill in this research equation of mine i would say the log odds my link function equals there's my beta sub zero there's my beta sub one there's my beta sub two straight from these coefficients and remember this is times attending and this is time specialist so if it was a senior resident it'd be zero zero and the log odds would always be 0.5596 if it was a senior resident if it wasn't attending this would be one and this would be zero if it was a specialist this would be zero and this would be one so you can see i can only get those three different probabilities so let's look at taking the exponent of those coefficients remember that's in the params attribute and once i take the exponent of that so e to the power of these coefficients i get the odds ratios so let's like have a look at that i see for beta sub one an odds ratio of 0.8 so that is if it's in attending i get an odds ratio of 0.8 and if it's a subspecialist we find an odds ratio of 0.302 so how do we interpret these now both of those are below one so remember we're going to subtract that from one so one minus 0.8 that's 0.2 times 20 times 100 percent is 20 percent so that we can say there's a decrease in the odds of a real-look laparotomy of 20 percent if it isn't attending as the main surgeon as opposed to the senior resident see how it's now becomes so crucial for us to have chosen one of those three levels as our base case because we are comparing this is an odds ratio comparison as far as remember the unfair coin in the fair coin so that is the attending over the senior resident and have a look at if it's a subspecialist is 1.0 minus 0.303 that gives us 69.7 percent that's a 69.7 percent decrease in the odds not probability in the odds of a real-look laparotomy if the main surgeon is a specialist as opposed to once again the base case a senior resident and we can do that for the confidence intervals as well so I exponentiate all of them and we can see the 95 percent confidence intervals around the odds ratio now it's exponent so it's the odds ratio now have a little look at this something something's going on here look at beta sub 1 and beta sub 2 so the odds ratio is 0.18 that's below 1 so that's a decrease in the odds till 3.37 that's above 1 so that's an increase in the odds so within the 95 confidence interval limits there's both a decrease in the odds and an increase in the odds and once you have that you cannot have a significant p-value so let's go back there's our model have a look at this this x sub 1 remember that's our beta sub 1 look at that p-value 0.76 look at these confidence intervals now still express this coefficients negative to positive so within that range is both a decrease to an increase in the odds now look at what happens to beta sub 2 we have a p-value of 0.048 so that is less than 0.05 and have a look at this our odds ratios around the coefficient is both negative and if you exponentiate that that means it's both going to be values below 1 so look at this the 95 confidence intervals in the case of a specialist shows a decrease in we subtract from 1 1 minus 0.99 that's 0.01 that's 1 to 91 percent and because they are both a decrease the confidence interval shows both a decrease they both fall below 1 but it is from 1 percent to 91 percent decrease and now we have a significant p-value if just barely so those are all very important factors now finally we're going to use both a continuous numerical and the categorical variables as our dependent independent variables so we're going to use both the seniority of the surgeon at the index procedure and a schemic bow length as a predictor of whether a relook is required once again we're going to create our design matrices so we say and our formula relook required given a schemic bow length plus seniority now the order here we just have to make you know make or realize later that the order that we give it in here is not going to determine which is beta sub 1 which is beta sub 2 and which is beta sub 3 we have to just be alert when we look at the values what happens so let me just show you what these values look like and you can see they're my constant my column of constant and then we see a 1 and a 0 so it's made attending beta sub 1 it's made subspecialist beta sub 2 and it's made a schemic bow length even though I put the schemic bow length first there it made a schemic bow length beta sub 2 so just make sure of that order which means and I want to write it out for you here because we haven't done so before here's my research question I have the probability this estimated probability of a relook there we go it's beta sub 0 plus beta sub 1 times an attending plus beta sub 2 times specialist and those can either be 0 or 1 plus beta sub 3 times bow length so i've got four parameters there beta sub 0 beta sub 1 beta sub 2 beta sub 3 over 1 plus e to the power that or if you want to think about this still as a linear model that's beta sub 0 hat beta sub 1 hat beta sub 2 hat beta sub 3 hat and that equals this log odds of this estimated probability a null hypothesis is still going to be the same and that's why i have these four models for you all in a row it's still the same idea as we had with linear regression analysis of variance analysis of covariance that all these beta sub 1 beta sub 2 beta sub 3 estimates are equal to 0 but in words we would state that seniority and a schemic bow length are not predictors of the need for relook laparotomy so let's do that we're going to create numpy arrays from our design matrices we can use the logit function and this time we're just going to assign it to the variable log underscore model underscore 3 this is our third model lo and behold we can call this summary method there and there we have our beta sub 0 beta sub 1 which remembers attending beta sub 2 which is subspecialist and beta sub 3 which is the schemic bow length so watch out for the order there still have my standard error z statistic for each probability of each and 95 percent confidence intervals around the coefficients so the coefficients i filled in for you here it's just taking these values that we have for the coefficients there and there is the solution to our research question and you understand now what you have to do you have to you know exponentiate those values these coefficients those are going to give you odds ratios you have to exponentiate these 95 percent confidence interval lower bound and upper bound values if you exponentiate them you get the 95 percent confidence intervals around the around the odds ratios and you know now if an odds ratio is more than one it increases if it's less than one it decreases the confidence interval straddles that one you're not going to find a significant p-value if it doesn't straddle one you are and you can see here the p-value for for the attending was not significant 0.66 for the 0.67 for the subspecialist it was 0.53 and then for a schemic bow length it was significant and once again you can see the negative to positive negative positive and then positive positive so you can see why those two were not significant but this a schemic bow length given those other two now you have to understand that in this model we're not seeing these things as separate they are now combined into a single model so it's given that it isn't attending or not given that it is a subspecialist or not over a senior resident with the schemic bow length only then does it give us this significant value there what i wanted to do for you is just to create a bunch of values you can have a look at the code there so that i can finally express this nice model for you there's the result of our model so for each of the resident the attending and the specialist we can see the probability of a re-look given these the schemic bow length so they each have their own they each have their own line and you can see this beautiful sigmoid shape curve there it's constrained between 0 and 1 exactly what we needed finally if you want to save these models remember this is plotly so you can really just click on that button right there it'll download it as a png for you you can zoom in you can pan around you can zoom in further zoom out go back home you can even zoom way in let's go back to home we can switch these off one by one decide which ones we want to show in isolation so that's why i absolutely love plotly and then you can save it and here i've instead of using the save button there i say fig dot write underscore image give it a name it's going to see that it is a png file my format is png explicitly stated there and i can specify a width and a height so if i were to run that block of code in this folder or directory it's going to create that nice png file for me and it's ready for use in a report so i really hope you enjoyed the seminar we looked at linear regression we've looked at analysis of variance analysis of covariance and now logistic regression and i hope you saw that progression how one just builds on the other and why i call them the four fundamental linear model types in quotation marks if you understand the basics of what i've shown you in this seminar series of four video tutorials you well on your way to making use of these types of models and looking at more complex models