 All right. Hello everybody. Welcome to today's topic, which is R for logistic regression. Thank you for coming. And let me go to the next slide here. So those of you who know me, you know, welcome back. Glad to see you again. And also tell me, I've got the chat open. If you can't hear me or if there's anything wrong, let me know. And I'll eventually look at the chat. I'm not great about this. Excuse me. But so some of you know me already. My name is Monica Wahee, and I'm trained as an epidemiologist and biostatistician. But I do informatics and I do all kinds of other stuff. And so therefore I am now a data scientist. Sorry, I've got like the driest throat today. So I've been doing like these little lecture series because I'm really into software integration. And now is the time, right? Because everybody, a lot of people have embraced, you know, in my field of public health, everybody's embraced SAS like that's sort of standard. But SAS users are now embracing open source like Python and R and other things and just not open source things, just other things like Tableau. And the thing is that if you are brought up in the world of data analytics through SAS, you often don't really learn anything else. And other things sort of run a little differently. So that's what this lecture is about. If you're in public health, you do probably do, well, depends on what you're doing, but logistic regression is kind of this main tool we have in biostatistics. And SAS is really good at it. But open source R does logistic regression. It's just a little more do it yourself. It's a little more nuanced. So there are problems with SAS, right? So SAS is not perfect like I have on the slide. The things that, I like a lot of things about SAS, especially prologistic, but one thing I don't like about SAS is it's hard to get the results out. I mean, you can make a PDF of the output, but it's hard to get the results out like in a CSV format. Any results, Model Fit, OR, CI, whatever. And also SAS doesn't really want you to make a coefficient plot. Like you don't see a plot of your slopes. I think you can call it plots, but they're not the plots I want. They're not the ones that help me interpret it. So SAS is perfectly fine for logistic regression, but I learned some tricks in R that I'm going to show you today that can be really cool depending on what you're doing. And actually, I have to say, now I tend to use, I'm showing you my sort of secret sauce for doing logistic regression, because now when I analyze BRFSS data or any other surveillance data, I'm going to be doing a lot of logistic. I kind of do what I'm going to show you today. So all right. So before we continue, I just wanted to tell all of you about this free online workshop I'm holding at the end of the month. The workshop is called application basics, and it's aimed at people like me in public health, or in other words, people who are data scientists or learning data science, but they do not come through some sort of business school or computer programming school. They come through like health or some other domain where you don't really learn about application development, like computer applications or business applications. So the problem is in public health, they want us to analyze data from applications. They want us to connect applications together, you know, SAS wants to stack other applications. So it's really helpful if you get a crash course in application architecture and how applications are built and who builds them and how they get designed, and that's what the workshop is about. So if you sign up for the workshop, which is in the link on this event, you get access to this for free access to the online course that goes along with this, which has all the didactic information. And we'll be working through that in our workshop, it'll be on zoom, and everybody will join and there's a few challenges in there and I'm hoping enough people sign up so I get kind of a big group and then what I can do is like split you into a few groups. I learned how to do that in zoom, like breakout groups and have you all do the challenges, and then you can come back together and we'll debrief. It'll be really interesting and don't be scared. I put a solution on there from my own channel and like, you know, what the teacher does, but my solutions are boring. I want to see what you guys have to say so please, if you're interested in this sign up for the workshop. It'll be Monday, Wednesday and Friday, September 25 27 and 29. Each session will be about two to three hours depending on how many people sign up, and it's at noon Eastern time. Alrighty, and then after these three sessions you'll sign up for a private wrap up with me on zoom just 30 minutes. Alrighty, and if you have any questions about the workshop let me know. I'm going to look at the chat I'm going to try to remember to look at the chat. Alright back to our regularly scheduled program so what are we doing today. We're learning about our for logistic regression, even though SAS is perfectly okay let's just see how we can do it at our. Okay, so I expect sort of a mix of people here with background so I want to make sure those of you are not very familiar with our understand what are has actually two ways to sort of interfaces, one is called our GUI and the other is called our studio. And if you make code in our GUI it runs in our studio and vice versa. They're not different programs. What's different is the the like the interface you're using so our studio is an integrated development environment ID. What does that mean. It means my colleague who likes make dashboard prefers to use our studio. And then when she runs her code, a window opens with this dashboard and it's just easier for developing on the web. Well, old fashioned epidemiologist me I just like to use our GUI it looks sort of SAS like I'm so old fashioned I get all upset if there's too much in my too much cognitive load in my visual stream. But it's totally up to you what you do. When I've taken my colleagues dashboard code from our studio that she's making and I run it in our GUI it just opens like a web, like I use Chrome it just opens a Chrome window and starts acting like it's our studio so it's no problem. So I today's resources on the right side of the slide you can see there's a blog post that will go to it later. I made about this but you can download. Oh you can download these slides. So you have these links just go it's a link in the description. But then you can download this demonstration data set I'm using it's just a little piece of BRF SS data. Just so I can demonstrate something with like a lot of data like with thousands of rows right. But I made it a smaller data set BRF SS is huge. It's got a lot of columns a lot of rows so it doesn't work so well for demonstrations but it works well for pretending it's big data for a demonstration. So, so yeah so those are all the resources I'm going to demonstrate that. So, going back to the left side of the slide. So, for the USAS users or not USAS users and you wonder what goes on in SAS. In SAS what you do when you're getting ready for logistic regression as you prepare an analytic data set. And what you're doing is you're predicting the log odds of the probability of an occurrence of a binary event like dead or alive right so you've got this column with a one or a zero in it as to whether or not they got the outcome or not that's your defendant variable that's why you're doing logistic regression. And of course you have independent variables so you've got to figure that out you got to do all your hypothesis stuff and cook up your independent variables you know are they going to be linear. Do you have categories you just got to do all that work right if you don't know how to do it take my LinkedIn learning course how to do it. But anyway so you design it and then you decree your your analytic data set and if you're using SAS you put it into SAS. There are a few different commands you can use to run logistic. The old stores will run prok gen mod gen mod for me general model and then you set the link as binomial I think more modern people will use prok logistic. You know but either way I like prok logistic but you have to add the descending option because everybody codes one as the outcome of prok logistic model zero as the outcome like why I don't know as SAS but anyway so. So if you wanted to model the right thing you have to put the descending option and then prok to logistic I didn't run prok logistic before I'm talking I probably shouldn't have the default outputs pretty nice. So in public health. So what actually comes out on logistic regression output just to remind you is we're predicting the log odds of the probability not the probability and not the odds. We're predicting the log odds of the printable so you get these slopes and the intercept right so if you put in three covariates as your independent variables you'll get three slopes and an intercept. And those slopes will be on the log odd scale so if you're in engineering I guess they don't mind that but in public health we can't have that we need to turn it into an odds ratio and a 95% confidence interval. So why why we love SAS is SAS does it for you I think of the options risk limits and it just and it won't just do odds ratio and do all kinds of stuff you know it's SAS you pretty much have to repress or suppress it rather than tell it what to do. And so you get all of this output in SAS and public health sort of style, and it's really nice you can look at it. But the thing is like I said on last slide you can't export it very well you can't graph it very easily. In our the whole thing is a fragmented affair in our it like once you get your the analytic data set is the same, your hypothesis is the same. Well once you get to our you're kind of like duck taping this whole thing together you're kind of like creating a ruby Goldberg like a pipeline you're basically creating a pipeline to do what SAS just does for you right. So the bad news is creating a pipeline means there's you got to tape together a whole bunch of pieces. The good news is you get to shop for each of those pieces you get to pick your favorites and there's a lot of options. So that's why I'm showing you this presentation today is to show you how I do my right. So this is my approach so we're going to go over now to our doesn't even if anybody has any questions you can go put in the chat here. So as I said, let me put this over here. As I said, I was, I'm going to be demonstrating using a little piece of a data set from the BRF SS so the be those of you don't know the behavioral risk factor surveillance survey is a system is a cross sectional anonymous phone survey in the US. So they just call numbers they don't know who's on the other side and then they ask them like how much they smoke and stuff and it's cross sectional so they do it every year. And they post the data online so it makes a really good demonstration data set for public health. So I took a piece of it like and I made it into an RDS which is ours native. It's like a SAS 7VDAT for R. Oh, and I just want those of you are not used to this. This is the R GUI in the console up here. This acts a lot like both the log file in SAS as well as it's the log file and it does other things that are usually in a different place in SAS. And then if I get output like if tables if I run like a freak frequency it'll show up in here but if I run a plot and open a new window. So that's what the console and I try to keep you can move the console around but I try to keep it up here because otherwise I get confused. I don't know what's going on and you can open as many code windows as you want over here. So this is our code we just have one code one log and code snippet today. Okay, so and as you can see the comments use this pound sign before which is an octothorpe. I like to be smart and use long words. Okay. So what's our first command is we got the read RDS which is it's an RDS file. Oh, I want to tell you I've mapped this directory. Like if you go to if you're clicked on the console and you choose see down here change directory. Like I chose a directory like basically a folder where I put this RDS data set. So when it tries to read it and it's only looking there like I forced it to only look there. So so this our read RDS file is going to read and this is an arrow so it's going to put it in an object. Now if I just ran this if I just ran this code, which is a bad idea, it would just print this to the to the console, but I don't want I want to read this in and put it into this data frame. That's what it's called in in R as a data frame. So let's let's run this code. So I highlight and I do control R but what it does is it actually just transfers this to the I guess this isn't a very big data set. So I'm going to run column names that's kind of like proc contents only just as a column. So I'm going to in the name of the day frame is brfs s. So here we go. All right. So you can kind of recognize some of these column names if you're used to using the brfs s data set. Let's go click on this here. Like you've got like smoke day. Remember this ex age age group is over here. Good. So there's a bunch of native variables in here and there's a bunch of transform variables. So what I've done is I a lot of these like age group, for instance, it has higher cardinality that I want that there's too many groups. So I'll collapse them into a smaller group, but then I'll make a set of indicator variables like one zeros for it. Okay. So to demonstrate logistic regression, it was just a little easier to make a Mickey mouse hypothesis so I could show you what I was doing. Oh, and let me show you how many rows are in this a number of rows in our demonstration data set. It's about it's almost 60,000. So what I'm doing makes sense. It's kind of like small big data. All right. So I came up with this Mickey mouse hypothesis because in reality what you would really do is you'd actually have a real hypothesis. You'd have this outcome and you'd have shopped around in the BRSS and you'd have chosen candidate covariates that you wanted to put in there. And you would do some sort of stepwise modeling process to choose what actually stays in the model because you're going to get cold linearity if you use a data set like this. I'm just pretending we have one model and I'm just showing you how to do that. And it's just a simple model. Okay. And so as you can see here, this is what the model means. It's the hypothesis is having no insurance plan, which is no plan. Yes, no. If you have a one for no plan, it means you have no health insurance plan. This is the US. So having no health insurance plan is the exposure in this little pretend thing. So it's a risk factor for poor health. So the outcome is poor health. Yes, no. And that actually is just a level from the general health question. You were, you asked, how would you rate your health? And, you know, there's people go from excellent all the way down to poor. So I just flagged the pores. That's the outcome, which is really bad, by the way. If you are rating yourself as poor in that question, all kinds of research says you really are sick, right? So it makes kind of sense, right, to have a hypothesis that not having an insurance plan would be associated with poor health. But we also know that education is a confounder. So I had a several level of a categorical variable about education. And I turned it into just two indicator variables. One which is called low ad and one which is called some call for some college. So this is the lowest education. I forgot what exactly it is. And this is some college. And the reference group then that's not in there. So that this is what these ORs are going to be compared to is people who completed college. Okay. So that's our equation is the dependent variables for how and then this is not an equals actually a tilde. So that's how we're specified the model. No plan, which is our exposure. Plus our confounders low ed and some call. Okay. And people ask me like, why do you do that, Monica? Why didn't you just call it education and just put it in there as a like continuous variable? And I'm like, because it's not a continuous variable, like this is the way to do it. I could have an independent slope. You're looking for a dose response. How are you going to see it? So I just get so irritated. I'm sorry. It's like, because there are reviewers and I'm like, do you even know what you're doing? But anyway, like once reviewers, they came back, I said we ran iterative logistic regression models. And they said, don't you mean interactive? And I'm like, no, I mean iterative. But anyway, so how you formulate that this is like, remember Proc GLM. This is like Proc GLM. Only it's GLM. And we've got our equation here with a comma. We specify our data, which is our data frame up here. Family equals binomial. So do you feel like Proc Gen Mod coming on here? It's very similar to Proc Gen Mod code. And if I run this, it'll, it'll present to the screen. Actually, let's see. I'm just going to do that control R here and look at what we get. Okay. And this is on the console. So it shows you what it ran and it says what it's calling. Isn't this awkward? You get the intercept and then you get the log odds, like slope. And this, you know, each of the slopes is so hard to look at. You don't get much in the way of model fit statistics, but you can't. You can do the negative to a lot of likelihood. You can get a package for that. You can get other packages for model fit statistics. But for free, you get the AIC. And I like the AIC, but the AIC is only a sort of comparative model fit one. Like only if you're fitting iterative models, not in Iraq, but iterative. Can you really use the AIC to choose models? So you might want to go with the negative to log likelihood approach. If you know it. I actually haven't met people who know that, but I never fight. I don't fight with these models that much. So it's too much trouble, but you see how awkward this looks. Well, that's actually, if I run this, see this, how it says logistic model. That's just about, I could have called that, you know, Suzanne. I could call it anything. I just chose the word logistic model. So this is, I'm making an object here. This logistic model here, I'm going to turn it into this object. Okay. So now it's an object called logistic model. Now if I run summary on this logistic model and start looking more sassy. Yeah. Doesn't that look like sass, right? Now it's all nice. You have the intercept and the three covariates. You got the law God's estimate, standard air, Z value. This means P value. This is really ugly, but see this, these three asterisks. There's this little thing down here and that's a code for significance. Right. So you're thinking no plan is actually not significant. Our exposure is not significant. Like we have a null study. We should quit. Now I'm just kidding. Actually the reason, I don't know why that is, but in the US people tend to have, people who are poor are on Medicaid. And so they tend not to be quite that sick. And so it's hard to find people with no plan that actually participate in the BRFS. I mean, there's just a bunch of selection bias and other issues going on there. But anyway, this is not even a real model, but this is what you see. So this would be, this is sort of your, your results of your model, but it's not really that easy to report. And I want to show you what kind of object we just made. We just made this GLM LM object. Okay. So this is not a data frame. It's this thing. Okay. You probably do too. So how do we do the odds ratio? So this is basically what we're going to do is we're going to turn this, this output or this logistic model thing. We're going to turn it into an actual data set. We won't turn it into a data frame. That's one way of having data sets. We're going to turn it into a table, which is a different way. Now there's a really popular package in our call. Deplier, which is very sequel like for managing data. And that operates on tables and people like that. I don't like tables. I've never liked them. I guess you just have different ways you like it, but what you'll be seeing here is the table. You know, what you've been seeing now up to now is a data frame version of data. You'll see this now and what I'm going to do. So I say, what are the hours? So library and our loads packages. So just to be clear, you know how in SAS, you have base SAS, you have SAS stat, and then you can add components depending on how much money you have and you're willing to pay to SAS. In R, you have base R, but base R is really, really lean. It's much leaner than base SAS because it's open source. It doesn't have to have much. And then the community makes packages. There's a server called the CRAN server CRAN. And if you get packages off of that, in fact, R GUI is automatically connected to that. That's how I put these packages in. I just refreshed my R because I was helping a customer. So before I did this today, I went and made sure I downloaded all the packages. So when I run this library dev tools and library room, it actually loads something. See that? If I ran that and it didn't load it, I would have to go and put load the package into my instance. Basically it's like saying, oh, I got to load my component into SAS. Well, this is free. So I love it. It's my favorite price. Okay. So now that we've loaded these libraries, we're going to use the tidy command from one of them on this logistic model. So remember this logistic model is not a data frame or a table. It's just this weird GLMLM thing that looks nice, but it's not something we can take it out of the environment and like graph it or make a table out of it and put it in a journal or a dissertation or anything. So what we're going to do is tidy it up. We're going to run the tidy command on it and turn it into this object called tidy model. I just called it that. So I'm going to run this. And now we can look at the tidy model. I'm going to run the tidy model just so you can see it. And you're like, okay. So it kind of looks like a data set, doesn't it? It looks like a data set where we're term. Now see the CHR DBL. See this little notation and see how it says a TBL four by five. That's how you know you're looking at a TBL just from the output to the way it prints the screen. But it's a data set. So you have the first column says term and it's like intercept, no plan, low end and some call. So that's the values of that. And then you estimate. So this is called estimate. So if I wanted to add something to this, I would refer to the tidy model, the table tidy model, and then the variable estimate. And this variable is called STD.error. And this is called statistic and it's called p.value. So this, this is a little baby data set. Okay. And so, and actually let's look at the class, which tells you what the type of object it is. And see it says TBL underscore DF TBL data frame. So I guess a TBL has, oh, this is why it's called TBL. Why is it called TBL? It's called TBL. TBL. I didn't even know that before I did this. But I guess it holds a whole bunch of objects in it. That's what's kind of confusing to me about our, a lot of people who are especially mathematicians like it, because you can hold like lists in there, like where there's a bunch of members of the same thing. But each member has different child, different numbers of child members. It's super confusing. But I guess that happens in math a lot. So some people like it, but I get confused. So now we have, we have our time. So what have we done? We calculated what is just a regression model. We created the summary model object. It was beautiful. But we wanted to take it out of the environment in like a CSV format. So we loaded this and we converted it into this TBL format. And now let's get it out. Let's get it out of the environment. But it's not a TBL. It's not a TBL. Let's get it out. Let's get it out of the environment. But let's stop for public health, right? Because I don't like these, these log odd, these log odds, slopes, right? Like I want to know what the odds are issues and the 95% confidence overall. So remember the standard error over the estimate. Let's go and let's go break into it. So I'm going to add some calculations before I expert this tidy model. Remember that's the name of it, right? So tidy model with a dollar sign and then OR simply refers to, this is saying that's the field in this table. So a lot of times when we're doing things like this in SAS, we're doing a data step. So we've already declared the output data set of the bot. Like data B, set A, and then we just start talking about variables in data set A. Here, you know, R is very transactional. It's not, it doesn't create a, you know, like you started data step and you're like, okay, this is going to be a big data step. We're going to be hanging around together for a while. R is very transactional. It's like, you want to add a data, just edit it. Just do it. You know, leave me out of it. Don't run loops and stuff like that. So here I specify the output variable. I'm going to call it OR, right? It's an odds ratio. So how am I going to calculate that? Well, I'm going to take the EXP variable estimate, this thing over here. And yeah, it's going to calculate for the intercept. I'll just ignore the intercept. But it's going to do it for the other ones too. So I'm going to get this column called OR with odds ratio in it. And while I'm there, I might as well also add the lower limit, see the lower limit and the upper limit. And I'm using a 95% confidence interval because you see my statistic here I'm using. And you see what I'm doing. I'm taking the estimate minus, for the lower one, minus the margin of error here by using that. At least see what I'm doing. You know what I'm doing. All right. So I create those things. Actually, let me run it before I forget. And then let's look at it again. Oh, we did. See? So remember before we forget that we were actually running a model. So we're predicting poor health, it's really bad. Like I said, low add the 95% confidence interval is like 2.3 to like 2.8. So that's like, they have between 2.3 and 2.8 times the odds of having poor health compared to like high college people, like college graduates. And with some college, it's only 1.5 to 1.8. So that's a dose response. Isn't that? All right. But here's our no plan. Where it see crosses the one it crosses unity. We already knew all that and then just ignore the intercept. Okay. So now, awesome. We have what we want. Let's export it. So we're going to export it as a CSV, right? Because that's easy to put in Excel. So I'm calling, I'm just going to call it tiny mile. When you export stuff from our, to where I exported it and actually open it to show it to you. Okay. So this is what the CSV looks like. It's probably what you thought it would be. It's a little hard to see this. I guess if I do like this format cells and just do like this number. See, these are the odds ratio. Remember, I was just freaking out over this and over this and see, these are all out there. And, you know, the values look, I love the format painter. Don't you love the format painter? Yeah. See, these are really big baby P values. This is terrible P value. But you can copy this into Excel. You can do all kinds of stuff with it. It just makes it so easy. Get your model results out. Okay. You stuck with me this long. You deserve a special bonus. So what happens to me a lot is, if I'm making a model with like, let's say I'm making a model like this, but I have like 20 candidate covariates or 15. And I fit the model and I end up with maybe 10 survive the modeling process. I love using a terminology. It's so harsh. I have to write about this in the, in like the results. I have to like figure out how to interpret it. And remember some of those results might be like risk factors and some of them might be protective factors. So it's hard to interpret. Oh, somebody's got a question here. Do we have any data set in R where we have the results of logistic regression and tidy model? No. We wouldn't, but we could. I'm, of course, creating a situation here. But if you use, you know how I use this right CSV command. If I wanted an RDS, like this data set, if I want to turn this into an RDS, I wouldn't use that. I would use save RDS, tidy model and then tidy model.RDS. Is that, wait, does that answer your question? I hope it does. I chose CSV here is because then you can get it in Excel easy. If you choose RDS, it really only R can read it in. But the advantage of choosing RDS is the next time you go to read it in R, it goes really fast and R knows what you're talking about. Just like when you use SAS and you bring in a SAS 7B death, it's so happy. If you try to bring an Excel file and ask them a bunch of questions, you don't have to not use CSV. Hopefully that answers your question. I was getting back to the issue I have is when I have these really big models, final models, and I have to go, okay, good, great. I have these final models and I have to interpret them. Some of the odds ratios are risky, like that 2.5 to 2.8 or whatever, but some might be protective. Let's pretend it takes a multivitamin or something or like it's on a vegetarian diet or something. That might have been protective. That might have been statistically significantly. The odds ratio might have been between 0.7 and 0.8, meaning that people on the vegetarian diet have 0.7 to 0.8 or 70 to 80% of the odds of the people who are not on that diet. So they have only 70% of the odds, which is bad, right? And so how do you keep this all in your head? I mean, this is not a bad table, but now just in your mind, just imagine columns and columns of that, right? And I'm trying to interpret this. This is on one of my screens and I'm writing on the other. I'm like, what is going on here? So this is what I did is I found this library, this package in R called ARM, like your ARM. Somebody will tell me, I never know why they named these packages things and then people say dumb on it. I never knew why they named it that, but it's like pliers. Like you can do anything with data. Thanks. I'm just so naive when it comes to this stuff. But anyway, so this ARM library, the only thing I've ever used it for and I loved for is this coefficient plot. So before I make the plot, you're going to see confidence limits graphed out on the plot. They are not the 95% interval, a confidence interval I calculated. They are two S E's like plus or minus two S E's. And it just, that's the way the plot is built. But so it's not the most beautiful plot. And I probably wouldn't put it in a journal. Oh my God, it's so easy to interpret your resulting model by looking at the spot. So first, actually, let me just first run the plot. And then I'll go back and tell you what I did with the code to make it run. So I'm going to run the plot. Okay, here it is. So let's go look at this plot here. Okay. So you'll see here it says risk factors for poor. Obviously I put that title there and poor health was what I was modeling. So you, you know that there is really an intercept in there. Oh, thank you. Arm data analysis using regression and multi-level hierarchical models. Where's the arm part? A analysis regression multi-level. Maybe that's it. Thank you for that title. That's the right title. There's probably more in there. Like higher hierarchical models. Might want to check that one out. I'm always looking for a good hierarchical model package. So maybe I'll have to look at that one. So, but anyway, I've just rated this package for the coefficient plot. So you'll remember that we've got an intercept in here and the coefficient plot command needs the intercept to be in the underlying data set, but you're not going to see the intercept here. Okay. So that's the first thing I want to say. The second thing is that you'll notice this is still on the log odd scale. But it doesn't matter because I'm just using this for interpretation to write like the results of discussion. And all I need to know is really the relative position of all these. So you remember that no insurance was our exposure and high school diploma that low education was our was our first confounder, I guess. And then some college was our second confounder. And so as you can see, you can see the dose response here so easily. And just to be clear, what's being grabbed here in the center is the log odds. And then if you look closely, you'll see that there's this thick line and then a thinner line. And that's the standard error and the, or that's one standard error, plus or minus one and this plus or minus two. So this is straddling unity. We knew that, right? We have a null study on our hands. And we knew that these two were very, were very risky, right? And they're looking great. So using your imagination, like imagine I had like 20 covariates in here. Like for instance, let's say I had a bunch of age groups and I had chunked them up like with independent, very like flags, right? The binary variables. I'd have like a whole clump. I'd have all these clumps of like, like income, you know, all these, these series of flags and this would get really long, but what I'll do is I'll color code them and stuff. Let me show you the plot or the code. So I call library arm and then I create a vector called var labels. So it's just a vector. I said intercept, no insurance, high school, diploma, some college. They're basically my labels. This is the vector. And then let me just get this all the way here. So as you can see here, this is var labels, right? Let me run var labels. Now I created var labels equals here. I usually use this arrow, but I think I must have copied this from the coefficient plot or I even misspelled it too. I must have copied this from the coefficient plot documentation. Well, it's on GitHub all updated. You'll see my commit, right? So, okay. So these are the variable labels, but it's nothing's happened yet. The plot hasn't happened yet. This is just a vector that says these things. This is parameters and this sets up how the output plot comes out. To be honest with you, I forgot the arguments for this. Here is the main thing I want to show is this coefficient plot argument that comes out of the arm. So I just made indentations and stuff ggplot2 style so you could see what's on each line. So first I call the logistic model. Notice I'm calling the actual logistic model, not the tibble that we made in the tidy one. I'm calling the original one. I'm making vertical equals false because I wanted a horizontal here. Then Y, what is that? Y limits, right? So I have negative one to 1.5. I don't know why I picked that. I guess I forgot that I was on the log scale or no, that makes sense. No, I must have known I was on log scale. I don't know. I just picked that I guess because I wanted to make it more healthy. I put that title. And then it says var names is the option. So see how I handle these two options differently. This is the color option. And the color option, I just put the whole, the letter C I think stands for combine. So I put the whole vector in. I said dark blue, dark orange, blue, violet, dark green. And if you're curious about these colors you can find different ways you can specify colors that are. So I took one of the ways, which is they have some names programmed in. So I just, I don't know why I picked these names. And you can look for these palettes online, what blue violet really means or what dark green really means. So I did that here. You know what I could have done is I could have taken this whole vector and I could have named it like color, color list. And then just say call equals color list. I could have done that. And the reason why I'll do it with color, the reason why I'll sometimes move it up here and then call it as a vectors because I don't know what colors I want and I want to kind of fuss around with them and it's just easier with it's up here than digging around in the code. But anyway, so that's sort of my annotation of how I got this coefficient plot and what it is for. So let's go back to our slideshow here because I'm already over time. So in summary, these are the steps I did in R and that if you go to my blog post or you download this or whatever, you'll see that that's what I did. So first I prepared the analytic data set in R, I used an RDS and I specified the candidate independent variables, independent variable. Then I used the GLM, which is in base R to create an object in R and it was to create the material. Then I used the library, the packages dev tools and broom to run the tidy command on that object to create a tidy object, which is really a tibble to make the results into a data frame. Well, I said data frame, it's really a tibble that you can edit, it's a data set you can edit and then step four was using the data editing, just data editing commands. I added, you know, calculations, I added the OR and then the conference limits on the RDS for the odds ratio. Then I exported the results as a CSV, but I could have explored them as RDS, but as a CSV, it's easy to open them in Excel and just copy them out and maybe use them in Tableau or somewhere else for graphing. And as a bonus, using the original GLM object, I use the command coefficient plot or cofplot from the package arm to make this coefficient plot. And again, it's not really a good idea to interpret the results of the model when I'm writing about it. All right. So that was our presentation today. I'm look over at the chat. If you have any questions, I'm happy to answer them. And if you download the slides, you'll get, you know, all these links to my blog and those blog posts. Those of you, I'm so happy that you stayed this whole time. If you missed it in the beginning, it was September 25th, Wednesday, September 27th and Friday, September 29th. I'm running a workshop. It's a workshop in application basics, like how applications are made, how they are architected. And I'm talking about like computer applications or business applications. You know how like we data analysts, they're always saying, oh, analyze Twitter data or I guess it's called X now or analyze, you know, what kind of applications or whatever. So you really have to understand application architecture and they don't really teach you that in public health or in these in biostatistics. They teach you that if you come through like a business school and learn computer science, but not everybody does that. So if you're, if you fall in the category of people who've never really been taught that, this is perfect for you. It's free. It's not set up like with the online course. So all your materials are there. I'll give you free access to it. Oh, there's a question. Niraj, thank you for the session. Do we have any workshop or session on sample size calculation for epidemiologic studies? No, but I'll take that into consideration. I can add that to my next series. I usually do a few of these in each series. Just as a quick answer, Niraj, I like to do the application called G-Power. Okay. And it was developed in a European university that I can't pronounce. I want to say Düsseldorf, but I don't think that's right. But if you Google for G-Power, actually if you Google for G-Power, you'll find a lot of tutorials and it's free. It's a sweet, cute little app. All it does is power calculations for you. It's so good at power calculations. You have to really know what you're doing. It's like a wizard. It's a really neat little program. Yeah. See, to be honest, Niraj, I don't use R or SAS for calculating G-Power. Or calculating sample size. I always do G-Power. And the reason why is I like to calculate multiple scenarios. What if the effect size is little? What if the effect size is big? What if the standard deviation is little? What if the standard deviation is big? What if we don't get very much sample? What if we get a lot of sample? What if we don't have the power? I like to create this huge scenario. And then I like to decide where I want to be in that scenario. And G-Power is so nice because you can take a screenshot of how you set up what you're entering each time you're making it calculate. And so I would say that if you use G-Power and you know what you're doing, you make sure you know what you're doing, you're basically doing best practices. It's a really good tool. So I would advocate everybody use that. It's only for power calculations, sample size calculation. But if you do it and you keep your notes right, you're probably doing it right. SAS has PROC power and I'm sure there's a million things in R and I can't ever remember if I'm doing it right. You're doing it right. But yeah, and you should be able to find a lot. Actually, if you guys go on my blog and you search for G-Power, something tells me I posted something on it. But I don't, my blog's gotten kind of big so I kind of don't remember. Either that or something maybe on YouTube. Just look on my YouTube channel or look on my blog and maybe there's something there. But getting back to this workshop, it'll be great because you can use, you'll be using my course management system, but I'll be teaching the workshop and we have several challenges. And let's see if Ebenezer is still here. My assistant Ebenezer has been helping me get the workshops ready and everything. And so a lot of people have signed up. So that means when you come to the workshop and I teach you this stuff and we have these challenges, I can put you in groups and we can come back together. But don't be scared because I also did like a teacher's version of the challenge so you can see like what my solution was. But that's what's fun about applications is they're actually things you design. So that's part of why I want to run this workshop is I want people who are creative in the biostatistics space to also be able to be empowered with the knowledge to be creative in the application space. All right, everybody. If you have any more questions, thank you very much for showing up today and watching my demonstration of how to do logistic regression at R. And I look forward to seeing you at another event or the workshop. Thank you for watching this video, which is part of the Public Health to Data Science rebrand program. If you are interested in joining the program, please sign up for a 30-minute Zoom interview using the link in the description. Thank you very much.