 Well, hello everybody. Good afternoon. And I'm so happy you showed up today. I'm Monica Wahee and it gives me great pleasure and joy today to share with you some great R packages that I've run into in my work with my customers for health data analytics. Now, thank you so much for showing up. And I apologize in advance. I think I might have tried to cram too much into one lecture. I'm not doing any software demonstrations or programming demonstrations, but I got code for you. You can download these lectures on the link on the like these lectures, you can download these slides and get the links in the slides and get the code in the slides. But also, as you can see in today's lecture on the left side of the slide, first, the first half of my presentation, I'm going to talk to you about the R packages I use when I do factor analysis. Now, I'm aware I got a lot of SAS users here. And what's wrong with SAS for factor analysis? Nothing. In fact, SAS is much easier probably to use for factor analysis. But if you want to use R, I show you what I do when I use R. And then the second part of this, I'm going to show you some R visualization packages, which are not ggplot2. They use ggplot2, but they're sort of special ones that I don't see people using a lot. And what I'm kind of famous for is I'll explain the use case. And, you know, it's always cool, like most of the time you're making like bar charts and time series plots, you know, because you're trying to do something or it's a qqplot or whatever. But once in a while, you've got the special visualization need and just matching the need up with the use case need with the weird or unique plot that is half the battle, right? So that's what I like to do on my blog posts is just make it really clear, like what are the cases in which you would use this plot? Now, before I go on, next week, I'm holding my online workshop, application basics working with R. It's a free workshop and it's online. So please, if you want to go sign up now. It's Monday, Wednesday and Friday of next week from 2pm to, or I'm sorry, from 12pm Eastern time. I'm sorry if you're not an Eastern time, you'll have to translate this into your time zone. But 12pm Eastern time, I'll have you block off your calendar for three hours. It depends on how many people show up because it is a workshop. What would happen is, I'm going to lecture, it's based on an online course I made, which you'll get free access to. This is a real deal. You can, and you can sign up in the link on the event, on this LinkedIn event that you obviously signed up for or you wouldn't be here, right? But please sign up if you want to come, if you have time to come, because what we'll do is I'll teach you about applications. I'll teach you about how to, like sort of the big picture of like making application pipelines. And the way I'm going to teach it this time is we're going to concentrate on R. Like when you do your breakout and your group assignments, we're going to be thinking of ways of getting R into our application pipeline, which actually is so natural for me. I remember back like 20 years ago, I hate to say that, but it was still S plus then or whatever. I got, I got kind of mad because I was trying to make a forest plot. And I remember I was, I ended up turning to R and it just never stopped after that. So I just want to emphasize this workshop is free and it's next week and it's online. So please register if you want to go, please, yes, sign up for the workshop and I'll send you the materials to get started. All right, enough of that commercial and on to the program. So first we're going to talk about factor analysis. Now, you want to download these slides if you want to link to all these, these packages, but. Oh, and also the abstract from the paper published. Here is what I'm going to, I'm going to talk about when an actual use case I had when I was doing factor analysis. And I was doing it in R now, but before we go too far, I just want to remind everybody what factor analysis is. So sometimes. So this is the case I had, and there's other cases where you use factor analysis, but the case I had is. I had a customer who was making a psychometric instrument. Okay, so one of those instruments that if you fill it out. You end up with sub scales. And there's scores on the sub scales and it means something so you might be familiar with that from management. Like, if you're in management courses, they'll make you fill out something and then they'll say, like, like, especially I was talking about the Myers Briggs with one of my customers. You know, are you an E, or an I, you know, you get a score, right? Well, if you ever take one of those things. This is the back engineering. This is how those things got there. That's how the E and I thing got there. That's how all those sub scales got there. And that is what happened was people started out by with just a whole bunch of statements, like, you have to agree, imagine a Ligard scale of like one to five and five is strongly agree, and then somewhat agree all the way down to strongly disagree and statements are like, I'm the kind of person who likes to show up early to class. I'm the kind of person who puts off doing my homework to the last minute. Okay, so I'm kind of joking. I don't know where I'm going with this, but imagine you're making a psychometric instrument about how students behave or something. You might have those kinds of statements, right? So let's carry this analysis a little bit. Let's say I am making a psychometric instrument to figure out what kind of student you are. Pick sort of three pre-specified domains. Like I might say, okay, how good are you at homework? How good are you at in class, you know, because homework and in class is different. And then maybe how good are you, like, one on one, like, I don't know, I'm making stuff up. So if you're making a psychometric instrument and you do what I just did, you select three domains, I just chose three. A good friend of mine told me, he's like, Monica, first time around, put 10 on each domain. I'm like, 10. Then you can't pick that many domains. So let's say we got 30. We got 30 statements, 10 on each domain. So I got 10 homework statements and 10 in class statements. You see where I'm going with this. And I give it to a bunch of students. Okay, now I've got this data. And I've got the question is, are my three factors going to come out? Is homework going to hang together? Is the in class going to hang together? Which is what factor analysis is. And what I'm talking about happens to be confirmatory factor analysis, because we had pre-specified domains. Earlier in my career, one day, I just grabbed a bunch of statements from a quality of life, a bigger quality of life instrument, and I threw them at some medical students. But I didn't have any pre-specified domains. I kind of didn't know what I was doing. And that's how I met the person who taught me this. And they're like, well, you don't have any pre-specified domains. So we're doing EFA, exploratory factor analysis. And so I was like, all right, let's see what happens. And we did. We got some domains out of it. But in any case, if you're in SAS and you've got my customers data, what it looks like is a bunch of columns that say one through five on them. And I know what the columns mean. We have documentation. But our question is, are the homework ones or whatever, all the same ones falling out on the same factors? So it's just a bunch of internal correlations. Bad at math, but that's the best I could say. And if you've got SAS and you put it in PROC factor, you know, SAS is like a full service restaurant. They come and they give you all the information you'd ever need, all these plots. And you can sit and go, oh, do I like my factors or don't I? But in R, you've got to sort of tell it exactly what you want. So I'm going to just now in the future, if you hang out with me and connect with me on LinkedIn or whatever, stick around because I'll probably turn what I'm going to show you eventually into a blog post and give you some data and stuff to download. But for now, I just have it in this slide presentation. Okay, so here it is. So what's the first thing that happens when I've got my customer's data and she's got three domains and they're a bunch of Likert scales? Well, one of the domains was IDENT. It was like identity something. And IDENT, if you can look at the slide, we had one, two, three, four, five. We had five items on the IDENT. So here's what I did is this is just a vector of the names of the variables. And I just put it in a vector named IDENT bars. These are just the names. Well, why I did that is there's a tricky trick you can do in R where, see, up here, the data set I read in is I call it analytic. That's the name of it. Well, see down here where I go analytic in these brackets and I put IDENT bars. If you just list the variables you want to keep from there, it'll just keep those. So what I did was I just listed the variables, the names of the variables I want to keep and I kept it IDENT bars and down here, I just made IDENT. And why am I doing that? Well, first, I'm going to want to do a Cronbach alpha. So what's a Cronbach alpha? It's a correlation test. Now I always feel bad because I'm so bad at this. It's a correlation test that looks at how inter-correlated all the answers are to those items. So I have to pick out the items and say, okay, give me the Cronbach alpha. And the rule is if the Cronbach alpha is less than 0.7, they're not very inter-correlated. And that's an issue. That means you don't have a subscale game over. Go back to the drawing board. So that was the first thing for my customers. You'll see there's three domains under this IDENT, TC and BI, and I really don't remember what they mean. This was like years ago. And so first I was going to try and figure out the Cronbach alpha for each of them. And you'll see here that I'm doing that with the IDENT. So I picked out from analytic the IDENT bars and I'm using the library psych. Now I just want to step back and say the library psych has a lot of good stuff in it. More like throughout my life I bump into it kind of often. This is one of those times I like I think it's got some plots in it. It's a really good package. Okay, so now I'm running see this code alpha. I'm running alpha on IDENT. And IDENT is just a data frame with the IDENT bars in it. Okay, I'll bet you're dying to see the alpha. I was dying to see the alpha. Here's the alpha. So this is from the console here. 0.7. Remember I said it'd be, you know, I'd throw it away if it wasn't good enough. Well, here's the alpha. So we did that to each of them. Now I've got some bad news for you. Alpha is a little bit like loosey-goosey. Like you have to have a pretty bad subscale before alpha like doesn't like you. So I would, but this was a kind of high alpha, you know, and 0.7 is kind of loosey-goosey. I don't know. It's industry standard. Okay, moving next on. The other thing we wanted to see was when we did the factor analysis, we hoped that all the IDENT variables all were in a correlated with each other and not correlated with the TC variables, which would be in a correlated with each other, which are not correlated with the BI variables. Like that's what we really hoped to confirm in our confirmatory factor analysis. It doesn't always come out the way you think it will, but that's what we're going to try to do. So what you'll see here is these are the names of all the items, right? These are the IDENT items, the TC items, there's like four of them, the BI items, okay? And so I put those into this one big, these are just, this is just a vector of names. So remember that tricky trick I said that you can do sass is so jealous about this. You know, can you imagine like data B said A and then you'd have to list the variables you want to keep or I don't know. Yeah, keep equals, remember all those? Well, here I just made this vector of all the variables names I want to keep. And I put these brackets and then analytics is in here. And then I created, I know this is kind of an ugly name factor underscore p underscore df df stands for data frame. This is just to remind me stay frank. Guess who comes next psych right psych. I love psych that's it's such a good package. Okay. So now we're this is our data frame that has all of our Likert scale variables from all of our, our different domains, three domains that we plan pre plan so this is confirmatory right. So I'm going to run principle on it on this data frame from the psych command from the psych package. So I'm doing in this code and factors three I'm forcing it into three factors because I thought we had three factors, but I'm going to tell you the truth is what I do a lot with this customer and she does a lot with psychometric built making psychometric measurements. I usually try three, four and five factors, and I see which fits the best because we're trying to make an instrument we're trying to make a measurement we're not trying to just like game it you know we're trying to really do a good job. And then this rotate equals very max, you know, sometimes the reviewers don't like that I'm doing that. They say, you should only do that when you're doing exploratory like if you didn't plan. But the problem is we do a bad job. Me and this customer are terrible. I just do a terrible job. I mean, she's brilliant. And I'm smart. But we just do a really bad job of planning these factors we we make domains. We put items on the domain. Then we do this and we get totally different domains you'll see what happens here. So we run the principle command on this. We I force into three factors but remember in real life I look at all three, four and five, I throw in the very max and I create this object for three and then we'll look at 53 next. So as you can see this is the output. RC one RC three and RC two are the three factor loadings I force it into and you can easily see what I want to load together like I want all these. I want that like this says 80. I wish all of them said 80 for this one says .24. Okay. So this one obviously in maybe this one was whatever I thought I did was but then what's going on down here. So see these are all high and I thought these would be high and it would be low here. And then where it says TC I thought these would be high like TC is nothing like TC is actually over here it's just a mess. All right, but but that's why you do research right so we had to pick through this and sort of figure out well what what do we have and do we only have three do we have four you know what do we have so we just had to do something about it. Now, another thing you can do when you're trying to do what we're doing is make a screen plot. Now I'm going to just admit to you I don't understand all this code. I went and stole it from the internet right gotta love the internet. Gotta love open source are so the library the package I'm using is n factors and remember this data frame from a minute ago. Well I'm using that same data frame and we're getting eigenvalues. Okay. And the first step is to put it in this article TV and then run this super complicated thing I totally don't understand. And then you just just follow this code just do what they say why you get this isn't this gorgeous. This is one of the most gorgeous things I've ever seen. I love it I don't even want to touch it I don't even know how it got there. But what you care about is this line. Okay, so my customer and I, we thought we had three domains and so if this would have looked the way we thought it would be like 123 and then there would be an inflection right but here. I don't know. It looks like either there's a big inflection here remember one and two were kind of together, or maybe you could argue the inflection is here but this just did come out as cleanly as we wanted to. All right, so does anybody have any questions about that because that was just my demonstration of how to do the factor analysis portion of this. The next is I'm going to go on to doing the gg plot to visualization this was kind of a grabbing today. Okay, so you'll see on the slides. I, and I've got three plots for you. And they're special because they're not gg plot to plots per se I mean they probably use gg plot to, but they're not gg plot to plot the same ones we always make which are time series and like, you know bar charts and you always have to make a histogram. I'm just going to go over three and the links to these blog posts are in the slides and also we have the code here I'll show you how to make these. So the first one I'm going to talk about is the upset plot. And I first, let me just go down here and explain what it is like I hadn't even really heard of it here let me just cut. So the upset plot is really good and you don't have to publish it all the time, but it's really good when you're wondering how things are grouping together. Okay. So in this case, what the problem was is we had all these. We have this data set of people who had multiple chronic diseases, one or more chronic diseases. And I was trying to figure out, do they all have one chronic disease or do they have patterns of chronic disease. And this really answered it because this is how you read this slide is these are the chronic diseases. Okay. So the first one here arthritis, there's just a dot that mean this represents the group that only has arthritis. Okay. And this is total number of people have arthritis here. And this is the total number of people who have this combination which is only one thing. So the in this data set the top three situations were having just arthritis, having just depression, or having just diabetes. So this answered my question. I'm like, are all these people like having multiple things or they're just one thing. But as you would guess, since depression arthritis are really popular in this data set, the next, the fourth most common pattern was depression and arthritis together here. And so you can see here under here, this is the raw depression count, and the raw arthritis count. And here is the count of the pattern. Okay, and I'm just going to tell you this has changed my life this upset. I'll tell you where it changed my life. It changed my life when I was trying to visual visualize microbiome data like data from like where you can have multiple bacteria or multiple pathogens like one experimental unit can have like just one path. One pathogen or just another pathogen or both of them or organisms, you know, those, those kind of situations. Let me see another one is when people are taking drugs, like taking just lipid lowering drugs or just hypertension drugs, whatever. This is so useful for that. And so this is, I'm using the package upset are, which is easy to use. It's just that you've got to put on your thinking cap because you've got to form. You've got to implement your data in such a way to make it easy. Actually, if you just go to my blog post and follow what it says, it's, it's good. Like you can specify all these colors and stuff, but don't bend your brain like I did as soon as I figured this out, I just made this blog post so you can follow it and then, you know, you know what to do. All right, so that was the first one. That's the upset plot. So don't be upset. All right, here's the second thing. Now, you're probably like, okay, Monica obviously does a lot with Likert scale stuff. And I certainly do. And so here's the problem is like, remember the customer I was just talking about with this factor analysis. Okay, well, let's say we want to look at the actual raw answers that they gave, you know, like I was giving the homework exam. I'm the kind of person who shows up early for class or whatever, you know, how many, how many strongly agrees are there, how many strongly disagrees. Well, this visualization is so awesome. And I didn't invent it. It's the Likert package. Okay, so let me just go through how you interpret this visualization. So in my data, I often encourage my customers to use five levels in their Likert scales. Once in a while I might bend that, but most of the time people just think that way. So I, and I encourage them in whatever language they're using to have the top one be five and be strongly agree. The next one be somewhat agree, then neither agree nor disagree. That's just really clear then. It's clear than neutral, then somewhat disagree and strongly disagree. And if you far out your data that way, then this package works really well because here are just five items, right? Now these items could have been in any order in the native data set. What the Likert package does is it orders them on the plot with the most agree. So it adds up strongly agree and somewhat agree. And actually, I don't know. You probably can't see the numbers here. They're a little maybe it goes up here. Okay, so down here on the on the X axis, you'll see this is zero here in the middle. This is the three. And then the percentage out here is the percentage of people who are saying on the agree side and the percentage of people on the disagree side and I just made this data up. But whatever this is, you can see that there's a large percentage of people that are strongly disagreeing here. And also this middle part here. And you can see there's numbers for the middle part. This is how many people are in the middle, you know just said the middle one. You see these numbers over here. This is the strongly and somewhat together those numbers. And this is the disagree and somewhat together numbers. So it gives you like this really easy, you can just sit and stare at this and if you're like, well, you know, you know, like the example I was just giving you about the factor analysis we had a whole bunch of items in there. I could just break them apart into smaller data sets of just the domains I was looking at or just whatever, and make these this chart. The problem with making this chart is oftentimes you'll do a survey, and nobody will say one, and like strongly disagree for something. And if that happens, it breaks this like or things. So I've got this hack on here. And it is a hack. It's really ugly. In fact, if you go to the YouTube associated video, you'll see somebody made a comment with some code in it, which is like the elegant way of doing it. You know, mine is the, I don't know how to code way of doing it. So my way works, right, but it's not elegant. And so, but in any case, it gets around the fact that like maybe there might have been a missing value, like nobody said for for one of them or nobody said three for one of them. And this fixes it. It doesn't put any fake data and impute any data it just tricks are into not freaking out and just making your plot to be more specific. This plot relies on factor levels for ordinal variables, which is a our thing and not a SAS thing. And it's not if we can't get those factors levels right it doesn't make the plot right. So, so I encourage you to go to that one if you got Likert data try that one out and then finally have the last one I've got for you is this is a dumbbell plot and this is where I've seen it is when you have an opinion, and you have like two political parties, right. So, let's say that this was like the blue was Democrats and the red was Republicans, right. You might see something like here is what percentage of people think you should have universal healthcare. And this might be something like 80% and this be something like 50% or something like that. And so this just shows and then see this one has flipped around. And what I did is I just looked up some ratings online between the on some different axes about two different colleges, just to show you how you can use it for ratings because actually what was happening to me in real life this is what was happening is I was looking for a hotel I don't even remember know why I think I was going to a wedding I was going to a wedding in Cleveland. And there was like one of those areas near the wedding that was like where a whole bunch of hotels are in the same area. And so I was like trying to get the best value hotel so I was on like travel lossy or something. And they'd say like overall they're all like 4.5 or whatever. But when I went to look at them each, they had. It's not travel velocity what is it. I'm forgetting, but they had like multiple ratings like, like convenience and value. And I wanted to compare the individual ratings, like those individual ratings and so that's where I sort of came up with this and so that's one more and this is using. I forgot it's it's some library here. Yeah, it's using ggplot but this ggalt. I really like this ggalt. I think I've done something else with it before. You get to know these things and then see there's you get this dumbbell so it's basically, oh and I even made my own legend on here sometimes I do that. You know, R is so flexible. You can just do anything you want with it. That's what I love about it. So if you want to do any of those plots, just get the slides here and go to those web or those blog posts, because I've got actually code there so you can try yourself. So that's what I had to present for you today. I had the whole factor analysis packages. And I had these plot packages. And again, I want to remind you I'm Monica Wahee. I hope you're having a good week and next week I'm having my online workshop. And we're going to be talking about is application basics, which is applications like how do we put applications together. How do we make them work together and I'm going to have a special focus on our on how to work our into your application pipelines. So it'll be a real who you'll have fun. And even though we'll be going off of a course and online course that I have on that you'll get free access to. We're going to be doing stuff interactively and talking about different application solutions. So you're going to love it. So please make sure to sign up. So thank you very much to everybody who showed up today. I hope you enjoyed what I was here to tell you and I hope you try out some of these awesome packages and have an enjoyable week. Thank you for watching this video, which is part of the public health to data science rebrand program. If you are interested in joining the program, please sign up for a 30 minute zoom interview using the link in the description.