 Hi, everyone, and welcome to the survey reporting and its application workshop conducted by Kevin Fommelant. I'm Sue Boffman from ARL, and very pleased to welcome everyone here this afternoon. And as we were just talking about chrysanthemums and pumpkins, I hope everyone is having a good fall. Today's workshop is part of our series of training opportunities for the Research Library Impact Framework Initiative. It is the last in the quantitative research series today, and our thanks to all of you who have attended the workshops since the beginning of 2021. It's hard to believe it's been 10 months since we started our workshops. As is our practice, the session is being recorded and we will share the recording slides and any other documentation that Kevin will have with everyone following the second session, which is going to be held on Thursday, October 14th. As always, you are welcome to share the materials with colleagues who could not attend one of the sessions. So with that brief intro and hello, Kevin, let me turn the virtual podium over to you. Okay, thank you for the introduction, Sue. Just going to share it. So before I get started, I just wanted to remind everyone if they have a question to feel free to interrupt me verbally. I'll be monitoring the chat, but sometimes it's hard for me to focus on that with the presentation. So go ahead and jump in with any questions. And so today, I'll be covering survey reporting and its application with some advanced methods and statistical analysis. So if you're able to attend the first three sessions, this is the capstone, we put everything together. So I'll review the workflows that I use to start the project to get a project started, develop a survey instrument, field the survey, conduct the initial analysis. And then the last part is where we get some deeper into some statistics so that we can think about how to report to our different stakeholders and what their interests are and how we can represent our data in the most effective way. So yes, this is a life cycle. So I work on a yearly schedule, or typically I'm asking the same population for their opinion every single year, which could be the case for you. It could be more often or less often depending on if you are interested in learning about a specific intervention or event that you've planned. We've also done those as well where I'm matching the the read the responded data at the level of the individual across 10 or 12 events over a year long period for a seminar series. So all this applies to any sort of timeline. But I think that life cycle approach is helpful so that you can think about the end of your project. When you're getting your conclusions that you're already thinking about changes that were or ways to preserve the data that you have and to use that to drive the survey effectively in the next year. So the first session went over web survey design and then data collection and cleaning, which is getting the project started some exploratory data analysis. I'll take you through where we we we put composites together thematic composites. Some some simpler statistics that are still important and that I actually use most often in my survey reporting. A little bit about dashboard and just to remind you on some visualization techniques that are used in surveyed in survey projects and then more time spent in advanced statistical methods. Particularly waiting versus imputation and then some regression analysis work. So I have a case study for you. About survey response rates that I've used in my own work that could be of use to you and just nice to see how regression is used in an applied fashion and some some other ways that I I shy away from in survey research. So for for web survey design. This is the part of the project where we're thinking about the research questions. So we have If we're on the yearly cycle. We're thinking about what's important to our stakeholders and what's important internally about what we want to monitor opinions about So this is when we're operationalizing our our different areas of interest. So if you're interested in library services, you can divide it up into separate Separate areas to to get at in your survey. So that could be specifically interactions with students and staff. It could be about a seminar series. You could ask about opening times. You could also ask questions specific to different age groups and their interests. So you have your your journal concepts. But at the beginning of the life cycle, you're changing those concepts into a series of questions. Hopefully in an order that's logical. So, from most general to most to most specific so you can guide your your respondent through your survey in a way that's understandable. And so logically, what you're doing there is you're creating There is your creating gate questions and those are those are the questions where you're asking if If a respondent has experienced a service or is an interactive with a staff member. So those questions are their purpose is to narrow the population for that question to those who are who are able and eligible to answer the question at hand. So those are those are important ways to guide your survey respondent through your survey in a way that you can collect the most useful information. And then it also reduces the burden on the respondent so that they're not answering questions that they're not interested in or they don't have an opinion on sometimes respondents will feel the need to tick a box or check an answer. Even if they know that they haven't experienced the service because they're they're in the survey and they're complete us. They like to complete everything. And so, the questions are the way to monitor that to to sort of guide them through the survey in the right way. It also use expository texts. So when you're designing your survey scales may be intuitive to you, but it's important to describe them in detail to the respondent and the survey instrument. So, if you give them the one to 10 scale. Some people will be confused about what's what number corresponds the most positive feeling, or if you have something like never sometimes usually are always you get to go into a little bit more description about what that means. Or if you want to orient the respondent to a specific time period or to a specific event first for a set of questions. That's when you use the expository text to direct them to use the questions. That I would like this is what I would like you to think about when you're answering these questions. And then skip logic is the other way that we use to manage respondents as they go through the survey. So, this means that if you if they give a specific answer to a question that makes them that would make it that reveals that they would not have a relevant answer to a following question say a great question asked them if they've ever visited the library. front desk and they say no, then you skip logic, which skips ahead of those questions into the next item on your agenda, which might be software programs that they use in the library. And so when we're thinking about survey research, we're thinking about reliability and validity, reliability being the ability to get the same responses. From the, from the same survey over different periods of time. So for respondent sees one question one week. I see it again the second week will interpret it in the same way. And then validity, meaning that we are measuring what we're thinking we're measuring, which something that I like to focus on both of these topics really when we're dealing with diverse populations and they may have different interpretations of questions. That's why the expository text is so useful can only do so much in the question text itself to guide the respondent, but the expository text is there to help them. And then answer choice scales so best practices and survey research mean that we want to limit the number of answer choice scales that we use in the survey but of course it's not practical just to have one or two. Sometimes you want to get a finer opinion about a topic, sort of like a warmth or coolness rating from a one to 10 scale, or you want to move to a less fine scale when you just want a yes or no answer. If they're satisfied or unsatisfied and that can depend on your preference and just the wording of the question may dictate for answer choice scales. So it's okay to have different answer choice scales in your survey, but best practices dictate that you should limit them as much as possible. So I have a few, a few examples of questions that I've used and so the top question the question five just a gate question have you visited the library in the past year. Yes, no, nobody attended to. So depending on your interpretation, or your desire for what the next series of questions would be you could allow nobody attended to to be in the to be to answer the next question but typically I really need that person to have visited the library to be able to answer. The next question as in this case. So you don't want people who are who haven't visited the library to answer a question to answer question number six. So you're skipping. We're forcing the people answered no or no, but I intended to to skip question six. Then we we talked a bit about an earlier sessions about data collection and cleaning. But it is something that that sort of kind of escapes attention when when designing a survey project is that is the feeling timeline. Typically I'll leave open a survey for at least a month and then send three reminder emails as I typically work with with web surveys and people who are people for whom I have email information. And then this this feeling timeline, you know I'm actually sending emails at different times of day on different days of the week or even on the weekend so that I'm catching the different segments of the population. At the time that's convenient for them. So if I said at the same time every week, it might be the case that I'm only catching a certain subpopulation that's not busy at that time isn't that is at their email. So I definitely encourage you to send reminder emails at different times to do the best you can so that everyone has a chance to participate in your survey since you want your survey respond sample to resemble your population as much as possible. Both in terms of demographics and age and other demographic characteristics. So for data cleaning is another important point, since you want to preserve the data that's most relevant to your, to your projects. So the first thing you want to do is to disqualify respondents who are not in the sample population. That's going to happen more than you think if someone gets forwarded an email for with your with a link to your survey. And they have never visited the library or they just are not even from your institution. It's possible that they went ahead and answered the went ahead and answered the survey so a lot of times in my surveys I asked them to verify their their email address. Now it may also be the case that you don't have any identifying information. And in that case you, you could use IP address in Survey Monkey and other web platforms there's that option. But just know that that can happen every once in a while with web surveys that you will get someone from outside your sample population to answer the survey. So to do your best to disqualify those respondents as much as you can. Another thing that happens quite frequently is that survey respondents may submit more than one response. So if they will go, they'll click on the link right when it comes to them will get through halfway through the survey. And then maybe they'll get to the end and actually submit with only answering some of the questions. And then they forgot that they did it or they they changed their mind about something and realized that they want to take the survey again. They'll click on that link again and go through the survey entirely. And so the typical procedure is to actually keep the most complete survey. And in cases when the respondent has answered the survey twice answered all the questions is to keep the most recently completed survey. So that's what I that's what I use to deduplicate. This is an important part since you don't want you don't want people to have duplicate responses in your in your data so that you're doubling the importance of their opinion with respect to everyone else. And the last thing with data collection and cleaning last point is about identifying survey completes. So you might take a broad view and you want this really is dependent on your your needs and what you want to get out of the project, but you could take the view that you want absolutely everyone's participation in your survey. And as long as I answer a single question, you want them included in the data. So that could be just you have their email address they submitted that and answered one question you know they're at your institution they might have an opinion that's relevant to your work. So you keep that. On the other hand, you might be more strict and you want someone who has visited the library and done a few activities and so you might require them to answer several gate questions that yes they've been to the library yes they've done this and yes they've done that. So three question three gate questions and that can be enough to qualify for them from the survey and some other methods are answering half the questions in the survey which is pretty strict. So I encourage you to think about how much participation you want from how much participation you need to be able to get useful information from your respondents and to set a threshold that works that works for you. Typically I use a certain number of gate questions. So yeah I set a set a threshold this one for this project I set at least two gate questions they had to answer. Since I discovered that those who are answering zero or one. We're answering so few questions, less than 10% of the entire survey questions that I couldn't I couldn't justify them having. I could justify to the client that they had a significant experience with the client services exploratory analysis this is actually my favorite part of survey research. This is when you start to get start to dive into your research and discover what it is that what it is that you found in in your research project. So do you want to stress here that I do run some statistical analysis analyses in the exploratory work. And that's because my if I have enough information if I'm doing sort of a more of a standard survey report. It could end here my project could end here at exploratory analysis. I'm doing my t test I'm looking for correlations. I'm giving a descriptive summary of the mean scores for each for each question. And that could be enough for the report so my my work never goes really beyond this exploratory analysis and doesn't go into the more advanced statistical methods, which definitely a viable path and survey research. So we'll get into some of the more advanced topics before but I do want to stress that, you know for your purposes it could be doing some exploratory analysis could be enough, and then you can move on to statistical method more advanced statistical methods. If you want to do some more internal quality control, as opposed to to just reporting. And so an exploratory analysis what I'm doing is I'm going back to my research questions, and it's reminding myself why it is that I set out to to field the survey. And so, I'm typically one in my work, it's to find out if opinions have changed over time since a lot of this a lot of the work I do is timeline based so there's a particular service that it's new or that has changed, or just and that my client wants to track. How respondents opinions that service is changing over time. That's my, my standard research question, but for you it might be something like us, like I've mentioned a few times that a seminar series, or just some some point of interest. I know I know from reviewing some of your questionnaires that there's a lot of interest in learning about the effect of code with on on library operating procedures and definitely think that's of course a very useful implementation of survey research work is to actually use a current event and then catch that catch that data as it comes in it's really valuable to see how services work in times when things aren't running as usual. And then once you get once things return to normal is to see, you know, is there, is there any change in that opinion that was affected by that event, or if actually somewhat counter intuitively the opinion remain the same. Or what is it that's differently what what different priorities to students have during during cobit and after cobit. So, so timelines can be timeline research questions can be useful. Second point here is about his item analysis and I am analysis is just a fancy term for developing composites. So composites are groups of questions that are thematically related. So, if you're thinking about your survey and your in your survey structure, right we're going from most general to least general, but they're different topics that we cover in the survey itself. So the first part might be overall services. The next part might be opening hours and the third part might be software bear availability. And so we can create composites that are thematically related questions so we designer survey we already sort of think we know what the themes are, but we use this. It's called Crownbacks alpha to form these composites of thematically related questions, so that we can report these scores in a thematic way instead of at the question level, which is a little bit less. It really actually helps people who haven't, you know, who will haven't been involved in designing survey research product project to get a better understanding of what the purpose of the work is for, and then to sort of map from from that, these narrow questions back to the, the research question the themes that you're interested in talking about with your stakeholders. So maybe I'll take you through some initial statistical tests and exploratory analysis. And then really the purpose is to this to do is to do this work so that you can formulate a workflow for the next step in your process after visualization, and then to dive into the advanced statistics. So I'm thinking more about subtle issues about how to report about about responding opinions among different subgroups, how to deal with missing data, and then, and the biases and non response, which is something that comes up a lot in survey research since of course as you know there are different subgroups who traditionally have answered surveys less often than other groups in my line of work is particularly challenging to get survey responses from from Latino respondents so we're always trying to come up with ways to get broader participation beyond the obvious ones of providing the survey in different languages, but nonetheless, it's still often the case that I get a different response rate by different subgroups. So there are ways of dealing with this. And some ways that that are there with that are well intended but I think may actually may not make the problem better so I want to I want to cover that in some detail in this, in this presentation again. Using regression so that we can find out the method to use and then how to employ that method when you're developing your final report, so you can use the right language describe how you've gone about doing something like in mutation, which is a method that we use to deal with missing data. So just work in this presentation really going from broader topics and we're going to get very specific to dealing with some more common problems and survey research and and methods to resolve those problems, particularly when you're boring different stakeholders, but different interests. So yeah, the first the first thing I do beyond, you know, doing my descriptive statistics. I'm looking at my mean scores. So mean scores just you're getting your question results you can see that on your very satisfied to dissatisfied score, I mean scores just that translated into a one to five scale and then you make that a numeric value so most people have answered very satisfied your mean score will be about five. If you have a perfect score for example so that's the first thing I do. But after that, I don't make any decisions really based on that. And I'm an analysis, typically in our, which I use to, which I use to develop composites. So as I mentioned before these are just thematically related questions. It's not a statistical test so it's not a hypothesis test. It's really a heuristic that we use to select which questions go into which composite so that when you're reporting your results, you can talk about things like customer service, or you can also think about, you know, overall satisfaction, or any other sort of concept that maps to what your interests are. So we're not reporting. You know, the answer to, we got the mean score of 2.5 on question 11 about the specific service is harder for people to digest so it's both statistically helpful and also conceptually helpful. So when you, when you're doing your statistical work, you'll actually, this is what you'll, this is something you can get out of our but most of the other systems programs will also give you this nice chart, which gives you the alpha statistic for a group of questions, it gives you the, the overall prombax alpha for the group, and then what would happen if you deleted that question. So, question four and these imaginary six, these imaginary six questions. If you delete that question, the alpha goes much higher. So it's suggestive that that question shouldn't be included in that composite, either it could be a standalone question, or it could be in a different composite. And I do want to point out the gate questions typically aren't included in these thematic composites. So the ones that ask about have they attended, or have they interacted with the service you can eliminate them from the composite immediately. And then just run your statistics on the rest of the questions. So the, the guideline for the alpha and the sort of idealized question one through six those are very high alpha scores. So, anything above a 0.5 is typically within the realm of including a question in a composite. But again, it is a heuristic it's not, it's not, you don't use it, use it to guide your decision, your decisions, you don't have to just rely. You can just rely on the number itself as a decision maker it's really what you think in combination with the guidance from the alpha statistic. So I've identified my composites. So that means I can move on to some other statistical tests, and I actually didn't get to cover this, and I think one or two of the last workshop to have with you all so I wanted to move it back a little bit to the students T test, and what it is used for in survey research side. Yeah, I've got some questions on this because of course when we do a survey project we do some research we want to find out what is a significant result. Sometimes I'm just reporting mean scores and trends, and that can be enough, especially when you're doing year on year comparisons. There's always the interest in finding a significant difference instead of one that's just a trend. So if you do want to caution that it's difficult to run the, the T test on questions within the survey itself. Since I've really seen the survey where every single question has the same scale. That's one of the assumptions that is needed to be able to run a T test. So if you have questions they're differently scaled really disqualifies the T test, which is why the most, the more useful application of the T test and survey research is to compare is to compare year over year changes in a certain question or composite, or to make comparisons to subgroups within a single survey year so if you want to know the difference in responses between younger respondents and older respondents say students versus faculty members. That's a, that's a great use of the T test, or making comparisons from last year survey to this year. So I do, I do recommend that we don't even if there are some questions on your survey that have the same scale. It's a bit difficult. It's hard to report them, because it's very scattershot really depends on the the scales that you use in each question so if you're only reporting a subset of those questions with a T test it can make it things a little bit more confusing for someone who's reading your report, which is why I tend I tend not to include that in my work. And I just want to point out that this involves the rejection of the null hypothesis is something that I always, I always like to say that our assumption is that there is no significant difference. But as I'm sure, many of you know that we actually use this test to find if we can reject the null hypothesis that the mean scores are the same. So that's very helpful in the next slide of of a mean score comparison. So on my survey, the outcome metric is overall satisfaction. And then I have it divided into two age groups. So another key point in survey research is that, even if you haven't identified some groups in the beginning of your survey work. Your analysis is really when it's really when you can identify the subgroups right. So, for, for race and ethnicity, that's, that's pretty intuitive if they, if they click, if they choose, if they choose a racer ethnicity that's that's what it is. Whereas with an age group. It's, you can bin it. So if they pick an age 23, and you find that you get, you might have an enormous number of respondents in the 20s depending on your population, whereas a similar size population would be just 30 and over. In that case, this would be the appropriate bidding to do the to run the T test and find and find if there's any significant difference in the mean scores. I will say that also, you know, in the survey research when I'm when I'm thinking about race and ethnicity. There are ways to be in different different ethnicities together. So for example, for, for health and human services where I have some contracts with it with the government. I actually prefer to, they prefer to, to report white respondents and then underrepresented underrepresented minority students separately. So I work with with training programs so they're actually students as well. And so they'll include white respondents and then Asian respondents and then anyone else who identifies either as a different race, multiple races. They identify as with the Latino ethnicity. They get included in the designation underrepresented minority. So, you know, even in the, even in this is common survey research. If you have a question about race, and then another question about ethnicity, you can, you don't have to think about it too much ahead of time as long as you're, you put everything on the survey that you think you need at the end. At the end of the survey, you can start to map those, those race and ethnicity values together so that you can report them in a way that might be standard for your institution, or for, for a regulatory authority in my case. Or you might, you might also find that it's useful to report them entirely separately because you have a very diverse population you're getting a large number of responses for each subgroup. So that's sort of, that's sort of one of the, the, one of the issues that that I typically run into is that I can't report very small subgroups for privacy reasons so if I get fewer than 10 respondents for a certain race or ethnicity subgroup I can't be reported which means their opinion isn't being heard on on that particular topic. So there's kind of a dual, dual things that work here. So what I end up doing is is binning that population with another with another race or ethnicity subgroup under the under the rubric of underrepresented minority, so that their opinion is being counted. And that, yeah, they aren't being excluded from the survey process since they took the time to answer the survey and I want to represent everyone's opinion as, as, as accurately as possible. So yes, just, yes, to state, to make it clear, you design your race and ethnicity questions. You don't have to already have it mapped out how you report as long as you get all of your answers that you need in advance. One of the things that I want to stress about tea test is that it doesn't help you establish which questions are most important. That is a that's a difference just will come to, but it's one to, to, you know, emphasize that the main usage is is to for me and score comparison across time, or across some groups. It's the element of the exploratory analysis is actually Pearson coefficient. And I find this is very useful. I think that's probably the statistical test that I use the most often in survey research, since it really builds off of the work you've done with composite development. So you have those thematic composites, and you have some typically have some outcome measure overall satisfaction. And so I want to know, and your stakeholders probably want to know what are the aspects of your work that's driving that overall satisfaction or that outcome, that outcome measure. And that's really the what the Pearson coefficient can do. It can find it will, it can be the R squared value which I'll get to in a second will tell you the amount of variance that the, that that thematic composite contributes to the overall correlation. So, I'm not going to go into the equation here but the important thing to remember here is that your mate it is a hypothesis test so it's different from the alpha it's not a heuristic. So you can look for a significance measures you can translate this into a p value, but I like to report the correlation coefficients themselves across all questions together. So we can see, we can make comparisons amongst the questions to see what is the biggest driver overall overall satisfaction of our outcome measure. And in this case that I worked with. You can see the thematic composites that I developed. And then the, the Pearson coefficient scores for each of the, the, the two question comparisons. So in this case customer service was the biggest driver overall satisfaction. And then the least important was software availability. So I do encourage that to be reported to to stakeholders who whose eyes make lays over if you report mean scores. That can be hard for them to conceptualize but if they're able to see if they're able to understand what the name of thematic composites and how that relates to the outcome measure, you're really, you're really doing their homework for them and it makes the makes your work a lot easier. Let's emphasize the two different purposes of these two statistics. So products alpha hypothesis test, heuristic and Pearson correlation. It is used in hypothesis testing. So Pearson used to determine covariance among questions, and the alpha is used to make a decision about thematic composites. And so the other major difference is that from back south as a comparison amongst multiple, multiple questions so when you're developing your composite. And typically historically in the survey research actually you would put, you would actually you put all the survey questions together at the beginning, because historically surveys are actually a lot shorter. So it was actually the prime back south was actually developed as if one survey was one composite. So you would find your, your inter item reliability for all the items within the survey but now, typically surveys are longer and we have basically a little mini surveys within your survey that they get at different concepts. So we're developing several composites of four or five questions or whatever the, the number of questions is within the survey, based on this this alpha statistic, whereas in the Pearson correlation. It's only a comparison between two questions. So I'm just going to pull out an example from for my work in with health plan clients about how this, how this can work. So I have my outcome question, which is the rating of the health plan. And in my, my question text I'm explaining what the, what the, the scale means so using any number from zero to 10 where zero is the worst health plan possible, and 10 is the best health plan possible, what number would you use to rate your plan. And then as always I'm interested in finding out which composite is the strongest driver of the variance in that overall rating question. And so there were two separate groups that I had subgroups in this, in this survey project were types of health plans so PPO plans versus HMO plans, whereas you can use any other subgroup could do a race or ethnicity. You can do age groups. In this case we have health plan. And you can see that it actually doesn't make a difference when you divide up the subgroups. So for the PPO plans claims processing and questions related to that topic, we're the strongest driver of overall satisfaction. So if someone read what that means is someone rated those questions highly and claims processing, they were also likely to answer highly on the overall rating, whereas for some of the lower the less correlated ones so submitted claims to plan questions that wasn't does an important driver there wasn't as strong of a correlation between a high score on those questions versus the rating of health plan question. And so for the HMO POS plans. So the rating of specialists happened to be the most important. So I actually do report the VR squared value as a percentage for clients to help them understand what percentage of the variance can be explained by answers to a certain composite. So, this is just a way to mentally compare how important something is. I think, I think it's useful for, for people who, you know, haven't done the survey research along with you but are simply looking at your report so this way they can see that claims processing accounts for 20% of the variance the overall satisfaction to the PPO and EPO group. I'm going to take a second. I just want to pause there for a few seconds to ask for any questions for a jump into visualization and into some advanced statistical methods. They can be questions on presentation yourself we do I think we have a little bit more time as presentation so if you have questions that relate to your individual project, you can either answer and ask them now or feel free to at the end of the presentation. Our sessions was about visualization and tabulation. My method I use a few different software programs. Tableau and are typically, but you know, here is that we're making our visualizations. There's two, there's two separate approaches to visualization. And that's dashboarding where the dashboard doesn't have a lot of text, and you're simply reporting probably mean scores, maybe a few other descriptive metrics, composite scores probably, or some result, maybe you even report the correlations course to Pearson correlations course in your dashboard, but there isn't a whole lot of expository texts. So the focus is on the visualization, whereas other reports you're using embedded visualizations into a text report, where you have more of an opportunity to explain what's in the individualization itself. So I actually find that my tendency I think is a pretty common is to create a lot of bar charts, a lot of bar charts in my reports. And so one of my recommendations is to vary the types of visualizations that you're using in your reports. So I have an example of a radial graph here, which here is just a different way to represent the responded age group population. So in this, in this case, there's actually a very large subgroup in the 55 to 65 64 excuse me age group. So the inner ring is the is the 25%. So the dot that corresponds to the age group is below this. This ring, they are, it's less than 25%. Whereas this age group is at nearly 50% of the entire population, and this is by the, the plan groups I talked about in the previous slide so I like to test out visualizations to see how they, how, what kind of reaction they get it and if they're intuitive so some people will really rely on those bar graphs and others will get more exploratory with great things like radial breaths. But this is an example of the dashboard that I put together, which gives you the opportunity to in Tableau. And if anyone has any, you know, it's been a few months since we went over Tableau together if you want a reminder and you're about to go into your visualization procedures and feel free to reach out to me so I can help you with that. About some of the data processing functions or protocols that you need to implement to get that to get that started. And the dashboard is that you just have several visualizations together, so that there's sort of minimal user need to click through beyond the existing menus in your dashboard. Okay, so, so for statistical analysis. I'd say maybe about half of my projects actually get to this level. The other half stop simply at reporting mean scores and then getting Pearson correlations and finding out which questions are the most important drivers of that overall satisfaction metric. So there are some times I need to do a deeper dive into incutation and some regression methods. So I have a case study for you here about regression for response rates. And then I'll explain in some detail about how to use incutation to deal with missing data in a way that sensitive to everyone was participated in the survey. So just to back up a bit before I get into the statistics itself is that we're kind of in a transitional period and survey research where a lot of the initial work in the field comes. Probably my guess comes from market research. And so much of the, much of the statistics that we've covered today is actually derived from that field and applied specifically to things you may see in the media like polling, or other forms of opinion, or measuring research. But at the same time, you know data science is a growing field that has its own methodology and its own approach. So the main difference between the two is that survey research is more about reporting and analysis of results that that have been collected. And data science is more about modeling. So the model that best represents the, the public's opinion on a certain topic. So, survey research would have shies away from some some advanced methods in quantifying opinion that's farther removed from the actual answers on the survey on the survey that was that was in other words data science is interested in putting together a simulation based on the responses that were given to the survey itself. So two different approaches but I will say there's some overlap. Survey research and data science both use imputation methods to deal with missing values. And therefore, these fields have become become less distinguishable and borrow methods from each other. So multiple linear regression. You know, this is, you know, this is a multiple linear regression I actually don't use this so too much. When I'm thinking about how questions drive overall satisfaction I stick with the Pearson correlation because this is a metric that gets at the different covariance. It already takes into account the covariance between the different questions itself, whereas in linear regression. There are one of the assumptions that's problematic for survey research and the one that I think probably just qualifies it from being used to model the outcome measure the overall satisfaction is that there can't be correlation amongst the independent variables. So we get to work on a survey where there wasn't where there wasn't significant multicollinearity amongst the independent variables so typically like a customer service measure will be highly correlated with with claims processing if I'm doing a health plan survey. I'm sure you can think about, you know, aspects of your work, different concepts that will end up having a similar pattern of pattern responses. And so this is problematic for linear regression, because you won't be able to tell which is a really a key driver of that overall satisfaction figure. So I actually typically don't use it to find out which individual question is the most important or the most in the most important individual composite the drives overall satisfaction. Yeah, I just want to emphasize because of that fundamental assumptions being broken about correlation amongst independent variables, and I will get to some methods about how to deal with that. But that's in the, that's in the different the the advanced lasso rich regression section will come to. In the previous sessions I've talked about waiting and imputation but I did get I did get several questions about this since, you know, if you're coming from outside survey research, there can be some some anxiety about getting a missing value and then in some sort of analysis you want to have complete data. And so the term in in survey research and in statistics is imputation, which is replacing a missing missing value with an estimated one. And of course, the devil's in the details how is it that we can, we can estimate that value, but more and more generally, and I talked about a bit about this before. So polling, which is something a lot of people are familiar with since it's in the news a lot. And how polling has gone has gone wrong in recent elections or how it's deviated from the actual response, the actual election results. So one thing that's talked a lot, it's got a lot of media media attention is actually waiting, waiting a survey responses. And so in this in waiting, what you're actually doing is expanding a subgroups composition, among respondents to resemble its composition among the population so you might know that, you know, maybe 55% of voters are women and 45% of voters are men. But the survey you fielded to have it more at 5050. So in order to correct for that, what you could do is take that subgroups responses, but then expand it so that it counts for. So if you so the 50% of respondents are female that actually becomes weighted so that it's 55% of the entire score for the survey report. And this is a method to counteract non response bias, and it can work. If the population that if the people who I haven't responded to the survey are similar to those who have responded to the survey. If they if you would expect that portion of women who didn't. If you didn't answer the survey to answer in the similar way to those who did, then waiting will will have little impact will have little negative impact on your results, however, if there's a specific characteristic about those who did not answer the survey. That makes them different from the population who did answer the survey, and then you wait them, as if they were the same as those who did participate in the survey. They compound the error, since you're effectively ignoring that specific behavioral characteristic that is driving their non their their non participation in the survey. And so that's why I don't, I think I don't necessarily recommend that you wait. Since I know this is a topic that comes up a lot amongst different racial ethnic ethnic or age groups. Since you may even know the, the actual population demographics I suggest you don't wait them in this way, since you might be magnifying the opinion of people who are responsive to your survey versus those who weren't. However, there I think there is a better method out there. And that's called imputation. It can. It's so what it can do is that it avoids this problem that waiting has, since it doesn't assume that the those who didn't respond to the survey have exactly the same characteristics, but if you know several characteristics about those who didn't answer the survey you can actually regress those characteristics on the overall survey responded population and impute a value based on that regression. So that's closer to what you would expect them to answer, had they actually responded to the survey itself. So that's what I think is typically done on just a single demographic characteristic that doesn't explain the entire, the entire populations opinion so difference being that in waiting, you're relying on one, you're effectively relying on one variable to explain the entirety of their, of their opinion, or it's an imputation and regression imputation, you can take in all of these variables and regress their opinion on all of those which have much more explanatory power, and that way you can, you can more, you can more accurately represent those who haven't responded to your survey. So there are a few, a few methods that I typically use an imputation. The regression equation here for imputation so I guess that in a second but yeah, the things you want to think about when you are talking about imputation are the types of variables that you have the quantity of missing values in the randomness. So, typically in survey, in survey research there is a decent amount of randomness to the, to the missing values so well maybe the case that is biased in one way when I talk about randomness I just mean, is there an entire population that's totally missing that would be a problem with randomness and you'd be you might shy away from some of these imputation methods. And then how much of it is missed how many, how many values are you missing and so that's more at the record level. Are you missing is 90% of the 90% of the values within a certain record. In that case, you might be beyond the scope of what imputation can do for you. And so in mean imputation, what you're doing is you're just, this is, you're taking the mean score for the entire population without respect to other individual characteristics, and then just applying that mean score to that person who didn't answer that person who didn't answer that question. That can be a problem because you're losing out on information that you might otherwise use, or as in regression, you're actually an action just keep ahead to the next slide so you can see this. Well, we'll get to the case study in a minute but if for regression the regression imputation these are all the the variables that you have knowledge about and then you're actually imputing this here you're the result of the imputation by by by regression to actually building on all of the other variables that you have available to you. So, it could be age race ethnicity to impute whatever that missing value is to that question. And so that could, that could even be the answer to a question that some in the thematic composite, whereas my case study covers imputation specifically for response rate. So, don't have too much time but I do jump ahead here just for a minute because I do want to point out that talking about, you know, diversity and survey responses. It's important to use imputation in a sensitive way so that we're not imputing values that are that are incorrect or don't aren't reflective of the that subgroups and payment. And so I found this nice research paper earlier this month it just came out this year from the Urban Institute that have the best practices for for imputation. In this case they're actually imputing racial data based on answers to surveys. It's a nice guide to show you that imputation can be done around race ethnicity and gender. It just has to be thought out a little bit more clearly and I think this provides a nice framework to get started on that. But, you know, in my case study for for survey research I actually use linear rational to predict response rates. And that's because, you know, I, I have a very large survey population so I'll often be sending surveys to a couple hundred thousand people. And one client in particular for a large, a large state wanted to get a certain number of responses from each from each HMO. So I think they're about 50 HMOs in this in this state for Medicaid members. In order to have a sufficient amount of data to be able to analyze the work on their end. They needed us to project the response rate for the 2018 project. Excuse me. And so, building on the response rates for the 2017 project. I actually had, and this was, this was usual they had demographic information for almost all of the participants from 2017 in the survey. And in that way, I was able to create a regression equation that predicted the probability that an individual respondent would respond to the survey based on their age, age, race, ethnicity, and I think they're the region in the state as well. So that was one of my variables. I do point out that for this project I actually didn't do any variable selection so there was no process to actually eliminate variables from the regression equation. And that was largely because I wasn't as concerned about multicollinearity because I was just trying to model the response rate I wasn't really looking at. I wasn't modeling opinion in this case. So if. So if there was some multicollinearity I was able to deal with it even if it affected my outcome variable a little bit. So I had eliminated one of the variables and I thought the risk was higher that I would change it such that the, that the outcome measure would be more affected. So, I actually use the same equation in 2018. But the problem was they had a bunch of new members in 2018 for which they didn't have demographic variables. What I did was, I took census block information down to the zip code level, and based on the number of people in that census block, the number of people by age by race by gender, I imputed a value for that, for that person in that zip code, according to the mean value that they would be so the average age in the census block might have been 34 and if that person was have had a missing value, their imputed age would be 34. And in this way I was able to stabilize and project out response rates for the 2018 project. And that way actually had a response probability for each individual even if I didn't have specific demographic information for each of them. So, it was a combination of evidence from a previous project in combined with imputation from an outside source. So, one of the one of the one of the problems though, because I didn't do variable selection. There was multi culinary I knew that, but the problem ended up being that might have gotten a question. Okay, so for response rate prediction. I didn't I didn't do variable selection which meant there was multi culinary problems with this have the effect of doing is that it actually caused the variance to be larger than it was in real life. And for a group that was very unlikely to respond it actually over compensated and projected very very low response rate, but for a population that was much more likely to respond it over projected, it over predicted how much they would respond. And so this wasn't a problem because I really wanted more responses anyway from those. It was more nervous about those low response rate group so I ended up getting more responses from them than I wanted, and fewer from the higher response groups which is why it was a sort of a manageable problem. So that was led me to the next step, which was to actually reduce the number of variables in my, in my regression equation. And so there's a few methods here that deal with multi culinary problems. The cause overfitting, which is really what you can see here if there's more variance in the predict in the predictive response rate versus the actual, you have a problem with overfitting or your regression model. So, these are actually sort of intro machine learning techniques that you can use to reduce the number of variables in your regression equation I actually use lasso. So sort of the quick and dirty explanation for these two methods is that they're, they're a response to overfitting your method which is the problem I have because I use too many variables. The model was too confident in its predictions and so actually, these methods make the equation less confident through two different methods I use lasso because I wasn't because I actually dropped variables that I didn't need. I didn't have a response rate prediction I wasn't worried about people's opinions being affected. So I could drop a race variable at the end of my analysis and be able to still have a more accurate equation so that when I ran the project in the next year. I can have predictive values that more that more equally matched the variance in the real predict the real projected response probability values. So, feel free to, if you have questions about that the end I'm happy to answer more about it or if you have questions about how to apply those models. I just wanted to give you some, some techniques to fix a problem that comes up a lot with regression with models with variable selection. It comes up in data science approaches to survey research since the field is what I was taught originally was to do backward selection and linear and linear regression into just into sort of those more. These are the methods that pay more attention to p value differences as you add and take away variables into your model, whereas the field is more moving into these these newer methods which use, which use matrix matrix algebra to reduce and actually eliminate still keeping still keeping predictive value that matches the original variance in your linear regression in your actual real response prediction probabilities. Okay. So, for, for survey reporting you know I really think of it as two separate, it's really two separate fields. There's data reporting that has mean scores and correlations which I use most of the time. The second part of survey reporting on modeling opinion, which involves imputation, which is estimating estimating a value that's not available to you and then advanced aggression approaches to find ways to get at to get at public opinion. That's not that's one step or move from their actual answers, but may actually have a better estimation of what your population thinks, instead of the smaller sample that you've collected, which ideally would be perfectly representative your population and survey researchers would definitely lean that way, but the data science approach is more in the opposite direction of modeling what the, the population things based on the, the, the likely by a sample you've collected. So, this is the, the last, the last workshop that we have in quantitative survey work. So, it's been great chatting with you all about your survey projects, especially the time when so much is changing and hearing about. Yeah, hearing about the different methods that you employ to interact with your, with your student opinions. So, appreciate your, appreciate your, your interesting projects and the feedback that I've gotten so far. Yeah, there's a few takeaway points for for survey research that I like to emphasize. So, the five key things. That's to design a survey that operationalizes your concepts of interest so you don't want to go right into your survey and then write questions without thinking about what concepts you want to study. You want to think about those concepts first and then design the questions on top of that. You want to look at the survey to your target population. So those are the people who have visited your library or who are interested in your services. Seems like some got some got a little bit deleted but pre descriptive visualizations that use in your exploratory work, so you can understand what you're doing for exploratory analysis, and you're doing your exploratory. You need to create the manner composites and identify composites that drive the overall outcome score so these composites are those that that should map to the concepts you were thinking about when you're operationalizing your original research project. And then you want to run those correlations which I find was useful to find out what drives overall satisfaction. And then finally, be aware, be aware of when to implement implementing mutation strategies to represent missing values, depending on whether you want to model opinion or whether you're reporting the opinions that you've gathered from from your survey work. Yeah, thank you for your for your attention during this the session and if you're able to attend the previous ones for that as well. And I think we have a few minutes for questions as well so feel free to jump in. Kevin, this is who I wanted to thank you for for that your discussion and presentation this afternoon. I imagine in your work and I know colleagues who are also in this training have done a variety of reporting and writing of papers, etc. Do you have an opinion and or advice about the number of visual visualizations or tables and charts, etc. to get into a report or a paper. And they're just thoughts about you, you want to tell the story, whatever the story is, and do it in a variety of ways, but do you have an opinion about that piece of writing, presenting reports or a paper, whatever it might be. Yeah, you know I've seen sort of a transition in the way the reports have been developed and survey research whereas before. I think they, they start with sort of large descriptive visualizations, and then only at the end of the report would you get to the more meaningful results but I've seen a reversal in that to really give the more meaningful results first so to show in the beginning, and then to provide so cross tabs or the larger the very large visualizations at the end that describe the answers to each question. I think, I think that's been an improvement at least amongst the people that I work with, I have noticed that they have a better grasp of the main outcome so. I think, I think my, my preference is to, is to front load sort of the executive summary version of those visualizations in the front of the report. And then to provide the larger descriptive visualizations at the end that people can page through if they have additional questions. Do other colleagues on the session have a question for Kevin. While colleagues might be thinking, Kevin, would you again share your email address so that if colleagues want to kind of circle back to you about, you know, creating putting data together and putting it into a report or paper, they can give you a call for for consultation, or email you for consultation. Absolutely. And so we'll provide the link to that that study from the Urban Institute, but I think it's useful for for imputation and just generally as a way to grapple with issues of diversity and representation survey work. Right, thank you. Kevin, what did you share, you shared a post attendee the link. Oh, yes. Oh, I should always share the moment. Thank you. Okay, there I think that's the correct link to the Derby Institute study. Right, thank you. If there are no questions for Kevin at the moment we can let you go a couple minutes early. Thank you, Kevin. I look forward to seeing you on Thursday and reminder to everyone if you if you have other colleagues I know there's a number of people who have signed up for the workshop on Thursday. But if you have other colleagues that you think might be interested, please, please let them know to register and if you need the registration link, shoot shoot me an email, and I can send it out to everyone. So thank you all. It's good to see you, and we'll see you again. Thank you. Thanks very much everyone.