 My name is Blake Telles, I'm the associate vice president for research and the research office in conjunction with the data specialists at the library have organized mostly them here at Haluza. So we hope you'll have a good experience and hope you'll learn something kind of the goal of this was to provide some broad coverage of a lot of topics because a lot of you first of all congratulations on being researchers and now the opportunity to discover new things but the way that you're approaching your research and the type of research to do will vary significantly based on your interest in your department and so on so we're going to cover a lot of different things today and hopefully give you some resources that if some of this stuff is a particular interest to you you'll have the ability to go learn more contact the speakers or presenters. One thing I guess just a small philosophical comment when I was a student it's been a while but it seemed like the goal was always to just get done get to graduation and I'm always looking at trying to get to whatever was next and not necessarily enjoying what it was I was doing at the moment sufficiently and research is kind of my way and your goal is to get a degree your goal is to get a job but a good opportunity to learn how to think critically a good opportunity to learn how to manage data how to write papers how to communicate and how to discover stuff the observant of observational about things so make sure you make the most of your research opportunity we hope that by giving us some tips today on data management we can help you be a little more efficient as you enjoy that process so what we're going to do today is we have three different sessions first session is don't become a data horror story as practices tips and tricks we have two presenters we have Jessica Shaw who's an assistant professor in sociology but we are first and then we have Tyson Barrett who's a research assistant professor of psychology and I believe managing director of the data science and discovery units so each take 20 minutes and then we have a 20 minute question-answer period a lot of short break then session two is going to be doing it right do it right from the start ensuring reproducibility confirmation and we'll have some speakers all introduced then and I'll end up with some kind of a fun data data game so we're scheduled to go from one to five we'll have some breaks in between and hopefully a chance to network a little bit and interact so any questions before we start thank you for being here and Jessica let me just say one more thing because I want to we have a very can you help us the box the file in the box and has all the takeaways we have that on slide already we don't have it on the slide but we will post it on the research data management website we have a page for it and then we'll link resources to that page and when we send you the survey which all of you will gladly complete information about there there will be information about that there too so we'll have lots of resources for you to take home with this so that you can remember where you go I'm Betty I'm your data like living every one of your personal data like learning you can call me anytime between 85 I go to bed by 91 and so we try to set this up to be successful for you as an audience so we could just wander all over talking about data but we've asked the presenters to come up with like three takeaway messages and in that box folder each presenter has a short summary of their takeaways so pay attention and see if you can pick up what those takeaways are they should be obvious but if you don't get all the notes down there should be stuff available go to website so you're in the mic okay all right hi everyone I am Jessica Shad as was already announced and I'm really excited to be here today to talk about doing effective surveys so a lot of what I'm going to be focused on today is more on the data collection aspect and doing so I hope that when you get to the point of analyzing or cleaning or presenting your data you have better data to work with and so surveys are one of the most common ways that social scientists at least gather data and I know that other people in other disciplines use it as well and while I was getting ready today I thought of three things that I didn't include in my slides that you don't want to do with survey research and so I'm going to start by telling you those three things so first of all you don't want to make any assertions with your data that are not accurate and that's where I'll be talking a little bit about sampling and measurement and so that's one thing you really want to move away from if at all possible you also don't want to gather any data that's not useful for you in your analysis or for other people who might be interested in your data and also you want to be really careful about keeping the confidentiality of the people who participate in your research so all three of those are sort of the things with surveys that you want to be most careful in terms of not becoming a data or story I guess I would say and I do have quite a bit of experience doing research so I'm a natural resource social scientist and so a lot of my research is focused on getting information usually from doing surveys from farmers about how they make decisions to use conservation practices I also do a lot of research in rural communities on quality of life and so I've done a lot of male surveys but also some telephone surveys and then even drop off pick up because I'll talk about later on so to get started first of all what are surveys it does seem like sometimes there's a little bit of confusion of what surveys actually are and sometimes the word survey is even used by politicians or marketers to gather data that's not really the type of data that we as researchers use but when I'm referring to surveys I'm talking about collecting information from a sample of individuals and using their responses to standardized questions so a key thing there is that you're asking people questions that are worded in the exact same way so that you can draw comparisons by different groups in your sample and a lot of times these are close-ended questions right that makes a lot it a lot easier to compare the responses if there's only a select number of ways that people can answer whatever question it is that you're answering them some surveys do include open-ended questions but it's a little bit more time-consuming and challenging to draw comparisons then so a lot of survey research focuses mostly on closed-ended questions so case in point this is a set of survey questions that I use when I was looking at people's perceptions of crime up in the Bakken area of North Dakota and Montana to see how that recent boom had impacted their daily lives you can see that I'm asking people to what degree they attribute the following problem to the recent oil and gas boom in their community and so they select one response this is very different than asking an open-ended question and letting them tell me all the things that maybe they've experienced since the recent boom it's easier for me to for me to compare and there's really three main advantages to collecting surveys or collecting data through surveys so first of all it's a very versatile method you can focus on a lot of different topics you can use a lot of different ways to gather your data so you can contact people via via phone online via mail and ask them to respond in those different ways as well it's also a very efficient way of gathering data particularly if you use online surveys so you can gather data quite quickly and for relatively cheaply using survey methodology and that of course varies by what sort of mode you use to collect your data and then finally usually but not always your results are generalizable so what you find from just a sample of the group that you might be studying is hopefully able to tell you about that group as a whole that of course depends and this is really really important on how you do your sampling if you use a convenient sample to gather survey data you're not going to have very generalizable data if I stand outside of the library and ask students about their study habits I'm gonna have a very biased sample right that would be a convenient sample I'm not giving every single student at USU the chance to tell me about their survey habits so the degree of generalizability depends a lot on how you do your sampling or how you get people to participate in your survey the key issue in survey research today that I'd say most of the experts are most concerned about is that response rates are down so if I contact a hundred people it used to be that a pretty good response rate would be that I could get 70 to 80 of those people to respond to my survey does anybody have any idea of what response rates look like today yeah I'd say 30% is what I've been getting when I've been doing surveys with farmers and that's sort of generally what a lot of people are experiencing again it varies by what mode so how you're contacting people and how you're doing your sampling and how how much how many resources you have to contacting people but but people are quite concerned with response rates going down in a recent study by Steadman at all they found that in the the surveys that had been conducted in their lab at Cornell University over the last 30 or so years that with each year the response rate had gone gone down on average about 1% so it's not that big of a deal just one year to the next but when you look at over a 30 year period that can be quite substantial so here is their their plot of the response rates over time and you can see that if you look in the 1970s they were hovering anywhere between 55 up to 90% and now if you look down to today it's closer to the 30 to 50% range so big decline over time and the big problem with this is that it can hurt the representativeness of your findings so if you only have certain types of people responding to your survey those people might be different from your non respondents so if you're trying to say something about a population as a whole and only certain type of people are responding that can be problematic and so increasingly we're needing to do non-respondent tests so basically you need to keep trying to contact those non-respondents and see if you can figure out why they didn't take your survey and how they're different from the people who responded or sometimes we can also use secondary data so maybe the census data or the egg census data and compare our respondents to that population and then adjust our data in that way but it can hurt the the representativeness if we don't have a very high response rate and so response rates are declining people have different explanations for it does anybody want to throw one out there of why they think responses to surveys are going down because you've said too many what's up because we've been sent too many yes yes burden not just by academic researchers right who else is getting surveys from where else are you getting them from corporate world Amazon yeah so like there's customer surveys right now I'm getting a lot of political surveys and they're really not looking for valid data from me they're more looking to get my email address or verify a position that they already to help verify their stance on a particular issue and so overburden is a big thing also it's become increasingly easy for people to isolate themselves from researchers for conducting surveys right so if you get a phone call on your smartphone from a number that you don't know what do you do right you don't answer it even they've tried switching to numbers that might mean something to you and now you already know that you also don't answer though it used to be though if you had a landline at your house which a lot of people don't have anymore you would pick it up because you wanted to know who called and the the landscape for doing this sort of research has changed tremendously and so right now survey researchers are sort of scrambling to try and figure out what's the best new method for doing survey research given some of these these issues so I'm gonna talk a little bit about gathering valid so that's accurate survey data reliable and generalizable data and so I'll talk a little bit about that through planning collecting and then presenting your data and there's a picture of some of my grad students last year doing a massive mailing of mail surveys to farmers in South Dakota we sent out a total of 6,000 surveys across the state and so a lot of work goes into and resources go into doing survey research so it's very important to do the planning ahead of time so first of all you want to plan exactly how you're going to use your survey data before you collect it this sounds obvious but even myself as having done a number of surveys I always always say this I wish I would have added a question about dot dot dot or I wish I would have asked the question this way instead of how I asked it and so by putting the thought in before you're getting to the data analysis and data cleaning stage you can really save yourself from having to go through this horror story of your own wishing that you had different data so you want to ask yourself what kind of analysis do I plan to do with the data once I get it what kind of variables are my independent variables or my dependent variables so if you want to look at age differences you better make sure that you have an age question on your survey you can't go back and add that in later on what you collect in a survey is what you collect so some of you might do qualitative type research where after you start doing a couple of interviews you might say I'm gonna start asking these questions as your project evolved you can't do that with survey research what you send out are the response are is the data that you're going to get so you can't change it retroactively so here's an example of my own small horror story but from that same survey in North Dakota and Montana we asked a series of questions about people's perceptions of crime as we started to do the analysis somebody asked do we have a question of people being victims of crime or having experienced crime no we didn't and so a really good predictor I'm not a criminologist so it's not totally my fault but we didn't have an indicator in our data set that would tell us whether or not an individual respondent had ever experienced crime but we're trying to understand their perceptions of crime so by thinking that through a little bit more before we collected the data we would have had a better data set to work with later on you also want to think about what level of measurement you need for each variable so what do I mean by that I mean you want to make sure that if you're looking for a continuous or a numeric response that's what you asked for so if you're looking for a number you need to make sure you ask for a number if you're looking for a category then ask for a category but for example this should hopefully should make it a little bit more clear I always include on my surveys what year were you born and so in this way I can later on I can change that into respondents who are old however I define them as old or young if I were to instead have included on my survey you consider yourself to be a young adults and adults elderly I can't go back and change that to what year somebody was born right so you want to think of what sort of analysis do you want to do and how am I going to ask my question before you go about collecting your data you also want to put yourself in the shoes of potential respondents is my other key tip for you so questions need to make sense and be answerable and I'll give you some examples of this in just a minute you don't want your survey to be too long who has received a really long survey and quit taking it in the middle yeah or decided not to take the survey at all we want to be really cognizant about overburdening people that we're requesting for their time and this can be really hard to do as researchers right you think we have so many great ideas of all the things that we want to ask about and so planning ahead of time and reducing the number can be really important to boosting up that response rate timing is really important so I do a lot of research with farmers and when might I want to avoid sending surveys to farmers spring yes and and fall harvest yes if I send my surveys during those time period I'm not going to get very many respondents and in fact I had a bit of bad luck this summer I sent out a survey and a little bit hard to read but there was massive flooding that happened in South Dakota so even though I had sort of tried to plant around when planting was normally done because of external circumstances a lot of farmers were still trying to get out on their field and unable to take the survey until I definitely noticed that in my response rate you know don't send out surveys over Christmas those sort of things you also want to make sure that your topic whatever you're focusing on has saliency or saliency for the people who are taking your survey and that also can increase your response rate so make sure that you tell the people who are potentially taking your survey how the responses might impact them that can also help also doing a pilot study so getting a sample of people who are in the population to take your survey before you send it out can be really important to gathering valid data later having if you can't do that that does sometimes take a lot of resources and time if you can't do that at the very least have experts in that field review your survey so so that you're not just sending out something that doesn't make sense to the population that you're trying to study so a couple tips on writing good survey questions you want to make sure that you're using short non-academic words and sentences so take a look at this question do you believe in anthropogenic climate change does everybody know what anthropogenic means does anybody want to tell us what it human god yeah but not everybody knows that right so make sure if you're asking people questions they need to be able to answer it otherwise they're going to skip over the question or worse quit taking your survey or also not very good answer it and not really know what they're talking about another tip I would give you is avoid double barrel questions so avoid asking questions that really have two questions embedded within them so here's an example of a double barrel questions in your opinion how would you rate the speed and accuracy of your work the speed and accuracy are very different things right someone can be very fast but very inaccurate so make sure you're not doing that separate them out you want to minimize bias or leading so many people have attended the movie gone with the wind more than any other motion picture produced a century have you ever seen this movie is this sort of making people feel bad for not having seen gone with the wind yeah exactly so it's going to bias your it's going to bias your response if you have a bias lead into your question keep it simple allow for disagreement this seems like minor but experiments have been done showing that if you don't do what I'm going to show you it can lead to people just agreeing with what you say so this is an example of a not perfect question do you agree that this community is a good place to live to make this question better you should add in do you agree or disagree that the community is a good place to live don't ask questions they can't answer how much money did you spend on groceries in the last year anybody anybody no okay how much did you spend in the last week make response categories exhaustive so make sure that there's a response for everybody you're asking about families you're surely getting out from people by just having those have another option is one way to do it make sure that you also use response categories that are mutually exclusive so make sure that there is not overlap unless it's a check all that apply question so if I typically spend three hours on social media I'm not going to know in this bad question how to respond to your survey make sure there's not overlap incentives are another nice way if you have the resources to increase response rates so social exchange theory shows that if we give people something ahead of time then they feel obligated to participate in our survey so I often include $2 bills in my in my surveys and you can see I've gotten some notes back from people about them appreciating it and I've done experiments to show that using the $2 incentive can boost my response rate significantly so you can see no incentive I had a 25% response rate with incentive I have 32% finally my last tip for you is know the limitations of your survey design every single research project is going to have some sort of weakness and this includes methods using surveys so make sure that you know what those weaknesses are this could be because of how you designed your survey or it could also just be because you have a lack of resources regardless being aware of these limitations can help you to address or temper your findings when you're discussing or publishing your research so just thinking about it if you're using a non probability or a convenient sample you need to make sure not to say that your findings are generalizable to the population and there's different methods that you can use like waiting to make your your responses more more like the population that you're interested in a couple emerging ways that people are doing survey research is online panels I'm not going to talk about these but if you'd like to ask me more about them I can talk about them and then using multiple modes of contacting people and also allowing them to respond to your survey so key takeaways right there if you're interested in learning more about survey design I do teach a full class on this it will be offered a year from now sociology 7100 I don't know the day or time that it will be available yet but feel free to contact me if you'd like to see the syllabus or if you're wondering how it might work with with some research that you have thank you well thank you for having me I'm Tyson I Tyson be I grew up with a bunch of Tyson so I learned to go by Tyson be so my talk I'd I want to call it finding data balance we're gonna come back to these rocks I think they're beautiful for one thing but it's also precarious so how many are really comfortable with Excel or Google sheets in here okay they've used that sufficiently like yeah okay what about SPSS Chimoby Jasp or R SAS decent amount but way less than the this spreadsheet one that's kind of what I expected so regardless of where you come from with this I almost everyone is using some form of spreadsheet software even if you're like man I'm so hardcore our user I try not to use spreadsheets they come to you they just they pop up so when you're thinking of those rocks in the corner there's a million ways to knock those rocks over right so you can think of some rude ways just kicking it there's some natural ways there's only a couple ways that you actually can stack them when it comes to spreadsheets it's the same way there's a million ways to make your life harder is anyone seen this in their their great experience and research I yeah I'm curious you're confronted with this situation you come back to a project you you can see that so I put data modified you can tell I'm really in the data mood the date modified and then the names ideas of how to handle this situation so you come back to a project and you need to answer some questions about things what's your natural tendency to do here so look at the date modified so here this one looks like the last one that we added it so data final final version 2 but there there is a data final final final that was modified just a week earlier so are we getting concerned maybe these two maybe there's a maybe this one was edited on accident or someone came in and was just like moving something and then they decide not to and they saved it anyway so the day modified change this is a scary situation and then like this one there's always like one of those that someone just like wrote their name because they're doing something with it I have no idea what's happening in data gym there's also some issues with copying and pasting data if you've done that before there's issues where sometimes you didn't copy all the stuff that you anticipated copying and so you thought you brought over all the data but really brought over a third of the data and that can be quite shocking when it comes to the analysis anyone use color coding to help you understand your spreadsheet if nothing else it makes you feel good you look at the spreadsheet and you're like colors not just numbers but the issue is is that isn't data that you can end up using very easily it looks pleasant but it doesn't tell you anything so unless you have an indicator that goes along with the color coding it's not going to do you any favors down the road so excel is aggressive when it comes to dates has anyone put like one to three in a cell and it's like oh January 3rd that is what you mean and you're like no I meant one to three like that is the that's the category I don't know why but excel is just ultra aggressive like that so you have to watch out on dates and anything that looks like a date excel is going to treat just like a date so you may have a column of data that is someone's response about how often they do something and pretty quickly what that turns into is January 3rd and if you try to turn that back into a number it's gonna be like 4,000 or 43,000 and you're like no one's going to the store 43,000 times a week so what what what does that response mean so you have to be very careful when it comes to spreadsheets what their natural behavior is another one that I come across a lot is hitting columns this can be really nice to help you now feel overwhelmed with a really big data spreadsheet but it really can throw you off when you start to try to work with your data especially so I'm on applied statistician and the College of Education most of my work is helping other people deal with their data problems and this one sneaks up all the time so some will hand me an excel spreadsheet and they're like there's only three columns in there I have looked at it there's only three columns I'm like okay this will be easy I try to bring it into our ours my preferred method of statistical analysis and suddenly there's 5,000 columns and I'm like where did those come from oh yeah there's hitting columns through the whole the whole spreadsheet so really quickly hiding columns can make something that looks quite pleasant into a nightmare so I would say I haven't given you the clue I love Michael Scott yeah the there's always issues when it comes to these things if you apply these things you're going to have problems down the road there's this article Broman at all it's a pre-print I look him up you provide some principles of working with data and spreadsheets I'm going to go over just a handful of them he gives probably 15 or so over 15 in there first one is to be consistent the first nightmare that you can create for yourself is changing your mind about how to handle something halfway through your your data management or collection or analysis you can do a bunch of things wrong but if you're consistent about it then it's a lot easier to fix so when you're like I don't know what the right way to do this is the first principle just be consistent because then you always can at least backtrack a little bit another one is write dates like this I know it's if you're from America you're like that is totally unnatural why would we do a year month day even though that totally makes it's like the metric system we just like we don't care we'd rather do it month a year which is never made sense if you write dates like this when a date shows up in another format then you immediately know that wasn't me something's wrong I don't leave any cells empty this can this is really natural in a spreadsheet to be like oh it's missing I'm just gonna leave it empty it's actually best if you have some sort of indicator about it being missing whether that's NA or 999 or negative 99 whatever it is that helps you know that it wasn't overlooked it wasn't an accidental deletion when you're going through the spreadsheet because sometimes stuff like that happens you you're trying to type something out and you're like wait where did that go and then you don't realize that you deleted something I put just one thing in a cell this one I don't see as often but I still see it come up where people will try to write multiple things into a single cell because it looks nice this is also true of cell merging it looks really nice but really quickly that can throw a lot of data analysis off because what column does that actually go with so you just want to put one thing in a cell and don't combine cells we call this rectangulating you want to have your data in a rectangle if you have different things copy and pasted all around it it's going to be hard to go in and analyze that later on it's going to be a lot more confusing so you want to have every column is a variable and every row is an observation whatever that means for your data so that could and single observation could be a school or it could be an individual it could be a rat it could be a property it could be any of those it could be a single plot of land whatever it is a single row needs to be a single observation has anyone created a data dictionary now raise your hands like proudly because that that's yeah it's not practiced very much because it is a lot of extra work but when you come across new data one of the first things you want to do is create a data dictionary because at some point future you is going to come back to this it's going to be like what is that and if you don't have a data dictionary you're going to have to do a lot of looking around with the data dictionary it's going to be answered right there so did dictionary really the core pieces that you need to have in there is a variable name so that's tied to your data so the column name what it is so maybe what it's trying to measure and what the responses mean if you have those three things in there you're going to be able to come back to your data much more simply than if we're just coming back and you're wondering what what are my possible options here I can't remember what they were these are what people said but I don't know what was presented to them notably red cap it's a collection software that the College of Education has provided it creates a data dictionary for you so there are some benefits to looking into red cap it's similar to Qualtrics in a lot of ways but right cap will create that for you which is nice time saver another one don't use color as data if you have something that you want an indicator you want these are all together these are grouped in some way create a new column that tells you that so you can say indicator of group and you'd put ones for all those people and zeros for the others that's a great approach you can still use the color but don't forget the indicator and then don't include calculations in raw data these things can change abruptly when someone is unaware that these columns are reliant on these other columns that that's a quick way to lose a bunch of data and not know why so I'm going to make some recommendations about how to handle that one a little bit better but in raw data you just don't don't want calculations that the thing that I like to say is you want to always be kind to your future self you may have collaborators to and that's also important to be nice to them of course but at some point future you is going to come back to this if you're using this data to publish or you're using it to present at some point someone's going to ask a question and you're going to be like I can't remember that was a while ago I actually need to go back and look and this happens to me a lot when we have one of those slower journals that you're submitting a paper to and you start working on another project and you aren't thinking about that paper anymore and then the revisions finally come back there they're asking for some things and you're like I can't remember what we did so when when you do come back it's really helpful if you've done all these things that you're going to be like Tyson when he was 31 years old thank you whatever thank you for doing that for me and what these things are going to help you do is find that data balance so you're not going to feel like you're just waiting to kick over those rocks you're actually going to come back and know that there's going to be balanced rocks here I know it's always precarious you can always mess up data but they're going to be there and I'm going to understand how to work with them so when it comes to organizing that this is one approach there there's many but this one I particularly like so when you use spreadsheets you want to be very particular about where spreadsheet analyses are and spreadsheet data are should always be kept separate and obvious so one one way to do this is if this is a folder in your on your computer so your project name whatever it is data plaza one folder in there just can be called data and you're going to know if it's in there it's just data I am not going to do any analysis in there it's just my data so in there might be raw data dot CSV and then maybe a clean data where you went in and you clean things up having both of those can be really important I always recommend that you always have a raw data file that you never change it's always housed in your data folder and you never do anything with it you may bring data from it but you're never going to change data inside of it next one I would say is analyses in this one go ahead and have spreadsheet data in there do analysis with that mess things up try stuff out and nice thing is is you always can come back to your original data because it's housed nicely in your data folder it's not housed in that spreadsheet and if you make a mistake suddenly it's gone and then I always like having a manuscript folder in there I like keeping my analysis of manuscripts somewhat separate manuscript ones tend to get really messy because you're going back and forth between collaborators and there's comments there's old versions stuff like that so you want to keep that away from where your data actually is and so you can actually find where your data and analysis are without having to go through all the manuscripts alright with the remaining time I just wanted to go over a few terms I can help you communicate what you want to do in a spreadsheet from anyone know Hadley Wickham if you're in our world maybe you've heard of him he's he's a big big deal there but a lot of these come from his work from sequel database management language and deep fire that's from Hadley so one of the issues with data is that it's very abstract what we're doing and there's a bunch of synonyms for the same stuff and so when you're trying to communicate with a collaborator and you're trying to say what you did with the data or what you want them to do with the data it can be very helpful to have a grammar that actually describes what is happening and one that's not going to be changing a whole lot over time and here's I'm just going to give you a handful to consider and then you can look into a sequel and be fired up to learn more so first first one I want to talk about data wrangling anyone heard that term before so it's the idea of bringing in data and cleaning it up so cleaning and manipulating it's all they're all synonyms in some way so it's about obtaining and cleaning data to get it ready for the analysis one thing that I find with my students is when I say cleaning data it is kind of a stressful thing to like I don't even know what that means how do you clean data it's it's all about just getting the data ready for your analysis whatever that means if that's recoding a variable if that's dropping columns that you don't need those are all data cleaning data wrangling things yeah so you've mentioned raw data clean data being separate files but what if you're collecting data in the middle of you know stage 2 stage 3 you already cleaned it and you've done that recoding and that kind of stuff would you put that new data back into the old file or with a raw data file or do you just start adding into your clean data file with your data now that there's a better organization to it so what personally what I would do is I would have a sub folder in my data folder that would be like first round and then second round and then I may clean both and then you can find them and have a complete data that's scary when you have two different collections come in it can things can start to get a wire so it's nice to maintain both of them and then another term is selecting whenever you say the word selecting refers to getting columns so each variable so I'm selecting these columns for this analysis so that's selecting filtering is about grabbing rows so sometimes you only want to analyze certain individuals so certain groups maybe or maybe there's analysis for us just females in your sample that's filtering so filtering observations means you just grab certain observations but keeping all that for columns another one mutating I like this term it's either creating a new variable or adjusting an existing one so it's all about taking what you currently have and like recoding or transforming the data in some way like doing a log transform that that's a common one so when you're changing a variable in some way or creating a brand-new one mutating it as a term for that you also have summarizing synonym for that one is aggregating so it's usually taking lots of information and somehow reducing it into some summary statistics in some way but common ways or means medians anything that takes a lot of information and condenses that into less information is some form of summarizing or aggregating another one is pivoting this one has a lot of synonyms for that there's reshaping there's a long dating there's widening all of these things are not about changing what's in the data but the way that is shaped so often we collect data in a wide format where we have observation for time one in this variable for them measure in time two for it in time three but for someone else's we actually need those to be shaped more like this and so that that's called pivoting and then the last one I want to talk about is called hiding that's a really broad term it's just when you have a clean you're cleaning your data set or you're pivoting it or re-tangulating it these are all really abstract things but if you have time to look into these I recommend it ultimately our goal at the end of getting ready for an analysis with the spreadsheet is to have our data all columns or variables and all rows or observations if you have that then the data should be ready for you to actually do analysis but with a spreadsheet or with if you are interested in learning more I hear some links I always call this the picture slide so if you if you need to remember anything here this is always a good one to take a picture of I am providing these slides at my website if you want to thank you so you mentioned like suggestions in organizing your raw clean data and your analyses but say I'm not doing like our analysis that I'm doing spreadsheet analysis how do you keep the mutation more like formulas or how you processed it visible for future view that's a great question so spreadsheets are notorious because they don't have the steps written out so when you use a coding program you have all that written out for you so you can come right back to it honestly the the best that I've seen fruit that people have done is to keep a record of themselves instead of code they write down this is the things that I did just like the writing code but still they keep those separate so the notes that you take about what you did to the data that should go in the analysis folder so it keeps you organized in terms of what did I actually do with that data so you recommended to not leave sales blank and why was that again because I use mostly SAS for my statistical analysis and I always leave my sales blank because then SAS can recognize isn't not right so that's a great question because most statistical analysis will recognize a blank cell as missing the issue is if you're going to be working with your data in the spreadsheet at all a missing could be an accident or it could be meaningfully missing and so it's gonna be a lot harder for you to put 999 in a cell on accident that makes sense yeah great question it's more for Jessica's question so I don't know like maybe it's a long question I'll make it concise and clear so if we want to collect survey data and like a dirt like certain amount of participants we have to get in order to like get the data to be both as general advice yeah so that's a that's a really good question and that really goes to thinking about how you do the sampling and if you're using some sort of probability sampling so there's probability sampling where you give everyone at least some sort of known chance of being in your research and there's not probability where maybe it has more to do with convenience or or you purposely selecting some people but depending on how much heterogeneity there is in the population that you're studying that can play a role in how many people you need to participate in your study but if you think about like polling right now that they're doing political organizations are doing to try and understand how people are going to vote they say that they can do it with approximately 1000 people if they are doing a sampling strategy that's making sure they get a proper number of people fitting in certain categories so so I guess my answer is it's it's not set in stone it depends on how varied your population is and what you're trying to do with the data once you get it collected so in general though with a couple thousand respondents you can have enough data to be able to say something about most populations using probability sampling but but it can vary you might be able to answer that better would you just tell us quickly what you what the online panels were about sure sure so it's really interesting so I believe we have Qualtrics here at USU correct and so I've been contacted a couple of times by Qualtrics and increasingly these different survey organizations so like Survey Monkey that you've probably heard of or Question Pro which is another thing like Qualtrics there's some difference and even Amazon has their own and it's basically people who sign up and they say that they're willing to take surveys to get some sort of benefit it could be monetary compensation it could be coupons so like marketing companies could give coupons if you take their surveys but so they have this pool of people who if I contact Question Pro and I say I need a panel of Utah residents to take my survey about people's views on climate change they will have a set set people who they can send my survey to and they're already willing to take it and I can offer some sort of compensation for them to take it there's a lot of problems with doing surveys that way as you can imagine right there's going to be certain types of people who are going to be part of those panels also you can think about what happens after a person spend like takes a whole bunch of surveys right and they do it is sort of a way to earn money they're going to be maybe taking them quickly assuming that they know what you're asking in a question so there it's a it's a I looked into doing it last year for just a general population survey of South Dakota and I contacted Question Pro which was our provider there and they're like sorry we've only got like 15 people in the whole state that are part of our panel so it's it's just not going to work in in certain areas either but but I forgot to mention they were the people who are on the panel they have certain information from them already so like your gender your age and so I can say to them I want women that are 18 to 25 and if they have enough people they'll ask those to take your survey so statistically there's 15 people in North Dakota I was like that's not going to work but yeah Thanks for having a question so if you're doing statistical analysis on data are the statistical potions built in to excel or other softwares will it give you different results? Can you be aware of a software using to do that? Yes the short answer is yes each approach is slightly differently the more mainstream ones will give you for that like set statistics will give you the same answer regardless as you branch out into a little bit more complicated methods they start to differ especially I've seen Excel have that they found some issues with some of the ways that they were doing it so my recommendation would be to not do statistical analysis in Excel when possible there's free point and click software like Jamolvi or JASP that are pretty straightforward to use and they function kind of like a spreadsheet for most of it and then you just point click and those ones are well accepted that's a great question so that's Jamolvi, J-A-M-O-B-I, it's free or JASP, J-A-S-B weird names for softwares how's stats like this? This is an unsolicited comment from an engineering perspective I use spreadsheets a lot and I use them in the classes that I teach and one of the things I ask my students to do is to do what I have to say, have gross observations at Collins and I'll ask them to do a set of hand calculations for one row of data it does a couple of things, number one it checks my spreadsheet for my ability to hand calculations right make sure those match up and I feel confident that my spreadsheet is processing the way it should be processing but again to future self if I come back for me it's like five minutes later what does the spreadsheet do I don't remember but if I have a set of hand calculations in just a couple of minutes I see all my equations I see what it does and now I know what my spreadsheet does so it's kind of a nice tool for a particular type of analysis if it's an analysis tool you might come back and use again and again as a professional or something that's something that I encourage my students to do sometimes I pull data out of the census and I don't know how to work with their raw files and so I'm like, well sometimes I use their raw files but there's a lot of times data queuing when you're trying to make a rectangular file where you've got every row is an operation method all of this is a variable is that right Eric? sometimes like for example if I'm just extracting data it's like for a couple of different cities or something and then I've got a lot of dates just for example then like each observation I have to like hand put in the date like to make that like a date variable and then I have to copy and paste that date because I only have it once and then I've got all the you know what I'm saying? is there like a way to clean data that's messed up that way? like fast? I've never been able to do it I'm just really worried so I am biased programs like R I generally can do things like that in a moment it doesn't take a whole lot of work where and especially it might take me an hour depending on how big that it is and census data is anything from the government is going to be kind of messy if you're lucky enough to not get it in that PDF that just usually works but yeah I haven't found any really good shortcuts in a spreadsheet for say it's usually in a like another programming type language like R R has some simplified code like if you looked at it years ago it looks very different now so it is something to look into there are classes up here that teach it I teach one of those and we cover how to do things like that because those things happen everywhere and it can be worth spending 30 hours learning something like R and then never have to do that in a spreadsheet again it's trade-offs but yeah it's a great question there's also a tool called Open Refine does anyone here use Open Refine and it's super it'll clean data, calculate data and some simple things like that it's referring to it easier to learn than all and if it's something where you just have to like some dates that you have to get cleaned up it'll go like that Open Refine Open Source and if you have questions about that I can show you at some time can stop by and see me and I can show you how to do stuff that's cool I have a question so aside from number of samples is there a way to how do you professional people in your life how do you report heterogeneity or generalization generalizability of the data that you can like say if you had like a thousand samples you could say you assume that it's generalizable but besides the number of responses are there other ways to report quality of data so let's see sure one thing that I do a lot and this is why if you get the census this year take it we'll compare our data when I did that survey up in North Dakota in Montana and these were county level samples I could compare my respondents to the percentage of say like males and females that live there and in that survey for example I found that I had a higher percentage of males responding to my survey I actually lived there in the population and so then I waited it accordingly yeah comparing to other data sources is important I don't know does that answer your question check the quality that's an important thing to report right as a reader of your research I'd like to know something about that right yeah certainly and increasingly as we've seen the response rates go down reviewers in journal articles are also looking for you to report that testing that you've done between the respondents and the non-respondents and so one thing that I've actually moved to that makes my life a little bit easier is getting sampling frames that have some data on all of the people in my population so for instance when I was doing surveys of farmers I used to do a request to the government of farmers who had participated in government programs that would really give you the names and addresses of farmers if I buy my sampling frame from a private organization they'll give me the names addresses, emails and then some characteristics of them so like the number of acres that they have their gross farm income from last year and then I can with my own data set I can see oh shoot my respondents were much more higher income than the population and so at least making sure you know those and report on them helps people assess the quality of your data too for sure if there's valuable coupons I'd be happy to take your census surveys and any other questions alright let's thank our speakers all of them thank you