 So this is session two, do right from the start, ensuring reproducibility confirmation. Now this one's going to be a little bit of a struggle, a challenge, because each presentation only gets 10 minutes. And so we're going to work hard to keep them on that. But again, we've asked them to come up with kind of key takeaways. So watch for those, and those are also available in the drop box or in the box folder. So our presenters are Susan Durham, who's a statistical consultant, watershed sciences and college natural resources. She will start. And we have Dave Bolton, assistant professor of physiology and health science. All right. And then we have Richard Cutler, math and statistics professor. And we have David Roseberg, civil environmental engineering. All of my friendly faces left. We could break, but that's okay. They see a lot of me. I'm Susan Durham. I'm staffed in the ecology center. I consult with graduate students and faculty on the design analysis, interpretation, and presentation of their research. My clientele comes from the Natural Resources College, the Department of Biology, and Ecology Center. So I am specifically hired to help those people with their stuff. I have a background in biology and ecology. So that's been a big help as I've been communicating with my clients. And I have a master's in applied statistics from Utah State University. I have been at this for about 30 years now. So I'm the first to lead off. I allegedly have 10 minutes. I've already told Ann. She's not allowed to pull me off. Each of us will be presenting through the takeaway points. I have the advantage of being able to do the one that is like everybody's first takeaway point. Planning ahead. And I've rearranged my thing, but then I forgot to give it to you. Anyway, let me flip forward real fast. Here. What I'm going to do in my time here is go through this overall framework of flow for a scientific research project. And you'll note this lovely quote that I have at the top. It's really very true. You can reanalyze a set of data, but we can't recollect it. And throughout all of this, well, I'm a statistician. A lot of this is focused on data, but since they invited the statistician, I'm going to talk about being statistical. Statistics is a science, and that statistical science runs all through this flow chart at just about every step. And hopefully you'll see that as we go through. So my takeaway points are for one plan ahead. Everybody's really surprised about that one. Although I will say, going back a little bit, if you do plan ahead, you will reduce, although probably not eliminate some very, the possibility of some very unpleasant surprises. Like having your statistician tell you, oh, isn't that lovely? You have no reputation. Or finding yourself two months out from your defense and realizing that you need to read and implement a really complicated statistical analysis. And you don't really have the time to do that. So you're going to save yourself a lot of hassle. The second takeaway point is don't underestimate how much effort this is going to be. You want to, this is true at all steps in that flow chart. We'll be stepping through that one by one real quickly. But in particular, the data screening, the data cleaning, the data exploration, and the statistical analysis. That is going to take you a whole lot longer than you think. Most of the burden will fall to you to do this work. I am a statistical consultant. I am not your personal data analyst. And so you will be the one conducting these things, running the software. So embrace this challenge and plan for the time that it's going to take. If you have planned ahead, then, this handy-dandy, if you have planned ahead, then you have acquired this methodology that you need to know. And we were talking a little bit about do you do analysis in Excel? I'm really draconian about that. I'm like, no. All your data manipulation, all your data cleaning, it's done in a scripting language like R or Python or whatever you want to do. But you also need to learn data manipulation methodology, data visualization, and data analysis. The third point is, draw upon the expertise of others. Do not try to do this all yourself. We are all here in a community to support each other. And so don't be shy about going to talk to people. Your advisor and your committee, your statistician, if you are lucky enough to have one, qualitative, quantitative colleagues, the research data management people at the library, did he want to say a bit more about that? The writing center. Most of us are probably scientists. If we wrote well, we would be in another field, okay? So develop collaborations early with statisticians, context experts, people that can help you with reproducibility and just get those established way early. And they work with these people throughout the whole of your career. Those of you with access to stat consultants will be my people, people associated with Aggie Grimit Station, people associated with the college of education and human services, like Tyson who spoke earlier. But the rest of you are gonna be a bit more challenging to find statistics. So we will step through this. First one, define the scientific question. Seek a lot of input from a lot of different sources, in particular statisticians, I'm a little biased. At this next step, you identify what are you, what metrics are you gonna do that address your research question? You're going to define the data collection protocol, the sampling protocol or the experimental protocol, obviously a highly statistically dependent step. The next one, what analysis methods are you going to apply to your data, depends upon the study design, depends upon the data characteristics, and it needs to answer the question that you said, oh, from them, that you said there. So you want to be effective, you do not want to be dangerous, you wanna understand the methodology that you're using, and you're still pretty early in your research process. The next step is one of my favorites, I will admit I don't make people do this a lot, but I may change my mind. Many years ago on an old stat news group, Ronald Crozier wrote, no one should ever do an experiment without analyzing it first. So how do we do that? Well, we simulate data, we simulate data that looks like the data that you are expecting, and then you analyze it with these new tools that you've learned how to use, and then you see whether the results are evident, and if there's something about the process that's broken, you go back to the first step or the second step, and you run through it again, you redesign your study. But if everything looks good, then you get to collect your data, the really fun part. Now the advantage of doing this thing with the simulated data is that you have to think extensively about the data that you're going to be collecting. You get to be the scientist, and you need to think about what is it going to look like. You also learn all the tools that you're going to need to work with it, and you also assess the power of your study. You've got like your accomplishing a lot of things right there, and it's really easy to simulate data now. So that's cool. Okay, collect data. Don't underestimate the amount of time it's going to take you to get those data and delete the electronic form. Do not underestimate the amount of time that it's going to take you to clean and screen and explore your data. There's just this year, there's a young career researcher at, and she's had to retract three papers, early in her career, three papers, formally retracted because the data set that she got from a collaborator had serious and unresolvable problems. You do not want to be that person. I know you're sad. And from a personal experience, I can tell you, you actually spend more time if you don't screen and have to come back around and do all your analysis. The same. Okay, so here, this is the really fun part. You get to analyze your data. You're also checking assumptions, goodness of fit. With the actual data in hand, you may find out that that's not actually, your plans aren't going to work. So you go back and you fix those and go forward again. And then you're at the end, sort of. Communication is the goal. There's no point in doing this work if nobody ever learns. So you need to hone your verbal and written skills. That's why I listed the writing center on that one slide. And at this step, you are translating your statistical results into a story, a really good story. Straightforward, clear, but you're still making sure that the reader of your story understands the uncertainty that still remains in your study. They're telling an honest story. So, in my ten minutes, my point is, if you plan ahead and devote time and effort to that, and you devote sufficient time and effort, you work extensively with other people, you will have increased the likelihood that your studies will be. So, go forth and do science. That's the end of my comments. I think Dan Bolton is up next. So, anyway, thank you. This is for, sort of a simple forward, back and forward. Yeah. I could also, yeah. Sweet, huh? Now, this part, this part, I don't know how to work. Yeah. Yeah, good. Quicker. All right. Well, I was coming back from Ireland at about 30 hours without sleep when I gave these slides to Mike, so I hope they still make sense. I'll try to work through it. Now, one of the things that I'm going to discuss today is that I've, how many people here have heard of the registered report? So, a few people. So, I've actually published a registered report. It came out in the journal Cortex last year. So, what I thought I would actually talk about today is, really, we're all going to be hitting on the bullet of plan ahead. So, this is a way of, it's a formalized version of plan ahead. So, the journal forces you to get all of your, your ducks in a row here. And this right now, this is for the journal Cortex I published with. And what they, I'm going to try this out. Oh, wrong one. Got you. Okay. So, this is kind of the editorial triage. And you start at the beginning here. You put in stage one. You submit your introduction and your methods. And based on you have sound rationale for doing this study. Is it an important question? Are your hypotheses clear? And are your methods appropriate for answering that question? Based on that, the journal will decide to accept or reject your paper. And one of the nice things about this is now, I can tell you, this was a pretty long process. It was the first time I had gone through this. It took a couple of years after going back with different versions of this. So, I learned a lot through it. But the nice thing is all of the effort is on the front end here. So, once they've accepted this in stage one, so long as you do what you say you are going to do, even if you get negligible results here, they will still publish this, which is actually a pretty important finding. So, I'll go through what I'm going to talk about today is I'll use the example. I'm going to not, I'll try to not dig into the details too largely, only the ones that are maybe relevant to understand what I did. So, I'm a neuroscientist, and as a lot of neuroscientists, I use t-tests and internovas, anything more complicated. I just run the stats department and quickly bring someone on. And in this case, what I brought on right off the bat, and enlisting the help of professionals, I brought on Sarah Schwartz through the statisticians with the College of Education. So, I brought her on the paper to help us at the front end, design this, because we had to, for this kind of registered report, you actually have to have, if you've ever seen the power calculations that go into what sample size do you need? A common marker is 80% power, so this you actually have to have 90% power. So, meaning you have to have a very large sample size to ensure that you're going to, if there's an effect, you're going to capture it with your analysis. And one of the things, so I look, oh, and the other thing, you actually have to have very clearly established hypotheses going in. So, they don't want you, they want to make sure that you're not sending in 20 primary outcome measures and going to pick and choose a couple after. They want you to be, you want to commit to what are you expecting with what particular outcome measures. And we actually included this as a figure at stage one to show visually what was our expectation with our data. And now, even though I'm showing you a registered report, this is actually something I like doing for any paper I do anyways. I always like to visually see what is it that I would expect if the study runs the way I would expect. Now, as a neuroscientist, I look at the neural control of balance, which is to say, what does the brain do to keep us from falling down, which sounds easy, it's actually very complicated. And what we do is we actually have to take people and throw them off balance and look at what the brain does. Now, I'm going to go through a couple, I'm going to dig into a couple details of my work just to show you one of the things that we had to do with this registered report is the idea of building in a positive control. And so in this study, what we did is we had people, they're on a harness, the harness is attached by a cable to a magnet on the back wall. They're leaning, I can release them at a certain point and we can look at the reactive step to avoid a fall. Now, what we do is we want to see what's the role of the brain processes in avoiding the fall. So we actually occlude vision. We do this, let me look at it, nope. We occlude vision, we play a bit of a shell game on them, meaning that we'll either put a leg block out there and uncover a handle or vice versa. We want to see what does the brain do in advance of a fall to actually help you avoid that fall. How we look at that, we use a technique called transcranial magnetic stimulation. In that shell, this allows us to take a quick snapshot of that brain connectivity from the part of the brain that controls the hand muscles. We want to see does that get facilitated, just if you see something that's going to help you avoid the fall like a support hand. And what we do is we have people in this setting where they're about to get released occasionally. The moment they open up the goggles that we can occlude vision with, we put in this snapshot to get the idea of what's going on in the brain or does it. And our outcome measure is simply this, what's called the motor vote potential, the amplitude of it. And that is that hypothesis slide you saw earlier. It was simply the amplitude of that motor vote potential. Now, what we had to convince the reviewers of is the fact that we are actually in our study. We're going to find something, we weren't pulling this out of thin air. We actually had a foundation from prior work that is leading us to suspect that the brain is doing this predictive priming based on vision. And what had been shown before is people, if they're simply, they're purely relaxed, but they're sitting there, they're looking at an object that could be picked up with a pinch grip. And in the moment, sort of about a tenth of a second, after seeing this object, if it's there or not, if it's there, you get a facilitation of the muscles that would normally interact with it. So it's an immediate priming associated just with looking at that object. Now, we thought this would have implications for the world of balance control so that you get this automatic priming of something that would help you avoid a fall if you did fall. Now, what we had to build upon is the fact that this finding showed that people just purely looking would get that facilitation. And so what we had to do here in our case, we're looking at how this really plays out with a whole body postural reaction targeted to a handle to avoid a fall. But what we had to build into our study was we had to, this is the positive control that we're looking at and what we had people do. Purely sitting down, they would either see that handle or not. So again, we'd occlude vision. All of a sudden they'd get vision. The handle's either there or it's covered. And based on the visibility of that handle, we'd look at is there an intrinsic facilitation or a facilitation of these intrinsic hand muscles? And that is the effect that we have to replicate in order to then test our experimental condition, the fancy new exciting thing that we want to test. We have to build upon the foundation of what's established. And so this was for the registered report, actually a required thing. You actually have to have an outcome neutral, positive control built into your study such that if we did get negligible findings in our experimental condition, we can at least show that we have basic competence or what we're doing, we can replicate the established effect that we are building upon. So even though that was formalized in the registered report, I think this is a good idea. You can certainly always add this to any study that you're going to do. And it saves you two minutes, excellent. And it saves you work down the road if you actually get this correct upfront. So I think my three points were planned ahead. And then if you can, if possible, a lot of journals do have this format that I think is becoming more popular, a preregistration process. So I encourage you to do it. You probably can't do every study like this but it does take a lot of time upfront. But if you can get some major studies in the pipeline that build this up, I think it's a good idea. And even if you're not going to do a preregistration, the idea of building in a positive control to test your new experimental effect is something that will safeguard you from the review process later on. And that's it for me. All right. Well, greetings, everyone. Thank you for attending. I just want to say it's been a real pleasure for the last 30 years interacting and talking to Susan and just listening to what she presented. I recognized just how much we have in common in our viewpoint of the universe. There really is a statistician viewpoint. So and PowerPoint framework. I'm wondering what button I pressed to get rid of these lines here. Never mind. Okay. So, you know, at the risk of it I thought I'd just put up something to tell you a little bit about me because it tells you where I'm coming from. And so I'm one of the professors in the math and stat department and we're in a statistician. I was hired in 1988. So I've been here quite a long time. I've been working collaboratively with researchers in many different disciplines for a number of years, but something that's perhaps most relevant to our discussion today is that I spent three years running the statistical consulting center in the department of mathematics and statistics. And in that three-year period I worked on absolutely 250 problems from every single college at Utah State University. Now going into it, there were places where I expected to see people coming from college of natural resources, college of education, college of engineering. What I was struck by was how pervasive the analysis of data is. And it also gave me a fine appreciation for the range of research that goes on at Utah State. So coming over to the consulting center, sometimes I'd see faculty, mainly it was graduate students who were sent up by their advisors and so I got to speak with a lot of different people. So I work on the application of statistical methods in a bunch of different areas, one of which, or perhaps the largest one of which is ecological environmental statistics. And I'm also someone who knows quite a lot about the design and analysis of experiments. So that would be an area. The area that I'm really, I would say most conversant with at this point in time is what we call statistical learning or machine learning in terms of absolutely interchangeable and those are absolutely plumb in the center of what we call data science these days. Statistics, all statistics is data science, the converse is not necessarily true. So I want to tell you, I want to, unlike previous and other presenters, I want to get right down in the weeds. So I've got a crowd and I'm digging up the cheap grass and other stuff which is in the paddock. And so here's an example. It's a cautionary pale, it's something that actually came in to the statistical consulting center when I was working there. So 12 soil cores were taken and go down in the ground up the feet and pull out a core of soil. Each core was split into three different levels. So the top was two to five centimeters and the middle was seven to 10 centimeters down and the bottom was 12 to 15 centimeters down. So 15 centimeters is about 10 inches. And the 12 portions at each depth were then combined because it's 12 different cores and thoroughly mixed. And so then 36 samples from each depth were drawn and they were randomly assigned to, it says each treatment, two treatments, there are only two treatments in each family. So 18 to each treatment and at each depth and for each treatment and for each year of six different times three samples are randomly selected and the testing involves, you know, figuring out chemical composition and then leech those elements out with acids and so it's a destructive sampling process. And so it's obviously a very sophisticated experimental design. And some of you may have seen things similar to this and clearly someone put in, some people put in a tremendous amount of thought into constructing this experimental design. I spoke with the person who actually did the work, a graduate student and a staggering amount of effort went into implementing the design and collecting the data. And the student's advisor and committee signed off on this design and it's actually a terrible design. And along the way, you know, there was one level of replication which never was in the design of the experiment and there was another level that was completely exploded by mixing all samples together. There was a logistical reason for doing that and that was that, you know, if they just did one particular core there simply wasn't enough to come up with 36 different samples. And so many of the traditional methods that we would apply to data like these traditional analysis of variance techniques and so on simply went valid. And so, you know, we cobbled together something but it's a real euphemism to say that the analysis did not meet the student's desires or expectations on this particular problem. It was really sad and that was a big, one of the big messages of my time in the consulting center was that, you know, it was really hard if I'm trying to tell someone, you know, your study was flawed from the beginning. As we've heard before, you can't fix it with doing some statistics after the fact. I've encountered, you know, maybe 10 or 12 out of those 250 where I really had to tell first and look, there's not much I can do for your set course and pictures and things along those lines. Your study was flawed at the beginning and we can't do any inferential statistics. So, you know, a couple of years later, I got some extant data and I had a graduate student working with me and she worked really hard on this problem. And we came up with a better analysis for it but boy, we had to make some heroic assumptions and still wasn't satisfactory. The design was simply not possible. And really, if the person had come to see Susan or me or, you know, someone who could have advised him on this, he could have avoided all of this particular problem and really a very unfortunate situation. So, a couple of thoughts about data analysis. Sometimes it's easy. Sometimes, you know, you have a nicely designed study and just put it into prokinova and sass and boom, out comes the answer. That happens about once every 20,000 studies. So, the data analysis can be much harder than you think and that's something that behooves you to come grips with before you actually conduct the study. This is getting back to the plan ahead. This is the topic that we all agree on is planning ahead so that you don't get these nasty surprises potentially unfixable surprises right at the end. So, it's very important to know the details of the analysis and Susan said she doesn't often recommend that people generate fake data. I've been lucky in some of the examples that I've had that my collaborators and I have looked around and I was like, you know, this data set looks quite a lot like what we're going to do and so we've been lucky, you know, we've been able to practice on a real data set but, you know, the idea of actually getting down, I mean, this is really at the nuts and bolts and the weeds and generating some fake data so that you can practice your analysis I think is very important. And in this thing here of the details, you know, I think I want to try this. Love new technology. So, you know, in the level of the details, I mean, there's a lot of data science techniques these days which people hear about and classification trees, gradient boosting machines, random forests and they sound so sexy and yes, we want to apply them to our data set and so I've read so many passages and proposals saying this is what I want to do I'm going to apply random forests to this and I'm going to do the same next thing and it's not good enough because, you know, these techniques require data limitations and, you know, you have to be aware of those. So, usually there are multiple methods, perhaps multiple approaches for analyzing the same data reinforcing something Susan said and learn as much as you can about the techniques that you're going to use, you need to own the data analysis that are part of your thesis or dissertation and you can do that by taking classes, by looking at online resources and by talking to people who have expert knowledge in this area from a faculty, from a consultants like Susan and others. And I think that's all I want to say for the moment. Great, can you all hear me? My name is David Rosenberg I'm an associate professor in civil and environmental engineering and my research focuses on water management which is all about how we can operate reservoirs, dams, diversions, how we can try and get in the heads of users, all of us to conserve water and do that for a bunch of different purposes. For example, in one application we're using tree rings we reconstructed stream flows going all the way back to 1480 and we've done that actually at a monthly level even though trees only grow one year long ring per year. And Nor Akela who's photo on the far you're far right is setting up a meter and register at the water lab that records water use every five seconds and when you can record water use and you're starting to now get the idea of big data at every five seconds you can see how long people take showers for how many times a day we flush the toilets how many gallons each flushes and lots of other really cool stuff outdoor water use. So this type of data we do a lot of modeling as well it's gotten me into the area of reproducibility because kind of reproducibility is at the forefront of science of every science field what is reproducibility the idea that we collected the data we make that data and model code available someone else can use those exact artifacts and they can reproduce the results they can get the same results that we got and we can actually verify that what's going on is good so my first main take away point that's actually like the way end of the spectrum on the continuum and the reality of 2020 science in my field and actually in many other fields is that we're not even here like on the first one yet but how many of you have finished a research project? Raise your hand keep your hand raised if you also made all the data that you collected in that project all the models and code the results available for others to use like publicly on the internet more easily I'm not seeing any hands and that is also the reality from what I'll show you in a second so the instructions and all the other digital artifacts that's just step one here right is making the stuff available then if stuff is available we can come to step two which is what I started my few minutes talking about which is actually can we reproduce the results can you get the same results shown in the figures and cables and other tests like is that process reproducible and then finally I'll say like we're probably a decade or more away from step three which is can we replicate the findings which means can we use new data sets and in new locations and maybe at different points in time to come up with like the same general conclusion about what we were seeing in step three and so this is a continuum and I've run the wrong way on the continuum I'm still going the wrong way and I just want you to think about how can I push my work up the continuum it's not get to reproducible results or finding wreckable just how can I even get on the continuum or maybe move a step every step it's a good idea that you've heard today so far has been about making our spreadsheets more readable and being able to keep track of which file we should be opening I mean these are all efforts working up those continuum let me make sure I go the right direction okay so how can I make my results reproducible this is the question you should be asking or I want you to be asking and this number one build reproducibility into the project from the start you've heard this one at least twice today but I want to add on a few more things which is you have to budget the time to do all this because it does take time and you have to have money to pay people to do this or sometimes to buy storage like on cloud storage in order to host all your data you have to think through registering research design in a previous talk or future talks previous and you have to think about if you're going to involve in human subjects research involves human subjects how can I go through the international or institutional review board process which governs human subject research in a way that I can potentially share some of the data that I'm collecting and this is like confidential data how can I do that in a way you need to think about that ahead of time can't do it at the end of the study because all the agreements were already signed and all the restrictions and what other tools software and data management another tool to help okay number two put all your materials in a rough and preferably in one repository for the project so they're all co-located in a place that people can complete and the repository that you will choose the location maybe it's maybe it's digital commons right the university's online repository maybe it's a field or repository specific to your field in my field we have HydroShare which is like it's like social media for hydrologists and that's also a good option we also use github a lot which is the code helping share code and one of the cool things about these repositories is it's not just a bin that you dump stuff in at the end of the project you're actually building this container and as you go through your project everything goes into it and at the end you hopefully don't have to do a huge amount of work in order to share it okay make all your inputs and outputs particularly a proprietary work so that's software where you have to buy licenses to use that someone else who's wanting to use reproduce your work is not going to have the license to use or maybe private with IRB or computationally intensive you want to make sure all those steps in that process are available all the materials are both written after the step that uses the proprietary code or maybe it's computationally intense the steps that people won't be able to reproduce give them everything that led up to that and everything that resulted from that number four how many of you when you write a paper an article or report ask someone else to like look at it and give you feedback on it good yes a lot of yeses we need to do the same thing for our repositories they're actually quite complicated they have a lot of text in it a lot of different materials can someone follow through all the steps that's the goal standard you know you actually reproduce your work if someone else can get through and they get the same results and we've been increasingly building that into my own research group's practice meaning that I have one graduate student check the work of another graduate student before we submit okay number five train students employees and reproducible and employees and reproducible practices this is why I'm here today right now this minute talking to you right that's what the data we have to do a lot of work to get everyone up to speed to have the skills and the training and the intuition about how to how to do it if you want 42 more other tips um we just have out actually a new editorial titled making research more reproducible it appears in the journal of water resource planning management which is the flagship journal of I know I'll say that in Blake I think he has to do that for the American society of civil engineers I mean that's the society that governs civil engineers in the big organization they publish 30 other journals as well um oh I went the wrong way okay um I have a minute left so I'm gonna skip to this slide in order to do all this like this was a ton of work and it's a ton of effort and it requires a lot of people who care and in order to create a culture where we actually start reproducing the results it's going to take all of us and not just us us as authors the journals that are publishing the work that authors are working in for example the um the registry and the reports ahead of time it takes institutions like Utah State University and other universities around the country and the world it also takes federal agencies and it takes funders and we together all have to be coordinating these things so that we can actually have science that is reproducible and that um that we can make hopefully replicable in the organization so with that I'll finish I think where are we headed next questions okay we'll come back up do you have any questions you'd like to ask about I know I only know about this because I know somebody who's in the program but like the College of Education here has a program for teachers of STEM students or like engineering teaching are you guys working at that level like trying to get high school students to be thinking you know what I mean like like ingraining the idea of part of the scientific method is this you know like when it comes down to it like this is part of the scientific method I don't have any interaction with high school students I have run and participated in code camps where we bring high school students on the university to code in ICON or other languages on water resources problems that's been my main interaction with high school students I think at that stage the primary motivation like from a to get people interested and excited about coding and working with data and trying to wrestle with with problems the problem we gave students was Pine View Reservoir on the Weaver River it's about like 40 miles up here just how should they construct release rules how much water to release from the reservoir to supply different users and prevent people from getting flooded and the group came up with a pretty cool example one group decided it was really cool to flood everyone and because it's a computer model in a coding environment of course we have a fair interaction among the staff faculty with high schools presenting trying to get them interested in taking any statistics and so on but it's not at this level it's really at the level of trying to get them interested in data and with things that we do with data and then if they go much further then they're going to start encountering some of these issues of reproducibility and replication so I have a question for Dr. again you've talked about the importance of screening and exploring the data before analysis what exactly do you mean by screening and exploring it when Tyson said cleaning to some extent it's that so you look for unusual data values they might be typographical errors they might be cut and paste errors just to verify that all of the data that is in your data set are legitimate but I also do tabulations so people bring the data sets all the time and most of them have errors they've gotten things wrong in their Excel data set so they've got two observations for this treatment combination and they should only have one and so you're cleaning screening, cleaning things like that and then summarizing then you start to do histograms or descriptive statistics minimums, maximums plots that illustrate the relationships that you are looking for in your data set just pretty much whatever the first step is to get good numbers to where you're pretty sure that every number in that data file is not wrong but then we move forward into exploration there's a quote in one of Frank Harrell's books that goes something along the line of that failure to look at your data before or looking at your data before you do the analysis is almost as bad as failing to do so I'll say did you get your data? it's just cheating no it's just doubt but it's also hard to if your data set is really big you can't actually ever look at all the values of figuring out ways to wrangle because you can't necessarily look at every single data value you get this example that I give of this woman who's retracted the papers there's probably some data falsification going on in these data sets and she did an extensive screening but it didn't show up basically she was clued into it by somebody else and it didn't show up until she looked for exactly that pattern and it was just too many little blocks of data with all the same value but she'd already summarized the data because it was a really big data set so she'd already done means and things like that and they didn't show up you can't be too careful if you're skeptical basically when in doubt, assume the worst but that's what I mean because I can answer your question and it's going to be specific to your context and the nature of your data but the first point is to get good clean data good data quality and then start to look for patterns in the data that you are interested in ultimately finding to be honest nothing too cheeky about it but it's crazy not to do that do you want to add anything? well just an anecdote we did the hip fracture project elderly people asked them how many hours a week of health physical activity do you do that could include working in the art and cleaning house and so on and when we summarized the data because we were conscious about this the largest value was 98 98 hours a week we had 14 hours of hard physical labor today for 7 days a week for elderly people no go ahead I think you have to flip the question like it's not just like data okay and I'm proving it wrong you have to prove that your data is good you start from the assumption that I basically have crap even though you spent a lot of time and effort in our case it's you did a lot of model work or they maybe took a lot of time or you took a lot of time programming but you have to come from the perspective of I don't trust this and the data has to hurt my trust instead of the reverse of the data or other what if the data isn't your own data the government produced the data that you're pulling from you're in a world of you are in for it seriously because as you've discovered the formats are all over the place the quality of the data is highly consistent they haven't documented it like they need to not even know where it is I always warn my clients that they're relying on collaborators especially bureaucratic that it's going to be hard it's going to be very time consuming to pull the data together one point I can add to that the reason I was in Ireland is because there's a data set they have there that has thousands of older adults I look at falls and older adults they have brain scans from lots and lots of older adults and their brain markers might be predictable fall risk now they've actually got lots and lots of data including things like grip strength now what we found out by kind of during these meetings is that they were quite variable in how they collected grip strength so it's weirdly grip strength is a marker of frailty and even a cognitive function it's not just a basal strength measure but if people do it with their arm supported it's actually more challenging versus this so to get maximal strength by controlling all these joint torques is actually more challenging so you'll get a different answer depending on how you do this we wouldn't have known that unless we kind of dug into the weeds and found out that this was a problem so we're not going to lose grip strength in that period the other things are usable but that's an example of how we get that but it's something to be cautious about you can also see it this is the core question of reproducibility can you actually access and use data or results available in this case the government but you also become the government because at some point you're going to publish your results and you want other people to share your results and to use your results I know they're really good successful scientific endeavor and so my suggestion would be is as you struggle with using other people's data which is a real struggle and it goes on that it's hard we wrestle with it we're not the primary data collectors in most cases the water conservation project was a rare exception for us just to pay attention to what we actually need from those data providers and those study providers in order to be able to use their materials in a way in your work and then of course hopefully to carry that forward or address those issues when you turn your results out I've had better experience I'm sorry if I could just say I've had good experiences with government data services data meticulous in documenting stuff and so you might wonder why in the world maybe buried somewhere in a two point type footnote somewhere on page 693 but it's there the information is actually there and the quality of the data is very high it's as good as it could possibly be was recently with CDC on flu type and again the quality of the data is not bad and the quality of the documentation I guess I was just thinking about this reproducibility thing in the case of human subjects research where a lot of data sometimes comes from videographic sources and things and watching videos of these people who have signed informed consents and IRB documents which I can't see a case where those videos could necessarily ever be made publicly public so I guess my question is will it ever be so serious about this reproducibility thing that human subjects will be required to give consent to make their video of themselves publicly available or will it never reach into even that level of great question so my work intercepts IRB when we put sensors going to people's houses or their facilities and we want to know how long they shall work things like that so I've had a chance to think about this a lot and I think if you're thinking so first of all an IRB study or protocol that's already been done in the past it's going to be very hard to change that to come in after the fact and to get people the consent to do something that's already hard enough to get them to consent which is a study to begin with and so coming back after the fact is really good but I do think this is changing and it's changing because we're living in a digital world where I reveal more data and information about myself to Google and to my cell phone provider to political parties and I'm looking at the video right here behind me and this is where is that video going right? to use that and this is just the world we're moving into so I see it changing I think more pressing like the way people are becoming more comfortable with this and so the IRB protocols can relax another thing is that in talking with Courtney who's in the CASC college she does a lot of sociology work related to to water use is that you can maybe ask for people's consent afterwards so initially you sign a consent form that says you know this is the terms of participating but then after the study has been done would you be willing to consent it's like an additional add on to share your video right? the video already happened the person knows what's in it they know there's nothing harmful here and so that you can have that consent on maybe those particular most controversial or maybe troublesome or worrisome parts afterwards whereas if you ask someone up front they would say no excuse me is that a follow up comment to that actually I think the question needs to be asked when working with human subjects is what risk does this pose to the participants so for your case I don't know what your research is or what your video taking that needs to be that's really crux of it and then to follow up at some institutions you can have additional yes-no sign-offs as if it was collected as a sub-study or you could even say that this video would like to add as part of a repository you can send to that yes or no and then just one other additional follow up with the advancements in visual recognition and blurring it's becoming relatively easier to mask people's identity at least visually and then voice change and stuff like that so there's some ways we can limit some level of potential risk just give a thought yeah I mean we did on a study that was part of the high utah project which is a big large institutional water project for Utah they interviewed some 2000 people households up and down the Wasatch Front on a survey and then they connected that to the participants bill watering use and my group wasn't involved in either of those parts but we were able to access the data in an anonymized form and that let us bring in the landscape coverage data which we could then tell how much turf area people had how much trees what they get a better sense of their outdoor water use and then we also got permission after the fact to share the anonymized version of that study with everyone or posted online and to do that to actually follow through on those terms we had to mask so you were talking about video blurring of images we had to I mean we could measure the land the grass turf area on every plot down to two square meters but we had to mask that up to 30 square meters so that there were at least like three or four other households in Logan there was about 40,000 or 10,000 households in Logan so that you couldn't identify which house we were talking about so I think that sounds very very doable one thing to keep in mind too is human subjects data depending on the population that you're working with some of this data especially if it's a very sensitive population some of the data is really valuable and there are places we have we're a member of ICPSR it's a repository and there are ways that this data can be secured so that it it cannot be accessed by anybody you actually would have to go to Michigan to access this data the reason this is important is if you're dealing with some of these populations that are very sensitive you don't want to be retesting them right they're already being surveyed too much and by making sure that this data is kept in a place where others can utilize it either to reproduce the results or to build upon those results make sure that that population isn't tested again and that this data is kept for as long as possible so there are lots of options when you have human subjects to keep the human subjects safe protected and still allow for the data to be reused anonymity anonymity let's thank our panel again