 Okay, good morning, everyone. Thank you for coming in the second day of the workshop that we are organizing in the context of the Maria de Maethu strategic program and We are starting with a very special keynote and Victoria Stodd and in fact we are looking for someone to to really Introduce us to the world of reproducible research when we started the program. We really wanted to be reproducible research Fundamental topic and we thought that we could just get a few examples of best practices and then just implement in the in the department some best practices, but it was not that easy and Maybe from today's talk we will realize that it's not that easy So So we had to go to the to the source and to people that really know what all the issues of reproducible research Are and so definitely we we run into Victoria for for that So in fact, let me just read her bio in Website and and then we can go from there So she's definitely a leading figure in the area of reproducible in computational science Exploring how can we better ensure the Raiability and usefulness of scientific results in the face of increasingly sophisticated computational approaches to research Her work addresses a wide range of topics including standards of openness for data and co-chairing legal and policy barriers to disseminating reproducible research robustness in replicating findings cyber infrastructure to enable reproducibility and scientific publishing practices Stodon co-chairs the NSF advisory committee for cyber infrastructure and is a member of the NSF Directorate for computer and information science and engineering advisory committee She also serves on the National Academy's committee on responsible science Ensuring the integrity of the research process Previously, she's now faculty at the Illinois at the Graduate School of Information Sciences But before that she was a system professor of statistics at Columbia University And where she taught courses in data science reproducible research and statistical theory and was Affiliated with the Institute for data science and engineering She co-edited two books released in 2014 privacy big data and the public good frameworks for engagement published by Cambridge University Press and another book on implementing reproducible research published by Taylor and Francis Stodon earned both her PhD in statistics and her law degree from Stanford University and She also holds a master degree in economics from the University of British Columbia and a bachelor degree in economics from University of Ottawa So I think from this it's clearly that we have the person that can tell us what the reproducible research is and Can then lighten us on a very interesting topic. So thank you for coming. It's a pleasure Well, it's really a huge pleasure to be here and I feel like now I know who I am So what so what I did is I put some slides together and I found your The discussion yesterday very interesting and I tried to reflect some of that in the slides a little bit So I also I'm going to try not to take the whole hour So hopefully we have time for discussion and I am very comfortable if you have questions on a slide or in the middle Just put your hand up and we can even just start the discussion right away because I think this is I think this is actually what you're more interested in is sort of learning and engagement Okay, so I called the talk reproducibility and computational research and what I thought I might do is Parse through different ways to think about the term reproducibility one of the first things that you run into I think when you start trying to imagine what this might mean for your work is Many different people have many different interpretations. So one of the things that I'm going to do is take Notions of reproducibility and try and ground them in some of the technological changes that I know in Everybody's working in and everybody knows has happened But really sort of how do we tie those changes into reproducibility and then I wanted to talk about Scientific norms again. This is to help try to understand what reproducibility might mean when you actually implement it in your work So it's one thing to say this in an abstract sense It's another thing to think about what are we actually trying to do when we carry out the research And how can our implementations of reproducibility really support what we're fundamentally trying to do in terms of discovery and in terms of the scientific method Then I added this especially so we have best practice Principles so it's not exactly what people want. I don't have a Sort of a sort of flotilla of examples However, I do have some nice examples So I think it can support this and then the discussion is really around the principles to keep in mind when Implementing it in any particular situation One thing that I've learned in the course of my research on this topic is the type of research people do is so highly varied Even in one particular subject area. It can be extraordinarily different So that has implications for what it means to actually produce work that other people can reproduce And then I have a little bit of discussion on the end if we get there around what it means for the scholarly record What can we? What mindset can we use to really imagine how we might start to behave in this world of radical transparency and much of the discussion over the last few years around reproducibility has really been along the lines of well We have a serious problem in terms of being able to understand even what our colleagues are doing when they publish a paper But will I am optimistic that we'll solve this problem. So five years ten years from now We need to think ahead to what if we have all the code all the data all the methods and everything is transparent And this is a normal way that everybody just does their research How should we think about what we disseminate what the scholarly record should look like? So we keep this in mind And I think we do sort of a little bit better and sort of sharing and sort of constructing ourselves for that future Okay, so Everybody here knows these changes. I think it's very useful to pull them out in segments so the first change that I like to pull out that that really is driving why we have problems in reproducibility now is enormous amounts of data Discovery that's driven from data. It's less driven by hypothesis working with say for example very high-dimensional data You have many more variables than observations. So this has Revolutionized I don't think is really too strong a word a number of fields including for example in the social sciences now that they've become to a large degree a fully sort of Quantitative endeavor as opposed to where they might have been 20 years ago with an enormous amount of theory involved as well And so it's really kind of reconstructed how we think about how we do the research So everybody knows that computational power another one that people are aware of it's a lucky coincidence that we happen to have enormous amounts of Increases in computational power just when we happen to get Increasing amounts of data. So we're able to do not just analysis on these large data sets in very sophisticated ways, but we can also do Much more sophisticated simulations as part of the discovery, which is what some fields are built around for example So I think when you mention changes in technology that have Really changed how we carry out scientific research people will sort of mention those two What I know almost never hear anybody mention although I think there was a little bit of it yesterday Another fundamental change is our Contributions scientific intellectual contributions are now appearing only in the software that's written to support Scientific results. So this isn't something that appears in the publication itself they're standing in the software and so it makes me really consider software as a first-class scholarly object with real intellectual contributions in it and the don't appear anywhere else in the dissemination in the publication So I put a little screenshot there. That's a Leor Patrick who is a professor at Berkeley in I believe mathematics and Biology and he's giving a keynote and I just took this screenshot off YouTube as he was speaking and he's saying how for him Around his research the software contains ideas that enable biology Not just in the paper not just sitting for example in parse data, but really in the software itself Okay, two that I'll mention as well that Are in a sense also obvious but also very important to think about when we start reconceptualizing Dissemination So the first one of course is we are moving things into the internet now That's very obvious but on the other hand Thinking about just sort of keeping that in the back of our minds is one of the technological changes when we start thinking about how We are going to reinvent the scholarly record and dissemination as we become radically transparent The other example is often overlooked in my experience, which is when we start sharing things Openly on the internet we run immediately into into intellectual property problems So one of the things that I've done in my research I won't talk about it too much this morning but I have a number of papers on how to share your work in such a way that others actually can use it Can build on it and extend it in a legal sense and the default is you can't right because Copyright stands in the way and other intellectual property barriers stand in the way And so whenever you're delivering these digital scholarly objects, you need to make sure that they're Permissioned or open licensed so that other people can actually go ahead and use them Okay, I'll come back to I'll come back to all these points. They float through the entire core the entire talk Okay, so if you if I asked each of you what reproducibility would mean in your work I think I would get some quite different answers and One of the things in discussions around reproducibility that you might have run into here already is When people have a different idea of what the concept means that you're discussing It's very difficult to sort of move to a standard or a resolution So one of the things that I have found useful is to try and pull out pieces of what reproducibility actually means and Because each of these pieces has very different remedies associated with it So the first the first type of reproducibility that I've identified and maybe you can come up with others or sort of criticize You know my three, but but these are the ones that I found useful So the first one is empirical reproducibility. There is a lot of press And attention being given internationally on all over the US that I'm aware of To irreproducibility issues in the life sciences and this is a huge concern to people of course because you know People want to know that the drugs are taking work and we're probably research and so on so it really hits the popular Media because that's something people can really understand And so I've I've pulled those out as empirical reproducibility in the sense that can I go to say You're like a similar Wet lab bench say for example in biology and can I carry out physically the same Experiment that you actually carried out and so in a sense It's empirical you're sort of doing things and sort of an it's sort of an empirical way And it's the same idea of reproducibility that we've had through the sciences all the way through So I think actually in those discussions. We don't actually have anything new that we're talking about We're really thinking about why did the existing system that's been around for hundreds of years break on those Studies that were done in an empirical way So I don't focus my research so much on empirical reproducibility, but this is probably 80% of the sort of Hype that you hear around reproducibility Another way of thinking about reproducibility that And I have some examples of all of these coming up that I find useful is thinking about statistical reproducibility so things like Experimental design or power or how you might expect your statistical conclusions to generalize to a new sample So that's of course very different to say someone carrying out a biology experiment at the bench And then the last one where I really do focus a lot of my research is computational reproducibility So how do those technological changes that I just outlined a few minutes ago impact the type of research that? How we carry it out how we disseminate what our standards are and so on so If people have these sort of different notions in mind when they're talking we end up with a very confused conversation So I found that those to be fairly useful So here's an example of empirical reproducibility. There was a short article Published in cell reports called sorting out the facts. So what happened in this article? Okay, I don't think I put the year on there. I think it's 2013 I think So what happened in in this group this was a well-funded collaboration between Harvard and Berkeley and these are two top labs who do biological research this they process cells to understand How cancer is advancing in cells particularly breast cancer cells and in there The grant that they got from the National Institutes for Health for this collaboration They specified that what they would do is spend two months at the beginning of the grant and make sure that each lab You know Harvard lab in the Berkeley lab were producing Identical cell outputs that were indistinguishable so that they sort of had a baseline for their research and there was no sort of Signature associated with the lab where the the cells were processed it took them two years to try and Get it to the point where the processes in the two labs were actually Close enough that you couldn't tell which lab the the output had come from and they were shocked So enough so that they actually Wrote this article saying you know this type of reproducibility is much harder than what you think and actually we're probably introducing all sorts of error into our research unwittingly because we think The output from the labs are the same but really they're actually quite different And the part that I highlighted there that I think it's impossible for you to read. I That was when they When they figured out what the actual issue was one lab Took their sort of cell slurry and stirred it in a beaker the other lab put that beaker in a centrifuge You know and two years to try actually unpack the process enough to try and understand what those differences were another example that I just like to throw up because The different types of reproducibility have quite different problems associated with them So there was a workshop about a year and a half ago reproducibility issues in research with animals and animal models So we have already in this discussion yesterday and today always taken reproducibility is a good thing and if reproducibility means carrying out an experiment for a second time that means you're going to be killing animals Suddenly you have sort of costs and benefits involved in reproducibility. It's not necessarily always good to be reproducing an experiment So those are the types of issues that come up in empirical reproducibility Okay statistical reproducibility again, I won't spend too much time on it, but just to give you an idea so using for example hypothesis testing and the entire machinery of p-values To actually discover your hypothesis instead of test your hypothesis I'm sure people who work in that domain are quite familiar with this idea of p-hacking and sort of looking through the data until you get something That sort of matches your preconceptions of what the results should look like Designing low-power experiments non-random sampling for example all of these things stopping your results from generalizing How have you treated outliers especially when you're combining data sets? What are you reporting so this goes right to the heart of reproducibility? I can't understand exactly how you've extracted that information from the data like what types of methods you've used if I don't even know how you've treated outliers or what your thresholds were or what kind of data pre-processing you've done and Very rarely is that information ever really included in the paper right and of course as you all know Small changes in how you treat outliers can dramatically change the results right they can be quite sensitive All that sort of regular problems poor model design Misspecification very sensitive models or parameter very sensitive models and and output for Small changes in the parameter for parameter settings for example that go into the model So all of these things make it make the results more fragile and less likely to actually replicate So a whole different set of questions to say what the Berkeley Harvard labs were looking at when they were trying to deal with their centrifuge Okay, and then what I'm going to spend most of my time on is the notion of computational reproducibility So this one is new over the last at the most 20 years maybe 15 years. We've really started to embed Computation is the normal way that we carry out research. I would say I Don't know rough guess upwards of 85% of all the scientific output uses computation somewhere. Maybe upwards of 90% It's just very difficult to Publish work that is either wholly theoretical or doesn't somehow use At least some computational aspects in deriving the results Okay, so we've the way I like to think about it is we've traditionally had two branches of the scientific method we so we started with the deductive branch including mathematics logic and And sort of reasoning that could be deduced from axioms and then this was expanded to the empirical branch or inductive branch That started to include Statistical analysis of controlled experiments So deductive logic Pythagorean theorem and so on didn't help you when you wanted to try to understand Where to plant how close to the Nile to plant your crops so they wouldn't get flooded But they were still in the flood plain and so on and you had to start to do these predictions There's a lot of talk about how these technological changes I outlined at the beginning are creating new branches of the scientific method So you've probably heard a Third branch of the scientific method we can carry out high-powered simulations and Extract knowledge ask new scientific questions within the context of a simulation and simulating an entire physical system for example changing the parameters Simulating again and trying to understand Our world through those means that are wholly computational. So that often gets labeled, you know, sort of third branch of the scientific method Fourth branch of the scientific method is data-driven discovery. So people think that we can really Take these enormous amounts of data that are everywhere landing in our lap It's almost embarrassing how much data we have as researchers and and ask new questions develop new types of methods and Really get results that we wouldn't have gotten otherwise And so people start to say well, that's also a fourth branch of the scientific method. In fact, we've you know doubled the number of branches So the main thesis that I'm going to present to you today is that These third and fourth branches, so I put a little question mark in there that you probably noticed How computation prevents presents only a potential third or fourth branch of the scientific method So what we need to do is all that hard work about Standards for dissemination to really bring them up to third and fourth branches of the scientific method. So whoops So that I'll explain that a little bit more So if we think about why we have a scientific method that guides our research The reason we do is we have All of our processes are fraught with error and when we discover something the first question is well, do we know it's right? How confident are we in this? I presented to the community. Of course, there's no gold standard I can answers in the back of the book where I can check The all I can do is present it to the community and try and convince my peers that really this is actually something that we should attach a very high probability of being true to and So in the deductive branch, we didn't just sort of present Pythagorean theorem It traveled with its proof right so other people could actually go through and verify why we should believe that or use it In the empirical branch, there's an entire machinery of hypothesis testing that's established You use appropriate Statistical methods and then there's a very structured way that you communicate the results if you try to Publish a paper and just sort of leave the methods section blank. Of course, they won't publish that So We need comparable standards Whatever the it means to have a proof in computational science Whatever that same sort of dissemination standards and protocols are in computational science in my opinion That's what we need to develop before we can really start calling Simulation and data-driven discovery and so on third and fourth branches of the scientific method Okay, so this quote actually gets Mentioned quite a bit when people talk about reproducibility and I like it so I'll I'll go ahead and mention it So my advisor actually is David Donahoe that you can see in the middle of the slide there He was inspired by John Clair about who was a geophysics professor. I believe he's emeritus now at Stanford who in 1992 He actually started refusing to sign a student's thesis unless he could Reproduce it in the computational sense. So run basically they would give him a make file and they would it would chunk through and then Eventually generate the thesis including all the figures and so on and unless he was able to reproduce the thesis on his system He wouldn't sign it and the student wouldn't graduate So he was very heavy-handed about it On the other hand he really developed a lot of the core notions around reproducible research that we rely on today And I think it's very interesting that Clair about didn't talk about integrity or Fourth branch of the scientific method or any of this stuff. He actually talked about one of the reasons that was mentioned yesterday which is He had new students coming into his lab and it would take him about two years to produce new original results Because they had to spend so much time Rewriting code or trying to understand what previous students had done before they could really contribute in a novel way When he started refusing to sign the thesis unless it was reproducible on his system that went to two weeks So students could get there immediately rerun Work that had been done previously and start extending it as sort of almost trivially and contribute new results So that was his motivation and it was very practical So Dono host some summarizes these ideas of really reproducible research. That's the term that Clair about used By saying the idea is that an article about computational science in a scientific Publication is not the scholarship itself. And of course we all sort of rewarded for we all get fixated on that Publication it's merely advertising of the scholarship the actual scholarship is the complete set of instructions and data that generated the figures Okay, so one question that I often get at this point that I'll just mention is people They tease out two different ways to interpret reproducibility in the computational context and they're exactly right So one that I just mentioned in the thesis example Can I actually just run your scripts run your code using your data using the same? input parameters and same settings and can I get Figure four or whatever it is that you're you're putting in your paper or your tables. Can I get the same results and People will say but who cares really that's not a sort of waste of time that you can't even you're not extending scientific knowledge By that way all you're doing is sort of verifying a computational system And that's true. So they say well what would be really interesting is suppose I took your results from figure four and I sort of understood in an abstract way what you had done to get them And then I recoded it all and I started from new data and new collections and I independently Tried to generate the figures that are in your paper And then if I get the same thing then this is a real scientific contribution and it's very interesting and so on and I think That's true, and that's right, and I'm not arguing that running someone else's code And again making sure you can get their results or even your own coding and re-getting your own results Is a new scientific contribution however? I can guarantee you if you go through and you code up someone else's work And you try to get the same results even if you fully understood what they've done I guarantee you you won't get exactly the same results. It's just the pipeline is just too Complex these days the systems are just too much that we can't control on them And you will get something a little bit different So then the question is is that different read is that different? It's real like are you actually providing counter-evidence? for The findings that were published or is that difference just noise that's in that discovery pipeline And then we should ignore and essentially you've come up with the same results and the only way we're going to understand How to reconcile those differences whether they're real or or or apparent is if we can actually run through the scripts and Generate those results in that sort of less exciting way So I think we really need both we need to be able to have that level of transparency that allows us to do that diff between the different methods that actually produce the result but of course that doesn't mean that the results are correct Right we can reproduce terrible wrong results all day, and it doesn't mean that they're actually right but that independent Reproduction and being able to understand how those two approaches differed Then that's something that we can really start to say okay, so that Corroborates our results that's counter-evidence, and then we can really start to have that that discussion Okay, so norms I think it's useful to think in thinking about this to think about why we have Reproducibility or where it came from and in the responses. What's appropriate? There's an enormous number of responses that can be taken when trying to make your work reproducible with varying degrees of work that falls on the researcher too, so We all have a bias maybe as to which ones we might prefer But so thinking about the norms can help kind of parse out what's important and what's not or what's idealistic and what's important for us to actually do So the two that I focus on here So so Merton in in 1942 came up with these five norms of scientific research They are not uncontroversial even though I'm setting them up here as if they are uncontroversial But I have found them helpful in guiding my thinking around these issues. So Communalism scientific results are the common property of the community Universalism all scientists can contribute to science regardless of race nationality culture gender and so on Disinterestedness as researchers were acting for the benefit of a common scientific enterprise rather than for our own personal gain Originality so some new scientific claims will contribute something new in order to get the attention of the community and Skepticism so scientific claims must be exposed to critical scrutiny before being accepted So this is the two that I lot rely on are probably skepticism and a little bit Communalism here, but The research that we're doing changes in ways beyond what I'm going to cut beyond the scope of this talk It changes. It's changing in ways where we have far increased Interactions with industry and collaborations and we are thinking about new funding models for example at Sort of the government level and how we actually produce the research And I have found those norms useful to sort of bear in mind about what we're about and what we're doing when we start crafting these sort of new ways of carrying out our research getting it funded It's very easy to sort of Slide away from things like scientific results being the common property of the community for example when we start dealing with People who aren't scientists Okay, so skepticism the last one there. I'll just mention a historical note so we can independently verify the claim So this is what this is the one that I think really drives a lot of our Need for transparency why it's an issue and what's going on behind reproducibility So this is Robert Boyle. He initiated this idea in the 1660s actually with the advent of the first journal the Royal Society He wanted to be able to Reproduce experiments without having to write a letter to the author or some of the original paper or somehow have the author involved So that's where we get this idea that you shouldn't have to Email the author if you want to understand their work It should be in theory contained in the paper according to Boyle So then the question is he of course wasn't dealing with computational methods He was dealing with sort of air pumps and sort of vacuums and what I was calling empirical Research, so how do we take that notion that worked for the empirical setting and apply it to this setting where we have deeply computational research Okay, so here's the part everybody has been waiting for Okay, best practice principles and now it's too built up because it's just not that it's not going to solve all the problems But but it's sort of like the norms that maybe there's things here that can guide decisions Okay, so I actually have a paper called best practices for computational science. I dug it out and put this up there So software infrastructure environments for reproducible and extensive research extensible research This is from 2013. So already it's three years old already the ideas here I would probably write a little bit different paper today things move so rapidly around this this area But what I did in this article is I traced through a history of Some of the discussions that have been happening in the community and then tried to come up with some guiding ideas For best practices for researchers one of the things that I was alluding to at the beginning is the type of research That you're actually doing has very different prescriptions around reproducibility Okay, so in the computational context, so I actually did this Sort of Yale round table in 2009 that you can see the sort of bluish writing up there Reproducible research addressing the need for data and code sharing and computational science So that was I think one of the first gatherings that said That brought researchers stakeholders funding agency folks together and said we've got a problem in terms of the transparency of computational work, but what are we going to do and We came up with that document that we published saying, you know Here's some ideas that people can think about using in their work that I'll get to in a little bit And then we had the next section, which were the dreams things We would really like to see like tools or support or changes in how we saw funding happening and so on In 20 at the end of 2012 so Brown University does the these week-long Workshops and you just sort of apply for one called ice herm and we produced a report there So the ice herm discussion was much bigger People were extremely interested in it very agitated and lively and these are people across different disciplines who didn't necessarily know each other they're coming together around reproducibility and Then we had a workshop report there, which is much more comprehensive And if you're interested, I'm happy to send links and so on it has some of these definitions and levels of reproducibility And it's it's one stab at the problem You may read it and say well this this part doesn't apply to my work that part does and I would change that part a little bit and and that's That's what we intended. I mean, that's fine, but setting the default default to reproducible in computational research so our idea was Instead of sort of saying well, I can't do it for this reason and my data has privacy issues can't do this Make everything open transparent reproducible and then deal with the exceptions Okay, so you have an exception because you have say human subjects in your data So clearly you're not going to be putting your data on the web But everybody else who doesn't sort of have that exception then they go forward and and try and make their work more transparent I put this one here, too So exceed is a tool that I think I heard it mentioned actually yesterday that some people use That's an interface between high-end high-performance computing resources and researchers to try and give them the sort of software interface to access it access the sort of big powerful machines and they have an annual Gathering and we did a workshop reproducibility it's a little bit blurry and again we came up with a workshop report and The notion was if that's a big seed is creating a software middle layer that's Assisting access to these computational resources Maybe that software is really the right place to start building in some of the things like capturing what Functions were submitted in what order with what parameter setting that could all be automated, right? And so I think it's interesting because one of the arguments also mentioned yesterday Against reproducibility is the underlying hardware. This is a computational system can change So even if I have say your scripts even if they've been Bundled in a virtual machine as much as possible Actually running them on a different system is a whole different kettle of fish and painful and not clear that necessarily Valuable use of a researcher's time But it's very interesting because it's the high-performance computing Community in my opinion is really making the greatest strides to do this And they probably have the hardest challenge because they really do have some unique pieces of hardware Like a supercomputer. There's just one right that you're not gonna and the software is so customized for that particular Piece of technology. So that's sort of very challenging case and reproducibility yet. They're taking it on. So for example in super computing they have an annual huge annual conference called super computing although they've just changed the name to SC and There's a student competition every year So in the discussion in the steering community for this super computing workshop Our conference they started thinking about what can our requirements be around reproducibility exactly the same questions that you all have Been asking yourself and they really they ran into the same issue They didn't know exactly what they should be requiring of researchers and it's a competitive issue I mean that People really want to publish in this conference and be part of the proceedings. So it has to be fair They can't sort of arbitrarily ask for incredibly difficult things. They don't want it to either be a joke either And so the approach that they took is well since we don't know the answer and this is really high impact on people's careers What we will do is we'll run as part of our student cluster competition a reproducibility Study so the student cluster competition happens every year and so this year for the first time they've actually chosen a paper that the students will actually replicate and that becomes their sort of learning opportunity for What worked what didn't work what was hard? What did the students need and there that's where they're going to start building their standards for the requirements for the conference? Proceedings of themselves. So that's one approach and and very interesting and they're just doing it for the first time this year So you have to stay tuned on that one these things are they just take time Okay, so this is a wall of text, but I'll summarize it for you, but what I what I wanted to do was pull out This is a 2000 this is Text from a 2003 report just to let you know how long these issues have been on the forefront of some people's minds and discussed so that's We're going to be close to 20 years of discussion on the topic In about five So this is National Academy of Sciences the publications called sharing Publication related data and materials Responsibility of authorship in the life sciences so they were focused on the life sciences However, I think what they're saying does generalize so here are their principles from that report Principle one authors should include in their publications data algorithms Other information that's central integral to the publication that is whatever is necessary to support the major claims of the paper that would enable one skilled in the art to verify and Replicate the claims. So you'd think that was I just wrote that this morning, right? But that's actually 2003 I didn't write that but that's 2003 And coming from the life sciences as well and you can see how they're quite sensitive to the computational aspects we're talking about algorithms in many of the discussions that I've seen around Reproducibility in the empirical sciences you almost never see code mentioned. They might talk about data sharing at the most but they have somehow Skip over code. So I'm I was very happy to see algorithms mentioned in there Okay in exchange for credit and acknowledgement that comes with publishing in a peer-reviewed journal Authors are expected to provide the information essential to their published findings. We don't have those standards today still However, a number of journals are really moving towards those requirements And rapidly moving towards it Okay, so principle two if central or integral information can't be included in the publication for practical reasons large data sets Human subjects there are sort of real reasons It should be made freely and readily accessible through other means. Well, okay, you can see it's 2003 when they say for example online When necessary to enable further research integral information should be made available in a form that enables it to be manipulated Analyze and combined with other scientific data. So they're already layering additional work on people so I sort of haven't mentioned that issue that I've just sort of said data and code but Again, that's very abstract and when you actually think about sharing your data or your code things like Formats interoperability what kind of metadata standards do you need? Do you want to make this something that's machine-readable or discoverable? Where do you actually share this? All these issues start to kind of float up to the surface Principle three if publicly accessible repositories for data have been agreed on by a community of researchers And there are many communities where this is really the case And they're in general use the relevant data should be deposited in one of these repositories So we have domain specific repositories many of them around say the life sciences genomics funded by NIH other repositories that have just sort of Either started as an institution a repository as a particular institution and kind of become the standard for the domain Those are something that came organically from the scientific community We don't have any such similar thing for code in the sense that it's come organically from the community All these repositories will tell you that they'll accept code, but what they're talking about is sort of a bundle of bits like a zip file that Sits their static like a data set so they're sort of imagining code of the data Which is very unsatisfying Something more like you know github or bit bucket is much more how people want to share and deal with code GitHub and bit bucket aren't from the scientific community You know github's a commercial enterprise and so on to the closed-source platform And it's largely becoming the standard now for how scientific codes get shared But when we think back to Merton's norms and we think about that engagement with people who have different sets of incentives than we do I don't have better options at this point. I think github's great, but But they're not organic from our community the way some of these data repositories were and people knew about them in 2003 so it's just something to think about I mean if you also think about We sort of have kind of cycles in code code Sharing platforms that people probably remember source forage right ten years ago gone essentially no one uses it and then people used Code google.com quite a bit sort of move their code over from source forage now people have moved on to github They'll move somewhere else it's about a five-year cycle and And so the the permanency of these objects is something to think about as well The idea that you could read a paper that's five years old and still be able to reproduce the results seems Desirable right it's not easy Paper that's ten years old and I'm leaving aside the issue of actually running the code which is you know ten-year-old code It's probably not ever gonna run However, it still can be useful to look at what the parameter settings were how they implemented the algorithms and what the What the decisions were? Okay, so here's the best practices that I came up with in the paper the first one open licensing for data and code sort of the software aspects that travel with the Object and for the data so you pre-permission these objects and I can talk about more about this if you're interested So the people can legally just click download use I would put citation recommendations with anything that I shared so hopefully people go ahead and cite However, you prefer them to actually cite the data and the code workflow tracking I think it's very difficult to produce reproducible research if it's something you decide to do after you've written the paper You really need to sort of start at the beginning of the research process and take those pieces of information as you're going through That you think you're going to need to share at the end So having that Additional piece of information that tells people what order to run your functions and what parameter settings there were all those little pieces of information they need to actually run and produce the final results data available and accessible Version control for data people generally don't think about this one I think it's very important if you think about the published claim as Sort of the primary object that you're sharing and the code and the data supporting that claim I mean, this is really our the notion of reproducibility. I've been building if you have a data set that kind of got Fixed or updated between publication and when a user reads it I mean you've broken reproducibility right and so people to find that counterintuitive. Well, yeah But there's mistakes in the data set of course I want to fix the data on the other hand, you know The way we think about results in the scholarly record is they are you know fixed in the scholarly record So mistakes and all I mean code is always buggy like all these mistakes We need to have that full snapshot and persist you want to change the data set or the code and so on the way we normally do that is You sort of do that as a separate entry into the into the scholarly record, but many many times I've run into people who Run repositories or so on or our data managers who are wonderful at their jobs, but but there's a difference between Sort of make ensuring that work can be reproduced from a particular data set and sort of that ongoing revision of data sets So github has a file size limit of two gigs It's not appropriate for data and we don't have sort of I mean people are sort of working at it and nibbling at the problem We don't have sort of an organized version control system for data and it seems pretty clear We wouldn't want to Pre-copy the data sets every time we make a small change right so we need some more intelligent system Making the raw data available so in some discussions people like to make the processed data available with the scripts that they Used to generate the figures however as I mentioned the beginning that sort of goes back to the statistical reproducibility issues What did you do with your outliers? What did you do when you process the data that can be really important so I would say making the raw data available Again, you run into size issues. You run into all sorts of but in the ideal in the abstract making the raw data available and having those Processing scripts available as well so people can see all the decisions that were made And data types data of course is so different Tiny little files that are just running toy examples Up to very large streaming or dynamic databases. So how how this happens? again a system that works perfectly for you know small data sets just immediately breaks for larger data sets and I Don't have a silver bullet answer on this Just it makes the problem more interesting and more fun Code code and methods being available and accessible so version control for code We have a huge amount of work that's been done on that Making the code available externally. I personally believe that the time for exposure and transparency is at the point of publication So when a researcher decides these results are ready. I have confidence in them I'm going to bring them out to the community at that point. That's when all these sort of additional Aspects flow with the results data code workflow information and so on Some researchers don't agree with me. They think that We should be exposing our research as we're going through and They cite things like well, what about negative results and dead ends and p-hacking and so on I should have this sort of More elaborate trail of the research that was actually done It's not a right answer on this I just feel like when the research is under Deliberation and consideration in a sense. It's sort of a very creative Private kind of way that you're interacting with the research and then at that point of publication is where you're saying to the community that I've got this to a level that I'm confident sharing it all Version control for environments. So for example Docker and some of the virtual machines and containers were mentioned yesterday. So ensuring that these are shareable code samples and Well, I put test data there But what I'm talking about is actually thinking about testing in the context of software and code for science if you are Coming from the open-source software community when you contribute code to a project you would never just contribute You know the actual executable statements You would also contribute Tests so people know when the code is running correctly and when it's not so unit tests some pieces of the code regression tests and so on We don't think about that much in When as scientists sharing our software We there is sort of an implicit test when we share can we produce figure four or whatever it is But thinking about what it means to understand when code is running correctly and when it's not I think is a new area for us to think about Maybe I what happens if I throw in an appropriately sized matrix of zeros Maybe I get some predictable result and we can start sort of thinking about tests that should travel with the software It's not always going to be possible to get the software to To do inspect what's actually in the software at some point We're going to actually I think need to rely on Tests to really understand the software we certainly have to for proprietary codes that are used in research that are closed What did you what to do with the big code bases is an interesting problem especially ones that have been around 20 years already How do we start kind of exposing and open sourcing some of those some of those code bases? Citation third-party data software should be cited site your own in your paper site your own code that you use for that paper site your own data Get the ball rolling set an example And then of course it almost goes without saying that if there are requirements that you're subject to like in your department For example, or your your funder the sort of person who's given you grant money of some sort to do the research has requirements Obviously, you're going to comply with those They're never going to be well, I don't so I don't want to say never in the next 10 to 15 years They won't be as strict as this anyway, so you almost implicitly get that for free Okay All right, so we're almost done the last thing that I wanted to show you is By the way, the slides are on my website And I'll make sure that they're linked because all of these tools and these efforts and so on are hot linked on there And I think they're kind of cool to play with So this is this is stuff almost all of these are just researchers who recognize the problem and decided to start solving it themselves And on their own time built a tool the one that I especially wanted to mention was I pull as an example See if this Okay, so I'll just briefly go through this and but I'll let you Play with it on your own, but what they did so this is actually an open source Platform so they have their code in github and you can just spin up your own kind of iPal looking thing as you want and They the approach so I pull stands for image processing online And the approach that they took was to create a new journal and have this reviewed and then this article appears Organically in this online journal. It's a whole other question I can talk about whether to start in your journal or try and change existing journals I think it's better to stick with the existing journals, but but they they did it in such a way that they Organically do this. So let me just Choose a paper here and you can see what they've done They said here's the article and by the way full text manuscript PDFs everything traditional is here Article and then source code you can see down here so image processing is sort of very lucky in a sense that generally they deal with images or maybe about a couple gigs at the most and so and then most of their the Impact is in the actual algorithm that they've developed so in the source code. So they have the source code here and So you can grab the pieces that go with this notice They have demo and then they have archive here. So demo You can take images from the paper I don't know what cabin. Okay, and then go ahead and run it, right? So they've implemented their algorithm organically in the journal and the really cool thing So you can just check the results like you don't have to actually install all their code And the cool thing is you can upload your own data here So why believe them on their hand crafted chosen? Images for droning and try it out. So that's what's in archive. So last time I clicked on archive You don't know this is just random people. I don't know what I'm gonna get here Uploading image. Okay. So once I got something a little weird with Student and it was not a good situation So but it's interesting you can see this is what people are just playing with it and running it and so on and trying out the algorithm So and so that's it's one approach. I mean you asked for some examples It's one example that this community this sort of group of researchers not even a community has taken and developing their journal Okay There we go, I will not go through an approach that I came up with a research compendium trying to link data and code to journals I wanted to mention this article very briefly because of the conversation yesterday co-chairing associated with research impact in Image processing people have done some research on this and there is a bump that you get for co-chairing. I Assume it's the same for data sharing. You might say well, there's confounding factors Maybe the best researchers are more likely to share their code Sure, but okay, you still are getting that in fact factor And then I wanted to leave you this is my last slide And I wanted to leave you with a few questions about what I talked about at the beginning about thinking about this world of radical Transparency so we query the scholarly record now because we are looking for a particular paper Or we're looking for an author or maybe we can query by keywords. That's about the extent of it Right and the one of the talks yesterday started sort of pushing on these issues So what can I do if I want to do kind of more intelligent queries of the scholarly records? For example show me a table of effect sizes and p values in all phase three clinical trials from melanoma published after 1994 it's basically impossible to do that query today. It's kind of an obvious query to want to do though Name all the image denoising algorithms ever used to remove white noise from the famous barber image with citations So actually, I don't know if you notice. I went over it really quickly, but in that random Ipaul paper that I chose Barbara was in there. So presumably you we would pull that paper as well as whatever else Impossible, but I would want to know right how what's what's happened around Barbara. Who's manipulated what for her? List all the classifiers applied to the famous acute lymphoblastic Leukemia data set along with their type 1 and type 2 errors So with a student we actually tried to do that manually just to see how how hard it is of what kind of papers We could come up with we found about 13 papers and something like a little over 20 classifiers, but it was by hand It's not a query. I can actually do I'd like to do that if I want to apply a particular Classifier to the data set I'd like to know that I'm not duplicating someone's work and it's hard to Figure that out Create a unified data set containing all the published whole genome sequences Identified with the mutation in the gene Breca one so come up with your best breast cancer Data set it's hard to generate that you have to kind of go to the sources and start piecing the data together These are all kind of obvious for scientific query though Randomly reassigned treatment and control labels to cases in some particular Published clinical trial calculate effect size repeat many times create a histogram of these effect sizes do it for every clinical trial published in the year 2003 say listing trial and histogram side-by-side So are the effects real or not real in these some clinical trials very difficult to do these queries so I leave it there as my last slide to sort of Spurn thinking about where we can actually take this and and How core this is to the types of things that we're doing as a scientist in the way that we're disseminating our knowledge So I'm happy to take questions. Thank you Okay Hello, I had read about the hacking before and reproducibility problems in statistical Science and one of the things that came up was people discussing the possible inadequacy of p-values and difficulty of understanding them. Is there any sort of alternative method pushed? There's a very good question So I know some of those discussions and they tend to come from the psychology community Where there has been a lot of discussion actually the term p-hacking was coined by Yuri Simonson who is a psychologist Or professor of psychology The approach they've taken is actually not in your question where you say alternative methods So the approach that some journals are starting to take I think I know one that's doing it Is they're saying we're not going to publish p-values anymore. Just don't talk about it Which seems to me worse That at least with something with p-values I can sort of structure and try to understand it But without that, you know, I don't I think we're just you know, you're just going to be publishing Unintelligible junk is what I think Other journals what I've heard the discussion is well will publish confidence intervals We won't publish p-values, but of course they're dual So that doesn't actually do anything different and it just it does highlight what you're saying about the difficulty in understanding p-values Because clearly they didn't really understand what is going on in the determination of a p-value That's the level of the discussion that I've seen The way that I like to think of it as a statistician is it's a call for more statisticians and more statistical research so that we can actually adapt p-values To the modern context there is of course a lot of work going on with multiple comparisons Benjaminie has been doing work with false discovery rate for example Those are much more intelligent and structured approaches to how to deal with the the p-hacking problem, but the sort of Main problem that isn't being addressed is this idea of carrying out test after test after test after test And then reporting the one that happened to give you the results you want I mean, that's almost a social problem. Not really a statistics problem or maybe a tools problem That's a great question Thank you I have a question what should be done if For some reasons you're not allowed to share the data either for example with Twitter you only can share the IDs of the tweets and once you try to retrieve as you will never get the same as As the regional research has had or you have some privacy issues So what is the work around them? There's there's work that's being done on that That's a great question of the very difficult answer. So it depends on why you can't share the data So for example human subjects data, obviously, there's laws You're not gonna sort of put people's medical records out on the web. You probably end up in jail On the other hand people are starting to do things like be more sensitive to issues around reproducibility. So prior to the recent discussion if you had any kind of Identifiable personal information in the data sets, you just zero like you just don't share it and it either gets destroyed or probably just gets Destroyed now. I think there's a lot more research that's coming up as well Can we do something that shares some of it and still protects confidentiality and some very sophisticated research around? Actually was mentioned a little bit yesterday differential privacy sometimes helps Can we query the data set in such a way that I don't learn the characteristics of the people in the data set? But I can even maybe with some noise I can get some confidence in the results that someone derived who did have access to the data set So maybe there are sorts of sophisticated ways we can do this Maybe we can be more clever about authorizing other researchers into sort of a pool that has access and what the controls are and make That more seamless. We're also talking about sort of cultural issues too as people Quantified self-movement people have a lot more sense of ownership over their own data for their own bodies And they'll say things like actually I want you to share my data with other clinical trials And so on I don't want you to ask my permission every single time and right now We don't really have a mechanism to facilitate that We just sort of have this machinery that says no that you it's bad for you We can't actually all the linking that was talked about yesterday. We can actually not guarantee your privacy I don't care about my friend. Well, no, so I think we'll see this cultural evolution as well as we become sort of more Comfortable with people who want to actually share the data So that's just in privacy Other places I have also optimism. So for example If you work with Google Facebook sort of flashy types of Companies to collaborate with your students. There's no way you're sharing the data that that's based on right In fact, if you're working with Facebook, you've got the Facebook laptop with their stack on it and in fact, you're probably now I think even actually at Facebook, so you're doing your fellowship there and I think there's a real culture clash that we have to reconcile between Transparency and in the research that we're demanding as computational scientists and those types of collaborations where data is not available Methods really are not available in how they're instantiated. You might get a high-level description But but if you want I think if you want to contribute to the scientific community in the discussion You've got to be making all of this available. How do I know what exactly you did and why it's right? So those are also cultural issues. I think we can sort of start to push on We have some leverages scientists. We don't need to necessarily sign all the NDAs Right off. I'm not saying necessarily with Facebook and Google for the hardest cases but if we think cleverly about how we engage with People who are very important to science, but maybe aren't Scientists then we can start to sort of plan these things out from the beginning better So there's so that's a very complicated answer to what seemed like a straightforward question It really depends on what the barriers are and you see it with codes to people who develop codes in a collaboration And then say the company wants to use the code and doesn't want it sort of shared openly because they don't want to advantage They're competitors and how do we start negotiating and navigating all these sort of new? Interactions that are coming up that 20 years ago barely had to think about these things Thank you for a great talk. I Don't know about the US regulations, but I know Roughly about European law and Talking about data protection and talking about data sets that contain faces and voice speech Are there any practices of? letting third parties I mean like having contracts for third parties to be able to use these data for scientific purposes only and How can you control and track all those people third parties that you are? Just I mean you are given some permission to use this data by participants It's not your data and you're trying to give it to the community. So how is this going on in the US? That's a great question So I think sort of the legal issues that attend to this discussion are completely under research and underpowered So if people are interested in them, I think you can make a career easily on on these issues So I don't know of any canonical examples of contracts that could say for example service templates or guides But that's an opportunity to start kind of maybe thinking about them and developing them I am involved in a project which is it's based in the US So it has more sort of US law around it unless European law and that's actually why I didn't go into the legal research that I do It's so US based it's then privacy is quite different here and copyright is quite different But this so what's going on in this project that I'm involved in is it is state governments in the US So say New York State or you know Illinois or whatnot and they have an enormous amount of person identifiable Information in school records on welfare case studies what's going on with the children health records and so on and They have incentives to share this even for things like just getting their state government to work better Can they understand the cases better? Can they identify children who are possibly being exposed to abuse better if they're able to say look at the school records and Tie it into welfare payments and so on so they have sort of very concrete reasons. They want to share the data but that also exposes it like you pointed out to a Potential for researchers once these these data are being sort of made available and shared across different agencies Can they be also shared with researchers? So the project that I'm involved in is exactly templating what the contract should look like and And so we have two sets one agreements when you're putting your data into sort of the sort of pool or repository and blending it with the other Research and then the other contract is for the research or the user on the other side What they're actually agreeing to and sort of sharing and downstream tracking and so on it's all very new So you're right on the cutting edge of what people are doing and thinking and what what they need, but that's exactly the right question Think for the great door So I'm interested in the case when the paper Open license then then the license for your code and your data So for example if you're working for Microsoft and you cannot share your code Yeah, but you're still publishing a journal that is completely open Isn't there a contradiction to the to the fact that I agree with that your paper is more like a Snapshot of your research the advertising and yeah, good work. Thank you Yes, there's a you know of particular cases in which there is some work around for that when we want to reproduce the work of some Yamaha or Microsoft project In which part of the code or there is some mechanism that allows to to share their code in an open journal. Yeah Okay, so that's a very interesting question because normally it's the other way around Normally the journal is very tight around copyright and nobody who's really noticed or cares about the code So the researcher shares the code because they just wrote some math lab scripts or whatever it is themselves and or our scripts Or one and they just share them You're talking about the reverse where you actually have say a permissive Open journal open access journal and then they've relied on some proprietary codes that aren't a shareable So that's I don't think the journal would actually try to exert leverage. Maybe maybe they're starting to I don't think they would over the code But we can do things like so what I would so if I really really can't share the code Like and I you try all the charm and negotiation with Microsoft that you can and you still can't get the code open then the next thing I would like to do is Set up a series of tests that help me understand how the code is running and how it works So I can sort of start to glean what those core operations actually are in the code It's maybe not as good as having it open to play with myself But at least that's sort of a step in the direction and then and then the other issue is sort of cultural in thinking about how we as a community engage with Our collaborators and making sure from the beginning that we're going to be able to share the products or at least do the best We can in terms of sharing the products right now We kind of don't really do that So you're probably running into papers where it someone else has just made it difficult to get a hold of the code It doesn't take Microsoft or a company to make it hard to get a hold of the code. Just try it on any old paper It's really hard but we're changing and we're and we're moving and I think that Sort of education around these issues and and discussions and and I'm seeing the journal standards around what they accept in terms of code and data Rapidly changing. It's not to the level that you're describing there But in terms of full openness, but it's rapidly getting there and I think this type of thing Here's my prediction five years seven years. It won't be acceptable to publish this anymore But we've got this black hole right now before we get there Yeah Becoming more common and Anissa is about the instrument that is used to do the analysis of the data So we in computer science we have a big problem right now doing or replicating. Let's say the research of Google because we don't have the resources. Yeah in science also exists because well You have a huge telescope or something. Yeah, right, right. So what is the direction that is going in? Interesting question. I that worries actually I think I've heard these discussions in computer science like for example in the size director at NSF it used to be the case 20 years ago that people came into computer science in Academia because that was where the coolest machines were the most cutting-edge implementations the biggest scale and and now we can't hold the candle to Google and Facebook and and when I teach Things like data science and so on we don't even have scale for a real implementation of Hadoop for example And so people say things like well I'd like to know scale at Hadoop because I want to go work at Google and so on and even the training is not I can't Train them on a Google like system before they go for example And so so then the question is yeah So that there's this kind of like whole entire access to resources that I don't see that changing actually and In academia so when you have sort of a unique instrument for me my approach would be There isn't a known solution that I that I know of at least But my approach would be to try to see well, what can I do that's similar basically build analogies And how different is it and where do I expect sort of error to creep in when I'm actually mimicking that? That's sort of larger instrument, but instrument availability is a problem. It's actually one of the reasons I stay away from Empirical reproducibility because you need sort of the bench or the tools and so on and that makes reproducibility much harder than if I'm just sort of passing around software or data So it's a it's a very hard problem through Governments we can replicate at least partially these instruments that are somehow used maybe It's an interesting question, but also think about the architectures They're totally different in the two settings if you think about what Google and Facebook are doing They're not using super computing Maybe they maybe IBM has a couple or something what they're they're all doing distributed computing and this is programs to develop these No, so he's a little super computer centers data centers. Okay. I don't see that I Have an affiliation with NCSA at Illinois and there are communities around these supercomputers I don't think they're going anywhere, but it is something I think about a lot Here