 Welcome everyone, let me welcome you all to the spring 2016 CNI membership meeting and what looks like a really nice day in San Antonio, which we arranged special for you all. I am delighted you're all here. I hope that your travels have been relatively painless. I'd like to make a special welcome to a couple of folks. We have with us the new cadre of clear fellows at this meeting and could you all just kind of wave your hands in the air? Yes, we have quite a group of them over there in the back anyway. You'll have an opportunity I'm sure to meet them at sessions and at the reception tonight and we're delighted that they're here with us. I'd also like to extend a particular welcome to our foreign visitors. Getting here from abroad can be quite an adventure and we are very very pleased that you're here with us. I'd like to welcome four new members who have joined us mid-program year and I think at least three of them are actually physically with us at this meeting. Those four new members are Skidmore College, the Crest Foundation, index data, and Swarthmore College. Welcome new members. Changes in the program happen and I want to direct your attention to the message board out by the registration desk. Thus far while we have had a couple of changes of speaker which I won't go through with you, the only actual session change that I'm aware of is that in the 515 slot today we've had to cancel the breakout on the open library of the humanities. Martin Eve unfortunately is not well and can't be with us but we're hoping we can reschedule that soon for a future meeting. I just you know sort of urge you to keep an eye on that board for changes as we go along. Now with those bits of news out of the way I'll just note that there are instructions for how to get on the Wi-Fi in your registration packet. If you have trouble with that just let one of us know and we'll try and figure out what's going on. There also are some announcements of upcoming meetings and I may I'll say a word or two more about those when we gather tomorrow afternoon for the closing session. So with that stuff out of the way I get to do something really good. I get to introduce Victoria Stodden. I am so pleased to have her here doing the opening plenary today. We go back a while now. I think we probably first met at AAAS a few years back but I wouldn't absolutely swear to that. I think we've been at a number of the same events dealing with various aspects of cyber infrastructure, the future of scholarly communication and things like that. Victoria has been a pioneer in really trying to think through issues about the replicability of scholarship in the digital age which is a tremendously important topic. It really not only is an important topic to the quality of scholarship itself but also I think to the public support of and the public honoring of scholarly work. The public sinks a lot of money into underwriting research of various kinds and I think replicability of research results and understanding what that means is an integral part to having confidence in the record and the outcomes of research and to earning and maintaining public support for the research enterprise. So the stakes here are very high. Victoria I think brings a really powerful set of insights into how the great shift towards computational technologies and data intensive technologies and network technologies are really changing the entire nature of the scholarly enterprise and she's going to share some of that thinking with us and also help us to think about the implications of this for the communication and the documentation of scholarly work and for the scholarly record itself going forward into the digital age. So please join me in welcoming Victoria Stottin. So it's such a pleasure to be here and I was so delighted to get the invitation to come and speak and address you all and I feel like this is a perfect audience for the types of things that I love to talk about. So one of the things I've done a couple of things to help facilitate our discussion so my slides are on my website if you prefer to follow along on your iPad or click on some of the links in the slides and going to try very hard to leave time at the end for many questions and I'll be around actually for the entire meeting so also if you don't get a chance to ask your question please feel free to just grab me or talk to me. Defining the scholarly record for computational research as Clifford very generously mentioned I have spent the last few years of my career trying to think about what does it mean to do research when you have a computer involved in the research that is something that I'm loosely calling sort of high integrity research. So one of the things I'm going to try to do in my talk this afternoon is really build up what that means or how we might conceptualize something like that when we're using computers to drive our results building all sorts of software around these scientific findings working with massive data sets that are complicated and coming at us at high velocity and what does that mean for things like transparency in the research process, second-order verification of the results if you are a reader or a user of these these outputs of research and how do we start thinking about this because it's all really kind of very new about how we share or disseminate research. Okay so I was going to touch on three aspects in my talk today the first two I'm going to go through fairly quickly but they set the stage for the rest of the talk so the first one is really trying to pull apart and understand what are these technological changes that are impacting say for example our conception of what the scholarly record could be if we can tease those apart we can start to understand how they're interacting with the research and the reuse and how we can actually start to think about what each of these changes might actually demand from us. Then I wanted to touch on well if we're thinking about what we should be doing in some sense I've introduced a normative dimension to our discussion. So in research we have long-standing norms so I wanted to talk a little bit about those norms and whether we can lean on them for strength thinking about where we should be going with all these changes. Then the bulk of my talk will actually be the impact on the scholarly record and what I'll do is take these different threads in this framework that we've been building up and to see if we can start to understand different ways that we might be reconceptualizing or thinking through what the scholarly record might look like in this age of digital, deeply computational, high impact, high velocity data research. Okay conceptualizing the technological change so I think there are several ways to break down how technology is really causing these really sort of turning how we do research on its head. So the first one I think everybody knows we've got enormous amounts of data. It's not just more data of the same where we need to scale the same that we've always been doing. It's totally different types of data so we're doing all sorts of different types of research. We have a very high dimensional data which is new and streaming data and these very different types of data so it's not really just a question of size it's all a question of character of these different types of data sets. The second one probably people in the room have thought about this one deeply as well. We also have much faster computers so what this has done separate to the increase in the size and availability of data the very fast computers have allowed us to do different types of scientific research so for example we're asking questions of say a physical system where what we want to do is take that computer simulate the physical system change the parameters rerun it change the parameters again rerun it and ask questions about our world using that computational technology. It may or may not have data involved and then the third aspect that I think is causing very deep changes in the research and in the dissemination that I hear almost nobody talking about is the role of software so my background is as a statistician and if you think about the methods that are applied to these data when you're doing the research all sorts of munging of the data sets preparation of the data sets you need to understand where there are outliers what's good data bad data to even get it ready for the analysis then when you apply the analysis in the models and do at the actual research in the inference all of this is being done in software and it's not in the publication so then the question becomes well we're actually making these contributions to research to discovery and how we actually have how we do inference in this digital research world that are appearing only in the software so how do we actually have that as part of transparency or reproducibility and how do we communicate that aspect to others in the research community it used to be that we could actually include this in the method section now the scale of the work that's actually happening on the computational side and the production of some of software around particular scientific results it's it's not possible to include a complete description in the method section so I have a little photo here this is a screen grab of Leor Pachter who is a professor of mathematics and biology at Berkeley and he's giving a keynote in 2013 and in the course of the keynote he happens to mention that the software that's being generated say in biological research he goes the software contains ideas that enable biology so we all have an understanding that the scholarly record of course contains ideas that enable biology or whatever your field of interest is however now it's actually appearing in the software as well so what do we do about that okay another couple of comments about technology so of course everybody's aware of the changes in communication due to the internet or due to network technology and it's not just a question of digitization but also of access and I won't go into the myriad examples that are well known the last thing I'll mention about changes in technology is the role of intellectual property law and this is something where traditionally in research we haven't thought that much about it even when I first became a grad student you could still find some professors who would you know write a letter to other professors asking for a preprint and the preprints would get mailed around I'm sure people remember those days and and we didn't think much about what that meant in terms of intellectual property now making things available on the web intellectual property is everywhere so one of the things that travels with this discussion is a subtext all the way through on all these components is well what intellectual property rights adhere how can I actually make that work accessible reusable something people can use say to reproduce the work and extend it and those are big barriers so I'll get to that at the end of the talk a little bit more okay now grounding these changes in scientific norms so hopefully you feel very motivated there are lots of deep changes happening in the way that research is carried out and then the the opportunities that we have for dissemination of that research but if we think about what researchers are trying to do what mindset they have what the end goal actually is then we end up in this discussion of norms so one of the things that I like to do in framing a discussion around reproducibility is been the concept into three rough groups there's overlap and so on and all of them but I feel like parsing out these different aspects of reproducibility is very helpful for framing the discussion so one of the things that when people talk about reproducibility in research they mean empirical reproducibility so the way I think about this is the type of reproducibility we've had since the 1660s when we started to communicate results and findings with this notion that you should be able to read the paper and independently you don't have to contact the original researchers independently reproduce and verify those results so empirical reproducibility would be things like at the bench in a biology lab actually doing things in physical space and so this that's our traditional notion and there are many interesting questions around that notion and and and it has actually garnered a lot of attention in the last couple of years however nothing's actually changed within that discussion those all those technological affordances that I just mentioned they're not impacting empirical reproducibility they're things like can I get the reagents that you used in your experiment can I access your stem lines and so on okay so contrast this with the notion of statistical reproducibility so doing inference have I done things like design the experiment in such a way that I would expect the experiment to give the same results in it when it's replicated in a different setting for example and so I won't go much into statistical reproducibility here mostly in the interests of time but there are many mistakes and errors that can be made just in the statistical aspects of the research that would cause the research not to replicate so if you have your ducks in a row statistically then I think you could satisfy statistical reproducibility what I will be spending time on today is this idea of computational reproducibility so there the idea would be can I regenerate your results say for example if you have digital raw data you have as I mentioned you may have done some pre-processing you may have some analysis steps and software that you've applied to the data and then you've published some findings so the question would be well can I actually replicate the computational aspects of your work can I get the figures that are in your paper or the tables or the output so probably most of you are thinking right now well okay so maybe if I can regenerate results using the same software the same data I haven't really you know said anything about the correctness of those results are they scientifically valid maybe I can get the same wrong tables and the same wrong figures in your paper and that's all true however if the way that I think about it is that's in a sense a baseline level of what we should be shooting for and computational reproducibility at least I can get through and check that the machine ran the stuff the right way on your data and got the results in the publication after that I can do things like independently try to re-implement your methods and then see if I can get some similar results if I can't rerun your your code on your data and actually regenerate your findings how am I going to understand why our results differ if I do an independent replication and they will definitely differ so we in a sense we need both types of research but I do need that deep level of transparency in being able to inspect all those computational steps that you went through to get to your final published results and of course right now there are there are researchers that are taking this on and and making it a priority in their work but generally speaking you just can't do that from that traditional publication because the pretty traditional publication was of course defined for empirical reproducibility so the question in my talk well what do we need for computational reproducibility what does that publication look like okay so empirical reproducibility I won't go too much into this I feel like I explained it pretty well but a couple of examples here about how labs biology labs tried to reconcile results around empirical reproducibility and the the point here is it's not easy either so it's not a question of any of it being easy or difficult the computational aspects are really the the new aspects that I've been focusing on okay so statistical and computational okay so I wanted to peel back the covers on computational reproducibility a little bit so we're leaning on a development of norms around computational reproducibility to try and understand what we what would be appropriate in this context and traditionally we've thought of scientific research is having two branches of the scientific method so we thought about the deductive branch of scientific reasoning encompassing formal logic or mathematics all this type of deductive reasoning and then we also have the inductive branch of the scientific method so for example statistical analysis of controlled experiments you're certainly not applying deductive logic in that context there's an enormous amount of chatter around how how technology as I've described it at the beginning of the talk is creating new branches of the scientific method so you've probably all heard this the third branch of the scientific method around simulation and around intensive computation fourth branch of the scientific method around data-driven discovery and big data I put a provocative question mark there on branch three and branch four and so my argument that I'm submitting to you is that computation presents only a potential third or fourth branch of the scientific method so why do we even have a scientific method can't we just you know we kind of have an idea of what science is or what research is and what we need to do why do we really need a formalized method around this and the presupposition that led us to the idea of a scientific method is that everything we're doing in research and in discovery we're in new worlds we're trying to generate new information that explains more about our world it's fraught with error I mean we're just people trying to make sense of things we're going to make mistakes all the time we do make mistakes all the time we never get to a notion of certainty in in research we just kind of feel like we have a better grasp on things or things are more likely to be right or we have not a particularly good grasp on things and we're always trying to move to that to that feeling of greater certainty around our understanding of the world so in the deductive branch that branch around mathematics and formal logic the way that this hunt to root out error has manifest itself is in the notion of the proof right you wouldn't publish a mathematical finding without telling the community how you got there through that formal notion of the proof similarly in the empirical branch they have an analogous notion where we have an entire machinery of hypothesis testing a way that you would set up this problem you apply appropriate statistical methods and then you have a very structured way that you communicate those results through that method section so I challenge anyone here to publish results and just leave the methods section blank won't be published and it's very very important for that transparency in that communication of how you actually got to that results those results so what I think we need is the development of comparable standards standards similar to what we have for the first and second branch of the scientific method but for the computational research the third branch fourth branch of the scientific method so what's our idea of the proof what do we need to communicate so that other people in the community feel satisfied that we rooted out the error the best that we could so there are notions around this so many people doing researchers are have been bugged by this and have thought about it I'm certainly not the only one who's interested in this issue and there's a notion of really reproducible research that was developed in 1992 or promulgated in 1992 by now emeritus Stanford professor John Clare about and David Donahoe who did the quote in the middle it was my thesis advisor and he was paraphrasing Clare about as follows so he says the idea of really reproducible research is that an article in about computational science in a scientific publication is not the scholarship itself so that PDF is not characterized as the scholarship it's merely advertising of the scholarship the actual scholarship is the complete set of instructions and data which generated the figures and tables and the results in the paper and as I've mentioned I think I've touched on this already this difference in sort of running through the same code on the same data and regenerating the results versus an independent implementation and so what what Donahoe and Clare about are saying here is we need to be able to get that the software steps the data that were used that underlie that that scientific result that's being given to the community through the scholarly record and we need to be able to deliver those along with that that result okay so there they're all of these changes happening in the research context and as we know research is governed by these norms and so thinking about what norms could actually guide our responses or how we actually instantiate that delivery of these digital objects as part of the scholarly record so there's many different ways you could lean on different norms I just I looked at Merton scientific norms from 1942 and I'm presenting them somewhat uncritically here and I should mention that there is quite a bit of criticism of these norms and development and so on but the germ of these norms I think is correct and also useful for these discussions that we're having today so Merton postulated five norms Communalism which I think he actually called Communism but its sense evolved into Communalism scientific results are the common property of the community I might even push on that and say they're the common property of the public universe universalism all scientists can contribute to science regardless of race nationality culture or gender disinterestedness as researchers were acting for the benefit of the scientific enterprise rather than for our own personal gain originality so making a contribution to the community research contribution must be original or adding something new to the discussion and skepticism scientific claims must be exposed to critical scrutiny before being accepted so the two that I focus on the most in my research or I found most useful is the idea of Communalism being able to share the results in such a way that they're broadly available and skepticism exposing these claims to critical scrutiny okay so skepticism I can you can trace this lineage back to the 1660s this is a picture of Robert Boyle as many of you I'm sure recognized so skepticism requiring that the claim can be independently verified so not content you don't have to email the author for example to try and understand what they actually did in their publication so what this means is we need to have a notion of transparency in the communication of the research process so what went into generating those results and like I said that's from the 1660s okay so what does all this mean about the scholarly record so I have a I have a couple of ideas if we're thinking about what the scholarly record might mean in this computational context so we need to use the scholarly record to be able to access or regenerate scientific research findings so I need to be able to get a hold of items that were relied on in the generation of those published results I need to be able to have whatever information is material for an independent replication or for reproducibility so one of the things that the reason I have two there two options is because every publication around a scientific discovery is in some sense a stylized narrative you never have a sort of a diary of everything those tried and everything that went wrong and all the dead ends the researcher went to if you read these these articles it's always this smooth progression you know right to the results and in a sense that's that's right I mean we don't need to be bothered by the mistakes and the typos and things like that on the other hand we do miss this sort of intellectual growth that the researchers went through and more importantly to statistical reproducibility these avenues that didn't work out can actually be extremely material for understanding the the significance of the ones that did work out so in a sense we've got some thinking to do about well what parts of the story are relevant for inclusion when I'm actually telling others about the results that I've found okay so what what do I mean by these items that we'd want to share we can think about articles or manuscripts text code software data workflow information so how did I actually implement those scripts and the software on the data what were my parameter settings what order were the scripts applied in so how do I actually use all those pieces to knit them together to get those results that are in the publication research environment details what did my computational setting look like when I did those that that inference on on the data and then other items again you'll recognize this is to be part of the discussion around empirical reproducibility that I'm not emphasizing so much in this talk our material objects like the reagents lab equipment instruments text historical artifacts and so on so just as important for reproducibility just not the type of reproducibility I'm emphasizing today so we're sharing items we're sharing items that can change data sets can be corrected updated renormalized software is extremely dynamic and fluid when people are actively using it changing very frequently so it brings in this notion of versioning as absolutely crucial to everything we're doing and we're sharing these types of information okay so as I mentioned I'm not the only one who's noticed these problems a number of researchers have noticed as well so I put together this slide of some of the tools that people have built to try and help with this problem so one thing that you may think is well you know this is a computation if you're talking about computational reproducibility this is research that's happening in silica for example so why don't we build some computational tools that would do things like capture that version of the data set or what software you actually used and bundle it all up and maybe make it more shareable in a more automated way and I think that's a very natural response and I think that's an incredibly important part of the solution and so people have been doing this type of thing and I'll just note that for most of the research tools that I'm pointing out on this slide and it's a hopelessly incomplete list so I don't it's almost a haphazard list so I apologize in advance to anyone that I forgot but the but what's going on that's notable with these tools is most of them these are just faculty members who have developed tools and instruments to help with the research and help with reproducibility on the side of their day job because they think it's just an incredibly important thing in a gap that's missing and one thing I've noticed as this discussion progresses and I'm starting to see this more and more is now commercial interests are starting to come in more and more and provide the support tools so that's just part of the discussion in a dimension that is worth considering so I roughly group these tools into three categories dissemination platforms so I work on research compendia the first one there and I'll talk about that in a moment I Paul image processing online machine learning open source software so this is these are these are sort of post publication I want to deliver you some digital object that I think should be running with my publication and so these are these are some solutions that are that are trying to fill that gap and finding ways to get say data sets to you software to you workflow information to you as the reader and consumer of the of the publication pre publication there are a number of tools also that are being created so as I mentioned helping capture what might be important for sharing before you actually get to that point of publication so these are the workflow tracking in research environments and many tools with a lot of deep work behind them to help capture those environments what variables were actually use what your inputs were to the software and deliver that as part of the publication as well if the author chooses the final category and these are all kind of a little bit overlapping the final category embedded publishing so we still our currency and scholarly record is still the static PDF can we do something a little more congruent to the types of research that we're actually doing in the computational sphere so for example could I deliver you a PDF where you can click into images and grab the code or click in and find the data or maybe even have it regenerate within the PDF for example so I've been calling those embedded publishing and there's a number of efforts around there as well so I think the fact that we're seeing these tools sort of popping up and organically arising many from a grassroots level really speaks to the importance of the problem where the researchers are stepping in and trying to build this these infrastructure responses on their own to try and just do better research what I started the talk sort of loosely calling you know research with integrity okay so I'll mention also research compendia a little bit the one that I've been working on so the notion the sort of the fundamental idea of research compendia is the scholarly record is proceeding more and more publications being added to it kind of every second and many of those publications have data that underlie the results have coded underlie the results I'd actually even make an argument that the majority at this point do and so the question is then is there a way to persistently house data software workflow information those computational artifacts that should be traveling in my opinion with the publication in such a way that they're persistently available to the reader of that original publication so the idea of research compendia was really as a pilot project so we could start to understand well what do researchers need how difficult is this type of sharing what problems and barriers do we start to run into so we're trying to link the data and code to these published claims enable reuse and develop a better understanding or guide for people who are doing this research and being able to promulgate these these additional aspects we also realize that once you have the code and the data we could actually just run it ourselves and certify results even though we're not actually maybe the intended recipient the lots of the work that we've supported on research compendia publishing fields where I don't have expertise for example but we can still get the code running we can still use their data and so we're able to sort of say well we were able to get those same figures same tables as in your paper and so we could do that baseline kind of verification of at least the computational soundness of the results not saying that they're correct but at least saying I could run this the the digital objects that you gave me and I could get the results okay it also opens up other of their opportunities so in a particular area around a research question if people are sharing their data and their software with associated with their publications we could start putting the data together and running the software that we have on a much larger data set and validating those results in a much more powerful way and do things like stability checks or sensitivity checks in the methods and in the data itself so things that aren't the traditional way that we would think about the reason or the purpose of sharing scholarly information okay so here's an example of those pages I was talking about so I just grabbed a page on research compendia the blue title at the top will link back to the published article so this isn't unless it's an open open access journal we're not taking local copies of the publication and making them available for obvious reasons with copyright but at least we can link back to the article in this one this happens to be an open access article so you can see in those buttons about two thirds of the way down you can click and grab the article directly from the compendia page and you can grab code and you can grab data and so the idea here is the abstract describes code and data the digital objects that are traveling with the publication the authors might be programmers or data curators for example people who have contributed deeply to the success of the paper yet may not be listed in the traditional authorship role on the publication there's a lot of kind of norms and standards about how those authors are decided on and then we have additional information around how the intellectual property is licensed here which I'll get to in a moment it's all open source so we have research compendia sitting there on github we do by the way assign do is to code data to the research compendia page and then we have a DOI for research compendia as well and the idea is assigning these DOIs in such a way that they're linked hopefully with the publication that involves more of an engagement and discussion with publishers to do this before publication but at the minimum will have them hierarchically linked within themselves so that they're discoverable as a bundle and then we applied the MIT license very open permissive license to our platform the software that's on github and available so all of you can just grab a copy stand up research compendia your version yourself and work with it or extend it okay so I'd like to finish with a few comments on intellectual property so one of the things that happens in the interaction between software and intellectual property is that software falls under copyright by default so copyright as you probably know extends to any original expression of an underlying idea so the underlying idea is never barred with intellectual property however as soon as you have this in fixed form then this is then the fixed form itself is then copyright to the original author so there's a nice phrase that copyright follows the author's pen across the page you don't have to register copyright this is a default and so we're very used to thinking about copyright in the context of text but it also is equally as applicable in the context of software where we have an idea for what we might like the machine to do and then we go ahead and we code it in some language and that again is putting in fixed form some original expression of an underlying idea so copyright applies so Stalman was one of the first people to notice that this created a big problem for sharing of software so for example if you're the copyright holder I have to ask you if I want to make a copy or reproduce your work so that's very burdensome so I can't just go out go and download software from the web that's sitting out there unless I ask you if it's okay if I go ahead and use it or maybe change it a bit or apply it in a new setting and that lasts of course as we know 70 years plus the life of the author and probably something to do with the longevity of Mickey Mouse as well so the 70 years keeps kind of extending and extending so I consider copyright essentially to be in perpetuity and certainly for software and for these computational objects 70 years is essentially infinite anyway so what Stalman did his big innovation was well instead of having people email me or contact me to ask my permission to use my software that I've put on the web why don't I just attach a little bit of text that says you've got my permission already and here's the permission that I'm giving you you can use it under these circumstances so that was the advent of open source licensing and Stalman's background was to try and ferment and create an open source software community and often his insight around licensing is credited with the successful creation of the open source software entire movement and community however the norms of the open source software community they don't map to the norms of the research community perfectly so there are gaps and one of the things that I thought about was well what would be appropriate leaning on already the enormous amount of work that's happened in open licensing for making these digital objects that travel with research available so we go back to things like Communalism let's make things as broadly available as we can research operates by citation that's how we get our credit so there are many licenses that will do exactly that that will make software for example or copyrighted objects available to do whatever you'd like with as long as you attribute the original author so one of the museum it license that we saw on research compendia for example so my recommendation is take these software objects that are associated with research and make them available with an attribution only license or make them available in the public domain these are the light these are the ways of structuring intellectual property that most closely match the norms that we already have in the research community your media components so the article figures tables making this available under attribution only license for text not a software license like for example the creative commons attribution license that would match our long-standing norms now I just sort of swept aside the whole issue with publishers and copyright but we're just talking in ideals right now the final object data data is trickier to license because there is no copyright on raw facts in the US however and of course the licenses adhere to copyright however the notion of what a raw fact actually is in research I bet every single person here would have a different answer for me and the courts and there nobody's really answered that question so I think arguably if you have done some original selection and arrangement of the raw data in the court in the Supreme Court's words there may be copyright that adheres to at least that original selection and arrangement of the data although not the raw facts so perhaps an open license could travel with data for example or if you want to just forget that whole conversation putting data out in the in the public domain is another way to make it broadly available so the idea is of the reproducible research standard is take away that barrier that's introduced by copyright to the promulgation of Clare about notion of really reproducible research where I can take computational findings and I can get the code I can get the original data I could re-run this presumably and verify those results extend those results take the code put it on my new data or take the data mix it with something that I'm doing and extend the research in what would I think to everyone in this room seem like a very natural thing to do so taking that IP framework and making it aligned with these long-standing norms that I discussed earlier okay so I wanted to just finish with a couple of slides with open questions and queries to sort of sort of really make us think as imaginatively as we can about what this scholarly record could look like so queries that I would like to make on scholarly computational publications I'd like to be able to ask the scholarly record to show me a table of effect sizes p values so statistical output on all phase three clinical trials from melanoma published after 1994 that's a very hard query today I would like to query the scholarly record and ask for all the image denoising algorithms that have been applied to remove white noise from the famous barber image the barber image is used all over signal processing and image processing so tell me what all those publications were and actually give me the citations to of where those public where those algorithms were used and introduced give me all the classification algorithms that were used on the famous acute lymphoblastic leukemia data set this is one also that's very common in a certain strain of literature along with type 1 type 2 error rates of some statistical output as well the output would all be in those papers so we're not really computing anything I just want to deliver what's been published give me a unified data set containing all the published whole genome sequences identified with the mutation in the gene Bracka one that's also a very hard query to do I would like to randomly reassign treatment and control labels to cases around a particular clinical trial published cases and then I could calculate effect size then what I could do is say repeat this multiple times and do it for every clinical trial published in the year 2003 listing trial name and this particular histogram giving these reevaluated results side by side so we could get a sense of the effectiveness of these clinical trials in the in the published scholarly record these are hard today but I would like to move to a reality where we can routinely make these kinds of queries okay so cyber infrastructure thinking about tools I think successful tools need to minimize the amount of time the researcher is going to take in learning the tool and using the tool I don't think a successful tool is something that's going to take a huge amount of research or time and effort and energy we need to think about automating as much of the discovery and dissemination process as possible not automating the intellectual aspects but automating things like what version of that data set was used to produce figures for for example that's something the machine knows and we can just capture from the machine with the appropriate tools what particular frozen form of the software was actually used can I grab that so it doesn't stop us from doing things like fixing bugs and software all software's buggy doesn't stop us from doing things like fixing mistakes and updating data sets for example but we still need those versions that produced those results in the scholarly record mistakes and all words and all as I mentioned facilitating queries across the scholarly record that go to those computational aspects data software algorithms for example and then capturing all that information needed in the research process to allow people in the field to verify and assess those findings it's all really about delivering evidence for those results okay there have been a couple of community responses that I'll just touch on in Yale Yale Law School we did a round table in 2009 with different stakeholders and we called it we had a published work workshop report called addressing the need for data and code sharing and computational science at Brown University we had a similar round table for the week long ISERM series in 2012 so there we titled our workshop report setting the default to reproducible in computational science and research so getting the code out there getting the data out there getting these artifacts out there and accompanying their results in 2014 in the summer with the exceed project which is a project that is meant to facilitate access to high performance computing technologies and machines and systems for users that may not typically work there so acting as middleware software they sort of thought well maybe that's somewhere where we could really be enabling some of the cyber infrared cyber infrastructure aspects of reproducibility since we've already got software wrapped around their processes is middleware maybe we can go ahead and capture some of those things like code data versions and so on machine state for example I won't spend time but of course we're being pushed as everybody knows from the White House with their mandates around access to publications and access to data now notice that access to code access to workflows wasn't mentioned I think that's up to us as the community to wrap those into our discussion and have a more holistic way of presenting the results or really it's about evidence and it's about reproducibility the federal agencies are moving beyond what OSTP is requiring them to do so last year National Science Foundation in September had a workshop they called reliable science the past the robust research results so that's going beyond just dissemination of data for example and NSF actually has a rigor and reproducibility page I'm sorry NIH has a rigor and reproducibility page talking about their efforts to make data available software available and so on and then journal requirements journals another stakeholder also kicking in so I did a study a few years ago on well what are journals saying about code requirements and publication data requirements and publication and they are not to the point where this is solving the problem however they are taking nibbles of the problem and making data available and making code available okay so two different ways to think about this so on the production side we have things like crowdsourcing public engagement and science we're putting data online software online with our published results this is really different from a societal perspective than one researcher writing a letter to another researcher asking to mail the preprints right this is something now people are looking at this from all walks of life and all backgrounds it's not an internal dialogue anymore so this pipeline so opening up this accent to these coherent digital scholarly objects coherent is in quotes because I'm trying to communicate how it should be attached to a published narrative how there should be usability to enable reuse having licensing that enables reuse for example mechanisms for evaluating new findings so what happens when people start playing with our work in the scholarly community and then they come up with really interesting new recent new new insights and research do we have a way of folding that into the larger discussion or we just expecting them to submit their publication to the journal like all of us do and then like I mentioned legal issues of our reuse and privacy on the other side around crowd sort of an opposition to crowdsourcing in this public engagement is the use of the output of our research more broadly than ever so these findings and publications coming from what I've been calling third branch fourth branch computational research are feeding into all sorts of policymaking everybody has heard evidence-based whatever right evidence-based health care evidence-based policy evidence based medicine so this is this is something that's now becoming part of the currency of our policy dialogue itself so how do we know that those findings are really right unless we can go in and look at the code look at the data rerun things and do those verification checks that have always been part of science that need to require more thought and more effort to extend these to the computational aspects okay so for cyber infrastructure dreams and wishes data access software access persistent linking to publications that I've mentioned I'm including in there also the workflow parts of the conversation link do I assignment on articles data code workflows making this entire compendia that we're publishing rather than this static PDF I've talked about data access I haven't talked about any of the barriers to data and code access so for example can we be more innovative when we have say legal barriers to the sharing of data or ethical barriers so for example you have human subjects in your data and we're not just going to be able to make those data available if you do education research we're not able to make student data publicly available but can we be more innovative about how we might provide access rather than thinking of it as a binary that's sort of closed or open I think yes are we using robust methods producing stable results and findings from these data do we have our emphasis correctly on reliability and reproducibility from the findings that were actually garnering from the from the data and I think the cyber infrastructure should be open source and should be inspectable reusable and so on okay and I'll just make a quick note of the Google flu trends argument or sort of experience where if you remember Google decided to start making its trend data available or not well sort of started making announcements and predictions about how it was using its trend data that were actually better at predicting flu trends in the US than the CDC for example and they got a lot of you know press and attention on this because it looked like people's search terms in Google were actually providing a better picture of what's happening in our society then sort of doctor reporting and hospital reporting to CDC that we long relied on so people are very excited about this if you had you know if if our conversation today was of interest to you then you might have had questions like yeah but how do we know Google's getting that right and what are their prediction methods what are their models what's the underlying data all of that was opaque and then in 2013 2014 Google flu trends suddenly stopped working the predictions weren't as good to this day I don't think anybody except the people the sort of small group within Google knows why so the question then becomes when can we rely on this output in this information and when can we not and shouldn't we have better mechanisms rather than just trust me so here are principles for cyber infrastructure supporting the scientific norms I've put a few out there today as I said I didn't put them out in a definitive way but there are these long-standing norms that we do need to support in terms of research that characterizes the research itself supporting best practices in science so this is I think a more open discussion so what should the best practices look like in the computational arena taking a holistic approach to cyber infrastructure so we think about many of the descriptions of examples that I've used in the talk today would easily fall in the paradigm of data science so what does it mean to do data science well does it mean understanding what your data generation mechanism was where you got the data what are metadata associated with the data what have you done to prepare that data for analysis what analysis have you done as I mentioned those are all encoded in software almost never shared and then how why should I believe this what are you doing to convince me that the same machine learning algorithms or whatever it is that analysis you've applied to the data is really right can I even take a look at this did you tweak those parameters for and just get really good results and then publish those or is this something that's more broadly stable and applicable right now it's difficult to understand that for many of the published results so having this and understanding this end-to-end research pipeline right through to reuse sharing and how it's handed off into the next project okay I just mentioned is a teaser there the social and political environment because of course all of these discussions who pays for cyber infrastructure for example they start to involve government entities and funders and we're always always these discussions are embedded in a larger political context in a social context that we need to be aware of and so I'll finish with some open questions who funds and supports say for example tool development cyber infrastructure who owns data code research outputs so I have been loosely talking about it throughout the talk as if the researcher generated and is making it available oftentimes the researchers using data that were generated somewhere else by other groups do they want to put this in a trusted repository that their community uses so we have a generator we have a user who's created results we have a repository who's adding value through metadata persistence discoverability what about the funder who funded the project wouldn't the university say well I also supported that those are my data so we have all these sort of tugs around who owns data we haven't even gotten to the conversation about who owns code and it's something similar so these research outputs I think that's a conversation that's coming who's controlling access who's controlling the gateways having a sort of discussion before it's a calamity I think is important what are the community standards around documentation citation standards best practices who's going to enforce this should it be different for different research groups do we have sort of larger principles that we can communicate to the scientific community then act on and embed in the cyber infrastructure we're developing how do you cite the cyber infrastructure itself how do you reward this this goes back to number one who's going to fund and support it how do you know what's working and what's not working what are the incentives to do this development of cyber infrastructure and how should we be thinking about a healthy system that actually encourages this type of behavior I'll stop here and I think we have maybe not enough time for questions but a minute or two so I'll just open the floor to questions and of course like as I said I'll be around through the whole workshop if you have other questions so thank you hi I was interested in your work on research compendium and whether that sounded like something that could be automated as a confirmation in the future and secondly if there are any success stories hidden in failures where something didn't pass those are those are great questions so one of the things that we've been working on that I didn't show you here is sort of having small virtual machines around the computational deliverable so around the code and around the data so you we're not doing this but you could imagine having this sort of minimal virtual machine that a researcher might even work in and then just deliver that as part of their publication and they you would know that things sort of worked and you'd have that confirmation in advance from the researcher and we're not at that stage yet but I can certainly see that and that level of automation and I and it's a great question I'd really like to see research compendium even more automated than it actually is we just sort of lay we're layering right now on top of the existing system you know so it's clunky in a way and we can certainly be smoother in the future so that's a great question yeah Dave Rosenthal from Stanford two quick points on that firstly availability of the source does not guarantee reproducible of course execution we have the example of the emulation of the chase software that was done by the team at CMU where this the code was deposited with the paper but it requires a specific version of Ubuntu and a specific set of libraries in order to run it yeah and the other is well people are starting to deliver virtual machines we noticed that the feed that we get from ACM for the clocks archive last fall suddenly became bigger than the feed we get from Elsa there which was quite a surprise the reason was too far one of them was it started being filled up with videos of people giving presentations which take a lot of space the other was it started being filled up with the ends yeah and most of these are VMware virtual machines which are huge right which are huge which there is no guarantee that we can run those in the future and which contain all sorts of potentially licensed software that we have no idea what it is yeah and so we were forced to consult our agreement with ACM and point out that they were indemnifying us against copyright violations by the stuff that they were delivering us and we were going to depend on it yeah yeah no those are excellent points i could have spent the whole talk just talking about software sharing and all this sort of pitfalls and issues so one of the one of the one of the points that you're raising that's extremely important is just having the source code doesn't necessarily mean anything right maybe you can inspect it but it's hundreds of thousands of lines long how do you even open source something that big it'll take you years to even read it so so the way that i've been thinking about solutions around that is again actually leaning on the traditions of the open source software community where they would deliver software with tests and we do not do that at all in our research in this community in the research community and we have no notion of sort of software testing so if i wanted to sort of understand how a piece of software worked and maybe whether or not i should trust it i would immediately start leaning on tests and that's a that's a conversation that's just starting so certainly software is neither necessary nor sufficient for reproducibility however i think in the majority of cases it's extraordinarily helpful and of course the ideal would be running the software and i just sort of skimmed over that like how we would just run it all and that's non-trivial as well in terms of what needs to be delivered to actually run the software and as one person said at one of our workshops you can't virtualize blue waters so sometimes you're on these very specialized pieces of hardware and what does reproducibility mean in that context so there are fascinating edge cases i think around these issues however i think what i'm talking about probably applies to upwards of 80 or 90 percent of the research that's being done where you have relatively short scripts on static databases and we could do a better job sharing that as a default openly and then start to think about well maybe i can use sort of very lightweight virtual machines or start to innovate around this i don't think these problems are insoluble i think there will always be very challenging cases though yes oftentimes in the web development things that are just good enough tend to to take it take off and take advantage over more worked out scholarly and formally devised science systems jupiter hub stands strikes me as one example of that because it seems to embody a lot of elements i'm just curious about your reaction to that as a submission mechanism for journals and things of that sort yeah no this is this is taking off and i think this may become the next standard so i do oh my slides aren't on there but it's fine so that that slide that i had with the infrastructure responses one of the ones that i point to is jupiter on there and this is being used everywhere people um if you remember we had the breakthrough on the detection of gravitational waves well the computational aspects are reproducible and they're shared in a jupiter notebook for example and so that's something where we sort of start the conversation in the right way and then we can sort of fail fast right and it's i think we certainly are going to do things like research companion is an example where i'm building on an existing system it's certainly not beautiful or sort of highly developed software or what not but we're trying different things out and we're prepared to make mistakes and sort of have an agile response and my understanding from jupiter too is that that's similar so we can actually be extensible i mean um Fernando Perez evolved it from um python out to say our Julia and other um pieces of software that are being used to all over science as well so it's certainly the case that we're not going to hit perfect the first time and i'm fine with not having the beautifully engineered solution as you mentioned and just taking um what we can get and start building on it we just need to be sensitive to the brittleness of software which is once standards are evolved and in place particularly with software it is difficult to change course so just sort of thinking about what what would it mean to maximize that ability to be agile and that's part of the reason that underlies my push for open source in the platforms not just in the research as well okay it looks like we are out of questions and out of time so thank you very much thank you very much victoria that um that was a really illuminating talk um and there's a lot to think about i suspect you're going to be having a lot of whole conversations in the next uh 24 hours or so thank you again