 I'd like to talk about work clothes and gateways or work clothes tie heavily into reproducibility. We'll briefly, very briefly, survey all of our resources. They're available at no charge for everyone in this room doing research and beyond. Then some other initiatives. Some we call it a compass. We hand back and forth, getting really into community data and closing with a few examples of things we're doing at PSC and with the very broad community both in Pittsburgh and nationally. Okay, so first, PSC is here because we are a joint institute of Carnegie Mellon and the University of Pittsburgh. We're around 32 years, and we do a number of things. We have large systems, as in the top left, where we support open science. We also lead research projects. Right now we have about 30 active grants, but which only five of those are supporting the big systems. And so these collaborations are ways that we work with the community on leading and enabling science. We lead national training. We have a workshop calendar on our website that we've distributed through MCS that rotates through topics, one of which is big data and our digital intelligence. And that one is very well attended and one here is welcome. We are actually supporting those that pick NCMU communities. That's the top right. We provide networking to the region. That of course is essential for moving data around. We've supported some people in this room on how to get data from their labs to really high-end data and compute resources. And then we also work with industry. So I want to jump very quickly to just one slide. It's a cartoon representing workflows. And to me, scientific workflows and data workflows are a way of capturing what you've done, maintaining that provenance, enabling reproducibility, because now you have a record and you can re-execute that workflow. We have a complete record of all the data provenance, the tools, perhaps even the software and the binaries, such that you can reproduce what you need at will. This example is from the Center for Causal Discovery. And this was an NIH big data knowledge center of excellence. There was a collaboration... Well, it's a collaboration between University of Pittsburgh, Carnegie Mellon, Perm Philosophy, PSC, and Yale, with various other collaborations for different things. And so in this, the gist of this one is that we deployed an open web client that ran on any runs and everything from tablets to laptops. That talks to our big systems, and most people do, cause and effect relationships in big data, looking at lung disease, genomics, and brain dysfunction, without ever knowing they're actually using what we call a super-imputer. That's all behind the scenes. It's software as a service, it's very cloud-like, and it is democratizing. It lets everybody use these features without having to become programmers, and really supports reproducibility. Our big machine right now is called Bridges. We really designed this to support within science and all the things we've been talking about. It has... You can read what's here, these slides will be available. It has a loan, it has about 29,000 cores. We've just added... Well, we're in the process right now of adding a number of GPUs that Paula would talk about. And this really does support the integration of high-engine computing, data, and artificial intelligence. It was the first machine in the world to do that, and it's now been copied in several other places. This is for research. You can have it for research, for coursework. Anywhere from small allocations, you get a day through millions of hours, all of no charge. We support on here. We design to operate very easy to use ways. So it supports what you already do. It supports Jupyter, Python, Anaconda, Matlab, everything people already do. You do not have to learn different things around this, even though it's a year and a thousand times bigger. And so data is uniquely enabling. Like transition over or some new pieces of bridges. Thank you, Nick. So what I'm going to talk about now is one of the latest expansions that we are going to have on bridges. So we are planning this week and next week to incorporate a new element in our big system, which is called... It comprises one immediate BGX2 box. For those of you who know, this is the latest and greatest system for doing deep learning at scale. So when you talk about what GPUs to use when you're training your models or doing inferencing, if you're talking about a single GPU, it might look like the cost-benefit. The best option considering cost-benefit is not a multa. I mean, we can talk about that later, but then when you're talking about scale and doing data parallelization and model parallelization and using many GPUs at the same time, there is no better system at the moment than a BGX2. It's highly interconnected to make all the GPUs work at once. It brings a total of 2.4 terabytes... I'm sorry, 2.0, a bit of loss of mixed precision. So we are very excited about putting this out there. As Nick mentioned already, all of the services that we have at the center are for free to open science. So open science and enabling open science is basically at the core of our mission at the center. This is going to be as well. We are starting an early user period this year that goes until December. We are welcoming new users that think they can leverage these resources and do deep learning at scale. We are welcoming data and offering guidance through the process. So it's a total of 88 pulses. So 1DGX2, but also 9 servers, each of them with 8 pulses. It's a total of 88 pulses. It's a very, very powerful resource that we are very excited to make available. Another initiative that we have at the center to help open science and to help advance in particular AI at scale is this initiative that is called COMPAS. What COMPAS is about is about letting users in the open science community get access to latest and greatest technologies that are designed for artificial intelligence. Not only get access and be able to get their hands dirty with it, but also get support and get guidance in terms of what technology, what accelerators, the storage interconnects are the best options depending on the model. So depending on the type of work you're trying to do, it can be that your bottleneck is on compute or maybe on transfer or maybe on the memory of the accelerator. That's the kind of things that we are exploring. So what we have is this sixth front initiative that has a COMPAS lab which is the actual processors and interconnects and technologies or well FPGAs. We have open COMPAS which is working with a selected group of representative science projects where we are benchmarking these technologies and bringing or coming up with the best practices that we can disseminate. We have the COMPAS consortium through which we collaborate with the private sector to get representative use cases out there that are needed in production. Then we have the educational and training component where we disseminate these best practices. We also support community data sets that are necessary for deep learning and well making happen this. Well being able to execute these projects and we also have COMPAS research which is researching the scaling part of artificial intelligence. Open COMPAS is the part where we work together with the projects. If you have a project that you would like to have benchmark in different people learning technologies, different GPUs, different... I mean to have your code scale beyond more than one GPU maybe use more than one processing node. It doesn't have to be a GPU. In machine learning applications that benefit regularly from regular processing course you're more than welcome to talk to us and see a wide image we can work together. In the community data sets part what we do is for certain mature corpus of data we offer the opportunity to host it in our systems. We are mainly interested in community data sets that are relevant for a specific field so say ImageNet or NNIS or another example is the GIL project which is constantly scrapping the whole news of the world and identifying people events, locations and making graphs or the common code. These are data sets that are big that are relevant for our community that want to be shared. So we are... Or we have this opportunity to host data sets like these ones and offer... take care of the hosting of the data, moving the data and all that and allowing users to focus on their science. We also... if you have a data set that you actually want to make available to your community we can also consider having that one and you can limit the access to whoever you want if you want it to be completely public or if you want it to be available to just specific groups that's something we can do. I just want to mention, so the common code for instance is a very big data set that is in the petabytes so two petabytes are growing of data it's currently hosted for example in Amazon now what happens when you want to query that amount of data, it's very pricey one of the benefits of this initiative is that you have the data here hosted at PSC but then it interoperates with high performance computing capabilities so we have a huge machine that can for free query the data, we have all the tools Hadoop, Spark, other frameworks for deep learning groups and so that's game-changing so that really makes you be able to use the data. I'm going to pass it to Nick who's going to talk about some examples of other data sets. Okay, so I'm very quickly going to wrap this up with two great examples. The first one is the Brain Image Library this is an open science project we host at PSC Alistair Puleski who is our biomedical director who is the PI of this along with Marcel Brachet over at Biology and Sam Watkins at Pitt Center for Biological Imaging The objective here is to really amass the world's best and only compendium of confocal fluorescence across the media because it will give mice, marmosets and rats and their shooting for 10 petabytes is what they expect in this pile of phase and it's a great resource for people here at CMU and Pitt and beyond it really is a national global resource for doing neuroscience research the other thing we have just begun is the Human Biomolecular Alist Program this is a large NIH initiative and in this initiative they're looking at really obtaining very high resolution imaging, omics and sickle cell sequencing for all major tissues in the human body except for brain, and that's fine because we do brain in a separate place we applied for one of the infrastructure components for this we are doing a hybrid solution we're having HPC here in Pittsburgh one thing with cloud resources such as in Amazon and giving new research to the best of both worlds this will be in tissue data from tissue mapping centers at five different sites for Caltech, Stanford, Harvard and Vanderbilt and supporting other researchers across the country to do development tools one of those projects is here at Carnegie Mellon by Zipp Lar-Joseph in computer science and computational biology but other tools developers at Harvard people do mapping where mapping in this case means what's the coordinate system how do we do queries across the body of all these things coming in in these different modalities where we're looking at combinations of standard imaging, mass micrometry single cell sequencing, RNA-seq proteomics, transcriptomics all these different kinds of tissue data and how do we map the coordinate system for different humans for tissue visualization and analysis in a meaningful way this is the beginning we had a kick-off meeting last week in NIH we expect this day to grow between our best guess is between 10 and 100 petabytes we will see if we go through the next few months how do we actually get just a four-year project to build it then at least three years production after that it's a really exciting opportunity for CMU and the Pittsburgh region so I think we're out of time so I'll take a couple questions I guess last thing I want to say we're having a workshop tomorrow two to four which is hands-on on using bridges everyone wants to just figure out how it works see if it works for them and also walk through how to get an allocation at no charge and that is really important just to go final call is if everyone does have data they can follow this community data sense we're looking at an initiative for building scientific research data management in Pittsburgh in a very large way if everyone has data they would like to have close to it and support please talk to us great, does anybody have any questions for Nick and Halla? go ahead so I have a question you were talking about these hybrid solutions where people can move between these local hardware or the basement somewhere around here and the cloud and I wonder if you could say a little bit about the technology that allows people to move easily between those usually moving between those kinds of things is actually really hard I think we have more so I think first of all I'm jealous you apparently have a very big basement but seriously so for the hybrid there is a combination of work flow it's all the connections you want in the hybrid use combination which have nice connectors between what we do and the cloud to be clear what we do has advantages in that the compute is free it's already funded by international tax dollars so that's completely free the storage is also free for research and we can do things with a big scale we can use the GP resources that Halla mentioned for doing AI data sets we have large memory nodes with up to 12 terabytes of RAM each so we do an awful lot here that doesn't have a business case for the cloud the cloud in there has a lot of community data sets that are right out there it has massive scale and so there are reasons to use both for this project we really see a good back and forth with our Rosalobus where data is computed in cloud so it's back here to be stored for the workflows Nick said at the center we have a network that is dedicated to define the better ways or best ways to move a particular data set we have very big data set depending on where it is and where technologies are supported we have to work around usually if it's big and you can do global stats I just have to say that in general when you deal with a lot of data movement you you want and you have the cloud in the picture given the fact that the cloud charges for aggressive data you have to think you are following such a way that you minimize that burden that can go very high very quickly depending on how the intensity of your work but usually what we do is we work on a case by case basis we can speak on a specific project but then in that same line that's why maybe having the data very closely compute if that's an option it's very beneficial because there is a time delay but also the cost of moving the data associated so we support containers on bridges and if you containerize your application you can just move back actually I have a question here do you actually provide support for data analysis? so bridges we have a very mature and we take care of supporting or keeping the environment like Spark and Hadoop and all that updated and working so there are a number of tools that are out there if there is a specific project that is interesting for the center we can have some people some I don't know employees or interns it can be dedicated to work together with a specific project to do the actual data analysis and execution I don't know the algorithms and whatnot but it's not that by default element is not the element of this initiative but can't be done provided that there is interest in both parts okay so what I would like to mention is when you get an allocation of this system like I said that's free you can also request what's called extended collaborative supports and what that gets you is up to 20% normally could be more of a person who can then work with you to be all approaches to maybe working with some data analysis to do scaling and that normally is 20% of a person for a year so all in all there are many things we are doing our core part of our mission like the reason why the center exists is to name data science by providing infrastructure, storage and also support so if you have a project in which you think we might be able to come and talk to us we might find we will help you find out which of these initiatives or services would feed your needs okay our next speaker is Dan Valen who works on strategic partnerships at ficture thanks everyone thank the organizers for inviting me excited to be here so I just wanted to give a quick overview of ficture for all of y'all if you have not heard of it this is our slogan ficture is a platform that allows you to share and discover research we loosely define digital research to encompass all the outputs that are a result of research so this could be raw data outputs all the way up to 3D printable images you'll notice on the left here it says discover research from ficture there are featured categories they are cross-disciplinary so we are known as what is called a generalist repository so we accept content from across all disciplines and some of the features of ficture once we publish content we assign open licenses for reuse of that content we assign digital object identifiers persistent identifier that aids in the ability to cite that research we allow for versioning of that content and we ensure that content is available so we have a robust infrastructure to facilitate that which I'll get into in just a bit so to kind of stick with the theme of our slogan here on the store side it's a way to host your data publicly and collaborate privately so in the back image there you'll see kind of the dashboard that you see once you log into the platform it allows you to create projects and collaborate privately prior to publishing your research you can also annotate and assign metadata to that research to give as many descriptors as possible so people can understand what research it is that you're publishing and once you do make it openly available you can see this is actually I looked this up earlier Krasopa Talib Krasopa Talib it's a sea worm those are its jaws in a 3D printable STL file if you were to actually visit this on the site it would be spinning and you could 3D print it if you had the capability to do so so once you do make your research public we do aim to make it as visible as possible speaking of the STL previews we also aim to create previews of your data so once you publish content to Big Share we actually have support for over 1200 file type extensions to really highlight your content and that's the first thing that you see once you land on an article page and so there's actually a screen grab of a Jupyter Notebook so we have preview functionality for Jupyter Notebooks STL files I mentioned the 3D printable files I think the most recent one was FITS which is a deep space image so we really try to bring your research to life and I also wanted to highlight this little piece here we do support the versioning of your content so once it is publicly available you can actually version and iterate it and the DOI will refer to those versions as well and the final piece of our slogan discoverability so we took a look at how researchers or the public is coming to the site 60% of traffic comes from Google and so I know there are some librarians in the room you're most but we realize the importance of Google in driving traffic and really landing on the site and so we do make sure that all of our content is indexed and marked up appropriately for discoverability on Google and where appropriate Google scholars so we do have relationships with the teams there I think the most recent and exciting development is Google data set search has anyone heard of this? yeah so this was a recent announcement from Google we were actually an alpha tester so all the fiction content if it is considered a data set it is indexed by Google data set search and so that's what the screen grab is down here but also just to steer away from Google for a second we are indexed across a lot of the community initiatives so I mentioned here data site share and data citation index we also work with Biocaddy and data one for earth sciences research and so we try to make the research as discoverable in the relevant places so researchers can actually just a little background of kind of where we've gone and the communities that we're working with Figshare is about a six and a half year old company we started initially out of our founders PhD as a way to share these non-traditional research objects for credit so it is initially built for end users if you go on the way back machine and look at early iterations of Figshare it's a whole area of word press blog I recommend you do that but we still at the core of our mission want to make a free service that allows individual researchers to publish their content and so if you go to figshare.com you can create an account and you can upload as much content as you would like there are some limitations on the file sizes but other than that we really do try to incentivize researchers to make their content publicly available in building out that infrastructure we've actually started to serve multiple communities the first of which were academic publishers actually so back in 2013 Floss Public Library of Science came out with an open data policy saying if you publish with Floss you have to have a data availability statement saying where the underlying research lives and Figshare they came to us and said can you extend that infrastructure to support this initiative for us so we host all the supplementary material for Floss we do a lot of the visualizations of their content and we've since started working I think we're working with over 25 academic publishers at the moment and providing a suite of services the most exciting moment and Fig Bong is still here is we were we were approached by the American Chemical Society to act as the infrastructure for ChemR5 for their chemistry pre-print server we're also working with Sage Advance and trying to really stick with that initial theme of ensuring that your research is as openly available as possible we also work with institutions I would be remiss not to mention Kiltub which is Carnegie Mellon's repository which is powered by Figshare I do have a screen grab with Janelia up there sorry it should be Kiltub so we work with universities globally as well as institutional labs we also support conferences and proceedings and we've recently started working more closely with foundations and actual governments and government agencies in supporting their initiatives so some of the ways that users interact with Figshare on the top part we see a public profile on the bottom is kind of our dashboard that we see once you log in it's a way to allow individual researchers to publish their content for credit I'll get into how we capture that it's hosted data anywhere in any file format accessible from anywhere so we're built on Amazon web services if you upload your content there as long as you have access to the internet you can download and view it public and private storage spaces which I touched on collaboration tools and all published items receive that data site DOI we do have a bunch of case studies looking at multiple disciplines we also have a really exciting new podcast called the school of Batman where the premise is Batman is trying to solve a problem and you can only solve that problem with the help of a specific scientist so individual researchers can talk about their research and how they can help Batman so it's everything from researchers who are building lasers to linguists who are trying to solve language problems but you can learn about our case studies if you were to fix your case studies or school of Batman I recommend everybody give it a listen it's pretty professional to write the fringe festival this summer which is kind of cool I also talked about getting credit for your research one of the ways that we do this we really play close with the community we were an Orchid Launch Partner so if you do have an Orchid account is everyone familiar with Orchid I'm getting nods okay cool so you can actually sync your Orchid account with your fake share accounts that anything you publish on the platform is synced to your Orchid so you can get credit there you also will note that there's a little blue box in the top right corner so this is a public profile of a faculty mover from Imperial College London who I like to use in my examples because he has created a tool that pulls content directly from his local machine once he's running experiments he has extra crystallography machines it's a chemist and it gets pushed directly to fake share auto-populates the meditative fields so he has over 2,500 uploads and you can kind of see that reflected in the views, downloads, and even citations of the content that he's made public on the platform so just a good example I might surf back to him in a bit another way that we track attention is through Altmetrist.com and so they're a company that takes a look at engagement with your content once you've made it publicly available so they'll look at tweets if your content has been picked up by a news outlet by a blog and that will be reflected on the actual item page so this example is Excellence Are Us it's a preprint university research and the fetishization of excellence Oddly has really excellent altmetrics around it so it was picked up by a bunch of news outlets and blogs, blog 11 times 3 Facebook pages so in lieu of a citation you can see the immediate attention that your research is getting once it is published to the world I also mentioned the citations this is a new development you can ignore this fantastic title supplemental materials to submitted paper you can see maybe that's why it doesn't have a very high altmetric score but it has been cited 7 times so there is some value in this and so we are trying to show that as a way to engage with researchers and try and give them some incentives to make their research publicly available and this came up last night I wanted to talk about this our API is fully open and documented you can find it at docs.figshare.com you can actually do more programmatically through the API than you can for the actual Figshare GUI the spirit of open science and just open data in general we want to make that as open as possible as possible can and as such we adhere to the open API initiative everything is documented on Swagger it looks super familiar to you and it also works really well with the community so I mentioned we have a push and pull from github where you can sync your github account if you program in R and you use RStudio you can publish directly to Figshare using RStudio we work with the open science framework there's a whole bunch of things that are actually going to set up a marketplace where you can actually look at tools that people have built on the API which is really well so this is my last slide this is actually an article on Figshare how to make your research open access and so just to sum it up it is a platform that allows you to publish your content in any format and we try to make it as discoverable as possible so you can get the attention and credit that you deserve for your research so this is where you can find me or you can tweet us directly at Figshare and thanks so much thanks Dan thanks for what you mentioned if anyone here is interested in content on Figshare we're actually having a session tomorrow 12 to noon to help people get started with any questions for Dan? one of the things I've seen with infrastructure like Figshare is they tend to kind of be around for a few years and then they're sort of lying around you think about things like google.com research data something like that so I wanted to ask you about the sustainability model so are you doing subscriptions from universities or how are you sort of making sure that you're going to be here in five years? so who pays the bills? so I mentioned that we do we've had this live before about all the different communities that we sort of can actually do that yeah so we want to keep the tool free for end users that's the founder Mark Hamill that's his mission and it's his belief that this should be available as a resource globally what we started doing when we were first approached by publishers and then shortly thereafter by institutions is we wanted to extend that infrastructure as an enterprise solution to sell services to foundations and governments to academic publishers to institutions as a way to keep the company moving forward and so we are a commercial institution by free components so can you also just follow the question so if that so that you're self supporting through the efforts that you're describing there and so then what is the relationship with the nature springer family right okay so Figshare is part of the digital science portfolio and we receive funding from digital science and digital science itself was born out of I believe biomedical biology one of the head editors at nature when the merger happened they kept digital science kind of separate from springer nature just because we were our own entity but we still have that dotted line to one of the parent companies of nature Figshare is unique in that I'm allowed to say this because I actually saw the papers we still have autonomy because digital science itself does not have a majority stake so we are still allowed to do things like we've Figshare.com and we get to make a lot of the decisions around business models and why we do what we do there of course is support in the form of from digital science and has kind of helped us start up as a small company but that is your question okay what's the ideal workflow for scientists that have a scientist that pre-code and GitHub for not a figshare for DOI or like what do you have in Figshare is how a scientist would use this it's really very it's entirely up to the researcher so we'll see on Twitter that exact example somebody able to be out pre-print on archive code on GitHub figures with specific licenses on Figshare in that case they just wanted to retain licenses for some of the supplementary content that they've included alongside their pre-print it could just be you've actually published all of your work and at the end of the grant you have all of this potentially negative results that you wanted to make on the available it's really up to the researcher so we don't say you have to do it this way but we do have mechanisms in place to kind of control that so we like to say Figshare is as open as possible as close as necessary so you can publish confidential files where you upload the file itself so it's kind of backed up on the infrastructure where you can release files under the cargo or you can link to files that exist elsewhere but still need a DOI so it's trying to meet researchers where they're at and wherever I guess they're comfortable releasing that content how do you normalize the data right so if individual researchers submit the data how can you make sure that some of the search results one of those researchers will get that other one because he did something similar let me try so let's assume for a moment that you have two researchers that are doing brain research and one of them submits the results with one term that he's using for his cells that he's using and the other is using some different term for the same cells is there any way to normalize this on your own yeah so this is a question we get asked a lot and it involves curation actually trying to tweak the content as it comes in on fixture.com we just do a light touch review to make sure that the content that's being uploaded is of an academic nature with our enterprise tool we actually do have curation modules where subject specialists can look at that content prior to publication so once you click publish it goes into this curation queue subject experts at your university weigh in provide best practices around metadata assignment and try and aid in that normalization we aren't as restrictive on the end user platform just because of the volume that comes in but we are taking a closer look at actually curating content and adding tiers of you know what was my fun example here supplemental materials to submitted paper like maybe that is not as complete as it could be that would be a tier two whereas if something would be of actual academic publication quality that would be a tier three and it could be indexed at a different level okay our last speaker for this panel is Victoria Stodden she is associate professor of statistics at the University of Illinois at her band of champagne for the invitation and it's really a pleasure to be here just a fantastic event so far slight correction and in the School of Information Sciences but I do have a courtesy appointment in the department of statistics at Illinois so I updated the title of this session to insert the word code in there and include this in the discussion along with data and I think they're quite separate objects so data and code get considered separately but they're inextricably linked right you never are accessing data without code and typically when you have code you are using it to access some kind of data and not necessarily but typically so I sort of changed and then I put computational in there for reasons I'm going to explain in just one moment so just before I get started I also put my slides on my website so if I go too quickly or you want to explore a link or what not they're there and you can grab them if you have a device and go at your own speed I'm going to move pretty quickly because I have 10 to 12 minutes and a number of things I want to say okay so let me just move really quickly I just go right to notions of reproducibility what? okay great well in that case okay so let me I'll go a little slower so I wanted to think about how do we think about notions of reproducibility and what does it mean actually already through the talks that we've seen this morning people have different lenses on the word or what it means for their research so I wanted to try and pull that apart a little bit because I think it makes a difference in how we operationalize that in our work as researchers institutionally and in infrastructure so I'm going to take some time to do that I wanted to talk a little bit about tools cyber infrastructure some of these emerging efforts that are coming out including things I've been working on and some of the ideas behind some of this and then touch on artifact availability if I can by the end alright so we use this word reproducibility all the time which I think is great it's really important as part of our discourse how it relates to computational and data enabled research I think has certain nuances around it and I've often seen the word reproducibility used in conversations and people need totally different things so I have found it helpful to pull out three different ways of thinking about reproducibility you may or may not agree you might have others or whatnot but this is something I found useful and it also I think maps to larger national conversations that are going on around reproducibility across different communities that are actually talking about different things as well so the first one is empirical reproducibility so empirical reproducibility in my mind is our traditional idea of reproducing a physical experiment so if you think about before we had computational affordances augmenting and leveraging our research we would have done something like recording a lab notebook steps that we did and the expectation would be that another group would be able to get to the same results so this is how we've done science for hundreds of years. That's what I mean by empirical reproducibility this is talked about a lot today in part of the reproducibility debate and I think it has its own set of remedies to it another form of reproducibility is statistical reproducibility we heard Rebecca this morning talking about her history as a statistician and how this maps to reproducibility issues which indeed it does they're separate from the empirical reproducibility issues so statistical reproducibility would be things like do you expect your inferences from the data to be to also appear in another sample for example so how much do you expect your work to generalize across different samples ideally all the time and so if it's not and the statistics are breaking like for example the discussion around p-hacking and so on then that to me falls in the bucket of statistical reproducibility. The last one that I pull out which has been the discussion in this session primarily is computational reproducibility as sort of a separate animal around reproducibility so here I'm thinking about how do we make those computational steps that have allowed us to make those inferences from the data the scientific inferences and claims how do we make them transparent so that they can be inspected verified reused and so on. Right now as we all know when we publish works that are computation enabled which is just about everything we publish most of the time the computational aspects are very opaque inaccessible it's very difficult to describe them in words even in the text and words are just not that useful if you want to see how things were actually implemented on a machine anyway so we've got a lot of work to do in computational reproducibility and that's where I tend to focus my research. Okay so here's a frame that I take out of computational reproducibility and one of the ways I like to think about it's important it's relative to how we have done science for hundreds of years and if you think about what we're doing in terms of the scientific method we have we should typically think of it as two branches the first branch around deduction so we have mathematics, formal logic and sort of deductive reasoning to allow us to develop new logic this fails at some point right if we want to understand where we should be planting our crops on the flood plains of the Nile deductive logic can only get us so far so we have the second branch empirical or inductive branch where we do statistical analysis of controlled experiments now there's an enormous amount of talk for the last 10 years plus around how computation and data have enabled third and second branches of the scientific method so I put a question mark here branch 3, 4 at the bottom and my argument is do we really have these branches of the scientific method the same way we do for this first or second branch if we're not enabling transparency verifiability and reproducibility with these branches the same way that are embedded in the first two so why do we even have a scientific method it's because by definition for research we don't know the right answer and every step we take to try to learn more about our world is fraught with error we could be off track at any point we don't know by definition because we don't know the right answer and we are fallible as researchers who knows so this is one of the reasons science is open and the methods are open to scrutiny to try and ensure that we're rooting error out of the research and discovery process all the way through as much as we can so this is something that's incorporated into the first and second branches you don't go to say a mathematical journal and say I have a new theorem awesome there's no proof I just want you to believe me they would say no we're not going to publish this without the proof I want to know why do you think that's true why do you think that doesn't have errors in it and we'll expose it to the community similarly with the empirical branch there's a long-standing entire way that you conduct and disseminate the research there's a machinery of hypothesis testing that are used with appropriate statistical methods and then there's a very structured way that that's communicated in the method section in the paper similarly if you try to submit a paper and you leave your method section blank you know it will be immediately redepted it's like no question so here's the challenge I think that is before us on the computational reproducibility side is right now it's a potential third fourth branch of the scientific method until we develop comparable standards to what is already adhering to the first and second branches so I don't mean to criticize the computational branches and the ambition we've been thinking about this for maybe 20 years right the other branches have had a couple of hundred years to come up with their standards but in computational research what's our analogy for the proof what allows people to really check and make sure that this is something that they understand how the results were derived they are able to buy in that this is very likely correct as the author is asserting and understand and support the claim okay so there's a quote that I'll just read quickly about this and I want you to know that it's from sentiment that was published and expressed in 1992 even though the quote is from 1998 the idea is it talks about what it means to do computationally enabled research so basically much older research we published today and it says that if you want to think about what reproducibility means in that context here's the idea it's that an article about computational science so any computationally enabled research in a scientific publication is not the scholarship itself it's advertising of the scholarship so here the way I like to think about it is if you sort of have all your work as this kind of big iceberg right I know that the tip is what's showing in the scholarly record and all of that work is a lot of the computational work that happens and this is a lot of real scholarship that's going on that is never exposed the typical way that we published today so then the quote goes on the actual scholarship is the complete set of instructions and data that generated the figures the tables the findings and results in the paper so that's something that we can start to consider as part of a natural publication of this scholarly record okay I have just a little note here because I often get a question about reproducing these computational steps and essentially the question goes something like this who cares if you can reproduce it you're not actually advancing science I'm much more interested if someone re-implements the entire experiment say for example rewrite software and comes to similar or the same results and you know I completely agree however you will only come to similar results you won't get to say whatever decimal point the same results as another experiment so I would argue you still need this level of computational transparency to understand why those results are different is this within some noise threshold on just variability in the data or is this something where there's really a meaningful difference in the implementation that we want to reconcile so I think we are moving towards this world of radical transparency in the service of science a couple things just to really mention quickly so this article came out at the end of 2016 in science and what we did in this article is we put together seven recommendations that are at the community level for taking steps towards realizing this because I shouldn't really call it future radical transparency but future scientific transparency and one of the things we wanted to do here the same as I did in the title was bring the discussion around code workflows and instructions around the data to the same level of discussion as data we hear a lot about open data now which is great I don't want that to just sit in isolation that's part of a larger context around research around reproducibility around the software and the environment in which the data are embedded so I'll go through these very fast so the first recommendation share your stuff where you get the data software workflows details about the document out there and trusted open repositories the second one link to the stuff in your paper so share it and set up persistent links the repositories receiving the other talks can help establish put them in your paper before you publish the paper so the people aren't kind of googling around in your name and paper title hoping to find some library you put somewhere hopefully not on your web page but get those links actually embedded in there and cite yourself so if you use code and data that you have shared as one of the artifacts that supports your claims maybe you have a link in your publication I just mentioned but give the citation too so we don't have established citation standards for these artifacts and whatever you think you need to be sharing for computational transparency but just do your best we'll keep iterating on it and of course cite anybody else's work that you've leaned on that's beyond ideas around the artifacts if someone uses your data and they extend your software then there's already made citation they can just copy out of your paper and use and help cite you and get this problem solved around citation documentation journals can check all this stuff at the point of publication and then use open licensing which is an area I've done a lot of research into enable reuse I probably don't have time to go into it but make sure all of your stuff is licensed appropriately so is that 10 minutes to 15 minutes and the last one the last recommendation is support research in this area it's not obvious how to do all of this stuff in research context so considering this as a research area that we can learn to understand what reproducibility means in computational settings where there may be privacy implications with the data and other aspects let's do research on what that actually means okay lots of infrastructure solutions are popular there these are all linked and they're all very cool I don't have time to talk to them two things I'm just going to mention in my last 90 seconds so I've been working on a collaborative project that we call the whole-tail project where we are trying to create environments that support the entire discovery pipeline for a researcher and the idea is when they go through their research so this is enabled with tools that meet researchers where they are like our studio do you burn out books and so on the idea is when they get to the end of their research pipeline they can publish what's called a tail meaning story and also long tail of research they can publish this as a bundle that then allows people to discover code and data linked in strategically with the actual results itself okay so I'm going to leave you to explore this this is basically just what I said details about the tails too it's actually open now we released our first public version in July so please use it we have tested out and tell us what's wrong we have ways to capture feedback and bug reports and tell us issues and anything you run into on whole-tail and it's just that whole-tail.org you can get in with your institutional ID so CMU pit and so on IDs will work the final thing I'm going to say is another project that I've been working on here my totally different collaborators on this project this is called easy data management plan so we're trying to operationalize how data management plans are constructed so that they bring attention to more artifacts than just data like for example software workflows and so on and then we're also trying to make them understandable by machines instead of just free text they're like funding agency can understand what's going on in the community or resources are they using are there repositories that are missing or people gravitating towards certain infrastructure that needs more support and so on right now that's basically impossible with the current data management plans so it's also released if you're interested and that's urlshere at dev.easyvmp.org and we have a form here to ask if you want to give us feedback on that one too because that's also new newly released this summer and we would love for you people to play around with these tools and give us feedback and these are just little sort of ways that we can chip away what it means to be computationally transparent in the research that we're doing okay that is it my conclusion is that we will become far more deeply and massively computational in our research this means we have to be more transparent in terms of being able to manage the size of these jobs and the scale of the research in turn this transparency will allow us to become more effective and more ambitious in terms of the computational projects that we actually attempt to design just so thank you thanks very much for that interesting talk so the question is about actually when you when you distinguish between replication and reproducibility you use the phrase noise threshold to describe like one thing we would like to compare them and that yeah okay and one thing about this is that it implies that combinational reproducibility is a property of the results of a paper you describe like this figure is reproducible but an earlier definition you use and the one I use is a code is computationally reproducible so when we describe something as computational reproducible what is that a property of it's a great question and you can even extend exactly what you're saying not just what it's a property of but what does it even mean in terms of if I can't say I do have your code and I am excitedly ascribe it to the code and I am able to run things and get some output how do I even understand whether this is evidence supporting like it's close enough to your results and it's supporting your results or this is counter evidence as far enough away so we want to be able to do that on the processes right so I would I always talk about reproducibility in the terms of the results so I don't talk about data reproducibility code reproducibility and so on I think those are sort of stuff in the order considerations what I care about is the claims and I want to reproduce those claims I do want that code that you ran to do it when they are published in the scholarly record bugs and all I don't want that exact data snapshot the exact version that you used as an input mistakes and all there may be different versions that update this and improve it but if it's in the scholarly record that's the type one that I want snapshot it so that's the way that I think about it so that answers the question thank you for talking it was very very interesting special challenges that arise when you are trying to address these goals and it's and you're dealing with huge data or huge computing so in particular we did learn in I can see like there are many I have seen many problems like the data is not available the results cannot be reproduced in fact papers and being like more advertising instead of like really being how did you get there and kind of reproducing this is a very good question and reproducibility is not equally easy for all groups all types of research some have easier time than others and one of the places you start running into barriers are very large data sets so CERN for example for more than a decade has been espousing openness they want to make their data open so they are making decisions about what level what point in the filtering in the reduction of the data so that's great and admirable so this was really more talk about what goals should we be going for and things like how long does the code need to run and so no one expects this to run indefinitely I think that's unreasonable but then people start saying ok so we can actually set up different types of linearization and so on to keep things running well how does that run on a HBC context and so on and so I think that's was sort of my recommendation 7 that there are super interesting research questions here we don't it's not obvious that we have all the answers and it's not a question of flipping a switch in a research pipeline just making it open right there are cultural changes as well incentive problems that we addressed in some of those recommendations and challenging infrastructural problems having said that one of the things that I've noticed is I would have thought were easier problems to solve in terms of computational transparency where there are shorter scripts or smaller data sets and so on like where I come from in statistics I haven't always seen the same uptake compared to communities where I would have thought things were really hard like for example in HBC community there are many advances going on around reproducibility even now requirements on computational transparency for accepted papers for example this has been a few years in the works and this is something where more than half the papers that are accepted now have their artifacts available and if you want to choose a community that has a real challenge they often have unique hardware they're working on bespoke software that goes with that hardware standard tools are often difficult to implement in that environment like Docker for example and yet somehow this community is really running with this really important part of their research so a lot of it comes down to culture and it's not something where I think the standards are you're not necessarily a bad researcher for not being transparent but it's about taking steps and doing what's possible and what you can with the constraints of your own research environment and as the HBC community is telling us there's really amazing things that can happen even in very challenging communities and challenging research problems we'll be back in a little bit off the last question how do you feel about the relationship between reproducibility and openness so for example if you need to have a supercomputer to reproduce the experiment or if you need proprietary software that's a really great question so the first thing is I think there's value to actually being in conspect code even if you don't intend on running it it may be for a supercomputer where nobody wants to actually put the allocation time into running experiments for example but inspecting even key parts of code can be really useful to understand implementation so I think there is an argument there again that's hard too because many of these code bases have been developed not in an open source way for decades right and so again you can't flip this switch but maybe it's an opportunity to refactor and look at codes and so on maybe it's a new way of doing research so that leads to your first question about the relationship between openness and reproducibility reproducibility is a small subset of openness and my personal opinion is openness is too big a concept and too ill-defined to be useful for scientists to understand how to actually tomorrow make a difference and do a better job releasing their work reproducibility especially on the computational side my experience is scientists get this right okay so I think if I make this I use this code this needs to go open the data that I use that's a much smaller problem right than open so do I make all my data available do I put like every version of every software what does openness mean and I think it can be intimidating so one of the things I've said in the past is the notion of reproducibility can scope openness in a way that can be actionable by scientists and institutions because I think it's a little more clear exactly what you would do it's not clear and solved but it might be a little more clear than the concept of openness so that seems to me lower any fruit is the way that I thought about it and I like the way that it nits the pieces together like it nits the code and the data and the claims together versus a data set hanging out there I just don't think it's that or a piece of code just sort of hanging there it's not that doesn't go to our vision of what it's for right okay great thank you okay so now I'm going to invite our speakers to come back up here for the panel okay let's start off our questions a lot of people here have talked about data sets that involve human subject or patient data and Victoria touched upon the privacy point and so I just wanted to ask if any of you could comment on the potential barriers that it's creating for sharing data and how do we fit those ethics into open science the things that I think we should aspire to is how can we maximize access and transparency given constraints so I don't think anybody would seriously recommend breaking the law right like if someone this isn't even it's not it's beyond an ethical issue right this is just like illegal going to jail kind of issue so but you can think about things like well if our if we do believe that we're trying to repair the process and people can the research can benefit by people accessing maybe we can think about ways of operationalizing how people are authorized to access that so they would go through likely IRB or something but we can streamline that we can think about things like who has data ownership and what's the agency for example of the subjects in there so it's there are many especially rare disease folks with subjects who would actually like their data shared but there isn't a way to really act on that agency and then there's a lot of ongoing research about how can we query private data for example and maintain the privacy but I think this is something that's also developing too but so we should make it open but I think we should think hard about those gray areas so it's not a binary anymore which we open and close yes let me add that these are all perfectly good yeah good what you said but the other thing is a lot of things we're doing we so we do work a lot of people who want to have the support and work with them on that but with things like the high project image with the human environment in Atlas there where they're looking at collecting data but new there the goal is to get extremely open consenting from all those participants that can really do a lot for the open science and get the consenting up front that lets people do whatever they want in the research front with this data granted this is normal this is easier because this is for quote normal patients but that's still not to say that we won't find things in their single cell genomics it would be not we they think they're normal not the but I think that's one thing too to make this to really avoid some of the legal challenges I just have a comment it's not really a question to say but I've been encountering this issue a lot I do a lot of applied health research and prospective longitudinal studies and one of the study I work with has followed 1,500 kids from the 1920s through their death so some of them publicly identified most have not and we have death certificates and cause of death collected and just a couple of wrinkles with this so for example we often and of course I'm very interested in sharing data but to share my data people would get date of birth date of death information as well as sensitive psychological information for example I'll just highlight one participant in my studies in the Manhattan project so very public people where it would be consequential if that information got out so I feel ethically I couldn't share the health information at the same time as any kind of psychological information because that could lead to not only clear identification of people but also clear identification of sensitive issues information and some wrinkles to this are the feedback I usually get in psychology journals is that if I require people to go through my IRB then it's not truly open access and I'm like well but I also can't just give people open access personal information like that and on the flip side the director of NIH has told me that he believes that the right to privacy ends at death so because all of my participants are dead they no longer have a right to privacy of that kind of information so like I'm sort of on the very I would say cautious end of this respectfully because these people living information in the 50s and 60s and they couldn't have imagined computers where this stuff would have been online but certainly I get a lot of different perspectives from other scientists and it's kind of a hard line to block so anyway just some food for thought about but different kinds of data from like you know cell lines or computer modeling or things like that these wrinkles become really complicated Awesome talks I was trying to think of sort of a question that might kind of bind a lot of this together maybe it's a really stupid question I'm not sure Is anyone thinking about sort of like statistically you know multiple comparisons are a thing do open data sets expire or are there ways of keeping track of usage that would sort of you know make sense of this there are a lot of repositories and a lot of places where this data is going is there any sense of implementing something like this going forward Good idea, not that I've ever seen or heard of in the repository community I have heard in stats community people who do research on this wondering about basically they didn't use the word expiration but I like that categorization of data so suppose you run a test and say it's a no idea or something like that and you happen to have been the millions test run it but you just don't know it's the first test you run so what do you do in this situation and so that's something where I think these points that are being surfaced in these comments here show a very rich area of research but I haven't seen that this come up in the repository in a structured way I mean this is still an area where you know statisticians are even doing research just to understand how to resolve multiple comparisons with many tests let alone apply it to a data center in the repository as an institutionalized structure I will say that sponsors are thinking about that there's a discussion in Washington where they're saying we can't keep all scientific data it's just too much where do we draw the line of raw data and just keep curated or preprocessed data and ditch the raw and when do we decide that a community data center is expired or obsolete it so those are very much of interest to us too because we look at community data sets on our resources everything is finite so at what point you said you can't do that anymore we had the same discussion for the Hive last week who would very much like to get all raw data sequencing centers and tissue mapping centers and so on not clear they'll be giving us all of that but we'd like to have the raw data because we know everyone will want to reprocess it in different ways is how we're going to move forward and then the question is how long do we keep it so that's who will be wrestling with it we are two journey simulators we're going to turn it up for the new higher energy physics so the question is they just keep all the data they need to keep all the raw data there is a pipeline that generates a lot of data in the process that one they let spar after a certain number of years but the experiments that have in the 70s 80s that's data that has to be installed so I just wanted to add that data from there so the answer I would say is different for the community datasets initiative that we have the criteria is is it been used by the community and what are the constraints we have so we do have a certain limit to how big are the datasets that we host at the center pretty much everything I was going to say is slowly answered but I will build on that last point about dating from last years that's one that we see a lot especially in the UK there's many policies for the institutions that we work with saying that they are willing to delete data after 10 years from last years so those are the kinds of things that we're being asked to actually measure for them but of course that policy varies between institutions, countries, disciplines so I think with our own restrictions it is based on the storage and so we're thinking about that but at the same time about so we're just going to keep it especially when it's purchased so I just wanted to draw I think a subtle distinction about what you were saying and the conversation you've been having here which is my impression of what you were saying is you were interested in sort of the powers of disabilities and that's generalized how much it should be for and so I think something like what you're suggesting could inform these types of discussions and I don't think it does at this point but I think it could one thing that I also wanted to mention is that these ideas too so these ideas you were talking about about how long these sort of offers are of artifacts you could ask the same question so when is it worth reproducing the results so some results the community is very interested in and they really want to have a lot of understanding and certainty around them other results just probably aren't worth the effort to actually do the full reproduction of the result so these aren't clear answers at this point for some of these cutting edge questions like yours we're pushing the envelope and we're sort of learning the landscape of issues so we don't have answers there but I think the trajectory is really clear as a direction thank you very much I wouldn't ask about something else but I do want to say please keep the raw data in cloud based as well and I think it is kind of a who's going to make the judgment of what is good and bad and how do we know what we're going to do later on but you're right the issues are coming out as they come I just wanted to say that you should also think is when is de-identified data sufficient I really love the idea that to be proactive and have a good consent that's absolutely great but looking back just not providing the subject's name is that sufficiently de-identified I know that there are a lot of brain imaging and sometimes have enough data to reconstruct part of that of the face and so forth so that's another issue that perhaps needs to be looked at and prioritized because we want to be able to share the data and we want to put that to human subject so I think those two are very well known the consenting part and how to de-identify the data so I was part of a four authors who got together and did a co-edited book big data and public good and one of the things we did with chapters was sort of soliciting these chapters in Hellenism Bound and Solemn Barrack House and essentially their chapter said we don't have an informal consent and the reason we don't have it is because we don't know how these data are going to link to other data in the future so we really can't tell people in any kind of reasonable way what the risks are or aren't because we don't know and so their argument was we turn we're doing this end run around informal consent and we're better to be honest with people that this could be discovered more ways that we don't understand yet and so something like I would argue almost surely we'll be able to I mean if there's any complexity in the data almost surely we'll be able to kind of ax off what's going on in there so that's paired against this landscape of changing norms as well around patients and subjects sense of what's appropriate with their data but the idea the idea of informal consent as it's traditionally understood is coming under a lot of challenge because of future data linking I know you're talking about that so well that's absolutely true and I think the challenge here is that as datasets get linked they can really be used in ways that no one perceives and that's the real thing we've been, I mentioned that I've dated that's just looking at bimovic apparatus but when you start looking at other data sets and in our view or my view a real value of community data we're pulling together a disparate data sense to be able to bring together interdisciplinary research and ask new questions that's really good but also expose a lot more of these weaknesses in privacy over here, thank you so much for your talks I love this idea of open data and open sharing and I guess I have a practical question of how do we speak the same language so the way I index my data is going to be different than the way Josh indexes data and when we send data to which we have but if we were to we would have to then learn each other's languages and it makes it very challenging especially with these huge data sets and just as a practical limitation in reviewing all these papers and the expectations for each paper is so humongous and it's so much data I don't even know how to evaluate data when given the raw data and the source code so how do we practice it? it comes down to ontology it's a better data what words do we use interchangeable for each other's data what can be mapped and what can so if one paper refers to age in bins of 0 to 30 30 to 60 and above others do numbers how do you put all those in the same footing we got NSF funding there was a conference here at CMU in May there's myself, Paula, Keith and Melanie, we're all here in the room it's amazing but as we made on applying machine learning techniques to scientific data discovery re-use there's a lot that can happen there the Google data set search is one example it's a little in the fields if you want to do some things now we certainly expect to have Google there but really looking at the techniques can help us with that discoverability with doing the automatic discovery of the ontologies to help us understand how to put data sets on the same footing so we can work with them together and I think that's where we really need to help because we've been developing community-centric data sets around different domains different countries have done data collections and now we just have hundreds of different samples and they're still not talking correctly you said 1200 blacks were supported in fixtures okay why don't we have 1200 what worse we don't need that and that's like not even getting to your questions or reconciling variables and stuff but just even getting the formats to talk and so that's something where as a community we shouldn't tolerate yet another format we need things in sort of open machine readable as low level as we can possibly get and that's something where we have work to do as we interact with equipment and so on that produces formats and what do we tolerate and what are our standards in terms of openness too but that, yeah I thought you might have something to say about that because that just seems kind of real are our volunteer formats of instruments yes so proprietary formats is the thing that happens especially in scientific research people are creating their own file formats which is really fun so I think it kind of goes to this whole idea of a research object there are two things that I usually try and speak to when it comes to trying to understand other people's data so the idea of a research object which is some of the best metadata that you can have is in the form of a publication and I'm not trying to put the article in the pedestal we know it's limitations but it does explain the methodology and if you have the raw data outputs the clean data, the actual data calls that led to that point that you are trying to claim in your article to shed a little bit more light as opposed to just sharing data and so we've been thinking really really hard about ways to kind of link all of that information together and put it presented in a way where it is kind of understandable as it could be there are also larger community initiatives that happen the one that I've obviously heard but hasn't mentioned yet today is fair and it's thinking about data making it findable, accessible, interoperable and reusable both for humans and machines and I think that kind of circles back around to this whole idea of using artificial intelligence machine learning to understand that that data that you're being presented with so thinking about a high level it's starting to get there but great thank you to our panelists we have lunch out in the lobby sorry I brought it in the class lunch is out in the lobby but you're welcome to bring your food back in here if you need more space for eating and conversation we will re-covene for our third panel which is about open tools at 130 thank you