 Thank you, Jess. I'll get started. So this is quite a mouthful of a title when we put it together So what over the next 20 minutes and I'm going to try time this right? Break down this into constituent parts and hopefully at the end of it You'll see some of the directions we are headed in terms of how we think about reproducible analysis And how that might actually help us figure out answers to questions that we care about as scientists So the four areas I'm going to talk about today knowledge generation and reproducible analysis I'm not going to go deep into what knowledge is. I think that's for separate discussion Perhaps if you want to talk about it over beard this evening, I'll be happy to provenance and semantic web tools This is an area that I'm fairly interested in these days And we've been building up a prototype platform based on these tools and some of the challenges in future directions as We move forward in integrating a lot of the information that is coming out in the scientific literature So Huh, I think many of us find ourselves in Calvin's footsteps We look at something in the scientific literature and we have no idea where it came from So the mind calendar apparently they just ran out of space on the rock That's why it stopped at 2012 the apocalypse came and went But we find the same thing in the context of scientific information But the same thing about getting at information as to why we see a result Would help us answer many of these questions from journal articles published between 2008 and 2010 Retrieve all brain volumes and aid or scores of persons with autism spectrum disorder who are right-handed and under the age of 10 Now I've purposely highlighted some of the things in italics because those are things that could be substituted by other similar Conceptually related components and these are kinds of questions that we had scientists would like to ask Google Perhaps someday, but we can't right now and the question is how can we get to a state where we can answer these questions? through computational queries Well, the reason we can't do this is in many of the cases we need a formal vocabulary now in this past in this Previous slide. I said aid our scores now for those of you who don't work with autism aid our scores means nothing to you And we need a way of communicating that to a computer Let alone a human being so we need formal representation and there isn't a formal vocabulary to describe all entities activities and agents in a domain And vocabulary creation is a very time-consuming process much many of the people here are much Bigger experts at this than I am and they'll all tell you that it takes a while to get a good vocabulary down Standardized provenance tracking tools. I need to know how something came about is typically not integrated into your existing Infrastructure that's something that we need to also make available Binary data formats we deal with we get a new machine. We create a new format and we put it out there and Those formats often come up in a very ad hoc manner And they're not really easily searchable without something else interpreting. What's in it? The actual data that we deal with in neuroscience can vary from a single bit. Yes. No Two terabytes of data and we need a way to relate to that data and in many research laboratories after a paper is published 90% of the data is deleted. I don't have space. Hopefully that's changing But that's not yet quite the case Finally, we run a lot of computational tools Everything from hacked up math lab scripts by conscripts to actual software that's been produced in a proper way But there are no standards for computational platforms that we follow We just create things because we like the flexibility of that system. So things are hard But why should we do this? We're trying to integrate things across the different scales of neuroscience Right. We are getting from spatial scales temporal scales different kinds of instruments and various disorders But the data sets that we play around with Contain ad hoc metadata. There's no standardization And our process with methods that are specific to the subdomain even if they're using the same kind of algorithm Just because of the tools they use we don't know that they use necessarily the same kind of architecture So this limits integration We also don't share relevant metadata the fact that on that rock somebody didn't write on the back side Hey, I ran out of space. I could do more. We're here is something We don't know because we don't have a lot of information that comes along with the data and finally right now we rely on a significant amount of human effort to answer the previous queries that I listed because errors are in humans so Anything that we tend to read has to be taken with a bit of a Grain of salt since sometimes large grains of salt We see that in routine retractions in various fields in the science. So I'm going to spoke focus mainly on brain imaging over here and one of the things I'm highlighting over here is a workflow You've seen this slide many times through the past few conferences But this actually highlights the complexity of what we need to capture and The speakers following me will talk about different aspects of this So let's go through this process sort of step-by-step So first in order to reproduce something I need to kind of reproduce the participants the screening inclusion and exclusion criteria The demographics that were used to match the participants up the experimental setup which involves the stimuli the experimental control software Then we move on to gathering the data We have the MR scanner pulse sequences and reconstruction algorithms Cognitive or neuropsychological measures that we may or may not have captured and then comes the data analysis software tools environments quality assurance methods which may or may not have been Automated or it could have also been done by an individual in which case we kind of need to dig into the brain of the person doing this Analysis scripts and finally figure and stats that come out. So there's a lot along that way but reproducibility is necessary and David Donahoe has been talking about reproducibly for a long time error is ubiquitous in scientific computing and Despite all our efforts error creeps in so the only way we tackle it is to capture things And keep improving our methods, but since error is ubiquitous We just have to do our best to capture every kind of information that we get So what are the aims of reproducible analysis? We want to reproduce an analysis very often you hear of a great new finding you have your own data You want to be able to see that it works for your data if it doesn't reproduce the same analysis on your data Well, you might consider Not thinking about that analysis as giving you the statement that it did at the end of the day We want to increase the accuracy that verify that analysis were consistent with the person's intentions review the analysis choices that too many parameters that we tweak in science and Everybody doesn't have the time or the energy to go through the entire parameter space And so you want to know what sort of choices somebody made in doing the analysis Clarity of communication. I would argue that the method section of today's articles should be tossed out They give no information. They could have it could have easily been a paragraph somewhere else They're really not useful enough to replicate the analysis It should be replaced with structured information, which is what I'm going to argue about for the next few minutes Trustworthiness, I mean you see a result you want to appreciate it You want to use it as a fact, but you want to make sure that it's kind of done well You also want to be able to take somebody's prior analysis and do your own analysis With the same methods And that should be easy to do and finally we're always dealing with parameters So you want to contextualize a result that you see if somebody says people have a smaller carded volume People with obsessive compulsive disorder have a smaller carded volume You want to take that and say is it in across all cases? Is it a subgroup of obsessive compulsive disorder? Is it because my processing was done with the particular software that I see that you want to be able to do all of that But capturing research today is fairly hard So we have the laboratory notebook and luckily I think in most cases people have at least Transitioned to Google Docs and Dropbox to share it It's no longer in paper because that I don't know how many people go back and read that The code exists on file system and some people have moved to GitHub and sourcecores to store their code That's all good data is getting into databases and archives and there various environments that people are working on To make the analysis reproducible as in somebody else can use it Finally much more supplementary material is coming online Despite the fact that some journals have decided not to allow supplementary material But we are getting to see that people are putting things out there. So that's all good But that brings us to provenance and semantic web tools. We can do this better We can get to a place where we can get structured information quite nicely So the central theme in all of this is capturing information. How do we encode information and I want to give you a preview I'll show you a little bit of a prototype of a platform later But I want to give a little bit of a previous to what should this platform look like Well, I consider this thing to be something that has to be decentralized That many people can put up data out there many people can put up computational services out there And all of these needs to be linked up much like the internet and the web services that we see currently today are Encoded information has to be in standardized and machine accessible form It shouldn't have to go through a human filter to figure out what was in it and The other thing that I think is important going back to that first slide that we have to start viewing data from a provenance perspective How did it get there? I think we want to do this We just don't have the information currently in many of our articles to do that Finally with the amount of data being generated We want to integrate and therefore to discover and share data So we need to have ways in which we can query resources to get at information to get at data that allows us to do something with it We also want to be able to immediately retest an algorithm revalidate results or test a new hypothesis on new data All of this should become automatic much like a Google search is done today, but we're not quite there yet So some definitions I've used the term provenance. I'll formally define it actually the W3C formally defines it this way I'm just picking on their definition here It's information about entities activities and people involved in producing a piece of data or thing Which can be used to form assessments about its quality reliability or trustworthiness What is a data model? This is something we'll get to use if we have to encode provenance a Data model is an abstract conceptual formulation of information that explicitly determines the structure of data and allows software and people to communicate and Interpret data precisely and finally the provenance data model approved yet is a conceptual data model that forms the basis for the Forms the basis for the W3C proud family of specifications It's a generic basis that captures relationships associated with the creation modification of entities by activities and agents So I'm going to skip over this slide But I've left some of the things in there so that you can go through and look at some of the definitions Related to these different components of the proud dm But I'm going to highlight some of them in the context of this diagram what we are really looking at our entities activities and agents and Relations between them and this is not something I expect you to digest right now Not this one, which is it which expands it even further But the idea here is that we have a formal way of representing information or Relations between information and we capture at a simple level Agents for example, these are things about persons organizations or software agents Including machines So we can throw in the MR scanner right here as an agent that deals with things under entities at least in brain imaging We have things like our data files as well as various pieces of information like demographic data coordinates that come out of an analysis and Activities are processes so that could be the activity of scanning a person or it could be the process of analyzing data And all of this is formalized in a provenance model done by the W3C I think one of the key things here is this provenance is not an afterthought in the existing models that we have Provenance was kind of tacked on to the existing models in this case. It's exists directly with the data and the metadata It's a formal technology agnostic representation of machine accessible structured information And that I think is important because as data gets bigger We're going to get less and less of mechanical Turk like situations of having to deal with data but we're going to We're going to look at Machines doing much of our queries for us Federal queries using sparkle when represented as RDF. It's a W3C recommendation that simplifies app development and allows integration with other future services Okay, so semantic web tools I'm going to skip this for sake of time But it's a framework that allows us to describe data in a consistent and manner And it allows us to represent data as relations in a graph. So we're basically building up notes in a graph This is simply a notion of Somebody talking about a person whose full name is Eric Miller whose mailboxes this email address in his personal title Happens to be a doctor and that's represented as a graph And that's what we are building and the more we connect nodes and edges over here The more richly connected a set of information do we create? Sparkle is a protocol and an RDF query language that allows us to query such information And one of the key things with sparkle is it allows us to write an ambiguous queries and things like federation are built into Sparkle by Default so you can create these RDF stores anywhere and you can query across them now doing that with a standard MySQL database is kind of hard So I'm going to get back to sparkle examples later But let me talk about a prototype platform that we've been developing a standardized data model Providence tracking decentralized content creation storage and federated query So the standardized data model we've developed is based on the prof dm and hence derives from the provenance ontology We encode information in a structured manner We use a consistent vocabulary and we are trying to develop metadata standards by a domain specific object models So a few things here terms a lexicon of all things brain imaging We don't have it yet, but we're getting there right now We've encoded a fair bit of die-com terms some software specific terms some statistic terms and paradigm terms Object models this is what we deal with many of us have various directory structures and Structured information in those directory structures, but we can represent that in a more formal way We can also take CSV and tab delimited files and then code that Brain imaging file formats are another case where we can get metadata out of those files and represent that in a more standard manner And then we will put provenance into all of this So this is the general picture that a researcher sitting at a place can query different things in a federated manner And each person can set up their own sort of triple store to store different kinds of data And in the future some of that could be integrated to things like the INC of data space But currently provenance tracking is done in this manner and maybe we should just stick with it It kind of makes for a nice PhD comic thing Different ways in which we try provenance, and this is definitely one of them But we can make things better. So there's something we could track it with something like that by the notebook We could do it with some opera which Andrew Davidson who's around over here has been building in the flopping Synapse is a web resource Which allows you to track provenance and then there's the prop Python library with RDF extensions Which is what we've been using to build some of our services Now the other part of all of this is integrating some of these with workflow tools Now if you cannot instrument your tools to track provenance, there is no problem It's nobody's going to go back and say I'm going to do this or if they do that's going to be fairly erroneous So tools have to be integrated with these things. I've been building a set of tools for brain imaging called the NIFI workflow environment and this has an embedded provenance tools in it But we went down the path by first logging to a file and that broke down when we went to a cluster because NFS shared writing broke down We then decided to take every function in the file and write out a restructured text output and that broke down because machines Can't quite read that yes They can render it but they can't read it in a formal manner then we exported the script since Python Maybe that's a formal language But now you need to know Python to read it We then went to ipython notebooks which at least captured the flow of the analysis But it didn't tell me all the branches that I tried but didn't record and So finally using prof as a library that's integrated We can actually track all the branches of analysis that we went through Without and even the final one that I may have finally reported on But I also know all the things that had been tried So decentralized content creation and storage kind of comes out of this that we can create an exposed metadata Where we do the analysis not necessarily in a centralized space although Registering the data set with a central authority can't help and we can use robots much like Google has done to mine their web To mine these data sets So storage of provenance these this is just from an analysis just to get a sense of how these triples and statements can be Stored and what kind of information we get from brain imaging stuff We get things like runtime dependencies inputs and outputs and be five hashes all the good things that you would actually want to capture In order to reproduce an analysis and then in terms of a workflow We get relations between processes and links to shared input output entities Now I'll come to this in a in a second with the example application So what can we do over here? We can write a sparkle query now I'm not expecting you to parse it, but what this query is effectively giving me is the right amygdala volume from Free surfer analysis, which is a package used in brain imaging analysis And what I found from this I could face that query straight into a D3 rendering Which gave me a histogram and it was very easy to see that the right amygdala volume was Identical in six of my 973 subjects and actually it happens to be identical in many many pairs of subjects Now for a volume to be identical is something somewhat fishy So I can go back and look at things But this allows me to quickly get at that information Just because of encoding it in a structured form without having to go and redefine what my model is Another similar thing I have two sets one run one stored in a computer at IMCF and one running over here at MIT And I make can make a federated query that joins across resources in those computers Pipe that onto a web application, which shows me group information here. I'm plotting Age of my subjects versus verbal IQ and this is a web applications. They are much easier to develop these days And we can get very lightweight applications dealing with information derived from this provenance data model Okay, so what are the challenges in future directions? Well, first, let's see what all of this has bought us We have a common vocabulary for communication if we can implement this well and get it adopted It provides rich structured information including provenance domain specific object models that are embedded in the common structure and Data and content can be repurposed. It's not that we have to write data out to do one thing We can use the same data for different application So somebody could say I have a cluster and I can just use execution time of my functions as a way to say this Function should be sent to this node of the cluster But where we headed? Well, we want to formalize an idiom object models as extensions of probable Now this is more in the formal ontology work of things We don't have to get there to be practically useful, but it'll make it useful We need a common vocabulary across software. There are many people developing things. We need them to sort of Standardize some of these things. We need to relate it to the rest of the linked data web Related to publications authors grants reuse existing book categories and ontologies and Integrate with existing databases. We don't want to throw away all the content that has been generated over the last few centuries We need to instrument apps and build them with provenance tracking embedded in it And we really need to publish more structured information anytime we think of it. We need to get out there reproducible analysis on a VM of an existing pathway that's something that's kind of a lot Work in progress, but we'll get there once we get standardized environments for many of these tools There's a lightweight decentralized architecture that had proposed that's out there We can talk about it later, but I want to thank some of the folks who had put in a lot of work This includes a data sharing task force of the international Neuroinformatics coordinating facilities and the burn derived data working group the neuro-matching in Python community has provided to the Knife pipe and workload of Luvman tools the W3C Prop working group has been a tremendous help in refining some of the standards and working with us in Coordinating and improving the Python and provenance library and finally support from the NIH and INCL I want to leave you all with a picture of the future This is a picture from Carol Gopal's group at University of Southampton and It's the idea that all information especially research output should be connected to each other and we need to get there And to get there we need to follow standards and adopt it as a community. We cannot do it alone We cannot have 10 developers just do it for the rest of the community when we do our practices The developers can help instrument tools But we need to think about how we put this together in the context of our publications and research output So I'll leave it out there and So Thank you and congratulations for decentralized data. That's the first time I Hear this word here. So but Decentralized data is of course, it's important, but how you Centralize queries because if someone needs to find some data, he needs to know where is it? So I'll use this can people hear me. Okay So that's where the metadata part comes in right so what we are saying is decentralized data allows the actual physical Large-scale data to be stored in different places. We're also saying that the metadata that's information about that data can also be decentralized But we're saying that you can also have registries which say These are the places that have registered with you much like internet addresses are registered through DNS servers So you can route things through so yes You need a common registry to be able to query But the fact that people can put up their own data services allows the process of not having to have a central platform for this Hello, my name is Ahmed and I would like to ask two Questions one is regarding the technology you're using It's just a small opinion that what's new in it. It's rather What's the significant part of your presentation with respect to the technology you're presenting because to me it's it's nothing new Second thing you ask for the you've asked community to contribute in it Normally what you're talking about regarding Putting the publications and these things online one person cannot do it. It's about Organizations who are publishing the data and they are organizing it and don't you think they are already structuring the information in a way That at least you can find it means it's not now Completely unstructured. We are publishing the text. So what's your? Contribution in this All right, so in terms of the tools that I'm using I would say nothing new really It's just that some of the tools have evolved over time like sparkle 1.1 Only you've got officially accepted as a recommendation in March of 2013 So the ability to do some of the things has evolved over time the concept of RDF was initially established in 1997 so There is nothing new new from the tool perspective in terms of the community perspective What we are hoping is currently people still share a lot of data as Zipped directories it still requires human intervention to go in match up a PDF To what's inside the directory to find out what things are and all of that can be resolved purely through structuring a metadata structure That tells everything about how things what the things are in that zip directory structure. That's what we've done. We've created these object models To say here you have a free server directory which has these hierarchies of information, but now there's a sparkle store Storing all of the metadata about it I can run a sparkle query to get at any single file and its properties in that directory structure That is still missing in terms of community distribution of information Many of the current datasets distributed have PDFs associated that describe how things are laid out And that's to me is not structured information You need to get a human in there to read that PDF to figure out what that information layout is