 to write tests that cover all possible Circumstances that cover all the different things that can go wrong It's all about designing for what you can imagine and what you can think about which is no different in the experimental world from having a good set of controls About knowing that your instrument is behaving the way it should be of running the kinds of calibrations processors Putting the subject in the right place in the MRI machine That ensure that things are going to work for you So in a sense, it's kind of interesting from my perspective that as a experimental scientist I always looked at computational science and thought you know you guys have got it easy. I mean You build this software and you know, it's going to be work every time You know that the data is reliable. You know that these things are all in the right place And you know everything's going to continue to work in exactly the same way every time you run it, right? Yeah, that's what computers do and then I met these computational scientists who kept kept saying to me You know you experimental scientists really do this properly You know your standards of publication are all about reproducibility all about making sure that someone else can come along and repeat your experiment And of course many of you will have seen the paper in nature a couple of months ago Where a group of people from Amgen Took primary research results that were of interest to them as a drug development company and Their first step in taking these potential drug targets down the road of development Was of course to check they could reproduce these experiments Published in you know those kind of middling fairly reliable journals like nature and science They could replicate the experiment in about 12% cases. This is slightly problematic and Of course many of you will know some of the horror stories from computational science of Things that cannot be replicated code that cannot be recovered code that cannot be run So it's very interesting. We've got the situation where the grass appears to be greener on both sides of the fence What I want to suggest is essentially that actually both camps can draw inspiration from the best practice On the other side of that fence So when we talk about the best practice in experimental science We talk very much about reproducibility of providing the detail of exactly What reagents were used how the experiment was run recording the conditions telling someone about the calibration We talk a lot about the documentation How detailed that needs to be and it's always more detailed than you think To make it possible for someone to have a fair to even chance of actually taking your experimental results and repeating them And those are things that at least anecdotally Are not done terribly well in the majority of the computational sciences people don't Tell you a lot about the dependencies. They very often don't provide the code Certainly very often don't provide the source code that will enable you to reproduce things and probably the less set about documentation the better on the other hand we have these truly amazing tools in computational science So we have things like continuous integration the notion that Every time you check in a piece of code every time you add a piece of code to a system to a whole body Perhaps to the work of a whole group You don't just run the tests on that code and make sure that they pass You recompile the entire set of code with all of its dependencies You identify all the places where it breaks you find all of the things that could go wrong But this person who's added another two lines or maybe taken away ten lines has potentially broken Imagine if we could do that with the research literature Imagine as an editor or a referee you have this paper come in You're looking at it, and you're not entirely convinced. You're a bit worried because this is Perhaps gonna upset upset the Apple cart may or maybe it confirms things and you don't want it confirmed I can't imagine that happening But but one can imagine that that could be possible But if you could just run the paper compile the paper against the rest of the world's literature against the data from the rest of the world's literature And be able to see if this were true What would that mean about this the rest of this literature? What would it mean? We have to reassess. What would it mean? We have to look at and Imagine you had that capability from the perspective of a researcher coming in to a new space trying to understand Which papers you need to trust and Which ones you need to keep clear of Whether it's in terms of the methodology in terms of the data Or in terms of the conclusion So I think there's these there are lots of these parallels and I could draw this out by talking about lots and lots of different examples Source code it's subversion Continuous integration is a great example On the experimental side as safe running controls properly good documentation But also things about how we effectively Share results how in the end it often becomes important to bring someone into a lab To teach them how to actually run a process in a way That's very similar to the reasons why pair programming is an effective mechanism for identifying issues in code And what I want to suggest is that actually these are not analogies. They're not similarities There are actually expressions of a deeper underlying similarity that these are both fundamentally today information businesses They are systems in which we take data and process and Draw conclusions from them and that in both cases Systems of provenance systems of labelling and tagging and integration are critical and that in both areas of science We don't tear do a terribly good job of that in either case That raises some interesting questions if these are Systems information systems and what's the platform are we building on? Are those platforms stable? Are they reliable? What are the dependencies For the peer review process? What are the dependencies for code that runs only in Fortran 77 what are the dependencies of the code that's written in python? Although it's actually written in Fortran How many of us you've seen came across a wonderful example There's I couldn't understand this code at all. I figured actually it was someone writing for training python Which is what happens when you take a Fortran programmer and tell them they have to use python There's been writing in Fortran for 30 years. They think about the problem in a different way It creates all sorts of interesting dependencies So there's lots of lots of different things to mine there But what I want to pick on and what I think will be picked up by the other speakers is To go back to some of the things that John Udell Focuses on when he talks about web thinking because John Udell's web thinking and Computational thinking and all of these things are effectively an expression again of the same thing But we are all in the business of managing data and information and the processing of that data and in this context John says a couple of things that I think are really important and Probably the most important and the thing which we're really failing to translate Into the world of science is to copy by reference Not by value that sounds like an obstruce kind of thing What does it matter? Whether I go through some torturous process of linking This cell in that spreadsheet Back to that other cell in that other spreadsheet Building a workflow Versus the easy thing which is just copying and pasting and the answer is that we lose the provenance and In losing the provenance we lose the opportunities for integration So when we talk about things that sound dull like data citation like URLs for objects whether they be people or papers or physical reagents or data and When we talk about the process of linking them up and really dull things like W3C standards and ontologies and all of these things. I know I'm preaching to the choir here in practice But the bottom line is that all of these things are links and addresses and In linking and addressing things and copying by reference we start the slow process of working towards a world Where continuous integration for the scientific literature will be an assumed part of what we do It will be the basis for how we think about Connecting and communicating our research before we think about submitting it to a journal or sharing it with anyone else And it will be the basis for how we assure ourselves That what we've done is up to standard because at the end of the day the person you mostly collaborate with yourself is Mostly collaborate with is yourself in six months time When you've forgotten what you did and so the value we can create for ourselves In doing these things properly is enormous and it's always hard You know, it's always the thing that you will do tomorrow I'll do the documentation tomorrow. I'll sort the tests out tomorrow I'll put the data in the database tomorrow because there's always a more important thing to do But usually it's replying to the referees on the paper that you've already lost the data for So if we can focus on this notion of ourselves as managing information of The provenance and linking and citation as the core of that process Then yes, we're gonna have to put in some more work But in six months time you're gonna appreciate that work and in five years time We're gonna have a much better system for managing the information which is already overwhelming us and I'll stop there