 So, the next keynote presentation is by Fernando Perez on open science and reproducible research on Jupiter. All right. Well, folks, thank you so much for the invitation to speak here. It's a pleasure. I want to frame at first a little bit, and this is framing that will probably be no surprise to many of you in the audience, but I think it's useful to kind of set my own mindset around why I follow the path of getting involved with the building of these kinds of tools. And I think there's a variety of aspects and perspectives to it. One is an ethical one that, for me, working openly is a way of trying to build a fairer access to scientific data and scientific research. I come originally from Colombia. When I came to the U.S. to do my PhD in physics, I did have access, obviously, at an R1 institution in the U.S. to proprietary software. But that was the kind of software that I knew would be difficult for my former mentors back in Colombia to work with. And if I wanted to share my work with them in treating them as equal peers, then it would be much harder to do so if it depended on a highly proprietary pipeline. There's the aspect, human and social aspect, and it's the fact that working openly makes it much easier to collaborate with people and to build relationships in new communities in ways that are much harder when you're working in a highly proprietary model. And for me, that has been true to the point where, by building these open communities, I've actually built also amazing personal friendships in addition to productive professional collaborations. There is an epistemological angle to it, which is that I think that if the mission of scientific research is in a sense to pry open the black box of nature and understand how it works, I find it very hard to justify doing that by using tools which themselves I am legally prevented from opening and understanding how they work under threat of a lawsuit. And so I believe that those things are just fundamentally incompatible. And finally, it was just that I'm a geek and Python was a cool language and I really got very sucked into building things with it. All of this for me began back in 2001 when I was a graduate student trying to build a better interactive workflow for the kind of analysis processes that are familiar to all here but that are a little bit different from those that are used in industry where we are effectively trying to iteratively explore a problem by running a little bit of code, looking at results, plots, etc. And gradually we find our analysis in contrast to building a library to solve a problem that has been well specified in advance as maybe a little bit more of the case in an industrial software engineering setting. And this began as a student procrastinating telling his advisor, I'll be back tomorrow, this is going to just take an afternoon. It's been 17, 18 years now and I'm still doing it. So never believe me when I give you a time estimate or anything. I do want to clarify though that from now on everything that I'm going to say is not my work, the credit and the reason we've been able to accomplish some of these things is because we have an incredible team of people who work on it, all the credit goes to these people. And I do want to flag the fact that this is a team that has folks from industry, it has folks from academia, it has folks from government labs, it has folks who have joined us as volunteers in the open community. And so it's an incredibly diverse and interesting collaboration. And this is the kind of thing that has also been a challenge to build with trying to stay within the confines of academia. And that's a longer conversation that I don't have time for here in this talk, but I'm happy to have it over a coffee or in breaks with anyone who's interested. Just as a quick show of hands, who here has used the IPython slash Jupyter notebook tools? Okay, so a good chunk of you, not all. So for those of you who are not familiar with it, I'm going to give you kind of a very, very brief cliff notes. You can think of it to first approximation as Google Docs with a Brain, if you will. It's a web accessed environment that allows you to create documents. Those documents can contain text, formatted text, mathematics equations. But they also include blocks of code. And those blocks of code are executed in line. And the results of that execution are stored in the same document. So you can build interactive documents that have both narrative human natural language input combined with computational code in the results of those computations all within one live document. And it's accessed through a web browser, which means that you can use it locally or on a whatever remote resource you're accessing, super computer, cloud, cluster, et cetera, using the exact same interface and the exact same UI. Very, very briefly to give kind of an analogy because I think it's useful, especially in a context like INCF, where I think people care a lot about the notion of standards and protocols and ways of building open tools that actually facilitate collaboration and not just one team to do something. What the core, what fundamental ideas are in the Jupyter machinery. And those ideas can be perhaps seen by using the analogy of how the web operates. The web is a highly simplified view of the web. We can think of two key pieces of its architecture are the HTTP and S and star family protocols that allow you to represent how two actors communicate and transfer data between them. That interaction is typically between a web server and a web browser, but it doesn't have to be. And that's an important point that by building a protocol, you can build services over HTTP, reusing that protocol where the clients are not necessarily a web server that has a human reading a web page in front of it, but maybe something completely different. And you're still reusing the same infrastructure and architecture. And second, when you want to encode that traffic in a manner that is useful for human consumption and human representation, HTML fulfills that purpose. And you can represent pages and documents in that format. This is highly simplified. This is not a network architecture talk, but I think it's useful for our purposes because we sort of have a similar parallel in Jupyter in that on the one hand, we have a well specified protocol, an interactive computing protocol. The details of it don't matter. There's a bunch of network connections, traffic. But the important points are the traffic, the data of that traffic is encoded as JSON packets. And the underlying transport is done with a specific networking library called 0MQ that happens to encode the communication patterns that made sense for representing the tasks of doing interactive computing as a protocol. By which I mean, we sat down and we actually look at what do humans do when they execute code interactively in an exploratory manner. What do you do? You type code, you get results, you get graphics, you communicate queries over your objects. We literally sat down and listed what is happening and encoded all those as messages and communication patterns. And 0MQ is a networking library that is written in C++, has a very liberal license. It's very simple to compile and it's very fast. And by encoding a protocol on top of these two ideas, it meant that pretty much any programming language you could think of even though this project was born in the Python world, could use these ideas because 0MQ has bindings for just about anything you can think of, you can bind 0MQ to it and use it from just about any language you want. And JSON is also a standard that even though it comes from the JavaScript world, these days every language has bindings for JSON. So that's how we represent to the communication between sort of a human typing code and something executing that code and how that transfer of data should happen. And if that transfer data was meant to represent a session that would be encoded in a document, then we also built a document format, which is the node document format, which actually is a capture of these JSON packets. So the document format is itself a JSON data structure, which has many benefits and some important drawbacks that we acknowledge and we're working on. This protocol is effectively a web age capture and kind of formalization of the very basic ideas of interactive computing at the REPL, at the terminal, if you will, at the redevelop print loop. But what we did was we tried to represent any possible output that modern computational processes could produce. So it's not just printing output text, but it could also be your computations produce images and that's a first class citizen. You may produce objects that are mathematical objects and those are represented as first class objects. You may produce obviously HTML or JavaScript. These are things that web browsers are good at. So they are also first class citizens and even interactive, live interactive computation. The idea that the output of a computation could be something that maintains live interactive controls where as you operate in the browser, calls are made back to a computational engine and new data is computed. This is the kind of thing that we wanted to make very, very easy for working scientists because working scientists should not be in the business for the most part, unless you really want to become a software engineer, should not be in the business of building complex software graphical interfaces. That's typically a very time consuming and complex and difficult job. You need a really good software engineer to do a good job on that, but it's perfectly sensible of a working scientist to say I would like to explore what happens as I drag this particular parameter and they would like to see that quickly and easily and with minimal amounts of code. And so what we've done is we've written a library called Ipy Widgets as part of all of this that gives you a single, a one line of code and it's simplest format, one line of code. You add this one line of code to a function and it turns that function into a live interactive computational object so that you can explore parameters while maintaining actually good code based tracking of what that mini GUI is doing for the purposes of reproducibility. Of reproducibility and I'm happy to talk more about that offline if anyone is interested in that. The point is we encoded in that protocol all the generic actions that we thought would be relevant and we took the time to document and present that as a formal, independently specified protocol that could yes, be used for Python because this was born out of the Ipython project but that could potentially be used by other languages and early in the process we invited the team that was building the Julia programming language to spend a week with us while the Ipython team was working on our code to try to implement those same ideas in Julia and they built a Julia kernel that would allow you to do this exact same process but with no Python in the picture and instead a Julia backend by the way I wanna congratulate the Julia folks because yesterday they released 1.0 of the language. Does anybody here in the room use Julia? Okay, a couple of you. It's great language. If you haven't taken a look at Julia as much of a Python fan as I am, I think this is a team that is doing an amazing job of rethinking high level, high productivity scientific programming languages in the basically taking lessons from Matlab, taking lessons from Ruby, taking lessons from Python and taking the very best of the last 10, 15 years of compiler research and type inference research and building a really, really amazing language on top of having been able to build kind of a sustainable community to develop tools on top of it. So I don't have any specific project right now on Julia but I wanna give them a shout out because yesterday they reached an important milestone of reaching kind of 1.0 stability for the language. Anyway, after Julia had a functioning kernel, we had one of the Ipython devs at the time work with our folks to build an R, R support so that you could instead of Python have R in the back end and use the exact same process with R and eventually that really became a standardization of this idea that has been widely adopted by the community and today there's over 100 different programming languages are supported in the back end. The advantage of this being that the only thing the community has to do if they want support for a new programming languages, write an implementation of one tool, this thing called a kernel and then after that everything else in the Jupyter ecosystem is available to them and all these languages are equally first class citizens, there's nothing special about Python anymore and that's what led to effectively Ipython becoming Jupyter and now what while Ipython still exists is now the Python little bit in this larger ecosystem. I want to briefly talk about where these parts, this part of the project is headed. So I've been talking about this thing that I mentioned the notebook. It turns out that when you open that web application you also get a file manager, you get a terminal emulator, you get a text editor. These are all things that are actually seem a little bit outside of our core competency but it turns out these are all things you need especially if you're working remotely. If you're working on a remote server it turns out just having that document type interface is not enough for real world scientific research. So we ended up having to build all these other things in the original code base all of that was sort of Frankenstein style glumdop on top of one on top of one another. Not particularly modular or clean from an architectural perspective. So for the last few years we've had a long running collaboration especially with Bloomberg and also other companies to build to reimagine these same ideas in a highly modular architecture that will allow you to build new interfaces that would have all these pieces be able to talk to each other, communicate with each other and be much more extensible and easy to adapt to new scientific workflows. It's called Jupyter Lab. It's currently in sort of high quality beta stage. We haven't completely stopped developing the main notebook interface and it's still being maintained but we're gradually transitioning to making Jupyter Lab the new interface. One important point is that it's a new interface. There's no changes to the underlying file formats. So your data and your existing documents don't change at all but this new interface still has notebooks but allows you to put these other interface elements on the same footing as the original notebooks. And now that we have an architecture where these protocols are spoken by the pieces of the UI and these things are available to you as a developer as TypeScript components, then you can, yes, you can have a notebook over here but you can also have, for example, a new view on a specific piece of output from this computation that is actually represented and synchronized in a different part so that if, for example, you have a document and you have pieces of a visualization, you can keep those separately available and the computational links remain below. You can have that same document viewed as a PDF in the environment. You can go beyond notebooks. Now the architecture has been kind of broken up into pieces so if you have, say, a document, a markdown document, that document can be rendered live. Most ex-editors these days are capable of doing that but in Jupyter you can connect to that document a computational console and say this document actually has code attached to it and as I hit Shift Enter in my markdown document not only do I want to see it rendered, I actually want to see that code executed. So now you can begin basically recomposing pieces of your own workflow for your needs with the underlying components, not necessarily having them be a notebook document. The same idea is applied to data so you can have, within the interface, you can view data and if you open an image file, you can view it obviously as an image but the system is modular so if you have a CSV file, you may not want to view it as a CSV, maybe a tabular representation is more convenient so you can view that. You can open a JSON file and by default JSON looks just like a bunch of nested curly braces, it's typically pretty hard to read. If turns out that, for example, this is JSON that corresponds to a specific schema called GeoJSON to encode geospatial data. If you have the GeoJSON viewer loaded then you can also open that same file in its more natural human valuable representation which is a map with the encoded data represented on a live zoomable map. This is an example of opening a FASTA file which is a genome assembly format and this is a specific FASTA file that has multiple assemblies of a Zika genome and the community had built a viewer for it that is a third party tool and it was very easy to say, oh, for my use case, we were interested in viewing this kind of data. This is, and I wanted the most useful thing for them, for this community to be able to see is the alignment of all of these particular genomes that are meant to represent the same organism. So by building, by wrapping the original viewer for this data format, now you can view that as a standalone, as a standalone proper scrollable and whatnot interface but importantly, not only do you have a live viewer for that data, that exact same tool can now, once the protocol's understand it, you can write a teeny wrapper, this much coded, really amounts to a couple of lines of new code that will then allow you to have the same tool as something you can call in your code and this is a recurring pattern in the design that it's not just about building graphical tools, it's about building tools that may be used interactively but it has to be possible to use them as recorded script control executable code for the purposes of better practices in terms of reproducibility and whatnot. And this, I presented this a week ago at Nura Academy in Seattle which is still ongoing and made this point that this is meant to be a community-extensible tool and within 24 hours, somebody, which I'll mention in a minute, a team went out and by the next morning they had wrapped a nifty viewer, they had wrapped the papaya nifty viewer so that you could load nifty images within the same interface and have all of the interactive functionality of papaya for viewing nifty in this format. This was done by Anisha Keshavan, Nate Back and Chris, I don't know if Chris is here in the other session but they did it in a matter of 24 hours they had, they had a wrapper built up. So this was a good illustration of the value of documenting and building these things based on open standards and a well-defined set of interfaces for these things to communicate. That team saw this presentation by the next morning they had a wrapper up and running. So that's kind of a section on the technical machinery of the underlying interface. I wanna talk a little bit about reproducibility, about reproducible research, specifically about a project called Binder that has been in development for a few years and that tries to give us tools to make a reality, this quote that I use quite often from Buckhight and Donahoe back in 1995 where they made the argument that at this point we should really, really be thinking about something beyond PDFs for the publication of computationally intensive research. Donahoe is a applied mathematician, statistician, however you wanna slice it but I think this quote is highly relevant these days in just about any scientific field that is computationally intensive. I subscribe absolutely to this philosophy and over the years what we've tried to build, and this is actually a project that had its origins in neuroscience, the first viable version of Binder and the creation of the name of the original project was done by Jeremy Freeman back when he was a neuroscientist working at Janelia Farm. Jeremy is now at the Chenzhakar Brook Science Initiative and what Binder offers is a very simple web interface to try to help working scientists be able to give others a completely self-contained reproducible entity that they can share with others to re-execute an entire working pipeline with minimal effort as long as they follow certain guidelines of having their code in a repository by default GitHub but it could be others encoding certain pieces of the interface of the access to the code and execution with Jupyter or related tools we'll see about that more in a minute. Very importantly, documenting which are the dependencies that their code needs to run. So it's not just you give me a repo but you tell me in a machine readable format what dependencies the code needs the system will actually package all that up in a Docker container we'll actually ship it over into Kubernetes we'll do all the work for you in the background so that you as a scientist as long as you follow the right guidelines all you have to do is give us a URL of your GitHub repo click a button and then you will get a live executable instance of that deployed for you automatically. We've had some very satisfying examples of scientists adopting these practices. You may have seen that in 2017 the discovery of gravitational waves was awarded the Nobel Prize in physics. This is for those of you who are not physicists this was a prediction made by Einstein about a hundred years prior that large masses accelerating anywhere in the universe would that acceleration would ripple would radiate the energy from that acceleration would radiate away in the form of waves but waves that travel at the speed of light just like electromagnetic waves but these are distortions not in electrical magnetic fields but distortions in space time itself. This was a theoretical prediction detecting that experimentally is a massively challenging problem it's a feat of engineering that took 30 years to build against somewhat impossible odds for the experimentalists in the room. The detection problem is about a 21 orders of magnitude detection problem so imagine sensing a distance roughly 1,000th of the diameter of a proton over four kilometers shift that's a brutally, brutally challenging problem. It's kind of akin to finding a person in an image of the Milky Way to give you a sense of orders of magnitude okay it's about that hard. What they did was they built two detectors one in Eastern Washington and one in Louisiana that are four kilometer long interferometers that bounce lasers back and forth and if there's any perturbation in space time that will produce a detectable interference pattern. The point is in September of 2015 these two interferometers detected a signal that was actually unmistakably the merger of two black holes one roughly 27 solar masses in total and 1.3 billion light years away. This is the first figure on the main physical review letters paper from that discovery and first of all I wanna flag that all of this paper has been built with open source tools. So we've been working for 15, 17 years in building numpy, scipy, matplot, lib etc and for a long time I remember going to conference and saying it's okay to use open source tools for doing science and this is what Python is by the way it's not just a geeky toy it's a real tool and this is one of the most high profile scientific efforts today in the world and for them it was completely a given that they would be using these tools to build all of their analysis and furthermore the open science center from the LIGO collaboration actually has actually has a page where all of their analysis are available as Jupyter Notebooks and you can download and run that yourself or you can click on it without downloading anything you can click on it if you go to the LIGO open science center you can click on it and you can run the analysis yourself on the web because it has been packaged in this binder format so anyone in the world who has a web browser can run a child in Columbia who wants to learn about gravitational waves like I was when I was 16 could just with an internet connection in a web browser listen to this I don't know if you heard at the very, very end that little whoop that is the chirp as in the final collapse of two black holes merging 1.3 billion light years away turned the vibration turned into sound and that is the real sonification of the actual signal of what two black holes collapsing into each other sound like and I think it's fantastic that we can build open tools to make this possible to anyone in the world so binder itself from a technical standpoint what it tries to do I think I have about 10 minutes 10, okay so from a technical standpoint what binder tries to do is take repositories and turn them into reproducible containers it then will serve those containers to the user using Jupiter Hub it then will provide an interface to make it easy to share those things with others called binder hub and it finally has a service to run those for the public which is a free service that we run thanks to funding from the Moore Foundation the first layer of that repo to Docker is something which given a repo will build a Docker image it supports existing environment specifications and Python and R and many other languages and it's very specifically designed to be something of usage for the scientific community yes this kind of thing can be useful for others but we're not trying to build something for the web industry at large we're trying to remain simply a tool to make it possible for scientists to easily and practically share their workflows with others the rest of the Googles and the Amazons of the world have enough engineering resources to build tools like this for startups and whatnot our use case is the scientific workflows of everyday research how do we make it possible to make share those things in the cloud on the web with minimal effort for working scientists who don't want to become software engineers Jupiter Hub, the next layer on top of it is something which fairly was designed originally to deploy a Jupiter notebook on the cloud but it actually can run non-Jupiter environments it basically allows you to authenticate users and spawn web services for them so if you want to run on your laptop you can but if you want to run the same thing and instead say I want to authenticate my students with my campus authentication you swap how they're authenticated and they run if instead of running services on that server you say when they authenticate I want them to run Docker containers in our cluster on campus you say okay I'm gonna change how those processes start and you can get that and then on top of it Jupiter Hub is designed I'm sorry Jupiter Hub together with Kubernetes which is open source technology coming out of Google is designed to put this easily on the cloud with a scalable kind of industry standard set of tools to make management and dynamic scaling much easier and finally on top of that the binder hub interface allows you gives both the web UI and the underlying APIs to manage these things and share them with others and do all of the launching and I'm happy to talk about the details that itself is available to the public as a free service it's an open service we have funding from the Moore Foundation the usage as the underlying infrastructure has gotten better usage has been skyrocketing these days were on the order of 10, 15,000 daily sessions of usage and one important point is that today or at least as of about a month ago we are now seeing usage from virtually the entire planet and this is very satisfying to us that when we look at the data of user sessions virtually every country in the world is now has at least accessed some of these binders and has accessed the kind of content that people are sharing through these interfaces so it kind of demonstrates that yes this is of value to everyone not just the people who have access to expensive sort of first world scientific resources and I want to note that behind binder there's a highly instrumented interface we've been collecting a ton of data all this data is available to you to the public on how much it costs what are the performance profiles what are the resources used by this because we want this to be something that ultimately we pushed into we push into publishing pipelines we federate out our intent with running this public service was never to run a service that could satisfy the needs of everyone but it was to have a proof of concept that would help us learn what is the cost and resource profile of something like this so that it can be federated and implemented and propagated across the scientific community including potentially commercial publishers actually because we actually want something where the standards are open and under the control of the scientific community but that hopefully also changes the behavior of commercial publishers and in the last few minutes I want to talk a little bit about education and about how we've been trying at Berkeley to teach some of these things all the way down to the undergrad so that hopefully the next generation is better than us all these in their practices so last fall I taught a course called Collaborative and Reproducible Data Science at UC Berkeley it was listed as a statistics course but I actually had a mix of students graduate and undergraduates a fantastic GSI who did a ton of crazy work the goal of this course was to teach students sort of what is collaborative and open science what does it mean to work reproducibly why is that important sort of from an ethical epistemological and social standpoint and so we had sort of weekly readings and lectures and discussions on the what and the why the ideas but importantly how to do it how to actually do this and so we spent a lot of time on the practical skills and especially focusing on making it an everyday practice making sure that the students realized that this was something that had to become a habit and a manner of working rather than saying once I've submitted my manuscript they don't go back and clean it up and make it reproducible and documented and tested because no, you won't by that time it's way too late the skills that we built upon and this is just a way you can do this same thing with different choices of the specific tools but I think the actual ideas are all relevant which was the idea of using version control of writing code with open programming languages of automating your processes rather than doing them manually of documenting everything with good open tools of testing the work you do of implementing continuous integration for your processes and of wrapping it all up in these kinds of open containers importantly, this was done trying to get Git and Python was done as an everyday part of the course so the course itself was built as a GitHub repository it was built with Sphinx so that the students would kind of get used to using these tools in everyday practice in the sense that this is something that they would do like to brush their teeth this is not about getting a root canal every few years, it's about brushing your teeth and not having to have painful procedures done by the dentist and so Git is kind of the toothbrush of science and one of the ways of making that stick was not only to teach them how to use it but actually do all the homeworks using something called GitHub Classroom which basically facilitates turning in homework through GitHub and doing the entire course as a matter of routine the students had to create many, many, many repositories and collaborate with their peers and whatnot using these tools at the end of the course they had to turn in a project which actually was an original analysis like a mini research project they had to find their own data and then include or link it if too large the data in the repo, write code that had tests for it write supporting analysis notebooks that would provide the underlying basically going back to Donna Ho's ideas that would provide the rest of the background a main narrative notebook that would be sort of the paper all of the necessary dependencies and automation reproducibility support and following best practices in terms of legal licensing code sharing, contributions, et cetera that would make the complete body of work sort of follow what I think is a reasonable first order kind of standard playbook for this and I'm really satisfied to say that it worked all the students were able to do this they turned in projects that had well documented repositories that had continuous integration builds that were green that had binder buttons so that I could basically review their work by just clicking on one link and having a live version of their entire analysis they included supporting analysis, supporting code supporting test suites and end point sort of PDF summaries and so this is possible it is possible to teach these things in a way that's adopted and finally to close in the one minute I probably have like building these kinds of open tools we really can support infrastructure that is used by the community I can talk for a long time about this I want in the interest of time I only wanna flag two points one is that at Berkeley we've invested very heavily in this we have Jupyter hubs that are accessible campus wide and today those support the teaching of we have a new major in data science at Berkeley and the two backbone courses of that major the lower division one is called data eight the upper division one is called data 100 this was me teaching data 100 in the spring I had about 650 students in the fall I'm gonna have 800 data eight in the coming session is gonna have 1300 students so we're teaching the entire campus in the largest halls on campus they don't even all fit in there anymore they have to view many of them have to be the lectures online we can teach the entire university these courses using these free resources by hosting it all in the open and it's for the Canadians in the room pack yourselves in the back your country is doing a phenomenal job of building national infrastructure with these tools the compute Canada team which provides national HPC infrastructure has deployed Jupiter services for Canadian researchers CISIGI is a project basically to have kind of one click access to Jupiter hub and Jupiter resources on Canadian national infrastructure but they also have projects in collaboration with the Alberta government to bring these same exact tool chains down to K through 12 education I'm really really excited with the things that I've seen from the folks here in Canada doing so I hope other countries that are less enlightened we can learn from Canada anyway to wrap up in addition to thanking the team obviously I want to thank the people who have funded us but I also want to flag again as I mentioned on the team picture the fact that this has been possible by a very interesting partnership between private foundations the Simons Foundation was mentioned earlier we've received a special funding from Sloan more in Helmsley trusts we've received some federal government funding but also a lot of industry funding and that's something that I think it's taken a ton of work to make this kind of complicated multi-stakeholder process work but it's been very valuable to do that work because we've been able to achieve things that would be I think very difficult under either a purely government funded type effort we would have never gotten here private foundation money is difficult to come by and even though they've been incredibly generous with us it's the kind of thing that you can't always count on but purely industrial efforts even when companies work on open source would not have gotten us here and so I think there's a lot of value in trying to make this work so in closing we've talked about a little bit of technology and hopefully these are interfaces that will give you some ideas for you to work on how these tools provide infrastructure that can be used by other organizations and hopefully I've shown you that it's possible to also train a new generation of scientists in these ideas and thank you very much hopefully I'm not too much over time there you go so we have first we have time for a couple of questions on this talk and then in the program we've been allocated a bit of time for overall discussion of effectively sort of neural data science topics so yes hey there thanks so much both for your work and the talk so what are your thoughts so you have my binder which in some ways being on GitHub is sort of I think maybe GitHub centric you know that it's got it's like a repo centric maybe not GitHub centric what is your sense of what best practices for integrating large data sets that maybe like a whole classroom is using rather than being pulled is there a way to have sort of a data centric way and how does that jive with GitHub and what's your best practice for that so that's a very good question I don't think we haven't converged on a single solution for that I think there's two slightly different patterns for that I think one is the question how to share and make accessible data which is larger than that you can reasonably shove into a repo for the purposes of reproducible long term publication I think for that my thoughts are kind of following a little bit the kind of approach that the particle visits have taken over the years which is that the LHC has kind of level zero data which is physically well first of all a lot of it thrown away at level zero triggers but then there's the first layer of data which really is only replicated by worldwide infrastructure you're not gonna copy that anywhere and then successively smaller levels of triage of data you could imagine putting some level which is maybe too big too big to fit into a repo but a little bit smaller than say 10 petabytes putting that in an S3 bucket or some public bucket which is persistent cold storage which the repo can pull from so that if somebody wants to start the analysis at that level they can pull from that but I would actually suggest including a next level of pre-processed data which can be included in the repo which is where most people would start so most people would start from a level that is maybe a couple hundred megs that can be put in the repo if they really want to validate what happened next they can pull say a few gigs 10 gigs, 50 gigs from an S3 bucket and if they really need the five petabytes they need to talk to you something like that and I think that's a pattern that we can converge on good practices for scientific reproducibility the classroom large team use case I think is a little different in that there I would suggest deploying on and that's the kind of thing that we are doing is on the Jupyter Hub where you deploy whether you do it cloud based or on a campus cluster or on a supercomputer then there that's the place where you should have the equivalent to your S3 buckets but you have the advantage of having a little bit more control over what exactly is the right path do you use a shared file system do you use buckets do you use whatever technology and you do it in that environment and I think those two patterns are slightly different and that's what we're doing with both does that answer your question? So question about reproducibility of the notebooks themselves so I think Biner is a wonderful solution to this but and the education is also a great way to get everybody on board with the right process for this but we're not gonna be able to teach everyone and very often we still have people who publish just the iPie and B files and the question is is there a way that we can get just a dump of the imports so that you can have a hope of reconstructing the environment that somebody was running that in? Yes, that's a really good point I think we haven't written that kind of tool but it's actually quite straightforward to do it it's quite straightforward to do it first of all I think there's a few things that we can do with educating people about some really easy low effort things there's a little Jupiter extension called watermark and if you load that if you pip install it and you load EXT watermark then a single call, it's one magic will produce a dump, a little text dump in the output so at least it's printed I had my students do that that was one of the things they had to do a little text dump that says these are the versions of all the libraries imported so at least it's like you can read it it's just at least it's printed in there even if they didn't build an environment and Docker container at least it's right there and that really is one line of code and so getting people to do that and worst comes to worst yes, pulling the code out and pulling the imports out of a notebook is a 20 minute scripting exercise we haven't done it but we could sit down and over lunch and write that script excellent talk so two points one is we've talked a lot about fair at this conference and scholarship research I include both requires both these types of dynamic environments where people are just very quickly going back and saying oh look in five minutes I put this together but there's also what the publishing community has generally provided as a level of stability which means that you can find these things again in the future and you pay attention now to saying okay if this is an enduring scientific artifact or should be available for some period of time I don't think you can ever guarantee that some of these things are available forever giving in the same way that perhaps paper is but how are you interacting with fair so that's my one question my second is just a plug because we're launching a new journal I'm launching a new journal that's called Neuro Commons that is dedicated to exactly this type of science because I think it's about time that we have a platform where all of this can actually be integrated with the publishing industry so I just wanted to put a plug in for that I think we paid for the drinks last night but I'm not sure but just your comments I'm fair so yes absolutely and let's see I think we're trying to honor those ideas and actually by having this little playbook that I suggested here that I think I was hoping this summer I would find the time to write this up as a short little commentary paper with my GSI and I may still have a long enough flight to do it because I think it's useful to kind of document this process and our experience and what worked and what didn't in that process but my point is that I think the presentation of this has a digital artifact tries to meet some of those layers in the sense that here there is something that is sort of the paper right this is they were meant to include a PDF of the main narrative that doesn't have all the details and the computations and all the noise but does have the scientific narrative but the point is it is attached to these other artifacts that live together and they're well interconnected and yes if in 50 years darker formats have changed enough that we can't even run this thing hopefully this will still remain so we are no worse than we are today but hopefully we are better in that all of the layers are there to keep digging one step further as your needs dictate so that if you really care more you're willing to run the code if you really care more you're gonna study it and download it and install it so we're trying I think we're trying to genuinely honor those principles if you do feel that there's pieces of this that are missing and I'm completely open to that idea this is definitely work in progress I'd love to talk to you more because I'm sure there's a lot we can do better but I think we are trying to honor those principles.