 So my name is Chris Holgraf, I'm down at UC Berkeley and I also work with the Jupiter project and I'm going to talk a little bit about some stuff that we've been working on in the last couple of years around making it possible to take interactive computational workflows into the cloud on shared infrastructure. So I like to get the thank yous in ahead of time because I always run over time and then I end up skipping them at the end. I'm going to talk about a bunch of stuff that I personally have not built. I've contributed to it, but all of these tools are created by a really awesome community of open developers all across the world. And this is just a small sampling of them. But really everything that we're doing is a community driven effort and the credit should go to all of these people. And then I also don't want to forget the people that have funded a lot of this work, thank you to all of those folks as well. Okay, so my background, I have a degree in neuroscience. You may have heard from Tian's talk that I used to work a lot with the MIT Python project that kind of sucked me into the broader open sourcey data science world. And now I kind of exist at the intersection of three different groups on campus, the Berkeley Institute for Data Science, which is an organization dedicated to helping and creating open tools for research and science. The Division of Data Science, which has a similar kind of flavor but is more focused towards education and teaching. And then a Jupiter group at Berkeley, which is a group of us that are building Jupiter tools to help with those two use cases at the university. So people often ask me, what is the Jupiter project? It's been around for a while and a lot of people have at least heard that term. But many people think it's the Jupiter notebook. A lot of people think that it's a company. And just to begin, I'd like to define at least my definition of Jupiter, which is that I think of it as a community of people and an ecosystem of tools that's all kind of revolves around the use case of interactive computing. It has a couple of guiding principles, and in particular, I think, being agnostic to the language, as well as in general to the workflow of people, and trying to build tools that are modular and composable so that it can be pieced together in different kinds of interesting ways. And people often describe Jupiter as the connective tissue of data analytics workflows. We're trying to make it easier for people to utilize other open tools and to be the kind of backbone that makes your life easier in the context of interactive computing. So I have found that the best way to explain the kind of design principles behind how the Jupiter community operates is to think about something called the last mile problem. So I'm going to take a brief aside and see if that helps clear things up a little bit. So say that I want to get from my home in Berkeley down to San Jose for this tasty coffee shop called Voltaire Coffee that I really like. It's like one of the only reasons that I ever go to San Jose. If I wanted to get there, I have a couple of different options. I could walk there by myself, which would get me there, but it would take a while. I could pay somebody to drive me down, which is certainly possible, but it'd be kind of expensive. But my preferred option is that, generally speaking, I'm going to use public infrastructure to get down there. And fortunately, I live in a society that, at least to some degree, invests in public infrastructure and that makes my life much easier. I could walk from my house to the Bart station. I'll take the Bart down to Fremont. I'll switch over onto a bus. I'll take that down to San Jose, and then I'll walk to Voltaire Coffee. So those two waypoints along the way, the Bart and the bus, are public infrastructure that I can utilize to get myself closer to my destination. But at the very end of it, I still have to walk to exactly where I'm trying to go to. And in the world of public transportation and infrastructure, that's often called the last mile problem. That last little bit that I need to do myself, because it doesn't make sense for the government to build a bus system just for Chris Holtgraf, right? It needs to build systems for lots of different people. But then we want to have the ability to go our own direction after that in order to get to our destination. So public infrastructure's job is to get us closer to our goal and to make that last mile as short as possible. So how does Jupiter fit into all of this? Well, if we think about our destination not as a geographical location, but as some kind of goal that we have, like, I want to create an interactive report that shows off some kind of scientific analysis that I've conducted and share it with my friends, how can we think about this in a similar way? Well, we have one option which is to build the entire stack from scratch, write a bunch of custom HTML and D3 and interactive visualizations and stuff like that. I could pay somebody for a product that they could give to me and I'll just use that. For various reasons, those are both kind of non-preferred options for myself. I like to take the same kind of approach to software that I do to transportation, which is to use public infrastructure, open, modular tools that I can build on top of. So taken with this example of an interactive report, Jupiter kind of decomposes that problem with a bunch of building blocks that all kind of exist separately from one another. So it utilizes open ecosystems of packages such as the sci-fi ecosystem. It has a document format specification that lets you interweave narrative text along with your code. It uses a server protocol that allows you to control resources on your machine and spin up different user sessions. It has a whole selection of interactive kernels and a bunch of different languages that people can create and extend for themselves. It has a bunch of user interfaces that are built on top of that stack, such as the traditional notebook interface, Jupiter Lab, Interact, and a bunch of other ones that utilize all of those pieces and those standards in order to show you different kind of views into the same underlying structure and data. And at the end of that, you have this report. So each of these pieces along the way is relatively modular. They don't make a whole lot of assumptions about the things that come after it. But what it allows you to do is build step by step on top of the pieces that came before it in order to accomplish this goal of creating something like an interactive report. OK, so now I'm going to go back to the original goal of this talk and discuss, specifically, scientific reproducibility and shareability using shared infrastructure in the cloud. So to me, the most compelling quote that I've found around scientific reproducibility and interactive computation is this paraphrasing from Bucket and Donhoe. And I'm just going to read it verbatim, because I think it's a really good quote. An article about computational science in a scientific publication is not the scholarship itself. It's merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions, which generated the figures. What's really interesting to me is that this was written in 1995, back before data science was super sexy and everybody was trying to get involved in reproducibility and computation. And I think that in a lot of ways it was an extremely forward-thinking way of approaching science and computation. Fortunately now, we have a lot of tools and technology that we didn't have in 1995. So how can we utilize that technology in order to get us closer to being able to share the actual environments and the computation itself? So our mission for this talk is to start with a kind of base Jupiter ecosystem, Jupiter notebooks, the notebook format, say the sci-fi ecosystem. And we want to get to our final goal of having an open, reproducible and insurable environment for our work. So to begin, I'm going to talk about how you can go from your local computer onto some kind of shared infrastructure. And I'm going to focus on the cloud, although in general this kind of thing could be any sort of shared infrastructure. It could be a machine that you own locally, physically, or a machine with your group or your organization. And a lot of this is guided by one of the sort of main principles of how we've been thinking about education at Berkeley around data analytics, which is that some amount of data science should be taught to everyone at the undergraduate level. And by that what I mean is not just the future statisticians, not just the future computer science majors, but also history majors and rhetoric majors and English majors and polysine majors, because data-driven thinking is really important in all aspects of society. And I should say, I'm not here talking about, like, oh, I want to teach you how to use this API for pandas, but it's more the conceptual ideas around statistical thinking and uncertainty and things like this. The challenge is that at UC Berkeley, when you want to teach a course for everyone, you get situations like this. So this is the first day of a gigantic introduction to data science course called Data 8. It's about 1,400 students large. It doesn't fit into the biggest auditorium that we have on campus. And for anyone who's ever taught a 15 to 20-person data carpentry class or software carpentry class, you know that if we had to ask all of those students to install their own environments on their computer, we would be wasting the first three months of the course, just debugging that. So how can we connect these students with some kind of shared infrastructure, computational environment that we can curate for them so that we can just skip over all of that stuff? So we don't have to spend every waking minute talking about like paths and condo distributions and environments and things like that. So for that, we're using something called Jupyter Hub. The goal of Jupyter Hub is it allows you as a deployer to create a preconfigured analytics environment that you define, the packages, the computational resources, everything like this, and then provide access to that environment on shared infrastructure. So for example, say you have access to some sort of fancy machine in the cloud. Maybe this is something that you've asked for from AWS or Google Cloud, or if you have a fancy machine on campus that you have available to you. You can put a Jupyter Hub on that machine, expose it at an arbitrary URL like myhub.org, and then provide access to students or researchers or whomever you'd like in order to be able to do their work on the machine. Because it's in a Jupyter environment, it's language agnostic. You can define whatever data analytics environment you'd want. A lot of people use Julia, Python, or R. You can also define an arbitrary interface that you want people to have access to. So maybe if you're teaching a course in R, you want students to use RStudio. If you're teaching a course in Python, you want students to use Jupyter Notebook interface. Or you could even get things like VS Code running on this sort of an environment. You can also then connect it to resources for pedagogy that are in public repositories like GitHub, for example. So Data8 stores all of its course content in a GitHub repository, and it pulls that content into the Jupyter Hub so that students can interact with it. And finally, maybe you want to use this in order to provide access to really complicated computing infrastructure, like a Spark cluster, or really large or complex data sets that would be really difficult to ship around to students individually. And then finally, it also allows you to authenticate users. So once you have authentication, beyond allowing certain people to access it, if you're concerned about that, it also gives you the concept of a user identity. And that lets you do a lot of interesting things about tracking the kinds of things that students are doing, allowing students to submit grades and homework and things like that. OK, so I'm going to try to just show you what this looks like in the wild. And I'm right, I don't have a mouse here. The mouse is only on the screen on the opposite side of the room, so this might be. OK, so this is the, yeah, you can see that. So this is the website that Data 8 uses to host its course textbook. So all of these pages have a Jupyter Notebook underneath the hood, but it just looks kind of like a regular old book. You can scroll through it, you've got a table of contents, you can scroll within a page to different sections. And it looks like a Jupyter Notebook because it was created from a Jupyter Notebook. So when students read this in the context of a course, they often then want to interact with the material that's there. And for that, the instructors have put together what they call interact links. And basically these are links that, when you click them, will fire up a process on a Jupyter Hub that the Berkeley team manages. It goes to GitHub and it says, OK, where does this Jupyter Notebook live underneath this page? It then spins up a user session for the student and then dumps them into a live version of this where they can start to interact with the content on their own. So this is running, you can see up here, datahub.berkeley.edu, which is where the Jupyter Hub for Data 8 lives. OK. And I also just want to give a quick shout out to Open Humans, who gave a talk, I think, yesterday, that also uses Jupyter Hub. Yeah, bastion, to allow people to collaborate on personally collected data and explore and run different analyses about their own lives. OK, so we've added one extra step to this journey of getting to open reproducible shareable environments. We have infrastructure that lets us collaborate on the cloud. So once you have some sort of shared infrastructure, the next thing that you need is the ability to package your work, define the environment that you need, and then share it with other people. And the tool that we're working on to make that possible is something called Repoter Docker. So Repoter Docker, in essence, is a fairly straightforward tool. It tries to do one thing relatively well, which is to convert a repository of code and content into a Docker image that can run that content. That's it. So as an example, let's go to this page. This is a GitHub repository that has one of the analyses that was done in the recent gravitational waves finding with LIGO. And if you look inside, you'll see there's a Jupyter notebook in here that I'm not going to dive into just for time's sake. But what you'll see is a file here called environment.yaml. This is a common file in the Python world for defining the packages that you need in order to run whatever is inside. And that's going to be important when I show you what Repoter Docker does in a second. So if I were to call Repoter Docker from the command line, I'd give it a path to a repository either locally or remotely on something like GitHub. And in this case, I'm passing that LIGO binder thing. And what's it going to do once I do that? It gives me a bunch of gobbledygook machine looking code. So let me dig into each one of these steps and tell you what's happening underneath the hood. So first of all, what it does is it fetches the repository that you give it. It either moves into that directory locally or it clones it from GitHub. It then looks inside of that repository for what we call configuration files. These are common files in the various communities that are supported by Repoter Docker that define the environment needed to run what's inside. So in this case, it's requirements.txt. It might be something like environment.yaml. If you want an Anaconda installation, it can also be other languages like r. You can specify a version of r and a date on CRAN that you want to pull packages from. And the goal of all of this is to basically weave these configuration files into a Docker file that we know will run the environment needed to run that code. So it generates the runtime environment that's needed to run whatever is inside. And then it also actually installs those dependencies based on the configuration files that are there. Finally, it builds that Docker file into an image, and then it pushes it to a Docker registry that you can pull from in order to either send it to your friends or use for yourself for reproducibility reasons or use with another tool, which I'm going to talk about next. So if you look back through this gobbledygook, you can kind of see each of these different pieces in there. It clones the repository. It finds the environment needed. It then builds and installs that environment. And then finally, it pushes it up to the cloud. You can support a bunch of different data analytics workflows with this. And our goal is to make this relatively extensible and, again, kind of modular. So if you were from a different community and you wanted to build in the ability to generate reproducible images with Rupert or Docker, there's a relatively straightforward way to do so. And the kind of guiding principles of this tool are that it needs to create things that are both human and machine readable that utilize preexisting specifications and standards as much as possible. We don't want to create a new standard. What we would like to do is piggyback off of preexisting standards. But we do want to support many languages while being tightly scoped around this workflow of interactive, reproducible data analytics workflows. And then we've built in a lot of extensibility to make it easier for you to adapt this to your own workflow. OK, so that's one more step that gets us a little bit closer. But what we need now is something that can kind of tie these two different pieces together, some kind of shared infrastructure that can be paired with the reproducible image building process of something like Rupert or Docker. And that's where Binder and Binder Hub come into play. So Binder is another open source tool. It deploys a web app that allows people to create interactive, reproducible, and shareable environments using a Git repository that is available publicly or locally on the server on which it's running. And the easiest way to think about Binder Hub is that it's basically just Rupert or Docker plus a Jupiter Hub. It's like a particular configuration of Jupiter Hub that makes it easy to launch Rupert or Docker in order to build the image that's needed to run a repository. It's built on Kubernetes. It's cloud agnostic. It's very scalable. I'll show you in a second, but we run a Binder Hub service that gets easily 100,000 users every week. And it's also entirely community-driven. And it's meant to be deployable by anyone. So I'm going to focus a lot on a particular deployment. But really, the goal is for this to be generic tech that people can use for their own groups or for their own communities as they wish. So one example, and probably the most long-lasting example of a Binder Hub, lives at mybinder.org. And what I'd like to do is just show you how this works with another, hopefully, successful lab demo. OK. So if I go to mybinder.org, I get a pretty simple form here. It just says give me a path to a Git repository. And then I'm going to do something for you. This is a repository that we maintain, so I know that it's there. It's a binder-example slash requirements. So I'm just going to type that in. It lives on GitHub. And if it had launched, what it's going to say is, oh, OK, I've actually already built the image needed for this repository, so I'm just going to spin up a server on a Jupyter Hub, and then I'm going to dump you into that server so that you can start interacting with whatever is inside. So as you just saw, it's now running on a Jupyter Hub. And I have access to the Jupyter notebook that we put together to just demonstrate what's going on inside. If I go to github.com and show you that repo, you can see that it's basically just the same kind of structure that I showed you before. There's this reprimand.txt file. And if I look inside that file, it just specifies a couple of dependencies that I need. So what happens if I then edit that file? I'm going to go inside, and I'm going to add, let's see, how about pandas? So I'm just going to add one extra line that says, hey, I actually need pandas as well in order to run this thing. And I'm going to commit directly to master because I like to show bad practices in public talks. OK, so now I go back to mybinder.org, and I'm just going to type back in the same org and repo. And now when I hit launch, it says, OK, I've noticed that the commit hash has changed for this repository. And I'm going to spawn a repo to Docker process that rebuilds the environment needed and then provides you a link that you can share with other people to interact with it. So I'm not going to wait for all of this to happen, but it's going through that whole process of finding the environment, defining the packages needed, building it into a Docker image, and then allowing you to share that link with other people so that they can interact with it. Mybinder.org is also run as an open service. So Binder Hub is open tech. We're trying to run Mybinder.org in the same kind of philosophy. You can get a ton of information about the stuff that's going on there. If you go to Grafana.mybinder.org, you'll get like a really interesting plot of lots of different activities and the kinds of repositories people are launching and stuff like that. We also publish our billing data in case you're interested in seeing the gory details there. We have an operations guide that has some sort of best practices around maintaining shared infrastructure on Kubernetes. A lot of the people that run this are not Kubernetes DevOps people. We are learning as we go along, and so we're trying to adhere to some best practices in knowledge and skill sharing and things like that. And then we also publish incident reports for all of the fires. So you can see lots of like kind of angry, coded language like, oh God, I have to wake up at six o'clock in the morning because Mybinder.org is down today. If you want to read all of that kind of stuff, it's all there in its glory. But the service itself and this kind of binder approach platform has been super successful much more so than we were expecting. So at this point, we're launching like 100,000 users, 100,000 binder sessions every week. And the reason that I highlight that is because the team of people that's maintaining this infrastructure is like a skeleton team of like 10% of five people's time or so. Kind of split around the world. Nobody's really getting paid in a full time sense to work on this. And I think that's a testament to the robustness of the technology that's there. Because again, our goal is to allow other people to deploy these kinds of things for themselves. The thing that's really exciting to me is that this launches across the world just in the last month and it's amazing to me the coverage that you were able to get across the globe by providing access to people in this kind of way. And a lot of these are folks coming from countries where it would be really hard for them to run the same kinds of material if they were running it like on their own laptops. There's like a whole other conversation to be had about access to the internet and how that is a huge blocker here. And this isn't going to solve that problem, except to say that most of what Jupyter does is it renders a front end and the only thing that it sends back to the internet are little snippets of text. So it's not like a heavy IO kind of process, but that's still something that we need to find ways around. And then lastly I want to mention that a bunch of other binder hubs have started popping into existence. Kirsty runs a really cool project called The Turing Way that I think she'll probably talk about a little bit later in a talk. There's also a really cool project called Pangeo that runs a binder hub for Earth Analytics. Jesus is a, or G-E-S-I-S, it sounded too much like Jesus. G-E-S-I-S is a, I think it's a group dedicated to social sciences in Leibniz. And this S-Y-Z-Y-G-Y, which I'm not even going to try to pronounce, is a nationwide deployment of Jupyter hubs and binder hubs across Canadian research institutions. Okay, so I'm going to like blow through these super fast to make sure that we have a little bit of time for questions, and I'm going to just spend like five seconds on every single one. So hopefully you all can keep me to that pace. Jupyter Book allows you to create an interactive book from a collection of Jupyter notebooks. It's what I showed you before, and it renders those notebooks as a quote-unquote static HTML page that can do things like render widgets that are powered by a binder kernel or a Jupyter hub kernel that's running in the cloud. Recently, Pandoc, which is a really awesome like arbitrary documents to arbitrary document conversion tool added support for Jupyter notebooks as sort of a first class citizen. So what we're really hoping that we can do is start to convert notebooks into actual publishable documents that you could submit to a journal, for example, that have citations and references and cross references and stuff like that, but also have some kind of fancy interactive, you know, HTML based view, if that's what you want as well. There's a really cool project out of a group called Quantstack, which are kind of friends of Jupyter called Voila. They're French, so they like clever French names for their title, for their projects. And Voila allows you to define a dashboard using a Jupyter notebook that then gets rendered into an interactive HTML based page, which allows you to do some really cool things sort of intentionally hiding the code and only exposing the interactive elements that you want to. And there are a bunch of other things that are happening as well. I'm happy to chat with you about it. There's way too much stuff to talk about in Jupyter Land to fit into a 25 minute talk. So to finish up, the four things that I mentioned in this talk. One, Jupyter makes the last mile problem easier by getting you closer to your goal, building public infrastructure that you can then build on top of to sort of get to your final destination. Two, Jupyter Hub lets you create interactive environments on shared infrastructure that you can provide to lots of different people around the world or within your group. Reputer Docker lets you create a reproducible Docker image from a repository that you put online. Binder Hub weaves all of these components together for an application that lets you repeat this process on the web. And then finally, here's this Get Involved page. As I mentioned, all of these are open projects and Jupyter is a very large and open community. No matter what your skill set or interests are, we would love to have your participation and I'm happy to talk more after this for anyone who's interested. So thanks very much.