 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at RCE-cast.com where you can find links to our Twitter, our blogs, all that fun stuff. Once again, I have Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks again for your time. Hey Brock, how are you? It's getting to be fall here, which means it's getting to be the ramp-up time for supercomputing. Right, so once again, I believe we both have booths this year. I am not directly participating in anything, but a number of us will be participating in panels on arm for scientific computing, machine learning, GPU computing, a number of other different little things that we've dabbled in over the last couple of years. So this is probably the most involved year we've ever been with supercomputing for my group. Yeah, that spans a wide gambit there. And I'll be having the usual OpenMPI Birds of Feather session with Dr. George Basilica from the University of Tennessee given the state of the Union of where we are in OpenMPI and where it's going all these things. So love to see all of you there. But enough about supercomputing. Let's talk about today, Brock. Who do we have? Okay, so our topic today, we have Dr. Brian Granger to talk to us about Jupiter. So Brian, why don't you take a moment to introduce yourself? Yeah, thanks so much for having me here today, Jeff and Brock. And I'm a professor of physics at Cal Poly State University in San Luis Obispo. And my background is theoretical atomic physics, but for the last, I don't know, decade or decade and a half, I've gotten very involved in open source software development originally in the scientific computing space, sort of before data science was a thing. And then as the entire sort of universe has shifted around this new idea of data science, a lot of the open source projects I'm working on are right in the middle of data science. And but also continue to be relevant in traditional scientific computing as well. That includes Project Jupiter. I'm one of the leaders and core developers of Jupiter and I Python. And then also in the last year, I've been working with Jake Banderplass on a new data visualization library called Altair. So that's kind of funny because actually my degrees in nuclear engineering got into scientific computing, started off with the classic scientific computing. In the last couple of years, with the rise of data science, I'm doing a lot more data science infrastructure for people in social science, health, IoT and engineering and then everything else. So it's funny, kind of parallel tracks. Yeah. I think there's a lot of us who've taken very parallel tracks in that respect. This is amusing. I'm the odd man out in this podcast here where I'm the only pure computer engineer. We just consume what you do. Yeah, it's apparently so. Yeah. So, Brian, can you give us a little bit of detail? We're here to talk about Jupiter. What is Jupiter? Yeah. So, Jupiter is a sort of offspring of I-Python. And so I can start with a brief history of I-Python. It started in 2001. Fernando Perez, a classmate of mine at the University of Colorado, a graduate school, started I-Python originally as an enhanced interactive shelf for Python. This is sort of in the very early days of Python starting to be used in scientific computing. And Fernando had long been a user of Mathematica, as I was, and really missed a lot of the sort of niceties that Mathematica offers for working interactively with code. And so he started I-Python in 2001 to bring that to Python. And for the first roughly decade, it remained a terminal-based interactive shell for Python. And then in 2001, we built a web-based notebook. It was part of I-Python at the time. And over the following years, we abstracted out an architecture that allowed other languages to plug into that web-based notebook. And Jupiter was born as sort of the language-independent part of the overall effort. And so today we refer to that notebook as the Jupiter notebook. I-Python continues to exist as one of the language extensions, or as we call it, a kernel for Jupiter, which provides basically the Python language support for Jupiter. So where did the name Jupiter come from? And particularly, it's got a slightly odd spelling. There must be a story behind that. Yeah, it was a long time for us sort of hunting and trying to find a name that met the basic constraints of something that was open on domain names, open on GitHub, Twitter, et cetera. And so that once you impose all of those constraints, there's not much left. We also wanted to name that sort of knotted in the direction of our scientific computing heritage. And so you've got Jupiter, Galileo sort of built into the name. And then also the fact that even though Jupiter is language-independent and support, we support many different languages, our Python heritage as well. And so the sort of changing the I in the name Jupiter to a Y is where we ended up. At times we've said the name also sort of has fragments of some of the main languages we support, namely Julia, Python, and R. So you could think of Jew, pi, ter. I think that's a little bit tongue-in-cheek. But if your listeners want to take that seriously, that's also completely fine. I think some people in our community, that's the folklore that's emerged around the name. And I'm completely fine with that. So for those who haven't seen this before, why would you not just use a Python, Julia, or R, ID? Yeah, that's a great question. And it's a question that is becoming more difficult for us to answer. And so let me first give the answer, looking back. The main thing that the notebook, the Jupiter notebook provides is interactive computing. So when someone is running a computation and while it's still running, writing more code, submitting it and looking at output. And so it's really optimized around that type of workflow and that in combination with a more of a narrative format. So the Jupiter notebook is actually a document that a user would create that mixes live code with narrative text, LaTeX equations, visualizations. And so at the end of it, not only have you sort of been able to do your interactive computing yourself, you end up with an artifact that is can be used to reproduce the work, share the work with other people, etc. And so traditional IDEs don't offer those two aspects of sort of reproducible interactive computing. And then what we refer to as a computational narrative that could be shared with other people. This is the concept of literate programming, right, the idea that you can have all of your documentation and everything embedded, but it's more for learning rather than actual documentation. Yeah, we like to use the word computational narrative rather than literate computing. The original vision of literate computing was slightly different in that it wasn't designed to be an interactive experience. It was more that you would have a single file that contains the source code and the documentation, and then later run various post processing steps to generate documentation or source code. Whereas what we're talking about here is actual an actual live interactive computation in the middle of a document that honestly looks more like a Google Drive document or a Microsoft Word document than a traditional IDE. So yeah, this actually kind of strikes me and I'm going to date myself a little bit by making this reference, but when I was doing research for this interview and looking into what Jupyter is and how it works and whatnot, it kind of struck me that this seemed like a Google Wave for science, particularly with the timing back when you started this in 2001 or so, that Google Wave was the predecessor of Google Docs and Google Sheets and things like that, where they kind of introduced a lot of this technology where you could even have multiple authors working in a single workspace at the same time and everybody gets a simulcast of exactly what's happening all throughout and stuff like that, but you've taken that and run with it even further to actually put computation in there. Is this, was there any relation to the Google Wave ideas at all or were you before that or did it happen at the same time or is that even an apt analogy or am I just speaking nonsense here? That's a great question. I don't remember the exact year of Google Wave and to be able to sort of put it in the context of our thinking, but sometime I got involved with IPython development around late 2004, 2005 and originally I had started to do work in more in parallel distributed computing in Python and part of what Fernando and I started to realize was that the interactive computing experience was really something that we loved and other people loved, but companies and efforts such as Google and Gmail were starting to come out and show that the web could be used for more than just static content and so the idea of a web-based notebook was something that we started to talk about very early on in 2005 and it wasn't a single thing, but it was really a constellation of new collaborative web applications that people were starting to build. Google Docs is one of those Gmail. The social media side of it was probably less of an influence. I think for us at the time social media was more entertainment than sort of a productive work tool, but that sort of broad new direction that many people and companies were taking of building rich collaborative interactive web applications was definitely something that informed our thinking and planning and direction of IPython and Jupyter. So while you were speaking there, forgive me, I did go off and research. Google Wave looks like it came around in 2009, so you were several years ahead of that, so quite the pioneer, but let's flash forward back to today. So from your definition here, it sounds like you're not even intending to be a competitor to an IDE. This is an entirely different paradigm of doing work. Is that an accurate assessment? Increasingly, those lines are being blurred and let me sort of describe what we found over the years. So we've really focused on interactive computing and what we've seen from ourselves and our users is that eventually those interactive computations start to look more like traditional software engineering. So eventually you need to pull a function out of a cell in a notebook and put it in a standalone file and start to write documentation in a test suite and package it. And it's been pretty painful to go through that transition from an interactive notebook to more full-blown software engineering. And in response to that, we're actually building a new user interface for Jupyter called JupyterLab. And honestly, when most people see it, they say, hey, wait, that's an IDE. And so it's definitely something that is becoming much blurrier in terms of traditional IDEs and then interactive computing. And I think the way we're casting it now is that if by IDE you mean an interactive development environment, then yes, we're willing to commit to that. The traditional notion of an integrated development environment that's focused on more software engineering type of workflows, that's not there. There's many other IDEs that are much better than that. We're really focused on the interactive portion of the development process. So if it's being less used for interactive and people are really kind of using it for everything, does running a kernel inside of Jupyter introduce any performance overhead? No, not really. So for your listeners, the word Jupyter kernel is what we use to refer to the separate process that we send network messages to and that process runs code, the user's code in a particular language. So there's a Python kernel, a Julia kernel, an R kernel. And the only thing we're sending to the kernel is a string of source code that that process can then interpret and run. There is a small amount of overhead to just sending that small string of source code over. But once it starts running that code, there's essentially zero overhead. We have, for example, a C++ kernel and once it gets that network message that has the source code, it compiles it and runs it and it's full blown native C++ performance. So how would you say the most common way people kind of start with Jupyter and end with Jupyter if they're using it in their project? Yeah, I think the sweet spot for Jupyter right now is people that are doing a wide range of tasks in scientific computing and data science where maybe they're running a simulation or they're loading a data set, cleaning the data set, processing the results of a simulation, doing statistical analysis, doing machine learning based on data sets, and then doing visualization in an interactive context. And then at the end of it, wanting to have something that they can share with other people and communicate their results. That's sort of the core use case for Jupyter, even as we support more traditional software engineering type workflows. So you kind of already talked about the history between Jupyter and iPython. Is Jupyter still primarily used by Python or is it kind of evolved into one of these other kernels becoming more popular? Yeah, that's a great question. And there's a research group at UCSD that has recently scraped all of the notebooks off of all the public notebooks off of GitHub. I think there's 1.2 million Jupyter notebooks on GitHub. And they're starting to look at these notebooks to help learn about how people are doing interactive computing. And I'm pretty sure they were the ones that mentioned that of the existing notebooks on GitHub, I think it was something like 97 percent, mid 90 percent were still Python. Now, it's very possible there's some sampling bias there that other communities are not putting their notebooks on GitHub like the Python community is. But based on our observations, there's still a very large fraction of our user base is Python. That's also helped by the overall popularity of Python in this space. Is another use case, because you mentioned Leitech in there as well. Have you seen anybody write a paper specifically in Jupyter and have their graphs and charts and what not be active computations so that they could produce, say, a PDF that actually represents an integrated set of work rather than, oh, I got to have my scripts over here that generate my PDFs of graphs that then get slurped into Leitech and blah, blah, blah, blah, that whole kind of thing. Has it been used to create publishable results like this? Yes and no. So the narrative text in a Jupyter Notebook is marked down and even though we support Leitech in the marked down cells, marked down is a little bit too limited to offer sort of full blown publication content. It lacks a lot of features that you need for that. And because of that, the main way we're seeing people use this in a publication context is as sort of accompanying material for a formal academic publication. That's being done quite often. So a great example of that and it's a perfect day to mention it is the LIGO collaboration which discovered or observed gravitational waves and actually won the Nobel Prize for that just today that was announced. They, anytime they have an observational event, they actually publish a Jupyter Notebook that reproduces all of their analysis that goes into the associated peer-reviewed publication. And that type of usage pattern is something that we're seeing quite often and that a lot of academic publishers are quite interested in. So let's talk about the guts in a way this works a little bit. You mentioned a couple languages but out of the box, if I install Jupyter, what language is slash kernels does it support? So the Jupyter itself, actually that's a really good question. Today I think if you, for example, installed Jupyter with Pip or Konda, the only kernel we will install is the Python kernel that we build, the IPython kernel. And then any other kernels that you would install beyond that, you would have to install separately. And the reason we've done this is that we ourselves, that sort of the core team, core Jupyter IPython team, only maintain a very small number of kernels. Most of the kernels built for Jupyter are developed by third parties. And so it's completely up to those third parties how you would install those. Also the many of those other kernels are written in other languages that have completely different packaging systems. And so it wouldn't make sense to Pip install an R kernel. R has its own packaging system and the R kernel is shipped using that packaging system. So what are some of the common kernels that are out there? Yeah, so the Julia kernel is quite popular. Julia was actually the second language to have support for Jupyter other than Python. So that kernel has been around a long time. It's fairly mature. And the core Julia team has sort of been using and promoting Jupyter for quite a long time. And other popular kernels, the R kernel, there's an open source R kernel that is people are definitely using. Other sort of broad areas that we're seeing is a movement towards JVM based languages that a lot of people are interested in for tools like Spark. And there's a number of Scala kernels. And then also actually JavaScript kernels. There's a couple of different JavaScript kernels for Jupyter. And JavaScript is a great language for working in the web. And we give users, being a web application, we offer users a lot of the niceties of being in a web-based environment. So you can use libraries such as D3JS to do a visualization if you want. Has anybody come up with, or any of the commercial languages supporting Julia? There should be things like Matlab, SAS, S plus. Anybody like that? So do any of those have Jupyter kernels? Is that more the question you're asking? Yeah. Yes, actually. So I'm pretty sure that IDL, which is sort of an old school interactive computing environment, used a lot in the astronomy community. They, as far as I know, ship a Jupyter kernel. And then SAS as well has a Jupyter kernel that they're shipping. There is an open source Matlab kernel that is available. I've not used that myself. I've had some students that have tried it and said, it's okay. You can use it, but it's definitely not sort of a first-class kernel. We would love to see MathWorks take on that and build a really nice, robust Jupyter kernel. And that is something we're hearing from our users, that a lot of people are still using Matlab and want to keep using Matlab, but they want to integrate with Jupyter and get the Jupyter notebook format and a lot of the other benefits of the overall ecosystem. So for creating these kernels, you said they're a separate process. What's the mechanics? How do you get from the web front end to the kernel and back? Yeah, that's a great question. The kernel is defined by the network protocol that it speaks. And the transport layer that we use for kernels is called 0MQ. 0MQ is a message-oriented layer on top of TCPIP that we use. And the way that kernels talk over 0MQ is basically through JSON messages. And we have a formal specification for the types of messages that the front end would send to a kernel and then also for the types of messages that a kernel would send back to the front end. And as long as a process uses 0MQ in that way and speaks, sends and receives the right JSON messages, it can be a valid Jupyter kernel. And there's a lot of flexibility within that that exists. But that's sort of the minimal notion of what a kernel is. So then a kernel is not even a plug-in. It's just a standalone entity. And as long as it listens and speaks in the right way and you just tell the Jupyter core like what TCP address and port it's listening on, you're good to go. Is that correct? Pretty much, yeah. The kernels are registered with the notebook server by dropping a small JSON file in one of a couple of different configuration directories. And that JSON file essentially has the command line program to run to start that kernel. And so it's a very, there's no sort of language to language calling. We literally just, you tell us what process to start and we will start that process and assume that it speaks the right network protocols. There is a way for kernels and the notebook server to agree upon which ports are being used as part of that. But it's all a fairly simple system. So then let me ask my own bias here, being an HPC MPI kind of guy. Has anybody written kernels that front a back-end HPC cluster using MPI or some other parallel technology so that you actually have a Jupyter notebook either launching or controlling or directing some larger computation that's running either a small or large size HPC job? Yes, definitely. And there's a couple of different ways you can architect that. One is that there's no constraint over the type of code you run in a kernel. And so, for example, if you're running the C++ interactive kernel and you want to start to just use MPI in that context, you could do that and it should work fine. Now, with that said, Jeff, I'm sure you know, there could be a lot of subtleties about how MPI processes get started. And so, if you wanted a kernel that really did that well, you'd need to think about that sort of bootstrapping phase that MPI does. But there are examples of that. One other project that exists within the IPython organization is something we call IPython parallel or IPyparallel. And it actually exposes a Python API for talking to basically MPI clusters that are separate from the kernel. So imagine that you might, you know, a typical use case we see is someone running a Jupyter notebook on the head node of a large supercomputer cluster. The kernel would be running interactively on that head node. And so, it's not doing anything computationally demanding, but then the user might start a large parallel job with Python and MPI and then be able to steer that interactively from that Python notebook. So that is one of the very early use cases that we had in mind for this. What about something using one of these new web stack orchestration engines, like Kubernetes or Rancher or Mazos, where you could actually say, start my big thing over here, it listens on this port and I can run Jupyter locally? Is anybody kind of doing that? Almost like I run Jupyter locally and when I'm ready, fire up this thing into cloud or something like that to do the heavy lifting? Yes, definitely. The biggest place we're seeing that is in the Spark community. So there's a PySpark client library and that library knows how to communicate to Spark clusters. And there's a couple different ways of doing it. Either the kernel can be started as part of the Spark cluster. There's a new REST protocol for talking to Spark called Livy, L-I-V-Y, I think it is. But that's definitely one of the usage cases that we see in that's happening a lot in the large companies that are offering sort of turnkey Spark deployments is they're sort of packaging that around to Jupyter notebook based front end for their users. So what's Jupyter Hub then? How's that coming to play with all of this? Yeah, so the original Jupyter notebook is a single user web application. So it's something that users tended to start just on their local machine. So they would type Jupyter space notebook at the terminal on their local laptop. And that starts the Jupyter notebook server. And then they use the software through their web application, but it's just talking to this local server. Jupyter Hub is an organizational multi-user version of this that basically takes care of spawning single user notebook servers on behalf of different users. It handles authentication. And then there's a proxy layer that routes the traffic to the appropriate single user process. So it's a think of Jupyter Hub as a multi-user organizational implementation of Jupyter. Is it possible for does Jupyter Hub understand batch systems or cloud orchestration APIs or anything like that? So we can kind of spin these things dynamically because I know there's other tools already do this like the tech visualization hub at the Texas Advanced Computing Center allows you to submit a job that spawns Jupyter for you and reverse proxies it back. And we actually support that at Michigan too on our cluster. So people don't have to make their script or anything. So does Jupyter Hub have that built in? Yeah, really how I look at Jupyter Hub is a set of building blocks that you can assemble in different ways for particular types of deployments. And for example, one of those building blocks handles authentication. And it's an extensible API. So if you want to authenticate with OAuth, you can plug in whatever OAuth system you have at that point. Another building block takes care of spawning individual single user notebook servers. That's also extensible. And so the simple default one just starts a local sub process on the server. But there's different people have written spawners for different batch systems for Kubernetes, for Docker and so on. And so that's something that a lot of work has been put into. And I think the largest scale deployments that I know of these days of Jupyter Hub support many thousands of concurrent users. And I think the largest ones right now are using Kubernetes to manage the sort of spawning and load balancing and auto scaling of the system. Now, something you mentioned and alluded to earlier in the conversation here was about the efforts you guys have encountered upon for developing a community around Jupyter. What can you tell us about that? For example, you just had Jupyter Khan in August. Can you tell us a little bit about that? Yeah. So over the last few years, we had started to experiment with different events to bring together Jupyter users. And we had had a number of Jupyter Day events as we were calling them all over the world, really. And we had started to observe that there were a lot of users. And our users have really amazing things to share about how they're using Jupyter. And part of what's fun about it is that the really diverse ways that people are using Jupyter ranging from social sciences and humanities to traditional physical sciences to data journalism. And Jupyter Khan was our first sort of larger conference to bring as many Jupyter users together as we could. So yeah, that was just this past August about, I guess, just over a month ago now in New York City. And we had around 700 attendees. And it was organized, co-organized with O'Reilly Media and also the non-profit organization for Jupyter, which is the Numfo Foundation. So what's the easiest way for someone to get started with Jupyter? Yeah, most of our users probably install Jupyter through the Anaconda Python distribution. That's really the easiest way to get a working Jupyter installation that includes all the other dependencies that you will likely want to use along with it, different visualization libraries, scientific computing libraries, machine learning. And so the Anaconda Python distribution is probably the most common way that people get started. Another increasingly common way is organizational deployments, where someone within an organization deploys it on behalf of the rest of the organization. And at that point, you're typically the users pointed towards the URL for that deployment that you could log on with whatever credentials are set up for the deployment. What's the strangest thing that you've seen Jupyter used for? Yeah, the strangest thing. Let me think about that a little. We usually like to ask this question for most of our guests to kind of emphasize the way in which software and even science itself escapes out into the world and then gets used for these sometimes wacky or crazy imaginative ways that the authors and developers just didn't intend at all. Yeah, I mean, I think, I don't know if strange is quite the right word, but one usage case that I don't think we had in mind back in the mid 2000s when we got going on this journey was usage in data journalism. I think I was ignorant of any work in data journalism happening at that time. There may have been. Again, it could be just that I wasn't aware of it. And so the idea that journalism teams would have computational folks involved who are using a tool like Jupyter and doing machine learning and data science and data visualization is something that we've been extremely happy to see. But it's also something that I think has surprised us in the best possible way of being surprised. And so that and honestly, I think part of the fun of it is that the organization that has done this most successfully is Buzzfeed, which is not usually pictured by folks as being a serious news organization. But there's a fantastic data journalism team at Buzzfeed News. Jeremy Singer Vine is one of the folks that we've interacted with a lot. And at this point, as far as I know, anytime they publish an article that has data behind it, they share their analysis and the data set on GitHub, and they're publishing that as Jupyter notebooks. And so they're really setting a very high bar for openness and reproducibility in data journalism. So that actually raises a fascinating question. What license do you distribute Jupyter on under? And does that carry through to the work that is published by Jupyter notebooks? Yeah, we use the three clause revised BSD license. And it's a very liberal license. And that's a choice that we made very early on. We wanted people to be able to use Jupyter in pretty much any way they wanted, whether it's for non-profit work or academic research or even for building for-profit companies and products around it. And so there's no constraints on how people license Jupyter notebooks themselves. They can license those notebooks essentially using any open source license or not even an open source. You could write completely proprietary Jupyter notebooks and that's completely fine. Okay, so what about the Jupyter organization as a whole? You said you started as IPython and you kind of made it abstract. How are you guys organized? Yeah, so we are now a part of the NumFocus Nonprofit Foundation. NumFocus is a 501c3 non-profit that's home to a number of open source projects in the Python R and Julia communities. So a lot of the other open source projects that users are using when they're using the Jupyter notebook are also part of NumFocus. And we have a fantastic development team working on Jupyter. Our project is sort of led by a steering council, as we call it, of 12 individuals that have made long-term significant contributions to the project. And then Fernando Perez continues to be the BDFL for the project. But it's a very large and significant effort by a lot of different people. I think we have somewhere on the order of 25 full-time people plus hundreds of other part-time and occasional contributors to the project. So it's really a large community effort at this point and many, many countless people making contributions to the project. And we're really grateful for all the work that everyone's doing. So thanks very much for your time. Where can people find out more about Jupyter and get started? Yeah, we have a website at jupyter.org. And that's probably the best place to start. There's links there to installation instructions as well as our documentation. The other place that would be great to go to learn more about the project and how it's being used would be to go to the JupyterCon YouTube channel. We have videos of all the keynotes and all the sessions there. And I'm not sure all of the sessions are uploaded yet, but they were in the process of finishing those uploads over the last week. And there's many really good talks that are on YouTube for free that anyone can watch and learn more about the project. Okay, Brian, thank you very much for your time. Thanks so much for having me on, Brock and Jeff. And yeah, thanks for what you do. All right, thank you.