 from Galvanize, San Francisco, extracting signal from the noise. It's theCUBE, covering the Apache Spark community event brought to you by IBM. Now your hosts, John Furrier and George Gilmer. Welcome back everybody, Jeff Frick here with theCUBE. We are at the Apache Spark community event, kind of running concurrently with Spark Summit just across the street. We're at Galvanize, a pretty important place we're learning more about that's really kind of taken itself a mission to help develop more data scientists, which we know the world desperately needs as we move to this more data-based economy. George joined in this next segment by George Gilbert, my co-host from Wikibon, and our really excited guest to have Fernando Perez. He's got a really long title, we made him chop it down. A scientist from Lawrence Berkeley Labs and also the Berkeley Institute for Data Science. Fernando, welcome. Thanks for having me here, it's a pleasure. Absolutely, so when you sat down, he said, you know, I kind of have a different take on this whole scene, you're not part of IBM, you're not part of the original Spark thing, so what is your take coming from really a heavy academic data science background? Yes, so I am, as I said, I'm a scientist at Lawrence Berkeley National Laboratory, and I'm one of the founding investigators of the Berkeley Institute for Data Science. I am a physicist by training, I'm a particle physicist by training, and I've kind of been heavily involved with the open source scientific computing movement for a long time. I started writing tools for scientific computing in open source when I was a graduate student. I started writing what has become known as the IPython project, and something that has evolved into today what is known as the Jupiter project, and it's now something which is part of actually the Spark ecosystem, since you can use Jupiter as the environment for analyzing your data with Spark. And the tools that we've created in this path from physics and mathematics and open source scientific computing are tools that in our environment are used for many different kinds of research. At the Data Science Institute we have biologists, we have particle physicists, we have astronomers, we have social scientists, we have people from many different kinds of contexts, all of whom use these tools for their academic research. And some of them use Spark, some of them use Spark as the bottom layer that does sort of the heavy duty processing and then on top of it, they add other tools from the open source community to do kind of the higher level analysis, visualization, et cetera, to communicate insights. And that's kind of where I arrived, because as I said, not having being part of the origination of Spark, I have come from the creation of the Python ecosystem, which is kind of where I started, the IPython, non-Py, SciPy, Matbot, Lit, et cetera. And so it's really kind of that part of the ecosystem where I have been working in the past 15 years. Right, right. Good stuff, it's funny that you say that the use cases are so broadly across all those different disciplines when we've just had IBM basically sit down, they didn't say it in so many words, but they're almost replatforming all their analytics applications around this technology, which really drives to the fact that it's a broad and really powerful technology. It's good stuff. It's your thing, actually Spark is, it's almost like that fable of the P underneath all the mattresses, that's sort of like all your tools are building on top of that. But since you come outside, you come from such a different context, the people who take the sort of most bleeding edge use of these data science tools, can you take us back a couple of generations and tell us how the tools have evolved to make data scientists more productive, and that maybe can help us inform where we are today and where we're going? Sure, I mean, at least I can sort of recall my take on it, as I said, I started as a particle physicist, and when I started developing IPython, which is the project that brought me to all of this, what I had in front of me as a grad student was one day I looked at my desk and I had a stack of probably seven or eight different programming languages, books on my desk, I was writing Perl code, I had books about Bash, AUK, SED, C++, Mathematica, IDL, and I realized that I was probably spending more time switching between programming languages than actually writing any, doing any work, right? And then I discovered Python and I realized that I could probably replace most of those tools with Python, and what is a grad student going to do between actually writing more fun code and doing his dissertation, I embarked on this task of writing a new environment to do my work in Python, and I was able to start collapsing a lot of those tools with new tools, new open source tools in Python, I did graduate, but actually what ended up happening is that many, many other scientists were trying to do similar things, they were trying to build better open source tools and many of us ended up doing it in Python to do scientific research and to do data analysis and what has become known today as data science, right? In the Python programming language? I'm sorry to interrupt, let me pick up on that one point, which was what made everyone choose Python and then what sort of higher level abstraction or what usability addition did that create besides not having to learn seven different language or switch between them? What did the community of everyone contributing to Python help foster? I think what was central to that was that we were all able to interoperate very quickly, we were all able to work interactively, we could bring our legacy tools, I was able to basically pull in the data that I had from my legacy supercomputing codes written in C and Fortran, I was able to pull in my numerical data, all of it, and I was able to quickly interact with that data, pull in the visualization library that I started using was developed by neuroscientists in Chicago, the numerical tools that I was working with had been developed at Lawrence Livermore Lab and at MIT and were contributed to by postdoc at the Mayo Clinic and by scientists at NASA, there were scientific libraries developed by a guy who was at the time, I think, a postdoc at Duke University and basically scientists from many communities started rapidly sort of aggregating and contributing to these tools and using common abstractions at the time, these were tools that predated obviously the spark kind of revolution that we're seeing now, the story converges in the last few years, obviously, we're talking about 2001, 2002, 2003. So was it that people agreed on Python as a language, a single language, and then everyone sort of turned their contributions into libraries? Yes. So you had a rich ecosystem, sort of like our first statistical programming. And this was equivalent for numerical programming, this was very similar for numerical programming, array-oriented programming. All the science, arrays. Exactly. Array-oriented programming, numerical programming, the kinds of things that many of us were doing with MATLAB and IDL in the decades before, we started doing it with Python. And it was the ethos of open source, the same ethos that is propelling now Spark, the ethos of open source contributions in around the decade of sort of the year 2000 and onwards. And many of us basically began building all of this as libraries that kind of cooperated, coexisted, and contributed together. So then what was the big advance of Spark in terms of, you know, we hear about the unified programming model, and then these personalities on top, each of which can call each other, is the value that it provides relative to, say, Python and all its libraries, that the Python libraries you would use in conjunction and consecutively, whereas they couldn't really call on each other, the way Spark personalities or libraries can call on each other? What's, I think what Spark has brought to the game is really an additional layer of sort of enterprise level analytics. It's not so much for the everyday numerical computing workloads that many people in the physical sciences were using, but when you look at running enterprise level analytics in large clusters that run in the data center, for those workloads, the kinds of numerical tools that we, the physical scientists were developing, were not very well suited, right? And Spark really made a killing in that space. And what is happening is those two streams of tools are seeing a really interesting convergence now. And I think that's something that we're seeing a very interesting synergy in the last few years. And that's probably why I was invited to participate in these discussions is because now, in the last few years, the folks at the AMP Lab built PySpark, which is the Python layer on top of Spark. Basically, it's a layer that allows you to call Spark with a Python API. So basically you have your Spark, Spark runs on Java and Scala, right? That's what the actual Spark engine runs on top of. But you can import that with Python APIs, call that in Python. And then once you have run all of your large scale analytics in Spark, then you can import all of these Python libraries that the physical scientists have been writing for the last 10, 15 years that do numerical computing, machine learning, visualization, pandas, data frames, et cetera, and use those interactively and use those with the IPython notebook, with all of our interactive facilities that we have been sort of battle testing for 15 years and building and very productively testing. And those complement each other very well. I'm just curious, when you've built all those tools out, I mean, it was a very interesting tale that you just told about, you know, this guy here that did this and this guy here that did this and this guy over there that used to be in this, now he did this. You know, the spirit of collaboration, which you really see coming through now in open source is a way to drive innovation much more quickly than any individual person or company could do it. When you guys were doing that, was it really, it wasn't really to develop an open source project to put out every place, it was more just, I got a cool tool, I need some help, I heard you got it, can I share? It wasn't necessarily specifically built around let's build an open source thing and innovate, is that accurate? No, I think we also saw the value in building those projects and in building those communities. I think as, I think from early on, I actually remember in 2003, when I, when I, Python, when I was, I was, well I had just graduated and I was starting, I was first year post-doc and I actually was invited to Caltech to present at the SciPy conference, which was a nascent, tiny conference organized by a young company, Enthought and the founder of Enthought, which was one of this, the fellows that I just mentioned, basically said, why don't you come and give a talk about your iPython tool? I'm sure people would love to hear about it. And I said, look, I'm a starting post-doc, my boss isn't going to really fund me to go to talk about this little interactive Python tool that I've written. And he said, no, just come, we'll support you, we'll fund your, we'll fund your flight. There's a community of people here who care about what you're doing. And he supported me and will host your project at the time that, I mean, we didn't have GitHub, we didn't have any of the amazing kind of tooling around the open source ecosystem that we have today. And he said, we'll host your source code repositories and the mailing list and we'll fund you to come here. And I realized that there was a value to having these things and to seeing these things as open source projects. And many of us realized that we should actually work and try to get these things funded and to even begin convincing the funding agencies because most of us were actually still living, some of us as academic scientists were people who worked at NASA, people who worked at the Hubble Space Telescope, people who worked, who were in, I mean, some of us were just recently graduated postdocs. My colleague who is now basically my main collaborator in the IPython Jupiter project, he's a professor in the physics department at Cal Poly San Luis Obispo. He was at the time starting a postdoc at Harvard. We were all beginning to say, well, we probably should begin working with the fund, the academic funding agencies to convince them that it's important that they fund this stuff. And at the time, that was a really tricky proposition. The funding, the federal funding agencies in 2002 didn't want to hear about this stuff. Today, if you look at things that are being funded at the federal agencies, the NIH, the NSF, the Department of Energy, DARPA, et cetera, they're actually funding a lot of these things. I think they do fund some of Spark development. I think DARPA has put money into Spark if I'm not mistaken. And I do know that they're funding a lot of these important scientific, including scientific Python and R and many of these other projects are now being funded. And that's an interesting point that I don't think people really drill down enough into is how much this basic research at the academic institutions that's funded by the government eventually trickles out into significant commercial products, whether that be open source or not. Absolutely, absolutely. All of our ecosystem, that's another area in which this story that I told a little bit about how the open source scientific Python ecosystem that I, Python and Jupiter grew out of that I'm kind of part of and that is now sort of, that river is sort of merging in some ways with the Spark River also has a similar path in terms of how its licensing story works because Spark is part of the Apache ecosystem which uses the Apache licensing model. All of the Python ecosystem is licensed under the BSD model, which is also makes it basically very industry friendly. So it's a licensing model that makes it very easy for industry to reuse these tools. And so even though it's developed under, a lot of it has been developed under, as you were mentioning under academic, often academic funding models, it is very easy for industry to use that in industrial contexts. I wanna go back to something you said about the PySpark bringing together the Python and the libraries that the academic community uses with the core Spark engine. Our last guest who's the chief architect for analytic computing at IBM was talking about the basically data frames and RDDs without boring our viewers who might not have groked into that level. That's an enterprise computing construct. It's like a record. And you had mentioned scientists work with arrays which are very different. How do you tie together the two? Is there like a thin straw between the two of them where you sort of extract the data out and then sort of turn it into another format, like the arrays pull out the records and turn it into arrays? Because what I'm trying to understand is, can Spark be that core for scientific computing without losing its uniform fidelity? No, I think that there are certain classes of data for which the array continues to be kind of the natural abstraction. The data frame is a complementary data structure that fits certain other problems very naturally. And I think those two are just, they're core abstractions for data science that both need to coexist. And I think it's important that we don't try to sort of, a screwdriver and a hammer are both really, really useful things to have in your toolbox. And you don't necessarily want to drive nails with a screwdriver and you really shouldn't be pounding screws with a hammer. Okay, so one related question. We keep hearing in scientific computing about, I mean it's so data centric now, there's like, pick your science dash informatics or I guess simulation is the opposite or the complementary form. One is you're measuring the large Hadron collider and putting off a petabyte a second, literally. And the other is running a simulation having a ton of data. Do these, does Python and its coexistence with Spark help create these reproducible experiments? Data sets that before were just one big file that got lost in the original scientists laboratory and couldn't be reproduced elsewhere? The issue of reproducibility is an issue that is both technical, but also we must not forget it's also social. The question of reproducibility in science is one that is very dear to my heart and I've actually personally invested a lot of effort in working on this question and so I care deeply about it, but I try to stress that it's an issue that is not just technological. We need to build better tools to enable science to be more reproducible but it's also a social issue. Because the point is that we need to give you the necessary tools to make each step of a scientific workflow reproducible but we also need to incentivize people into keeping all of those steps together to enable others to reproduce because even if you have all those things it's perfectly possible to throw them away. It's perfectly possible to toss them over if you don't have the incentives to actually keep those things. And yes, some of these tools do make it easier having computational tools, for example, we've seen that the Jupyter notebooks that we provide are being used by scientists to publish basically analyses that have the data together with the narrative that presents the conclusions so that others can go back and re-execute the analysis and therefore reproduce from the beginning to the end and by having the conclusions together with the code it is easier to see from the beginning until the end, all together. Jeremy Freeman is a neuroscientist who is doing beautiful work showing how to use Spark basically to do, to couple Spark to the raw analysis of his data all the way to the end, his final scientific conclusions with notebooks that basically go from raw data with zebrafish and mice. I was gonna ask, the zebrafish, he zaps their brains. Exactly, so this is Jeremy basically showing how he used Spark to do the raw analysis and then the intermediate analysis and our Jupyter notebooks to basically demonstrate how to weave the human narrative because there's a raw analysis piece but then there's the telling the scientific story, right? And how to reproduce the entire process so that the final conclusion is reproducible but this is why I'm making the point that it's a combination of a technological story but the social piece as well that have to go hand in hand, right? What you're telling us though you're at the leading edge of what ultimately we wanna do in the enterprise which is tell stories in data, reach conclusions, have them be reproducible and this is a collaborative environment then for working with other groups. So that's precisely what we're building our entire effort with what began as the IPython notebook and what we now call the Jupyter notebook and the reason why we transitioned the name of the project from IPython to Jupyter is because even though it was born as a Python specific project we abstracted a lot of the tool from not being Python specific and it now supports over 40 different programming languages so we decided to call it something that wasn't Python specific and even though we continue developing a lot of Python tools in it the project is now works in any language and the language agnostic parts of the project are called Jupyter that is precisely a lot of our focus the Jupyter notebook and the Jupyter environment is precisely about building an environment where you can build code and narratives together where you have the narrative, you have the collaboration and you have the code and they're all put together and that's exactly what Jeremy uses to build those stories with the zebrafish and where he has the code and it's actually talking live interactively to a Spark cluster but he's presenting the results right there. And so wait, save the three components again? The code? There's code but that code is executing live it's not static, it's not a PowerPoint of code it's code executing live Against the data set Against the data set right and there's the results of the code that are live and then there's the narrative the English narrative, the human language narrative and those things are presented together that's what constitutes a Jupyter notebook and that's what we have built and that architecture works across over 40 programming languages and actually the IBM team built a Spark a native Spark kernel for that architecture Well Fernando, we're getting the hook guys going to ask you for the summary but I think you just gave it to us so thank you very much for providing that great insight Well thanks for having me here Absolutely, thanks for stopping by I'm Jeff Frick, you're watching theCUBE we're at the Apache Spark community event here at Galvanize in downtown San Francisco I'm Jeff Frick with my co-host George Gilbert We'll be back on their next segment after this short break Thanks for watching