 This is George Gilbert on the ground. We're at Spark Summit 2015. We are with Matej Zaharia, creator of Spark and sort of master of ceremonies. A couple topics we wanted to talk about with Matej. The notebooks that are becoming the user interface to Spark and the Spark operating system. We want to talk about some of what IBM told us officially yesterday and some of what came out unofficially and then how that relates to all the work that Spark's doing. So let's talk about first let's talk about what IBM said and what they didn't say. The official announcement what what are they bringing over in terms of the libraries? What IBM is bringing in terms of notebooks I mean yeah so my understanding is IBM has been working on you know connecting Spark to notebooks it's for interactive data exploration and visualization and it's actually something that we've seen and supported in Spark from a very long time ago. So the open source project since we added Python supported IPython notebook which is the standard for that and at Databricks we built our cloud platform from the ground up around notebooks and we use IPython notebook because it's the most widely used one. So it's nice to see I think it's a great environment for data scientists and engineers to work with and it's an environment that's evolving but it's it's still a nice way to integrate visualization and exploration and it's nice to see that you know others believe it. It was Nathan, I forgot his name, who did the IPython that's now Jupiter notebook. Fernando Perez or actually there are probably several people who was on the panel last night so what he told us about was it's a perfect way for sort of the data modeler, the data scientist and the business sort of analyst to collaborate at a GUI level almost or part GUI, part code. That's exactly it. It's integrating code and results from the code and actually in some of the notebook platforms so in particular in the notebook platform that we've built at Databricks it's also very easy to take that and turn it into a presentation or a dashboard and publish it to the wider business and it's just nice because data exploration is very iterative. You always ask new questions, get new answers back. So that collaborative nature of it and appealing to multiple roles pushes accessibility out into the user community. Exactly and it's also for you know even though we talk so much about big data and the sort of more traditional kind of small data world notebooks are extremely widely used, extremely popular and everyone wants to bring that same XP and still big data so it's something that users already know how to use. Okay so now let's talk about IBM introduced yesterday or stated their intent to bring system ML, their machine learning platform over to Spark. How does that, what's its sweet spot relative to ML Lib on Spark? Yeah so it's an exciting you know it's an exciting thing to see and system ML is a very interesting approach to machine learning so I think the way it works right now is that system ML is sort of a separate programming language to write machine algorithms in and it has special optimizations for machine learning. The easiest thing I think the first thing that they will do is probably just run it you know it will still be a separate thing but it will compile down to Spark but I think the exciting opportunity is to integrate it with the machine learning and other APIs in Spark. So in Spark we have a really nice way to get data in from multiple sources and so on and it shouldn't be you know it's not out of the question that we'll see those integrated with system ML. Okay so now let's talk about integrating with other sources and other APIs. Yeah one of the unique things about Spark was that you have SQL, you have machine learning, you have graph processing and I'm missing one. Streaming? Streaming of course and now what IBM didn't say but if you read between the lines they're going to bring their 35 years of query optimization experience with SQL make that a library they're going to bring their obviously system ML they're going to bring their info streams and again I'm forgetting one thing but these are deeply optimized on top of the your core analytics Spark OS. How might they interoperate with each other and with your libraries? Yeah so that's a really good question so I think it sort of it still remains to see which things they bring but it would obviously be really exciting if all this IBM IP that they have can run on top of Spark and I think it's it's you know it really validates the platform and it's a very forward-thinking move by IBM. When we designed Spark from very early on we we focused actually on the generality of the engine because data analysis requires a mix of techniques and a lot of the pain we were seeing real people have both in terms of usability and performance was just having to mix together many different systems so we always asked how can we make this composition of workloads extremely fast and so we hope that that IBM is able to use some of these we've been using these in Spark's standard library and in the past couple of years we've been working on having very standard plug-in interfaces between Spark and other software similar to how anyone can plug in a device driver say into Windows or into the Linux kernel we want people to plug in data sources and libraries into Spark. But with your libraries let's say you have a you have the beginnings of the SQL optim a query optimizer now is there and with streaming it's not quite as you know pure as let's say the IBM streaming product yeah there's a trade-off it seems between working on a common core and then specialized for super yeah super performance or deep functionality how do you make that trade-off? Yeah so yeah that's a good question I think I think we're still figuring that out I think people find a lot of value in the libraries tying to each other and being combinable with each other but I think and just to sorry to jump in but to have the libraries combined means you're making the choice of generality not specialization. Yeah exactly of generality but you can have if you if you do it correctly you can have generality and still have excellent performance for each workload so a lot of like a lot of programming languages operating systems things like that achieve that once they're designed you know they find a common thing to design the system based on. The bigger question I think is how easy it is in terms of APIs for someone to combine the APIs and I think everyone is still learning there and I think we'll have to see you know there's some benefits to IBM contributing their things into the existing spark libraries there it's also easier and in some way it has some benefits maybe commercially to keep them separate and just on them on top. Okay so that's an unknown. Yeah so it's a bit of an unknown but it's something we've been trying to do and and iterating on since we started to let you combine these workloads so I think and a lot of the things happening in Spark try to improve that. Okay with Matej Zaharia creator of Spark this is George Gilbert on the ground at Spark Summit 2015.