 Live from New York, it's theCUBE. Covering Big Data New York City, 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and Peter Burris. Welcome back to New York City, everybody. This is theCUBE, the worldwide leader in live tech coverage. Holden Kairos here, Principal Software Engineer with IBM. Welcome to theCUBE. Thank you for having me. Oh, you're very welcome. It's nice to be back. So, what's with Boo? So, Boo is my stuff, dog, I bring you. You gotta hold Boo up, I don't even see Boo from here. So, this is Boo, Boo comes with me to all of my conferences in case I get stressed out. And she also hangs out normally on the podium while I'm giving the talk as well. Just in case people get bored, you know they can look at Boo. So, Boo is not some new open source project. No, no, Boo is not an open source project. But Boo is really cute, so that counts for something. All right, so what's new in your world, Spark and machine learning? So, there's a lot of really exciting things, right? Like Spark 2.0 came out, and that's really exciting because we finally got to get rid of some of the junkier APIs and data sets and are just becoming sort of the core base of everything going forward in Spark. And this is bringing the Spark SQL engine to all sorts of places, right? So, the machine learning APIs are built on top of the data set API now, and the streaming APIs are being built on top of the data set APIs. And this is starting to actually make it a lot easier for people to work together, I think. And that's one of the things that I really enjoy is when we can have people from different sort of profiles or roles work together. And so, the support of data sets being everywhere in Spark now lets people with more of like a SQL background still write stuff that's gonna be used directly in sort of a production pipeline, and the engineers can build whatever production-ready stuff they need on top of the SQL expressions from the analysts and do some really cool stuff there. So, junkie API, what does that mean to a lay person? Sure, it means like, for example, there's this thing in Spark where one of the things you wanna do is shuffle a whole bunch of data around and then look at all of the records associated with a given key, right? But, you know, when the APIs were first made, right, like it was made by like university students, very smart university students, but it started out as like a grad school project, right? And like, so finally, with 2.0, we were able to get rid of things like places where we use traits like iterables rather than iterators, and because like, these like minor little junkie things, it's like, we had to keep supporting this old API because you can't break people's code in a minor release, but when you do like a big release like Spark 2.0, you can actually go like, okay, you need to change your stuff now to start using Spark 2, but like, as a result of changing that in this one place, we're actually able to better support spilling to disk, and this is for people that have too much data to fit in memory even on the individual executors, so being able to spill to disk more effectively is really important from a performance point of view, so there's a lot of like cleanup of getting rid of things which were like sort of holding us back performance-wise. So the value is there, enough value to break the... Yeah, enough value to break the APIs, and like, I mean, 1.6 will continue to be like updated for people that aren't ready to migrate right today, but for the people that are looking at it, it's definitely worth it, right? Like you get a whole bunch of really cool optimizations. I wanted one of the themes at this event that the last couple of years has been complexity. You guys wrote an article recently in Silicon Angle, some of the broken promises of open source, really the root of it being complexity. So Spark addresses that to a large degree. I think so. Maybe you could talk about that and explain to us sort of how and what the impact could be for businesses. So I think Spark does a really good job of being really user-friendly, right? It has a SQL engine for people that aren't comfortable with writing Scala or Java or Python code, but then on top of that, right? Like there's a lot of analysts who are really familiar with Python, and Spark actually exposes Python APIs and is working on exposing our APIs, and this is making it so that if you're working on Spark, you don't have to understand the internals in a lot of depth, right? Like there's some other streaming systems where like to make them perform really well, you have to have a really deep mental model of what you're doing, but with Spark, it's much simpler and the APIs are cleaner and they're exposed in the ways that people are already used to working with their data. And because it's exposed in ways that people are used to working with their data, they don't have to like relearn large amounts of complexity. They just have to learn it in the few cases where they run into problems, right? Because it'll work most of the time just with the sort of techniques that they're used to doing. So I think it's really cool. Especially structured streaming, which is new in Spark 2.0. And structured streaming makes it so that you can write like sort of arbitrary SQL expressions on streaming data, which is really awesome. Like you can do aggregations without having to like sit around and think about how to effectively do an aggregation over different micro batches. Like that's not a problem for you to worry about. That's a problem for the Spark developers to worry about, which unfortunately is sometimes a problem for me to worry about, but you know, not too often. Boo helps out when it gets too stressful. First of all, a lot to learn. But there's been some great research done in places like Cornell and Penn and others about how the open source community collaborates and works together. And I'm wondering is the open source community that's building things like Spark, especially in a domain like Big Data, which is the use cases themselves are so complex and so important. Are we starting to take some of the knowledge in the contributors or developing about how to collaborate and how to work together and starting to find that way into the tools so that the whole thing starts to collaborate better? Yeah, I think actually if you look at Spark, you can see that there's a lot of sort of tools that are being built on top of Spark, which are also being built in similar models. I mean, the Apache Software Foundation is a really good tool for managing projects of a certain scale. And you can see a lot of like Spark related projects that have also like decided that becoming part of Apache like foundation is a good way to manage their governance and collaborate with different people. But then there's other people who like look at Spark and go like, wow, there's a lot of overhead here. I don't think I'm gonna have 500 people working on this project. I'm gonna go and model my project after something a bit simpler, right? And I think both of those are really valid ways of building open source tools on Spark. But it's really interesting seeing there's a Spark components page essentially, a Spark packages list for community to publish the work that they're doing on top of Spark. And it's really interesting to see all of the collaborations that are happening there, especially like even between vendors sometimes, you'll see like people make tools which like help everyone's data access go faster and it's open source. So you'll see it start to get contributed into other people's like data access layers as well. But as a pedagogy of how the open source community works starting to find its way into the tool so people who aren't in the community but are focused on the outcomes are now able to not only gain the experience about how the big data works but also how people on complex outcomes need to work. Right, I think that's definitely happening. And you can see that a lot with like the collaboration layers that different people are building on top of Spark like the different notebook solutions are all very focused on enabling collaboration, right? Because if you're like an analyst and you're writing some Python code on your local machine, you're not gonna like probably set up a GitHub repo to share that with everyone, right? But if you have like a notebook and you can just like send the link to your friend and be like, hey, what's up? Can you take a look at this? You can share your results more easily and you can also work together a lot more, more collaboratively. And so Databricks is doing some great things, IBM as well and I'm sure there's other companies building great notebook solutions who I'm forgetting but the notebooks I think are really empowering people to collaborate in ways that we haven't traditionally seen in the big data space before. Collaboration, let's stay on that theme. So we had eight data scientists on the panel the other night and just talking about, and collaboration came up. And the question is specifically from an application developer standpoint as data becomes the new development kit. How much of a data scientist do you have to become or are you becoming as a developer? Right, so my role is very different, right? Cause I focus just on tools mostly. So my data science is mostly to make sure that what I'm doing is actually useful to other people cause a lot of the people that consume my stuff are data scientists. So for me personally, like the answer is like not a whole lot, but for a lot of my friends that are working in more traditional sort of data engineering roles where they're like empowering specific use cases, like they find themselves either working really closely with data scientists often to be like, okay, what are your requirements? What data do I need to be able to get to you so you can do your job? And sometimes if they find themselves blocking on the data scientist, they're like, well, how hard could it be? And it turns out statistics is actually pretty complicated. But sometimes they go ahead and pick up some of the tools on their own and we get to see really cool things with really, really ugly graphs. Cause they do not know how to use graphing libraries, but it's really exciting. That's good. Machine learning is another big theme at this conference. Maybe you could share with us your perspectives on ML and what's happening there. So I really think machine learning is very powerful. And I think machine learning in Spark is also like super powerful. And especially just like the traditional thing is you like down sample your data and you like train a bunch of your models. And then eventually you're like, okay, I think this is like the model I want to like build for real. And then you go and you get your like engineer to help you like train it on your giant data set. But Spark and the notebooks that are built on top of it actually mean that it's entirely reasonable for data scientists to take the tools which are traditionally used by like the data engineering rules and just start directly applying them during their exploration phase. And so we're seeing like a lot of really more interesting models come to light, right? Because you're just, if you're always working with down sample data, it's okay, right? Like you can do reasonable exploration on down sample data but you can find some really cool sort of features that you wouldn't normally find once you're working with like your full data set, right? Cause you're just not gonna have that show up in your down sample data. And I think also like streaming machine learning is a really interesting thing, right? Because we see like, there's a lot of like IoT devices and stuff like that. And like the traditional machine learning thing is I'm gonna build a model and then I'm gonna deploy it and then like a week later, I'll like maybe consider building a new model and then I'll deploy it. And it's a very much like, it looks like the old software release processes as opposed to the more agile software release processes. And I think streaming machine learning can look a lot more like sort of the agile software development processes where it's like, cool, I've got a bunch of new like labeled data from our contractors. I'm gonna integrate that right away. And like, if I don't see any regression on my cross validation set, we're gonna just go ahead and deploy that like today. And I think it's really exciting. I'm obviously a little biased because some of my work right now is on enabling machine learning with structured streaming in Spark. So I obviously think my work is useful. Otherwise, I would be doing something else, but it's entirely possible. You know, everyone will be like, holding your work is terrible. But I hope not. I hope people find it useful. You're talking about sampling. When he's for our first Hadoop world of 2010, Abi Metta was on, he stopped by again today, of course. And he made the statement then, sampling's dead. Yes, sampling didn't quite die. I think we're getting really close to killing sampling. Sampling will only be dead once all of the data scientists in the organization have access to the same tools that the data engineers have been using, right? Like, because otherwise you'll still be sampling and you'll still be implicitly like doing your model selection on down sample data. And we'll still probably always find an excuse to sample data, because I'm lazy and sometimes I wanna just develop on my laptop. But, you know, I think we're getting close to killing a lot more of sampling. Do you see an opportunity to start utilizing many of these tools to actually improve the process of building models, finding data sources, identifying the individuals that need access to the data? Are we gonna start turning big data on the problem of big data? No, that's really exciting. And so, okay, sorry, this is something that I find really enjoyable. So one of the things that traditionally, when everyone's doing their development on their laptop, right, you don't get to collect a lot of metrics about what they're doing, right? But once you start moving everyone into a sort of more integrated notebook environment, you can be like, okay, these are the data sets that these different people are accessing, like these are the things that I know about them. And you can actually train a recommendation algorithm on the data sets to recommend other data sets to people. And there are people that are starting to do this. And I think it's really powerful, right? Because like, in small companies, maybe not super important, right? Cause I'll just go and ask my coworker like, hey, what data sets do I wanna use? But if you're like at a company of like Google or IBM scale, or even like maybe like a 500 person company, you're not gonna know all of the data sets that are available for you to work with. And the machine will actually be able to make some really interesting recommendations there. All right, we have to leave it there out of time. Holden, thanks very much. Thank you so much for having me and having Boo. Pleasure, all right, anytime. Keep right there, everybody. We'll be back with our next guest. This is theCUBE, we're live from New York City. We'll be right back.