 from San Jose in the heart of Silicon Valley. It's theCUBE, covering Big Data SV 2016. Now your host, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live at Silicon Valley for day three of coverage of Big Data Week, which comprises of our event, Big Data SV and Strata Hadoop, right across the street. This is theCUBE's Silicon Angles flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, my co-host George Gilbert, analyst at wikibon.org. Our next guest is Holden Kaurau. She's the principal software engineer at Big Data IBM. Welcome to the co-author of Learning Spark. Welcome to theCUBE. Thank you for having me. So Spark is hot. It cut the head off the Hadoop beast, as we heard earlier on theCUBE. Cloud's taking the legs out, but Hadoop is still rocking and rolling. A lot of innovation going on. Spark's one of them. So give us the update. Learning Spark, the book's out. You co-authored that. You got office hours coming up here this afternoon for folks who are watching. You might be around. You got office hours. What's going on? I mean, how advanced is Spark? Where's the progress bar? Certainly, Big Data Spark summit was interesting east and now got the west one coming up. What's the update? So I think there's a lot of really exciting things happening in Spark. And the big thing is Spark 2.0 is coming this year, right, and that's really exciting because it's an opportunity to sort of like get rid of some of the dead weight, the things that have built up in some of the craft, and also a lot of really new exciting things, right? Like sort of going from the RDD model to the data set model and allowing people to mix functional and relational queries together really, really easily and sort of bring their expertise together so that maybe a more traditional business analyst can more easily work with Spark and have that work productionized by traditional data engineers. Is that a polite way of saying the hardcore Spark developers have to kind of mainstream it? Because I mean, Spark has been an example of what I call the developers that eat glasses, spit nails, the dev ops guys that that pioneered a lot of the stuff we saw with the early cloud days. So you kind of got to socialize it into the mainstream enterprise. A lot of heavy lifting's got to get done, fill the gaps in, right? Yeah, I think there's, so with any technology, right? Like your first version is great and it's wonderful. It's very lean and it doesn't have a lot of like security or other things like that. And as time goes on you have to add all of these things to make it a really good enterprise product. And I mean, that's part of where companies like IBM come in sort of adding the things that the enterprise needs so that they can adopt this. And Spark SQL is also very much like opening up the kinds of people that can make Spark programs, right? There's a lot of, I mean, there's tons of business analysts who if I asked them to write Scala code would be like, oh no, that's quite all right. You know that. You can handle that, you can take that one. But I mean, I don't have the time to help all those business analysts, right? And so it's really powerful to be able to give them the tools that they're used to working with and being able to work with actually really, really large-scale data at sort of the speeds that Spark is. So Spark SQL is really great there. And the dataset thing especially makes it a lot easier for us to sort of mix the more traditional like relational queries that the business analysts might come up with. And then in the few cases where it's like, oh, this part's really slow. I need to redo it for performance. It's really easy to sort of like do the nitty gritty engineering parts without having to sort of shift between systems and different paradigms. One of the other tools kind of integrated and it's a lot of integration. Obviously IBM has the big Spark investment and you're involved in that. What's your take on that? Where is that investment? Because a lot of folks see IBM as validation as a big player, but they're donating a lot to the community. They are. And so I work at the Spark Technology Center in San Francisco where we're focused on just open source Spark. And it's a really great thing. There's people also working on libraries on top of Spark, right? Like system ML brings a lot of more machine learning capabilities on top of Spark as well. I don't work on it personally, but they're in the same office with me so they're clearly cool people. They're in the elevators. Yeah, yeah. And we plan to steal some of their code and bring it into Spark when they're not looking. But it's a lot of really smart people doing, doing a lot of productionizing of Spark. And if you look at sort of where the contributions are coming, you can see that a lot of them from IBM have especially been focused on Spark SQL and sort of after that, like the machine learning libraries are the next sort of area of focus in terms of the number of contributions that we've been getting into Spark. And so those are the two sort of like really key things that we've been working on. Let me pop up a level on that. The contributions about SQL and machine learning that we were talking earlier about the huge sort of investment IBM made 16 years ago. No, maybe 17 years ago in Linux and all the software there, they ported to it and how it legitimized Linux because it was the proprietary flavors of Unix at the time and then there was Windows, which was still teething. Is this a comparable investment? I think in a lot of ways, there's many parallels, right? Like IBM started the Linux Technology Center inside of IBM to focus on just open source Linux. And then they've done the same thing with the Spark Technology Center. But beyond the parts that are just focused on the open source, right? There's this massive shift internally to bring a lot of things to run on top of Spark. And so I field slash ignore a fair amount of questions from product teams about how to get their things better working on Spark. But so there's a lot of IBM products which are starting to come towards using Spark as their execution engine. In the same way that we were seeing a lot of the open source Hadoop products are also switching from traditional MapReduce as their execution engine to using Spark, right? Like you can't make new machine learning algorithms inside of Mahoud anymore if it's not in Spark. And there's Hive on Spark is coming, right? These, all these things that have used. It's a bandwagon kind of thing going on with Spark big time right now. Yeah, everyone's essentially gone, well, my old execution engine, that was nice, but this new thing is so much better. It's worth it. And like it's a huge investment to port all of this code. But the returns are just, it's so amazing in terms of like you're able to handle problems of just a completely different scale. And so many examples of scale points because that's what everyone's kind of betting on, that order of magnitude. What are some examples of order of magnitude you've seen? So I mean, most of, so I work in kind of a weird spot, right? Which is I mostly work on this machine learning and core stuff with a bit of Python thrown in. And actually I think the Python's really cool. But a lot of things that you're seeing are people that have traditionally been stuck with doing sort of like single machine processing. And they just, you know, you have to down sample your data or if you're doing MapReduce, right, you can do like large scale compute, but it's no longer interactive, right? And as a data scientist, your workflow is so different when you have to like come up with a problem and you know, think is this maybe a solution and then like sleep on it and come back and find the answer versus like, oh, this probably doesn't work, but I'll just give it a quick shot, right? You find so many more things when you're able to do your large scale stuff interactively. So basically the cadence of how someone actually thinks. Match is the environment. Maybe not quite, right? Like you still might have to go get a cup of coffee for a really complex thing. But it's- You don't lose a day though. You don't lose your step basically. You kind of get a little bit of pause. Can you give us some more concrete examples of products that are now moving their execution engines over to Spark and then the change in that sort of the cycle time? Sort of the responsiveness, yeah. So since I just work on the engine itself, I actually don't work on the products that much. So I don't have too much experience with what the products themselves are in terms of their changes. But I mean, there's the PANDA stuff where essentially like people have been having to do like hour long map reduce jobs to like manually sort of try and get back some of the functionality that they want to do some complex analysis. And essentially they're going from hour long or day long jobs to like 30 minutes or 10 minutes if their data is nice. And it's really awesome because they're just able to actually view the whole data and not have to sample. Although I wanna ask you about machine learning is this a fascinating area for us. We've been playing a lot with our data science platform that we have. It's looking at Angle and not people get confused between supervised versus unsupervised, which is a concept I would explore with you. And then also we heard last night from on our analyst panel this notion of algorithms, policing algorithms because the quality of the algorithms is becoming an issue. The data is just the part of the data, the envelope around it is the algorithm, which is data as well. So who's policing the algorithms, you know? Yeah, I'm actually, yeah. So there's a bunch of things policing the algorithms. And there's actually like the Sparks pipeline model makes it really a lot easier to do things like hyper parameter tuning. So where you're gonna do things like I made my model. Oh wait, no, no, I'm actually gonna have my algorithm try and figure out what the best parameters are to make my model for me because I don't have the time to do that. But for in terms of like producing the policing the quality of like the actual algorithms between releases of Sparks. I mean, it's just traditional unit testing and all of those good practices that you really need in any project. But for like actually like validating that your stuff is working really well there's not a lot of great tooling there. One of the projects that I work on is Spark Validator. It's a very small project but it's sort of focused around, you know, doing things like, and it's not something that people should use today. Like don't install it, yes, very early stage. But it's like the kind of thing where you hook it into your Spark job and like maybe it's gonna build up sort of a historical understanding of what your Spark job looks like. And then if the number of records you read today went down from the number of records you read yesterday it's gonna be like, hey, maybe you don't publish this model the results are probably junk. And I mean, of course, if you're an engineer you can come in and be like, no, no, it's cool. Like we actually lost 10% of our user base I'm really sad, but go ahead and publish the model. Or you come in and you're like, oh dear God, like someone dropped the table last night, we're screwed, we should go and fix that. And avoid that. It's a notification, it's really a ping, if you will, it's like, it heads up. It's a ping, but it's also a ping which halts doing an automated deploy, right? Because a lot of people, I did a survey for sort of like how people were using Spark and about a quarter of people were automatically deploying the results of their Spark jobs into production, which is great, right? Like, I mean, that means it's really well tool. Love automation, if it's good. Yeah, if it's good, it's great. But when it explodes it's pretty terrible, right? Because it explodes at two o'clock in the morning and no one likes getting away from the fan. Someone's gonna hit the fan. So in some of the research we've been doing, we talk about this progression of applications from, well, data lakes where it's people just getting their sort of feet wet often with machine learning. And to intelligent systems of engagement where there's machine learning behind the interactions to help anticipate and influence the user. But then there's a third phase where we talk about which is where machine learning is put on the process of sort of the design time and the runtime chain of improving and benchmarking, coming up with better machine learning algorithms. Where are we on that spectrum? Like how far out of is that in terms of rocket science? So I think different organizations are in very different places, right? As you've mentioned, a lot of people are just still getting their feet wet with the data lake. But I think a lot of people are sort of at the data science on data science stage, right? Where they're collecting all of these metrics and they have all of these analysts and then they realize maybe some of the stuff that they're doing could be useful for each other. And so they start to do meta analysis to figure out what sort of where their data's coming from, which pieces can be shared between the organization and sort of how to be good in this department. And for like sort of automating the models, were there-ish I guess would be the expression? I mean, you can tune on hyper parameter tuning and you can be like, good luck, have fun. But it's not like a thing which a lot of people are doing yet, right? And a lot of the pipeline stuff where you, so like the feature selection, you can just like make a giant set of features, right? Like one of the bug reports I had was someone who had a million features and was running into some problems with that because they were using it in Python and Python has some corner cases. But essentially like we're at the point where some people are sort of in that phase, but I would say most people aren't really doing machine learning on their machine learning models themselves. And to put that in perspective, it sounds like IBM is experimenting with it, which means it might be mainstream five years from now or I think sooner than five years. I think we'll see a lot more people using machine learning to do a lot of their tuning for their machine learning models much sooner than five years, maybe two. But I'm also an optimist, so who knows? That pipeline that you're talking about is multiple machine learning algorithms. In other words, when you talk about just to sort of level set for us, those of us who are, what's the word? Where we're having one of those brain cramps that's early onset Alzheimer's. We'll get more puffy after this. Yeah, I do need that. Energy drink, George does the pipeline energy drinks. I pound them. Where we're, if you're using a million parameters, you've got a whole bunch of models in an ensemble. And it's hard for one person to have in their head how to tune that. That's very true. For ensemble training, it's not something where you can really sit down and think about it. Or if you have a million features, it's really difficult to sort of manually do feature selection. You're not gonna wanna do that. That's just not feasible. You're gonna use some regularization techniques to sort of narrow that down. And these pipelines are actually frequently, so there's a sort of like artificial sort of separation which is going away in spark between sort of what's called, you know, estimators and models. But like, there's all of these different machine learning components that are sort of fed together. And it's not always like a straight line, right? Like sometimes there's branches and they come back, right? It doesn't have to be like, this is the text processing that I do and then I'm gonna do this and then I'm gonna do this, right? It can be very, very complex and that entire thing builds our model, which is really cool, but also runs into the problem of how the hell do I save this damn thing? Because the whole pipeline itself, it's now, right, you don't just have a linear model, you have like six different models together that generate the inputs that you're using for your linear model and your... Well, take that one step further, which is we wonder, in addition to those classes of apps, before I had that brain cramp, there's the, how repeatable can we make that? In other words, in the past, for decades, we've had packaged applications, but we don't see big packaged applications emerging anytime soon for this class of apps. Can you help shed some light on either why that is or why that might be wrong in a few years' time? I think we'll see more sort of packaged applications to make this kind of thing easy, right? Not everyone wants to write Python or Scala code to generate their machine learning model, right? Like ours is becoming increasingly popular, and there are actual sort of more BI type tools that people can use to generate models. And those will probably become increasingly popular with time, but the repeatability is actually a really hard problem. And this can be especially frustrating when you're trying to improve the code that you're using to generate your models, because the new version of Spark maybe has a better optimizer, but I still wanna be able to reproduce my results from last year in case the auditors show up. It'd be really nice if I wasn't just like, oh yeah, that's how it worked last year, don't worry. I wanna be able to show them what I did. And I think that that goes back somewhat to export support, right? And with having good export support, I think we'll also get a better idea of being able to store this additional state information so that we can actually reproducibly train our models, and we can actually be like, okay, so if I just changed the data, but I kept everything else the same, like how would things be? Or if I keep all my data the same, and I just changed my model, like how are things? And that's not maybe something which works super well today, but I think we're getting there really quickly. And there's been a lot of improvements in sort of this exporting and keeping all this information that we need. Hold on, I wanna get your take on the marketplace right now. Obviously you're in the trenches, it's great to see the experience getting down and dirty there with the tech, which is great. But a lot of people have been complaining that not enough machine learning going on at the event, that we need more tech faster, especially around machine learning. Do you agree with that? You think we're okay? We're the areas that need to be improved. What's going, what's working? What are people doubling down on? What are your thoughts on the kind of status of where we're at with ML? So I think there's a lot of room for improvement in machine learning, right? It's just huge, right? There's so many possible models. And we could just spend our time trying to implement all of these models in a distributed fashion, but I don't think that would be a really good use of our time, right? I think really what would be really great is if sort of the pipeline APIs and these things got to the point where making these things pluggable for other people was really easy for them to do. And to some extent, I mean, that's part of what system ML aims to accomplish. Even though I don't work on it, I'll... That was a big donation, by the way, to those, fantastic. Oh yeah, no, wonderful. And a lot of really smart people came over with that, so I'm very, very happy that I get to pick their brains. But it makes it a lot easier to like, for the people who are just true machine learning experts, right, like they don't necessarily spend a lot of time thinking about distributed systems. And system ML tries to sort of hide some of the distributed systems part from the machine learning implementers themselves. So they can implement things in more of like an R-like... So you're seeing a wave of this open source now with ML out there with system ML, a lot more action going on. You're seeing a lot more coming. Yeah, I think there's a lot more things in the pipeline. And I think it's a really exciting time to be in this space. Okay, two personal questions. Show the shirt there. It says, I will cut you as a unicorn. Certainly the unicorns are being cut themselves, valuations are being slashed. And second question is, is there another book on the horizon? There is. I'm working on High Performance Spark with my co-author, Rachel, who's wonderful. And we just got an early release, the first four chapters are out as of this week. So if anyone feels like giving me money, that's pretty cool. Especially if a corporate expense account buy one for home and one for the office. If you're searching for it though, you're gonna have to add O'Reilly to the end of it because there's apparently High Performance Spark plugs. And I didn't think very well when we were coming up with the title. And those are for some reason more popular than my book. It's very sad. Well, good luck with that. Thanks for coming on theCUBE. And sharing the insight has been fantastic. Thank you so much for having me. It's great. This is theCUBE. We're live at Silicon Valley. Day three, wall-to-wall coverage. This is theCUBE. We go out to the events to extract the signal from noise. And we will be machine learning our butts off in Ireland as we go to Hadoop Summit next week. And we'll be right back with more of this short break.