 Covering Big Data Silicon Valley 2017. Hey, welcome back everybody. Jeff Frick here with the Q. We are live at the Historic Pagoda Lounge in San Jose for Big Data SV, which is associated with Strata Hadoop World across the street as well as Big Data Week. So everything Big Data's happening in San Jose, we're happy to be here. Love the new venue. If you're around, stop by back of the Fairmont, the Pagoda Lounge. We're excited to be joining this next segment by what's now become a regular. Anytime we're at a Big Data event, a Spark event, Holden always stops by, Holden Carroll, she's the principal software engineer at IBM. Holden, great to see you. Thank you, it's wonderful to be back yet again. Absolutely, so the Big Data meme just keeps rolling. Google Cloud Next was last week, a lot of talk about AI and ML, and of course you're very involved in Spark, so what are you excited about these days? I'm sure you've got a couple of presentations going on across the street. Yeah, so my two presentations this week, oh wow, I should remember them. So the one that I'm doing today is with my co-worker Seth Hendrickson also at IBM, and we're gonna be focused on how to use structured streaming for machine learning. And sort of, I think that's really interesting because streaming machine learning is something a lot of people seem to want to do, but aren't yet doing in production, so I just wanted to talk to people before they've built their systems. And then tomorrow I'm gonna be talking with Joey on how to debug Spark, which is something that I, you know, a lot of people ask questions about, what I tend to not talk about because it tends to scare people away. And so you know, I try and keep the happy going, but. Bugs are never fun. No, no, never fun. Just picking up on that structured streaming and machine learning. Yeah. So there's this issue of, as we move more and more towards industrial and internet of things, like having to process events as, you know, they come in and make a decision. Right. How, there's a range of latency that's required. Totally. Where does structured streaming and ML fit today and where might that go? So structured streaming for today, latency wise, is probably not something I would use for something like that right now. It's in the like sub second range, which is nice, but it's not what you want for like live serving of like decisions for your car, right? Like it's, that's just not going to be feasible. But I think it certainly has the potential to get a lot faster. We've seen a lot of renewed interest in ML live local, which is really about making it so that we can take the models that we've trained and spark and really push them out to the edge and sort of serve them in the edge and apply our models on like end devices. And so I'm really excited about where that's going. To be fair, part of my excitement is someone else is doing that work. So I'm very excited that they're doing this work for me. Let me clarify on that just to make sure I understand. So there's a lot of overhead and sparks because it runs on a cluster, because you have an optimizer, because you have the high availability or the resilience. And so you're saying we can preserve the predict and maybe serve part and carve out all the other overhead for running in a very small environment. So I think for a lot of these IoT devices and stuff like that, it actually makes a lot more sense to do the predictions on the device itself. These models generally are megabytes in size and we don't need a cluster to do predictions on these models. We really need the cluster to train them, but for a lot of cases, I think pushing the prediction out to the edge node is actually a pretty reasonable use case. And so I'm really excited that we've got some work going on there. Taking that one step further, we've talked to a bunch of people, both like at GE and at their Minds and Machines show and IBM's Genius of Things, where you want to be able to train the models up in the cloud where you're getting data from all the different devices and then push the retrained model out to the edge. Can that happen in Spark or do we have to have something else orchestrating all that? So actually pushing the model out isn't something that I would do in Spark itself. I think that's better served by other tools. Spark is not really well-suited to large amounts of internet traffic, right? But it's really well-suited to the training and I think with MLib Local, it'll essentially will be able to provide both sides of it and the copy part will be left up to whoever it is that's doing their work, right? Because if you're copying over a cell network, you need to do something very different as if you're broadcasting over a terrestrial XM or something like that. You need to do something very different for satellite. If you're at the edge on a device, would you be actually running like you were saying earlier, structured streaming with the prediction? Right, I don't think you would use structured streaming per se on the Edge device, but you would use, so essentially there would be a lot of code shared between structured streaming and the code that you'd be using on the Edge device and it's being factored out now so that we can have this code sharing in Spark machine learning. And you would use structured streaming maybe on the training side and then on the serving side you would use your custom local code. Okay, so tell us a little more about Spark ML today and how we can democratize machine learning for a bigger audience. Right, I think machine learning's great, but right now you really need a strong statistical background to really be able to apply it effectively. And we probably can't get rid of that for all problems but I think for a lot of problems doing things like hyperparameter tuning can actually give really powerful tools to just like regular engineering folks who, they're smart but they maybe, they don't have a strong machine learning background. And Spark's ML pipelines make it really easy to sort of construct multiple stages and then just be like, okay, I don't know what these parameters should be, I want you to do a search over what these different parameters could be for me. And it makes it really easy to do this as just a regular engineer with less of an ML background. Would that be like just for those of us who are, who don't know what hyperparameter tuning is? Oh, okay, sure. So the knobs, the variables. Yeah, it's gonna spin the knobs on like our regularization parameter, on like our regression, and it could also like spin some knobs on like maybe the Ngram sizes that we were using on the inputs to something else, right? And it can compare how these knobs sort of interact with each other because often you can tune one knob but you actually have like six different knobs that you wanna tune and you don't know, if you just explore each one individually, you're not gonna find the best setting for them working together. So this would make it easier for, as you're saying, someone who's not a data scientist to set up a pipeline that lets you predict? I think so, very much. It's, I think it does a lot of the, brings a lot of the benefits from sort of the sci-pi world to the big data world. And sci-pi is really wonderful about making machine learning really accessible, but it's just not ready for big data. And I think this does a good job of bringing these same concepts, if not the code, but the same concepts to big data. The sci-pi, if I understand, is it a notebook that would run essentially on one machine? Sci-pi can be put in a notebook environment and generally it would run on a single machine. And so to make that sit on Spark means you could then run it on a cluster for doing your... So this isn't actually taking sci-pi and distributing it. This is just like stealing the good concepts from sci-pi and making them available for big data people. Because sci-pi's done a really good job of making a very intuitive machine learning sort of interface. So just to put a fine sort of qualifier on one thing, if you're doing the internet of things and you have Spark at the edge and you're running the model there, it's the programming model. So structured streaming is one way of programming Spark. But if you don't have structured streaming at the edge, would you just be using the core batch Spark programming model? So at the edge you wouldn't even be using batch, because you're trying to predict on individual events. So you'd just be calling predict with every new event that you're getting in. And you might have a cue mechanism of some type. But essentially if we had this batch, we would be adding additional latency. And I think at the edge we really, the reason we're moving the models to the edge is to avoid the latency. So just to be clear then, is the programming model, so it wouldn't be structured streaming and we're taking out all the overhead that forced us to use batch with Spark. So the reason I'm trying to clarify is a lot of people had this question for a long time, which is, are we gonna have a different programming model at the edge from what we have at the center? Yeah, that's a great question. And I don't think the answer is finished yet, but I think the work is being done to try and make it look the same. Of course, trying to make it look the same, this is Boo, she's not actually barking at us right now, even though she looks like a dog, she is. But there will always be things which are a little bit different from the edge to your cluster. But I think Spark has done a really good job of making things look very similar on single node cases to multi node cases. And I think we can probably bring the same things to ML. Okay, so it's almost like we're coming back. Spark took us from single machine to cluster. And now we have to essentially bring it back for an edge device that's really lightweight. Yeah, I think at the end of the day, just from a latency point of view, that's what we have to do for serving, for some models, not for everyone, right? Like if you're building a website with a recommendation system, you don't need to serve that model on the edge node, that's fine. But like if you've got a car device, we can't depend on cell latency, right? You have to serve that in car, right? So what are some of the things, some of the other things that IBM is contributing to the ecosystem that you see having a big impact over the next couple of years? So there's a lot of really exciting things coming out of IBM. And I'm obviously pretty biased. I spent a lot of time focused on Python support in Spark. And one of the most exciting things is coming from my co-worker, Brian. I'm not gonna say his last name in case I get it wrong. But Brian is amazing and he's been working on integrating Arrow with Spark. And this can make it so that it's gonna be a lot easier to sort of interoperate between JVM languages and Python and R. And so I'm really optimistic about sort of the Python and R interface is improving a lot in Spark and getting a lot faster as well. And we're also, in addition to the Arrow work, we've got some work around making it a lot easier for people in R and Python to get started. The R stuff is mostly actually the Microsoft people. Thanks Felix, you're awesome. I don't actually know which camera I should have done that to, but that's okay. Perfect, I think you got it. I think you got it. Cool, so Felix is amazing and the other people working on R are too. But I think we've both been pursuing sort of making it so that people who are in the R or Python spaces can just use like PIP install, condo install, or whatever tool it is they're used to working with to just bring Spark into their machine really easily, just like they would any other sort of software package that they're using. Because right now, for someone getting started in Spark, if you're in the Java space, it's pretty easy, but if you're in R or Python, you have to do a lot of sort of weird setup work and it's worth it. But like, if we can get rid of that friction, I think we can get a lot more people in these communities using Spark. Let me see, just as a scenario, the R server is getting fairly well integrated into SQL server. So would it be, would you be able to use R as the language with the Spark execution engine sort of somehow integrated into SQL server as a execution engine for doing the machine learning and predicting? You definitely, well, I shouldn't say definitely, you probably could do that. I don't necessarily know if that's a good idea, but that's the kind of stuff that this would enable. It'll make it so that people that are making tools in R or Python can just use Spark as another library, right? And it doesn't have to be this really special setup. It can just be this library and they point out the cluster and they can do whatever work it wants to do. That being said, the SQL server R integration, if you find yourself using that to do distributed computing, you should probably take a step back and rethink what you're doing. Because it's not really scale out. It's not really set up for that and you might be better off doing this with connecting your Spark cluster to your SQL server instance using JDBC or a special driver and doing it that way, but you definitely could do it in another inverted sort of way. So last question from me. If you look out a couple of years, how will we make machine learning accessible to a bigger and bigger audience? And I know you touched on the tuning of the knobs, hyperparameter tuning. What will it look like ultimately? I think ML pipelines are probably what things are going to end up looking like. But I think the other part that we'll sort of see is we'll see a lot more examples with how to work with certain kinds of data. Because right now I know what I need to do when I'm ingesting some textual data, but I know that because I spent like a week trying to figure out what the hell I was doing once. And I didn't bother to write it down and it looks like no one else bothered to write it down. So I think really we'll see a lot of tools that look very similar to the tools we have today. They'll have more options and they'll be a bit easier to use. But I think the main thing that we're really lacking right now is good documentation and sort of good books and just good resources for people to figure out how to use these tools. Now of course I mean I'm biased because I work on these tools. So I'm like yeah, they're pretty great. So there might be other people who are like, Holden, no, you're wrong. We need to rethink everything. But I think this is, we can go very far with the pipeline concept. And then that's good, right? The democratization of these things opens it up to more people, the more creative people solving more different problems that makes the whole thing go. You can like install Spark easily. You can set up an ML pipeline. You can train your model. You can start doing predictions. You can, like people that just haven't been able to do machine learning at scale can get started super easily and build a recommendation system for their small little online shop and be like, hey, you bought this. You might wanna also buy Boo. She's really cute. But you can't have this one. He's not for sale though. No, no, no, not this one. Come on. No. It's a tease. I'm sorry, I'm sorry. All right, all right. Well Holden with that we'll save a buy for now. I'm sure we will see you in June in San Francisco at Spark Summit and I look forward to the update. I look forward to sharing with you then. Thanks so much. And break it late this afternoon at your presentation. Thank you. She's holding care. Karo, I'm Jeff Frick. He's George Gilbert. You're watching theCUBE. We're at Big Data SV. Thanks for watching.