 From San Jose, California, it's theCUBE, covering Big Data Silicon Valley 2017. Hey, welcome back everyone. We're here live in Silicon Valley in San Jose for Big Data SV in conjunction with Strata Hadoop. Our two, three days of coverage here in Silicon Valley in Big Data. It's our eighth year covering Hadoop World and the Hadoop ecosystem, now expanding beyond just Hadoop into AI, machine learning, IoT, cloud computing with all this compute is really making it happen. I'm John Furrier with my co-host, George Gilbert, our next guest is Eric Pelkey, who's the senior director of Product Marketing at Pentaho. Coming, we've covered many times and covered their event at Pentaho World. Thanks for joining us. Thank you for having me. So in following you guys, obviously Pentaho was once an independent company bought by Hitachi, but still an independent group within Hitachi. That's right, very much so. Okay, so you guys have some news. Let's just jump into the news. This is a hard news. You guys announced some of the machine learning. Exactly, yeah, yeah. So Eric Pelkey, Pentaho, we are a data integration and analytics software company. We, you mentioned you've been here doing this for eight years. We have been at big data for the past eight years as well. In fact, we're one of the first vendors to support Hadoop back in the day. So we've been along for the journey ever since then. What we're announcing today is really exciting. It's a set of machine learning orchestration capabilities which allows data scientists, data engineers and data analysts to really streamline their data science process. Everything from ingesting new data sources through data preparation, feature engineering, which is where a lot of data scientists spend their time through tuning their models, which can still be programmed in R and WECA and Python and any other kind of data science tool of choice. But what we do is we help them deploy those models inside of Pentaho as a step inside of Pentaho. And then we help them update those models as time goes on. So really what this is doing is it's streamlining, it's making them more productive so that they can focus their time on things like model building rather than data preparation and feature engineering. You know, it's interesting. The market is really active right now around machine learning. And even just last week at Google Next, which is their cloud event, they had made the acquisition of Kaggle, which is an open data science, but you mentioned the three categories, data engineer, data science, data analysts, almost on a progression, you know, super geek to business-facing. And there's different approaches. And one of the comments from the CEO of the Kaggle on the acquisition, what we wrote up at SiliconANGLE was, and I found this fascinating, I want to get your commentary and reaction to, is he says the data science tools are as early as the early software development tools generations ago, meaning that all the advances in open source and tooling and software development is far along. But now data science is still at that early stage and it's going to get better. So what's your reaction to that? Because this is really the demand we're seeing. There's a lot of heavy lifting going on in the data science world. Yet there's a lot of runway of more stuff to do. What is that more stuff? Right, yeah, we're seeing the same thing. So last week I was at the Gartner data and analytics conference, and that was kind of the take there from one of their lead machine learning analysts was, this is still really early days for data science software. So what there are, there's a lot of Apache projects out there, there's a lot of other open source activity going on, but there are very few vendors that bring to the table an integrated kind of full platform approach to the data science workflow. And that's what we're bringing to market today. But let me be clear, we're not trying to replace R or Python or ML live, because those are the tools of the data, data scientists, they're not going anywhere. They spent eight years in their PhD program working on these, working with these tools. We're not trying to change that. And they're fluent with those tools. Very much so. But they're also spending a lot of time doing feature engineering. Some research reports say between 70 and 80% of their time, what we bring to the table is a visual drag and drop environment to do feature engineering a much faster, more efficient way than before. But yeah, so there's a lot of different kind of disparate siloed applications out there that all do interesting things on their own. But what we're doing is we're trying to bring all of those together. And the trends are reduce the time it takes to do stuff and take away some of those tasks that you can use machine learning for. What unique capabilities do you guys have? Talk about for a minute just what Pentel is doing. That's unique and added value to those guys. Yeah, sure. So the big thing is I keep going back to the data preparation part. I mean, that's 80% of time. That's still a really big challenge. There's other vendors out there that focus on just the data science kind of workflow. But where we're really unique is around being able to accommodate very complex data environments and being able to onboard data. Like what do you give an example of those environments? So like geospatial data combined with data from your ERP or your CRM system and all kinds of different formats. It might be like 15 different data formats that need to be blended together and standardized before a machine learning data scientist can, before any of that can really happen. So that's the complexity is in the data. And so Pentel very consistent with everything else that we do outside of machine learning is all about helping our customers solve those very complex data challenges before doing any kind of machine learning. You know, one example is one customer is called Caterpillar Machine Asset Intelligence. So they're doing predictive maintenance onboard container ships and on ferries. So they deliver, they're taking data from hundreds and hundreds of sensors onboard these ships, combining that kind of operational sensor data together with geospatial data. And then they're serving up predictive maintenance alerts, if you will, or giving signals when it's time to replace an engine or replace a compressor or something like that. Versus waiting for it to break. Versus waiting for it to break, exactly. So yeah, so that's one of the real differentiators that very complex data environment. And then I was starting to move toward the other differentiator, which is our end-to-end platform which allows customers to deliver these analytics in an embedded fashion. So kind of full circle being able to send that signal but not to an operational system which is sometimes a challenge because you might have to rewrite the code or it's a, deploying models is a really big challenge. Within Pentaho, because it is this integrated, fully integrated application, you can deploy the models within Pentaho and not have to jump out into a mainframe environment or something like that. So I would say differentiators are very complex data environments and then this end-to-end approach where deploying models is much easier than ever before. So perhaps, let's talk about alternatives that customers might see. You know, the sort of, you have a tool suite and others might have to put together a suite of tools. Maybe tell us some of the geeky version would be the impedance mismatch. You know, like the chasms you find between each tool where you have to glue them together. So what are some of those pitfalls? Yeah, I mean, one of the challenges is you have these data scientists working in silos oftentimes. You have data analysts working in silos. You might have data engineers working in silos. One of the big pitfalls is not collaborating enough, not really collaborating enough to the point where they can do all of this together. So that's a really big area that we see as a problem. Is it a binary not collaborating or is it that the round trip takes so long that the quality or number of collaborations is so drastically reduced that the output is of lower quality? I think it's probably a little bit of both. I think, you know, they want to collaborate, but one person might sit in Dearborn, Michigan and the other person might sit in Silicon Valley. So there's a location challenge as well. The other challenge is some of the data analysts might sit in IT and some of the data scientists might sit in an analytics department somewhere. So it kind of cuts across both location and functional area too. So let me ask from the point of view of, you know, we've been doing these shows for a number of years and, you know, most people have their first data lakes up and running and their first maybe one or two use cases in production, very, very sophisticated customers have done more, but what seems to be clear is the highest value coming from those projects isn't to put a BI tool in front of them so much as to do advanced analytics on that data, apply those analytics to inform a decision, whether a person or a machine. That's exactly right. And so what are some of the sort of, how do you help customers, you know, over that hump and what are some other examples, you know, that you can share? Sure. Yeah. So speaking of transformative, I mean, that's what machine learning is all about. It helps companies transform their businesses. We like to talk about that at Tentaho. One customer kind of industry example that I'll share is a company called IMS. IMS is in the business of providing data and analytics to insurance companies so that the insurance companies can price insurance policies based on usage. So it's a usage model. So IMS has a technology platform where they put sensors in a car and then using your mobile phone can track your driving behavior and then your insurance premium that month reflects the driving behavior that you had during that month. So, you know, in terms of transformative, this is completely upending the insurance industry which has always had a very fixed approach to pricing risk. Now they understand, you know, everything about your behavior. Are you turning too fast? Are you breaking too fast? And they're taking it further than that too. They're able to now do kind of a retroactive look at an accident. So after an accident, they can go back and kind of predictors. They can go, well, they can go, yeah, exactly. They can go back and kind of decompose what happened in the accident and determine whether or not it was your fault or was it, in fact, the ice on the street. And so transformative, I mean, this is just changing things and I really... So I want to get your thoughts on this. I'm just looking at some of the research. You know, we always have the good data but there's also other data out there. But on your news, 92% of organizations plan to deploy more predictive analytics. However, 50% of organizations have difficulty integrating predictive analytics into their information architecture, which is where the research is showing. So my question to you is, is a huge gap between the technology landscapes of front end BI tools and then complex data integration tools. That seems to be the sweet spot where the value is created. So you have the demand and then front end BI is kind of sexy and cool, wow, I can power my business but the complexity is really hard on the backend. Who's accessing it? What's the data sources? What's the governance? All these things are complicated. So how do you guys reconcile the front end BI tools and the backend complexity integrations? Yeah, our story from the beginning has always been this one integrated platform, both for complex data integration challenges together with visualizations. And that's very similar to what this announcement is all about for the data science market. And so, yeah, we're very much in line with that. So is the card before the horse? Is it like the BI tools are really good given by the data? I mean, it makes sense that the data has to be key, but I mean, BI front end BI could be easy if you have one. Yeah, yeah, it's funny you say that. So I presented at the Gartner conference last week and my topic was this just in, it's not about analytics. Kind of ingesting, I don't know what I'm saying. Yeah, kind of cheeky. Yeah. But it drew a really big crowd. It's not about analytics, it's about analytics. So it's about the data, right? And it's about solving the data problem before you solve the analytics problem, whether it's a simple visualization or it's a complex fraud machine learning problem. I mean, it's about solving the data problem first. And then to that quote, I think one of the things that they were referencing was the challenging information architectures into which companies are trying to deploy models. And so part of that is when you build a machine learning model you use R and Python and all these other ones that we're familiar with in order to deploy that into a mainframe environment, someone has to then recode it in C++ or COBOL or something else. That can take a really long time. With our integrated approach, once you've done the feature engineering and the data preparation using our drag and drop environment, what's really interesting is that you're like 90% of the way there in terms of making that model production ready. So you don't have to go back and change all of that code. It's already there because you used it in Pentel. Okay, so obviously for those two technology groups I just mentioned, obviously pretty obvious. I think you had a good story there. But it creates problems. You've got product gaps, you've got organizational gaps, you have process gaps between the two. Are you guys going to solve that or are you currently solving that today? Or I mean, there's a lot of little questions in there, but I mean, that seems to be the disconnect, right? I can do this, I can do that, do I do them together? Yeah, I mean, I'm sticking to my story one integrated approach to being able to do the entire data science workflow from beginning to end and that's where we've really excelled. So to the extent that more and more data engineers and data analysts and data scientists can get on this one platform, even if they're using R and Weka and Python, then part of it- But you guys want to close those gaps down, that's what you guys are doing, right? We want to make the process more collaborative and more efficient. So Dave Vellante has a question on CrowdChat. He says for you, Dave Vellante was in the snowstorm in Boston, Dave, good to see you, hope you're doing well. Shoveling out the driveway, thanks for coming in digitally. His question is, HTS has been known for mainframes and storage, but Hitachi is an industrial giant. How is Pentaho leveraging Hitachi's IoT chops? Great question. Yeah, great question, thanks for asking. So yeah, Hitachi acquired Pentaho about two years ago. This is before my time. I've been with Pentaho about 10 months ago. But one of the reasons that they acquired Pentaho is because of a platform that they've announced, which is called Lumata, which is their IoT platform. So what Pentaho is, is the analytics engine that drives that IoT platform, Lumata. So Lumata is about solving more of the hardware sensor, bringing data from the edge into being able to do the analytics. So it's an incredibly great partnership between Lumata and Pentaho, and Pentaho is- Nice internal customer too. Yeah, nice internal customer. That's a big data solution. It's a $90 billion conglomerate. So yeah, the acquisition's been great. And we're still very much an independent company going to market on our own, but we now have a much larger channel through Hitachi's reps around the world. And you've got IoT use case right there in front of you. Exactly, yeah. I mean, so- But you are leveraging it big time, that's what you're saying. Oh yeah, absolutely, yeah. We're a very big part of their IoT strategy. I mean, it's the analytics, like the Caterpillar. Both of the examples that I shared with you are, in fact, IoT, not by design, but it's because there's a lot of- You guys seeing a lot of IoT right now? Oh yeah, we're seeing a lot of companies coming to us who have just hired a director or a vice president of IoT to go out and figure out the IoT strategy. A lot of these are manufacturing companies or coming from industries that are inefficient and- Digitizing the business model. Yeah, yeah. What Yvonne's been talking about. So to the other point about Hitachi that I'll make is that as it relates to data science, a $90 billion manufacturing and otherwise giant, we have a very deep bench of PhD data scientists that we can go to when there's very complex data science problems to solve at a customer site. So if a customer is struggling with some of the basic, how do I get up and running on doing machine learning? We can bring our bench of data scientists at Hitachi to bear in those engagements. And that's a really big differentiator for us. And just to be clear on one last point, you've talked about like you handled the entire life cycle of modeling from acquiring the data and prepping it all the way through to building a model, deploying it and updating it, which is a continuous process. But, and I think as we've talked about before, data scientists or just the DevOps community has had trouble operationalizing that end of the model life cycle where you deploy it and update it. Tell us how Pentaho helps with that. Yeah, it's a really big problem and it's a very simple solution inside of Pentaho. It's basically a step inside of Pentaho. So in the case of fraud, let's say, for example, a prediction might say fraud, not fraud, fraud, not fraud, whatever it is. We can then bring that kind of full life cycle, back into the data workflow at the beginning. Just, it's a simple drag and drop step inside of Pentaho to take those results. To say which are right and which are wrong. To say which were right and which were wrong and feed that back into the next prediction. We could also take it one step further where someone, there has to be a manual part of this too where it goes to the customer service center, they investigate and they say yes fraud, no fraud, and then that then gets funneled back into the next prediction. So yeah, it's a big challenge and it's something that it's relatively easy for us to do just as part of the data science workflow inside of Pentaho. Well Eric, thanks for coming on theCUBE. Really appreciate it. Good luck with the rest of the week here. Thank you, yeah, very exciting. Big data SV week. Thanks for coming on. Yeah, thank you for having me. Okay, you're watching theCUBE here live in Silicon Valley covering Strata, Hadoop, and of course our big data SV event. We also have a companion event called Big Data NYC. You know, we program with O'Reilly Strata Hadoop and of course been covering Hadoop World really since it's been founded. This is theCUBE. I'm John Furrier, George Gilbert. We're back with more live coverage today for the next three days here inside theCUBE after the short break.