 Live from the San Jose Convention Center, extracting the signal from the noise, it's theCUBE, covering Hadoop Summit 2015, brought to you by headline sponsor Hortonworks, and by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort, and by Atunity. And now your hosts, John Furrier and George Gilbert. Okay, welcome back, everyone. We are live in Silicon Valley for Hadoop Summit 2015, this is theCUBE. It's our flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of SiliconANG. I'm joining my co-host, George Gilbert, our big data analyst at wikibon.com. And our next guest is Donna Prush, VP of Products and Solutions Marketing at Pentaho. Welcome to theCUBE. Thank you very much. Great to see you again. Great to see you as well. So, you guys are now part of Hitachi. That's official. That's right. As of June 4th, we are officially Pentaho, a Hitachi data systems company. Okay, so that's good. Check that box. We couldn't talk about it last time. Check that box. You have some deep pockets behind you. Big news, you got some new releases, stuff with Amazon. Give us the update on the quick news that we can get into some of the chat. Yeah, so it's actually really appropriate given what I've been hearing in the keynotes and enterprise scaling of Hadoop, et cetera. So, the big news for us is really all around deploying big data in the cloud. We're seeing that a lot. So we have support for Amazon Web Services, adding support for SAP HANA. And we've also released a scalability study that we did at Pentaho, just to kind of get a sense for our customers of how their environments would scale and had some really interesting results there. And then we had pre-announced support of Spark. And so in this release, we actually support the orchestration of Spark, of Spark jobs within Pentaho data integration. So, lots of big news around emerging technologies, the cloud and kind of this whole world of big data. So you add some elastic map reduce on Amazon. So that's one announcement. SAP, so large enterprise type stuff, which is the theme of the show. Next week is the eye candy, if you will, Spark, which is really hot right now. A lot of people are digging in. So what's going on with Spark? What's your take on that? I mean, Spark is certainly emerging. We had a debate on the crowd chat this morning with some of the folks at IBM and SiliconANGLE, trying to be like, hey, it's not ready for prime time. Something George has an opinion on Spark. People love it. The developers are going nuts for it. Aussie Depplin showing here, sits on top of it. What's your take? Yeah, so in kind of usual Pentaho fashion, we're always after the use cases. And so in the beginning with Hadoop, we did the same thing. We really tried to identify where are there, some key use cases that we're seeing people get value from. And so our Pentaho Labs folks went out and started to talk to a bunch of customers. And really where they're seeing a lot of uptick is with the data scientists, the developers and that. There's some real value there. As far as beyond that, we're not seeing too much. But we did realize that in cases where Pentaho has customers that want to be able to have Spark applications running, that we can now orchestrate those within Pentaho data integration, which speaks to that whole kind of Pentaho managing the data pipeline or the data flow. And that just adds one more capability there. So amazing to see. So you guys love the data operating system messaging, right? I mean, you hear that on stage. It's perfect for us. I was listening this morning and it's really all about, what's that flow of whether it's traditional data or unstructured data into, whether it's in data warehouse or enterprise, whether it's to do, that we can really manage that whole flow out to analytics. But it's, did you guys, you're giving us the, it's almost like the frosted mini weeks version, which is, you know, you guys do a lot more under the covers that makes that orchestration valuable. Yes. Others can move the data and they can transform it, but you bring along information about that data that makes it coherent and meaningful from place to place. Can you tell us a little more about that? Yeah, absolutely. So kind of one of the keys of Pentaho data integration is that if you're thinking about ingesting data, for instance, into Hadoop, it's not just about getting the data in, right? Pentaho is going to allow you to actually cleanse and transform that data and provide a level of governance. And then as you start to think about Hadoop in an environment where maybe you have data in an enterprise data warehouse, the transformations that go on where you want to be able to blend data together and the orchestration of all of that is all managed through Pentaho data integration. So if you think about starting back from the data at the source and then through those different, you know, data stores that I mentioned, Pentaho is going to manage all of that and provide the governance around it, which is really what's important. So tell us to pick a favorite Apache tool that would do this otherwise. And then what a developer or a DBA or some data steward would have to do in the absence of all that metadata that keeps it glued together. So the short answer is really it would be writing code, it would be writing pig scripts or Java scripts or MapReduce scripts to be able to get that data where you need it to be and process it. And that's a hand coding process. And so Pentaho is going to give you the visual tools to do that, to eliminate that overhead. And then even if you had all of that written, something has to orchestrate it, right? To say run that job then, this goes here, that goes, and so that's really where Pentaho adds the value. And those scripts would also be describing, say, you know, when I'm talking about customer, unique customers, I have to say where I got that data from or if I'm talking about dates, I'm saying it's consistent from the last place I talked about dates that I assume is also part of the script. Exactly, so yeah, exactly. So in our previous release, we announced this concept of the streamlined data refinery and really what that, we have a large financial regulatory agency that basically what they do is they have to recreate scenarios. So let's say they're trying to identify fraud. They need to go back through, you know, millions of transactions and figure out what happened on a particular day at a particular time. The overhead in terms of trying to get to that data and then recreate it is really difficult. And so what they've done with Pentaho is there's basically a front end to that that says, show me these dates in this time and this is what I'm looking for in terms of a stock symbol. When the end user presses go, all of the transformations behind that, whether it's from Purdue or an enterprise data warehouse are going to run and that data sets delivered with the appropriate data model as well. And so the accuracy in the governance is really incredible. I mean, the amount of time you can save in doing that. Is that an example where you're embedded in a bigger app? Yeah, so that's an example of actually that customer has an application that they've built on the front end of that. And then we provide the capabilities to run those transformations. But we also have customers who, for instance, like a landmark caliber, for instance, they're collecting sensor data and they're bringing that data together with other information about parts and maintenance, et cetera. They bring that data together and then they deliver it out through applications, like for instance, the guy who's standing out on the oil rig, he's got an embedded Pentaho dashboard and he's looking at this data. No clue about the complex transformations that are going on behind it. But that's really the kind of the power of having that embedded for that specific user. We talk a lot about the embedded at the point of impact, right? Like where do you need the data to be to actually be valuable and help that end user do their job? So it's like you're taking stewardship of the data and guaranteeing its integrity, like from the source all the way to the point of consumption. Exactly, exactly. So that relieves the application developer for having to do it. So you get the storage angle covered with Hitachi. You guys are pipelining the data, managing that. Steward of the data, whatever you called it, sounds great to me. My question is, we were debating earlier, I'd like to get your opinion on this. There's interesting debate around who's running faster and who needs to slow down. And we would say the Hadoop market exploded. Cloud was kind of like Amazon, enterprises weren't really adopting now. Private Cloud became Hybrid Cloud. Then you see like, it seems to be Private Cloud back again, but Cloud is really accelerating in the enterprise. Amazon's putting a lot of pressure on that. Is Hadoop has to slow down for the Cloud to catch up? Or is Cloud catching up? How do you see the dynamic? Because the Cloud is an engine. Because integration of an app developer is all about writing software. So the economics of software, they don't want to write a connector for XYZ legacy system or custom software just to shoehorn in to an enterprise which might be one off. So this kind of integration challenge is the elephant in the room, if you will, and in the Hadoop space. So Cloud orchestration can handle that. Containers and these new stuff happening. Where's this coming together? What's your take on that? Yeah, so for us it's really about, and this releases a lot about that. It's this idea of future-proofing, right? So Pintaho a long time ago when we decided to support the different Hadoop distributions built in an adaptive big data layer that just kind of insulated us from the intricacies of the different distributions. What's happened is as we move forward and coming actually out of an open source heritage it allowed us to be really flexible. And so for us, like we have a customer that actually we were talking about in our press release today, the Lucky Group. And what they did is they built out their initial big data environment with us. And then this year they moved all of it to the cloud. They just moved all of it to Amazon and they use Amazon Redshift. And so it was this ability to, as you're talking about, right? Like is it his Hadoop slow down? Is the cloud taking off? Which one, you need to be able to kind of be insulated from a lot of that. I mean, I think that's where a lot of the conversation is today. So extract away the complexities of the nuances of the configuration and containers are first. So you guys open source background, the heritage does come into play. Yeah, it does. I mean, what's going on in the cloud is pretty heavy right now. Yeah, and I think that that's really where, you know, it's funny. That's kind of how we got that first leadership in big data with Hadoop and things like Spark, right? So being able to experiment and look at where the use cases are and then deliver some early product, the innovation model is great for us, right? Because we've got a community of people banging on things and creating things. And so it's really helpful. So I got to ask you, because, you know, you light up when I talk about Spark, it's really awesome. But you highlight something important, which is, no, when you make a good bet and it pays off, you kind of like, you know, you're proud. You guys have made some good bets. What does it look like for someone who's not made the good bet that's trying to wash over the bet? So how does a customer look at the different companies out there and saying, they're the real winners. They've made the bet. They have the trajectory. There's no diseconomies of scale. They're not trying to copy or whitewash something. How does a customer identify that? Yeah, I mean, I'll just speak to what our customers, what we've seen with our customers. And so we talked about these blueprints before, right? These kind of common use cases. Customer 360, data warehouse optimization, a streamlined data refinery. And what we've found is when we go in and we talk to customers about, here's what we're seeing. This is what we've seen and this is what's been successful. They go, oh, that's my world. And a lot of it is looking at the emerging tech with what's established, right? So when we find, if we can just be in the world of what's the reality of what the customer's actually trying to get in terms of business value? And then we can start to say, well, here's some things we've seen and here's some common patterns. That's where that connection happens. And I think we kind of get through a lot of the hype and it's less about the product and more about what's the business problem that you're trying to solve. And then having the technologies just work, like so, whether it's converged infrastructure and or software layers, ultimately it's the app and workload people care about. And then how the data interacts, right? Yeah, and like for instance, that's why we did this scalability study for Hadoop is that most customers aren't going to start with 129 node cluster. But what we did is we said, hey, visual MapReduce, Pentel Visual MapReduce gives us some performance improvements. And what if we could talk with a customer about 10 nodes and say, hey, but when you get to 120 and I think it was one of the, remember which one of the speakers this morning was talking about, you know, how much data they're adding per day. What if we could show in a scalability study how we scale? And we actually had linear scalability. I mean, the performance actually, it was weird, we got to a certain point that started to almost improve with the amount of data that you added. So it's things like that I think that are just really practical. So blue, having blue prints, having essentially reference architectures is a key. Yeah, I mean, I think it's understand in this kind of a market, it's understand what are people actually doing? What are they accomplishing? And then work backwards from that and talk to your customer, you know, we talk to our customers like that. Those sort of key usage scenarios, customer 360, data warehouse optimization, data refinery, can you tell us sort of are customers starting small on those? Because they don't want to take on a big risky project, but they can see an economic value to this. And then in the process of implementing it, where are they hitting the pain points and how do you help them? Yeah, that's a really good question. So what's interesting about Pentaho too is because of that open source heritage, we had a lot of really early adopters. When you think about the curve, we were getting skinned knees and all that with Hadoop and really working with early customers. They were very early day one. Yeah, and one of our great customers, Rich Relevance, you know, they build recommendation engines for the retail industry. And you know, when I look at their environment, they have started small in terms of the amount of data, but now every day they have to update thousands of retailers catalogs, right? So they were like, wow, this data volume is getting huge. And they did some early work with us with Yarn. And so when we had our Yarn support, they were one of the first customers that actually went into production with Yarn. Now they're running Yarn. They're doing some very early work with Spark right now. So I think it is that start small. And also if you can start to add a technology, but you're not really putting the customer at risk, that's a huge place. Explain Spark to the people out there who know it, but don't know it, want to know it. I mean, certainly it's been identified as a relevant hotbed with developers. It's being operationalized now into software and now in deployments. What is Spark? Why is it hot? Is it replacing Hadoop? What's all that stuff? Yeah, so I mean, you know, from my opinion, it's a processing engine, right? It's going to make things faster. It's going to eliminate the need to have MapReduce, right? Which is complex and it creates some bottlenecks in terms of processing power. So I think it's just going to make the whole concept and thinking about Hadoop as a batch processing data story. You think about it and you go, well, if you could take that and remove some of that and really speed things up and allow things like real time to start to come into that world, wow, what would that do for us? So I think it's just the promise of that processing power. And I mentioned some of our customers whose volumes are just scaling. And then you think about machine generated data and the internet of things and real time is going to become a real consideration. I think Spark holds a lot of promise there too. I'm going to put you on the spot. Sure. Give me the, your hottest trend that you like in David or in Pentaho yourself that you really love right now that's getting really motivated, pumped up and the most overhyped buzzword. Well, that's a good one. I guess I would say, Spark is probably one of my favorites. As you can see, as you said. Yeah, you're lighting up. Lighting up got a spark in your eye. You know, I like to spend a lot of time with our engineers and our product managers. I sit at a whiteboard and go, okay, but explain this to me and like how I can, you know, articulate it. I'm like, wow, this is a great concept. You know, if this works, this changes the world of database. And that's the other reason I think it's exciting is just it's been a long time since we've had a lot of really cool technologies come into the database and analytics space. And the storage has now got flashed. You got the in-memory. This is like really cool, right? Yeah, exactly. And so in terms of overhyped, I guess I would have to say, we would say overhyped, but one that we thought had a lot of promise and then we didn't see as much, you know, as we worked with customers coming out of labs as much interest. And it might be because Spark came on the scene with Storm. We just, we just didn't see the uptake in terms of the use cases for that. And we were really excited about it because we were doing early work with Yarn and then, you know, it had some promise. But I guess that's when maybe it wasn't overhyped, but was hyped and we just kind of saw. It looked good off the tee, as they say, in the golf analogy. Exactly. You know, but it didn't really kind of, yeah. Still kind of bouncing around. Yes. That's what we do. So, okay, applications. What's going to happen in the application market in your mind? Because that's a hot area right now. The workloads are driving, dictating policy, literally to the network. DevOps is certainly part of the whole ethos of open source and Hadoop and analytics. What's missing? What needs to happen? What needs to happen faster with apps and software? And those that would embed you. Yeah. You know, where you're going to add value, but in a larger context. Yeah. Well, thank you, George. Because that's exactly what I was trying to say. You did that perfectly for you. You did that perfectly for you. George, you didn't have to make it that easy for you. No, but, no, but it's true. I mean, I think that's really where embedding is going to become important. We've seen it a lot with some of our customers that initially got a Hadoop cluster up and running. Maybe they're pushing some data out to their internal users. And then suddenly it's, wow, we could be providing that data as a service via an embedded app. And then even internally, you know, embedding things for different, embedding analytics for other organizations into apps. So I think that's one. And that's really that point of impact that the idea that I want to switch to a tool is kind of going to go by the wayside, right? None of us are going to want to have to go out of this experience and go into a tool. And then I think what will be interesting is as we start to see predictive become more prevalent, and I know Michael Thierry was talking about that this morning is when a lot of that just becomes automatic, right? So, you know, when do we just start to trigger things, right? And then things become embedded into apps and processes and triggers. And, you know, that's really about the connecting people and things. And that's what it is. Well, the ease of use and automation. I mean, Audimeta from Tresada was on earlier, talking about automating intelligence. That's extracting the insight. So I think, I mean, that's come back down to your exciting area that's on Spark in real time, streaming. I mean, it's kind of dynamic. It's not like a lake. I mean, data lake's good for business intelligence and data warehousing. But when you talk about real time, it's a lot of stuff going on, a lot of new parts. Exactly. So I think that'll be kind of, it'll be interesting to see how that technology, in a lot of ways, I think what I'm excited about it is was working with Hadoop five years ago when we did a lot of educating. Not a lot of selling. You're like, Hadoop? What is that? And now, you know, it's sort of like Spark is like the new Hadoop in a way. You know, it's this new exciting technology with a lot of promise though. Okay, final question. Open source, ODP was a big part of our conversation at Big Data SV, where you were on theCUBE. Open source is evolving. I mean, it's got the pure plays. You still got now the big companies around cloud. Certainly you saw with Cloud Foundry, you got the IBMs, the Pivotals, the HPs, the VM, we're all trying to get some sort of de facto, but not pure open source play going on. Is that good or bad for us? Is that going to have legs? What do you see? How do you see that? Yeah, I think for us, you know, open is always good, right? And I think in this world where we've got things like Spark Emerging and Hadoop has kind of taken hold is the more we can support the idea of protecting customers, future-proofing, looking at things in an open environment. I just think that's sort of the direction that we're moving in. So, you know, in that sense, it's a great thing. I mean, the old proprietary stack vendor concept, I think it's just kind of going to go by the wayside. And then having some SLA-based stuff is key too. Exactly. Having some future-proofed open source and yet some stability. Yeah, exactly. Don Prilich, VP of product and solutions marketing at Pentaho here on theCUBE, live in Silicon Valley. We'll be right back more from Silicon Valley after this short break.