 Live from New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to New York City, everybody. This is theCUBE, the worldwide leader in live tech coverage. Chuck Yarbrough was here as the senior director of solutions, marketing, and management at Pentaho. Good to see you again, Chuck. Thanks for coming on theCUBE. Hey, thanks, I appreciate being here. It's always a pleasure. Yeah, so big week, everybody sort of aligns all their product announcements and a lot of marketing activity around this big data week, big data NYC strata plus a dupe world going right around the corner. What's new with Pentaho? Well, I guess the new thing is we did announce a big data enhancement release yesterday. So it went out, basically it includes a whole bunch of things that we've added into the Pentaho platform to make big data, particularly Hadoop easier. The, just some of the things, additional spark enhancements. We can talk a little more about what that is, things like Spark on SQL and really enabling that data pipeline that you and I have talked about so many times, easier, faster, better. We added some support, additional support around Kafka, our metadata injection, which is something I think we talked about the last time I was on, metadata injections, really about automating and making transformation processes, think about ETL type processes, making those highly dynamic. And we added probably the biggest thing is around security, Hadoop is a challenge when it comes to security. So we've added some additional functionality around Kerberos, which we continue to extend and make it better and also integrated with Apache Sentry. So some of the things that we've got going, we've also made it easier and more effective to ultimately get data into Hadoop. We've looked at Hadoop for a long time. Our customers use it, they love it, but it still tends to be hard. Well, the Pentaho story has always been about taking a lot of this complexity and building out a data pipeline that can be operationalized. That story hasn't changed, but what you're saying, I think is that you can, as the industry keeps coming out with new innovations, your challenge is, okay, which ones do we pick and include into that data pipeline, if you will. So how do you do that? How do you decide? How do you prioritize? Well, great question. Largely, it's customers, right? What are the customers demanding? What do they need to do? Spark, we've been working with Spark for a couple of years, several years. We've added different types of functionality and in the latest release, we're making it easier to bring in Spark on SQL, into that data pipeline. Now, why? Partly because all these new technologies that come in, they're great in and of themselves. They do think like Kafka, it's great pub sub kind of concept. Takes some skills to work with, but to productionalize that in an enterprise, that's where the challenge is. So bringing that into the data pipeline and allowing that to be another data source, another process that can be managed, orchestrated, and operationalized, that's what our customers are asking for. Help us make it easier. Spark on SQL is an example, right? It's easier in many ways than dealing with what some of our customers have been dealing with in the past, with Hadoop, MapReduce, things like that. So what are customers' choices generally? They could sort of become experts in all these different tool sets and build out their own. Some people do that. They could acquire a platform like yours. The cloud increasingly is becoming an option, although people play in the cloud as well. Help us understand the landscape of options customers have. Yeah, good question. We have supported cloud for cloud environments for years, continue to do that. We see a hybrid approach, right? So we're not seeing a whole lot of people doing everything in a cloud environment. They'll do some and keep some on premise and other third-party locations. I think the real goal and the thing that we're hearing most from our customers is help me get data into the data lake. Help me ingest it easier, faster, and do it in a production kind of way. Simplify that process. Give me greater access to the technology to manage that data going through the pipeline and help me secure it, because let's be honest, you know, that's always been a challenge within the Hadoop world. And it continues to be a challenge. It's not the easiest thing to manage. So we're trying to help there and then ultimately do this in a way that keeps your investments future-proofed, right? So what you're investing in today, let's leverage that into the future because things are going to change. Execution engines are going to continue to evolve, become better, faster, you know, with more functionality. That's what we see. So you mentioned security a few times. So what's the secret sauce to securing the data pipeline, right? Hadoop, by its very nature, is data's everywhere. And the data sources are many. So how do you, you can't just put a mode around it. It is everywhere. So how do you secure the data? Well, that's right. And so, you know, that's why we've invested a lot into what our customers are asking for, which is extended Kerberos support. One of the things that we've done is we've made it much easier to have multiple users accessing the Hadoop cluster and be able to recognize all of those users as opposed to what many organizations will do, which is, you know, just go down to one and only have one user hitting that. So little things like that, you know, it may not sound like the most incredible thing, but the reality is really hard to do. It's really difficult. It's a challenge. So, and then Apache Sentry, you know, being able to really control access to specific data assets is, that's what enterprises need. That's what they're asking for. One gets the impression, listening to how far the pipeline has gone, not in terms of what you've built, but in terms of what's being consumed by customers, that they're at the stage where they're sort of hydrating the data lake, you know, and then the next step comes to, let's do some analysis on it, and after that, let's operationalize that analysis. What are some of the barriers to getting to that analytics step right now? Yeah, so it's funny, you called it hydrating the lake, I like that. I'm actually presenting at the conference tomorrow a session entitled Filling the Data Lake, right? So it's all about, and this is a challenge we're seeing. It's all about getting data in. Getting data into the data lake, easy, right? You could do it probably 15 different ways. There's tools, there's code, there's all kinds of things you can do. What people are asking for is help me reduce the insanity, right? Help me get a handle on what I'm trying to do and make more sense out of it. Let me give you an example. So particularly in the financial services space, in fact, we called it out, one of our customers, US Abel, is an insurance company, and in our press release, we talked a little bit about that. They had a problem where they needed to onboard data into the data lake quickly and effectively, and you've got thousands of customers, you've got thousands of different data types, data sources. I've got a similar situation with a bank in Europe, couple of banks in Europe have the same situation. They literally have data sets that are as simple as CSVs. The problem is, every CSV is different. So if you think about the traditional way of onboarding data, you could do it through code, you could do it through a tool. If you use a tool, like a data integration tool, like Pentaho data integration, typically, you would have to create a transformation process for every file, and that's how we've always done it. But with what we call metadata injection, we actually allow the process to be intelligent, to interrogate that data set. Every data set being different, it will interrogate it, figure out where the metadata is, or if it's not in the file, it'll actually consume metadata from a third source and apply it to the transformation at execution time and land the data in exactly the format you need into Hadoop. Now, why is that a big deal? Again, loading data is pretty easy, but it's a big deal because when you have 100 of those, and you have to maintain all of those jobs, that's a pain. When you have 6,000, that's a huge pain, and that's what we've heard. And I can tell you, we've got one customer that says, it takes to do a simple ingest process, takes about a day to create it. So it's not that big of a deal. Times 6,000, that's a big deal. And it costs, and I think these numbers are a little low, but they described it as being $1,000 per, production-alized load process. That's real money. What happens with the wonderful thing about data warehouses were, they were so curated, it was like Louis XIV's garden, it was like beautiful. But at the same time, the data lake was, I'm going to put, it's the compost pile. Now, is there a way to marry the best of each, where the compost pile has everything I could possibly ever need, but I need to be able to discover it and navigate a little more easily. Is that something you help with, or is that something you partner with to make the data lake transition better? So we do that really well. Now, we will partner with others around some types of data quality management and stuff, but your point is really well taken, and I'll tell you, I talk to people all over, and I ask them about their data warehouse. I'm an old data warehouse guy. And so I ask the simple question, how perfect is your data warehouse? And usually there's a little bit of a chuckle, right? Because people are like, yeah, well, it's not always 100%, it should be, but it's not. That's the real world. But to your point, it is highly curated. It should be perfect. The data lake, by definition, what did we say when Hadoop started coming around? We said, hey, this thing's really good, you could build a data lake. And all you have to do is load all your data in the truck, back the truck up, and dump it in the lake, right? What did we create? A dump, right? I mean, we use those terms, I use those terms. It's like, yeah, just dump your data. In my mind, I always thought, well, I'm going to think logically about doing, I'm not just going to dump it and ignore it, I'm going to do some sort of management of it. Well, that's what we're seeing. That's why so many people are talking about data lakes, not always in a positive way, right? I think of a data lake as the clean, pure, pristine lake that I want to jump into. I can consume that data, perfect. Many people think of it as a swamp. And so Pentaho, with our capabilities to ingest, simplify that process with the metadata injection, make it highly dynamic and effective. So one transformation process can handle literally thousands of very disparate data sources. But at the same time, to be able to manage that data, to know more about it, you mentioned, okay, well, so we can get the data in and then we kind of ignore it or forget it or let it dump there. And many people want the source data to go in. And we're seeing that, lots of organizations. Let's say it's as simple as a CSV file. Copy it, put it in, piece of cake. The problem is oftentimes to make it more consumable, we want to convert that. We want to conform that into something that is more curated, right? So what we've seen is a lot of customers have adopted a process where they'll take that file, land it in the lake, and then take that data set and conform it into, change it into a Avro form. Avro or even Parquet. So depending on the use case, right? But these are things that Pentaho is supporting, part of what we're announcing in this big data enhancements release. And it's all about operationalizing that, making it easier for the enterprise, making it easier for the developers and the users. So it's not necessarily Louis XIV garden, but it's got some kind of structure to it that then can be turned into value. Is that what I'm hearing? Exactly, exactly. If you're not paying attention to what you're dumping, what you're putting into the data lake, you're going to get a swamp and you're not going to get the value out of it. We're out of time, Chuck, but I wanted to give you an opportunity that you have solutions in your title. So how does Pentaho think about solutions? Are they solutions to help people sort of organize the mess in the dump? Talk about that a little bit. Yeah, no, thank you. It's, I've got the best job at Pentaho, best job in the industry. My team looks at how customers use our products and how our products fit into the entire ecosystem. And then we look at what's repeatable. And in some cases, it's purely just a repeatable design pattern that Pentaho can enable, such as filling the data lake. That's something that we can deliver as a solution and customers can reap benefit quicker, faster, and cheaper than having to build it themselves. In other cases, we work with our parent. We were acquired by Hitachi Data Systems, or Hitachi Group. They have some very interesting data products that we can integrate into. Plus, there's third parties where we can build solutions to again, simplify, automate the data pipeline, give customers more value, easier, faster, cheaper, all of that. All right, Chuck, well thanks for stopping by theCUBE. It was great to see you again. Hey, thanks so much. I always enjoy it. Our pleasure. All right, keep it right there. Everybody, George and I went back with our next guest. This is theCUBE, we're live from New York City. Right back.