 Live from San Jose, California, in the heart of Silicon Valley, it's theCUBE, covering Hadoop Summit 2016. Brought to you by Hortonworks. Now, here are your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live on day one of three days of wall-to-wall cars winding down day one. This is theCUBE's Silicon Angles flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, my co-host George Gilbert, Wikibon Big Data Analyst at Silicon Angle Media, next to the Czech Yard Barrow Senior, Director Solutions Market Pentaho, a Hitachi group company. Congratulations, the merger's complete. You guys now part of the Big Mothership. Almost a year ago, yep. Talk about the changes, what's happening. Certainly the Big Data Market hasn't slowed down. Donna was on last year, and she really impressed everybody with the mojo of Pentaho. Obviously, and now the acquisition and the putting the combination together. What's the update? Yeah, you know, it's all good. Hitachi, a lot of people looked at it like, that's an interesting move. The reality, it's all about the internet of things, right? And the industrial internet. And if you look at companies that do a lot in the industrial space, Hitachi is one of those. And so they're doing some very interesting things. From a Pentaho perspective, we're still Pentaho. And that was the agreement from day one, was that we weren't gonna change who we were. We're gonna, you know. And you have freedom too. They're not shackling you guys at all. No, we're still Pentaho. You were running hard before. That's right. And running hard now. And the investment is behind us. In fact, we're continuing to grow our team, grow the numbers. Literally, I came from a new hire training here in Santa Clara. And a huge number of new people. So it's exciting time for Pentaho. Yeah, I was really impressed with you guys. We first interviewed them, we did your event. You guys were running hard, but the focus was on the value proposition, which now is obvious to everyone, right? Which was data's gotta move fast. Data's gonna be built into applications. There are gonna be unknown things that are gonna come in. And data's gotta be enabled to be successful. But you gotta set it up. So you guys really made a nice business on that. So what's now, what's the next chapter for Pentaho? I mean, that's not the messaging, but for the most part. No, so it's important to understand kind of where we came from, right? So we were open source BI, or at least that's the way we're positioned. The reality is, early on our founders looked at what analytics was all about, from a business intelligence perspective. And it was always started with a business, or a data problem, right? So from an early point, it was, do data integration and do it right? To prepare data for the analytic process. And that's exactly what we do today. And in fact, Pentaho got kind of, some of the analysts had a hard time figuring out where to put Pentaho. Because, well, you do some data integration, you do some front end analytics, where do you fit? Well, the reality is we always had a vision that it was important to manage that entire data pipeline for an analytic purpose to then do something with it. So if you don't prepare the data to be used right, then you're not going to get what you need out. So let's tie that together. So now the rage today is, you have to be horizontally scalable for the data, but yet you got to be pre-packaged or focused with some domain expertise in the analytics area. So that's what everyone's talking about. So that kind of brings to the next question is, if you really don't fit into the special magic quadrant as their siloed quadrants, you're fitting this way, what's the solution look like? Because you guys now are enabling that pipeline for analytics. I'll tell you. How do you sort that out? The interesting thing is what we do enable are very high-scale, complex use cases that require that entire orchestration and management of data throughout what I call the analytic data pipeline to properly conform that data, to do some of the ETL kinds of things we used to think about. In fact, we were talking about data warehousing earlier, right? So we would go to an operational data store, grab some data, do some transformations and put it in the data warehouse and then we'd analyze it. The reality is the data is coming too fast, too varied. It's not just operational data anymore. It's coming from all over the place, devices. And now blending that on the fly as it's moving and delivering that in a governed approach, that's the key, right? That's what we can do. So being able to do that and deliver that intelligence is the value. One of our big customers is FINRA, interesting watchdog at the stock market who has huge data assets. So I got to ask you the question. So this is the thing back in the day. So in big data land, George and I were basically, you know, talking about this earlier, when you have the whole set of industry players who are running around the industry, trying to figure out, oh, open source is going to win the day. Open source, Hadoop is going to save the day. That was the good education for everyone with Hadoop. But Hadoop didn't go away. It morphed, certainly Spark has taken a front seat and everything, but it's not just about open source anymore and there's a variety of choices. So you're seeing kind of a trend, I was calling it the BYOT, bring your own tool to work, tool as in like tool to work on a product. So there's a diversity of tools now available and that's a good thing. You guys have an announcement called filling the data lake. I want you to take a minute to explain that because I think that speaks to why people jumped on Hadoop. I got to grab the data. Why am I going to throw it away? I got compliance issues. I got to keep it or they'll throw it away only because they don't want to say they have it, but they have, they store it and they'll say, we'll figure it out later. So that then merged into why we're storing it. That became the data lake evolution. What are you guys doing specifically around filling the data lake? So data lake is an interesting topic and it's on a lot of the messaging out here. Friend of mine and former, well, founder of Pentaho, James Dixon is the guy who kind of came up with that. He coined the term, right? So the data lake term has been around a while, but the problem people think about data lake and they either take one side, they go, oh, that's good or oh, that's bad. And the reason why they say it's bad and I think it's because of the way we describe Hadoop. Hey, it's a great environment that you can just dump all your data in. And you know what? The easiest thing to do in Hadoop is dump your data in. But what happens is if you just dump it in, you're literally going to create a swamp because you're not thinking about managing that data. So our latest blueprint that we announced is filling the data lake. It's really taking that very simple topic of how do I get data in, but in a managed approach? How do I reduce the complexity that happens? So I've got a couple of customers in the financial services space, very large banks that are dealing with not hundreds of data sources. They're dealing with thousands of data sources. And typically what would happen is, and George could take probably 15 different tools and take something as simple as a CSV file and push that into Hadoop. That's easy, right? You can do that lots of different ways. The problem is when that becomes 1,000 different sources, you have to typically create a process for each one of those, right? You have to know what the metadata is, you have to define that, and you have to build that transformation process or that load process. How do you do that when it's at scale, when you have 1,000 or in one case, 6,000? Yeah, and flow too, so diverse connections, but also flow. And these data files don't tend to be the same, right? They're unique. They don't all have the metadata applied. So we have some technology inside Pentaho data integration, which we refer to as metadata injection, but think of it as making your transformation processes highly dynamic. So enabling you to inject metadata at any step along the path. So if something as simple as a CSV file doesn't have a header record, well you got to figure out what that metadata is. And the normal way to do that is you build, whether it's in code or using an ETL type tool, you would configure it, you would build it and apply the metadata at the time you build it. Well then you get 6,000 of those transformations and that's the complexity, right? What was simple, meaning loading data into the lake, that was simple, but when it becomes 6,000 of those, that's complex. Because something's going to go down. Yeah, it's like streams, all these different streams coming into the lake or whatever metaphor you want to use would have to build its own process. In some cases, there'd be new data types that they don't have a process. So do you guys have the ability to take any new flow coming in, new data type, and on the fly do that? Yeah, and so what this blueprint does, reference architecture and leveraging this concept of metadata injection and making these transformation processes dynamic, you can literally have as few as one transformation process. So somebody goes in, they think about how they build that whole ingest process and be able to derive the metadata at the time of execution. So what's the impact for customers? What's the mean, what's the impact for the customer? Bottom line is they can, you know, when you have a large, high scale number of files, you can build your ingest processes much, much quicker. Because literally you could do it once and do it intelligently and it can interrogate the data or use a template to inject the metadata at execution time so that you're only doing it once and support thousands of little files coming in that are in different formats and different colors. So this is a big pain point, take away. It's a huge, you know, it becomes painful. How about getting data out? How about getting data out? So data in management, so I get it. You guys automate. So that's what our blueprint, what we announced is this concept of filling the data lake. We're helping you just manage that. And that was announced today. That and the point there is manage what you're doing, right? Don't just dump data in, right? That's the wrong idea. That's what creates the lake. So, and I like to use, I've got a picture of a beautiful lake, right? Because that's the way I think of the data lake. If I'm going to put time in to put data somewhere. It's 4th of July, Lake Tahoe's beautiful this time of year. Exactly. Should be clean water. I want that clean, pristine lake, you know, where I can, to then get my governed analytics out. I wrote a blog post with 2,000 instead of called Dirty Data, and this is around Twitter data, which I was playing with at the time. So I just want to quote some research that you guys put out on your press release from Vintata Research. Big data projects require organizations to spend 46% of their time preparing the data and 52% of the time checking the data quality and consistent. That's 98% of their time. Okay, I'm assuming it's not the same, but maybe some overlap there. But still, great amount of time. Significant. So data cleanliness is a key part of this. Yep, yep. And that's, you know, again, part of our. Not just the ingestion, I mean, our putting it in. Well, what do you call it? What do you call it? Injection, injection. That's really important. So, in an ingest process, again, simple, right? Inject or ingest. So, it's a little confusing. So, we have technology called metadata injection. It's about dynamically. Putting the right metadata into a process. The filling the data lake is about the ingest. The lake is ingesting the data. Okay, just want to make sure, okay. So, you know, having that in a controlled way, you know, helps you to one, not get that swamp. And two, ensure that you're delivering on the governed data that, you know, you're promising your users, right? And so, if you don't have the right data, and I think you'll also see some research from Forrester in there as well. You know, the sheer number of data sources that are being blended together. Does we keep on everything in here? What's your take on this? You're not quoted in here. What's your take on this? The data lake. Well, I was actually going to ask, which is, there's a spectrum from dump it all in, separate, essentially silos of data, to take five years to, you know, design a schema for every last, you know, eventuality. And then there's, you know, somewhere along that spectrum. And I'm guessing, you know, no one really wants to be on either end. But there are domains of data that can be clustered together. Yeah. And then you iterate, you know, and add structure over time to the whole thing. Yeah, what we're seeing, what we're seeing repeated amongst our customers and really what drove us into creating a blueprint, the reference architecture, implementation guide, services offerings to support it, was really the idea of, typically they want that raw data, put into Hadoop, but, and a lot of that is, again, simple stuff, some of it's relational, some of it's coming from CSVs, some of it's other sources. But in the example of CSV, that data, you know, to copy it, they want to preserve that raw source. Okay, great. But the challenge, and I think where you are going with that, is how do I use that data? If I'm going to use Hadoop effectively, you know, I probably want to then take that data and format it in something different. A lot of our customers are saying, you know what, put it in the Avro format. You know, a format that really takes use of the cluster and makes it easier to do the downstream analytics on. And so, this blueprint that we've released does exactly that. You know, you can bring the raw file or you can leave the raw file and just bring in the Avro format. But it's about enabling, you know, and simplifying the process to get that data in and convert it into something that's more usable, Avro. But even with the Avro format, it's nested data, so it's sort of like, you know, hierarchical joins. And I'm sounding really techy. But you're not doing anything like the intergalactic data warehouse data modeling problem that caused so many data warehouses to fail. No, no, the idea is to fill the lake, right? Okay, so the lake with, it's almost like not guppies, but, you know, trout. And guppies. I like fishing for trout a lot more than guppies. Guppies are the bait. No, guppies are the metadata injection. That gets the trout to the lake. Yeah, you know, so yeah, it's about, yeah, we're talking about, you know. Raw fish. Large at scale data movement and managing, right? Ensuring that it's a process that can be simple. There's all kinds of lakes. I mean, the species, if you will, getting to the kind of the fish analogy is kind of funny, but it's interesting because the species of the data, the whatever type of data, there's huge lakes, just trout in some little lakes and then you got master lakes with currents and everything, so the lake size is not an issue. Or is it? No, I think it comes down to, it always comes down to what the use case is, right? What are you actually trying to get out of it? If you're doing this just to... Are we talking about Lake Michigan here? We're talking about, I mean, give me some order of magnitude like the size of lakes you're talking about that you've experienced with, feeling a little bigger. Well, I mean, I can give you some ideas of some of our customers, you know, FINRA, an application that runs around seven petabytes in size. That's an extremely large data lake, you know, but they're doing some unique things, looking at data, you know, stock market data. Yeah, so it kind of depends, but it's not the size of the lake that matters, it really is. No, no, managing that lake, ensuring that you don't turn it into a swamp is what matters. Yeah, but for your limitations, you don't have any limitations on the lake size. You fill up whatever lake volume the customer wants. Okay, so, okay, Chuck, final question for you, explain what you're working on with Pentaho, your group, Solutions Group specifically, and why should someone want to work with you guys, and what do you guys work on so the folks watching could get a feel for some of the things that they might, if they're not already a customer, could be a customer, what should they know about you guys? Well, great. You know, it's a great time to be at Pentaho. My team focuses on solutions. We're talking about anything where the Pentaho platform can be leveraged with other things, services packages, things like filling the data lake blueprint. We're doing some interesting things. We really look at what our customers are doing. You know, where are they finding value? We've got a bunch of customers that are- So you engage with them, it's just a process, you guys have it in. Absolutely, absolutely. Can you take us through what would that look like, how do you just sit down? Yeah, so, you know, I'll tell you what, there's one that I've got my team working on around cyber security, you know, analytics around what's going on, you know, on my servers. It's very interesting. And you know, there's some connections with Hitachi there, because Hitachi is an infrastructure provider. So, you know, there's some infrastructure analytics things that we're beginning to see more and more customers asked for that is pretty exciting. It's really interesting things. And it's some areas that traditionally from maybe a data warehousing perspective, we haven't always been involved in. And now we're getting, you know, we're seeing the need and recognizing, you know, how quickly we at Pentaho can add value. If I'll give you the final word, what's the one thing that you'd want people to know about Pentaho that they may not know about? Well, you know, I think it's important to understand that we're about big data integration and simplifying your, you know, everything you have to do to prepare data for analytics. And, and this is the important part, is that we can then help you do what you need to do with that data, whether that's do predictive analytics, drive that through an R process, you know, so prepare data and do the analysis and deliver that whether that comes back in a visualization. And scale is not an issue. And scale is what we do well. All right, Chuck, thanks for spending the time to share. That's a serious injection of content here on theCUBE, metadata injection and ingestion for you. Thanks for watching theCUBE. We are bringing the live content to you here at Hadoop Summit 2016. I'm John Furrier with George Gilbert. Be right back with more right after the short break, day one coverage.