 to the Big Apple, everybody, this is Dave Vellante with George Gilbert. When we're here at Pillars 37, just down the street from the Javits Center, Strada plus Hadoop World. Strada, Hadoop World is going on all week. This is our sixth year at Hadoop World. This is Big Data NYC. Donna Perlich is here. She is the Vice President of Product and Solutions at Pentaho. Donna, it's great to see you back in theCUBE again. Thanks for coming on. Thank you, great to be here again. So it's fun. So Big Week, all the data scientists are here and the customers and the vendors, and it's sort of become the place to be in the fall and the East Coast. So it's exciting. You guys have been through just an amazing transformation. We remember having you guys early on when you were just getting started and of course the big acquisition from Hitachi. So give us the update. What's happening with Pentaho? Big announcement this week, so give us the update. Yeah, great stuff. So excitement because obviously getting close to the first day here of Strada and announced our Pentaho 6.0, so a major release for Pentaho. And it's really all about what we're all here to talk about this week, which is how do we manage data? How do we manage the data flow? And so we put some really nice capabilities into the platform to kind of manage this and automate this data pipeline. So really cool stuff. I mean that seems to be, we had Abhimeta on earlier and he was, because he's a popular speaker and he was saying, automating that data pipeline is the future. That really is. He first talked to, six years ago when we first talked to him, he was a big bank in New York, B of A and he talked about building the data pipeline and now it's all about automating it. So talk about how Pentaho approaches that. Why is it so important and why are you guys good at it and what differentiates you? Yeah, so I think it's really interesting. If you look back, you were talking about early on when Pentaho came into the picture, we always sort of looked at this concept of analytics and data integration coming together as one long process. If you think about that pipeline, it's really all about the end goal is the analytics and so what we've done is in Pentaho 6.0 is said, what are the things we need to put in to think about how people are getting value from their data today and that's really about blending data and kind of putting it to work in different places. We hear a lot about the data lake, we hear about the data refinery, there's a data warehouse and how do you bring the data across those different systems and think about it as a data flow as opposed to kind of capturing data and keeping it stagnant. Two areas, one is really about this big blend and how do we blend all that data and the second one is about how do we manage that pipeline? So you can't really just look at it from this point and then this point over there, you got to think about how it flows and how you can govern the data and really kind of look at what are the capabilities needed to make sure that you know where that data is flowing at any given point. Historically, I mean everybody has always known that generally speaking the value of data declines over time and so that's not necessarily anything new but what is new is our ability to get to near real time and the value of that, the value opportunity is now before us, much greater than it ever used to be. We just as it's ignored, we'll never get there but we'll get to it within a couple of days or weeks or months or whatever it is and now you're seeing organizations really start to invest in getting to that near real time piece. How does Pentaho play in that? What are you seeing? So we were talking earlier, George and I about how this concept of moving that data from where you've traditionally done data engineering, which we would think of as ETL and kind of the batch oriented side of that and what's happened is as we've worked through some of the new capabilities to refine data and to manage it in that kind of center process if you think about that refining of data and kind of bringing the user closer to the data, the need for that data and those transformations to run in near real time is even more important. So how we've looked at that is some of the capabilities we put in in terms of inline modeling, auto preparing data, all of that allows a user to really kind of generate data sets on demand and so you get to that real time capability but the back end of it is still being governed in the same way we would have governed it before. Okay, go ahead George. Just to elaborate on that, tell us just for comparison so viewers who Gen X, Gen Xers and who haven't been exposed to the old style ETL, when you are pushing or democratizing access to this transformation or big blend, what did it used to look like? What were the roles of the people who did it and now how can you make that simpler and accessible to more people? Yeah, so if we think about kind of the old school ETL we're looking at people who are data developers, we're looking at people who were writing script in the old days to kind of blend and bring data together and now a lot of that, like with Pentaho, it's drag and drop, it's simple, right? So the skill level that you need is not as sophisticated. We brought in a whole new set of data sources like Hadoop and NoSQL and all of a sudden it was, well wait, the same skills we had before aren't the same skills that we need now whether it's MapReduce or Spark or any of the other kind of technologies that are needed in skill sets. So if you think about how it's different from how it was is the complexity of that and what you need to know has gone up, the tools have gotten better, right? So what we've brought into the picture is the drag and drop piece making that really simplified. If we think about how that translates to the other side of things, which is users are becoming much more comfortable with data and I think that really started with the emergence of a lot of the analytic tools that came out that allowed users to sort of do this kind of visualization and analyzing on their own and so the middle area between there, now it's kind of like, you know, we hear this word data scientist, right? But a very sophisticated data analyst or even not that sophisticated, somebody who understands data is able to accomplish some of those tasks that before were really relegated to, I've got to go to IT and ask and now there's this middle area that says, well I don't have to ask, I can do it myself. The real cool thing about Pentaho is that with this analytic pipeline some of the capabilities we've added is that still has the same governance that used to go on which was really the reason you waited, right? Is you needed to know that the data you got was accurate and so that's really where these new capabilities in terms of blending and inline modeling are really important. So just to recap there, two key capabilities, democratizing access to the shaping and blending the data, building on the accessibility of the visualization tools and comfort with data and the second is the integration with governance so you can trust what comes out of it. Exactly, exactly. And that's where if you look at SixDotto and we talk about managing the state of pipeline, you've got to start thinking about things like lineage, life cycle, management, monitoring, security because that whole process has to be really looked at as a whole as opposed to point areas. Well I think that governance piece is a big blind spot for a lot of organizations. They spin up some little Hadoop experimental projects and then they get bigger and bigger and bigger and then the marketing guys do it and the sales guys do it, the logistics guys do it and all of a sudden you've got the data quality issue, you've got a governance issue, there's no chief data officer to pull it all together, maybe we need one and it's sort of became this big mess. What are you seeing in terms of the customer journey in that context? Yeah, well we've got a number of customers that are really kind of pushing the boundaries with that. The capabilities I was talking about in terms of that refining and modeling data in the center there, we've got a very large financial regulatory company or organization that basically what they do is they process like 75 billion transactions a day and what they have to do is they have to get this data to analysts to look for kind of that needle in a haystack fraud for instance. Maybe there's some suspicion of fraud in this particular area and they have to dig in and figure out what happened on that day, what was traded on that day, who were the people involved and so how do you do that in an automated way? You have to make sure the data's accurate, right? There's no in between of it, it's kind of right, it's right if the government says reproduce that, you've got to kind of reproduce that and so what they did is they took this concept of putting an application in the front and then they kick off, the end users can choose, here's the data that I want to see, kick that off, that data set is brought back to them and then if needed that can be reproduced over and over and over again and so it changes the whole concept of having to reproduce data and know that it's trusted. If it wasn't governed on the back end, you wouldn't, you'd have no way to do that. So fraud is an interesting example because we've gone kind of from this world of sampling and I'll let you know in four or five months whether or not something bad happened to sort of being able to look at all the data and making some decisions in near real time and then now there's this issue of, okay, what's the balance between actually notifying is somebody that there might be fraud and or maybe even negating a transaction, what are you seeing there and how is Pentaho sort of helping move that forward? I think there's a lot of trial and error and experimentation going on with the banks, that was there, yeah. Yeah, absolutely. I think you can take that and apply it to a lot of other use cases that fall into sort of this view of your customer, right? I mean, if you're able to bring in the different data sources that you need to understand your customer's behavior, you're gonna know it's not a surprise that David was in New York and then he was in London and then he was back home and he went to CVS and so that fraud alert will be so closely tied to your behavior that starts to go away but the only way you can do that is to bring in those different data sources and know that they're accurate. Of course, the data integration piece is what you guys are working on but there's also this, I don't know, we call it a classification issue where you might be on a plane and the transaction on the airline might be a fast food restaurant and you say, well, wait a minute, what's that? People really have to start rethinking how they communicate data to the system so there's technology, there's the underlying technology piece, there's other business process change that seems to be occurring. I don't know if you're seeing that. Absolutely, yeah, absolutely. And then there's that whole line, right, of where do we like to be profiled and feel good about, oh wow, this company really knows me. They think I'm gonna offer Donna something special because we know she was in that story yesterday and she really liked that shirt and we're gonna give her versus, I'm starting to feel a little creepy now because I just went online over here and I got an offer for this and that means that people know a lot about what's going, so there's gonna be, I think as you said, it's a process piece, right, and that's where the human intervention is gonna have to still be there to make sure we're not automating things that make people feel uncomfortable. Right, so please. Just along the lines of Dave's comment about Abhi Mettah coming on earlier from Trissada and saying it's all about automating the pipeline and talking about the journey. You've integrated three broad pieces that used to be separate, data engineering, the data prep or wrangling and the analytics and once those are in place and then you can put governance underneath all of them so you get at the end of the pipeline, you get trusted data. How does that accelerate the customer journey? What sort of things did customers do before we could integrate all those things and now what can they do? Yeah, I think a lot of it is time, right? It's gonna be time and speed and then the ability to profile to the point where your marketing gets better, your sales improve, all of those things, right? Because you've got that data and you know that it's accurate as we were just talking about. But I was just thinking about part of what becomes important and why I think it's become a bigger problem maybe than it has in the past is if you think about all the machine-generated data out there and the growing volumes of data. We have a customer, IMS, who basically what they do is they do telematics, right? So they collect, they're doing kind of the connected car and they collect data from the cars and then what they do is they provide that information to insurance providers so that they can bring together that data to say what kind of policy should we give George based on driving patterns in his area, et cetera. So if you think about where that data is coming from and sort of what the origins are, that's a very different problem that we're trying to solve and I think to be able to analyze, that's when that whole pipeline becomes so critical, right? Across the whole data engineering all the way through to the end analytics. So it's critical because it's not just that you were using one or a few operational applications. You have many applications, some internal, some external. You have data feeds and so the only way to tackle it effectively is with that end-to-end integration and then you can have the rich context to make a decision. Absolutely, absolutely. And I think that's the pressure point that we continually see, right? Is that there's continuing to be more and more and more data sources and how do we begin to manage that and then the technologies we have today to do that are great and they're working but they're new and they come into an environment and they're not the same as sort of the standard, for instance, data warehouse technologies that we've seen of the past. So how do you start to manage that but then think ahead to we're going to have to future-proof this, right? So that we don't, if something changes tomorrow it doesn't disrupt the whole system. So ideally your data source agnostic, right? That's what I'm hearing. Yeah, we have to be, yeah. And so there's some secret sauce to allow you to do that but if I think of the three Vs, volume, velocity and variety, are you really solving that variety problem or are you also table stakes for the other two? The other two as well, right? Because part of what we do really well if you think about that pipeline is we can ingest huge volumes of data into Hadoop, for instance, and then we can take that data and we can refine it down into analytic data sets. We can auto-model that data. So then you get to that cool space we were talking about where you've got analysts that need to access data and they can auto-model and auto-prepare and deliver data sets. And on the analytic side really what we see is as you mentioned, George, there's multiple, multiple applications that that data is going to serve, right? So it's not a single tool on the other end. It's not even just Pentaho's analytics. That's where we see embedded analytics is huge for us. That's a huge advantage because once our customers solve that huge problem on the backend, pushing it out to their applications is easy because they can embed the analytics on the front. So where are customers, your customers? I mean, it's a spectrum, I know, but how do you sort of advise them how to maybe beyond get started? Let's not do the get started, but how do they keep pushing this forward because they have to differentiate. I mean, what they do with your technology is really what matters and where the value is. So how are you helping advance them? What are they doing? Are they pulling you? Are you pushing them? I wonder if we could describe that. I think it's a little bit of both. I mean, we have our user conference coming up in a couple weeks. Pentaho World, we're excited about that. Sorry about that. Cube will be there. Yeah, it'll be great. October 14th. And, you know, we have customers that last year presented, Rich Relevance was one of them and they build their own recommendation engines, but they use Pentaho for the ingestion of huge volumes of data every day from different retailers that they have to get all this catalog information in every day. And then their recommendation engines serve retailers that, you know, L'Oreal and Kmart and other vendors, other retailers, excuse me. And, you know, they presented and many of our newer customers who maybe weren't, you know, Rich Relevance has been a customer for a couple years. They were really pioneering in this space, right? Some of the first organizations to say, hey, we could get value from this and we could get a competitive advantage. They presented and then we had other customers who were in the insurance space, financial services and they were like, wow, we had no idea what we could do with your platform. It's amazing to see this. So, we have the customers who are the early adopters who are pushing us hard and then we've got folks that are coming along now as the markets matured and they really know they have to get behind this and so they're saying, okay, tell us how to get started. And we've really honed in on just three kind of core blueprints, big data blueprints that we have that help customers get started because we've seen the patterns. That's the advantage of being a, you know, open source kind of early vendor in this space is that we got to kind of go through that process with the customers and really learn from them what worked and what didn't work. So how about the competitive landscape? I wonder if we could talk about that a little bit. I mean, you run products, so obviously you're- It's a little crazy. Yeah, so, and you guys are, I presume, you know, the next gen, right, to think about yourselves that way, but there's a lot of people in this space. It's a pretty big TAM, I don't know how you look at that, but I wonder if you could sort of size it up for us and just give us the competitive landscape and how you differentiate. Yeah, it changes every day, so I would say it's one of those things you're constantly trying to keep up on. You know, for us, I think the biggest space is sort of this middle area that I was talking to that you try to put a wrapper around, which is kind of the data preparation tools that are coming out, some of the data wrangling and I think that's really a function of, you know, users want to get their hands on the data and they don't want to be constrained by IT. And then, you know, there's the other force, so that's really the space where I think things are changing the most. There's a lot of tools in there, a lot of them are really great. They're parts of that pipeline, though, and I think how we differentiate is saying, you need those tools, you need to think about how you're going to get data to, you know, whether it's somebody sitting in an application, trying to look at a dashboard every day, a restaurant owner or there, somebody who's using a deep analytic, you know, discovery tool, they're going to need to get to that data, but how do you keep that governance? And the only way to do that is a big pipeline and that's just where we've, you know, had a lot of expertise, so we really differentiate on that bigger picture. And how about the Hitachi acquisition? What has that sort of brought to you guys? Obviously, you get a bigger sales force, a bigger distribution channel, there's some complimentary products as well, but talk about that a little bit. Yeah, so that's a really, obviously, great thing for us in terms of scale. You know, in a couple of days, we went from X number of salespeople to like over 1500, which was wonderful and a lot of resources and headcount. Instant global footprint. Data scientists, you know, I think we've got like 500 data scientists that we now have access to at HDS, which is wonderful. So the kinds of customers that I was talking about, IMS, for instance, that's doing the connected car, HDS has this whole initiative around the internet of things that matter, social innovation. And so what we're going to be doing with them is looking at this landscape of where's their opportunity to really connect people and things and start to build out solutions with them. So that's kind of next up for Pentaho, which is exciting for us, because if you think about, you know, the internet of things, it's all about connecting things and will be the kind of core capability that to connect the data, which is exciting. And of course Pentaho World is coming up in a couple of weeks. What can we expect there? Oh, it's going to be great. So we're excited. We've got Cloudera Key Noting, we've got Caterpillar Marine Asset Management, a Key Noting with us, and they'll be talking about their cool kind of internet of things, predictive maintenance deployment that they've done with Pentaho. We've got, you know, our CEO. So that's going to be exciting. We've got like 50 different sessions, which I'm excited to say, you know, my team's been responsible for putting all that together. So we've got tracks on advanced analytics, business analytics, in general, data integration, we'll have a whole section on social innovation. So it's going to be really an amazing time for our customers. Customers presenting? Oh, absolutely. Yeah. Talk about that a little bit. I mean, I'm inferring it's a lot of customer content. Absolutely. Yeah, we've got, so we had Pentaho Excellence Awards, which we do every year. Well, it's our second year, where we kind of acknowledge the customers that have really done innovative and cool things. So a couple of them that I'll mention, one is FINRA, who does obviously a lot of work in financial services. They've really done some interesting, interesting things in terms of accessing data for their analysts and doing that in a very governed way, but making it very self-service. CERN, which I love to talk about CERN, and I know everybody on my team is like, can you stop talking about CERN? Oracle's biggest customer. Yeah, but it's fascinating, right? Because what they're doing with Pentaho is they have an entire, you know, sort of international village of scientists in Switzerland, and they're running their day-to-day operations across all of those different parts of the organization with Pentaho. And they've extended the platform, they've pushed it, and so, you know, it's really great to see these kinds of customers that are going to be presenting and talking about their use cases. Those are just two. And what about, what do you guys get going? It's Strada, Hadoop World. Strada, so come by, see Pentaho 6.0, see the big blend, you know, see how we manage the pipeline. You know, we'll have people, and it's the big, all about the big blend. And so we'll be showcasing that. And yeah, so we've got another session coming up on Thursday, actually, a little bit of work that we've done with Forrester around this concept of governance and data. And so we'll have a session on that on Thursday. I think it's at 11 o'clock, talking about some of what we've learned from that early research. Can you share with us? What's the basic current story? Yeah, so one of the things I'll share, because it's not published yet, but I'll kind of go out on a limb here. Really interesting data in terms of how many different data sources customers manage in their environment, in terms of blending. And so what we found is the majority have 50 or above, and some of them have thousands. So when you talked about being data agnostic, you know, absolutely critical for us to not be, you know, only limited to specific data sources, and that this big blend is really what's going to become important, of course, that big blend in the context of governance. That's a hard problem that data agnostic. I mean, what's the secret sauce behind that? I mean, you saw Stonebreaker just came out with his company, Tamer, trying to solve that problem, made a big deal of it. I said, eh, Pentaho kind of does that. And plus, you know, the things, things we were talking about before, but what's the secret sauce behind that? Yeah, I think there's really two things. One is the open source piece, right? If you're open, everything is easier, right? Because you just, you know, it's an API, it's a plugin, it's a connector, it's not a big development effort that you have to go, you know, proprietary software, you've got to go, you know, redo everything to make something work, which is great. And the second piece is we actually built in several years ago, early on in the, when we came into the big data space, what we call an adaptive big data layer. It's pretty much like an insulating layer between the different data sources that are out there, the different Hadoop distributions. You know, if you think about some of the NoSQL data stores like MongoDB, and we built this adaptive big data layer so that when new versions come out, it's simply a plug and play. We don't have to recode every time a new version comes out, makes it really easy for us to support customers. So the architectural philosophy of open and open source in that layer is really the secret sauce that allows you to, you know, sort of skate to the puck, if you will. Exactly. IoT, you know, real time, spark. Yeah, I mean, the open source piece is really huge for us. I mean, I think, if you think about where we started in 2010, just first looking at Hadoop. Right. It's huge for our success. It contributes to why HDS acquired us, where they really see value, is that open play as they build out solutions. If you're going to build out those kinds of solutions, you want that piece that's going to manage the data pipeline to be very, very open. So you're building on open source and anything that goes back to the community, a lot that goes back to the community? Quite a bit, yeah. So we always have, obviously we have our commercial capabilities that sit on top and that's as we, you know, have gone more and more into the enterprise. It's important, right? There's just certain things that are expected and we've built on top. But we release, you know, the open source version at the same time as the commercial version and it's a big, big part of our philosophy and kind of where we came from as a company. Great. Yeah. Just, we were talking earlier about the two pipelines. Your, one pipeline you're managing, the first focus that we're talking about is sort of bringing the data all the way from the raw source to the analytic end product and having it in a single product whereas many competitors have it in separate distinct ones. I imagine that that sort of cycle time gives you more agility where you want to improve the modeling and the analytic decision making process. Now, once you've done that in your tool, where do you take that when the data scientists and the data engineers are happy with that process? How do you put that into production so that the system of record feeds a system of intelligence, you know, really fast to make a recommendation or prediction? Yeah, so if I've understood your question correctly, you're talking about how would I, if I'm a data scientist and I've built some type of say, algorithm for, or model for scoring or forecasting. And all the sourcing of the data. And the data, right. Especially you're the development environment to build something that helps make the decision making machinery smart. Now you want to put it into production so that it happens really fast. What happens then? So what you can do is if you've built out these models, it's simply the same as, you know, as you're building that transformation, you're going to drop your data model into that data transformation. So if I've got an existing model, say I've written it in R, for instance, I can drop that model into that data transformation. It'll run as part of that whole data orchestration process that you were just referencing, right? So there's something of value that can be dropped in there. So it's the runtime in addition to the design environment. Exactly. And that's really where we talked about this word orchestration, right? Within that pipeline, there's a lot of orchestration that has to go on, right? This transformation runs now, this one has to run tomorrow, this one runs again, you know, that piece of it is huge. So again, you got to look at the bigger picture to solve that problem. All right, we have to leave it there, but last question, you know, kind of we've been here since 2010, as our sixth year, I guess, at Hadoop World, you know, kind of early on it was the year of, it's going to be the year of question. So the early on was the year of tire kicking and then the year of the enterprise. What's 2015 in this big data world, the year of what? Yeah, I think this is the year of the machine. I think we're just going to see some fantastic, really cool things happening with machine generated data. I think we're seeing it in our customer base is that suddenly we have this blend, you know, going on and we're just seeing all kinds of companies. I mentioned IMS, we've got Caterpillar Marine Asset Management. It's just happening. It's very quickly happening. So the real challenge will be how quickly can, you know, us, all of us as an industry, you know, innovate to keep up with that and to really take advantage of that. It's going to be exciting. The second machine age is upon us and a tip of the cap to our friends at MIT, Donna Perlach, thanks very much for coming on theCUBE and sharing Pentaho's vision. Thanks, yeah. And good luck going forward and congratulations on the 6.0 announcement, big deal. Thank you. All right, keep right there, everybody. We'll be back with our next guest. This is theCUBE. We're live from the Big Apple in New York City, Pillars 37, we'll be right back.