 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC and Pivotal. Now your host, John Furrier at George Gilbert. Hey, welcome back everyone. We are here live in New York City for Silicon Angles theCUBE, our flagship program. We go out to the events and extract the signals We are here part of our Big Data NYC event in conjunction with Strata Hadoop and we're here one block away, 100 yards away from the Musconi, the Javits Center in New York City. I'm John Furrier, the founder of Silicon Angles. I'm my co-host George Gilbert, our next guest. Ben Sharwa, the CEO of Zoloni, welcome to theCUBE. Thank you, John. Thank you, George. Great story of your company. Self-funded, been around since 2007 with Hadoop since 2009. Self-funded, huge clientless. American Express, GE, Verizon, not only blue chips but large scale companies. So tell us about what you guys do and why the success. Sure, so thanks for having me here. So Zoloni focuses on data management and enterprise data management of Hadoop and in doing so, we're seeing a lot of our customers trying to migrate to a Hadoop data lake architecture so we help them build and manage a data lake. So that includes some of the problems that you see in the enterprise in terms of managed data ingestion, metadata management, data governance, cataloging and other features that are needed to build and deploy use cases in production. So I hear you guys got a big announcement here to share exclusively on theCUBE. We have George, we have an exclusive product announcement, launch your product. Thank you. We're really excited to talk about MICA which is our enterprise wide business self-service tool with a data catalog with self-service data preparation on top of a managed and governed data lake on top of Bedrock. So basically what we're enabling business users to do is to wrangle the data and come up with self-service refinements and enrichment and then operationalize it using our Bedrock platform so that now it becomes a part of a managed data pipeline. So let me drill into that just a bit. The data wrangling and the sort of preparation of the data, would that be by a data engineer or a data scientist? So it could be by business analysts or data scientists so that it can kind of make their process much more efficient so that they don't have to spend all the time trying to kind of depend on IT and do all these things themselves. I'm referring more to Bedrock right now and then to the catalog I want to get to. Sure, so Bedrock what it does is it enables our customers to build and manage the data lake. So we are able to ingest the data and while ingesting the data we're able to capture metadata and ensure quality of the data and then allow some of the transformations that needs to happen when you're bringing in a lot of raw data sets and create refined data sets. So by doing so we're already capturing enough metadata that we are now surfacing at mica level so that business users are now able to get value out of the data lake once you build and manage the data lake. In other words it's sort of like the way sort of a hardcore data developer might go into what's called a database system catalog to find what tables are there and what queries are there. This is a higher level construct so a business analyst could go in and see in the data lake what's been cleaned up and is presentable and consumable. That is correct. And our approach is using search and other technologies be able to quickly find what data assets exist across your enterprise. And in doing so we are leveraging data from on-prem environments. It could be in cloud based environments so we are providing a hybrid type of approach so that you have an end to end enterprise wide view of what data sets exist in your logical data lake if you will. And then once you know about the data then we provide self service refinements and enrichment on the data that we then operationalize back in the bedrock platform so that now it becomes a managed data pipeline in a governed way. So one of the things that comes up all the time on theCUBE is this ability to do service cataloging with data. The data wrangling process, as Jeff Hummerbacher used to say data wrangling is like a sport for the elite data scientists. But not everyone can beat these big time data jockey so to speak. So you guys make that easy with this new tool. How does that work? And talk about some of the challenges involved in data wrangling. I mean it's really difficult. How do you guys do it? What are the complexities? What did you extract away? And what is the architecture that makes this scale? Yeah, so our approach is to be able to create a managed data pipeline to begin with. So as you're deploying a data lake environment being able to ingest data in a managed way being able to organize the data where you're actually tying the data with metadata and extracting operational metadata that you capture and then being able to enrich the data so that it goes from a raw format to a refined format for specific use cases and then you can build other refinements for later and then also be able to extract the data to bring it to different systems. So that is how we think about a managed data pipeline and we try to automate this whole process so that data scientists or people who are consuming the data don't have to go through these tedious process of doing it over and over again. And so what is the value then? So is it targeted towards the non-math or non-programming data scientists or is it a way to prep for visualization programmers or both the analysts? What specific segment are you going after? Sure, so the core platform Bedrock is mainly geared towards IT users who are creating and building a data lake and Micah is geared towards business users who want to get value out of the data lake. So it is geared towards data scientists who may be then bringing the data in a notebook style interaction where they're actually using Spark and MLlib and other things to be able to run their algorithms or it could be business analysts who are finding these different data assets doing enrichment amongst those data assets and then using a BI tool like Tableau or Click to be able to do the visualization and slicing and dicing of the data. So there's this, what you seem to be talking about is there's this iterative process of enrichment which is, it's not like a monolithic pipeline but it's people exploring in a process of self-service where the refinement happens in stages and it sounds like once you've got a reusable view of the data you register it in the catalog, Micah. And so that can be reused by data scientists, business analysts, whatever. The part that's been shipping but I'd like to get a better picture of, Bedrock, when you're bringing the data into Hadoop you said you're extracting metadata so that you know, I guess lineage and governance information. Can you talk a little more about that? Sure. So as you bring in the data we have various things that we capture as part of the ingestion process itself in terms of where the data came in from and what type of data it is and how many records are there, what's the quality of the data so that you can get some KPIs on the data itself. So we capture all of that operational metadata along with the business metadata or the technical metadata that may be provided for these data sets coming in. And then as you take these raw data sets and transform them to be refined data sets for your specific use cases, we capture all the lineage information so that for compliance-related reasons you have the provenance that you need to be able to go back to where that data came from and what was the data in its raw format. Okay, so talk about, I want to get your comments on a comment that was made here in theCUBE this morning by Rob Thomas at IBM. He said, most data lakes are turning into data swamps. Mainly referring to, people just put data lakes together, yeah, just store it away. Hadoop has become one of those things that was actually a benefit for people to store data without acting on it. Which you couldn't, he's not making that point, but my point is, okay, Hadoop's great. You store it, I'll get to it later. But what he's referring to is that that practice has yielded a pile of bad data or just unusable data, essentially swamp-like. What's your thoughts? So Rob is a great partner. We work with IBM quite a bit and we 100% agree with this data lake concept turning into a data swamp very quickly without the proper management and governance controls. And that's why we think that having a managed process in terms of end-to-end data pipeline is very important as you build this next generation data lake architecture. It's like a wastewater treatment plan, almost. Yes, exactly. Before it gets to the lake. It's dirty, clean, it stays clean if not, you know, all kidding aside, it's an interesting metaphor, but what you mentioned that you're worried about that's happening, you mentioned Rob. What, why does it happen like that? Is it just mismanagement, bad process, bad data sources? Why does something become a data swamp? Sure. So I mean what we have seen from our experience is that people have started doing POCs as they get started with their Hadoop journey. And very soon they try to graduate these POCs into a production environment without the proper data management and data governance controls. And when that happens, very soon, the first line of business that had the first use cases, as soon as they show some success, all these other lines of businesses come online and they're bringing their own data sets and without the proper governance and management, it very quickly turns into an unmanageable platform where you have issues in terms of certifying applications for production build out and production deployment. Ironically, one of your customers is waste management, so it's, I notice on your website, it's got to stay away from those data lakes becoming data swamps. But I got to ask you in all seriousness, congratulations on your company. You guys have been operating for over five, maybe eight years now, it looks like about eight years. No outside funding, self-funded, congratulations. Very hard to do. I mean those kinds of customers is a real testament. So how did you do it? What was the strategy? You put your own cash in, you did some consulting, did you build it up over time? What happened? So all the above, so we, I came from a very traditional storage background and we were already seeing a lot of the large enterprise customers looking at alternative storage platforms. So we started building use cases for our customers and showing business value to the line of business. And we were very successful in doing that. We were working with some of the tier one telcos, some of the large health insurance payers, financial services companies where it was both operational efficiency type of use cases where they're looking for a cost effective platform and Hadoop is one. And then also they were looking at net new revenue generating use cases. So like campaign management, targeted marketing, loyalty. We just recently worked with a large global credit card issuer where we built an end to end cross merchant loyalty program. So it's a net new revenue stream for them. So we saw success by building those use cases for our customers. And you were generating revenue right away? Yes, and we were cash flow positive. We have been growing at 100% year over year for the last four or five years. How many employees do you have? We have close to around 200 employees right now. 200? Yes. So that's significant cash expense. Yes. So you're covering that with revenue. Yes. You do the math, everyone out there. You guys have got a good investment opportunity. We have been very fortunate with some marquee customers who have been supporting us all along and our mission is to make them successful in their Hadoop journey where they can deploy these environments in production. Talk about your growth strategy and your innovation strategy. Are they the same? Are they intersecting? Where does one help the other and vice versa? Yeah, that's a very good question. So we are right now seeing a lot of our customers and these are like large Fortune 50, Fortune 100 customers that are not just doing on-prem Hadoop but they're doing cloud-based Hadoop clusters. So they may have on-prem Hadoop cluster with all the sensitive data but then they're standing up agile analytics platforms in the cloud and what we see is being able to provide a consistent set of data management and data governance capabilities across those environments being very important. Including tooling and methodology. Exactly. So we are, as part of Bedrock 4.0 which we are announcing, we are supporting multiple cloud platforms as native as they can be as part of onboarding data and managing data and governing the data. Does that mean that with Bedrock, essentially the rules for turning or keeping a data lake from turning into a data swamp are common across on-premise and the cloud. So you have essentially a hybrid platform that's consistent in terms of in terms of making the data consumable. That is exactly right. Because we were already seeing concerns from a lot of our customers because as they're standing up these cloud-based infrastructures, how to kind of translate what they have done on-prem to a cloud-based model and have the same set of governance rules? Now, the use cases when they're going from on-prem to hybrid, is it for cloud bursting for extra temporary capacity or is it that they might have data that they're ingesting in the cloud that they want to process there? Both, I mean, we have some customers trying to kind of use elastic model in the cloud to burst the capacity. But we also have customers who are generating a lot of data from outside their environment, from their customer's environment with sensors and machine data that they're trying to bring in and create new data products, for example. So for those customers, they want to have an agile platform that they can enable their lines of businesses to be able to build and deploy these data products. So let me ask, the governance has become a rather broad term for keeping my data honest and trustworthy, but the Hadoop vendors all seem to feel like they have a role or responsibility to deliver that capability. So help us distinguish between what might get built into a specific Hadoop distro and then what you might be layering on, perhaps across distros, across private and public in a hybrid environment. Sure, so I mean, our approach is that we will leverage all the Hadoop ecosystem components. So we are not trying to reinvent the wheel in terms of the core functionality that the ecosystem. So like navigator or? Yeah, so like different frameworks like Spark or H Catalog or other frameworks that are part of the ecosystem. So we will be leveraging those and we're providing a layer of abstraction so that you don't have to worry about whether you are using one distribution today and another distribution tomorrow. So we will transparently map the things that you're doing at the data management layer to that distro specific project or ecosystem framework that is available. Okay, let me ask you, we're seeing one of the big themes that I mean, it's happening literally in real time is that there's an explosion, not just on the management and security periphery of the Hadoop core processing and storage, but we're seeing a splintering now of the analytic processing and even the storage layers so that no one distro is a second source for another. How is that impacting you and how can you make that easier for customers who are weighing one against the other? Yeah, I mean our philosophy there is that we'll use things that are tried out, well tried out. I mean, there are a lot of new projects that are coming up that are not well tested and to be deployed in an enterprise environment. So we are making sure that these are well tested, well kind of followed and developed projects that we are leveraging in terms of bringing in an end to end platform. Okay. And I really appreciate you coming on theCUBE. Congratulations on your product launch, launched here on theCUBE exclusively. Another product launch on theCUBE. Congratulations on your success. Thank you so much, I appreciate it and thanks for having me here. Great. We'll be right back more. Live in New York City is part of Big Data NYC in conjunction with Strata, Hadoop. This is theCUBE live, one block from the Javits Center. We'll be right back after the short break. It's live, it's a spontaneous.