 Live from San Jose, California, it's theCUBE. Covering Big Data Silicon Valley 2017. Hey, welcome back everyone. We're live in Silicon Valley for Big Data SV, Big Data Silicon Valley in conjunction with Strata Hadoop. This is the week where it all happens in Silicon Valley around the emergence of the big data as it goes to the next level. theCUBE is obviously on the ground covering it like a blanket. I'm John Furrier, my co-host. George Gilbert with Wikibon and our next guest we have two executives from Zoloni, Ben Sharma who's the founder and CEO and Tony Fisher, SVP of strategy. Guys, welcome back to theCUBE. Good to see you. Thank you for having us back. You guys are great guys. You're in New York for Big Data NYC and a lot is going on certainly here and it's just getting kicked off with Strata Hadoop. They got the sessions today but you guys already got some news out there. Give us the update. What's the big discussion at the show? So yeah, I mean 2016 was a great year for us. A lot of growth. We tripled our customer base and a lot of interest in Data Lake as customers are going from say pilot and POCs into production implementations of Hadoop. And in conjunction with that this week we launched what we call a solution named Data Lake in a Box appropriately. So what that means is that we're bringing the full stack together to customers so that we can get a Data Lake up and running in eight weeks time frame with enterprise grade data ingestion from their source systems hydrated into the Data Lake and ready for analytics. So is that a pretty big box and is it waterproof? I mean, this is the big discussion now with that pun intended, but the Data Lake is evolving. So I want to get your take on this. This has kind of been a theme that's been leading up and now front and center here in theCUBE already. The Data Lake has changed. Obviously we've heard, I mean Dave Vellante I think in New York said data swamp but using the data is critical out of Data Lake. So as it goes to more mature model of leveraging the Data Lake what are the key trends right now? What are you guys seeing? Cause this is a hot topic that everyone's talking about. Well that's a good distinction that we like to make is the difference between a data swamp and a Data Lake. And a Data Lake is much more governed. It has the rigor, it has the automation, it has a lot of the concepts that people are used to from traditional architectures only we apply them in the scale out architecture. So we put together a maturity model that really maps out a customer's journey throughout the big data and the Data Lake experience. And each phase of this we can see what the customer's doing, what their trends are and where they want to go and we can advise to them the right way to move forward. And so a lot of the customers we see are kind of in what we call the ignore stage. I'd say most of the people we talk to are just ignoring, they don't have things active but they're doing a lot of research they're trying to figure out what's next. And we want to move them from there. The next stage up is called store and store is basically just the sandbox environment and we'll stick stuff in there and we'll hope something comes out of it. No collaboration. But then moving forward there's the managed phase, the automated phase and the optimized phase and our goal is to move them up into those phases as quickly as possible. And Data Lake in a box is an effort to do that to leapfrog them into really a managed Data Lake environment. So that's kind of where the swamp kind of analogy comes in because the Data Lake in the swamp is kind of dirty where you can almost think, okay the first step is store it. And then they either get busy or they're trying to figure out how to operationalize it and then it's like, to your point they're trying to get to that. So you guys get them to that set up and then move them quickly to value. Is that kind of, is that the approach? Yeah, so time to value is critical, right? So how do you reduce the time to insight from the time the data is produced by the data producer till the time that you can make the data available to the data consumers for analytics and downstream use cases. So that's kind of our core focus in bringing these solutions to the market. You know, I have David Roth and I were talking to George always talk about the value of data at the right time, at the right place is the critical, you know, linchpin for the value whether it's an app-driven or whatever. So the Data Lake, you never know what data in the Data Lake will need to be pulled out and put into either real time or an app. So you have to assume at any given moment there's going to be data value. Sure. So that conceptually, people can get that, but how do you make that happen? Because that's a really hard problem. How do you guys tackle that when a customer says, hey, I want to do the Data Lake, I got to have the governance, I got to know who's accessing stuff, but at the end of the day, I got to move the data to where it's valuable. Sure. So the approach we have taken is with an integrated platform with a common metadata layer. Metadata is the key. So using this common metadata layer being able to do managed ingestion from various different sources, being able to do data validation and data quality, being able to manage the life cycle of the data, being able to generate these insights about the data itself so that you can use that effectively for data science or for downstream applications and use cases is critical based on our experience of taking these applications from say a POC pilot phase into a production phase. And what's the next step once you guys get to that point with the metadata because I could get that, everyone's got the metadata focus. Now I'm the data engineer or it's the three data engineer, the geek, the super geek, then you've got the data science and then the analyst and they'll probably be a new category, a bot or something, AI will do something. But you can have a spectrum of application on the data side. How do they get access to the metadata? Is it through the machine learning? Do you guys have anything unique there that makes that seamless or is that the end goal? Sure. You want to take that? Yeah, sure. It's a multi-pronged answer, but I'll start and you can jump in. One of the things we provide as part of our overall platform is a product called MICA. And MICA is really the kind of own ramp to the data. And all those people that you just named, we love them all, but their access to the data is through a self-service data preparation product. And key to that is the metadata repository. So all the metadata is out there. We call it a catalog at that point so they can go in, look at the catalog, get a sense for the data, get an understanding for the form and function of the data, see who uses it, see where it's used, and determine if that's the data that they want. And if it is, they have the ability to refine it further or they can put it in a shopping cart, if they have access to it, they can get it immediately, they can refine it. If they don't have access to it, there's an automatic request that they can get access to it. And so it's this sort of own ramp concept of having a card catalog of all the information that's out there, how it's being used, how it's been refined to allow the end user to make sure that they've got the right data that can be positioned for their ultimate application. Yeah, and just to add to what Tony said, because we are using this common metadata layer and capturing metadata at every instance, if you will, we are serving it up to the data consumers using a rich catalog so that a lot of our enterprise customers are now starting to create what they consider a data marketplace or a data portal within their organization so that they're able to catalog not just the data that's in the data lake but also data that is in other data stores and provide one single unified view of these data sets so that your data scientist can come in and see, is this a data set that I can use for my model building? What are the different attributes of this data set? What is the quality of the data? How fresh is the data? And those kind of traits so that they are effective in their analytical journey. I think that's the key thing that's interesting to me is that you're seeing the big data explosion the past 10 years, eight years, we've been covering it at theCUBE since the Duke World started but now it's a data set world so it's a big data set in this market. The data sets are the key because that's what data scientists want to wrangle around with instilling data sets with whatever tooling they want to use. Is that kind of the same trend that you guys see? That is correct and also what we're seeing in the marketplace is that customers are moving from a single architecture to a distributed architecture where they may have a hybrid environment with some things being instantiated in the cloud, some things being in on-prem. So how do you now provide a unified interface across these multiple environments and in a governed way so that the right people have access to the right data and it's not the data swamp. Okay, so let's go back to the maturity models. I like that framework. So now you just complicated the heck out of it because now you got cloud and on-prem and then now how do you put that prism of a maturity model on now hybrid and so how does that kind of cross-connect there? And then second follow-up to that is where are the customers on this progress bar? I mean, I'm sure they're different by customer but so maturity model to the hybrid and then trends in the customer base that you're seeing. All right, I'll take the second one and then you can take the first one. Okay, so the vast majority of the people that we work with and the people, the prospects customers, analysts we've talked to, other industry, dignitaries, they put the vast majority of the customers in the ignore stage, really just doing their research. So a good 50% plus of most organizations are still in that stage. And then the data swamp environment that I'm using at the store stuff, hopefully I'll get something good out of it. That's another 25% of the population and so most of the customers are there and we're trying to move them kind of rapidly up into a managed and automated data lake environment. The other trend along these lines that we're seeing that's pretty interesting is the emergence of IT in the big data world. It used to be a business users world and business users built these sandboxes and business users did what they wanted to. But now we see organizations that are really starting to bring IT into the fold because they need the governance. They need the automation. They need the type of rigor that they're used to in other data environments and has been lacking in the big data. And then they got the IoT code, cracking the code on the IoT side which is great, another dimension of complexity. On the numbers of the 50% that are ignored, is that profile more Fortune 1000? It's larger companies. It's Fortune, yeah, Global 2000. Got it, okay. All right, and the terms of the hybrid maturity model, how's that? And they had a third dimension, IoT. You've got multi-dimensional chess game going on here. So I think the way we think about it is that there are different patterns of data sets coming in. So they could be batch, they could be files or database extracts or they could be streams, right? So as long as you think about a converged architecture that can handle these different patterns then you can map different use cases whether they are IoT and streaming use cases versus what we're seeing is that a lot of companies are trying to replace their operational analytics platforms with a data lake environment and they're building their operational analytics on top of the data lake, right? So you need to think more from an abstraction layer. How do you abstract it out? Because one of the challenges that we see customers facing is that they don't want to get sticky with one cloud service provider because they may have multiple cloud service providers. It's a multi-cloud world right now. So how do you leverage that where you have one cloud service provider in one geo, another cloud service provider in another geo and still being able to have an abstraction layer on top of it so that you're building applications? So do you guys provide that data layer across that abstraction? Is that the strategy you do? That is correct, yeah. So we leverage the ecosystem but what we do is the data management and data governance layer, we provide that abstraction so that you can be on-prem, you can be in cloud service provider one or cloud service provider two. You still have the same controls and same governance functions as you build your data. And this is consistent with some of the Kubernetes we had all day today and other CUBE interviews where when you have the cloud you're renting basically but you own your data. You've got to have a nice, and that metadata seems to be the key. That's the key, right? That's right, yeah, yeah. And now what we're seeing is that a lot of our enterprise customers are looking at bringing in some of the public cloud infrastructures into their on-prem environment as they're going to be available in appliances and things like that, right? So how do you then make sure that whatever you're doing in a non-enterprise cloud environment you're able to also extend it to the enterprise? And the consequence to the enterprise is that they've got to run multiple jobs if they don't have a consistent data layer, it's more redundancy. Exactly. And not redundancy, duplication, I should say. Yeah, duplication and difficulty of rationalizing it together, yeah, yeah. So let me drill down into a little more detail on the transition between these sort of maturity phases and then the movement into sort of production apps. I'm curious to know, like, we've heard, you know, like Tableau, Excel, Power BI, Click, I guess, being sort of adapting to being front-ends to big data, but they don't, you know, for their experience to work, they can't really handle big data sets. So you need the MPP SQL databases on the data lake. And I guess the question there is, is there value to be gotten, or measurable value to be gotten just from turning the data lake into, you know, interactive BI kind of platform, and sort of as the first sort of step along that maturity model. So I mean, one of the patterns we are seeing is that serving layer is becoming more and more mature in the data lake so that earlier it used to be mainly batch type of workloads. Now with kind of MPP engines running on the data lake itself, you are able to connect your existing BI applications, whether it's Tableau, Click, Power BI and others, to these engines so that you are able to get low latency query response times and are able to slice and dice your data sets in the data lake itself, right? But you're essentially still, you have to sample the data, you can't handle the full data set unless you're working with something like a Zoom data. Yeah, so there are physical limitations obviously, and then there are also this next generation of BI tools which work in a converged manner in the data lake itself, right? So there's like Zoom data, RKDI and others that are able to kind of run inside the data lake itself instead of you having to have an external environment like the other BI tools. So we see that as a pattern, but if you already are in enterprise, you have onboarded a BI platform, how do you leverage that with the data lake as a next generation data, part of the next generation data architecture is a key trend that we are seeing. So your metadata helps make that from swamp to curated data lake. That's right, and I mean not only that, so what we have done as Tony was mentioning in our Micah product, we have a self-service catalog. So, and then we provide a shopping cart experience where you can actually source data sets into the shopping cart and we let them provision a sandbox, and when they provision the sandbox, they can actually launch Tableau or whatever the BI tool of choice is on that sandbox. So then they can actually, and that sandbox could exist in the data lake or it could exist on a relational data store or an MPP data store that's outside of the data lake that's part of your modern data architecture. But further to your point, if people have to throw out all of their decision support applications and their BI applications in order to change their data infrastructure, they're not going to do it. So you have to make that environment work and that's what Ben's referring to with a lot of the new accelerator tools and things that will sit on top of the data lake. Guys, thanks so much for coming on theCUBE. Really appreciate it. I'll give you guys the final word in the segment. What do you expect this week? I mean, obviously we've been seeing the consolidation. You're starting to see the swim lanes with Spark and open source and you see the cloud and IoT colliding and there's a huge intersection with deep learning. AI is certainly hyped up now beyond all recognition but it's essentially deep learning. Neural networks meets machine learning that's been around before but now freely available with cloud and compute. And so kind of an interesting dynamic that's rocking the big data world. Your thoughts on what we're going to see this week and how that relates to the industry. Yeah, I'll take a stab at it and turn it for you to jump in. So I think what we'll see is that a lot of customers that have been playing with big data for a couple of years are now getting to a point where what worked for one or two use cases now needs to be scaled out and provided at an enterprise scale. And so they're looking at a managed and a governance kind of layer to put on top of the platform, right? So if they can enable machine learning and AI and all those use cases because business is asking for them, right? Business is asking for how they can bring in TensorFlow and run on the data lake itself, right? So we see those kind of requirements coming up more and more frequently. Awesome. What he said. Okay, got it. And enterprise readiness certainly has to be tables. There's a lot of table stakes in the enterprise. It's not like easy to get into. You can see Google kind of just putting their toe in the water with the Google Cloud, TensorFlow, great highlight, they got Spanner. So all these other things like latency, you see it rearing their heads again. So these are all kind of table stakes. Yeah, and the other thing moving forward with respect to machine learning and some of the advanced algorithms, what we're doing now and some of the research we're doing is actually using machine learning to manage the data lake, which is a new concept. So when we get to the optimized phase of our maturity model, a lot of that has to do with self-correcting and self-automating data learning. I need some machine learning and some AI. So does George. And we need machine learning to watch the machine learn. And then algorithms for algorithms, it's a crazy world, exciting time. Are we going to have a bot next time when we come here? We're working on a chatbot for Messenger. We just came from South by Southwest. Guys, thanks for coming on theCUBE. Great insight and congratulations on the continued momentum as Loni. This is theCUBE, breaking it down with experts, CEOs, entrepreneurs, all here inside theCUBE, breaking down big data SV. I'm John Furrier, George Gilbert. We'll be back after this short break. Thanks.