 Live from New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA and our ecosystem sponsors. Now here are your hosts, Dave Vellante and George Gilbert. Welcome back to New York City everybody. Nenshad Bardoliwala is here. He's the co-founder and chief product officer of PIXADA, a company that three years ago, I want to say three years ago, came out of Stealth on theCUBE. October 27th, 2013. Right. And we were at the hotel, the Warwick Hotel across the street from the Hilton. And yeah, Prakash came on theCUBE. He did. Came out of Stealth and welcome back. Thank you very much. Great to see you guys breaking the way of taking the world by storm. Great to be here. And of course Prakash sends his apologies. He couldn't be here so he sent a stunt double. Great. So give us the update. What's the latest? So there are a lot of great things going on in our space. The thing that we announced here at the show is what we're calling Paxata Connect. We are moving just in the same way that we created the self-service data preparation category. And now there are 50 companies that claim they do self-service data prep. We are moving the industry to the next phase of what we're calling our business information platform. And Paxata Connect is one of the first major milestones in getting to that vision of the business information platform. What Paxata Connect allows our customers to do is number one, to have visual, completely declarative, point and click browsing access to a variety of different data sources in the enterprise. So for example, we support, we are the only company that we know of that supports connecting to multiple, simultaneous, different Hadoop distributions in one system. So a Paxata customer can connect to MapR, they can connect to Hortonworks, they can connect to CloudEra, and they can federate across all of them, which is a very powerful aspect of the system. And part of this involves, when you say declarative, it means you don't have to write a program to retrieve the data. Exactly right. Okay, that's... Exactly right. So... And is this going into HGFS, into Hive or...? So, yes it is. And in fact, so Hadoop is one part of this multi-source Hadoop capability is one part of Paxata Connect. The second is, as we've moved into this information platform world, our customers are telling us they want read, write access to more than just Hadoop. Hadoop is obviously a very important part, but we're actually supporting no SQL data sources like CloudInt, MongoDB, we're supporting read and write, we're supporting for the first time relational database, we already supported read, but now we actually support write to relational databases. So Paxata is really becoming kind of this fabric that's able a business-centric information fabric that allows people to move data from anywhere to any destination and transform it, profile it, explore it along the way. Excellent. Well, let's get into some of the use cases. Yeah, tell us where the banks are and the sense at the conference is that everyone sort of got their data lakes to some extent up and running. Right. So now where are they pushing to go next? Sure, that's an excellent question. So we have really focused on the enterprise segment, as you know. So the customers that are working with Paxata, from an industry perspective, banking is of course a very important one. We were really proud to share the stage yesterday with both Citi and Standard Chartered Bank, two of our flagship banking customers. But Paxata is also heavily used in the United States government and in the intelligence community. I won't say any more about that. It's used heavily in retail and consumer products. It's used heavily in the high-tech space. It's used heavily by data service providers, that is companies whose entire business is based on data. But to answer your question specifically, what's happening in the data lake world is that a lot of folks, the early adopters, have jumped onto the data lake bad wagon, right? And so they're pouring terabytes and petabytes of data into the data lake. And then the next question that the business asks is, okay, now what? Where's the data, right? So one of the simplest use cases, but actually one that's very pervasive for our customers, is they say, look, we don't even know, our business people, they don't even know what's in Hadoop right now. And by the way, I will also say that the data lake is not just Hadoop, but Amazon S3 is also serving as a data lake. The capabilities inside Microsoft's cloud are also serving as a data lake. So even the notion of a data lake is becoming this sort of polymorphic distributed thing. But so what they do is they want to be able to get what we like to say is first eyes on data. We let people with Paxata, especially with the release of Connect, to just point and click their way and to actually explore the data in all of the native systems before they even bring it in into something like Paxata. So they can actually sneak preview thousands of database tables or thousands of compressed data sets inside of Amazon S3 or thousands of data sets inside of Hadoop. And now the business people for the first time can point and click and actually see what is in the data lake in the first place. So step number one is we have taken the approach so far in the industry of, there have been a lot of IT-driven use cases that have motivated people to go to the data lake approach. But now we obviously want to show, all of our companies want to show business value, and so tools and platforms like Paxata that sit on top of the data lake that can federate across multiple data lakes and provide business-centric access to that information is the first significant use case pattern we're seeing. Just a clarification. Sure. Could there be two roles where one is for slightly more technical business user exposes views, summarizing so that the ultimate end user doesn't have to see the thousands of tables? Absolutely. So that's a great question. So when you look at self-service, right, if somebody wants to roll out a self-service strategy, there are multiple roles in an organization that actually need to intersect with self-service. There is a pattern in organizations where people say we want our people to get access to all the data. Of course it's governed, they have to have the right passwords and SSO and all that, but they are the companies who say yes, the users really need to be able to see all of the data across these different tables. But there's a different role who also uses Paxata extensively, who are the data curators, right? These are the people who they say, okay, look, I'm going to provision the raw data, provide the views, provide even some normalization or transformation, and then land that data back into another layer, as people call the data, right? They go from layer zero to layer one to layer two. They're different directory structures, but the point is there's a natural processing frame that they're going through with their data, and then from the curated data that's created by the data stewards, then the analysts can then go pick it up. One of the other big challenges that our research is showing that chief data officers express is they get this data in a data lake, so they've got the data sources, you're providing access to it, the other pieces, they want to trust that data. Obviously a governance piece, but then there's a data quality piece, maybe you could talk about that. Absolutely, so use case number one is about access. The second reason that people are not, so why are people doing data prep in the first place? They're trying to make information-driven decisions that actually help move their business forward, right? So if you look at researchers from firms like Forester, they'll say there are two reasons that slow down the latency of going from raw data to decision. Number one is access to data. That's the use case we just talked about. Number two is the trustworthiness of data. So our approach is very different on that. So once people actually can find the data that they're looking for, the big paradigm shift in the self-service world is that instead of trying to process data based on transforming the metadata attributes, like I'm going to draw on a workflow diagram that, you know, bring in this table, aggregate with this operator, then split it this way, filter it, which is the classic ETL paradigm. The, I don't want to say profound, but maybe the very obvious thing that we did is to say, well what if people could actually look at the data in the first place? Sort of program it by example. We can tell, that's right, because our eyes can tell us, our brains help us to say, we can immediately look at a data set, right? You look at an age column, let's say, and there are values in the age column of 150 years, right? Now, maybe 20 years from now, there may be someone who on earth lives to 150 years, but pretty much- Highly unlikely. The customers of the banks we work with are not 150 years old, right? So just being able to look at the data and to get to the point that you're asking, quality is about data being fit for a specific purpose. And in order for data to be fit for a specific purpose, the person who needs the data needs to make the decision about what is quality data. So both of you may have access to the same transactional data, raw data, that the IT team has landed in the Hadoop cluster. But now you pull it up for one use case, you pull it up for another use case, and because your needs are different, what constitutes quality to you and where you want to make the investment is going to be very different. So by putting the power of that capability into the hands of the person who actually knows what they want, that is how we are actually able to change the paradigm and really compress the latency from here's my raw data to here's the decision I want to make on that data. Let me ask that, so it sounds like having put all the self-service capabilities together, you've democratized access to this data. Now, what happens in terms of governance or more broadly just trust when the pipeline has to go beyond where you're working on it to some of the analytics or some of the basic ingest to say, I know this data came from here, it's going there. How do we verify the fidelity of these data sources? So it's a fantastic question. So in my career, having worked in BI for a couple of decades, I know I look much younger, but it has actually been a couple of decades. And remember, the camera adds about 15 pounds for those of you watching at home. But you've lost already. Well, thank you very much. So you've lost net 30. Or maybe I'm back to where I'm supposed to be. But what I have seen is the two models of governance in the enterprise when it comes to analytics and information management, right? There's model one, which is we're going to build an enterprise data warehouse. We're going to know all the possible questions that people are going to ask in advance. We're going to pre-program the ETL routines. We're going to put something like a micro strategy or business objects in an enterprise reporting factory tool, right? And then you spend $10 million on that project. The users come in and for the first time they use the system, they say, well, I kind of want to change this way. I want to add this calculation. It takes them about five minutes to determine that they can't do it for whatever reason. And what is the first feature they look for in the product in order to move forward? Download to Excel, right? So you invested $15 million to build a download to Excel capability, which they already had before. So if you lock things down too much, the point is that the end users will go around you. They've been doing it for 30 years and they'll keep doing it. Then we have model two. Model two is Excel spreadsheet, Excel hell, right? Or spread marks, there are lots of words for these things. You have a version of the data, you have the version of the data, I have a version of the data. We all started from the same transactional data, yet you're the head of sales, so suddenly your forecast looks really rosy. You're the head of finance, you really don't like what the forecast looks like. And I'm the product guy, so why am I even looking at the forecast on the first place, but somehow I got access to the data, right? So these are the two polarities of the enterprise that we've worked with for the last 30 years. We wanted to find the sort of the middle path, which is to say, let's give people the freedom and flexibility to be able to do the transformations they need to. If they want to add a column, let them add a column. If they want to change the calculation, let them add a calculation. But every single step in the process must be recorded. It must be versioned. It must be auditable. It must be governed in that way. And so why the large banks and the intelligence community and the large enterprise customers are attracted to Paxata is because they have the ability to have perfect retraceability for every decision that they make. I can actually sit next to you and say, this is why the data looks like this. This is how this value, which started at 1 million, became 1.5 million. So that covers the Paxata part. But then the question you asked is, how do you even extend that to a broader ecosystem? And I think that's really about some of the metadata interchange initiatives that a lot of the vendors in the Hadoop space, but also in the traditional enterprise space, have had for the last many years. So if you look at something like Apache Atlas, right, or Cloudera Navigator, they are systems designed to collect, aggregate, and connect these different metadata steps so that you can see in an end-to-end flow, this is the raw data that got ingested into Hadoop. These are the transformations that the end-user did in Paxata in order to make it ready for analytics. This is how it's getting consumed in something like Zoom data, and you actually have the entire life cycle of data now actually manifested as a software asset. Those are not, in other words, those are not just managing within the perimeter of Hadoop. They're managers of managers. That's right. Okay, okay, that's clear. Because the data is coming from anywhere and it's going to anywhere. And then you can add another dimension of complexity, which is it's not just one Hadoop cluster. It's 10 Hadoop clusters. And those 10 Hadoop clusters, three of them are in Amazon, four of them are in Microsoft, three of them are in Google Cloud Platform, and how do you know what people are doing with data then? How was this all presented to the user? What does the user see? Great question. So the trick to all of this, of self-service, so first you have to know very clearly who is the person you're trying to serve. What are their technical skills and capabilities and how can you get them productive as fast as possible? So when we created this category, our key notion was that we were gonna go after analysts. Now that is a very generic term, because we're all in some sense analyst in our day-to-day lives. But in Paxata, a business analyst in an enterprise organizational context is somebody that has the ability to use Microsoft Excel. They have to have that skill or they won't be successful with today's Paxata. They have to know what a VLOOKUP is, because a VLOOKUP is a way to actually pull data from a second data source into one, which is, we would all know that as a join or a lookup. And the third thing is they have to know what a pivot table is and know how a pivot table works. Because the key insight we had is that of the hundreds of millions of analysts, people who use Excel on a day-to-day basis, a lot of their work is data prep. But Excel, being an amazing generic tool, is actually quite bad for doing data prep. So the person we target, when I go to a customer and I say, do they say, are we a good candidate to use Paxata? And we're talking to the actual person who's gonna use the software. I say, do you know what a VLOOKUP is? Yes or no? Do you know what a pivot table is? Yes or no? If they have that skill, when they come into Paxata, we designed Paxata to be very attractive to those people. So it's completely point and click. It's completely visual. It's completely interactive. There's no scripting inside of that whole process because do you think the average Microsoft Excel analyst wants to script or they want to use a proprietary wrangling language? I'm sorry, but analysts don't want to wrangle, right? Data scientists, the 1% of the 1%, maybe they like to wrangle, but you don't have that with the broader analyst community and that is a much larger market opportunity that we have targeted to this point. Well, very large. I mean, a lot of people are familiar with those concepts in Excel and if they're not, they're relatively easy to learn. That's right. So excellent. All right, I'm Chef. We have to leave it there. Thanks very much for coming on theCUBE. Appreciate it. Congratulations for all the success. Thank you. All right, keep it right there. We'll be back with our next guest. This is theCUBE. We're live from New York City at Big Data NYC. We'll be right back.