 from San Jose in the heart of Silicon Valley. It's theCUBE covering Big Data SV 2016. Hey, welcome back everybody. Jeff Frick here with theCUBE. We are live in San Jose, California at Big Data SV, which is part of Big Data Week, which is concurrent with StructureConf. It's really ground zero for Big Data this week in San Jose at the convention center. We're at the Fairmont at the Gold Room. Got a little event tomorrow night, Wednesday night. If you're in the area, come on by about 4.30. Promises to be entertaining, and we're really excited about this next guest. This is a special company in the history of theCUBE. We do about 1,100 interviews, 1,200 interviews a year, and every once in a while we get a special one where we get book signings, and I don't think we've had any animals yet, launch companies, but this company launched a Big Data NYC, our first big data event in 2013, so we're really happy to welcome Nenshan Bartoliwa. Did I get it? Close enough. Bartoliwa, I'm very sorry. No worries. Co-founder and Chief Products Officer of Puxata, so welcome. Thank you very much for having me. Yeah, so hopefully you've pulled up that clip and you saw it again, is really. We watch it every day religiously in the office. Very good. So give us an update. Things have obviously changed since 2013. Actually, I think we had someone from Puxata on at the Spark Summit in San Francisco last year as well. That sounds right. That sounds right. So, obviously we're very proud. We have a long history. Our company was launched at the Strata Conference on theCUBE, so that's very exciting. And look, when we started the company in 2012 as just four guys in a basement in Redwood City, we said to ourselves, the market had clearly changed. We were going from a world where people were using relational database technologies, and we were clearly seeing the move towards Hadoop and other no-SQL technologies like Mongo and Couchbase and many others. And on the other hand, we were also seeing the move on the end user side towards self-service. People were tired of waiting for IT organizations to give them what they needed. They wanted to be able to have the power in their own hands. And so it was with those two insights that we decided that there was a missing layer. The myth is that you can just take Tableau and slap it on Hadoop and then magic happens. And I can most assuredly tell you that magic does not happen unless you have Paxata. And so the real breakthrough for us was knowing that there needed to be a third leg of the stool where being able to empower somebody to turn raw data that was being landed in a system like Hadoop and turn it into information on the fly so that they could power their analytics would be the next major wave. And here we are two and a half years later and we have no shortage of companies who have suddenly found self-service data prep religion and we welcome that. But as the pioneer of the space and the category it's been a really exciting journey so far. So it gives us update, funding, customers, employees kind of give us the quick four one along. Sure, so from a funding perspective we are a series C funded company. So we did our series C funding last year. We announced it in the middle of the year. We are very fortunate to be funded by both Excel partners India which is our original investor. And then we've added EDBI which is the Economic Development Board of the Government of Singapore. So we made a very strategic decision to partner with them because as you can imagine there are really exciting business opportunities in Asia and the broader market and being able to partner with a fund of that caliber allowed us to have an anchor point in the Asia PAC region. So we completed the series C in the middle of 2015. We have been expanding extremely rapidly. We have doubled the number of people we have in the organization. Very happy to say that I'm not the only guy running around doing demos on a MacBook. Took more than two years to get out of that mode. That was not fun. But we now have a very mature field organization both here in the United States with folks in seven different regions. We have people on the ground in Singapore, in Korea. Very soon we'll have more news on the Europe front but we're expanding from a people perspective very significantly. We're at about 70 people right now. In terms of customers, we're obviously the customers who are using our software get a significant competitive advantage. So I won't go into details or names but we're very proud to say that we, the top three banks in the United States are all Paxata customers. We have a very strong foothold in the United States government. Our technology which allows analysts to be able to pull data together turns out to be very useful for very specific mission critical use cases which I will not go into any detail about here less this conversation and very abruptly. Then we're also very fortunate to have the world's number one semiconductor company, the world's number one company in audit and assurance. We've been very fortunate to really put a stake in the ground as being the enterprise class mission critical data preparation platform. If you look at the market, you're really seeing a bifurcation, right? There are a number of very good and useful desktop type tools. People who have freemium models who are going that route and we saw that very early on and we made the decision in 2013 that we were gonna be the enterprise class standard and that bet has paid off with the caliber and types of customers that we've been able to bring in. So, Ninshad, let's dig into that a little bit. One of the things we've learned in sort of following the big data marketplace was that the platform itself in the form of Hadoop needed a lot of simplification. And we're slowly getting better on that sort of bending management tools around the expanding zoo of animals and including the use of zookeeper. But the other challenge customers face is the complexity of the data, which is one of the attributes of Hadoop, which is pour it all in and then let's figure it out. That's right. So tell us the progress we've made since your launch in making that easier to do. Great question. So it's a very interesting juxtaposition of two things that we've decided to do. On the one hand, you have the big data environment. And at a conference like Strata, you're gonna hear people talking about key value stores and oh, my nested JSON didn't behave the way I wanted to when I parsed it, which is a very nice technical conversation, but it's not something that you hear business people talking about. So on the one hand, we did bet technologically, and by the way, we were also a pioneer here in 2013. We bet on Apache Spark way before all the Spark mania really took off. So we knew that the underpinnings of our system, if we were gonna be relevant in the enterprise environment of today, had to be based on big data technology. However, what we juxtaposed that with was the ability for the mere mortals among us, the average user to have a point and click declarative environment where they could actually manipulate the data interactively. So instead of writing pig scripts or instead of building a script generator, what our end users wanna be able to do is press a button that says, I want to pivot these columns or I want to profile this data and create a filter. So what we've been able to do in terms of expanding the market of people who have access to data is to combine a really easy front end where frankly the way we qualify our end users is I ask them, do you know what a VLOOKUP is in Excel? And they'll give you that knowing look, right? Yes, this person clearly knows what a VLOOKUP is. Do you work with pivot tables in Excel? Yeah, I work with pivot tables too, great. If you know those two techniques in Excel, you will be a very successful Paxata user. So it's really the yin and yang between a very fluid, easy to use interactive experience combined with a very powerful, spark-based back end platform that we have been building over the last two and a half years. It just strikes me, Peter Burris on an earlier interview talked about these cars will never be successful because there just aren't enough chauffeurs. I'm thinking of the whole data science meme and the data scientist meme. No, there aren't enough chauffeurs. No, there aren't enough data scientists, but that's not the in-game. That's not the path to success. So Prakash Nanduri was my co-founder and our CEO. He likes to say there's a data scientist in all of us. We identified very early on in the market that there's a hierarchy of skill sets that people have in the enterprise. At the very top of the pyramid are the 200,000 to 500,000 data scientists. These are extremely skilled people that understand statistics, they understand how to program, and they have domain expertise. They're not enough of them to build an enterprise software company that can reach really, really big scale, which is why what do you see happening in the data science community is it's largely driven by open source. Great, so building products exclusively for data scientists did not seem like a sensible thing for us. Then there's the developer community or the data engineering community, and you see what's happening there is that there are traditional tools that have existed for a long time for developers that have, I'd say, reached the end of their useful life, but the point is it's a very crowded space. So the insight that we had in 2013 that's really paid off is that we need to go one level below on the pyramid and really expand the market of people who can do this. There are hundreds of millions of people who know how to use Excel, and if they can prepare data in the same way that they can build visualizations intelligently in Tableau, we could have a really successful company. That's funny, so I'm just like Christian Chebeau who we interviewed at Tableau I think in 2013 who again really said it's the Excel people, right? There's a lot of people that know Excel. That's really a rich opportunity to give them a new tool beyond what they've had. So if you could take all the Excel-skilled people in an organization and give them access to the underlying traditional data sources and big data sources, what kind of opportunity would you have to transform those businesses? And the tool, right? That's right. The data and the tool. So you've been around long enough now to have some customers make it, you know, farther along down the journey than just sort of trialing and doing proofs. Absolutely. So what are some of the applications that you've seen or first, how have some of these sort of self-service customers or users been able to refine data to where it's consumable by analytics? Right. You know, what have they done so far? Great question. So we have been successful in building a customer base across a number of different industries, right? Financial services, the government sector, healthcare, consumer products, et cetera. So I'll pick some of my favorite use cases just to walk through. One of the most interesting statistics that we like to talk about is the amount of money that's been spent on fines due to violating regulations in the financial services industry. Do either of you wanna guess how much money in aggregate banks have spent in paying off fines since 2008? Probably not enough. But that's a different conversation. I might get myself in trouble by guessing, but I think I've seen the numbers being in the billions. Hundreds of billions, hundreds of billions in penalties, $250 billion. Oh, that I didn't know. 2009. 2008, when we had the financial crisis. So when you're in that kind of environment, right, the banks are looking for any way possible to be able to make sure that they can keep the money in the bank and not give it to the regulators. So why are the top three banks in the United States, Paxata customers? Because one of the regulations that they have, and there are many, is called CCAR. It's otherwise known as the stress test, right? And in the stress test, you have to monitor not only the quality of the data that you're feeding into your models, but also the quality of the models themselves, right? So what they do is they actually use Paxata to be able to allow business domain experts, the analysts who actually understand that is a correct counterparty transaction. That is not a correct counterparty transaction. You can't expect IT people to understand that that's not their expertise. But if you put the power into the hands of the business domain experts, and they can in a point and click fashion actually highlight the exceptions, highlight which data elements look like they could be suspect, they can completely collapse the amount of time it takes for them to go from detecting an issue to being able to remediate it. So why are the strong industries where regulatory pressures are high adopting Paxata? Because they can, in interactive speed, bring a billion rows plus into a Spark cluster that has Paxata's technology on it, and point and click and find the needles in the haystack, take those needles, and immediately start processing it. So there's a tremendous value to putting in high volume interactive data preparation at scale into the hands of the analysts to actually understand that. So that's one example, okay? So, wait, hold on, I gotta stop you there. So one of my favorite questions is, billion data points, how do you pump it into visualization and find actionable data? So how does that actually work to find, how does the needle surface itself in that scenario that you just outlined? Great question. So we have a capability or technology that we call filter grams, right? And we made the bet very early on because we went with Spark that we wanted to be the vendor that could do interactive data prep at very large scale. So the point number one is to find the needles, you have to have all the data. If you're sampling data and then running batch jobs, you're no better off than you were in the traditional ETL world, right? So our vision and now what we've delivered is that you can load all the data into a cluster and now you actually have random access to that data. You're looking at a user interface and you can scroll to literally any point in this multi-billion-row dataset and actually get access to it. So number one is you have to have all the data. Number two is you have to have both visual as well as algorithmic tools that help you find the needle in the haystack, right? So on the visual tools perspective, we're announcing here at the conference our spring 2016 release and one of the investments we've made is in our filter gram capability, which are basically visual histograms. And those histograms are really interesting because again, they're single click. You don't have to code anything. You click on a column and you'll immediately see the distribution of all the values. So when you're looking for that needle, you'll see a really nice distribution of transactions in a bank. And suddenly all the way out here, like seven standard deviations to the right, you see that there are multiple transactions that have taken place and you can immediately zoom in on those and then go ahead and actually flag those transactions and start to investigate them. So that's a visual technique. Then algorithmically, we also figured out a number of very interesting ways that we can look for deviations in terms of repeated values that show up that look aberrant or to be able to find out how to join different data sets in a way that the interesting aspect of the join is not the things that actually connected but the things that didn't connect because that's how you know that there's an anomaly in the way that the data is structured. So the combination of scale, visual and algorithmic techniques allows people to iterate in a very fast manner to be able to find those needles in the haystack, which is why those banks are going with Paxata as an enterprise class platform. Great story. So once you've, I mean, this is a, it's a very, very clear example with hard ROI. Is it difficult to translate that into other use cases in different industries or do you just go deeper into the bank? Great question. So I think our vision has always been to be a horizontal information platform for the enterprise, right? That's very clearly where we're going and the technology that we have built is a horizontal technology. So our goal has been to become the de facto next generation information platform across multiple different industry segments and if you look at our customers, not just in banking or in the government sector, but if you look in consumer products, for example, Del Monte is a very well known customer of ours. Obviously nothing to do with banking. We're, you know, that very large semiconductor company that many people know has nothing to do with banking. So the approach that we've taken is to build a horizontal set of data transformation capabilities, governance capabilities, collaboration capabilities and then over time to layer on notions of solution constructs on top of it, right? So taking the core platform and then adding connectivity and models and that kind of thing to make them richer and more packaged from a solution perspective. So let me ask you, I mean the pipeline from data prep, you know, to, all the way to and through the analytic function, it's met by multiple tools today. Because, you know, we just don't have that level of integration. But if one of the key attributes that people need in terms of governance is lineage, what would you plug into so that a customer who's got a multi-vendor sort of pipeline has to, what do they have to manage? Sure, great question. So there's both within product lineage and then the broader ecosystem lineage, right? So the first, in order to play in the broader ecosystem, your product has to have very deep lineage and governance capabilities. So when we established the data preparation category, we said that there were five pillars to building an enterprise class data preparation platform. There was integration, there was quality, there was enrichment, there was governance and collaboration, right? Those all have to be part of one unified platform, not 15 different tools. So if you start with that mentality, from day one, we recorded every single step that the end user takes on the data. Everything we do in our system is versioned, right? So you can actually roll back the clock and look at exactly what George did or somebody else did in the system and see what steps they took, when they took those steps, where other people involved, et cetera. We version all of the data sets that actually flow through the system, both the imported data sets and the exported data sets. So point number one from a lineage perspective is that in order to be a good system or a good citizen in the whole pipeline, you yourself have to have very deep governance capabilities for the part of the pipeline that you own. And that is something that we have had from day one. However, to your point, we don't cover the entire end-to-end spectrum of all data sources. If they don't flow through Paxata, well, first of all, shame on you. They should flow through Paxata, but if they don't, obviously you wanna be able to connect that to other systems. And because we chose to go with a REST API-based approach, where the entire platform is enabled through REST, we can use the REST API to connect to other centralized lineage systems, like Cloud Air Navigator, like the Atlas project with which the Hortonworks folks are pushing and that sort of thing. And of course, also in more traditional systems as well. All right. Let's not, unfortunately, we're out of time, which is the bad news. The good news is, for cost you're out, you're not coming back. So when we get, when we see you at Spark, you're back on again. So you lost, you just knew, Julie. I'm just a stunt double. That's right. We used to take care of customers and that makes a lot of sense. So that's great. So thank you, great insight. You know, some great stories. It's really important that people understand how their peers get started where they're using this technology. So it's not just a bunch of guys talking tech. It's actually people solving business problems. Yes. There's huge value in this whole space and we're very grateful to have the opportunity to speak with you gentlemen on theCUBE. Absolutely. Glad you came. Hopefully we'll see you. Thank you. Maybe it's Spark Summit in San Francisco. We will be at Spark Summit West. But want to make sure you log on to Twitter. Take a look at the Twitter handle for theCUBE is at theCUBE. You'll see all of our CUBE gems and our Twitter cards and our CUBE cards and all of the great conversations that come out of theCUBE interviews. We're really excited because we get really smart people on who are inventing this technology, implementing this technology, changing the world with the technology. And it's great for them to come on unscripted, share their insight with you, our audience, and our community. And so we're really, really thankful and thankful to our sponsors as well. Pekata sponsors theCUBE. We've got a viewer here. You can see we've got a lot of gear. We've got a lot of people. People got kids. They got to eat, choose the whole story. So without sponsors, we couldn't do it. So thanks a lot for watching. We are live in San Francisco at Big Data SB 2016. We'll be back with our next guest over this short break. Thanks for watching.