 from New York, extracting the signal from the noise. It's the Cube, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. At the Hilton Hotel, this is Spark Summit East and this is the Cube. The Cube goes out to the events. We extract the signal from the noise. Jack Norris is here. He's the CMO of MapR, a longtime Cube alum. Jack, it's great to see you again. Hey, thank you Dave. You've been here since the beginning of this whole big data meme. It might have started here. I don't know, I think we've come full circle. I think you're right, I mean it really did start and I think in this building it was our first big data show with the original Hadoop world and you guys like I say have been there from the start. You were kind of impatient early on. You said, you know, we're just going to go build solutions and ignore the noise and you built a really nice business. You guys have been growing, you're growing your sales for us and things are good and all of a sudden, boom, the Spark thing comes in. So we're seeing the evolution. I remember saying to George, in the early days of Hadoop we were geeking out, talking about all the bits and bytes and then it turned into a business discussion. It's like we're back to the hardcore bits and bytes. So give us the update from MapR's point of view. Where are we in the whole big data space? Well, I think it has transitioned. I mean, if you look at the typical large fortune company the Web 2.0s, it's really how do we best leverage our data and how do we leverage our data in that we can make decisions much faster, right? That high-frequency decision-making process and typically that involves taking production data and analytics and joining them together so that you're actually impacting business as it happens and to do that effectively requires innovations. So the exciting thing about Spark is taking and having a distributed compute engine that's much easier to develop and much faster. So remember, the early days we'd be at these shows and the big question was, can you take the humans out of the equation? It's like, no, no, humans are the last mile. Is that changing or we still need that human interaction? Well, I think humans are important part of the process but increasingly if you can adjust and make small algorithmic decisions and make those decisions at that moment of truth, you get big impact and I'll give you a few examples. So, ad platforms, Rubicon project over 100 billion ad auctions a day. Humans part of that process in terms of setting that up and reviewing the process but each supply and demand decision there is an automated decision. Optimizing that has a huge impact on the bottom line. Fraud, credit card, swiping that transaction and deciding is this fraudulent or not, avoiding false positives, et cetera, a big leveraged item. So we're seeing things like that across manufacturing, across retail, healthcare and it isn't about asking bigger questions or doing reports and looking back at what happened last week, it's more how can I have an infrastructure in place that allows this organization to be agile? Because it's not the companies with the most data that's going to win, it's the companies that are the most agile in making intelligent adjustments. So much data, humans can't ingest it any faster. I mean, we just, we can't keep up. So the world needs data scientists that needs trained developers. You got some news I want to talk about on the training side but even that we can only throw so many bodies at the problem. So it's really software that's going to allow us to scale and software's hard, software takes time. So we've seen a lot of the spend in this analytics big data world on services and obviously you guys and others have been working hard to shift it toward software. I want to come back to that training issue. We heard this morning about Databricks Launch to MOOC. They trained 20,000 people. It's a lot but still a long way to go. You guys are putting some investment into training. Talk about that news. Yeah. Well, it starts at the underlying software. If you can do things on the platform to make it much easier and do things that are hard to surround with services like data protection, right? If you've lost data, it doesn't matter how many people you throw at it, you can't recover it, right? So that's kind of the starting point. And you're going to get fired. The approach we've taken is to take a software product approach to the training as well. So we rolled out on-demand training. So it's free, it's on-demand. You work at your own pace. It's got different modules. There's some training associated with that or some hands-on labs, if you will. We launched that last January, so it's basically coming up the year anniversary we recently celebrated. We've trained 50,000 people on Hadoop and Big Data. Today, we're announcing expansion on Spark classes. We've got full curriculum around Spark, including a certification. So you can get Spark certification through this map, our on-demand training. Got it, George. You said something really, really intriguing that I want to dive into a little bit. Where we were talking about the small decisions that can be made really, really fast without a human in the loop. Human might have to train them, but at runtime. Now, where you said it's not about asking bigger questions, it's finding faster answers. What had to change in your platform or the underlying technology to make that possible? You know, there's a lot that goes into it. It's typically a series of functions, a kind of breadth that needs to be brought to the problem as well as squeezing out latencies. So instead of the traditional approach which is different applications and different analytic techniques, dictate a separate silo, a separate scheme of data, and you've got those all around the organization and data kind of travels and you get an answer at the end of some period of time. It's converging that all together into a single platform, squeezing out those latencies so that you can have an informed action at the speed of business, if you will. And let's say Spark never came along. Would that be possible? Yes, yes. How would you do it? So if you look at kind of the different architectures that are out there, there's typically deep analytics, let's go look at the trends, the last seven years, what happened, and then let's look at doing actions on a streaming set, say for instance, Storm, and then let's do a real-time database operation, so you can do that with HBase or MapRDB and all of that together. What Spark has really done is made that whole development process just much easier and much more streamlined, and that's where a lot of the excitement's happened. So you mentioned earlier, two use cases, ad tech and fraud detection. And I want to sort of ask you about those in the state of those, so ad tech obviously has come a long way, but it's still got a ways to go. I mean you look at, who's making money in ads? Obviously Google, they're making tons of money, everybody else is sort of chasing them. Facebook making money, probably because they didn't let Google in, right? So how will Spark affect sort of that business? And what's MapR's sort of role in evolving that to the next level? So there's different kind of compute and the types of things you can do on the data. I think increasingly we're seeing the kind of streaming analytics and making those decisions as the data arrives. And then there's the whole ecosystem in terms of how do you coordinate those flows of data? It's not just a simple here's the origin, here's the destination, there's typically a complex data flow. That's where we've kind of focused on MapR Streams, this huge publish and subscribe infrastructure so that you can get real time data at the appropriate location, and then do the right operations. A lot of that involved with Spark, but not exclusively. Okay, and then on fraud detection, obviously come a long way, sampling kind of died. Yes, yes. And now we're getting too many false positives. You get the call, I mean I get a lot of calls because we buy so much equipment. But now what about the next level? What are you guys doing to take fraud detection to the next level so that when I get on the plane in Boston, I land in London, it knows. Is that a database problem? Is it an integration problem, a systems problem, what role are you guys playing in solving that? Well, there's a lot of details and techniques that probably go beyond what we'll share publicly or what our customers talk about publicly. I think in general it's the more data that you can apply to a problem, the more context, the better off you are. It's the way I kind of summarize it. So that instead of a sampling or instead of a, boy that's a strange purchase over there, understanding, well this is Dave Valenti and this is the full body of expenditures he's done and the types of things. And here's who he frequently purchases from and here's kind of a transaction trend. Started in San Francisco, went to New York, et cetera. So in context it would make more sense. So part of that is more data and the other part of that is just better algorithms and better learnings and applying that on a continuous basis. How are your customers dealing with that constraint? They've got $100 to spend. They can only spend so much on each of those. Gathering more data, cleaning the data, they spend so much time getting it ready versus making their machine learning algorithms or whatever the other techniques to do. What are you seeing there? Is sort of best practice, it probably varies I'm sure but give us some color on that. I'll actually go back to Google and Google a letter last round. Excellent insights coming from Google. They wrote a paper called The Unreasonable Effectiveness of Data and in it they basically squarely address that problem and given the choice to invest in either the complex model and algorithm or put more data at it, putting more data had a huge impact. And my simple explanation is if you're sampling the data you have to have a model that tries to recreate reality, if you're looking at all of the data, then the anomalies can pop up and be more apparent. And the more context you can bring, the more data from other sources so you get around a better picture of what's happening, the better off you are. And so that requires scale, it requires speed and it requires different techniques that can be brought to bear. The database operation, here's a streaming operation, here's a deep file machine learning algorithm. So there's a lot of vendors in the big data ecosystem are coming at Spark from different angles and are trying to add value to it and sort of bathe themselves in the halo. Now you guys took some time up front to build a converged platform so that you weren't trying to wrap your arms around 37 different projects. Can you tell us how having perhaps not anticipated Spark how this converged platform allows you to add more value to it than other approaches? So we simplify, if you look at the Hadoop ecosystem it's basically separated into the components for compute and management on top of the data layer. The Hadoop distributed file system. So how do you scale data, how do you protect it? It's very simply what's going on. Spark really does a great job at that top layer, doesn't do anything about defining the underlying storage layer. In the Hadoop community, that underlying storage layer is a batch system. So you're trying to do micro batch streaming operations on top of batch oriented data. What we addressed was to take that whole data layer, make it real time, make it random read write, converge enterprise storage together with Hadoop support and Spark support on a single platform. And that's basically the difference. We're gonna make an enterprise, great. You guys were really the first to lead that. Absolutely. You were, everybody started talking about enterprise, great after you were kind of delivering it. So you've had a lead there. Do you feel like you still have a lead there or is that the kind of thing where you sort of hit the top of the S curve and you start innovating elsewhere? NC State did a study just this past year. A recent study identified that only 25% of data corruption issues are identified and properly handled by the Hadoop distributed file system. 42% of those are silent. So there's a huge gap in terms of quote unquote, enterprise grade features and what we provide. Silent data corruption has been a problem for decades now and it's no different in the Hadoop ecosystem, especially as mainstream businesses start to adopt this. What's happening in the valley? We're seeing the Wall Street Journal every day. You read about down rounds, flat rounds. People can't get B rounds. You guys are funded, you're growing, you're talking about investments. What do you see? Do you feel like you're achieving escape velocity? Maybe give us sort of an update on the state of the business. Yeah, I think the state of the business is best represented by the customers. And the customers kind of vote. They vote in terms of how well is this technology driving their business. So we've got a recent study that kind of shows the returns that customers are getting. We've got a 1% insurance, a 99% retention rate with our customers. We've got an expansion rate that's unbelievable. We've got multimillion dollar customers in seven of the top verticals and nine out of the top 10 million dollar customers. So we're seeing significant investments and more importantly, significant returns on the part of customers where they're not just doing a single application on the platform, but multiple applications. Jack Norris, the map are always focused. Always a pleasure having you on theCUBE. Thanks very much for coming on. Thanks guys, appreciate it. Thank you. Right there, we'll be back with our next guest. Is theCUBE alive from Spark Summit East? We'll be right back.