 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Welcome back to Big Data NYC, everybody. This is theCUBE, SiliconANGLE Wikibon's continuous coverage of Strata, Hadoop World. This is our event within the event. And we're here, this is day three for us. Arsalan Tavakoli is here, he's the Vice President of Customer Engagement at Databricks Hot Company. You guys must be really happy about what's happening in the industry. Everybody's talking about you. They should call this Spark World. So congratulations on all the momentum. You guys have been doing a phenomenal job. So, I mean, George, I know you're excited. We're excited. So how do you feel? It feels great. I mean, it's also interesting to see. So I joined Databricks about two years ago, and at that point, I mean, Spark was an interesting project, but there was a couple of people a little bit interested. I think we had maybe 50, 75 contributors back then. You fast forward to where it is now. You're looking at, I think, 800 contributors at last count. To your point, small companies, large companies, kind of everybody getting involved with Spark. So it is, frankly, an amazing thing to see. And we were saying IBM won the contest for most mentions of Spark on theCUBE this week. We might beat that in this segment, but obviously. I'll try. Just keep it count. I'll try. Having those guys put an emphasis behind it doesn't hurt, but. So we're hearing a couple of themes this week. And a big one, obviously, is real-time, near real-time, ability to ingest that data. Hearing a lot about the data store, that's kind of an interesting topic as well. But what are you hearing over at the show? Yeah, there's a lot of things. I think one of the great things to see, anytime that you have a new project, is people are asking you, well, is it the new shiny toy? What makes it special, right? So there's, okay, we have contributors, we have X. What people are really looking to see is, are people deploying it? Are people actually getting value out of it, right? So it's great to see, I think there's now over 1,000 production deployments. So one, seeing that the people across industries are actually using Databricks. And the second question that you started asking, which is where I think as the conversation matures, is what are people using it for? And that's where you have things like, okay, what about the data store? What about streaming real-time and how we're doing it? So one of the most interesting things for us is to just look, what are all the different types of use cases people are doing it? And to compare it to, Databricks is a company we see from our own product and offering how people are using it, how are people using the broader open-source Spark projects as well in doing comparisons? Yeah, so go ahead, George. Well, that survey that came out last week from Databricks, of all Spark customers, there was one piece of data that stood out above all else, which was 48% of users are not on Yarn, 40% are on Yarn in the restaurant, I think messos. And what was astonishing is that implies that almost half are essentially outside the Hadoop ecosystem. Tell us what they're doing where they're running independently of that. And then for those that are in the Hadoop ecosystem, is that because they were already invested in Hadoop and the ones who are independent just hadn't, they were Greenfield? Is that what it looks like? So there's a couple of things there. Let me try to unbox them. One, it's always interesting to see how these surveys have actually evolved over time, to be honest. So very, very early on, always a more than 50% of users have been deployed Spark using standalone, just because that existed also before support for kind of Yarn and messos existed and so forth. Now in the very early on days, actually, the numbers were something around 50% standalone, 40% messos, a much smaller number of yarn. Because if you go back to the history of Spark, the first project was messos created at Berkeley. Matei needed something, it's a resource manager. He's like, I need something to run in addition to Hadoop. Let me create a project called Spark, right? So that's kind of where it came from. Now, as you moved forward, all the Hadoop vendors jumped on it and basically Spark became a part of it. You now had a commercially supported version of Spark sitting on yarn before Messosphere, there was no real commercially supported version of messos for people to get the option, right? That's one. Second, I think that the notion is a lot of people that still deploy it is what is the reason that they're using Spark for, right? And I think that there is a lot of people and you touched on this, George, what's the door that they came in through? And from Databricks perspective, it's the same. 50% of our customers, you came from the Hadoop world, have used Hadoop, 50% of our customers have never used Hadoop before, right? And when you look at it, somebody's made an investment in Hadoop and they're now looking at Spark to basically supplement what they have in there. And some of the reasons, the use cases we hear are people saying, you know, I had an ETL job, I'm looking to speed it up is one, very common one. Two, I'm looking to actually add interactive analysis where most of my workloads were batch beforehand. And three, I'm now looking to bring together as part of this data warehouse notion, multiple things, advanced analytics, like streaming and so forth, it's just easier with Spark to do so, right? And when you have that, your data sitting in HDFS, most likely you've kind of invested in other elements of the Hadoop because the system in Spark was designed to play well with that. There's a completely separate camp of people who, I think if you look at it, what's the number of Hadoop deployments worldwide? I've lost count, maybe a couple thousand at this point. There's how many like SAP has 300,000 customers and so forth and one of your last guests were saying, a lot of people still haven't gotten there yet and they're looking at Spark. Use cases we hear, especially on data science, people who are big are Python, you know, MATLAB users doing single no data science, now looking how do I scale it up. We see a lot of customers that kind of were from the MPP world, Green Plum, Neteza's, Vertica's of the world, who were doing on SQL and now they're looking to add advanced analytics, like gaming companies doing likelihood of churn analysis. They turn to Spark and then finally, one of the more interesting things that Hadoop is not one of its strengths is over 50% of our customers have data in more than one data source. So it's in HDFS and S3 or S3 and Elastic Search and Cassandra and so forth and so a lot of people are approaching Spark as being that narrow waste for how you can do computation when your data's spread out across a lot of different sources. So we just did a recent survey and a lot of the stuff was consistent with your survey. Different audiences are sort of a random sample of organizations but huge interest in Spark, specifically saying we're going to replace a lot of existing things that we would have done in Hadoop, you know, very, very high intentions but then when you take some of the workloads, I just wanted to run and buy you, this was sort of analytic workloads that need real time. IT operation support, I presume a lot of that's Splunk going on, okay fine, but data transformation, you mentioned fraud detection, risk management, workflow optimization, these are the big ones that were coming up amongst others. Are you seeing sort of similar overlaps there? Yeah, and in general what we do see from, like there's this maturity curve as people who adopt Spark go to, the first one is just the notion of data warehousing. I have a lot of data, I don't know exactly what I want to do with it, it includes ETL, 100% of, everybody's always surprised, 100% of our customers actually use SQL, they use other things but SQL is still a very common thing to be I like. The next phase in that evolution is predictive analytics, machine learning, advanced analytics, or some of the things you talked about, fraud detection, anomaly detection, risk detection, are all getting into machine learning which was one of the original things that Spark was built for and then the final piece that we're starting to see in the evolution is as the users are getting to streaming, right, and so we're seeing a lot more internet of things, real time type analytics, how do we do real time anomaly detection and so forth as well. So the integration of all those capabilities and the increasing integration in Spark, that sounds like it's a path, when customers want to use those capabilities ultimately together, it sounds like it can happen in Spark, it's harder to happen elsewhere. I mean it almost sounds like you're going to draw more customers in, just because you can do those together and you're integrating those APIs in a way where they're separate products or projects in the Hadoop ecosystem. Yeah, so the short answer is yes, I mean when you ask customers, if you look way back two years ago when we were having the conversation about Spark, everybody, all the terminology was in memory and fast, and you almost hear nobody mention that anymore because it's kind of taken for speed. The main things that people are asking for is, is it easy to use? Can I do what I want with it in interactive speeds? An easy to use means, is it languages I'm comfortable with? Like Spark just added SparkR, you look at it, the audience of, I think that we had SparkR talk at Strata, it hit fire code because everybody goes that, because no matter what you say, everybody loves Scala, the new languages, R is an extremely popular language. So it's like give me a language that I'm comfortable actually using. Right, and easy to write. And then the final notion is though, people say, you look at it from technology vendors perspective, I have the best machine learning system, I have the best streaming system, talk to an actual enterprise, they will never talk to you about it that way. They will come at you. This is my use case. My use case is doing basically real time anomaly detection, which means under the cover, what I need to be doing is kind of SQL grab the data, train a model on it, and then put it on a streaming thing, and every time you create another source of friction of a new system, new API, it was very difficult for them. So that ease of use, the integrated platform of being able to do all the processing I need to end in one place, has been one of the things that's resonated a lot. With ease of use. Yeah, some big emphasis on ease of use. You said earlier, easier, I inferred you mean easier than what we're used to with Hadoop. Is it easy enough? It's a question, that's a relative question, right? I'll be honest, everybody likes to say, and I think one of the disservices we do as an industry, we say it's easy. Why don't you go get that guy who's head of marketing to come and write code? And I was like, have you tried to tell him the head of marketing they need to learn to write even Python or code like that. I think what we look at it is easier and to the point that developers find it much more natural, the paradigms that they actually use to build it and the languages. I do think that there is, it's one of the major focuses that we have kind of Databricks on Spark and our product platform is how do we make it, both Spark easier and creating a cluster kind of actually using it. I think the analogy I like to give is Spark is a fantastic engine when we keep giving it better and better, but most people don't buy an engine. They drive a car. So how do we move closer and closer towards giving them that full set? On that notion of driving the car and not seeing the engine, are notebooks strategic in terms of being able to bring all those capabilities? I want to get at my data, I want to clean it up, I want to interrogate it, and then I want to learn from it and do anomaly detection or fraud prevention. Are the tools critical to that story? The interactive tools? I think that they absolutely are. So I have this theory that may either sound naive or obvious, which is that anytime you see a technology take off so rapidly like Spark did, like notebooks did, and in retrospect it looks obvious. It meant that there was a huge demand for something that was hard to do and somebody finally found a technology that makes it easier. And if you sit back and look, what is the workflow for somebody to actually develop a business use case? It's almost always, I'm going to try to iteratively figure out how to solve this. I need to look at the data. I need to see, am I finding the anomaly? I trained it on this model. I need to tweak, I need to do X, Y, and Z. Notebooks provide a phenomenal environment where you can kind of interactively go through, step through, understand what works, what doesn't. But then the only way that notebooks can be key is very few people like basically say, I'm going to bet my business on something that you explored in notebooks. The key is how do you do what you did in notebooks and then figure out a way to put that in production, in operation. You can put the notebook into production and make it repeatable. So one of the key things that, so again, stepping back, looking at what we did at Databricks was like, what are the biggest issues that people have today to actually be successful? Because you guys have put out many of the same survey, how many big data projects failed to show value, failed to get deployed. And things that people told us were threefold. One, it's actually really hard to get a cluster up and down. It takes like sometimes six to nine months and so forth. And the problem is that they're elastic. I want to be able with the cutting version set. The second thing is I want an easy to use environment that's iterative. And then finally, I want to be able to put it in production. So we put that, that you can take a notebook, say run it in production, which means like run it every day, every hour, every night, create the infrastructure for it, run it, then shut it down, and notify me if it goes down. So that kind of productionizing it has been probably one of the most used and demanded features that we had for the exact point. They want to get from exploration to production. Okay, so is that really sort of the next wave of emphasis in terms of where the innovation is? Or you guys feel like you're there? I think it's a dangerous thing to ever say that you're there. I think what I would say is we see, it's exciting to see the number of people who basically get value out of the product right now. I mean, one metric I'm pretty proud of is I looked at our first batch of customers across the first six months. Over 50% of them have already upgraded to a larger amount of cluster. So what that means to me says that they used it, they got value, there was a business use case for it, and they came back for more, right? And that's a key thing of saying, are you getting actual value? Now, where do you invest in more? There's both the Spark notion and Databricks kind of looking at what about other clouds? How do we make security easier? How do we make collaboration and operationalizing easier? So there's always work to do. Do you measure how long on average it takes to go from proof of concept to pilot to production? Yeah, so I mean, and a little bit of that depends on the size of the company, right? And so one of the things that we looked at was very early on, the metric that cited pretty common on the dependent is like it takes six to 12 months to get to kind of like the POC to production. The nice thing about from a Databricks perspective is when somebody starts, the time between they say, look, I'm interested to get going and until they have their environment that they can use for production up, is an hour or less. That's it, it's an hour, deployed in their AWS account so forth, almost always we then go and do it and say, let's do a basically a two week POC. And they're always like, I can't, like, I promise you try in two weeks as long as, and everybody's like, I have 50 use cases, I don't want 50. Give me one, give me one with an actual metric of how you define success, which is I want to do this, this data, this transformation, and if you can do it at this speed or in this easy idea, we do it almost always at the end of kind of two weeks, maybe four weeks, they've moved over to kind of a paying customer with the much larger pilot of moving over to production. What about, I'm hearing about notebooks on top of Hadoop like Zeppelin, I don't know if that's the only one. Is there a limitation in terms of the usability and accessibility of the notebook when the capabilities underneath the notebook, the APIs are part of separate products? Yeah. So I think the one thing that I will say is twofold. We, what's important to people, again, if I go back to my point about friction, it's the least amount of friction that I ever want to use something in a natural way. So a lot of the reasons why Zeppelin, great project, Project Jupiter, kind of iPython notebook, another great project and so forth. But what we started was, I think people refer to it as notebook, but we generally say it's a workspace because there's a couple of things that you need. One, just take a simple example, you have a cluster. What happens when we have organizations that have hundreds of users using it? You want to be able to multiplex that cluster, do you want to have 100 notebooks attached to it? Each of them running fair sharing a job. Kind of that's not a built-in capability in any of the other ones. They have to have static basically partitioning a memory which means unutilized spaces, one. The second, there's things you look at which is I want to know what about collaboration? What about version control? What about basically integration with tables and integration with libraries and dependency management? So notebook is kind of this more narrow, what is the interface through? I think in order to be successful, it is about that integrated workspace that says what are all the things you need to do to work to be successful and put that in one place? So to keep the infrastructure sort of transparent. In other words, if you're working on top of sort of fragmented products or projects or infrastructure, you can't bring that usability into the workspace itself. Is that a fair way of? Yeah, it's difficult because the key question is how it's hard to get the full value out of it, right? And you know, inevitably, if they weren't designed in mind to work together, you're either going to have to contort things to work together or there's just going to be cases where benefits that the underlying system has aren't exposed through the kind of notebooks. Like everything we built was built entirely for a spark cloud environment, right? And so there was a lot of design decisions we made early on that have paid off in terms of what we see as useful for customers that are harder to do when you're doing the disjoint systems. That's a lot of time, but last question, what should we be watching from you guys over the next six, nine, 12 months? What are your goals, maybe? It's a great question. I think from the next six, nine, 12 months, I think the things that you want to see is a couple fold from a product side, keep seeing the product improve, keep seeing basically looking at new clouds that we're going through on AWS so forth, but looking at kind of things like Azure and SoftLayer and Google Compute Engine exploring that world. I think that there's a lot of really interesting security features that we're also basically rolling out that are kind of key as we look at from sectors from everywhere from basically the federal government and public sector very stringent and how does that apply to some of the larger Fortune 500 enterprise customers we have, but what I really care about coming out of it is interesting use cases of hearing. I think people are a little bit tired of hearing, oh, big data is a lot of hype, there's a lot of interest, there's a lot of vendor. People want to know the questions. What are people doing with it? What is the value? Because what that does is almost every company you come into and talk to first, they say, great, what are my peers doing that I should be looking at? And having more and more of those stories that we're seeing through Databricks, I think will be great. The fourth V value, is you guys going to reinvent next week? Yes, we will. We'll be there too, we'll have theCUBE. Fantastic. Thanks very much for coming. Thanks for having me. Appreciate it. All right, keep right there everybody, we'll be back with our next guest, wrapping up day three, big data NYC, right back.