 from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. Welcome to New York City everybody, this is theCUBE and I'm here with George Gilbert, this is Dave Vellante. We're here, this is our second big data show this year and it just so happens it's in New York, we had the rapid miner show George earlier this year and of course we're here at Spark Summit East. This is the second CUBE day that we've done at Spark Summit. We did the West Coast and San Francisco earlier this year. We didn't do the European show in Amsterdam but the growth of the Spark community is quite impressive. We saw some numbers earlier today. We were here last night at the meetup. There were close to 300, 400 people last night geeking out hardcore data scientists. George, you were pointing out in our prep calls earlier this week that the East Coast community is much more substantially what we call the doers, the practitioners that are actually applying big data versus the West Coast, you get a lot more of the technologists community, the vendor community. A lot of doers as well but many, many more doers here primarily because of the financial services industry. So Spark is really beginning to take the world by storm. We heard a lot last night about Spark 1.0 and of course 1.6 and Tungsten and how that was sort of dealing with bringing Spark closer to the bare metal and now it's really in Spark 2.0 it's about optimizations, code optimizations, improved streaming and we're going to talk about all that. You know, heavy, heavy geekdom going on here. It reminds me of the early days of Hadoop World and of course one of the things we want to really explore today on theCUBE is what's the business impact of all this stuff. We saw this in the early days of Hadoop George where a lot of the discussion early on was about bringing the code to the data and not the data to the code and how to deal with things like the complexities of MapReduce and all these projects like Scoop and Flume and Hive and Pig and all these other sort of acronyms that we really didn't understand at the time and have sort of learned as the ecosystem grows. You're hearing a lot of the sort of similar discussions at really detailed level, granular technology levels at Spark and we want to explore that but we also really want to look at the business impact. So we heard this morning from Andy Konwinski who's one of the co-founders of Databricks and also responsible for the training, the online training but Matei Zaharia who's the creator of Spark, he came out and he really gave a meaty, meaty presentation. I think you heard a lot of that and then of course Ali, Gosie, both of those individuals are coming on the Cube later. He was the co-founder and CEO of Databricks, really talking about why Databricks is founded which gets me to the bottom line here. Databricks was founded to simplify big data. They created Spark, gave it to the Apache community, open sourced it to the Apache community and it's really beginning to take off since. George, you've been doing a lot of work in this area. You're just about to release the industry's first ever Spark forecast in the context of our overall big data forecast. So you kind of helped us get into this business, right? You were high on Databricks, what they were doing, the whole notion of the importance of in-memory and eventually real time. So what's your take on what's happening in the world of big data generally and specifically Spark's impact? Okay, so that's great teeing it up. What we're really seeing, like what you saw with Hadoop, any enterprise software has to go through this phase where it's teething and that's where you sort of work out the kinks when you're trying to get into production. You hit these sort of, I wouldn't call them roadblocks but you have to work around to them and then the vendors have to sort of make stuff production ready. So those were, you were naming some of the situation that you would find yourself in with Hadoop and we're sort of in that phase with Spark where you heard about the talk about Project Tungsten which is to try and get to bare metal performance. That's where we want to get the ease of use of Spark with the performance of software that's written all the way down to the metal. But more importantly, Spark came about as a reaction to it and as a replacement for MapReduce which is the core processing engine of Hadoop and the idea was to take better advantage of memory to build something that's higher level and more usable and now it's also integrated more capabilities so you can use it as if it were a SQL query engine, you can use it for machine learning, you can use it for graph processing which is helpful for things like recommendation engines. By putting all those things together, you've really got what is an engine for the next five plus years of big data applications. Let me just qualify that and save it where Hadoop got us, if we look at systems of intelligence as a journey and we have three stages in that. We have the data lake which is better, faster, cheaper sort of data warehouse plus some things, minus some things. Stage two is sort of personalized digital experiences for a consumer dealing with a vendor. The third phase is real time deep integration with systems of record for things like fraud detection or even internet of things. Hadoop got us through phase one. Okay, so let's talk about that a little bit. So the epiphany of Hadoop back in the 2008, 2009 timeframe was you can bring, as I said earlier, you can bring five megabytes of code to a petabyte of data, distribute, leave the data where it is, distribute the code, act on it accordingly, map it and reduce it, so to speak. And that began, particularly in the financial services industry, we saw a lot of folks, a lot of the big banks and other industries for sure starting to build out data pipelines, what our friend Abhi Metta called at the time, the data factory. And one of the points he made at the time was sampling is dead. No longer do we need to do sampling. And the reason was that Hadoop dramatically lowered the cost of the dreaded storage container. Right. In fact, Jeff Hammabacher said in the early days of Hadoop, one of his primary missions when he was at Facebook was to obliterate the expensive storage container. Sort of a shot at the traditional EMC, NetApp sort of boxes, because they're too expensive. And that's why you had to sample data. Now with Hadoop dramatically lowering the cost of storing data, you could actually operate on all that data. So we started to build data lakes. And essentially, you've pointed out, it was a cheaper version of the data warehouse. The return on investment in the early days of Hadoop was actually reduction on investment, lowering the denominator. Okay, so now, and Hadoop, of course, everybody knows largely batch, and there've been some projects that try to deal with that. Now, EnterSpark. So you've just gone out, George, you've done probably the most comprehensive study of the big data business certainly ever done. You've talked out how many companies in the last month? At least 40. So 40 companies in the last couple of months. And a number of practitioners, doers as we call them, and obviously a number of technologists. And you touched on these in your earlier remarks, but let's explore them in a little bit more detail. What are the key takeaways that you want that our community to understand in terms of your findings? Okay, so it's pretty simple. And I started touching on these, that there are three stages in the customer journey. That's the first and most important way for both vendors and doers to understand how they're gonna apply the technology. So the data lake, we could do that with classic Hadoop. We can do some elements of it better with Spark. But then the second stage after that, as I was saying, are these sort of more real-time personalized systems. And the third stage is where we've got essentially autonomous software where it can accept or reject a credit application or a credit card or take action on internet of things without a human in the loop. That's the key thing. So that needs to be near real-time. So the first takeaway then is, those are the stages in the customer journey. The second... So just to clarify that, so Spark essentially you're saying is taking the baton from the traditional Hadoop on the data warehouse, the data lake. Because the data lake is kind of a mess. Hadoop is very fragmented and complicated. So the founders of Databricks, they started the company to simplify big data. All right, so... Yes. Is that right? That's an excellent analogy. It's not that Hadoop can't do scenario two, which is the personalized sort of digital experiences. It's just that it's hard. It's hard on the administrators. It's hard on the developers. And it's hard to get it to be in real enough time. So Spark, when you get the kinks worked out, can help a lot there. And when we get to the real, real time stuff in scenario three, the Hadoop as it's currently constituted can't really... Okay, so you're talking about sort of three main findings. One is that the data lake in the Hadoop data lake are evolving and becoming simplified around Spark. And now there's two vectors there as well. Those who have big Hadoop investments and a lot of skills basically evolving their data lakes to accommodate or bring in Spark to simplify things, speed things up, et cetera. And those that are basically starting from scratch without Hadoop because it's simpler. And that makes a difference in terms of what type of technology commitments they'll make. Okay. And then there's all kinds of detailed vectors we can go down there. The second is personalized real time services. These are all in the category of the first set of learnings. The three takeaways. The first is this journey, which you're articulating. And as you're saying, it's a handoff that's gradual from the data lake to the real time personalized services to the sort of autonomous IoT type applications. As you need deeper and deeper integration with systems of record and faster response, you move more and more away from classic Hadoop towards Spark-like systems. Okay, and so that's takeaway number one. So okay, but then that, okay, we're good. What's takeaway number two? Okay, so takeaway number two is the real big surprise as far as I was concerned. And even, I guess, institutionally as we were concerned, because last year we said in the study that there's no way this whole market can take off without applications. Because if you can't make this stuff repeatable, then you'll have armies of consultants. Picture, every time you want to do a fraud detection or prevention app or recommendation engine, picture having to do 85% of it from scratch. That's just not gonna scale. So last year we made an assumption that the only way it'll work is if we will have apps to scale. We did a huge amount of research and we came away with a conclusion that we've seen very little in terms of people sharing it, which is we're not gonna have packaged apps anytime soon. It's just there are too many things that are unknown and unsolved. So that means we're gonna do handcrafted solutions for several years out. And that means slower growth for the overall market. And it also means that we're not gonna see the rise of the big new class of ISVs. So heavy, heavy, heavy customization. Mike Olson five years ago said this'll be the year of the app. It never transpired because of the complexity. That's what you're saying. And then the third tentpole takeaway is really around real time. So talk about that a little bit. Okay, so if we look at a vector, that's techie term. If we look at the changes that we're going through in those customer use cases in the journey, we're getting tighter and tighter integration with the systems of record, the core backend apps like your order processing, inventory control, anything that is controlling and access to your resources. And in the second journey, you've got the apps that are at the edge that are supporting your digital experience. The tighter that coupling gets, the more you need something like Spark so that you can make a claim on resources. You can get a real time price. You can say, do we have resources, whether it's an airline seat or a widget to offer or which widget should we upsell? So it's tighter integration and faster response. And that's where we need Spark-like systems to take us. Okay, good. And we're going to unpack this throughout the day today. George later on today will be presenting our view, Wikibon's view of the big data space generally and then specifically Spark in context. So keep it right there. We've got a number of guests coming up today. This is theCUBE. We're live from Spark Summit East in Manhattan. We'll be right back.