 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East. Brought to you by Spark Summit. Now your hosts, Dave Vellante and George Gilbert. Hi everybody, welcome back to Spark Summit East. This is Dave Vellante here at the Hilton Hotel and here with George Gilbert. We're very excited. We've been working on big data forecast for a number of years now. I want to take you through sort of what we're doing, but we're going to release today our first ever Spark forecast. We're in the process of updating our big data analytics report in the market study. We first released a big data study in 2011 and we're the first sort of analyst firm to do that. Developed by a colleague, Jeff Kelly, did a great job sort of putting sort of big data quantification on the map, showing market shares, market sizing. A couple of things that we noticed right off the bat. First of all, the market was small. We saw it growing very steeply in an old guy curve, hitting that sort of peak area sometime in the middle part of this decade and then sort of flattening out in the latter part. And one of the things we noticed was that software as a contributor to the revenues of big data was actually quite a bit smaller than you would have expected in normal markets. The reason was the preponderance of open source software. The second thing we noted is that services dominated the spend and the reason why that was is that big data was hard. The dupe was complicated, so people had to spend a lot of money on implementing. And that's one of the reasons why we saw the ROI to be somewhat limited and the return on investment tended to be focused on reduction of spend, reduction of cost against relative to traditional enterprise data warehouses. So we're here today. George Gilbert to George, first of all, welcome. Thanks very much for all the hard work that you've been doing in the last several months. You've been out talking to doers, what we call practitioners. These are guys that are applying big data to solve business problems. You've been talking to technologists, I think you told me off camera, you've talked to what, roughly 40 companies in the last couple of months? Yep, and through them all their customers, what they're hearing. Okay, so in addition, we've done a number of surveys in the big data space over the last couple of years. We're gathering from that data. We've had a big team doing this. David Floyer has put together some of the forecasts. We're going to show you today the first look at some of the forecasts, the sausages being made as they say. So this is early, we're still shaping some of that, but you'll be able to take away some of the initial conclusions. So George, let's go get into it. I want you to start by talking about some of the tentpole takeaways that you've learned, not only in the last couple of months, but in the last year. Okay, so what we call systems of intelligence, that's a journey. It's not a destination. It's about applications that make ever more intelligent decisions on behalf of organizations or people. And the key takeaway is that this journey can be grouped into three stages. The first being where you have rather loose integration between the analytics and decision making, and then the applications that actually do things. And so the journey is to get ever greater speed and tighter integration with those core systems of record. That's takeaway number one. Takeaway number two is that large scale applications built around machine learning, those are years away. In other words, we came to know ERP and CRM and these large horizontal applications that sort of defined a multi-decade era. We're not at the cusp of those. Number three, Spark is sweeping big data by storm now, but for it to remain real-time, it has to push its boundaries out. Many believe that real-time processing is a critical boundary that it must cross. And we got feedback from, we got feedback from Matei Zaharia himself today on just that issue when he was a guest on theCUBE. So let's jump into, yeah. So what I want to talk now, you're going to take us through the journey, but one of the things I said at the top was that initially, Hadoop was really focused on targeting the traditional data warehouse. I often quote Jeff Hammabacher saying when he was at Facebook, he was trying to attack the overly expensive storage container. So it was a way to sort of collect data, amass data, less expensively, that was sort of one piece. The other was with all this volume of data that was increasing, it was a way to bring code to data and not have to move data to some kind of God box through a pipe and some UNIX box with Oracle running it. If you had any money left over, you'd try to do something with it. So Hadoop really brought down the cost of storing data and analyzing data, but it was still batch. So but that's really that data lake has emerged out of that. We talked today about how Spark is really sort of simplifying that data lake and then the journey is extending into personalized services and then IoT. So take us back to the beginning, take us to that sort of data lake and the extension of the EDW, start there. Okay, and I'm going to do this at a high level right now because the key thing is after this, I'm going to drill down into each scenario. So this journey that you described, Dave, has three stages. The first being the Hadoop data lake, which is the better, faster, cheaper enterprise data warehouse. Stage number one, and we'll talk about the attributes of it in a second. The second stage, scenario number two in the journey, personalized real-time services. This actually builds on the data lake in the sense that you know a lot about your customers. You know a 360 view of your products and you know, essentially, you have a 360-degree view of the capabilities of your organization so you can serve goods and services in an individualized way. That's step two. So sorry to interrupt, but that was the promise of the decision support world. Now, in your opinion, has Hadoop and Big Data lived up to that promise? Great question. The way I look at it is a bit cynically, which is we never fulfilled 100% of the promise at any one stage, but we get closer. And I think that's... All right, great, carry on. Okay, if you would. So the third scenario, which our Eminence CTO and Chief Research Officer David Floyer has described where the real hard work happens is when we integrate these autonomous applications with our core systems of record so that, for instance, when you're taking a customer order, you can, in real time, do fraud prevention, fraud detection. And then, at the same time, autonomous internet of things applications as well. Those are the hardest to do. Those will be the last to be done. Now, let's drill down into each one of these. So, taking the data lake. This is the better, faster, cheaper data warehouse plus, plus, minus, minus. So it's better in some degrees, it's less good in other degrees. On the left side, you'll see we take traditional back office system of record applications and the data from them in batch mode, the operational data, we combine it with the new systems of engagement, the customer breadcrumbs, the digital experiences. We send it over a batch link, meaning periodically, into this better, faster, cheaper data warehouse. What it does better than traditional data warehouses is you deposit the data and then you figure out what structure you need in it and the point of that is, the benefit is you get to decide over time what questions you want to ask. When you do a real data warehouse, you have to determine upfront before you put any data in what sort of questions you're going to ask. That's a big difference. So let's leave it at that. Let me just ask you something about that because we talked to Anjul Bambri at IBM about this and she was talking about ETL, sort of moving to ELT. Extract load transform, so you're doing no schema on write or a schema on read, if you will. Is the elapsed time though, George, of getting to the answers actually decreasing? And if so, why? Or are we just spending more time building schema after we dump all the data into the data lake? And amazingly, and I'm not pandering, but that is the question. Okay, so what's the answer? So I believe the answer is, as our tools get better, we are shortening the amount of time to get to insight and partly because we're having a series of roles where the data scientists will apply some structure, the data engineers a little bit more, and then the business analysts still more and then they can iterate. And so in the data warehouse world, the problem was that if you wanted a new question answered, you had to go back, you had to unravel that pipeline, you had to go to the source systems, get that data, rewrite that pipeline, and then stick it back into the data warehouse. All that's gone. Okay, so it is compressing and the tools are getting better, combination things. All right, take us to scenario two. Okay, scenario two, personalized real-time services. Before in scenario one, I had that bridge with the stopwatch, that was the batch timeline. You had a big elapsed amount of time to get the data into the decision system. Now there's a tighter coupling. I wouldn't call it deep integration, but you've got information from your back office systems and your systems of engagement, your digital experience systems. They come into the data warehouse fast enough, you can create machine learning models that understand and anticipate and influence your customer in their journey. And that's where you can serve up personalized experience in the digital experience part of the application. You can engage the customer in a back and forth conversation. It's not deep, deep integration, but it's a better experience. What role does Spark play in this scenario? Spark's role is that it is a more unified analytic application platform so that you don't have to go from silo to silo to silo to get an answer. You can extract the data from your data lake with SQL, you can filter it, you can feed it through a machine learning pipeline to get the predictive model. In fact, you could even stream data in before you filter it with SQL. The point is, you're not going from tool to tool to tool. So it simplifies the technology component. And it speeds it up because you're not in different engines. It accelerates it, it presumably has an impact on data quality, is that right? Yes, data quality. Doesn't it give me more time to worry or at least affect data quality in a positive way? That's a tough question, I'm not tracking it to that one. Okay, so maybe. So what's the bottom line business impact? The bottom line is that you have one engine with multiple capabilities. So you don't have to do that batch handoff between section, between app to app. Okay, but I'm a business person. What does that mean for me? That means shorter time to insight and higher quality in the sense that you're not, there's less chance of errors in handoffs between the engines. Okay, great. So IOT's all the rage. Everybody's talking about real time. Take us through the third scenario. Okay, that's third scenario. So here we have, this is the hardest one. Deep integration between your systems of record and these new front ends. And when I call front ends, you still got your operational data. You still got the customer breadcrumbs. You've got now, say fraud detection. Fraud detection's really difficult because you've got someone coming into a retail store, swiping his card or calling in the call center or on their mobile phone and you have less than 100 milliseconds to approve or decline that credit card. You can't say, oh, we'll get to back to you in an hour after we've checked our historical database because you've already absconded with the money. And then you shut off the card and you've lost the $1,000 that you spent on the Armani jacket. So, well, no, this isn't Armani, but. In any case, in any case, that deep integration and that short latency is where all the money's made because that's the hard problem to solve and because that's where when you tie those systems together, that's where you unlock the value. You're making a sales order entry decision much more high value because you're saying, hey, don't take this order. And I'm telling you this at the moment you're trying to take that order. Well, so, and we've been talking about this all day here at Spark Summit East. The fundamental problem that people are trying to solve is one of demand and what we mean by that, we've been talking to our colleague, Peter Verst, about this is information has become so widespread that the power has shifted from the brands to the consumers. The consumers now have great information on pricing. They have information on product quality, reviews on Amazon, we all look at them, we look at the lousy reviews before we buy to see if it fits my particular use case, et cetera. The premise that we've been putting forth is that big data analytics is largely about brands trying to regain some of that proprietary information and value that they have so they know what's going to happen maybe before the consumer. So granted it's not the only use case but certainly dealing with the problems of demand are a major factor. Do you buy that premise? What do you make of that? I think you're spot on in terms of getting better context and understanding of the customer. But here's the thing, you never are done adding to your understanding of the customer. You're always adding more feeds and that is something we'll touch on but it's partly why it's very, very difficult to build packaged applications for these types of predictive systems, at least right now. If you're always adding new sources of information and richer analytics, it's not like I can ship stack of CDs to you and you can install them or it's not even like I can hook you up to your SaaS application because your sources, your new information feeds are going to be different from your neighbor's information needs. So that's one of the big problems and where you plug those into the apps, that's another big problem. But let's get to that when we do the forecast. Okay, well let's get into it. If we go to the next slide. So I want to just set this up. So when we started Wikibon, the whole idea was open source research. Use the power of community. Don't necessarily be the smartest person in the room but let's leverage the community and the knowledge in the community and publish fast. And so what you see us producing is really three types of research, sort of what it is, how it works, and what's the business impact. And what it is and how it works, we make open source to everybody. Over the years we've had demand for greater quantification of markets, of business impact, the real value of applying technology to create a business capability. That's what we put behind our paywall. The other thing that we do within SiliconANGLE Wikibon is we leverage the power of community. We're tracking many, many millions of people online. We narrow those down and build our own communities from which we extract knowledge. So and our job is to really present that information in context. So what you're really trying to do is help people understand the shifts that are going on the market, in the market as they move toward so-called systems of intelligence. So what this chart shows is our big data forecast broken down by hardware and software and services from 2014 all the way to 2026. So we like to do these long-term forecasts and as you can see the market in total for 2015, we projected about 22 billion. And you can see the appropriate or corresponding compound annual growth rates. No, for everything. Sorry, 22 billion for everything. Oh, I'm sorry, 2015, I was thinking. So services, still the largest part of that at 9 billion. So 22 billion growing at about 13% on a compound annual growth rate basis. A couple of points that I want to share with you here. Number one is you could see software is growing faster than anything else. And our premise here is really, think about three things with big data. Is all this volume of data coming in? Either humans can get smarter or you can throw data sciences and services at the problem or you can develop software that scales. Number one, ain't happening. Data is growing faster than we can absorb it. Number two, we can't train people fast enough. You heard today, Databricks trained 20,000 people last year. It's barely scratching the surface. Services are not the answer, they don't scale well. Software is the answer. So our premise is that software ultimately codifying these processes around that data transport is ultimately going to drive the market forward. And that's really the message behind this chart, isn't it? Anything you want to add? Yes, I would add one thing, which is I'm used to seeing, I grew up on the vendor side at a company called Lotus, which has been forgotten in the ash heap of history, but once upon a time, we did cool stuff. And all the forecasts I saw from market research firms were straight lines up and to the right. And we do things differently here. What we do differently is we identify the things that cause a knee in the curve, something that says the market is going to shift now in this direction because of this set of reasons. That's what we're going to drill into. And that's why on this slide, you don't see straight lines up and to the right. And then I had these, so these I should caution. These are preliminary numbers, first look. All right, so let's go on to the next one. I love this chart. This shows big data apps, analytics and tools in context. So the purple is big data apps, the blue is analytics and tools. And what this talks to is the market today is highly, highly customized. We sort of debate a little bit with Sean Conley about this and he actually brought up a good point. Things like Spotify are new big data apps that we may or may not be counting in here. So they're coming from places. One of the things that Peter Burris predicted years ago was that there'll be more SaaS companies coming out of non-technology companies than technology companies. So maybe that's a good point that we have to consider. But what this chart really shows is today in the enterprise a lot of the development is customized but over time we expect packaged apps to emerge. Talk about that. Okay, and in fact, it's probably fair to say that we have a good number of B2C consumer SaaS apps like Spotify, like Uber, but we're focusing here on enterprise apps. And frankly, we were conditioned over a period of several decades to say, okay, we have these known processes like placing an order, like onboarding an employee, sending an invoice. Those were well known. The transactions were well known. The workflows were well known. We built very big application companies based on that knowledge. Today our applications are oriented around mostly interactions between people or where a person is interacting, let's say in some non-determinant way with say a piece of equipment. The point here is we don't know these processes. And so that means it makes it hard for them to be packaged in a repeatable way, at least today. And in fact, from all the research we've done over the last year, we don't see that changing in a major way anytime soon. There are functions. There are like B2B marketing apps that frankly, they're a bunch of very, very good ones, a big bunch. But seeing broad horizontal applications, big data applications, they're not there yet. So let me just make the last point, which is that's why the purple, which is the applications that are going to emerge, we see that emerging in the 2020 and beyond timeframe. All right, so now go on to the next one and we have to set up for the panel. One of the things that John Furrier talks about is Spark is sort of like the NIC card was in the early days, it extended the connectivity of the devices. It didn't subsume those devices. So what this chart shows is really, it's just a software component of our big data forecast and it breaks it down, sort of big data 1.0, 2.0, 3.0, according to that journey that you talked about. And then it shows the proportion that involves some type of the spending that involves some type of Spark and or streaming, not just Spark streaming. And you can see it's small today and then growing. There's some debate internally as to what it will do in the out year. Some of this stuff is hard to forecast because it's sort of embedded. It's an extension of the existing environment. But increasingly, Spark is propelling big data and Hadoop into that 2.0 realm. And as you point out, to get to 3.0, it's really got to get to real time. And that's a major challenge. We're hearing a lot of talk today about how it's going to get there. But give us your final thoughts on this slide and then we'll wrap. I guess the key takeaway is if you look at this slide, you might look at the very last data point on the green line and say 31 billion for Spark slash streaming and Chuckle. The answer is we're imputing a value for Spark and real time or near real time technologies by saying how many cycles is it consuming on the clusters that are supporting these apps? And yes, this is our first pass and we're going to refine it. But the main point is you can see it slowly picking up in the journey scenario one, which is the dark blue. You can see it accelerating as we get to scenario number two. And then beginning to tail off as it reaches the upper limits of scenario three, which were the real time apps. So conceptually it makes sense. But I think degree, we're going to refine it a little more. Okay, George, we got a wrap. I'll give you the last word. Summarize in 30 seconds what you learned in the last month, two months. Okay, big takeaways, tent pole. We're going to still be driven for the most part by mostly custom apps. That's going to put a bit of a lid on the growth rate. We were seeing triple digit growth rates in enterprise apps for almost the entire decade of the 90s, because you could just sell and ship a bunch of CDs to companies and they could install it. We're not going to see that right now. Then there's the issue of we're going to have to see, we're going to have to see, let me flip back to my, we're going to have to see that this journey to systems of intelligence, it's not like you can skip a step. This is, there are skills you build, processes you learn, and technologies that have to mature. This is nothing new. People process technology has always been the foundation of how we evolve in technology. And so that journey is something that we all, all companies will grow through. Some are farther along than others. Typically those who live in the tech bubble here in the Bay Area or who are principally tech companies. And the last one is that Spark really is taking the industry by storm. There are areas that Spark is staying away from, which is the database layer and the management layer, and different companies are taking different approaches to fill that gap. But Spark, Spark is here to stay for many years to come. All right George, we got a wrap, so please everybody check out wikibon.com, we'll be releasing this data this month. In a couple of forms, one is in the open form and the other piece behind the paywall is going to be some of the detailed forecast and market shares. So if you want more information on that, hit us up. You can tweet us, I'm at D-Valante at G-Gilbert41. You can call us on the phone, use old school. Generally people know how to get in touch, but definitely also check out siliconangle.com, we'll be releasing some of this information on there and always check out siliconangle.tv on theCUBE to see the replays of this event and other events. All right, so George, thank you very much for setting that up and all the hard work that you've been doing. Thanks to Ralph Finos and David Floyer for your participation and contribution, and thanks to the wikibon community for helping us get this done in such a fast timeframe. All right, keep it right there, everybody. We'll be back with our panel from Spark Summit East in Midtown Manhattan. This is theCUBE, we're live.