 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. We're back, welcome to Boston everybody. This is a special presentation that George Gilbert and I are going to provide to you now. SiliconANGLE Media is the umbrella brand of our company and we've got three sub-brands. One of them is Wikibon, it's the research organization that George works in and then, of course, we have theCUBE and then SiliconANGLE, which is the tech publication. And then we extensively, as you may know, use CrowdChat and other social data, but we want to drill down now on the Wikibon, Wikibon research side of things. Wikibon was the first research company ever to do a big data forecast. Many, many years ago, our friend Jeff Kelly produced that for several years. We open sourced it and it really, I think helped the industry a lot, sort of framing the big data opportunity. And then George last year did the first spark forecast, really spark adoption. So what we want to do now is talk about some of the trends in the marketplace. This is going to be done in two parts. Today is part one and we're really going to talk about the overall market trends and the market conditions and then we're going to go to part two tomorrow where you're going to release some of the numbers, right? And we'll share some of the numbers today. So we want to start on the first slide here. We're going to share with you some slides. The Wikibon forecast review, and this George is going to, I'm going to ask you to talk about where we are at with big data apps. Everybody's saying it's peaked. You know, big data is now going mainstream. Where are we at with big data apps? Okay, so I want to quote, just to provide context, the former CTO of VMware, Steve Herrod. He said, in the end, it wasn't big data, it was big analytics. And what's interesting is that when we start thinking about it, there have been three classes of, there have been traditionally two classes of workloads, one batch, and in the context of analytics, that means running reports in the background, doing offline business intelligence, but then there was also the interactive type work. What's emerging is something that's continuously happening and it doesn't mean that all apps are going to be like always on. It just means that there are, all apps will have a batch component, an interactive component, like with a user, and then a streaming or continuous component. So it's a new type of workload? Yes. Okay, anything else you want to point out here? Yeah, what's worth mentioning this is, it's not like it's going to burst fully formed, you know, out of the clouds and become sort of the new standard. There's two things that has to happen. The technology has to mature. So right now you have some pretty tough trade-offs between integration, which provides simplicity, and choice and optimization, which gives you fragmentation, and then skill set, and both of those need to develop. All right, we're going to talk about some, both of those a little bit later in the segment. Let's go to the next slide, which really talks to some of the high level forecasts that we released last year. So these are last year's numbers, correct? Yes, yes. Okay, so what's changed? You've got the ogive curve, which is sort of the streaming penetration, spark slash streaming, that's what was last year. This is now reflective of continuous, you'll be updating that. How is this changing? What do you want us to know here? Okay, so the key takeaways here are, first, we took three application patterns, the first being the data lake, which is sort of the original canonical repository of all your data. That never goes away, but on top of it, you layer what we were calling last year, systems of engagement, which is where you've got the interactive machine learning component, helping to anticipate and influence a user's decision, and then on top of that, which was the aqua color, was the self-tuning systems, which is probably more IIoT stuff, where you've got a whole ecosystem of devices and intelligence in the cloud and at the edge, and you don't necessarily need a human in the loop. But these now, when you look at them, you can break them down as having three types of workloads, the batch, the interactive, and the continuous. Okay, and that is sort of a new workload here. And this is a real big theme of your research now, is we all remember, oh, they don't all remember, I remember punch cards, right, that the ultimate batch, and then of course the terminals were interactive, and you think of that as closer to real time. But now this notion of continuous, if you go to the next slide, Patrick, we could take a look at how workloads are changing. So George, take us through that dynamic. Okay, so to understand sort of where we're going, sometimes it helps to look at where we've come from. And the traditional workloads, if we talk about applications, they were divided into, now we talked about sort of batch versus interactive, but now they were also divided into online transaction processing, operational applications, systems of record. And then there was the analytic side, which was reporting on it, but these were sort of backward looking reporting. And we began to see some convergence between the two with web and mobile apps, where a user was interacting both with the analytics, you know, that informed an interaction that they might have. That's looking backwards, and we're going to take a quick look at some of the new technologies that augmented those older application patterns. Then we're going to go look at the emerging workloads and what they look like. Okay, so let's have a quick conversation about this before we go on to the next segment. Hadoop obviously was batch. It really was a way, as we've talked about today and many other dates in theCUBE, a way to reduce the expense of doing data warehousing and business intelligence. I remember we were interviewing Jeff Hammabocker and he said, when I was at Facebook, my mission was to break the dependency on the container, the storage container. So we were the one that needed to reduce cost. We saw that infrastructure needed to change. So if you look at the next slide, which is really sort of talking to Hadoop doing batch and traditional BI, take us through that and then we'll sort of evolve to the future. Okay, so this is an example of traditional workloads, batch business intelligence, because Hadoop has not really gotten to the maturity point of view where you can really do interactive business intelligence. It's gonna take a little more work. But here you've basically put in a repository more data than you could possibly ever fit in a data warehouse. And the key is this environment was very fragmented. There were many different engines involved. And so there was a high developer complexity and a high operational complexity. And we're getting to the point where we can do somewhat better on the integration and we're getting to the point where we might be able to do interactive business intelligence and start doing a little bit of advanced analytics like machine learning. Okay, let's talk a little bit about why we're here. We're here because it's Spark Summit. Spark was designed to simplify big data, simplify a lot of the complexity in the dupes. On the next slide you've got this red line of Spark. So what is Spark's role? What does that red line represent? Okay, so the key takeaway from this slide is, couple of things. One, it's interesting, but when you listen to Matei Zaharia, who is the creator of Spark, he said, you know, I built this to be a better MapReduce than MapReduce, which was the old, crafty heart of Hadoop. And of course they've stretched it far beyond their original intentions, but it's not the panacea yet. And if you put it in the context of a data lake, it can help you with what a data engineer does with exploring and munging the data and what a data scientist might do in terms of processing the data and getting it ready for more advanced analytics, but it doesn't give you an end-to-end solution, not even within the data lake. The point of explaining this is important because we want to explain how even in the newer workloads, Spark isn't yet mature to handle the end-to-end integration. And by making that point, we'll show where it needs still more work and where you have to substitute other products. Okay, so let's have a quick discussion about those workloads. I mean, workloads really kind of drive everything. A lot of decisions for organizations, where to put things and how to protect data, where the value is. So in this next slide, you've got your juxtaposing traditional workloads with emerging workloads. So let's talk about these new continuous apps. Okay, so this teased it up well because we focused on the traditional workloads. The emerging ones are where data is always coming in. You could take a big flow of data and sort of end it and bucket it and turn it into a batch process. But now that we have the capability to keep processing it and you want answers from it very near real time, you don't want to stop it from flowing. So the first one that took off like this was collecting telemetry about the operation and performance of your apps and your infrastructure and Splunk sort of conquered that workload first. And then the second one, the one that everyone's talking about now is sort of Internet of Things, but more accurately the industrial Internet of Things. And that stream of data is again, something you'll want to analyze and act on with as little delay as possible. The third one is interesting, asynchronous microservices. This is difficult because this doesn't necessarily require a lot of new technology so much as a new skill set for developers. And that's going to mean it takes off fairly slowly. Maybe new developers coming out of school will adopt a whole cloth, but this is where you don't rely on a big central database. This is where you break things into little pieces and each piece manages itself. So you say the components of these arrows that you're showing in Just Explore Processor, these are all sort of discrete elements of the data flow that you have to then integrate as a customer? Yes, frankly, these are all steps, they could be an end-to-end integrated process, but it's not yet mature enough really to do it end-to-end. For example, we don't even have a data store that can go all the way from ingest to serve. And by ingest, I mean, taking potentially millions or more events per second coming in from your Internet of Things devices, the Explorer would be in that same data store letting you visualize what's there and process doing the analysis and serving then is from that same data store letting your industrial devices or your business intelligence workloads get real-time updates. We don't, for this to work as one whole, we need a data store, for example, that can go from end-to-end in addition to the compute and analytic capabilities that go end-to-end. The point of this is for continuous workloads, we do want to get to this integrated point somehow, sometime, but we're not there yet. Okay, let's go deeper and take a look at the next slide. You've got this data feedback loop and you've got this prediction on top of this. What does all that mean? Let's double-click on that. Okay, so now we're unpacking the slide we just looked at and then we're unpacking it into two different elements. One is what you're doing when you're running the system and the next one will be what you're doing when you're designing it. And so for this one, what you're doing when you're running the system, I've grayed out the where's the data coming from and where's it going to, just to focus on how we're operating on the data. And again, to repeat like the green part which is storage, we don't have an end-to-end integrated store that could cost effectively, scaleably handle this whole chain of steps. But what we do have is that in the runtime you're going to ingest the data, you're going to process it and make it ready for prediction. Then there's a step that's called DevOps for data science. We know DevOps for developers, but DevOps for data scientists, we're going to see actually unpacks a whole nother level of complexity. But at this DevOps for data science, this is where you get the prediction of okay, so if this turbine is vibrating and has a heat spike, it means shut it down because something's going to fail. That's the prediction component and the serve part then takes that prediction and makes sure that that device gets it fast. So you're putting that capability in the hands of the data science component so they can affect that outcome virtually instantaneously? Yes, but in this case, the data scientists will have done that at design time. We're still at runtime. So this is once the data scientists has built that model, here it's the engineer who's keeping it running. Yeah, but it's designed into the process. That's the DevOps analogy. Okay, great, well let's go to that sort of next piece which is design. So how does this all affect design? What are the implications there? So now before we had ingest process, then prediction with DevOps for data science and then serving. Now when you're at design time, you ingest the data and there's a whole unpacking of steps which requires a handful or two fistfuls of tools right now to make operate. This is the acquire the data, explore it, prepare it, model it, assess it, distribute it. All those things are today handled by a collection of tools that you have to stitch together. And then you have process it which could be typically done in Spark where you do the analysis and then serving it, Spark isn't ready to serve which is that's typically a high speed database, one that either has tons of data for history or gets very, very fast updates like a Redis that's almost like a cache. So the point of this is we can't yet take Spark as gospel from end to end. Okay, so there's a lot of complexity here. Right, that's the trade off. So let's take a look at the next slide which talks to kind of where that complexity comes from. Let's look at it first from the developer side and then we'll look at the admin. So on the next slide, we're looking at the complexity of from the dev perspective, explain the axes. Okay, so there's two axes. If you look at the x-axis at the bottom, there's ingest, explore, process, serve. Those were the steps at a high level that we said a developer has to master and it's going to be in separate products because we don't have the maturity today. Then on the y-axis, we have some, but not all, this is not an exhaustive list of all the different things a developer has to deal with with each product. So the complexity is multiplying all the steps on the y-axis, you know, data model, addressing, programming model, persistence, all the steps on the y-axis by all the products he needs on the x-axis. It's a mess, which is why it's very, very hard to build these types of systems today. Well, and why everybody's pushing on this whole unified, you know, integration. Right. That was a major theme that we heard throughout the day today. Right. All right, what about from the admin side? Let's take a look at the next slide, which is our last slide, in terms of the operational complexity. Take us through that. Okay, so the admin is when the system's running and the reading out the complexity or inferring the complexity follows the same process. On the y-axis, there's a separate set of tasks. These are admin-related, governance, scheduling and orchestration, high availability, all the different types of security, resource isolation. Each of these is done differently for each product, and the products are on the x-axis, in Just Explorer Process Serve, so that when you multiply those out, and again, this isn't exhaustive, you get, again, this essentially mess of complexity. Okay, so we get the message. If you are a practitioner of these so-called big data technologies, you're going to be dealing with more complexity, despite the industry's pace of trying to address that, but you see new projects pop up. But nonetheless, it feels like the complexity curve is growing faster than customers' ability to integrate that, absorb that complexity. Okay, well, so is there hope? Yes, but here's where we've had this conundrum. The Apache open source community has been the most amazing source of innovation I think we've ever seen in the industry. But the problem is, going back to the amazing book, The Cathedral and the Bazaar, about open source innovation versus top-down, the cathedral has this central architecture that makes everything fit together harmoniously and sort of beautifully with sort of simplicity. But the bazaar is so much faster because it's sort of this free market of innovation. The Apache ecosystem is the bazaar and the burden is on the developer and the administrator to make it work together. And it was most appropriate for the big internet companies that had the skills to do that. Now, the companies that are distributing these Apache open source components are doing a Herculean job of putting them together, but they weren't designed to fit together. On the other hand, you've got the cloud service providers who are building, to some extent, services that have standard APIs that might have been supported by some of the Apache products, but they have proprietary implementations, so you have lock-in, but they have more of the cathedral-type architecture. And the delivering them as services, even though actually many of those data services are discrete APIs, as you point out, in proprietary. Okay, so very useful, George, thank you. If you have questions on this presentation, you can hit wikibon.com and fire off a question to us. We'll make sure it gets to George and gets answered. This is part one. Part two tomorrow is we're going to dig into some of the numbers, right? So if you care about where the trends are, what the numbers look like, what the market size looks like, we'll be sharing that with you tomorrow. All this stuff, of course, will be available on demand. We'll be doing crowd chats on this. George, excellent job. Thank you very much for taking us through this. All right, thanks for watching today, everybody. This is a wrap of day one, Spark Summit East. We'll be back live tomorrow from Boston. This is theCUBE, so check out siliconangle.com for a review of all the action today, all the news, check out wikibon.com for all the research. Siliconangle.tv is where we house all these videos, so check that out. We start again tomorrow, 11 o'clock East Coast time, right after the keynotes. This is theCUBE, we're at Spark Summit. Hashtag Spark Summit, we're out. See you tomorrow.