 Live from Boston, Massachusetts, it's theCUBE, covering Spark Summit East 2017, brought to you by Databricks. Now, here are your hosts, Dave Vellante and George Gilbert. Everybody with the euphoria is still palpable here and we're in downtown Boston at the Heinz Convention Center. For Spark Summit East, hashtag Spark Summit. My co-host and I, George Gilbert will be unpacking what's going on here for the next two days. George, it's good to be working with you again. I always love working with my man, George Gilbert. We go deep, George goes deeper. So, really, fantastic action going on here in Boston. Actually, quite a good crowd here. It was packed this morning in the keynotes. The rave is streaming. Everybody's talking about streaming. Let's sort of go back a little bit though, George. When Spark first came onto the scene, you saw this, these projects coming out of Berkeley. It was the hope of bringing real timeness to big data, dealing with some of the memory constraints that we found. Going from batch to kind of real-time interactive and now streaming, you're going to talk about that a lot. And then you had sort of IBM come in and put a lot of dough behind Spark, basically giving it a stamp, IBM's imprimatur. Much in the same way it did with Linux, kind of elbowing its way in to the marketplace and sort of gaining a foothold. Many people at the time thought that Hadoop needed Spark, more than Spark needed Hadoop. A lot of people thought that Spark was going to replace Hadoop. Where are we today? What's the state of big data? Okay, so to set some context, when Hadoop V1, classic Hadoop came out, it was the file system, commodity file system, keep everything really cheap, don't have to worry about shared storage, which was very expensive. And the processing model, the execution of munging through data was not produced. We're all familiar with those terms. Complicated but dirt cheap, relative to a traditional data warehouse. Yes, so. Don't buy a big Oracle Unix box or a Linux box buy this new file system and figure out how to make it work and you'll save a ton of money. But unlike the traditional RDBMSs, it wasn't really that great for doing sort of interactive business intelligence and things like that. It was really good for sort of big batch jobs that would run overnight or periods of hours, things like that. And the irony is when Matei Zahari or the co-creator of Spark, or actually the creator and co-founder of Databricks, which is a steward of Spark. When he created the language and the execution environment, his objective was to do a better MapReduce than MapReduce. Make it faster, take advantage of memory. But he did such a good job of it that he was able to extend it to be a uniform engine, not just for MapReduce type batch stuff, but for streaming stuff. So originally they start out thinking that if I get this right, so it was sort of a micro batch leveraging memory more effectively and then it extended beyond that original. The micro batch is their current way to address the streaming stuff. So that it takes MapReduce, which would be big long running jobs and they can slice them up. And so, each little slice turns into an element in a stream. Okay, so the point was this, it was improvement upon these big, long batch jobs, making it think about batch to interactive and real time. But so, let's go back to big data for a moment. Big data was the hottest topic in the world three or four years ago and now it's sort of waned as a buzzword. But big data is now becoming more mainstream. We've talked about that a lot. A lot of people think it's done. Is big data done? No, it's more that it's sort of, it's boring for us kind of pundits to talk about because it's becoming part of the fabric and the use cases are what's interesting. It started out as a way to collect all data into this really cheap storage repository. And then once you did that, this was the data you couldn't afford to put in your teradata data warehouse at 25,000 per terabyte or with running costs, a multiple of that. So here you put all your data in here. Your data scientists and data engineers started munging with the data. You started taking workloads off your data warehouse like ETL things that didn't belong there. And then now people are beginning to experiment with business intelligence, sort of exploration and reporting on Hadoop. So taking more workloads off the data warehouse and the limitations, there are limitations there that will get solved by putting sort of MPP SQL backends on it. But the next step after that, so we're working on that step, but the one that comes after that is make it easier for data scientists to use this data to create predictive models. Okay, so I often joke that the ROI on big data was reduction on investment. Lowering the denominator and the expense equation, which I think it's fair to say that big data and Hadoop succeeded in achieving that. But then the question becomes, okay, what's the real business impact? Clearly, big data has not, except in some edge cases, and there are a number of edge cases and examples, but it's not yet anyway, lived up to the promise of real time, affecting outcomes before taking the human out of the decision, bringing transaction and analytics together. Now, but we're hearing a lot of that talk around AI and machine learning. Of course, IoT is the next big thing that's where streaming fits in. Is it same wine, new bottle, or is it sort of the evolution of the data mean? It's an evolution, but it's not just a technology evolution to make it work. Because when we've been talking about big data as efficiency, like low cost reduction for the existing type of infrastructure, but when it starts going into machine learning, you're doing applications that are more strategic and more top line focused. And that means your C level execs actually have to get involved because they have to talk about the strategic objectives, like growth versus profitability or which markets you want to target first. So has Spark been a headwind or a tailwind to Hadoop? I think it's very much been a tailwind because it simplified a lot of things that took many, many engines in Hadoop. And that's something that Matei Creative Spark has been talking about for a while. Okay, something I learned today, and actually I'd heard this before, but the way I phrased it in my tweet was genomics is kicking Moore's Law's ass, right? That the price performance of sequencing a gene improves three X every year to what is, you know, sensibly a doubling every 18 months from Moore's Law. And then the amount of data that's being created is just enormous. I think we heard from the Broad Institute that they create 17 terabytes a day as compared to YouTube, which is 24 terabytes a day. It'll be way, it'll be dwarfing YouTube and of course Twitter you couldn't even see. So what do you make of that? Is that just a fun fact? Is that a new use case? Is that really where this whole market is headed? It's not a fun fact because we've been hearing for years and years about this study about data doubling every 18 to 24 months. That's coming from the legacy storage guys who can only double their capacity every 18 to 24 months. The reality is that when we take what was analog data and we make it digitally accessible, the only thing that's sort of preventing us from capturing all this data is the cost, you know, to acquire and manage it. So the available data is growing much, much faster than 40% every 18 months. Well, so what you're saying is that, I mean this industry is marched to the cadence of Moore's Law for decades and what you're saying is that sort of linear curve is actually reshaping, it's becoming exponential. For data and so the pressure is on for compute, which is now the bottleneck, to get cleverer and clever about how to process it. So that says innovation has to come from elsewhere, not just Moore's Law, it's got to come from a combination of, you know, Thomas Friedman talks a lot about Moore's Law being one of the fundamentals, but there are others. So from a data perspective, what are those combinatorial effects that are going to drive innovation forward? Well, there's some, there was a big meetup for Spark last night and the focus was this new database called Snappy Data that spun out of Pivotal and is being mentored by Paul Moritz, ex, you know, head of development at Microsoft in the nineties and former head of VMware. The interesting thing about this database and we'll start seeing it in others is you don't necessarily want to be able to query and analyze petabytes at once, it'll take too long, sort of like munging through data of that size on Hadoop took too long. You can do things that approximate the answer and get it much faster. We're going to see more tricks like that. You know, it's interesting, you mentioned Moritz, I heard a lot of messaging this morning that talked about essentially real-time analysis and being able to make decisions on data that you've never seen before and actually affect outcomes. These are, this narrative I heard first heard from Moritz many, many years ago when they launched Pivotal. So, you know, he launched Pivotal to be this platform for building big data apps and now you're seeing sort of Databricks and others sort of usurp that messaging and actually seeming to be at the center of that trend. What's going on there? I think there's two sort of two, what would you call it? Two centers of gravity and our CTO, David Floyer talks about this, the edge is becoming more intelligent because there's a huge bandwidth and latency gap between these smart devices at the edge whether the smart devices like a car or a drone or just a bunch of sensors on a turbine. Those things need to analyze and respond in near real-time or hard real-time like how to tune themselves, things like that. But they also have to send a lot of data back to the cloud to learn about how these things evolve. You know, to learn, in other words, it would be like sending the data to the cloud to figure out how the weather patterns are changing. That's the analogy, you need them both. And so Spark right now is really good in the cloud, but they're doing work so that they can take a lighter weight version and put it at the edge. But we've also seen Amazon put some stuff at the edge and Azure as well. I want you to comment, we're going to talk about this later. We have George and I are going to do a two-part series at this event. We're going to talk about the state of the market and then we're going to release our big data, glimpse to our big data numbers, our Spark forecast, our streaming forecast. I say I mention streaming because that is sort of, we talk about batch, we talk about interactive slash real-time, you know, you're at a terminal. Everybody, anybody as old as I am remembers that. But now you're talking about streaming. Streaming is a new workload type. You call these things continuous apps, like streams of events coming into a call center, for example, is one example that you use. Add some color to that. Talk about that new workload type and the role of streaming and really potentially how it fits into IoT. Okay, so for the last 60 years, since the birth of modern, well, since the birth of digital computing, we've had either one of two workloads. They were either batch, which is jobs that ran offline. You put your punch cards in and sometime later the answer comes out. Or we've had interactive, which is, originally it was green screens and now we have PCs and mobile devices. The third one coming up now is continuous or streaming data that you act on in near real-time. And it's not that those apps will replace the previous ones. It's that you'll have apps that have continuous processing, batch processing interactive as a mix. And an example would be, today all the information about how your applications and your data-centered infrastructure are operating, that's a lot of streams of data that Splunk first took aim at and did very well with so that you're looking in real-time and able to figure out if something goes wrong. So, but that type of stuff, all the telemetry from your data center, that is a training wheels for internet of things where you've got lots of stuff out at the edge. It's interesting you mentioned Splunk. Splunk actually doesn't use the big data term in its marketing, but they actually are big data and they are streaming. They're actually not talking about it, they're just doing it. But anyway, all right George, great. Thanks for that overview. We're going to break now, bring back our first guest, Arun Murthy coming in from Hortonworks, co-founder of Hortonworks. So, keep it right there, everybody. This is theCUBE, we're live from Spark Summit East, hashtag Spark Summit, we'll be right back.