 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here with theCUBE. We're live in Midtown Manhattan at Spark Summit East. A lot of people talking about Spark. It's really the next big thing in big data. So we, of course, have to be here, bring theCUBE the whole set and really talk to the smartest people we can find. We've got the hecklers in the audience. Matthew Hunts over there giving us some grief, so that's always good. But we're really excited for this next segment to really get the co-founder and chief architect of Spark, Randall Chang, welcome to theCUBE. Thank you. So you've been doing this for a while. You've seen the community grow up. What do you feel? I mean, you got to be pretty, like a proud parent with what's going on with this thing. It's definitely very exciting. In many ways, this exceeded our own expectations. Even the wildest imaginations from five years ago. And that's pretty exciting to see the community grow. But still very techy. You were up today on today's keynote, just diving right in, getting into the... Yup, at Spark Summit in particular, I typically do the second day keynote and giving more of a technical vision of the future. In some other venues, it might become less technical, but at Spark Summit in particular, because I think the audience, a lot of them are developers and want to also cater towards them. As a result, I just put on my technical hat on. Excellent, let's jump in. So, press and analysts always like to cook up a fight. But you know, and okay, so we had Hadoop, the full ecosystem in MapReduce, had a great run from academia sort of, or the papers for 10 years, 10 years plus. So now we all agree that not replacing, but carving out a chunk of that ecosystem is Spark. And then of course, everyone has to throw up this now, okay, the next thing on the horizon that could threaten it is all these streaming technologies. Help us, like before we dive too deep, help us frame the, and understanding where streaming fits relative to where we've come from, say batch and interactive, and what are some of the choices we have in terms of pursuing a path into the future? Yeah, so this is a very good question. So I think streaming in particular is coming from as a response to the demand that a lot of applications want to really react on data faster. Like for example, if you're having a credit card for all the transaction detection, you better flag it as soon as possible rather than just waiting for the guy to get all the money and then fly out of the country. So in terms of that, I think streaming is a very important piece in the whole basic, the big data ecosystem. What we have seen is, since we released Spark Streaming about three years ago, Spark Streaming was first released Spark 0.7, I think, actually exactly February 2013 to three years ago. We worked with a lot of different customers or organizations to build out their streaming stack and what we realized is typically they really need a lot of integration between streaming and their existing technical infrastructure and this covers a lot of what you would typically call batch stack and the other thing is, in order to build out the streaming stack, you actually get to hire the people and find the right people to build them and both of this turns out to be incredibly difficult and frustrating. One reason is it's great to have streaming systems that are being built but often people think and I think there's a lot of mistakes engineers that tend to make, sometimes I make them too, is I'm building this new thing and this is the best thing ever and I just want this thing to work and you forget about all the other pieces you actually need in order to make it working. So this is the integration part and the other part is streaming's very difficult because there's a lot of characteristics that are different from batch. In batch, you know your data is there. Streaming data are continuously arriving, sometimes it might be late, sometimes it might be, they might just arrive very quickly, sometimes it might be two days late and sometimes data distribution actually changes over time as a result. Maybe one algorithm that works when you were creating your application won't work half a year from now. So all of this leads to a very complicated essentially I think a model, the way people have to think about streaming. So we're, now I think in the industry right now there's many different approaches to this. One approach is I would say the, called the Lambda architecture was basically coined by Nathan Marge a while ago, is to say we have a batch stack, we have a streaming stack and the two will basically converge in some form or another. And the other one is, I think it's getting increasingly popular, maybe championed by Google Cloud Dataflow, is let's treat everything as a stream. So I will basically use a streaming approach to tackle my batch data and batch becomes a special case of stream. Now it sounds great, but I think it's actually making maybe some of the pieces unnecessarily complicated because batch used to be simple. Now I have to use a harder way to deal with batch. So what we've been working on lately and which is part of my keynote this morning is can we actually make streaming as simple as just plain batch and let the system do the hard work to figure out how to actually run your applications in a streaming fashion? Whereas you would just think about some analytics you want to perform. How do you deal with some of those issues you raised about out of order data and algorithms sort of being out of date? If it's almost like you're introducing complexity that's appropriate for streaming, but how do you sort of backfill it on batch without it showing up? Yeah, exactly. I think there's many pieces. The one thing you mentioned is out of order data. I think the system should just be smart enough to correct itself. When you specify your application logic or business logic, what you should really say is, for example, I want to do some aggregation accounting how many users I have. Now the system should give you the best result it could actually get any given moment and if there's late data that arrives you should actually correct your historic result. I think this is what you get when you actually raise the level of abstraction and ask the user to specify intent rather than how to perform certain computation. So this is actually sounds kind of profound and consistent with what we're hearing in other areas like with applications that you don't get definitive answers now. The new programming paradigm is probable. This is the most likely answer. Are you saying that we take that now down to the very data layer? The very processing layer for the data? Yeah, I think basically we are raising a level of abstraction in Spark that you would, as the end user, two things. One is you would think about what you want to do, not necessarily how exactly to do it and Spark would actually figure out how to do it and Spark would give you the very best effort. Now if you have all your data in a consistent manner then of course at any given moment you get the best result and the ideal result. But if you have missing data and you are based analyzing that you will miss some of the result. But by the time the missing or late arrival data actually arrives, Spark will correct itself. So it's almost more of a statistical approach as opposed to kind of a classic computational approach which is yes, no, one, zero. Now it's really kind of confidence level the best effort or best result as of now with willingness to change based on new data coming in. Yeah, I think for certain cost of applications that is actually probably a pretty good approach to do. And that's hidden from the user? I think we want the users to just think in terms of how they always been programming. If most of our users already understand how to program batch data. They understand you have a static data set. You have a table in a relational database. How do I run some analytics on it? And we want them to apply the same mindset and make the transferable directly to dealing with live streaming data. Okay, am I dropping too low if I ask how do you account for the accuracy of time on events that are coming in that are widely distributed? How do you know event time is consistent for things that transpired over great geographic distances where the clocks aren't synchronized? The, yeah, that's actually a great question. This is basically part of the problem I think with natural data when you have data from many different sources, they don't always arrive in order. And this is like the term out of order data. But you don't even know whether it's out of order? You don't even know, but the system would know by the time, basically in our new structure streaming framework, you will actually declare there's a field in your data that indicates true time, basically what you call event time with the data. In the system, we actually observe those times. And as a user, one way you could say it is, for example, I have a maximum tolerance that is 20 minutes. If my data hasn't arrived in 20 minutes, just give me some result. And Spark, once past the 20 minutes threshold, will actually give you the result. Now, that might not actually be correct, because if Spark doesn't actually know the data, it's very hard for it to give you like an absolute correct result. But when the data actually arrived, maybe one day later, Spark could actually send you a signal and say, hey, I've gotten something else and this is what the new result should really be. So you made a trade off, which is, I'll give you what's most likely a very close to final answer. And in return, I'll take the burden of coordination or knowledge off the programmer. Yeah, absolutely. Okay, so, now sort of help us unpack someone who's been programming relational databases for 10, 20 plus years and how they look at the world. And now when they confront this structured streaming, this new data structure, I guess, which is how you're dealing with this new programming model, what do they learn and what do they unlearn? Yeah, so the most interesting part is, I think there's almost nothing they unlearn. It's almost identical model they've been learning from SQL. So this kind of ties back to how we've been thinking about Spark development in the past couple of years. We realize there's a huge class of basic database programmers, or just average programmers, that actually understand relational databases very well. And they understand SQL, they understand all of this. Now what we have done is taking a lot of the semantics in SQL, generalize it a little bit more, make it more expressive and more powerful, and embed it into Spark. So essentially in many ways, Spark, in sort of the internals of Spark, actually look similar to what you would call like a massively parallel MPP databases. Except we are making it a lot more general by opening up the API. So you have many different ways of programming it. The more UDFs in your time functions. This is like, for example, the data frame operations, the RDDs and all of this. And same thing now with structured streaming. We basically opened up the internals to actually enable a wider range of applications that was not possible before with just relational databases. When you say wider range of applications, meaning beyond what the SQL operators would give you. So like either ones that you built, or perhaps user-defined ones? Yes, absolutely. And also dealing with unstructured data, dealing with basic streaming live data, and also building mathematical operators directly into this, so you can do machine learning, for example. Okay, so SQL was, let's say, this. And now you've expanded it to something much broader. And the structured streaming is like a table without an end? Exactly. That's the easiest way to understand it. It's basically a table that grows over time, infinite. And then you can, the developer who's grown up on SQL for 20, 30 years, they can say, okay, I want to deal with the first 10 rows, or I want to deal with the first five minutes of rows, something like that. Exactly. Basically, how it works is you view a stream as just a table. It's just that, that doesn't end. It's gross over time. And then you specify a query. And what's going to happen is, at any given moment, when that query is executed, logically, you get some result on that table. Because at any given moment, the table is actually finite. Right. And run a query against that finite table. This is exactly the same semantics as your normal relational queries. Just after you're just running SQL, you get some result. And then you can decide what to do with the result. For example, you can output it. You can write it to a relational database. That's not Spark. And then we generalize this to basic infinite time span. So you can run at any given point. And Spark would just automatically run at any given point. Except Optimizer would now figure out, how should I parallelize it? How should I incrementalize it? How should I run this in a streaming fashion? So I don't have to actually, physically, look at all the data from beginning of time to the end of time. So you're almost taking a snapshot, right? It's kind of the classic acceleration. Yeah. It's for any moment in time. You're actually not moving as you keep slicing smaller and smaller and smaller. So that's basically how the users think about it. It's not actually how it's executed. Because if we always snapshot it, it's very expensive. So we'll be doing a lot more incremental computation under the hood without the user having to worry about it. Now, what happens in the case where you're joining multiple structured streams, different sources, and it's on a cluster? So who's responsible for making sure the right data items show up in the right places? Because as I understand it right now, the query optimizer doesn't know about where data belongs in terms of separating what I want from how to get it. Yeah, the query optimizer actually knows where the data are in general. It doesn't know always about everything about the data, but it roughly knows where they are, what the data are, where it's coming from. It's coming from Kafka, it's coming from, for example, Cassandra. The reason I ask actually is because my impression was like with Cassandra, the developer has to know how to shuffle the data when they want to join across nodes in a cluster. That's not the case in Spark. In Spark, you just say you're just described. I was thinking of the Spark connector for Cassandra. No, I think even the Spark connector for Cassandra, so there are different ways of doing it. If you do it through the normal data source API, basically if you're writing data frame functions, all you do is just say, for example, data frame one, join data frame two. Same thing with SQL, just joining normal SQL operations. And then Spark will suck data out of Cassandra, regardless which node it is, and it will be responsible for handling all the shuffling and all of that. Oh, OK. All right, so if you have feeds, structured streams coming in from different directions, you can express a query that joins filters, and then you can send it through a machine learning library. Absolutely. And now I'm sort of maybe this is science fiction. How computationally expensive would it be to continue to refine the machine learning model while you're taking live data that's streaming in? This is an excellent question. So it turns out not every machine learning algorithm can be done in a streaming fashion. There's one class of machine learning algorithm you would call like streaming algorithms that look at data only once, look at each data point only once, and they update itself. So this is the class of machine learning algorithms really applicable to streaming. And there's the other class, which you have to look at data more than once, and it becomes very expensive to actually do in a streaming fashion. So it really depends on the kind of algorithm. The expensive ones are where they're not acyclic, meaning they go around and around? They would go around and around, exactly. So for the very expensive one to run, you typically have to go to the beginning of time and then go to current, and then go back to beginning of time and come back to current time again. So you have to keep going, and that's very expensive. But there's a new class of machine learning algorithm and that's actually a pretty active area of research for machine learning is can we create basic algorithm and look at data only once for each data point? And those are the ones that are really good for streaming. And we've been building actually a lot of them into Spark now. So, Reynolds, we're getting low on time, and I know you and George and Matei could go for probably three hours and it would be actually probably pretty good. It would feel like 15 minutes. It'd feel like 15 minutes to get the red wine out. We'll have to do that in San Francisco sometime. But I just want to get kind of your take on what's next. Clearly we saw it, we're at the meetup, and 2.0 is coming out, and it's all about the 10x improvement in speed, which is just scary to me to think that this stuff's going to go faster, faster, faster. But what else is next? What's on the horizon when we come back? Obviously, we'll probably see you in the summertime in San Francisco, and then back here a year from now. What are we going to be talking about? What are you getting excited about? What's kind of next on your radar? Yeah, so beyond just Spark 2.0 and all the performance improvement and structure streaming, I think one thing that I'm fairly excited about, especially from a guy that used to spend a lot of time in academia, is actually the community edition of Databricks. It's the one that I think truly makes it easy for Spark learning experience. Because I think it's great that we're building all this great technology and software, but if people don't understand how to use it, it can't be trained to use it, it's not going to actually be that valuable. And having that piece that people can use as a way to actually learn about Spark, and a lot of other technologies also on that platform, I think it's a pretty cool thing to me. So I look for, I think the best part about that is every member of the community can actually create new content and publish it. So I actually look forward to what kind of innovative results and notebooks they would be writing. I wanted to ask about notebooks. I worked for a spreadsheet company way, way back. It was called Lotus. You probably weren't born yet. I've used Lotus Notes. Oh no, yeah, but this was what it was, Lotus 123. I was actually second product manager on Notes. But the question I have, notebooks can be these computational documents, which spreadsheets were at one point in time. Right now from what, when we ask people about them, they talk about empowering data scientists, empowering business analysts. Can we extend it to even less technical audience where they interact with data in a storytelling form where SQL isn't part of this arcane? So I think we always experiment with new technology as ideas here. One thing we've been discussing is some sort of basic drag and drop wizard that's on top of notebooks. There's no commitment to it yet, but there's a lot of discussions about it. So I think notebooks are a great form for storytelling. And it's probably the best form so far because you can actually interweave the methods of exploration and analysis directly into the results. For less technical audience, maybe they don't care as much about the methodologies, but care more about the end results. And I think that's the other thing we've been developing, which is basically how do you actually have a different view of the notebook, it's turning a notebook into a dashboard. So having a dashboard for example, just listing KPIs for executives to consume, it's also just the other side of the notebooks. Exciting times. Ronald, thanks for stopping by. Again, congratulations on all the success. A lot of good energy here. Thank you. We look forward to grilling you, or giving you the, what is it, the fifth degree next year. All right. Jeff Frick here with George Gilbert. We are live in Midtown Manhattan at Spark Summit East. We'll be back with our next guest after this short break. Thanks for watching.