 from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East. Brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Hey, welcome back everybody. Jeff Frick here, you're watching theCUBE. We're at Spark Summit East here at the Hilton in Midtown, Manhattan, really the home of kind of big data events on the East Coast. We're really excited to be here and a little bit later, George Gilbert from Wikibon is going to be presenting his first ever Spark Summit or Spark forecast, so we're excited for that. But we got a really special guest and I'll let George introduce him. Okay, so it is a distinct honor to welcome Matej Zaharia, who is the creator of Spark and who we like to say looks a good deal like a younger Bill Gates when he lets his hair grow out. And anyway, welcome Matej. Yeah, thanks George and Jeff, looking forward to chatting. Yeah. So let's start at a big picture level. There's, you know, it's pretty clear now that Hadoop has established itself as an ecosystem but that produces sort of fading from relevance. If you were to, and most people see, at least within the Hadoop ecosystem, we'll talk outside later, see Spark sort of taking on that role of that, you know, compute engine. Now, there's a debate as to whether it's like a three-year or five-year kind of role before we see, you know, walls or limitations or, you know, can we go well beyond that? So the first thing that some people talk about is, well, everyone wants real time and, you know, with streaming and they don't always, you know, know the difference between the two. Maybe start by telling us how the, you know, the definition of the two and then let's talk about how you implement to support it. Yeah, good question. Yeah, so these things aren't always super well-defined. The way I view it, real time is a broader class than streaming. So streaming means you get a sequence of events and you output like another sequence of records. So for example, for each record, you want to like look it up in a database and attach something to it or you want to group them and count the number of records of each type. So the input is a stream and the output is a stream. For me, real time touches a broader class of topics. So real time could also be I just want to serve some data, like there'll be queries coming in and I want to serve them really quickly. It could mean ad hoc queries. It could mean from the moment I have a question and like send it to the system to the moment it answers. So that's why people include things like interactive SQL in real time and of course it also includes streaming. So that's kind of the distinction I view, yeah. Okay, so let's drill down into streaming. Because as people look to Spark as this general purpose platform to solve lots of big data and fast data problems, but some people say, well, it doesn't really do real time streaming. Tell us what that sort of edge condition is and where you are today and then how you might solve that going into the future. Yeah, so I think this is, so it's not a thing interestingly that we really see from users of Spark streaming. Like basically around half of users of Spark also use Spark streaming and it's not a thing that they bring up. I think the most often this comes up when people talk about the internals of the engine and basically is the engine like are the operators in there that each time they receive a record, send out another record, or do they operate on batches of objects at once? And Spark is more set up the second way to operate on batches at once. Now, what this means, so in reality with Spark it's designed for high volume, distributed processing big data. So a lot of this data anyway comes over a period of time and you need to wait a while just for the data to come in and anyway you're coordinating across a bunch of machines and you need to wait a while for that to happen. But basically the batching of records is optimized for that and the latency you can get with Spark ranges from like a few hundred milliseconds to a second. So it's, I think for most applications it is real time. If you want lower latency, this is obviously something that we're looking at and there's nothing in the engine stopping it from doing lower latency stuff. The main thing is which applications need both that and the huge volume and there aren't really that many of them that have that. Even in IoT applications where you might want a real couple millisecond response at an edge device. Yeah, so the way that happens is so if you want to response at an edge device, often that's a local decision. People make it just on that device and that's way easier to build than sending the data all across the network or somewhere else and having it send something back. So that's what I'm saying. So there are existing solutions like most people who want to just stand up a web server and hit it with a quest and get something back or they have logic in the edge device and the question is like with Spark, it would be nice to extend the programming model to cover that, that might make sense in the future but we decided to first focus on the thing that no one could do at all with existing architectures which was if we wanted to do more complex processing of lots of data. That's what we did. Because real time really we talk about often. Real time is in time to do something about it, right? And depending on how granular you slice real time, eventually it becomes a batch of fractions and fractions and fractions of milliseconds. So that's really important thing, the application. Yeah, but there's a couple of things. Like if you think of applications, there's many reasons why your application might anyway be set up to output data more periodically rather than all the time. Like one reason is you're doing sliding window aggregation. You want like number of visitors to your website in the past five minutes. Well, to figure that out, you're going to wait five minutes or however often you slide that window. It doesn't make sense to output that faster. And the second thing is if you're centrally working with data then you have to deal with late data and by definition late data takes some time to arrive there and you need to wait for a while before outputting something. So this is where that comes up. So you have this trade off between if you're central, you need latency just because of a distributed system. And if you're at the edge, you need less context. Yeah, if you're at the edge and there are also other and often simpler solutions. Like once a day you can have a small like local application that does that. But you know, we do always look at these and like we design all our roadmap and all of the engine based on what people want and we've been pretty careful not to lock people into this. So it's almost like we could, if there's enough demand we could see a Raspberry Pi type, not running on Raspberry Pi necessarily. Yeah, the way to push to say some computation that was something like that. That would be pretty interesting. Okay, so now the SQL optimizer catalyst. Yeah, yeah. Okay, so Oracle's fond of saying of all the MPP SQL databases sprouting up on Hadoop, they're like, oh that's nice, but you know, what do you think we've been doing for 40 years? You know, there's a lot of intelligence that's have gone into that. Sure. So tell us the sweet spot of your use cases right now for the query optimizer and let's talk more broadly about its objectives. Yeah, definitely. So the goal of Spark SQL, much like the rest of Spark, the goal is to be able to connect to very diverse set of storage engines wherever your data is. Because big data in most organizations is distributed in many storage systems and because it's big it's by definition it's really hard to move. So the goal is to connect to those and let you do queries easily across those. The thing that's different from say an Oracle cell database is in those databases you spend a ton of time importing the data and while you import it, the database figures out statistics about it, figures out how to lay it out and can optimize for certain queries. In big data, often the data is sitting there in a file somewhere. It would take a super long time to import it to transform it. So you just need to work with what's out there like in whatever file you've collected, you know, a bunch of JSON files or something. And so Spark SQL is designed to work directly on those. For sources that have more information it is actually able to use some statistics. So like some of the file formats on Hadoop or some of the storage systems have that. Like Parquet maybe. Like Parquet, yeah. But basically the sweet spot where it started out is for these things you know it's just lying out there and you want to compute something, ask something about it quickly. So you said something interesting if we distill it down, traditional databases figure out what's in there as you put it in. Exactly, when you put it in, yeah. And you figure it out when you're using it because it's coming from essentially other sources that you don't control. Yeah, exactly, yeah. So you can design the assumptions in your engine differently. Does that mean at large scale when you include the time to ingest and to query? Yeah. When you get to large scale, the performance tilts in your favor? Yeah, exactly, the performance wins out. It's actually, this is an interesting thing if you see like even back around the days of Hadoop and MapReduce, there were these discussions from database vendors about like, oh, we ran this database query and it ran this many times faster on like our database than on MapReduce. But the interesting thing is in those same discussions, they showed loading time and the time to load the data and that database was like 10 times or 50 times longer than it took to run the query on MapReduce. So unless you're going to do 10 or 50 such queries, it actually, the MapReduce approach was way better. It was schema and read, you just compute the thing you need, you get out of there. And like if you want later, you can load the data into something else and do those statistics. So this is why it was important to have an intermediate, not an intermediate, a format people could agree on that was fairly structured. Exactly, yeah, yeah. You don't have to go through that transformation. So now in the Hadoop and Spark sort of world, what's happened is there's now a wide variety of formats that are in between, like a flat text file totally unstructured and what you'd have in a database. So things like Parquet, for example, which is the column on disk format in Hive, have a lot of statistics, a lot of indexes what you'd find in a database, but they're also very easy to update incrementally. So unlike having to wait for your whole database to reorganize, you can easily add data and start weighing it. So there's this kind of spectrum. And we went and sparked to support both. And I think that again, there's no reason why we couldn't do it, especially because the hardware out there is so new clusters and memory and SSDs and stuff. A lot of database engines will have to be read in any way. And we might as well write a good one for this kind of stuff. Okay, so let's tune to data frames. You've previously said data frames, and I'm going to ask you to define it. Yeah. Or were the biggest thing that were added last year because you could connect it to all the backend performance work by going down to the bare metal in the hardware. Tell us what that means and how it works. Sure, yeah. So data frames are an API in Spark that is a slightly higher level API for working with data, but it lets you do the same things you normally could. So it lets you do basic operations on data like grouping it, joining, selecting certain fields of it and so on. That's your own custom code on them. And they're based on the very popular data frame API in R and in Python. So if you're familiar with those sort of small data. Which we naturally are. Yeah, small data like single machine tools. It's pretty cool because you write essentially the same syntax and you get it to run at scale. The part that's different from normal Spark is a lot of the operations and data frames are, so first the data model is more restricted. So we know the fields in the data. They're not just arbitrary say Python objects in there. We know, okay, there's like five fields. You know, there's a name, an age, whatever else. And a lot of the operations turn into SQL like what are called relational operators, which are something we can pass through the SQL optimizer. So you get many of the same. That one we talked about before. Yeah, exactly. That figures out the data catalyst as you're using it. Exactly, so you get many of the benefits of a more declarative like data access layers. Like for example, if we see you're only using some fields of the data, we never need to read the other one, stuff like that. But it's still easy to integrate it into a program, into a language like Python or Java and have like a real program around it, not just a giant page of SQL. So that's what they are. Yeah. So we only have a couple of minutes left, but let's talk about machine learning. Yeah. I mean, many of us non-technical folks just think of it as this big black box, you know, you put stuff in and good answers come out or a model comes out that generates answers. Now, IBM contributed some technology. There's something interesting coming out of Berkeley, Keystone ML. Tell us about can Spark become host to something that essentially delivers combines different models. And I think the term is sort of ensemble machine learning. I see, yeah, yeah. That's specific to a very specific set of problems. Yeah. Yeah, so that's a good question. Yeah, so there's a few things. So first there's like, there are a ton of machine learning systems out there and many of them are actually being written as libraries for Spark, which is really great because it means in a Spark application, however you load and prepare your data, you can pass it to all of these. So system ML is one Keystone ML, H2O, a lot of these kinds of engines I can run as libraries. In terms of ensemble, so ensembles are a very powerful technique where you take many models, none of them is maybe that good on its own, but together by voting and averaging out their predictions, you end up with something much better than a single one. And we do have some methods like that in ML Lib already. We have, for example, ensembles of trees, which are called random forests or gradient boosted trees, but we'd love to have that across these libraries. I think the way to do that is just to have a standard API for what a model is, and then you can build these ensembles and train them and assign the weights in the right way. So this is, we are trying to make it easy to plug these models into the same interface in ML Lib. So external people can write something and then we say, oh, that looks like a classification model. It can predict if something is spam or not or whatever, and we just combine it with the other ones. I wouldn't say this is done yet. This is still like super. And then last comment, I guess, would be, the value then is that it ties in with the rest of the APIs in that one engine so that you can use it with streaming, with SQL. Yeah, and even with other ML libraries and stuff. So they need to agree on a common way of what's the data, what are my predictions, how do I say them? And once you have that, once you establish that way, different people can build the different ones. Okay, yeah. So I'll tell you, we were at the meetup last night. Huge turnout, 500 people, I couldn't believe it. Very, very techy, but I was fascinated by, A, you guys are building a product to really take advantage of all these advances in hardware, and B, you're immediately going for a 10x improvement over the last iteration. How fast is this thing going to grow? What are we going to be talking about a year from now? Yeah, so it's really hard to tell how things will go. It's very hard to predict, but I mean, so far we've been super thrilled by the growth of Spark and it's definitely exceeded our expectations. And we see growth in contributors is one thing, but growth in users is even higher. I think most metrics of growth in users are not by a factor of four in the last year, which is quite a bit. The thing we've focused on is making it very easy for people to contribute to Spark and having very little friction. So different vendors, different users, different companies can do their own thing and all this stuff works together and moves forward with new releases of Spark. So we've really tried to set up the process for that to let this continue to grow. Yeah, just to share some numbers from the meetup, I think said 144 groups around the world 66,000 members, so really aggressive growth rates on a really active community. Yeah, yeah, it's cool to see and hopefully, there's still a lot to do, but hopefully we have the pieces in place for that. And just a question on engaging sort of a new class of users, I came from a spreadsheet company way, way, way back, and I even had a pirated copy of Vizical before that. I'm going to admit this now in the video. But that was like the original computational document. Sure. What can you do with notebooks now that might bring data to a new class of users? Yeah, that's a really good question. Yeah, so this is a thing we've been thinking about a lot at Databricks. And there's actually a couple of things, but there's one thing we announced today which is basically Spark powered dashboard. So the idea here is you have a data scientist, you know, knows some programming maybe or at least knows some SQL and can create some reports. And then dashboards are a way to take those, add parameters to them, like say, oh, you can change which customer this query is about and expose it as a, you know, kind of as a web page. That's a dashboard and give that to someone who's not familiar with that to use it. So it's not quite a spreadsheet. It's a bit more restricted. It's kind of like maybe a pivot table or something like that, but it is a way to expose that computation back by Spark. That's something that we launched just today actually that will let people do that. And we'd love to see other interfaces too. So we do focus a lot on like, even for the data scientist's sake, like letting them expose this work to other people without having to be a bottleneck on the path to people finding out stuff. All right, Matej, we could go all day. I know you and George should go all day. Next time we're going to order pizza and beer and coffee and do it like a regular meetup. So thanks for taking the time and Red Bull for George. The rock star of the event is a lot of energy here. We're theCUBE. We're here for two days of wall-to-wall coverage from Spark Summit East. I'm Jeff Rick with George Gilbert. Thanks for watching. We'll be back with our next segment after this short break.