 This is George Gilbert. We're here at Databricks talking with Patrick Wendell who is engineering vice-president at Databricks and leading the spark release process Patrick let's let's get started on on the evolution of sparks most popular apps And how they change with the 2.x spark 2.x features You did a or or Databricks did a survey last year last fall Where the most popular apps were business intelligence data warehousing recommendation engines log processing user freight facing services fraud detection and security Talk about how the capabilities that That customers were building with those apps a year ago Can change as they start applying the the 2.x features, you know like tungsten phase 2 structured streams data frames, I guess which are pervasive Graph frames as they as they start to you know move into the product. Yeah, great So maybe I'll start by talking a little bit about how we Approach the whole design of spark 2.0, you know, how did we even get this feedback? What did we decide to work on what how did we think about changing these applications and there's a few a few ways We get input one is that Databricks is kind of the community steward in some ways of spark We started the project several years ago at UC Berkeley And we get a lot of feedback from the community through those trip channels That's the survey that you're looking at right now We had thousands of people tell us what what do they like what they think could be improved And we also as a SaaS company operate spark So we're not only the developers of spark. We actually operate a service with you know thousands of users using spark every day So we have a really interesting vantage point from which to kind of see what people are looking for what they're doing with spark And and where we can improve And and we spent a bunch of time kind of in the last six to twelve months thinking like what are what are the main things We can really really move the needle on in spark 2.0. That's the release That's just shipped now And I think there was kind of two classes of use cases that we really tried to to improve on And they kind of manifest in the list that that you're looking at right now from that survey The first class I would describe as kind of data warehousing style use cases That was one of the I think number two on that list you have but many of those could in some ways be construed as data warehouse Yeah business intelligence is kind of in that same camp and and there the main feature request We had from users was they really want performance performance performance They're they're trying to use spark to replace more specific Kind of custom Specialized tools maybe even offload from a MPP database. That's a very mature system for doing kind of ad hoc sequel querying So a major thrust in spark 2.0 is around performance for these kind of ad hoc query analyst type workloads that kind of encapsulates both of the two first two things you mentioned in that list And there it's a lot of a I would say Sophisticated database optimization techniques. Can we take better advantage of the hardware? Can we avoid bottlenecks by doing things like code generation and other types of? Sort of engineering techniques to keep this nice API that everyone likes but make everything way faster under the hood so Drilling down a little bit that sounds like leveraging some of the progress that's made in tungsten so far to take advantage of sort of new hardware architectures Absolutely, like, you know more and more cores whether processor cores or GPU cores Bigger bigger memory pulls and with you know approaching storage class memory exactly So I think you really hit the nail on the head there So so the whole tungsten initiative is really about getting closer to the hardware in some sense Spark is written on the JVM That's a slightly higher level abstraction and what we're trying to do is each every bit of performance out of the underlying hardware And the hardware trends as you've observed are more and more cores and more and more memory The individual speed of cores is not necessarily getting much faster So it's much more about kind of having better parallelism and taking better advantage of the cycles within a given core So those are that's kind of the data warehousing umbrella and and a bunch of kind of cross-cutting improvements there in spark 2.0 I was I was saving this for Michael, but it seems relevant now Which is then the the query engine itself has to you know is another important part of that and From what I gather There's the Composable it's a composable query engine where you can add rules, you know almost like you know You had slots in a heart in hardware, which is you know oracle will say hey We've been at this for 40 years, you know just you know don't don't don't even pretend to mess with us Yeah, how do you combine the Extensibility of the query engine and and getting close to the hardware does that give you a chance to leapfrog in some use cases? Yeah, so that's a great question. I think Michael will have all the details on that I'll give a high-level answer What we're really focused on in optimization is two things one is the hardware Can we just eat out as much performance in IO and query processing? But another one is like I would say software optimization So that's having a logical optimizer that can sit there and say you know hold on I've analyzed this query and we can actually avoid reading a whole segment of the data speed up this query by 100x just by using some Kind of deductive reasoning about what's happening inside of the query and that's an area We've innovated in as well and and this optimizer we have it's called catalyst It's the thing that powers kind of the core optimization inside of spark That's the result of Michael who you'll meet in a minute his PhD thesis That was his work on how do we build a more kind of composable optimizer Can we have an optimizer that lets you plug in exactly as you said different rules? And it turns out that that design which is fairly novel lets you get quite a lot of performance while keeping the optimizer very simple And you know spark is is not a 20 year old database It's a young project with a handful of very bright engineers working on it So anything that can improve the efficiency of us making those queries faster is something that's a huge win for us now You said The use case with the sort of date data warehousing related workloads Includes sort of offloading from a MPP data warehouse From what I gather to the things that they do really well our query optimization and workload management at the same time So balancing a lot of queries. Yeah Are you essentially trying to? Workload doing the two together is is a Difficult you know non-trivial task Are you trying to take out? big chunks of data because we have storage class memory or we will and then You know you sort of Analyzing the data that that data set almost in isolation from from what others are doing. Yeah, so that's a great Observation there's really There's kind of the multi-tenancy aspect in some ways like you know Our customers may have hundreds or thousands of people trying to concurrently use spark to do you know queries in the traditional business intelligence data Warehousing kind of model and there I think the most interesting aspect is that we're Databricks is primarily cloud service So the way you think about Multi-tenancy in that environment is really different because in a traditional kind of a model You have a fixed set of machines and that's not changing anytime soon So what you're trying to do is is as efficiently as possible let hundreds or thousands of users query all the all that hardware in the Cloud world the underlying hardware is elastic So it's it's possible for our customers and then really anyone running spark in the cloud Which is now the majority of use cases for a spark to kind of elastically scale up their own Instance of spark independently of other business units as units inside of the company So I think the the way we think about multi-tenancy and performance is a lot more about Allowing for elasticity and allowing for folks when they're willing to pay for it to acquire more resources As opposed to the workload management as opposed to the traditional workload management Which is like hey I have a box of machines and I've got a kind of impact as well as possible what's in there So it's a whole different challenge in the cloud and in many ways It's better for the end user I think because if they want more performance They just add more machines and they can have them for a minute for 10 minutes for an hour And then they can get rid of them if they want to and I know we're not trying to Sort of endorse any tools here, but I would imagine tools that that That are designed to Around that sort of multi-tenancy or or essentially lack of multi-tenancy using elasticity Would one example be like a zoom data or whereas like a Tableau is more like take a snapshot out of the Core database and and visualize that yeah, so I think that comparison is a good example I think like the spark problem is more at the data infrastructure layer So actually executing queries zoom data and tableau or more at the bi layer But it's a similar analogy in that sense I think like we're both focused on more kind of elasticity plug-and-play and a little bit less of a monoethic architecture Okay, okay, so I actually realized the second big You know maybe I can talk for a minute about the second big thing So the first thing was data warehousing and and making you know making that faster and easier and and elastic and and work really Well, the second big a set of questions and requests we had from users is that we saw that You know continuous kind of stream stream style jobs were becoming much much more common I would say two or three years ago it was something you'd see here and there and Now like many many more companies both have Streaming data ingest so they've set up like a Kafka or some other type of message broker where they have kind of streaming data coming in And they also have problems that they're thinking about in terms of kind of continuous delivery and to end delivery of that You know either model-serving or fraud detection or these types of things So so I think the second big theme which I know you covered a lot with Matei because I was eavesdropping earlier is How can we build better primitives for thinking about applications in a continuous fashion? Instead of instead of more of static view of the world like oh, we're just gonna run a single You know job and we might run it every night or something It's more of like an online a real-time kind of view and and this goes back to I assume that old notion of What's now hidden underneath data frames these micro batches so it's like Let's incrementally update the stream we can query a certain window of it live or we can I don't know you don't you don't do online learning yet. Do you with yes So that actually is a feature in in some of the newest releases of Spark so a great example is like a lot of our customers and spark users in general are using spark for Some type of building of models so they're using a model to describe whether messages spam or they're using a model to describe You know whether they think this customer is going to churn or whether what edge they should show to which customers they're building some type of mathematical model and And a lot of times those models become more precise if you update them incrementally as new data is flowing in And in some cases that's actually the difference between a model that works really well on a model That doesn't so starting in the newer releases of spark We actually have support for kind of these incremental model building in the machine learning library But that under the hood is using the basic streaming support inside of spark So, you know, someone could have written that on their own But we actually provide kind of built-in primitives for doing this in different places Okay, let me let me drill down on that a little bit to make sure I understood it Because I had understood that with structured streaming the use case of Online learning with the models was was actually not in 2.0 But was coming down and you know to one yeah I think the the initial pieces of it are in 2.0, but it will be something that's Evolved much more in the 2.x series I think what's there now would be considered like a preview or beta of this of this feature and you're also talking I think with Joseph Bradley potentially later today. He's the expert in this area So he can give you the roadmap and timelines. I keep asking, you know, I keep getting a Sort of drag into You guys are so good at explaining it. I can't hold off to the guy who's the expert Well, we're we're kind of a nerdy kind of you know groups. So we like to get into the nitty gritty details All right, let's put a pause right there. Okay, and and we'll come back and drill down This is George Gilbert We're with Patrick Wendell and we're at Databricks. This is the cube on the ground