 This is George Gilbert. We're at Databricks. We are on the ground. We're interviewing Michael Armbrust, who is the creator of the Catalyst Query API, which as we'll see is a very fundamental piece of Spark. He also has responsibility for Spark SQL and Spark Streaming, which when you put the three together pretty much underpin much of what's new in Spark 2.0. So Michael, welcome. Yeah, thanks for having me. All right. Let's start with how the work that you've done with Catalyst and what comes on top of it does so much for unifying and simplifying both the API in Spark and the engine. So what was the motivation behind Catalyst? Tell us where it fits and then and then why you built it the way you did. Yeah, that's kind of a fun story actually. So Catalyst started initially as this kind of out there research project. I was working on another relational database and I realized that the query optimizer, well, a very important part of the system, was very rigid and very difficult to change and I was trying to implement some of my thesis research on top of it and I kind of took a step back and said wait a second. Why is this so difficult? To add new concepts into the query optimizer. Maybe there's an easier way to do it. So we looked at a bunch of research on composable query optimizers and decided to try and build it. Once we had this really cool tool, kind of applying it to Spark was kind of the next logical step. There was already this system called Shark, but it was borrowing the query optimizer from Hive and that was really starting to become the limitation. So we took this next step of kind of giving Spark its own query optimizer and that was really how Catalyst began. Okay, and so just for our, for some of our viewers who haven't gotten under the covers, the query optimizer is the core sort of it's the core value add that allows you to take a query and turn it into something high performance that gives you an answer back really quickly. Yeah, exactly. And I think kind of a key part of this is it also comes along with this logical language for describing what kind of data flow you're trying to accomplish. Okay, elaborate on the data flow. So you can actually say things like I want to do a join, I want to do an aggregation and this is a higher level than the RDD API, which is really kind of more like individual code operators that you are implementing yourself. And RDD was the original API for Spark. So you raised it a level so that you could look across a bunch of operations and optimize them. Exactly. I think the key difference here is it's what we like to call declarative programming where you're telling the system what you want to do but not necessarily how to do it. Once you have that language for saying what you want to do, now the system has all of this new flexibility for kind of trying different ways to accomplish that can actually explore the space and find the best way to do it. Okay, so still on the database and the SQL track, taking the composable capabilities, which I assume means that you could add rules without breaking the engine apart and rewriting it over a period of 10 years as Oracle would probably tell you. Combine that with this declarative capability. What did that allow you to start expressing that was not possible before? Yeah, so when we first started we were focused primarily on SQL and we realized that that was a great use case. But there were a lot of people who actually wanted to express things that are difficult to do in SQL. They had complicated UDFs, they wanted to do machine learning, user defined functions, like their code that they wanted to inject into the query engine. And what we found was the RDD API was actually a very natural way to do this. So we wanted to come up with some way where you could marry the declarative programming of SQL with the kind of low level language integration that RDDs gave you and that's where data frames came from. Okay, just to recap, so declarative is, here I'm just going to say what I want to do, not how, so that's a lot simpler. That's like punching in my navigation system. I want to go here, you tell me how. Exactly, you say like compute an average, not one plus and count. Okay, and then the RDD gave you the ability to put your own functions and data types. But in RDDs you couldn't have that declarative capability other than if you were working with this Hive engine that was kind of broken from Hadoop. Exactly, exactly. So okay, the two came together and that created data frames. Tell us then, starting in SQL, what did that allow you to do, including the richness of the UDFs and the UDTs? I think what it allowed developers to do is when they wanted to do kind of relatively simple but very important tasks like filtering, aggregating, joining kind of the bread and butter of SQL, they can do that without having to re-express that logic on their own. But then when they want to do something kind of doesn't fit in that box, they can seamlessly transition into their own code without switching languages, without switching frameworks. Before you'd have to actually take multiple tools and compose them together. So this would be the old complaint with SQL was even though it was great at saying what I want but not telling you, but not requiring you to say how to get it, if you did something that wasn't in the SQL language, which was, you know, there was a lot that you couldn't fit in there, now you were able to surface those capabilities or allow it to be extended with user-defined types and functions. Exactly. So it was for people who are familiar with Microsoft Link, you know, it was basically one language that gave you access to the data and anything else that you wanted to do to process the data. Is that fair to say? Yeah, absolutely. I think we took a lot of ideas from Link and we actually then implemented them on Spark's distributed runtime, which is where a lot of the power comes from. Okay, so how did that capability migrate then into all the other Spark APIs? Yeah, that's a great question. Once we had this kind of more general framework for piping data through integrating with different data sources and expressing all of these transformations, things like MLlib naturally wanted to take advantage of it. They had spent a bunch of time building up their own ways to kind of allow users to flow data through a machine learning pipeline and, you know, say like I want to use this column as features and I want to transform this column in this way. Kind of having the data frame API and having this kind of logical plan structure that Catalyst provides was kind of a natural way for them to take advantage of a lot of that without having to reimplement it inside of machine learning. So data frames became a structure for people to represent machine learning pipelines. And so you, underneath data frames, you could also use Catalyst to optimize how that data flow worked. Exactly. So the same performance optimizations you applied to essentially SQL queries you could now apply to machine learning pipelines. Yep, exactly. I would kind of think of data frames as like the skinny waist of expressing data transformations inside of Spark. Okay, and then how did that work with structured streams? How did that help make structured streams so powerful? Yeah, so maybe before I dive into that, just a little bit of history. So there was another project before structured streaming just called Spark Streaming, which used this thing called the DStream API. It was a very low level API, very similar to RDDs, where you actually express exactly how you want the streaming computation to proceed. After kind of hearing a lot of feedback from users, they liked the power of this API, but they didn't like having to kind of express these complicated concepts on their own. If you have a stream and you want to do a count or you want to compute an average, you just want to say that. And so adding data frames to streaming was kind of a natural next step. Once you've answered the question once, you want to answer the question over and over in real time incrementally. Okay, so in other words, data frames sort of elevated the, for lack of a better term, abstraction of the entire Spark API. Exactly. I think like once you had this high level way of expressing things that's very general, kind of lets you go back and forth between a declarative and functional programming, you want to use it in many different places, not just for batch queries. And of many of the popular Spark applications, whether, you know, data warehousing, business intelligence, recommendation engines, how did you see their functionality changing or how much simpler and more accessible did they become for developers? Yes, I mean, we're just scratching the surface with structured streaming. It was just released in 2.0, I think, two weeks ago. But I think what we're seeing already with customers is cases where they would have to write a significant amount of Dstream code, both for doing integrations with different data sources and then expressing the transformations once the data is inside of Spark became significantly simpler. Instead of having to rewrite it and rethink it in terms of streaming, you take exactly the same code that you wrote for your batch job and you can turn it into a continuous application that incrementally computes the answer as new data arrives. Oh, so the idea that you go from a table which is finite and when you're dealing with batch to an infinite or never-ending table, would it be fair to say that when you're dealing with streaming, you're also pretty much giving up the notion that you're going to get a definitive answer because you might have stragglers in terms of items coming in and you can't wait infinitely. Sure. And one of the things that's sort of a feature of big data and fast data is that you don't get a definitive answer, but you get the best effort answer or a probabilistic answer. Is that a trade-off you made to make data frames and structured streaming accessible to people who just understood batch? Yeah, so we spent a lot of time thinking about the model. Traditional streaming systems often make you actually think about streaming, think about late data, think how you're going to adjust things as data arrives out of sequence. With the data frame API, we actually wanted, in structured streaming, we wanted to take a slightly different tact. We came up with this notion of triggers and I think the right way to think about it is you have an ever-growing table as data arrives. You execute a query at certain trigger intervals and we give you the best answer as of the time that the trigger fires. But the nice thing about this is then the trigger fires again in five seconds and now you have a new up-to-date answer that is logically consistent. It doesn't backtrack from things that you saw before. It may continually update if more data has arrived since the last trigger interval, but you can still think about it as a batch query that ran at that time on some set of data. Okay, and now that we have the data frame as a higher level API, the whole issue of micro batches versus per event processing, that is now under the covers. Absolutely. And sometime in the future you might have a lightweight per event processing stream, structured stream processor that's out at a gateway near IOT devices. Yep, so this was another pain point that we heard from users of D-streams. In particular, the correctness of your answer actually could depend on the micro batching interval that you chose. This was like a key part of the user-facing API. If you look at data frames, since it is the same API for batch and streaming, there is no notion of exactly how the computation is happening under the covers. You just expressed the computation we want and we figure out how to execute it incrementally. As a result, it's very easy for us in the future to change out exactly how the execution is done under the covers without you having to rewrite your application. Okay, so that sounds important and it sounds like the trigger corresponds to the window? Yeah, see this is actually kind of an important thing. The triggers in the windows have nothing to do with each other. Windows are based on event time. It is the time that is stated inside of the stream of data based on the semantic of your application. Triggers are just the intervals at which we're going to produce answers. You can change the trigger interval without changing the results of the query. It's just going to change when you see the results of the query. Okay, okay, that sounds... All right, I think I understand. So how will structured streams roll out within the product over time to affect all the different APIs or integrate with the different APIs? Yeah, that's a good question. So I think in 2.0, our primary focus was putting a lot of thought into this model and the API that we were presenting to users. We wanted to make sure that it was going to be a solid foundation to build on so that when people started building applications, they wouldn't have to rewrite them with each new release of Spark. Moving forward, I think you're going to see a lot of progress in a couple of different areas. One is integrating with different sources. So right now we can read files from HDFS, but we want tight integration with S3 and Kinesis and Kafka and many of these other kinds of things in the streaming ecosystem. Another thing that we're working on is integration with other parts of Spark, in particular, machine learning. So Spark MLlib is great if you have a batch of data and you want to train a model on that batch of data, but usually you actually want to incrementally improve the model as new data arrives. I'm actually working with an intern here at Databricks this summer on building model updating into the structured streaming framework. Okay, and so I assume that's the sort of online learning. Would there be a corresponding sort of persistence and then instantiation for online predicting in a like a parallel stream? Absolutely, yeah. I mean, once you've trained the model, now you actually want to be able to use it to do things. And we're looking at both kinds of types of model serving as well as being able to do this learning or this scoring directly on streams of data. Which would be when you take the model that you've updated and you put it into production on potentially the same stream. But there's got to be some work in terms of, I assume if you're updating the model you might even add new features but unsupervised potentially. And then somehow you might have to evaluate a bunch of different models before you pick one and put it back into production. Yeah, I mean that's, we're just scratching the surface here. Okay. I think there's a lot of interesting questions to answer. Right now what we're looking at is just kind of how do you even do this and how do you express it in terms of the APIs that we have just to get a kind of continually updated model. Using multiple models doing ensembles, that's all kind of open questions. Okay. What about, I imagine that streams, potentially high volume streams with online learning can be very computationally expensive. Can that work in a clustered environment if the underlying algorithms are scale out? Absolutely. And this is why we're working closely with the ML Lib team who's experts at building these scale out models so that any of the training we're doing can be done on a distributed cluster and scale up as the volume of traffic coming in goes up. Okay. With that, Michael, let's break and return to database scenarios. This is George Gilbert. We're at Databricks with Michael Armbrust and we're on the ground at Spark Central. We'll be back.