 This is George Gilbert, we are on the ground at Databricks. We're joined by Michael Armbrust, who is a creator of the Catalyst Query Engine, and that pretty much enables, as we're learning, most of the functionality in Spark, so I expect we'll see him perhaps as Matei's successor. Just a slightly tongue-in-cheek. We are talking now about database integration scenarios. So, Michael, tell us about the use case for which the Data Source API was designed, and why Spark and Databricks stayed out of the database and storage sphere, and how this lets you work with that. Yeah, so the primary motivation between the Data Source API is that data has gravity. People are going to have existing systems where the data is, and no matter how much programming I do, I'm never going to be able to integrate with all of them. So we wanted to open it up so that anyone could plug in their Data Source of Choice into Spark and still take advantage of the query optimization that Catalyst was able to do. So we rolled this out in Spark 1.3, so I guess it was over a year ago, and what we wanted to do is we actually wanted to kind of let developers who are using this new API plug in at different levels. At the simplest level, all it's doing is sucking all of the data out of some table and putting it into Spark. This is just kind of a standard table scan. But of course, many of these systems have indexes and the ability to do more complicated processing. So we wanted to give them the ability to do some pre-work before passing the data onto Spark in order to be more efficient. So we also pushed down filters and column pruning. So we'll say, you know, the user applied this filter, you can only give us data that matches this. Or, you know, maybe you have 100 columns, but the user only cares about two of them. We'll kind of pass all of that information down into the Data Source so that they can kind of execute the computation more efficiently. So that sounds like the burden then is on the driver developer or, you know, who's building a driver for that Data Source to understand how much intelligence is there. And as you say, this is assuming all the data, you know, is elsewhere. So how might that evolve if we want to have applications like Microsoft Link, the original, you know, it's one of the original inspirations for Spark where you can perform SQL queries that are not just scans, but you might be performing transactions, but you're also integrating it into the Spark language. Yeah. So I think what you end up with is you end up with something that's very complementary to these more transactional data stores. You use, you know, SQL Server, these other systems for point lookups and updates and kind of relatively small queries. What those systems aren't particularly good at is doing large analytics, joining with other data sources that are stored in, you know, Parquet or JSON or HDFS and kind of doing these more complicated computations. So what we, I see a lot of users doing is they'll have this operational data store that they keep up to date, that they do their transactional work on. And in Spark, they'll do more complicated analytics as well as this integration with other things. And so what would be, what would be some use cases where you, you know, you have, with the Data Source API, you might have data streaming in and you look up some historical data and, you know, that informs an analysis. But when you need transactional capabilities, what are some of the use cases, you know, where you still have data either streaming in, you might look up some historical data, but you do these point queries and an update, you know, maybe like a loyalty system or what are some of the use cases where that would be, you know, valuable? Yeah, so typically I think what you would end up doing is you'd end up using this like SQL system for the actual point lookups and things. But then when you wanted to do something more complicated like, you know, train the recommendation engine for the loyalty model, you want to suck all of the data out, do multiple passes over and iterations, possibly join it with logs, transaction logs that are too big to fit into your transactional store. And, you know, that's really where the power of Spark and the Data Source API come in. Okay, so actually sort of using your transaction to do a couple lookups and then make a decision and actually store the decision in the transactional database, that's a little bit sort of beyond the corner case. Although actually this, like that kind of use case is more popular in structured streaming. Something we'll see is that, you know, there'll be a stream of data coming in, you'll be computing some aggregates or, you know, windows over it, and one of the possible sinks for structured streaming is a transactional store. So what you'll end up doing is structured streaming is doing this computation on a huge volume of data, boiling it down into these aggregates that you're particularly interested in, and then updating the transactional store in real time with the latest results so it can be queried by other kind of operational systems. Oh, okay, so in other words, it could be looking up through the Data Source API using the intelligence of the underlying OLTP system and then when it makes its sort of decision using the Spark native analytics, it can update through at the end of the structured stream to the, okay, that's very clear now. Yeah, exactly, and so what you'll end up with is, yeah, you have this like vast amount of data that you wouldn't be able to process otherwise, but you also, in the end, you end up with this operational store that you can serve dashboards or reports or kind of, you know, these more operational type things from. Okay, so in this scenario, it's almost like you are surrounding SQL and making it part of the Spark language so you're not sort of dropping into some callable kind of interface or, you know, a second language where SQL has impedance mismatch. This now makes Spark your end-to-end data processing and analysis language. Exactly, and you can actually also write SQL. We'll parse that as well. Okay, that's very powerful. So we talked a little bit with, I think it was with Matei, about tungsten and catalyst and, you know, the new hardware changes coming together and how catalyst is open in a way that traditional DBMS query optimizers are not. What sort of capabilities can we look for over the next couple of years where the ecosystem can contribute to the evolution of the query optimizer, catalyst through its rules and where tungsten gives us access to hardware, you know, in the form of storage class memory, you know, thousands of cores potentially through GPUs, what becomes possible? Yeah, so I think the real power here is exactly what we started the interview talking about. We have this logical language where users can express their computation, and then we have this pluggable system that takes, you know, their expression of the computation and turns it into actual execution. And having, spending all the time building this infrastructure is very valuable because we've actually completely rewritten the execution engine multiple times now, you know, even just in the relatively short history of Spark SQL. We started with a really simple version that kind of was just, you know, imperative code that I wrote, but then when we realized, you know, performance and CPU were really starting to become the bottleneck, we rewrote it to this tungsten that you were talking about where we started using low-level operators generating bytecode on demand for the specific query that you're trying to write. And so I think it's actually this disconnect between the language that the users are writing in and the way we do execution, that kind of interface gives us a lot of power to change that out without users having to do any extra work as we continually improve what's going on underneath the covers. So would this be an example where incumbent DBMS vendors who can run either, you know, OLTP or decision support workloads where their code bases built up over decades make it difficult for them to achieve these capabilities that you have in your execution engine and the query engine? Yeah, I think that's one limitation and the other is these systems are traditionally only SQL and when you want to do more advanced analytics, you often need to drop into Scala, Java, Python, R code for doing kind of the more advanced use cases after you've done the bread and butter joining, filtering, aggregating. And they're not part of that core engine, they're potentially a different execution engine. Yeah, exactly. Our logical plan structure is able to describe both these relational operations as well as tight integration with custom user code. So that means you can build these pipelines that could be hundreds of petabytes over these huge clusters with in-memory performance, you know, using not just the cluster level parallelism but the cores within the cluster nodes. So we could be seeing performance coming from Spark that the others just can't compete with. Yeah, exactly. And even, you know, in addition to this parallelism, really the focus of Tungsten is also, you know, in addition to spreading it out on many cores, let's focus on single core efficiency so that you get more out of each core that you have available to you. Okay, on that note, it's kind of hard to beat that note. This is George Gilbert, we're with Michael Armbrust and we're on the ground at Databricks talking about Spark Futures. We'll be back.