 I'm George Gilbert. We're here downtown San Francisco at the Databricks office. We've got a special guest with us today, Matei Zaharia, creator of Apache Spark and co-founder of Databricks. Matei, welcome. Thanks, George. So let's start out at a high level. Lots of discussion with you and other members of the Spark team are about these amazing APIs that you've got. Let's bounce up a level and talk about the types of apps people have been building for 60 years, batch, interactive. And now there's a new type. Tell us where that fits in, continuous applications. Yeah, great. So we're talking, after seeing a lot of the use cases of Apache Spark and also other data processing frameworks, we've kind of introduced this name of continuous applications to cover something that everyone is already trying to do, basically, which is applications that integrate real-time data as it arrives into a complete end-to-end system. So basically, it's not just computing on a stream and outputting another stream. It's actually integrating it into a system like, say, a recommendation engine or some kind of alerting system, credit card fraud detection, something like that, where many people interact through the same computer system. OK, so let's drill down a little bit and talk about how you still need batch processing and interactive or request response. But with the addition of stream processing, how does that make a continuous app? Yeah, so the main idea we have here is that basically applications that process real-time data still cover many forms of processing. So even if you have real-time data coming in, that's usually not all your data. You also have some other data sitting on the side that's maybe static or maybe you update it every night or something like that, and you combine it. And likewise, even though maybe you're computing some stuff just on the stream, like for each item that comes in, compute something that comes out, you probably also have people in the loop that are doing interactive queries that come in and change the workload that say, oh, we should be running this other thing on the stream and so on. So our observation was basically that a lot of the computer systems out there today that people used to build these just focused on one aspect, like just the streaming part or just the interactive queries or just the batch part. But really, the developers and organizations using them are trying to build this end-to-end application. So we want to design APIs where you can combine these end-to-end pieces and have something like a recommendation system where people can interactively tweak the recommendations or something like that. And to use this example to extend it, it would mean like the recommendation system would update itself in real-time, not like overnight and say, okay, here's some new movies for you. Yeah, it would update itself in real-time, but it would be integrated with other things you want to do with that system. Like if someone comes in and just wants to ask a question about the current state of it, like how many recommendations that we make for this item, you could go in and do that through your standard interactive query tools. Or if someone wanted to say, I bring in the static data set, it's not a real-time data set, but it is relevant to these recommendations, it would be possible to put that in there as well. Also, all right, so summing it up now at the level of the APIs that make Spark so great, it's you're adding an API that goes along with batch and goes along with interactive so that these applications can be built on a coherent foundation. Yes, exactly, yeah. And in fact, we're doing for people familiar with the Spark APIs, we're using data frames and Spark SQL as the same kind of API that can extend into batch streaming or interactive processing. So anything that makes sense in that API will know how to do it at these three levels of latency basically, and all the results will be consistent and all of them will be possible to combine in your application. Okay, and I'm gonna come drill down on that actually with Michael, I think. So now let's drill a little bit into some sort of terminology that may or may not be familiar with a lot of people. We heard for years about this Lambda architecture, which was a, at the time, it was time to try and combine sort of stream processing and batch processing, and now it's sort of, you know, pejoratively sort of termed a bit of a hack. How does structured streaming in Spark 2.x get around that? Yeah, that's a great question, yeah. So the Lambda architecture basically came out when there were many sort of batch processing systems for large datasets, and there weren't that many real-time ones. And the first real-time ones were starting to be built, but they didn't have very strong guarantees necessarily about what they'll compute, you know, how they'll react to faults and so on. So the idea of the Lambda architecture was, you know, you receive data periodically and we know how to run batch jobs to get a result, and like the batch jobs are completely sort of consistent and correct and fault tolerant, they'll always give like the right answer, but they're really slow. So we're gonna do that, you know, say every few hours, and then we're going to have a second layer, like a second copy of the computation that uses one of these fast but less, you know, like less accurate streaming systems to give us some streaming results. So we're going to fill in with those streaming results when the data is new, and then later we're going to replace them with the batch ones. That was the idea. And the idea was like, okay, how do we write applications so we can reuse some of the code across these? How do we make this switch, like when you switch the batch version and so on? Does that make sense? So there was a fair amount of complexity and there was a fair amount of error because the last bit of data was coming from the fast system which didn't have the enterprise availability and resilience. Yeah, exactly. And it could change under you or it could just discount things because of the way, basically because of the trade-offs that these systems made. And so it's, if you're just trying to do something like, you know, just a rough approximation of something, you know, how many people are looking at this thing on social media, which is actually like where this architecture kind of came out of. Then it's probably fine. It's an easy way to put these together. But basically when people start doing that, they quickly run into challenges when the results don't match. So something we saw at a lot of companies actually is they built a real-time streaming sort of pipeline and they showed customers, they told them, you know, hey, now you can see some metric in real time, isn't that cool? And then they had the batch pipeline that calculated the exact results and, you know, send them a report every night or send them a bill. And then the customer would say, well, hey, I was looking at your streaming thing at five o'clock. It said there were, you know, 10,000 users on my video, but now you charge me for 11,000 users. What's up with that? So this is the kind of problem that can come up. Okay, so let's take that into a, as you were, you know, a particular application. Let's maybe talk about fraud prevention. And you know, you don't have to do the two separate pipelines, fast and big. How would it change how you build a fraud prevention app? Not, and actually not just how you would build it, but its capabilities. Yeah, okay, that's a good question. So yeah, so when you use continuous applications and structured streaming in Spark 2.0, the main thing about them is that they're designed to give the same set of results, so consistent results across the two things. Regardless of whether you have failures, whether, you know, some data arrives late, things like that, they'll give you the same kind of consistent result. And in the same application, you can therefore combine them and no matter which one you got the result from, you know, you can build on top of it. So to go to this fraud detection case, what it lets you do is if you have some, you know, kind of complex algorithm or report that you write every night, and you want to run that in a streaming fashion as data arrives, the system will make sure you get the same results as it would have from that one. Or it tells you, oh, I don't understand how to run this in a streaming fashion. The second thing it lets you do is it lets you easily combine kind of static data. You update rarely using batch jobs with the real-time data. And the same way you programmed an application that uses that static data, you can now apply it on the real-time data. So as we look out several quarters and we see the maturation of structured streaming and the integration, perhaps, of machine learning, would that mean that we are able to do a better job of catching the new fraud patterns and rolling them into the application? Yeah, yeah, basically, I'd say that kind of two or three things it enables you to do. First, you'll be able to use the same sophisticated algorithms you would use on static data and run them on a stream and get, you know, the same results. The results, that makes sense. You don't have to worry about, do my results mean something else in the streaming version than the batch version? Second, you'll be able to do, you'll be able to use the other pieces of Spark, such as interactive queries, on the state of your real-time application. So a really common one with fraud detection is, okay, we denied some credit card transaction or some application for some customer and then the customer calls us and asks us why. And now, you know, most organizations have a totally different system where analysts on the phone can drill into the data and can say, oh, I think it's because you have this or that. In this case, you'll be able to build that into the same application and get the same consistent view. You don't have to worry about, did the data that I used to deny the application make it into my customer response database? Or like, will my customer be confused because they don't see why it was denied? So it's almost like there were two, there were two almost distinct parts to the application. One was the big data and one was the fast data, the current stuff. Yeah, exactly, yeah. And now they're the same. Yeah, exactly. Or in some cases that I even see, there's like the interactive one, there's like, oh, let's periodically put this stuff into a data warehouse and then people on the phone can fill in some forms and request the records for this customer. So it's all about like logically what the credit card company wants to do is they want to have a single application with these different facets, the streaming, the interactive and the batch. But then because they only have systems that can do one at a time, they're forced to build these different systems and then it's their job to keep them in sync. And that's really hard and when it breaks, it also breaks in ways that are very hard to diagnose, yeah. Okay, yeah. With that, let's close out our look at Spark 2.0 and continuous apps and then we'll come back with a look at the roadmap.