 This is George Gilbert. We're with Patrick Wendell of Databricks. He is VP of Engineering. And we are talking about specialization versus integration. We all know that Spark has done a better job of pretty much anyone else of integrating their APIs to make a new class of applications possible. But traditional trade-offs have made specialized products generally faster or more feature-rich. So let's, at a top level, talk about how you made trade-offs that may make that traditional distinction less relevant. Yeah, it's a really great question. And I think that there's different approaches you can take when assembling a technology stack. One is the best-of-breed, highly diversified chaining together of many different types of tools. And that's actually not the approach we've taken in the Spark ecosystem. The other approach is, can you have one platform, one runtime that enables you a large amount of capabilities but is still relatively integrated with coherent and well-specified APIs that work well together? So I think in Spark, we've actually found that taking the unified approach has had a lot of benefits. One of them is that, from what I've seen from our customers and our users, there's kind of common themes that a particular big data user will want to see in the platform they're investing in. They want to see security that's kind of coherent and makes sense and holistic in the way that they're thinking about it. They also want to be able to train their employees to sort of understand the broad details of the platform. And they also want to exchange data and move data quickly with high performance between different types of problems. They might have a data ingest or ETL problem. They're just collecting data from 100 sensors and trying to clean it up and put it in the right format. They might have a querying problem where they have some analysts who want to actually run some SQL queries on that data. And then they have a data modeling and machine learning team that's trying to do some predictive analytics on that data. When we go into companies, we see all these different problems. And the approach we took with Spark was to build a unified kind of coherent engine that can solve all of those problems for the customer. I think we've also found that the traditional things you might give up in that model, maybe you'd give up performance, maybe you'd give up a little bit of some expressivity. The way that we've designed the APIs in Spark, we don't feel like we've given much of that up. I mean, it's performance competitive with many of the kind of specialized tools out there. So other than owing to the collective IQ of the development team, what are some of the things you did to do that? Yeah, so I wish I could say it's just, we're smarter, but that's not really it. The benefits of having a shared engine are that you get a lot of engineering investment in these key primitives. So the things I mentioned for you, tungsten, for instance. Tungsten in one part of it is, how can we better use memory inside of a Spark application? It turns out that streaming applications and query processing and machine learning are all very, very memory intensive. So if we can optimize that core primitive, we can have 10, 15 really right engineers think about how to make that one part work really well. It benefits broadly all of these use cases. A good example in kind of an older school example of this type of thing is the Java Virtual Machine. So the Java Virtual Machine was trying to solve the problem back in the day of building one application platform that it was not quite as specialized as if you're writing something directly for a specific hardware. But they presented really nice APIs for the users and they had nice ways of debugging and of making that portable across many different environments. But today, the Java Virtual Machine has had so much investment poured into that low-level engine that in many cases it actually outperforms applications that are more specifically written against specific hardware because they can do all kinds of runtime profile and other things. You have hundreds of engineers for 10, 15 years trying to make that thing really, really performant. So the idea that you give up something in order to use a more generic platform it actually isn't true because of the kind of ecosystem effect that platform is very, very powerful. And Spark is a sort of newer example of the same type of phenomenon. We have a lot of investment in these core parts of Spark that mean that to kind of replace or compete with that level of optimization would be very, very hard for a specialized system. Okay, so, and in the previous segment we talked about some of the trade-offs you made to be competitive with MPP SQL engines or to be competitive in a complementary way not doing exactly what they do the same way. So another topical feature area is stream processing and there are some where they say, this is what we focus on and we are really performant. We can handle per event processing if that might be relevant at the edge. What are some of the trade-offs you make so that it can be competitive with them? Yeah, it's a great question. So a lot of the optimizations in Spark 2.0 are related to reducing the latency around stream processing. The kind of space we target is latencies in the hundreds of milliseconds and higher. If someone's doing a high frequency trading or some kind of thing, that's often something where they're using custom hardware, it's a very, very specialized application. What we target is kind of what we think are the vast majority of stream processing jobs where they're trying to do some reaction in the sub-second range and that includes even credit card fraud detection. Is it fraud or is it not? The window they have for that swipe of a credit card is about three seconds so that's something that will be considered within the space that we target. So that's been a big focus of ours in Spark 2.0. Do you see a change in architecture where there's more analytics being done at the edge and that there are therefore demands, different demands on latency, footprint, memory footprint, processor, demands? Yeah, I think there's a few architectural changes that are interesting. So relating to what you said, I think we do see a proliferation of this kind of sensor device kind of environment where you might have hundreds or thousands of small, very low-powered devices that are collecting data and aggregating it and usually Spark doesn't quite push out that far so we don't have people running Spark on like IoT chips in random devices but we get pretty close to the front so usually that data could directly be ingested into Spark or go through a message broker kind of system and then right into Spark. The main thing I've seen is people trying to reduce the time of that latency as much as possible and it's very challenging because even if you can get the average case to be really good so maybe on average you have, it takes less than 20 seconds to get data from your end devices over a network through some message broker to Spark. In practice it's the worst case latency that's the problem. If you have a thousand devices some of them are gonna be slow, some of them might be turned off, some of them might not have cell service for a small amount of time. So the problems that we see around this type of application is largely about how do you reason about giving results when you don't even know if the data is completely finished because you still have data coming in. And structured streaming has that event time. Yes. It's just then that the latency of the analytics is gonna be really slow or it's gonna be continually updated. Yeah, so you kind of have two options. You're exactly getting it right. So you have two options in this world. One is that you give results right away but they're maybe less accurate because they may get incrementally updated later. Best effort, results. Best effort, yeah. Another one is you wait a certain amount of time but then the results take longer to deliver but they're more accurate. But because of this type of architecture it's rarely the case that like one millisecond versus 500 milliseconds really matters because it tends to be that the data in just time is at least in the seconds and in many cases actually in multiple minutes. So that's kind of the design space we were targeting. Are you targeting at all the moving spark down to sort of gateway devices that are near the edge? Not the SCADA hardwired board that's in the turbine or the wind turbine or the gas generator or whatever. But there might be an X86 machine that's on the factory floor on the network. Yeah, that's a great question. It's that's a little sci-fi for us right now. We tend to kind of assume that the data has been somewhat aggregated already by the time it gets to spark. I've seen in specific applications they might have some earlier aggregation tier that's doing that logic. And there's other tools for that like people use flume and Kafka and some of these other tools to do that. So right now that shipping of bits and aggregation is not kind of in scope for spark but in a future world I could see that happening. It's probably not on the very short-term roadmap though. Okay. I wanna come back to some of the popular applications that we talked about. Sure. Recommendations. Now that we have continuous processing, how might those change? And then in a system that relies on batch processing how much richer would a continuous processing system be? Yeah, so this is also something you should talk with Joseph about because that's kind of a very machine learning kind of question. But I think the biggest benefit in the sort of continuous processing world as I've said before is that being able to enrich the model with new data as it's coming in. And that's something that will be coming over the spark 2.x line. And so that just means being able to, given as an example, recommendations are all about understanding every dimension of the user and as much, learning as much information as you can in order to tell them what to do next. But often that data that you're learning about the user is coming right as you're trying to make the recommendation. So for instance, I'm browsing products on Amazon. They would love to have that latency be very low where they could actually use the products I've browsed in the last 10, 20 seconds to give me a recommendation on the next time I load a page and say, oh well, you're just using this, reviewing these products. And then based on these products and your entire history and what we know about you and what your potential buying patterns are, here's what we're recommending to you. If that's a batch workload, they need to wait 24 hours for the most recent data to kind of influence the model. So you might have noticed sometimes you're shopping online for something and then you log in like a day later and you see an ad for that thing that you were shopping for but now it's 24 hours later, you've kind of forgotten about the thing, you don't really care about it much anymore. If that latency can get down to be on the order of seconds or minutes, it's kind of part of the user session experience and that can significantly improve the response rate for those recommendations. Okay, let me switch tags for a moment. We've talked about Databricks' implementation of Spark as an end-to-end coherent and integrated experience but if you want to add some services like you want to have persistence, like not just file system but sort of database and you want to have ingest. So you've got Kafka, let's say you've got Cassandra. I think I've read that many people would add ACA and then you need ZooKeeper and you need three of each. So already you're at 12 servers. So even though Spark and the Databricks version is more comprehensive than say older architectures, it's still got a lot of moving parts. How might we expect Databricks and Spark to evolve over time to have fewer moving parts? Yeah, it's a great question. So I think the one thing we don't really do is storage. That's a big decision that both Spark and now Databricks has made early on and I think you're right that there are moving parts but from what we've heard, the ability to kind of compose multiple data sources is one of the most powerful and well-liked aspects of Spark. The old data warehousing model is like you have a golden set of servers and it's very hard to get data in there and that's where everything sits and it's kind of the cathedral. What we see a lot more is kind of to your point they might have some data in Cassandra, they might have some data in Redshift, they have some streaming data coming through Kafka and that separation of concerns actually enables them to be very agile in the way they're building applications. So I don't necessarily think it would be a good thing if Spark went into the storage business for instance and that's not really our plan in the short term. I think the right move is to have a very powerful general query engine that integrates the user interface and all the experience of kind of solving the analytics problem and one or two storage systems that are popular and that are robust and work really well and I think that's kind of as simple as it can possibly get unless you wanna sacrifice a lot of agility. And when you talk about one or two storage systems, are you talking about that might be native options or just that you do deep integration with through the data source APIs? Yeah, it's more likely that we'll do deep integration with them. The beautiful thing about the way people consume software in the cloud is they actually, they largely consume services. So you mentioned like you might have one or two servers running for this or that but very few people are running their own servers and that's becoming less and less over time. I'll give you an example. So like by far the number one storage that's used with Databricks is Amazon's S3 storage. Okay, now think about how disruptive that's been in the market. It has nine nines of reliability. It's the most reliable storage system ever built. They basically have never lost a bite of data on that system and it's extremely cheap. So good luck competing with that in the marketplace. And I think that the way that people consume S3 is as a service. They don't think about which servers they're spinning up or down. They just have some APIs that say give me this range of a file and list the files in these very basic primitives and that works super well. I mean it's a very, very successful project, the S3 project for Amazon. So I think the way that we see the software world going in the future is as consumers thinking about two or three main services, some storage and hopefully Databricks in there with the Spark for all of the analytics component and putting them together in a way that enables them to very quickly build new applications. Okay, really quick. Last word. Do you see notebooks playing a role in democratizing access to all the analytics? Yeah, it's hard to answer that quickly. I'll do my best, but I think that from what we've seen, so our product is completely notebook based. By the way, if you walk around you'll see some on the walls on some TV screens around here. But it's definitely really, really changed the way that folks interact with data. And what I like to see is that you have tens of thousands of students and other folks coming out of school that really understand this abstraction. And I think it has a chance to supplant like many other of the traditional interfaces that you see. One's being like a SQL shell, a very basic shell, or a BI tool. It's kind of somewhere in between those two. Computational document. Exactly, yeah. And so I think it has a chance to be really big and that's why we obviously chose those abstractions as our key interface in the product. Okay, this is George Gilbert. We're on the ground at Databricks. We've just been talking with Patrick O'Andell, VP of Engineering. And we will be back with some more segments.