 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCUBE. We're talking about data science and engineering at scale, and we're having a great time, aren't we, George? We are. Well, we have another guest now we're going to talk to. I'm very pleased to introduce Matt Hunt, who's a technologist at Bloomberg. Matt, thanks for joining us. My pleasure. All right, we're going to talk about a lot of exciting stuff today, but I want to first start with, you're a long time member of the Spark community, right? How many Spark summits have you been to? Almost all of them, actually, so. Really? It's amazing to see the 10th one, yes. All right, and you're pretty actively involved with the user group on the East Coast? Yeah, I run the New York users group. All right, well, what's that all about? Well, you know, we have some 2,000 people in New York who are interested in finding out what goes on and which technologies to use and what are people working on? All right, so hopefully you saw the keynote this morning with Mate. Yes. All right, any comments or reactions from the things that he talked about as priorities? Well, you know, I've always loved the keynotes at the Spark summits because they announce something that you don't already know is coming in advance, at least for most people. The second Spark summit actually had people gasping in the audience, what they were demoing with a lot of senior people. Well, the one millisecond today was kind of a wow. Exactly, and I would say the one thing to pick out of the keynote that really stood out for me was the changes and improvements they've made for streaming, including potentially being able to do submittal second times for some workloads. Well, maybe talk to us about some of the apps that you're building at Bloomberg and then I want you to join in George and drill down some of the details. Sure, you know, I mean, Bloomberg is a large company with 4,000 plus developers. We've been working on apps for 30 years, so we actually have a wide range of applications, almost all of which are for news in the financial industry. We have a lot of homegrown technology that we've had to adapt over time, starting from when we built our own hardware. But there are some significant things that some of these technologies can potentially really help simplify over time. Some recent ones, I guess, trade anomaly detection would be one. How can you look for patterns of insider trading? How can you look for bad trades or attempts to spoof? There's a huge volume of trade data that comes in. That's a natural application. Another one would be regulatory. There's a regulatory system called MIFID, or MIFID2, their regulation's required for Europe. You have to be able to record every trade for seven years, provide daily reports. There's clearly a lot around that. And then I would also just say our other internal databases have significant analytics that can be done, which is just kind of scraping the surface. These applications sound like they're oriented towards streaming solutions and really low latency. Has that been a constraint on what you can build so far? I would definitely say that we have some things that are latency constrained. It tends to be not like high frequency trading where you care about microseconds, but milliseconds are important. How long does it take to get an answer? But I would say equally important with latency is efficiency. And those two often wind up being coupled together, though not always. And so when you say coupled, is it because it's a trade-off or because you need both? Right, so it's a little bit of both. There's, for a number of things, there's an upper threshold for the latency that we can accept. Certain architectural changes imply higher latencies, but often greater efficiencies. Microbatching often means that you can simplify and get greater throughput, but at a cost of higher latency. On the other hand, if you have a really large volume of things coming in and your method of processing them isn't efficient enough, it gets too slow simply from that. And that's why it's not just one or the other. So in getting down to one millisecond or below, can they expose knobs where you can choose the trade-offs between efficiency and latency? And is that relevant for the apps that you're building? I mean, clearly if you can choose between micro-batching and not micro-batching, that's a knob that you can have. So that's one explicit one. But part of what's useful is, often when you set down to try and determine what is the main cause of latency, you have to look at the full profile of a stack of what it's going through and then you discover other inefficiencies that could be ironed out. And so it just makes it faster overall. I would say a lot of what the Databricks guys in the Spark community have worked on over the years is connected to that, Project Tungsten and so on. Well, there are all these things that make things much slower and much less efficient than they need to be and we can close that gap a lot. I would say that's from the very beginning. This brings up something that we were talking about earlier, which is, Matei has talked for a long time about wanting to take end-to-end control of continuous apps for simplicity and performance. And so that there's this will-write with transactional consistency. So we're assuring the customer exactly once semantics, when we write to a file system or a database or something like that. But Spark has never really done native storage, whereas Matei came here on the show earlier today and said, well, Databricks as a company is going to have to do something in that area. And he talks specifically about databases and he said he implied that Apache Spark, separate from Databricks, would also have to do more in state management. I don't know if he was saying key value store, but how would that open up a broader class of apps? How would it make your life simpler as a developer? Right, interesting and great question. This is kind of a subject that's near and dear to my own heart, I would say. So part of that, take a step back, is about some of the potential promise of what Spark could be or what they've always wanted to be, which is a form of a universal computation engine. So there's a lot of value if you can learn one small skill set, but it can work in a wide variety of use cases, whether it's streaming or at RAS or analytics and plug other things in. As always, there's a gap in any such system between theory and reality and how much can you close that gap. But as for storage systems, I mean, this is something that, I mean, you and I have talked about this before and I've written about it a fair amount too. Spark is historically an analytics system, right? So you have a bunch of data and you can do analytics on it. Where does that data come from? For either it's streaming in or you're reading from files, but most people need essentially an actual database. So what constitutes the universal system? You need a file store, you need a distributed file store. You need a database with generally transactional semantics because the other forms are too hard for people to understand. You need analytics that are extensible and you need a way to stream data in. And there's how close can you get to that versus how much do you have to fit other parts that come together? Very interesting question. So far they've sort of outsourced that to DIY, do it yourself. But if they can find a sufficiently scalable relational database, they can do the sort of analytical queries and they can sort of maintain state with transactions for some amount of the data flowing through. My impression is that like Cassandra would be the sort of the database that would handle all updates and then some amount of those would be filtered through to sort of a multi-model DBMS. When I say multi-model, I mean handles transactions and analytics. Knowing that you could, you would have the option to drop that out. What applications would you undertake that you couldn't use right now where the theme was we're going to take big data apps into production and then the competition that they show like for streaming is Kafka and Flink. So what does that do to that competitive balance? Right, so how many pieces do you need and how well do they fit together is maybe the essence of that question and people ask that all the time. And one of the limits has been how mature is each piece, how efficient is it and do they work together. And if you have to master 5,000 skills and 200 different products, that's a huge impediment to real world usage. I think we're coalescing around a smaller set of options. So in the, Kafka, for example, has a lot of usage and it seems to really be, the industry seems to be settling on that is what people are using for inbound streaming data as for ingest. I see that everywhere I go. But what happens when you move from Kafka into Spark or Spark has to read from a database? This is partly a question of maturity. Relational databases are very hard to get right. The ones that we have have been under development for decades, right? I mean, DB2 has been around for a really long time with very, very smart people working on it or Oracle or lots of other databases. So at Bloomberg, we actually developed our own databases for relational databases that were designed for low latency and very high reliability. So we actually just open sourced that a few weeks ago. It's called ConDV2 and the reason we had to do that was the industry solutions at the time when we started working on that were inadequate to our needs. But we look at how long that took to develop or these other systems and think that's really hard for someone else to get right. And so if you need a database, which everyone does, how can you make that work better with Spark? And I think there are a number of very interesting developments that can make that a lot better short of Spark becoming and integrating a database directly, although there's interesting possibilities with that too, right? How do you make them work well together? We could talk about for a while because that's a fascinating question. On that one topic, maybe the Databricks guys don't want to assume responsibility for the development because then they're picking a winner, perhaps. Maybe as Matei told us earlier, they can make the APIs easier to use for a database vendor to integrate, but like we've seen Slice Machine and Snappy Data do the work, take it upon themselves to take data frames, the core data structure in Spark and give it transactional semantics. Does that sound promising? There are multiple avenues for potential success and who can use which in a way depends on the audience. If you look at things like Cassandra and HBase, they're distributed key value stores that additional things are being built on. So they started as distributed and they're moving towards more encompassing systems versus relational databases, which generally started as single image on single machine and are moving towards Federation distribution and there's been a lot with that with Postgres, for example. One of the questions would be, is it just knobs or why don't they work well together? And there are a number of reasons. One is what can be pushed down? How much knowledge do you have to have to make that decision and optimizing that I think is actually one of the really interesting things that could be done just as we have database query optimizers, can you determine the best way to execute down a chain? In order to do that well, there are two things that you need that haven't yet been widely adopted but are coming. One is the very efficient copy of data between systems and Apache Arrow, for example, is very, very interesting and it's nearing the time when I think it's just going to explode because it lets you connect these systems radically more efficiently in a standardized way and that's one of the things that was missing. As soon as you hop from one system to another, all of a sudden you have this immense computational expense, that's a problem we can fix that. The other is the next level of integration requires basically exposing more hooks. In order to know where should a query be executed and in which operator should I push down, you need something that I think of as a meta-optimizer and also knowledge about the shape of the data or statistics underlying and ways to exchange that back and forth to be able to do it well. Wow, Matt, a lot of great questions there. We're coming up on a break so we have to wrap things up and I wanted to give you at least 30 seconds to maybe sum up what you'd like to see your user community, the Spark community do over the next year. What are the top issues things you'd love to see worked on? Right, it's an exciting time for Spark because as time goes by, it gets more and more mature and more real-world applications are viable. The hardest thing of all is to get anywhere you go and any organizations to get people working together but the more people work together to enable these pieces, how do I efficiently work with databases or have these better optimizations make streaming more mature, the more people can use it in practice and that's why people develop software is to actually tackle these real-world problems so I would love to see more of that. Can we all get along? Can't we all? Well that's going to be the last word of this segment. Matt, thank you so much for coming and spending some time with us here to share the story. My pleasure. Thank you. Thank you so much. Thank you George. And thank you all for watching this segment of theCUBE. Please stay with us at Spark Summit 2017. We'll be back in a few moments.