 Live from New York, extracting the signal from the noise. It's theCUBE, covering Spark Summit East, brought to you by Spark Summit. Now your hosts, Jeff Frick and George Gilbert. Welcome back everybody, Jeff Frick here with theCUBE. We are live in Midtown Manhattan at the Spark Summit East Conference. I don't know if you've gotten a number, George. How many people are here? 1,500 people there, about 2,000 people. All talking about Spark. It's early days for Spark. It's the newest, most exciting thing happening in big data, so of course we had to go out, get the signal for you, bring it home to you if you weren't able to make it to Manhattan, although it was a beautiful day today. It was snowing when we arrived and beautiful and sunny today, but we're really excited to be joined by our next guest, Matthew Hunt. And make sure you get to write the technical fellow at Bloomberg. Matthew, good to see you again. Thank you, nice to see you too, Jeff. So first off, I want to congratulate you. The night before the conference, they had a meetup. We're able to go in and stick our noses in. Unbelievable attendance, enthusiasm, and just really a wrapped audience. So you kicked the thing off. I think you're kind of at the epicenter of all things Spark here in Manhattan. So first off, congratulations, and give us kind of an update of what the community is doing around Spark, and why is there this enthusiasm and this energy? Sure. There's definitely a lot of enthusiasm, as you could see. There were close to 600 people there, and every meetup we've held has actually been like that. And part of it is there are problems that people are trying to solve for which these technologies clearly offer better solutions, even though they're not all mature yet, but there's a lot going on that's fundamental, not just for computer science and business, but for a lot of applications like cancer research, statistics, weather prediction. This is really part of the forefront of a lot of the elements, including machine learning. And you go to a lot of conferences, we've talked to you at HBaseCon, we've talked to you at Hadoop Summit, now we're talking to you at Spark. There's a lot of kind of confusion as to where these things all fit, how do they work together, where's the overlap. I wonder if you can kind of give your summary of where Spark fit, and how does it build on those technologies that have come before, and where there is some confusion, why do you think there's some of this confusion out there? Especially I guess with the dupe in Spark is where I see the most confusion out in the wild. Right, great question. So part of it is based on when there are new things and they're less well understood, that creates confusion in and of its own right. Maybe the best way to understand them is to understand a little bit about their origins, which tells you a little about how they fit together. And then also talk about where they're going. Some of the confusion is about what they do today relative to what people actually want and their own mental models. So Hadoop and its pieces arose from a very fundamental problem, really an economics problem of, I'd like to download and index the web and do that economically. And so you need a lot of cheap computers to be able to solve that problem together. If you have a lot of cheap computers, how do you actually make them work and manage failures and make it really reliable? And that's where the system came from. Not some sort of magical ivory tower thing, but engineers rolling up their sleeves at places like Google in particular to solve a basic engineering challenge. Other people have elements of that problem too. And other pieces have been layered on top of Hadoop. There's HBase, which is why we've talked about it. HBaseCon for being able to retrieve things faster. And that's also the basis of a lot of Google systems from something called Bigtable. Bigtable or MapReducer is really designed for very reliable, large batch processes. But if you want to do it faster and do things a little more interactively, what can you do? And that's really part of where Spark started to come in. People are excited about Spark because it's easier to program for. It's a lot faster for a certain set of use cases and it covers more of them. So instead of learning 50 different tools, you've got one, right? So you can use Spark for streaming and for batch kind of computations and interactive. And by the way, it's got machine learning things. Whereas inside of sort of classic Hadoop, there's a separate product for each one of those, which I used to call a noun explosion and somebody had a much clever phrase, which I'm now forgetting, but it's the same basic idea. The reason Spark is faster is really, it has a slightly more complicated instruction set. MapReduce basically has four principle instructions and was designed to work on machines of a single core with mechanical hard drives. And so in writing, reading in and in from the disks with each pass and then writing out very low level instructions, which takes a lot of time. Well, that's also why it's really reliable. Do something small, write it back. Do something small, write it back. But machines have moved on, memory's a lot cheaper so you can read it in, do a set of things in a row and it's a lot more efficient. And that's a huge part of where its performance comes from. So that's part of why people are excited about Spark. It does subsume a number of elements in sort of the classic Hadoop ecosystem, but they still work well in conjunction with one another. Spark is a way to do streaming and a way to do batch computations, a way to build these libraries, but there are other things you need also. You still need a database to store things to in the first place. You still need a file system to store things to. So that's another part of the confusion, is if you say, well, here's this fast computation engine, people have assumptions about what will be there automatically and what they want to do. Oh, I have streaming data coming in. Of course I want to write it to a database and then do some processing. That's actually not the way Spark works sort of out of the box. And that's part of the confusion too, is there's a mental model shift and there are pieces that haven't fully come together to make that seamless yet. So that's a long answer. That was a great answer. Yeah, that really was nice and high level. But let's take that a step further where you're saying people expect sort of processing to happen and processing would assume almost like you're going to store either before you do the processing or after the processing the results. But Spark SQL doesn't have native persistence or a database layer. So tell us about how Spark SQL was designed in a way where it can operate without having its own bottom layer. And how did they rethink the problem to do that? Good question. Well, first of all, I have to say I have a tremendous respect for Michael Armbrust who was really the founder of a lot of the Spark SQL efforts. And usually when you think of SQL it does a couple of things. You can write things to a database and you can also read from it. The Spark SQL has a couple of secrets. It says Michael will say himself, one of the secrets is it's not about SQL at all. But the other is at a high level it's for being able to write high level things and things like SQL to do the read. And we have decades and decades and decades of research and experience into relational databases and things you can use and why things like SQL are useful for retrieval in addition to the place where they're useful for storage like asset properties of durability and transactions and so on. Spark SQL lets you express the SQL that then gets turned into Spark instructions underneath the hood. So you don't have to know the primitive instructions the way you would with MapReduce. That's useful. One part of database technology that's very well studied is optimization techniques. So, given a SQL command to say join three tables with a where clause, there are many different ways you could try and execute it. You could try and pull stuff from one table and then look things up in another table but which way should you do it? It really depends on what data is in which table, how big are they, how good are your indexes and if you take the wrong path it's many, many, many orders of magnitude more expensive to execute and this is what optimization technology is all about. So Spark SQL will essentially allow in time optimization techniques to be applied to the problem as well including how it's connected to the storage layer I think although I would actually say Michael and the other folks would be the real experts in this case. But the other thing is well we've always used things like SQL to express optimization for retrieval of data and some basic computations like I'd like to aggregate and do a count or some. There are more complicated numerical computations which is why people use R or Mathematica or something else and can we actually apply the same kinds of optimization techniques when you express something that way and that is the real secret of Spark SQL. Take a data frame for NumPy or Pandas or the equivalent for a complicated mathematical calculation and then figure out how to run that efficiently based on where the data is without you having to know how to do it all. That's a pretty fundamental thing. We asked Matei about to the extent I understand that, just that which was SQL is fairly well understood, well bounded. As you said, we've applied decades of knowledge or research to optimizing it but if you take a general purpose language and you've got more functions and user-defined types wouldn't it be a good deal harder to optimize this bigger set? And he said, well we try and map it to the SQL like primitives that we know how to optimize. That's right. So I think the correct mental model for that is to think of what happens when you compile a program, right? So you write some code and you hit a button and the machine turns it into machine level instructions, right? If I have a different computer language it doesn't mean that the chip suddenly now has 10,000 more instructions. It still has the same set of a few dozen depending on the chip. Spark has a certain instruction set underneath the hood. So whatever you're writing in whether it's Spark SQL or DataFrames or anything else you can think of it as absolutely analogous to that. Take what was written as a high level language, compile it into Spark instructions and then you can use optimization techniques on that. That's why he called it, or you read when he said, oh it's like a just in time compiler. That's right. Okay so. That's a good analogy. So that's a way of taking language contracts. It's probably better than mine in surprise. Okay. Taking language constructs and making it performant and we were talking earlier about even without the storage layer when we have lots and lots of memory and abundant CPU or graphics processing units we can figure out what's in the data sort of while we're doing it, the just in time part. So I assume that that's perhaps a new way of building analytic SQL databases. I would say it's an old way applied with new techniques. So part of what makes optimizers work is having some knowledge about the structure of the data that you have. For instance, if it's a table in a database that has people's names, you know that the number of names, we could have a billion people in it or every American but the total number of names will be a lot smaller. So if I use the last name as a key, it's likely to narrow it down a lot. First name plus last name is a much smaller set than the 300 million people you start with. But that's knowing something about these statistical properties, the data you're pulling. Database optimization is really about understanding things like that. Gosh, is this likely to narrow down the size of the table for the amount of data I have to pull a whole lot? Before I join, so you need to be able to actually have those kinds of statistics. So how do you apply that to a distributed system and how quickly can you calculate them may in fact be one of the very interesting properties of these. But in some ways I would think of that as an old technique but applied with a new and thoughtful veneer of can instead of just for SQL queries, can we use it for mathematical expressions and sort of a generic breadth of things? It's pretty cool. Okay, let's switch bases to streaming which is now everyone's sort of hot new topic even though we don't have a huge number of apps yet but everyone expects us to see a huge growth in this type of workload. There seems to be a big debate between do we extend our existing programming models or do we need a new one? What are your thoughts on that? It's hard to pick which one to start with. It's a buffet. Yeah, so the first is streaming's been around for a really long time. It's all around us, right? Stock's ticking is streaming. Twitter is streaming, right? There are lots of things that are streaming feeds that have been with us all along. A lot of it just like database techniques and optimizers are mature and well understood technologies along with the trade offs. Streaming and how you can pass messages in is also a very well studied area of computer science and so batch versus not batch is it's not that one is better than the other, it's that there's a set of trade offs that are understood that you must choose between and so there's some things within that that are hard like if you ask people, so can you guarantee once and only once message delivery in order, right? You'll probably get a lot of hemming and hawing but more broadly I think part of people's excitement about streaming stems from a perception and education problem on the one hand and a technology maturity problem on the other in a way. So you said there were a lot of people who had this spark meetup. The last one we did around strata last year, literally 80% of the questions from the audience were about streaming. Now I had six or seven people on the panel, Reynolds Inn is the chief architect of Spark and a bunch of people that caliber, Tathagata Das who's actually in charge of streaming, Spark streaming and why is it that everyone has all these questions about streaming? It's not because suddenly magically everyone wants to do streaming, it's kind of around what people simply expect like I have a database, new data's coming in that goes in the database and I want to express calculations on top of that. That's been the case, we already have technology that does that. I have a relational database, I'm writing stuff to it and my SQL queries actually pull the most recent stuff. In my opinion, a significant chunk of the questions about streaming are really that people assume it's a database and you're reading from the database and so of course you can write to it and now I've got changes coming in and what do you mean I can't apply them? And that's where the confusion comes from and in some ways that's a maturity problem. That needs to be better integrated with the fundamental database technology along with the optimizer for Spark to talk to and that's part of the Holy Grail. Okay. In my assessment. That one I'm going to have to go home and take the video and do the replay a couple of times to make sure I grok it all. Now I have to go back and actually, I'll have to watch it to see what I said actually makes any sense at all, so we'll see. So let's switch perhaps to Bloomberg now and tell us some of the class of problems that you are trying to attack now that we've gotten past the siloed execution engine in Hadoop but you still you're using part of the Hadoop ecosystem for management or storage. What new things are possible and where are you hitting roadblocks? Yeah. So, you know, rewinding you have to keep in mind, you know, Bloomberg has tens of thousands of machines. For many years it operated the world's largest private network and that may still be the case. I don't actually know. And, you know, we have 4,000 engineers and, you know, hundreds of thousands of people who pay us several thousand dollars a month for our service where they expect accurate and timely access to information which really runs the gamut within finance and, of course, news also. We have one of the largest news departments in the world. You know, we just hired the chief editor of the Economist to oversee new services. Oh, Micklethwy. Yeah, Micklethwy, right. So there's a lot of breadth there. We certainly have the streaming problem in terms of, you know, ticks coming into what we call the ticker plant for stocks. That's a streaming problem but we get 60 billion ticks a day, right? That's a lot, right? That's still a big data problem when you have, you know, low latency requirements even though you can solve that much more easily now than was once possible. These solutions are still not nearly fast enough or reliable enough for that kind of market data. However, we have a lot of other applications where being able to put everything can use, you know, sort of commodity hardware and commodity open source software to subsume more and more of that for the purposes of simplicity. And we think that the direction that these are going in will offer increasing breadth for a lot of our needs for that. And part of the reason we've met at these prior things is, you know, we've been trying to offer and assist to elements. We added essentially high availability to HBase. There are other things for, you know, context lifetime for Spark. These are things that are required for real production use in an interactive context. So unfortunately, we're, you had a quick one. We're about out of time. You had a quick one or you don't have a quick one? A moderately quick one, which is a friend that a bank I used to work at when I was an equity research analyst said, one of the things they were thinking about with Bloomberg is almost as a SaaS provider offloading some of the calculation intensive applications or functions to Bloomberg. Is that the sort of thing where you would use commodity infrastructure to do the analytics and then sort of feed that as extensions through the Bloomberg sort of network and terminal as distribution channel? I can either confirm nor deny it. If not a short question, we have a very short answer. So with that, unfortunately, Matthew, we are out of time. We look forward to catching up with you again. I do want to make sure if you've never seen Jeremy Siegel at Wharton play his Bloomberg terminal like a maestro, it's worth the trip down to go sit in on one of his economics class. It's a phenomenal thing to catch up on the news of the day. So last word, people, are people keeping up? Where do people fit? You know, people matter to everything and part of what's happening is, you have to lower the frictional effort to be able to use these things. You have to take away the pain. That's part of the secret of infrastructure. So I think in some ways, one of the most interesting announcements for this was the community addition that Databricks has. And I just, I told Jan Stoker that yesterday because if you can, here's everything in a bundle and people can use that with training classes, I think there's a lot that will make it easier for people to learn. People cannot learn a hundred different technologies when each one is complicated. You have to make it simple. Simplicity is the key to everything. Awesome. Well, Matthew Hunt, thank you very much for stopping by. I'm Jeff Frick with George Gilbert. You're watching theCUBE. We are live from Midtown Manhattan at Spark Summit East. We'll be back with our next guest after this short break. Thanks for watching.