 Okay, we're back here live inside theCUBE, SiliconANGLE's flagship program. We go out to the events, instruct the signal from the noise. I'm John Furrier, the founder of SiliconANGLE, and I'm joined by my co-host. I'm Dave Vellante of wikibon.org. We're here with Marcel Comacher, who's the lead engineer for Cloudera, and specifically focused on one of the hottest products we've heard about, or features or capabilities, Impala. Welcome. Thank you. So, you must be excited. I am. You've been hush-hush about Impala for a while. I am very excited. Developed it for now, well over a year, and have been wanting to talk about it, but now is today. We've been poking at it, asking all these questions about how you're going to bring these worlds together, and, well, you know, we had our best people working on it, is what we were told, so now we're here. So congratulations on it. Thank you. Marcel, talk about, so Impala is your new announcement. We want to talk to Mike Olson about it, and everyone else. We don't want to go into the whole, you know, messaging that PR people are telling you to say, but Hamer Bakker had a great talk about, you know, he's all lit up like a Christmas tree, he was all passionate, but let's get into some of the conversations. So here at Hadoop World, you introduced it. Hallway conversations have been very positive. People love it, getting standing ovation, packed house at the session. What is the rationale behind Impala for Cloudera, from a technical perspective? Because I want to get into your background on Google a little bit, and why I had a great story about you guys, and mentioning Google engineer, you know, that kind of thing. Press love to do that, but let's get into some of the practical matters of the product, the platform. Let's do that. So why is it important to Cloudera? Well, SQL in general is sort of the main conduit to data in the industry. And we just felt like, I felt in particular that the existing solutions that were Hadoop native, which was Hive, was severely lacking, right? There was a very inefficient, high degree of latencies associated with it. So I thought it was basically high time to develop a native parallel query engine inside Hadoop. So this is basically what this is. This brings parallel database technology to the Hadoop ecosystem. We just had a guest on Camille, small mean answer to the price. She's a geek and she's alpha geek and she's, but she's in the business side. And she was saying that the whole shows about Hadoop is now recognized that everyone has to at least deal with it and use it, it's practical and it's good. Goodness of Hadoop is everywhere. And so it's evolving out into somewhat mainstream. Tomer thinks it's not won't be mainstream until there's a VMworld size conference, which is like 30,000 people. But okay, maybe that's mainstream. Okay, that's fine. But let's talk about how we got here. So your background was at Google. You spent a lot of time at Google working on some pretty big stuff, a spanner. You worked a little bit on? I did not work on spanner. You didn't want to spend, but you did a lot of query engine. I did, I was the architect of the query engine component of the F1 project, which has been published now, so I can actually mention it. Let's talk about that real quick, because that's good context to kind of where we want to go. Sure, it is a combination of an OLTP and an analytical system, and it uses the same query engine technology that is now available through Empowerment. So the basic technical approach is the same. I don't know to what extent I can talk about the details of Google internal technology. Come on, you're in a role that I can just see it coming out. Okay, I'll just simplify it for it. It's some serious, fast, highly geeky code that works well on large systems. So it's large scale, and you're dealing with large scale stuff. I mean, so they just kind of leave it at that, just kind of oversimplify it, but it's some complexity involved. So Hadoop going in real time is important because there's needs for slicing and dicing of data and there's some efficiency. So talk about one, the efficiencies that you solve with Impala, specifically with RTQ and some of the things you're doing in there. Of course, yeah. So actually contrasting it again with Hive, Hive uses MapReduce. It breaks a query down into a sequence of MapReduce jobs and all of these jobs have to materialize their output. So Impala, sorry, Hive generates a lot of IO in this process. Impala doesn't do that. Impala does not materialize intermediate output. Impala does process to process data communication, so you're circumventing a lot of the things that have made Hive very slow. So you're basically reducing the net amount of processing that is happening to the minimum that is required in order to run a particular query. And you're also doing a full distribution, meaning the query runs on all the nodes that contain data that is relevant for the query and they do as much pre-processing as possible, meaning they compute joins, they compute aggregations, and then send only the minimum amount of data onto the other nodes where it gets combined, et cetera. So Impala's probably obviously less mature than the sequel that we know and love. What about things like a columnar storage and compression, where do they fit into your roadmap? They are absolutely complimentary and they are on the roadmap. So we already have been developing with Doug Cutting a columnar storage format called Trevny. And that is outside of Impala in the sense that it is not tied into Impala. It will also be available probably in Hive, but it's definitely on the roadmap for Impala and probably also even for the GA. So this is definitely something that we need to do in order to achieve the same efficiency that commercial analytical columnar database systems can accomplish. And GA's when? We don't have a date yet, sometime early next year. Q1, we say Q1. So let me get this right. So the old way, you'd write a sequel query, we'd get chopped up by Hive into MapReduce to run its own Hadoop, right? That's the old way. And the new way is what? The new way is, so the old way was one or multiple MapReduce jobs. The new way is it gets turned into a logical query plan that then gets partitioned and these plan partitions get sent out to the individual nodes that have the data. All of these nodes also run an Impala daemon that does all the processing and interacts with the local data. So it uses the most efficient path to read the local data and then does the in-memory processing. So is this the concept of you push things out to where the data is? Is that the concept? Yes, exactly, that's the concept. And that's an important concept because as more and more data actually comes out of main memory, right? We're seeing much bigger hardware configurations with much more memory. A lot of the data will come out of the main memory. And so the local processing is going to be much more efficient than taking the data and reading it remotely over the network, which is what some commercial competitors are proposing. So Impala is Hive compatible, yes? Can I say that? To a large extent, there are things that are missing, that are in Hive that we don't have yet. Conceptually. Yes, conceptually. And so you're not persisting immediately to disk, is that correct? No, not at all, yeah. Okay, so in that sense, it's similar. And from a practitioner standpoint, it's familiar. Oh yeah, absolutely. I mean, we made an effort to utilize the same metadata that you have in Hive and also utilize the same ODC interface. So that's why we're partnering with some BI vendors and they could immediately retarget against Impala. Well, so what's the infrastructure requirements to run this? So like to manage the queries that used to be heavily, you know, obviously scale out a lot of scale, but it's not just commodity hardware in a good way. I mean, our industry standard hardware. But what's the footprint requirement? Does it reduce the footprint for the managing the queries? What's the impact of the query efficiency relative to the infrastructure? It actually does reduce the IO footprint in the sense that it doesn't materialize intermediate data and the Hive materialize intermediate data to disk, meaning it adds additional IO requirements onto the system that Impala does not have. So you get to utilize more of your IO bandwidth for to actually get the data out rather than, you know, doing all the intermediate stuff. I wish David Floyd was here from Wikibon because we would be geeking out at some serious IO conferences, but I'll try my best David Floyd to ask this next question. So obviously in memory databases are great. So as you have the in memory movement with databases is great. So you can do a lot with not a lot of compute, but you still need a lot of compute to go through stuff, go through data, but okay, but so you have in memory. If I add more stuff onto the database like interactivity and analytics, I've then put pressure on the IO. How do you guys resolve that? I mean that seems to be a common question. Maybe I'm phrasing it wrong, but the notion is okay, I got some compute, I'm maximizing my compute with my database with in memory, but if I want to add more, maximize the cores, I'll add analytics and more things on it. But that also puts pressure on the IO constrained bottleneck. Well, I mean, in the end, it's always a matter of finding a balance, right? If you're running a system where you're saying the IO is overutilized, where your bottleneck on the IO, well then either you should probably move things into main memory or you need to add more IO capacity. Impala has been written to be very efficient in terms of doing the processing of the data itself, the parsing out and the joins, et cetera. And so we put a lot less pressure onto the cores themselves, which means that you can actually fully utilize the available desk bandwidth. Let's talk about performance for a second, because the skeptics have been saying, oh Impala, HTFS can't handle it, it's not going to be a high performance. So Ron, talk about some of the skeptical comments that have been made about Impala. And then to talk about some of the performance benchmarks that you've seen, just Charles, VP of product told us that this is some significant performances. Yes, so the performance, so what we're actually seeing with the recent improvements in HTFS that are present in CDH 4.1, and which is what we used internally to test Impala, is that you can actually get full, the raw disk bandwidth out of the disks. So we're doing local reads, which means you're reading effectively from the local file system. There's no indirection through any data node processes, and we are getting an excess of 100 megabytes a second out of standard disks, out of standard SATA disks, which is as good as it gets. So that's a Hadoop integration challenge that you've solved, correct? Yes, exactly. Hadoop integration also for Hadoop, for HTFS to expose the full disk bandwidth that has been accomplished, meaning you can't go beyond the hardware, obviously. Now, won't you get an additional performance boost? Of course you will when you do columnar, right? Yes, you will, yeah. Because you further reduce, that's going to be a very big hit, because you further reduce the amount of IO you have to do. Right, so that's going to, and that's what all the commercial columnar storage managers depend on, so obviously. So I'm trying to build a business case in my mind. So it's open source, so I don't have to pay for it. Right. Right. It's simpler, right? It's the one last thing to administer, presumably. Right. Because of the integration that you've done. It runs on the same hardware, and it's faster. And it's going to be a lot faster when you deliver Trevno? Yeah, Trevno, the columnar format, yes. That's right, yeah. So if I had to do my little business case there, I could quantify that, depending on my size, do my before and after, and boom, do the ROI. I can't quantify it quite yet, but it should. Call David Florey, we'll be able to have it. So the rumor is that the name Impala came from you. Is that true? That is true, yes. So give us the history of that, of how the name came from. What happened was I was fishing around for a name, and I asked a colleague at work, Eli, and he threw out some names. Eli Collins, that is. He threw out some names, one of which was Gazelle. And I immediately thought Impala, because I had lived in South Africa for a while. And also I was listening to the band Tame Impala quite a lot at the time. So I'd like to plug them here. They just came out with a new album three weeks ago or something, so. We're going to get sued for royalties on the cube. Give attribution. So yeah. South Africa, where they eat Impala. They probably do. Yeah, I've never had Impala, actually. So congratulations on all that. Really awesome work. Impala's got some great buzz and the whole Cloud Air team, which has a lot of smart people in it are getting behind. Dusty Jeff is passionate about it. You've had a great track record of what you've done at Google. So I want to ask you more geeky questions and get your perspective on things. What do you think of the current database market right now in terms of some of the technical challenges that are being solved and some of the cloud-based solutions that are rolling out with Hadoop? Because right now, Hadoop still stands up better with bare metal on performance. Cloud is not really resonating well. Yeah, confidence is out there. You've got platform out there. But still, you're still seeing it be specifically for data centers and on-premise. What do you see as core challenges for folks that are engineering out there for on-databases and whatnot? In terms of, you mean engineers as in the people who write the databases or who utilize the databases? What are the top coders who are pioneering some of the enhancements in Hadoop working on relative to solving and scaling the next generation? Well, I think there's, I mean, what we're doing in the house, I can talk about that in more detail because I know more about it. I don't know really what our competitors are doing. Yeah, just from your perspective as a geek. No, no, of course. I think definitely caching is going to be very important. Like I said, main memory is very important. And for Hadoop to utilize available main memory more effectively is going to be important. Right now, it relies on the OSBuffer cache, which is ineffective. And there's going to be, there's a lot of processing overhead associated with it because it does all the checksumming. Every time it reads something, even out of the buffer cache, it re-checksums the data. So it's very expensive. If you're doing three-way replication, you need to have your data cached everywhere, basically three times in order to get the full performance benefit. So I think there's still a gap that will need to be bridged. What are you about coherency in the database? Is it aging gracefully, not gracefully? Is it changing coherency in the database? And what do you mean? Dave's giving me the hook here. Any, my final question is what's next for you now? And I think of the project, we've got our next guest coming, so I didn't interrupt, but we don't have time to go into the coherency. But final question, what's up next for you? What are you going to be working on? Definitely continuing to work on Impala and turning it into a real product. So there's, right now we have a beta out, which is something for people to try out. And the goal is to turn this into a really useful, basically query engine portion of a database system, right? Something that has commercial quality and can compete with commercial offerings. Final, final question, I have to ask you this, and I want you to tell the people out there, there's been, obviously when you clutter, it does something this bold, there's going to be some fud from other vendors. But just tell the folks out there why Impala's the real deal and why it's so important. From your technical perspective. Right, yeah, yeah, it is the real deal because we are utilizing standard parallel database technology that has been available commercially and has been very successful, right? It has demonstrated its ability to scale and ability to handle diverse workloads. And this is finally being available in the Hadoop ecosystem in a non-compromised way. And that's basically my summary. Big data platform, Marcel, thanks for coming on theCUBE. Congratulations. Thank you. Maybe it's a rock and not a car, but great stuff, congratulations. This is theCUBE, we'll be back with our next guest after this short break. All the signal from the noise is right here in theCUBE. Extracting that signal, sharing that with you. This is SiliconANGLE.com, this is theCUBE, we'll be right back after this short break.