 We are covering all the action. SiliconANGLE.com, Wikibon.org. This is our flagship telecast. We'll go out to the event and extract the silver noise. Talk to the Alpha Geeks, talk to the CEOs, talk to the startups, talk to the data scientists, and I'm joined by my co-host Dave Vellante, and we're here with Jeff Amerbacher. I guess your title is the Chief Data Scientist at Cloudera, but you're a data geek. You were on last year one of the most popular CUBE interviews, I think you'd use the word Jim Rat for data geeks, we're going to look for data geeks and data's hot and now exploring the data is a big part of the themes this year. So welcome back to the CUBE. Thank you for having me. And we love having you on here because we want to have these kind of provocative conversations because last year you talked a lot about the data scientists and playing with data. This year, it's the conversations making the most out of the data, get that understanding gap as Tim Estes talks about a digital reasoning closed and making the most of the data. So what's your take on this trend and from a data science standpoint? Sorry, somebody just about walked into the screen there. I'm sorry, which trend? So data science, relative to extracting the data, the insight side of the market seems to be hot right now. Yeah, sure. I mean, the more data you plow into this infrastructure, the bigger the bottleneck is going to be on turning the data you've collected into something which can actually improve outcomes for your business. So yeah, there's a lot of work that's happening there. Obviously, we're very excited about taking Hadoop from the existing workloads that it's able to do well out today, which are mostly data preparation related and allowing people to do interactive queries over them using Impala. So we've been working on this for two years and I think it's the most meaningful change to the infrastructure that I use on a day-to-day basis for data analysis since Hadoop. We just talked to Mike Dabber, who's at Battery Adventures. He made this metaphor, so if you could type one query into Google a day, that would kind of suck. So imagine, and you have different kinds of queries, but you want to be more real-time, so the market needs to have reflection on data sets, you want to pull data in, you just don't want to do one thing. There's a lot of human and or machine learning based techniques you could do on data that allows you to do things, and you need to do things fast, so you don't always get that right question the first time. Can you talk about that dynamic and what that means to the infrastructure and some of the software and some of the data analysis? Yeah, you bet. So I think that when you're doing a data analysis project, you'd like to be able to issue a query and receive the results back from your query in the time in which your working memory is able to hold the motivation for that query and et cetera. So everyone has a pretty finite limit on their working memory, and so that was the goal of what we built with Impala was to allow people to kind of iterate on the questions that they're asking of their data. So you submit one query, you get the result back, and you immediately submit another query based on what you see in those results, and so you follow this path when you're performing your data analysis. And the trends out there right now that are hot is obviously Hadoop and data and data science. Real time has been a big thing. What should people know about real time? Because that's a big part of the Impala is the whole real time aspect of it. What does that mean to people? Because there's real time for financial traders who need milliseconds to minutes. Right, so I think real time, when I hear people say real time, I try and break it down and understand exactly what they mean. So there's one problem that we've solved from a real time perspective at Cloudera with a system called Flume. And that's the time it takes from data to go from the place where it's been to the place where it's going to be analyzed. So Flume is trying to shorten that gap between data generation and data being ready for analysis. So that's one aspect of real time. Another aspect of real time is just saying, when I want a small piece of data, I can get it back really quickly. And so HBase is a really useful tool for saying, hey, if I can specify the row and column and version of where my data lives, then I can get a small amount of data back very quickly in every millisecond. And Paula is trying to solve a problem which is more about how can I aggregate data in a single or multiple columns or across multiple tables in a very fast fashion. And it's a very, very difficult problem to solve. You have to do a lot of work to make that query very efficient. And a lot of times at scale, you're bound by the speed of disks to be able to get data off disks. But that's probably the most important class of queries that we see our analysts performing are these kind of aggregation and join queries. So that's the class of queries for which in Paula's real time. Now there's another set of problems for which Hadoop is not yet real time. So there's some integrations with Solar, for example. So HBase is very good if I can specify the small amount of data that I need based on a row and a column. But Solar goes beyond that and allows you to specify what data you need based on perhaps the full text contents of one of those fields or many, many columns. So faceted search is very good at being able to pull back data in real time across multiple columns or within the field of a piece of free text. And there's another set of real time problems which are solved by sort of stream processing or complex amount processing engines. So you know there's a lot of databases research on this 10, 15 years ago which turned into some commercial vendors who didn't necessarily get as much market traction I think as they anticipated. And now there's kind of a new generation of open source offerings which are trying to do this stream processing or complex event processing. So this is taking tuples in flight and trying to do manipulations or aggregations over them before they even hit persistent storage. So I think that's another interesting class of real time commutations. So we're constantly trying to take what the problem sets for which Hadoop is useful and expand that into the next most important problem which is useful. From our perspective in Paula was the next most important problem set for which Hadoop could be used. We definitely see those additional use cases as ones that we'd like to solve in the future. You mentioned just spinning disk problems. Flash take care of that or is that just? Unfortunately no. So no it's not the expense issue. It's actually, flash is only about two to three X faster than spinning disk for serial scans. So if you're gonna, most of our analytical queries involve serial scans of data and you don't really get that big of an improvement for flash. So if you look at a 10X price increase for a 2X performance increase it doesn't really make that much sense. So flash can be, I think there are certain workloads for which flash could be useful but for the Impala style queries the only way to get there honestly is more efficient query processing and more sophisticated algorithms for joining. Jeff, talk about H, tell us about HBase. Cause HBase has an interesting scanning feature scanner within HBase. Is that, does that play into it? I mean why is it, well I guess why is HBase so hot right now? Obviously a big part of Impala. Yeah, so Impala can run over data stored in HDFS or in HBase. And as you guys are probably aware we recently hired Michael Stack and his team. Great hire. Janie Kryans and L.A. Clark away from Stumblepond to come work at Cloudera. So you know we now have eight people within engineering full time and dozens more outside of engineering and solutions architects and sales engineers, training and support who are all working on HBase. So it's clearly very important to us. And I think that the scan features though actually could use a lot of work. So HBase as a data store is sort of a right optimized mutable column store. So it's very, very good. It receives all of its rights into memory and it can keep up with very, very high right volume. But today if you're trying to do a serial scan of a data store in HBase you're gonna see a pretty big degradation in performance compared to data stored on HDFS in a file format like Avro or Trevne which is a column format that we're working on. So I think that there's a lot of improvements that could be made to HBase for these scan intensive read mostly workloads. And you'll see us doing a lot of work on that in the next six to 12 months. You guys put a lot of resource obviously behind HBase. Oh for sure. I'm looking at the HBase arrow as opposed to the... Yeah I think that we're very pragmatic in our products philosophy at Cloudera. So we looked into our customer base and we saw that HBase had a tremendous amount of traction. You know I think that it was very instructive from my perspective having been part of the team that built and shipped Cassandra at Facebook and then to watch Facebook actually make the decision not to deploy Cassandra into production but actually work with HBase. So that was very instructive to me to say well actually kind of the design considerations that we use when we're building Cassandra may not have been as important for the majority of workloads. I think they are important for a subset of workloads but in reality I think HBase is the better design system for a vast majority of workloads. So then if you look at the variety of HBase clones competitors the reality is that HBase has been deployed in production for several years. It has multi-data center replication. Snapshots are actually being actively worked on and are basically complete in trunk. It's got rolling upgrades. A lot of the work that's happening on meantime recovery for region servers is getting pushed into trunk over the next couple months. So HBase is actually pretty darn good for operations and it's pretty darn good from a performance perspective these days for its primary workload which is like I said that random access to small amounts of data. I think it still has a lot of improvements that can be done for scan intensive workloads but it's just when I look into our customer base it's what everyone's using and it's working. So I don't, it's confusing to me when people say we've written HBase because I look at it and I'm like it seems like you could have done something better with your time because HBase works. So what are you looking at now? I thought that was a little bit of a command. Sure. I don't know what you're saying. I just think there's other problems to be solved. So if I had an engineering budget to point my, and Mapper's got some very strong engineers. I have a lot of respect for Srivas and Tomer. I know those guys pretty well. And if I was thinking about what I wanted to build, I would probably augment the platform rather than rewrites and take on the maintenance burden of maintaining a piece of software which basically works and is in production at thousands of nodes within the largest web properties, some of the largest cloud vendors. It's just, I see these things, I guess maybe the advantage we have is we actually see people running their systems in production. So other people might look from the outside and say, oh, I could do this, that, and this better. But the reality is you don't know until you've put it into production. And I've seen these systems running in production and they work. So I don't, I would spend my energy on them. What are you seeing in customer memory? You mentioned you looked at the Cassandra thing as a great example instructive to see how HBase kind of was a use case there. What are you seeing now outside of your Apollo view that's interesting to you? Michelson brought up yarn as an example. What are you seeing out there that you're like, wow, okay, we're watching it. We might have, we're not overlooking it, but we're watching it closely. I mean, in all honesty, Apollo is the most interesting that's happened to me in the Hadoop world since it used, I just, everything else has been kind of noise. Outside of the Hadoop world, I'm very excited about what JJ Alar is doing with our studio. I think that that's a really interesting project that's, they're kind of inserting themselves using an open source ID, which uses WebKit under the hood, which is fascinating so that it can be kind of deployed on the desktop or in the browser pretty easily either way. And they're building a lot of features in there that I actually use on a pretty daily, basically a daily basis in our studio. I think that what Wes McKinney's doing with pandas in Python is very cool. So part of my workflow today is I take a Google Spreadsheets, a Google Spreadsheet. I dump whatever medium sized data that I have in there. I clean it all up and then I have a little script that uses the Google Spreadsheets API to pull that into a pandas data frame. And then I do all my aggregation analysis in pandas. So what I'm personally using, so Impala, like I said, has been really the first project within the Hadoop ecosystem which actually changes what I do on a day-to-day basis from data analysis. I've been kind of mostly bored for the last four years. In terms of what, you know, I see everybody trying to make their product work with Hadoop and I see, you know, different things happening. Yeah, but I didn't think anything fundamentally new was happening until Impala. So it's actually, I actually do data analysis with Impala on a regular basis because it's actually, it's interactive. It's using the SQL dialect that we built at Facebook with Hive. And it's got great integration. It's got an ODBC driver and it can plug into Tableau and MicroStrategy and ClickTech and PENPAL. So what are the good tools out there? I mean, like for example, I showed you our little H-Base app and we're having a really hard time getting the, oh thanks, coming in for you as a compliment, but it's early for us, but we're trying to get data out there. So we're running, we have all this data and we want to try to pull it out fast. We're building some SQL on top of it. So it's from playing with something like H-Base, like us, what tools can they do? Because SPSS is out there. It's free download for free. You gotta pay money for it. What tools? Actually, there's one other project that I get really excited about that I didn't mention, which is there was a set of research papers which came out at Joe Hellerstein and Jeff Heer's work on kind of data preparation and data integration. So one was called Wrangler, one was called Profiler. And I think I talked about Wrangler last year. But that to me is kind of like the great, and I think once again, I think we talked about this last year, the great unsold problem for me is someone who does data analysis on a daily basis is, okay, I have attained my raw data and I have put it into my repository and now I would like to beat it into shape and turn it into something that I can actually do model fitting or graphing off of. Yeah, absolutely. Closing that gap is very important. And I think that the stuff that they're doing, so they founded a company called Trifacta to commercialize. Joe's new company. Yeah, precisely. Yeah, yeah. So I'm not really an advisor, although he hasn't called me in to do any work for him yet, but I never take advisory roles, but to be completely honest, the quality of that team and the quality of the research, which backs it up, it reminds me a lot of the early days of Tableau where you had great data. Bunch of good gym rats working on some data problems. Yeah, exactly. So the core research which undergirds Tableau was done almost 10 years ago by Pat Hanrahan and Jack Mackinlay and others folks. And I see a lot of what happened with Tableau and obviously Tableau in the market today is a real behemoth. And I think Trifacta, if I look at the people who are trying to say how do I take what people have put into cloud error clusters and allow others to make use of it, that's the one I think that has the most promise from my perspective. How does someone- Yeah, go ahead. Pre-gates Hadoop, right? Yeah, for sure. So yeah, I think that actually Impala really gives the existing vendors a really powerful way to take advantage of the data that's stored in Hadoop. So if you took a Tableau or a MicroStrategy or a ClickView or a Pentaho and you tried to put it against a data store in Hadoop prior to Impala, it wasn't that useful because you had these batch query response times. And for the first time with Impala you can actually have interactive query response so people can build charts on the fly. So I actually, if Tableau was a publicly traded company I would have bought some stock in those guys. So I think that they're in a very good position. I think that trying to develop novel tools for visualization over Hadoop is what you would have had to do prior to Impala. And I think it made complete sense for the people who were building those tools to do what they did prior to Impala existing. But now that Impala exists, I think that people like MicroStrategy and Tableau and Pentaho and ClickTech and Spotfire and others, Cognos and Business Objects are gonna be enabled to take their existing tools and make them work against Hadoop. I mean, I'd love to put you in the spot because you can make perspective, so I'll put you in perspective on that. Essentially solving the problem and talking about it. Well, so Daniel Abadi, I met at VLDB back in 2006 and I've stayed in contact with him since then and I have a tremendous amount of respect for him as a researcher. But I think Impala's the right approach. I mean, I emailed him two years ago and basically said, hey Daniel, I see what you're doing with Hadoop DB. Here's the way that I think I would do it if I wanted to get interactive query response out of Hadoop. And I think that he basically didn't think that HDFS could be made to perform in a way that would enable interactive query response. But having the worth leading HDFS experts at Cloudera meant that we were able to put in modifications to HDFS. And to be completely honest, in talking with Marcel when he was building Impala, and I asked him how big of an impediment was HDFS to getting great performance out of Impala and he said, not really an impediment at all. We've done so much performance work. Our competitors will constantly say that they're faster, but literally never once have we lost a bake off. We're able to outperform them every single time. So there's a lot of great work that's gone in to enable those. And I think with HADAPT and in fact, we recently hired a guy named Mark Miller who was one to lead Architects for Solar Cloud. And I was talking to him about some of the work that he's doing with HDFS. And I said, have you guys seen any issues with performance out of HDFS to get solar reading indexes out of HDFS? And he said, no. You know, he said that I was very surprised. He said looking from the outside, I never wanted to get started with HDFS because I thought it was, I'd heard all these really bad things. And then once I started using it, the performance was actually almost equivalent to what we were getting out of a local file system. So from my perspective, I feel like I tried to tell Daniel what we were building two years ago. And you know, I would have loved to work with him on Impala. And I hope that we still one day will be able to work with him on Impala. But I don't think that app architecture is something that I'd run. Jeff, final questions we're getting break on time here. What's your vision next for Impala? Obviously a good announcement. Everyone's buzzing about it. It's a great direction. We love it. It's a platform. So it's the big data platform. That's a good positioning to have. Obviously, you're passionate about this and you're excited since Hadoop, which is a testament. Because what's next? What are you gonna take this thing? How are you gonna, what's the trajectory of it? So with Impala, I think the next steps are all fairly well understood. You know, we have an ODPC driver. We need to get a JDPC driver. We need to be able to put in things like improved application level caching. We need to do things like approximate crew results. We need to add analytic functions. There's a slew of existing joint algorithms. There's fault tolerance. So there's all, you know, there's things that we can do to eliminate stragglers. So it's a very, very well understood roadmap with Impala. In terms of what problem that I'm thinking about on kind of a two-year time horizon, I'm very interested in scalable model fitting. So I think right now, you know, we'd like you to be able to have, you know, one container to do all the work that you're doing for analytical data management. And we're exceptional at data ingest. We're very, very good at data preparation. And with Impala, we're now good at how queries and reporting. So the big workload that I see next is once I have my data all cleaned up and ready to go, I wanna be able to actually fit a model over that, whether it's a regression model or decision tree or, you know, support vector machine or what have you. And what that really boils down to is optimization. So, you know, we've been kind of whiteboarding a new project that basically allows you to specify using a DSL for optimization algorithms. So you specify this is the function I wanna optimize, use the constraints that I have for it. And then we'll try and parallelize that across the cluster. And then we'd love to build interfaces into R and SAS and SPSS with that so that you just don't have to leave the tools that you're familiar with. So a lot of Impala, it's very important that it's interactive and low latency, but another incredibly important aspect of Impala is you don't have to leave the tools that you're familiar with. It plugs right into your existing BI layer and it plugs right into your existing SQL interface tool that you're used to using. And so that's the same approach we'd like to take for model fitting and machine learning. So that's where I'm spending most of my time these days. Okay, conversation with Jeff Hummerbacher inside theCUBE here at Strata Live at Hadoop World. 2012, this is a great conversation and great insight and got the roadmap laid out for Impala. It's exciting, we're excited to have you and obviously we're a big fan of your work and have been following you and we'll continue to follow you. Great job with Cloudera, appreciate your help. And we'll be right back with our next guest right after this break.