 The Cube at Hadoop Summit 2014 is brought to you by anchor sponsor Hortonworks. We do Hadoop. And headline sponsor WAN Disco. We make Hadoop invincible. Welcome back everybody. You're watching The Cube live here at Hadoop Summit. My next guest is John O'Brien, Principal Analyst and CEO at Radiant Advisors. And Mark Milani, who's the SVP of Product Engineering at Actian. Welcome to The Cube. Appreciate you coming on. Thank you. Glad to be here. So John, we were talking a little bit before we went on the air about some of the hot topics here at Hadoop Summit. And of course, SQL on Hadoop is one of them. Why don't you, if you could, start off, tell us a little bit about yourself and Radiant Advisors. And then let's just dive into what you're seeing in terms of the SQL on Hadoop market and what you make of all the different options out there. Okay. So Radiant Advisors, we're a research and advisory firm. So we work with a large client base here in Europe. And a lot of the tracking of big data architectures, adoption strategies as companies are moving into wanting to understand implementation services and best practices, that's where we work. We also do independent research and benchmarks. And we've been tracking the, you know, several things from last year. Definitely we saw evolving, you know, the integration aspect with BI, the accessibility aspect, which we saw last year as the SQL on Hadoop was going to come up. And of course we saw several products hit the market. We've done our own independent benchmark around that for the different products that's available on our website for download for anybody. And from that you can, you know, we really tried to break down an approach from our client's perspective of how do we decide what's going to be a good SQL engine on Hadoop? What should we be looking for? What are the criteria for comparing them? And we just continue to see, even this week here at the Hadoop Summit, it just is gaining more and more momentum as a really important topic. Yeah, well, why don't we, before we kind of start talking about the different options out there, why don't we talk about, what are you seeing from your clients in terms of the drivers of this? Why are they so interested in this? So one of the main drivers, and we go to different summits all the time in think tanks with companies as well, and there's always the, we hear about the shortage of data scientists or, you know, MapReduce programmers, you know, we do not have enough resources yet. We have this, you know, a lot of people in the enterprise already enabled with BI already doing a lot of SQL, and this is this large group of people we just need to enable. They already work with data, but they do it with a SQL. So it's a way of how do we solve the, you know, kind of talent and skill shortage by maybe picking up a little bit more of bringing them into the Hadoop environments, right? So in doing that, it means introducing some schema on read and some other capabilities and new processes. But the main driver has been, if you want value out of your Hadoop system, you have to open it up to the most people, you know, as possible. Not just data scientist groups or specialized MapReduce teams, but everybody in the company has to have the ability to go in there and work with data and pull out insights. And in order to do that, SQL has already been proven to be that de facto standard for us. So how do we incorporate that into the equation? That seems to be the main driver for us. Mark, I want to bring you in the conversation. So Acti and this week announced their entry into the SQL on Hadoop market. Tell us a little bit about, you know, the announcement and kind of the drivers from your perspective to taking this step. Yeah, so we announced a SQL on Hadoop solution based on our formerly called VectorWise product, now Acti and Vector product. I heard John talk about SQL access. The actual genesis of the project wasn't for SQL access at all. It was more on how to scale Vector itself, which has already held the single node database records. And as we were looking at alternatives, HDFS was an obvious alternative. What we found interesting once we got into it was not only was it just a file system that we can provide easier access to, but it had a lot of new capabilities coming out with yarn for workload management. Some of the things we do uniquely in updates are that we call positional delta trees are very unique to us and patented to us. And we felt like we could work with an append file system in a way that was unique that I think it'll take a while for others to catch up to. And not only that, but I think we can extend it quite a bit from where we are today. So talk a little bit about how it fits into the larger approach from Acti. I mean, you've made some acquisitions, you've also got what was known as Par Excel. You've got the pervasive data integration component as well. Where does it fit in kind of your approach to big data? I mean, it sounds like you've got really an end to end solution set. Is the idea to build this platform for customers? Yeah, no, that's a good question. We have the vector database. We have the former Par Excel. Now we call it Actium Matrix Database. And we have the former pervasive data rush product, which is now called Dataflow. All the pieces are coming together in our roadmap. This is just the first part of taking vector, bringing it to HDFS. Dataflow was already implemented directly on HDFS, so you can do visual analytics directly on HDFS. Those jobs will soon be executed directly from the Actium Matrix product itself. So you can start seeing the integration happening. And you can rest assured the different optimizers are coming together already too. So the interesting thing when I came into Actium was that it wasn't just a bunch of different databases that didn't fit together. There were a lot of interface layers, so as an engineer it became a lot of opportunity for us. So these are really nice complimentary pieces. As the roadmap rolls out, there'll be HDFS. Vector, you'll see optimizers come together. You'll see UDF frameworks that execute through our Dataflow framework. And then the platform really starts taking shape as we progress through the foot next year. So John, you mentioned you've done some benchmarks. So let's dig into some of that data. What did you find? I mean, in terms of all the different options out there and how they perform? So the performance numbers were a little bit more as expected. But we had so many independent runs that we were doing on three different clusters. I think one of the things we would say is that in our approach, rather than just run a bunch of queries, different workloads, reporting workloads, analytic workloads, ad hoc workloads, rather than doing that, we actually wanted to frame it in a way where we told the readers that if you're going to approach this, it's not only about performance, right? Just because they're the fastest response time doesn't mean that's always the best answer. So we kind of came up with a criteria of about three different areas. One was speed. So speed always matters. But the second one also became the SQL capability because the different engines themselves, right? When we talk about a SQL capability coming into the no SQL space, some are SQL 92, some are SQL 99, some have analytic functions in their roadmap. And they're putting them in as needed by customers one at a time. Some are leveraging the UDF the same way. And then the third component was really around scalability. You know, not all of these engines run across the entire cluster. If you've got some pretty large clusters, do all the engines, at least when you're interested in, actually run on the entire cluster? Is it just a subset? So we also see some architecture patterns at companies where they may have a couple hundred node cluster of Hadoop running, but they might carve out some smaller areas for structured data. And those parameters limit the scale. If you have, for example, if you have a large load across in a big cluster of SQL, is it more important that you run across all thousand nodes, or is it more important that you have the SQL capabilities you know you need, right? And that's some part of the comparison. Now, I think what we found interesting in the report, and we've been talking about as well, is that we ran the SQL engines. We ran Hive 11, Hive 12. We ran Presto 57. We ran Impala, and we also ran Infinity B there. And we looked at those SQL engines on both Hadoop 13, Hadoop 2.0, and then the Cloudera beta 5. But what was the biggest influencer of performance on those was actually the data file format you loaded your data into. So if you were looking at Orc files, or RC files, or the Parquet files, or Infinity Bs, IDB files, what we found was columnar formats, compression, all of that had a huge influence on how that engine performed. So the same engine of either Hive or Presto on two different file formats would almost flip their performance in some cases. So really, if you drop in Presto on your cluster and you're taking a look at how it works, you might be going, well, I'm not getting the performance I expected. You have to come back down and pay a lot of attention to the data file formats. Well, that's interesting because there are so many options out there right now. And the question, so the question is, are they all competing or will some find a niche? And will they kind of, will we see some consolidation, some will win, some will lose, or will they all have a place depending on kind of the particular use case? I think that they're all going to have a place. What we found is that they, from the text all the way down to the really columnar ones, it's a spectrum. And the spectrum will give you increased compression, increased performance. But as you move towards more performance, you're also giving it more flexibility. Because you're organizing in columns and access pass and SQL and how you want to hit it. So if you know your workload, and if it's operational, SQL based reporting, you're going to offload some data in there. Then you might want to do something that's more columnar. But if it is more doing discovery and data science types, algorithms and things, you might want more flexibility in something more open. So I think they're all going to coexist and the users really need to know and manage where do I have which ones and for what reason. There shouldn't be just a, we do this one file format standard. Right, so right tool for the right job kind of approach. So Mark, talk a little bit about future road map to the extent that you can. And I'm also curious about how this impact, how you see the SQL on Hadoop movement or trend, if you will, impacting the more traditional data warehouse world, you know, there's a lot of talk about complimenting, doodle compliment the data warehouse. And I think we can all agree it's not going to replace it, but there are some overlaps happening. And certainly when you start adding SQL capabilities. What's your take on how that's going to, how those two industries are going to, are they going to butt up against each other? Is it going to remain complimentary? How do you see that playing out? Yeah, I think it's going to depend on use case. I think it's going to depend on your business case and what kind of, what kind of infrastructure you want, what kind of outcome you want. I do believe it's going to be complimentary. I don't think there's, I think you can run your warehouse in a Hadoop cluster using SQL, I think you can run your warehouse in a traditional LTP system. I think it's a lot depends on how structure your data, how unstructured your data, whether you're going to move your data, whether you're going to transform your data, whether you're going to check the quality of your data. And how you want to do that and what's the most effective way to do it. I think there's some interesting things in just volume of data that will lend itself to Hadoop file systems and things that can scale at that level. And I think that's why we're having such a big turnout at this conference here and other conferences is just simply the volume of data is creating some interesting technical issues. And so I think what's going to happen is some of the things you've seen in data warehouse, traditional way that data warehouses start moving into the, into kind of HDFS file systems, if you will. I think the key to it though is the simplification of access, the simplification of being able to work with the data more so than just being a repository for data. I think those layers are being built by a number of people, including ourselves. I think SQL gives you that ease of access, so it's not a map reduce algorithm to get it, all of that, where you need parallel programming understanding to do it. So I think that will be a huge enabler moving forward for HDFS warehouses, if you will. So if we could talk a little bit about actual kind of use cases when you're seeing them on your clients, I mean, just the announcement was just today, but I suspect you've got some customers in beta that you've been working with. What are some of the real use cases, at least the early use cases that they're going to leverage your SQL on Hadoop capabilities for? Yeah, no, traditionally the traditional use cases, Data Lake, ETL, and if you need to do low latency analytics, MPP, some sort of fast analytic database. A lot of the customers I talk to still do that, and we talk a lot about our data flow product that can parallelize through that kind of ecosystem, and I think that's going to be a very popular ecosystem. But for SQL on Hadoop, and what we're seeing is what they ask me is they say, okay, well, yeah, I've got that. I have security constraints. I have all these things. I'm not opening up my Hadoop system to the world, particularly in financial services or healthcare. But I have super users. I have high value analytics I need to run. There's no reason to set up another cluster for that. The outcome for the business is valuable enough that they might consider that. But at that level, they say they have super users. They want to get directly at the data. They can trust them or they can ring fence the data in a way that those users can get to it safely. And then they can provide better service and more efficient service for it. They don't have to, okay, I need my quants to run something against this data. Let's go put up a whole nother cluster. Let's put up all the security, all the administration, all the management. They just say that's exorbitantly expensive. And so this is where I think I'm seeing a lot of interest right now, is where they can open up a window into that system for the super user or for the high values that it uses where there's some trusted connection that they can give access to. In terms of Actine's interaction with the community, can you talk a little bit about, are you contributing to certain projects? What's the level of commitment and contribution, as I should say, and activity with this larger community behind us in terms of Actine's approach to that? Yeah, we're working with partners like Hortonworks, particularly around Yarn, that's super important to us. How much we contribute still? We've been so focused on this solution. We haven't gotten to the part of contribution, but Yarn's going to be incredibly important to a system that can use as much as the CPU as it can. So one of the things around Actium Vector is that anybody that saw Peter Bonds maybe in one of these earlier discussions uses all the cores, uses as much as they can and can get as much through as they need while workload management is not a news story in computer science, right? This workload management becomes a much more critical thing. So I see a lot of our involvement in the Yarn community at some point in the future. Right now, it's just merely getting it enabled and certified, and we work with various vendors here to do that. So John, talk a little bit about how you see this evolving. I mean, there's a lot of, as we talked about, competing approaches to this. Sure. And Yarn opens up a whole slew of different types of processing that can take place now on Hadoop. Yarn's a big breakthrough, but it's still pretty early, it seems, in terms of actually seeing some production workloads out there. What are you seeing and how do you see Yarn impacting the development of Hadoop going forward? I think it's really significant. So a lot of our framework that we deliver is really a three-tiered kind of architecture with Hadoop. And one of the things with Hadoop that is so valuable with the SQL side, of course, is the flexibility, scalability. But we still have an analytic tier, because that's a specialized workload with different databases and the enterprise warehouses in the MDM still live in a very structured reference data piece. So when you start with that today, and you have Hadoop out there gaining adoption, and then you look at Hadoop 2, over time, and we're looking at 5, 10, and 15 years out, I think it's really clear, and we advise most of the companies we work with to say, look, the Hadoop 2, if you're not adopting it today, you will be. It's a great data operating system platform. We studied major data architectures and service-oriented architectures for decades. And this is part of that whole natural evolution really, right? The persistence layer separated from the operating system layer, which is what Yarn is, you want to be on that. And so when you have the SQL engines coming out today, one of the things or any of the other components we advise companies on is to say, you want to make sure they have a Yarn story. Because even if you're not there today, you're going to be there. That's the direction without a doubt. And you want to bring things within your ecosystem that all work together. And the SQL engines are no different. They need to be at least in Yarn today or Yarn-compatible, or have it in their roadmap within this year in order to keep pace. Because it is a really fast-moving environment. This is where we're saying the data warehouses aren't going away. The MDMs, the analytic MPPs, and the columnaries. But as you identify workloads and you maximize your overall architecture, that's where Hadoop plays a bigger and bigger role over time. So Hadoop, of course, with Yarn is a newer thing. It takes time for adoption on this stuff. But I think it's not a passing thing at all. I think it's a very, very significant stride in data architectures. Absolutely. So tell our listeners out there where they can go and find some of the benchmarks you mentioned earlier. Ratingtadvisors.com. It's on our homepage. It's download the ebook. It has all the benchmark summaries, the approaches, and numbers. Take a look at that. And definitely give us your feedback. We're getting ready to do our second benchmark, because more engines have come out in the last six months. This was from January of this year. Newer versions, like Hive 13 with Tez as well. Actions Engine, we want to include. So we're looking for feedback on what you like, or how should we do it differently, and make another meaningful benchmark for the industry that's independent, and everybody can turn to. With a little bit of thought and framework around it as well, not just speeds and numbers. Right. So Mark, I'm going to give you the last word. For our audience out there that are practitioners, that they're trying to wrap their heads around this world of Hadoop and some of the new capabilities, what advice would you have for practitioners out there that are looking to get started? Maybe they're in the traditional world. As John said, there's no time to wait. You kind of just have to get moving. What advice would you have, one or two key pieces of advice for those practitioners? Well, I think the sequel on Hadoop thing is going to help people quite a bit. I mean, I would encourage people at the end of the month when it's available, download it. Even before then, I would download Vector ahead of that. I mean, you can start your Hadoop applications now that we've gotten into databases where Hadoop is an integral part of the infrastructure, doesn't really matter whether it's running on Hadoop or not. So yeah, I definitely think the sequel on Hadoop thing is not going away. There's a reason there's a big interest in it, and it provides access. It opens up the world to developers in Hadoop. And there's quite a few of those out there. There certainly are. All right, John O'Brien, Mark Malani, thank you guys so much for joining us on theCUBE. We'd love to have you back at a future event. You're watching theCUBE at Hadoop Summit 2014. Stay tuned, we'll be right back with our next guest.