 The Cube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Welcome back to The Cube. We're here live at the Hadoop Summit in San Jose. My name is Jeff Kelly. I'm with Wikibon. I'm my next guest, two partners here in the Hadoop ecosystem, and you'll see that theme at this show. A lot of partnerships happening. We've got John Settifarero, Vice President of Marketing for Actian Analytics Platform at Actian. And of course, Cube alum Jim Walker, Director, Product Marketing at Hortonworks. Welcome, guys. Glad to be here. Jeff. We were just talking a little bit before we went on. This is a fun but hectic and a little bit tiring. A few days here at this show. There's a lot happening, huh? Yeah. It's a marathon for sure. Indeed. I think it's a sprint. Yeah. Go with that. So Jim, I wanted to start with you, talk a little bit about kind of one of the things that's top of mind here at the show for pretty much everybody, and that's Yarn. You know, it was announced last fall, so it's been out in the market for a little while now. Really, tell the people out there that aren't familiar with Yarn, why it's so important to the Hadoop ecosystem. Yeah. I mean, it's been such a theme of this show. I mean, it's definitely all over the place. I would say it's definitely Yarn heavy. If Yarn is slapped around with Yarn, I guess, and the Sean Connelly keynote had a great graphic on that one, right? Yeah. Yarn really is this, it's kind of the inception of, I think, of Hadoop too. And this is a whole new generation of really kind of the way we think about linear scale storage and compute. I mean, the main value of Hadoop is to provide this great system, data platform, to do all these great things. The problem was, with traditional Hadoop, it was largely a batch system. It's typically a single siloed cluster with a single data set. And so with Yarn, really, it's the inception of really expanding Hadoop now into more of a multi-purpose, multi-use platform where you're storing many different sets of data and then setting up multiple applications and multiple different engines that are all accessing the data in different ways, all on that single set of data. Be it batch, which traditionally, of course, Hadoop's fantastic at, right? But the interactive and real-time use cases and running all those at the same time, it's not just opening up to new applications. We're seeing people start to see end-to-end kind of Lambda architectures of the way that they actually deal with data. And so I think it's just a, it was a function of time before this happened. I think early on, the Hadoop team at Yahoo saw this. A lot of those guys are now with Hortonworks, of course, and they saw the limitations of traditional Hadoop and Yarn started back in, I guess, 2009 or so, and it is the delivery of that. And so our customers are looking at it as kind of this vehicle to a wholly different way of looking at data. I think its introduction in October literally has completely changed this marketplace. We'll talk a little bit more about that. What are you hearing from customers and where are they on their, you know, in terms of using actually implementing these new data processing frameworks on top of Yarn? Yeah. I mean, it's still very early, but what are you actually seeing out in the field? Yeah. So, I mean, we're seeing lots of stuff. And so lots of interest in stream processing and stream analytics using things like Storm, so Apache Storm on Hadoop, and people are extremely interested in that use case. So looking at these massive sets of data that are streaming in from, say, machines and sensors and these sort of things, and they want to pick off items, they want to pick off events, they want to do real-time analytics across data that's floating into an overall system, right? And so that workload in particular is definitely, we see a huge increase. I think the other one is very straightforward, you know, HBase is known as fairly a resource hog, I would say, you know? Okay. So when you were running HBase in a cluster, it kind of took over the whole cluster. If you have a more granular resource management, which Yarn allows you to do, is you can now control, you know, how much of a cluster is being used for what workload is still having a single dataset. And so you're starting to see now, you know, stream processing, you know, HBase with low latency, no SQL, and then ultimately, you know, I think this whole interactive query via SQL, which is, you know, the lingua franca of data. Yeah. So it's always important. Well, yeah. So, you know, SQL on Hadoop, you know, and that is really one of the most popular ways that people are looking to take advantage of Yarn. As you said, there's how many SQL developers out there that understand the language. That's how they access data. So it makes sense to bring that to the Hadoop environment. So John, talk a little bit about Actin's approach to this, because, you know, there are several SQL on Hadoop options out there. Actin has just kind of announced their entry to this market. Tell us why Actin did that, why we need another SQL on Hadoop, and really where is the value, the really value add that you bring to the table? Yeah, absolutely. So this whole idea of Yarn, by the way, is that Yarn changes everything. And I agree with that. That's what Rob Bearden said in the keynote, and absolutely agree. And we've been running stuff on the HDFS clusters before Yarn was even available, because we believed that. We knew it was coming. And so, while everybody else is trying to do what they're calling SQL on Hadoop, we're actually doing SQL in Hadoop, inside of Hadoop. And there's a lot of advantages to that. One advantage is performance. We're coming out of the gates, for example, up to 30 times faster than Impala. We're able to offer, because we're inside of Hadoop, a single management system for all the clusters where everything is happening via Hadoop and Yarn management. So you don't have to create a separate cluster and separate management system. We use all the replication capabilities of HDFS as a file system. And so it eliminates a lot of complexity. And then ultimately, our approach is that we've taken the industry's fastest, most hardened, most mature vector processing engine and put it inside of Hadoop. So everybody, Jeff, is really trying to do vector processing. There's agreement across the board, the Stinger Initiative, Impala. They're all working at doing vector processing on or inside of Hadoop. Well, guess what? When Peter Bonch did that research on vector processing back in the early 2000s and built out a vector-wise database, we bought that technology. So it's been in development since 2004. It went into production in 2010. It's been hardened. It has acid compliance. It's got all of the failover capabilities, full SQL support, including a bunch of analytic functions. And all of this now running natively inside of Hadoop. So we've essentially turned Hadoop into a fully functional, high-performance analytic database, a massively parallel and MPP analytic database all running inside of Hadoop. So you take your existing BI tools, your reports, your operational dashboards, your operational BI. You point it at Hadoop now and it runs like any database you've ever interacted with in the past. So this is a huge step forward. We think because we're using this hardened technology created by the father of vector processing that we're actually entering the market about four to five years in advance of the rest of the providers. Yeah, we had Peter on. We talked a little bit about some of the technical symbiosis, if you will, between the acting database and Hadoop. Scale out nature on commodity machines. And it actually is a quite natural fit as opposed to some of the other types of databases out there that may be built on different approaches that don't necessarily match up quite so seamlessly with the scale out approach to Hadoop. So let's translate all that though to solid business value. What is this going to enable your customers to do that maybe they couldn't do before? Or is it simply going to make things move faster? Or is it going to give them new capabilities? Or a little bit of both? Yeah, I think it's actually the acceleration of time to value for analytics and for big data investments. So it was interesting in the keynote here at the Hadoop Summit. Merv gave a figure. He said that 70% of the existing big data projects are not yet in production. There's also about 60% of those projects that don't even have a business case. And the reason is, if you think about it, every business unit has a conduit to the data and the analytics, and that conduit is a business analyst. And 90-something% of those business analysts use a SQL tool or write their own SQL to be able to access the data and run these analytics they're providing to the business. And up until now, they've been given very limited access to data in Hadoop. The SQL support is the door's been opened slightly. We're flinging the doors open wide. And so the most important thing is that we are, we're literally opening the door to millions of business-savvy SQL users to be able to access this data that sits in Hadoop, to be able to provide the business case, to be able to pick the models that are going to work the best, and to be able to translate those into operational BI and embed those analytics and business process. So what we're doing is we're actually speeding the process so that people can now move these Hadoop projects out of the laboratory into production, begin creating value, and ultimately transform the way companies use data and analytics. Well, so let me push on that a little bit. So you're going to have, there's going to be somewhat of a change management issue taking place. You've got to, you know, the way companies do things has got to change from their traditional systems to adopting Hadoop and then adopting something like SQL in Hadoop. So I'd like to get both of your perspectives. What are you doing in terms of, what are you coming up against when you're in customer situations in terms of making some of those mindset changes about how you look at data management? And how are you kind of pushing those conversations forward, Jim? I mean, I think what John just said, I mean, is attributed to that. It's, you know, mainstream adoption depends on skills. 60%, I would say, maybe 70% of a by-process for anybody in technology is, do I have the skills to actually implement something? And so, you know, when we're sitting there and I'm going forth, I'm saying, hey, Hadoop, Hadoop, you got to learn MapReduce. That's a tough language. That's a tough sell, right? And so what we're seeing is really this mainstream Hadoop emerging. And mainstream Hadoop, it has to have all those tools. You know, it's a, you know, Rob Bearden was on the keynote yesterday talking about, you know, the enterprise has set a high bar for what enterprise means. We know what that is, right? We've been through this for 20 years, right? Or 30 or whatever it's been, right? And so, you know, I think, I don't think we can hide from any of that stuff. And so, you know, for me, it's, you know, you look at security, you look at governance, you look at operations, and you look at these interfaces. You know, what's that development framework that is going to enable and empower companies like Actian to basically directly integrate with Yarn? I mean, it's a multiplier effect for Hadoop. I mean, like, Hortonworks, you know, I mean, Jeff, you've been talking to us for a long time, you've had everybody on here. You know, we're a platform. I don't want to compete up the stack. I don't think our place is above that stack, right? And so, I want to be linear scale, compute and storage and provide all the right framework and all the right API so that, you know, great partners like Actian can build and use Hadoop and we can go to market together and do all these different things. So, you know, that's everything we're seeing. So, John. Yeah, I think the beauty of our partnership is that when you combine all of the modern data platform that you get with Hortonworks and the Actian analytics platform, we're really looking at bringing together that entire end-to-end analytics process. So, there is the sequel on Hadoop. But remember, in our platform, we also have a data science workbench where, you know, our business analysts can choose from 1,500 operators to do all of their data blending and enrichment, data science to be able to do analytics, machine learning, all of the building and testing of models. And all of that runs natively inside of Hadoop via Yarn as well. And so, just no coding required, right? Drag and drop interface. So, we're essentially removing all barriers for business access to this data that sits in Hadoop. And we're opening it up now to the entire spectrum of users of big data analytics. So, the data scientist, you know, does all of this discovery kind of work. And they figure out, you know, what are the candidate models that we can be creating? What are all the possibilities? And then the data miner brings in these data mining algorithms and tests those models and figures out, you know, which are the two or three contender models? And then they pass it on to the business analyst. And the business analyst will understand the business, right? They live in it and they'll figure out, which one am I going to put into production? And then all of that gets passed on to the casual user who takes that, puts it into the business and operationalizes it. And so, between us, we've got a platform that brings that entire process together in one single place, incredibly powerful for companies wanting to move forward with Hadoop. Yeah, I mean, we've done some work, some survey work, and one of the things we find, and it's not surprising, I mean, it kind of validated what we thought, but one of the biggest challenges going from that, scaling and productionizing your insights and your models that you develop in Hadoop or whatever platform, and then actually extending that to larger datasets and building some kind of data product that someone can consume on a regular basis is a real challenge. And so anything you can do to break down some of those barriers is going to be hugely valuable to customers. So it sounds like that's one of the advantages of kind of having a complete end-to-end platform approach. Yeah, absolutely, and I think that, and it's so, and there's cooperation as well, you know? I think one of the great things about Hortonworks, and Jim, I'll let you talk more about this, but is having Hive where you can address petabytes of data. You can run a query through Hive and you can ask a question of all of the petabytes of data that sit in Hadoop and you'll get a response back and then you have a data science workbench or a SQL interface for the business users that want to do higher-performance kinds of analytics. It's supporting all of those workloads. That's the combination of Hortonworks and Actian together. You're exactly right, John, and this is exactly how this market is going to, people ask me, you know, where's this thing going? Where are we going to end up? This is exactly where it ends up, and it's, you know, look at, yeah, Hive can do, we have vectorized query in Hive. We've had long conversations about this, John, right? And so, how are we getting interactive query at scale across broad range of semantics is what semantics we want to do in Hive? So, what am I going to use Hive for versus Actian, right? And that's the question. It's like, well, I'm not going to build the stuff up the stack, right? I mean, I'm still looking at a SQL interface. That's where we stop, right? We build this great SQL interface called Hive. All of the intellectual property that Actian brings to the table for data science and all the, you know, how do we do that? That's not our job as a vendor. You know, that's why we partner with people to do these things, and it's been, from day one, that is absolutely where we, you know, we're helping on making sure that that's where we fit, and that's why partnerships with us work extremely well. I don't want to, I don't want to obviate. I don't want to block out the sun. I want to work well with everybody, because the data center is the data center. And we're all going to live. Well, we've talked a lot on theCUBE about Hortonworks has really, you know, stuck with their strategy from day one. There's been no wavering on that, and it's critical, you know, for Hortonworks' long-term strategy to remain open, and it's a really, I mean, the partnerships are critical to your long-term success, and enabling the ecosystem. And I think that's why there's 88 companies out here. It's part of it, Jeff. I mean, you know, my friend Mitch Ferguson is our VP of Business Development at Hortonworks. In day one at Hortonworks, we had 24 engineers. 24 engineers came out of Yahoo. There were three other employees on day one. Rob Bearden, our CEO, VP of Finance, he got, I guess he got to have a finance guy, right? And our BD guy. And so really, it's literally, I'm not just figurative, literally from day one, it was a main focus of us to actually, you know, light up the ecosystem. And so I think that's why we haven't faltered. We're, you know, we know that that's important, and we're, I think we all march to that same beat internally at Hortonworks for sure. And I think our partners witness it, I mean, the way that we interact with them. It's time and time again, I hear. Great company to work with. We were able to get X, Y, and Z done. I mean, it's critical for us. Let's, you know, let's size up the market a little bit. When you're talking about, you know, competitor approaches. So you've got Cloudera taking approach, where they're both trying to build out the Hadoop data platform layer and some of the things that Acti is trying to do, where you're taking more of a partner strategy. What are, you know, to put on your objective hat, and what are the pros and cons of that approach versus your approach? Do you think? Look, everybody has a business strategy, Jeff. I'm going to stick with the Hortonworks strategy. I believe in it. You know, I bleed Hortonworks green, obviously. You've known me for a while. But it's not just that. It's just I believe that the, an open community needs a very kind of an open approach to partnerships. And that's, I'm not going to comment on my competition in their strategy. I got to try. I got to try, Jim. You're going to bait me all day long, Jeff. I mean, I'll let John talk about it. But, you know, I look at, I really like having a very strong competitor. I, yeah. To me, you know, Doug and I, Doug in a room were on stage earlier today. I saw them last week at a panel and well, I don't think we would be where we are today as a market without this competition that's happening right now. And so, you know, to me, it's absolutely paramount for any market to move fast to have great competition in it because we have moved very quickly. And so, and I think that's technically business-wise. Look, this is a great big show and I think that's a tribute to that. I think I'd have to say the same about the sequel on Hadoop, now our sequel in Hadoop. There's a lot of vendors out there that are trying, they're scrambling, right? And the race is on to provide sequel access to big data. There's no question about that. And that allows us to enter the market with a solution. I mean, it is really addressing a major need. I think the other figure that Merv put up in the keynote was this idea that 53% of the people doing big data projects, their next big thing that they want to do is sequel on Hadoop. And so, there's a need for that mark but the market's being created because there's a number of people pumping into the open source side of sequel in Hadoop. There's vendors like Actian who are contributing the best vector processing kind of technology to provide that for customers. And so, it's because of that need and the competitors that are out there that we can enter the market and be able to attract very quick attention around this particular hole that we're filling. Yeah, I mean, one company doesn't make a market, right? So, it doesn't, that was the way it was for quite some time. You know it was, as a matter of fact, but clearly not anymore. So, well, time for just one more question. Just, what are your thoughts about the show itself? As we mentioned, 88 companies here, but I think 3,200 plus attendees from over 1,000 companies. Just talk about the show. I have a unique position, Jeff. I'm content chair for the show. And so, when I took this over this year, it was really important to make sure that this was a community show. We went out and we attracted some really great talent to, you know, curate the tracks. We have track chairs, you know, Ron Balkan from Think Bay. We've got Eric Sammer, who's now at Scaling Data. Used to be at Cloudera, you know. Sanjay Raydi, a hort worker. Jonathan Gay from Continuity. Oh man. This is going to be like, remember, and if you have six kids, I forget so. And Andy Fang from Yahoo. And then there's the six, it's, anyway. John Santaferral from Actium. But they did a great job in being very fair in saying, look at, you know, we had 600 abstracts submitted for what was going to be 126 sessions. We ended up placing, you know, a subset of that. Of course, we have sponsors and everything. Great representation across the community. There's a fair amount of hort works slides and sessions in there. And honestly, those were chosen by the track chairs. And I think the, what we're seeing is a true community event. I mean, Jeff, I can't thank you enough for going on stage with Arun and Doug this morning. And it just shows exactly how this, this is a beautiful community of developers. And I think that was a great representative point in time of, you know what, there's a lot of vendors here. And there's a lot at stake. And everybody's got their thing going on. There's a battle for sequel and whatever that is, John. But ultimately, you know, there's a lot of developers. And you know what, at the end of the day, they all know each other. And they are respectful of each other. And to me, that's why I'm 3,200 attendees, 88 sponsors, we're over the moon delighted about the interest. My personal is about this really being a community event. Absolutely. Yeah, I think the one thing I've seen last time we talked, we talked about this shift that was happening from Big Data 1.0 to 2.0. And the major part of the shift was that companies had collected massive amounts of data. They've invested millions of dollars in Hadoop and Big Data platforms. And a lot of them were stuck in the laboratory. And this move, people are actually getting pressure from their CFO and from their CEO to now start generating value out of the data. And so I think this idea of removing all barriers for business access to data is resonating with the crowd here. Because there's a number of people who have made the investments. They need to move this into production and they're ready and it's going to happen. In the next six to 12 months, I think next year when we come back, instead of 70% waiting for production, I bet it's going to be around 40 or 50%. There's going to be a big push to get this into production. Big Data Analytics are going to become operationalized and those are the conversations that I'm having with people at this show. I need to make that next step, let's go and let's do it quickly. Yeah, I mean I think you definitely get a sense we're close to a tipping point with a lot of the group of concepts and experiments starting to graduate to more production deployments and that's where we're going to see some real value creation. So it's going to be fun to watch. We'll have you both on again next year for sure. We're looking forward to it. Thanks guys. John Santeparero from Acty and Jim Walker, Port and Works guys. Thanks so much for coming on. Thanks for watching. Stay tuned. We'll be back with our next guest in just a few minutes.