 Live from Midtown Manhattan, the Cube's live coverage of Big Data NYC, a Silicon Angle Wikibon production, made possible by Hortonworks, we do Hadoop, and WAN Disco, Hadoop made Invincible, and now your co-hosts, John Furrier and Dave Vellante. Okay, welcome back, we're live here inside the Cube, live at Big Data NYC, this is where all the action's happening in the Big Apple, a lot of action at Hadoop World Stratoconference right behind us at the Hilton, I'm John Furrier, the founder of Silicon Angle, I'm joined with my co-host Dave Vellante. Our special guest is Merv Adrian, Gardner Analyst, Big Data Guru, well followed, I'm sure you're quoted in the keynote today, welcome back to the Cube. Thank you, good to be back, and I've missed the quote, I wish I'd seen it. Were they, why isn't Merv on stage, that was the quote, but you were on stage, you had a picture up there. I'm there in spirit, always. Okay, so this is the fun part, we three of us get to pontificate on the, well, I'll pontificate, you guys want to analyze, what is the take, Merv, let's start with you, what's your take on this year? If you had to put this in the books, 2013, go back a couple years, Hadoop was a baby, it's growing up, POCs, pilots, how would you categorize this year in terms of just chapter of the Hadoop? Yeah, baby's certainly growing up. We've gone from why and where to how and who this year. This conference was very much about real deployment, organizations that have in some cases already kicked the tires a little bit, now ready to think about production readiness, enterprise class features, and more and more, thank God at last, governance, security, our issues that are on the lips, and on the minds of the people who are getting ready to make investments here. We still at Gartner have concerns about many of those issues, and I think they've been inhibitors so far, our data says 7% of enterprises have actually made real investments that have led to a deployment for a big data project of some sort, that's not confined to Hadoop only, but big data of any form that meets the classic Gartner definition of the three Vs that I know you're familiar with. That's a early nascent market, but we have, I believe, turned the corner here. We're moving into real mainstream. So you got that great hype cycle in very detail. Did we just plow through the trough of disillusionment? Did we hit that? What happened there? Well, there's two dots on there that I think are interesting, and for those of you not familiar with the Gartner hype cycle, it looks like that. The trough of disillusionment that Jeff just referred to is that stage where we get past that early enthusiasm and early adopters hit a wall for one reason or another. Maybe the product isn't good enough. Maybe they chose it for the wrong reasons. Maybe they just aren't doing it very well, but for some reason, after the bloom is off the rose, that's when the inevitable stories begin to appear about this thing isn't all it's cracked up to be, it was oversold, it was hyped, all of that may be true, but generally speaking, we move through that period into what we like to call the plateau of productivity once balance begins to reassert itself. Hadoop distributions are well down into the trough at this point, and that's not a negative, it's a positive, because you wanna move from left to right. So we think we're pretty close to beginning to pull out of that and start to move towards a more balanced and a more informed deployment. By contrast, one of the big themes here was SQL on or in Hadoop, depending on who you talk to, and that one is all the way over at the beginning, just past its trigger point, starting to climb the curve. There's a lot of enthusiasm for it now, but a lot of people still don't know, haven't got the message, that it's going to be possible to do interactive query. So the enterprise store, we had also Jack Norison from MapR earlier, they've been doing enterprise for the beginning of which people poo-pooed initially, but now that he's everyone, that's the trending story. But a lot of the software guys, especially Silicon Valley, that don't understand the enterprise, I think they get great software, but realize when they go to the enterprise, it's like going on the airplane, you got to go through TSA first, take your belt off, take your shoes off, put stuff through the conveyor belt and get stopped and waived, and ultimately you want to get on the plane, which is get the solution deployed and scaled, right? So with that metaphor, what is that TSA checkpoint? You mentioned governance, that's one trip wire right there. What are those mandatory screening variables? How could, what would you say to the guys out there that are doing, hey, pay attention, this is where you're going to get flagged? You've posed a very interesting conundrum here, how do I map to, like what's a body, what's a full pat down? Yeah. Well, it depends which project in Apache you're talking about, Apache project, drill? Yeah, I don't know, there you go, I'm not going there. This is a family show. The first question is, can I secure this at all? Can I secure it with respect to access? And that's, that problem at least is largely solved. Even from its initial release, we had Kerberos support in the Apache stack and it wasn't trivial in the early days to actually implement it if you didn't know how to and Hadoop itself wasn't particularly useful in helping you do that. And this was one of the things the Cloud Era manager was good at right from the outset and that helped them a lot. A step beyond that is how private can I make this data? And that becomes a question of do I encrypt it? If I'm not encrypting it, do I perhaps have policy based access rules that let me see certain things and not others? These are fundamental expectations with a DBMS, but they haven't been true for Hadoop. So one of the interesting emergent topics here was the rise of Accumulo. Accumulo is a data store that was originally built by our friends, some people like to think of them as friends at the NSA, and was one of the largest deployed clusters of data that could be secured in that fashion somehow they managed to open source it. That still surprises me. There was evidently a Senate investigation about that, but they're all still alive, so. Yeah, they started Squirrel. So we're good. They started Squirrel, but Squirrel is no longer alone and that's interesting. Other distributors are now saying we're going to support Accumulo, and some of them are contributing as the open source community is good at extra code helping to expand its use. And with Accumulo, you can get right down to what we would think of in the relational world as a cell inside a table and determine security based on that. And others are talking about doing that in things like H-Base as well. So we have some interesting implications in terms of. Now it becomes a race, right? So who can get through security first? Who's got the clear card? Who's got the TSA pre? Some of us will have the technology we need. Some consumers will want that right away. In some cases, the regulators will insist on it. A lot of things aren't gonna go into Hadoop because the regulatory bodies need to be convinced that it's secure enough and that it's private enough, especially in the case of things like healthcare. So the differentiation among the vendors around this issue of security will be I think one of the more interesting developments in the next year, the different approaches people take. And meanwhile, back in Apache land, projects are being done right now that aren't in any distributions yet today. But NOx, for example, is the Apache project that will tackle security there. And when it's ready, then we'll see who puts it into their distributions. That's a variable. So SQL, on or in Hadoop? What? You know, I try to avoid that conversation in time. Does it not matter? I think there's some posturing and semantic tweaking going on about whether I'm on or in. Frankly, users could care less. What users wanna know is my tools speak SQL. Can I put them in front of Hadoop and get some kind of interactive response the way I have been able to do with an RDBMS? Realistically, the answer for a while is no. I may get interactive and it's getting better than it has been, but an RDBMS is good at doing this and HDFS is not. It's not what it's designed for. And there's a lot of talk about schema on read and schema less. And I would like people to start thinking about deferred schema. If you think about what you do with Hadoop for the first time, you go in and start running barefoot through the data. You go into a discovery process. You go into exploration process where you look for things that are going to have business value. Once you find them, then the question becomes what do I do with this? If I'm gonna do this once a year, then this brute force approach of batch-oriented, file-oriented, brute force programming is just fine. Once a year, I'll run my big data mining scoring algorithm over 10 years of data. Great, that's fine. What if I wanted to use this thing every week or every day? It makes no sense to continue to do it that way. What you're gonna do at that point is choose a secondary data store that can be optimized, that can get updated easily to which I can provide random access. That's called a database. So it might be HBase. If I'm just gonna go with what's in my Hadoop distribution, everybody has HBase. And that's a steadily improving alternative. It might go into my existing relational database. The guys at Oracle would love you to use their high-speed connector and push it back over into an Oracle DBMS. Andy Mendelssohn was quoted about that not too long ago. Running on Exadata. So what you take on there? What you take on there? But there's also no SQL. Just to finish the thought, for some of the possible uses of that data, a document store might be the best model. So it might be a MongoDB or a Mark logic. It might be observational data like instrument reads for which a key value store like Bashow is gonna be the ideal thing. There's lots of different choices. And really that's the big story here. The big story here is choices. Many, many more alternatives in there have ever been and that's bewildering for people. It's not as simple as Hadoop. It's, Hadoop is a platform, which is something I wrote a blog about recently. On which many other things are going to be constructed. And next to which many other things are going to be deployed. And those choices are going to define how people do data processing for the next decade. Data processing, decision support. These are mainframe terms that we are used to now coming into the distributed world, right? So the software mainframe, we always joke about the VM world model you know, four years ago. Merits the software mainframe which we pulled that quote back. But let's talk about that, the data platform. So is there, will there be a data OS? I mean, is this becoming a reality or is that just positioning of how to frame out these data hubs that Cloudera has, the data platform that Hortonworks has? You have SAP endorsing, coming into the show saying, hey, we have HANA. That's a Ferrari, what are you going to call it? But not a bus. Maybe it's a slow school bus. It's useful for a certain thing. Very useful for a certain thing. I mean, you can't put 20 people on that bus and it's more of a Ferrari, I call it. But this is validation, right? It's a bigger picture than just saying a platform, right? It's the whole picture. We're talking about the whole picture here. And what we're saying is that we need an environment where the choice of locality versus cloudiness, the choice of optimization versus schema on read, the choice of preconfiguration versus discovery based is a meaningful one. And sometimes we can make it in advance and sometimes we can't. And there's always going to be a place where we have optimized stores for particular use cases. And there are more of them than there have ever been. So now we've got relational databases. We're going to have graph databases. We've got documents stores. We've got key value stores. They all have their role. The challenge for the implementer of the data fabric for any given organization is going to be which of these do I use when? How many of these things can I decide in advance? And it becomes very interesting to think about whether maybe the system could help me do that. Maybe I have a trued operating platform that can parse the incoming question and determine which would be the best way to answer it. And that itself is a variable answer because on any given day, maybe it's the end of the quarter and my biggest, most powerful engine is busy cranking through the quarterly SAC reports that I need to have out on time. And so let's use some of this other capacity over here. That's a virtualization idea. But virtualization is not just about resources. It could be about the nature of the platform and the task as well. That's a grand challenge in computer science. That's years from now. But we are now building out the components for that operating system you talk about. That is the distributed mainframe. That's the distributed data center. Distributed data center, exactly. So we asked Amar Awadallah, Cloud Era, the name of the company, Cloud Era. How about that? They pivoted. The cloud wasn't ready when they came out. His original vision was a platform because we were at Yahoo, he understood these big platforms. He saw what the web scale was building. They just did a do. But now, interesting, Cloud Era, so Dave confirmed that they're not changing their name to On-Premise Era. Doesn't roll nearly as trippingly on this one. I mean, so is the software-defined data center intersecting with big data? I know you guys cover that. It's kind of a little bit different orthogonal thought. But if you look at the engine, you're essentially talking about a bigger picture. That is the data center, whether it's container eyes from guys like Io to others that are going to be full-on SDDC software-defined data centers, which hasn't even hit the hype cycle yet. It's still kind of a buzzword, but it's going there fast. So that's On-Premise, that's hybrid cloud, and maybe some public. The world is going to be hybrid. That can't be any simpler than that. Although I think people who are cloud-native today will be cloud-native tomorrow. I don't think anybody wakes up one morning and says, hey, we're a $50 million company now. Okay, let's build a data center, right? They're happy at doing business in the cloud. Those costs are factored into their business model. They're comfortable with it. Companies that are on-premise today are gingerly putting their first applications out into the cloud, and they will put more of them over time. So that deployment question remains up in the air, but the direction is clear. We know which way it's going. There's going to be some stuff here, some stuff there. How are we going to keep it synchronized? How are we going to distribute work across the different environments in such a way that we deliver on the SLAs we've committed to our users? Those are big questions, and we're talking about the rise of machines here. We're not to the Skynet moment yet, but when the machines start to run themselves, things get really interesting. So, Murr, one of the things you do, obviously, in your role as a Gartner analyst and consultant to your clients is you help them reduce their risk but also identify opportunities. So you're talking before about all this diversity of data stores. At the same time, earlier in the conversation, you were talking about governance, security, all this enterprise-ready stuff, which is maturing, but people would all agree it's not there yet, right? So what's the balancing act that you're seeing with your customers in terms of the opportunities? Is it lowest common denominator for enterprise readiness to go after those opportunities, or those opportunities have to wait until the enterprise readiness is there? All of those questions intersect with the budget question. So for example, we see the opportunities for significant cost savings in data center modernization, but even with a good ROI case, those things are usually number five on the list of the top three, because somebody else comes along with a top-line project and that pushes everything down the stack. Perpetually, yeah. So organizations are confronting the competitive opportunities and factoring them through the requirement to keep things stable and supportable and secure. That's the dance they have to do. They have to be opportunistic enough to take advantage of these things when they come along, but careful enough that they don't break anything in the process, and that's what we spend a lot of our time doing. Okay, Murph, thanks so much for coming on the queue. We got a break there. We got our next guest up here a little bit over time, but I want you to put the bumper sticker on Big Data NYC this year. What's that bumper sticker say on the car? What's it saying this year about Hadoop? From whether and why to how and who. There it is, Mervage Mead Gardner here, doing interviews with the press, talking to theCUBE, talking to customers, getting the low down, total thought leader. Great to have you on theCUBE and involved in the crowd chat. It's fantastic. Great speaking with you. Go follow his blog at Gardner. This is theCUBE. We'll be right back after this short break.