 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Okay, welcome back everyone. We are live in New York City, here at Strada Hadoop, our event, Big Data NYC in conjunction with Strada Hadoop. I'm John Furrier with Dave Vellantes, theCUBE, our flagship program. When we go out to the events and extract the signal and noise, this is day three of three days of wall-to-wall live interviews, doing videos every day, having a big event last night. There's been a new research day three coverage live in New York City. Again, one block from the Javits Center, this is our new location, our studio, got the streets behind us. Here with Dave Vellantes, kicking off day three with our special analyst segment here on theCUBE, a CUBE alumni, Merv Adrian, the chief analyst at Big Data at Gartner. I added chief, boss, man, bad-ass, all great stuff. Just one of the troops, just one of the troops. Great to see you. Welcome back, Dave, great to see you again today. Guys, so kicking off day three, love to do the editorial around table, around analyzing the horses on the track, analyzing the data, because this is a big data show, and there's a lot of data. So, Merv, I want to get with you last night, we were talking at the event about some of the data. You have survey data, multiple surveys. You're out, you are the busiest person I've seen in the analyst community. You do meetings all the time. I do. And you're back to back to back. So, couple things, vibe of the show, who you're talking to in terms of company, size, and what are some of the things on people's minds that they're asking about, one of the things that they're pushing and peddling, and two, then we'll get into the survey data. So, sitting where we sit, I should quote the old Saturday and I think you'll ask a lot of questions for somebody not far from New Jersey, but. Yeah, so, it's a big show, and it's astonishing how it continues to get big. And there's this new piece of Javits, Javits North, which is kind of an odd construction. It's this giant, like, empty hanger where they're holding the keynotes. It's a little odd, but the crowd is, I think, trending as it has in the last couple years to a much more business-focused audience. There's a lot of technical content here, but there's an awful lot of business people. So, there are a couple of trends, I think, that go along with that. Business people don't really want to find out whether this is version 0.1.4 or 0.1.5 of this project that's part of the stack, that's part of the project, that's part of the overall architecture. They want to know what you can do for me and how easy is it going to be. So, for me, one of the more interesting trends here has been the number of vendors. I'm not going to name them because I'd leave a couple out and I don't want to do that to disadvantage some over others, but there's several vendors here whose entire reason for being, and entire reason for being at this show is we make deploying Hadoop and Spark easier. And the reason is because it's still really hard. You mentioned data, here's one. We've been asking for several years from the Gartner Research Circle, what's the biggest barrier to adoption, and I told you several years ago for the first time, skills was number one. All the vendors have been investing very aggressively in attempting to solve that problem. They're doing training, they're doing certification. Even the guys at O'Reilly are working with, I guess data breaks to do SparkCert here. But in three years, the number hasn't gone down. It dipped a little bit last year, came right back up this year. Skills is still the number one problem. So we see vendors now who are coming into the market saying we can make this easier for you. But you also, I saw a tweet on the first day you had mentioned and we'll get into the survey days. I want to ask that specifically. Sure. Is that the POCs are certainly on the rise. There's been some growth. So it's the classic hockey stick where's the inflection point. But then you made a comment that the production deploys are not massively growing. So let's unpack that. Let's dig into the numbers. But before we go there, so does that suggest that the complexity of the ecosystem is increasing faster than customers' ability to deal with it? Yes, I would say so. The complexification of the Hadoop stack is not diminishing. I've been tracking, you guys have seen it, the number of projects in the stacks supported by the distro vendors year on year. And it's not slowing down. We've got more new projects that mostly are not even supported yet this year to talk about that we had last year. Everybody's running around talking about Flink and Apex and Gora and you know. There's so many numbers and names you don't even know yet. We won't even, we don't. But yes. So they're filling the holes in the white space. Same time, the demand on the market side is interesting. So like we're kind of getting a feeling kind of a surround sound buzz, like okay, we can't move fast enough in the ecosystem to plug in functionality when the demand from the customer base is I want product now. So there's almost like a supply problem. Well, there's an impedance mismatch here. You've got an audience that's increasingly focused on solutions, not technology. But the technology is different already this year than it was last year. And so the people who have to deliver it haven't even got their story together. And the skills of the people who have to deliver it are not yet in place. So these things are colliding- There's operational issues on operationalizing something. I think that slows the market a little bit. But again, if you begin with, I said this last night on the panel, if you begin with a focus on what solution you're trying to deliver, not which technology you're going to build it with, you can get where you need to go. That doesn't mean it's easy. And part of the problem with the skills thing I mentioned is you get people trained in 2014. Those skills are useless this year. There's three new things that are going to have to be in this stack that they weren't trained on. Anyway, let's talk this first. Yeah, let's get into the data. So a little preamble. Gartner has this thing called the research circle. It's a sizable panel. It is thousands of companies that have volunteered to be interviewed on a regular basis about a variety of technologies. It's representative across verticals. It's representative across geographies. And it's representative across company sizes as a panel. Any individual survey, we get some subset of those people responding. So they may vary a little bit one-to-one, but we're very careful about what can we say and what can't we say based on statistical significance. Okay, that's the preamble. There are two surveys that are relevant here. We do one on big data. And starting this year, we did one on Hadoop. They are two distinct surveys. So this is where the numbers get interesting. I'm going to refer to a piece of paper here. I apologize, just so I get this exactly right. Bring the data, bring the signal. All right, so the big data investment is continuing to rise. And it is now up to 76% of organizations investing already or planning to invest in big data. That's a great number. A lot of people have been talking about the number that came out of the Hadoop survey, which said 42%. That's because not everybody thinks Hadoop equals big data. And they're right. Gartner has always said they're not the same thing. In fact, when you ask people what technology do you want to use for big data, the number one answer, and it has been for three years, is enterprise data warehouse. The number two answer is cloud. The number three answer is Hadoop. And that's about 40%. Now that's kind of interesting. Here's another one. This year we added a new item to the stack that they could select from. That item said spark slash flink, because we're not quite sure, although it certainly looks like Spark has the lead. Only 9% of people even knew what we were talking about and picked it. So it's still early days in the technology stack. The solution question, the problem statement of big data, is well ahead of the technology selection statement, which is Hadoop. So that hopefully unpacks the numbers a little bit. Yeah, so that's interesting, data on Spark. Like I said, this is survey week. Well, you saw the Databricks came up with the survey and it's Spark going to take over the world. Of course it is. The diseases company came up with the survey said exact opposite. So it's why we like talking to you, because you're independent. You don't have a dog in the hunt other than you want to help practitioners. So, but what do you make of all the Spark buzz? People talking about Spark simplifying things, but people say, Spark's still really complicated. You got to know Python. I mean, is it just like a little bit moving the needle or is it more major in your opinion? Well, I think you can get some insight from that Spark survey by asking people what language they were using. Because that's one of the more interesting facts in their survey, the number one and number two languages for that were Scala and Python. Go find out out there in the broad market how many people are working with Scala and Python. Go to one of their conferences and they'll tell you that Spark is the best thing on earth. So it depends on who you're asking. Our surveys are of executives who wouldn't know Scala from Python if you put the code in front of them. So it's not the same people asking and answering the question. Spark is clearly a superior technology to MapReduce for a very large subset of the things you want to do. MapReduce is fundamentally batch and Spark is not. In fact, the biggest pivot here, in my view, if you're looking for the headlines of the show, is we are pivoting the entire Hadoop community's focus from data at rest downstream after stuff is done to data in motion. Let's work on it while it's flowing by. Everybody's talking about ingest, everybody's talking about streaming. The projects we're talking about, the products we're talking about are increasingly trying to answer the question while things are happening rather than afterwards. We think that parallels a development that we've been tracking in the DBMS market that we call hybrid transaction analytic processing. We talked about that yesterday at HDAP. So everybody's trying to go after this same question. How do I move upstream and do things when it makes a difference rather than describing it after it happened? And depends on which market you start from, how you describe that problem. That's not a database issue too. We had a quote on the queue that said, not all data will hit a database to your point of data in motion streaming. You got flow theory, you have all kinds of real-time stuff where machine learning and some of the systems intelligence stuff that we've been covering kind of points to the cognitive for an IBM term, but this notion of letting machines do more than the human can actually handle. So this comes back down to the human machine. Well, the cognitive thing is fun. Is it going to be Watson, Cortana, Siri? Who's going to be the face of my cognitive computer? And if that pivot would turn- Siri? I mean, right now. I mean, Siri is a touch point for Apple users. I mean, that's an interface. Well, I would encourage anybody who's listening to this to go Google or Bing or whatever. Cortana, Siri, Arsenio, because there's a wonderful dialogue between the two of them that Arsenio staged that's hilarious. So that pivot is interesting. You know, they're talking about ingest and dealing with the data in real time. It also suggests as well that there's another process that companies have to invest in. I wonder if you could comment on this, that just make it better. Your fraud example is great, right? You're flying across. You do something, a transaction on the plane. The next day you get a fraud alert. It says, did you do something in a fast food, you know, location? I was like, okay. I bought a sandwich on the airplane. Let me think about that. Exactly. And so there's got to be a process at the back end that sort of improves whatever that is, machine learning and skill sets, et cetera. What happens when technology goes wrong? It's a big question. The adaptivity of these things is dependent on whether they're working from a set of fixed rules that we based on a predictive data mining algorithm we ran a year ago. Whether they're operating on real time data that adapts to real time conditions, that's the problem we have to solve. The processing is also hard. If you've got a huge amount of data coming through, say, in real time, and let's say a Facebook or a large scale or any transactional system, whether it's from different databases or not, it's irrelevant, you might miss something. So the math has to be really on the money. And databases aren't enough. Things don't get into the databases until we're done. Right? So we need both the database and the stream. So you get, the word hybrid gets used for a lot of things, so I won't use it here. You get a fabric of compute, a fabric of engines and a fabric of stores. And in fact, arguably, one of the other big themes of this event is the story at Strata. Yeah, I like that one. We've got, we've got. Roll my eyes on that one. You can roll your eyes if you want. Hey, I'm not a headline writer, but I play one on television. Well, the story, come on, the storage has been there out there on the TV. No, no, no, no, no, no, no. Listen, you've got Cloudera introducing an entirely new store called Kudu, okay? What's interesting about this? They're doing what DBMS vendors did years ago. Anybody who knows DBMS knows that they don't store their stuff in the file system. They bypass the file system and they write the storage themselves. In the Hadoop community, there's been an article of faith that if there's anything that defines what Hadoop is, it's MapReduce, well, maybe not. We're off to Spark already. And HDFS, and maybe not. Maybe we're going to bypass that because it has its limitations. Now the folks at Hortonworks will tell you they're working very aggressively on the next gen of HDFS, which will address some of its limitations. But Cloudera's opted to bypass it entirely. Maybe we keep the APIs so everything else works. MapR is pushing on there. MapRDB, which is essentially a variant of HBase. And they've added JSON to it now so it's more of a multi-model. The store is really interesting. Amazon? Yeah, Amazon's got S3. And oh, by the way, we should talk about numbers there in a minute. But the point is that different stores and different engines are going to proliferate across the fabric because they all have their uses. So it's not going to be monolithic. And the environment is going to increasingly get chaotic. There's lots of moving parts. And at this stage of the developing market, vendors who are playing here see an opportunity for differentiation by picking one highly valuable use case and say, we've optimized for that one. If you need to do this, come to me. We're back to engineered systems. I want to go, this is kind of like the whole monolithic purpose-built stack. But you mentioned something earlier. I want to drill down on it real quick because we had an awful lot of time. But you said the problem statement is far ahead of the technology stack slash technology, which would kind of say, okay. Not quite what I said. The technology is moving at a different rate than the problem statements are. Many problem statements can be solved with today's technology. Some can't be solved until we get tomorrow's technology. And it's hard to tell the difference. All right, well, I'll just say, I'll just say, matter of fact, customers want the solutions, right? And so there's this mix and match. So the question I have for you is, we've been covering this territory for you since it started, President at Creation. We had a headline yesterday in SiliconANGLE. It's whale season. All the big guys are in, IBM, EMC, Cisco, HP, everyone's in here, right? So how does an ecosystem, how does a cloud era, they're worth $4 billion on paper. How do they break out? If the technology is moving in as really fast, and now storage it could do for cloud area, you got this, polluting in here. Can't the big guys just co-op this whole thing? I mean, this is, how do the companies grow to be billion dollar companies? We're at Rich Get Richer. We've seen this movie before. Last time in this space, we saw it was the analytic DBMSs. You remember that, Vertica, Nathisa, Astor, Green Plum. There was a period where they were all still there and 18 months later they were all somewhere else, right? Does that happen this time? I think there's a fundamental difference in the market from that period. Every one of them had a protected IP stack that was different. Most of these guys have IP stacks that are mostly the same, and it's open source. So what does that do to the valuation equation for people who do M&A at the big vendors or at the venture capitalists who are occupying them? It makes it go poof. AccuHires. You buy customer lists, you buy organizations, support organizations and sales organizations. Those are the things that are valuable. How sustainable is the technology? Is it well engineered? Those are all very serious questions. They take a very different kind of due diligence than I think used to be done when it was a protected IP stack. And all we have to do is see if the customers were basically happy. And that's oversimplifying. But I think the model will be different this time. Is there a shout out from AWS on this ecosystem? I'm glad you asked that question. So let's cloudify this a little bit. Or Azure. Yeah, well we- Go Google. We've all been tracking this market for several years and we all sort of had this bubble over in the corner that we didn't talk about too much called the cloud because all the numbers were invisible. Everything was very opaque. Amazon didn't report any number except a single one for several years on its entire revenue. Gartner has arrived at some estimates that we use internally in the hundreds of millions of dollars for Hadoop. For all we know, Amazon's making more money on this than all the independent distros who are selling on-premises put together. I don't know that for a fact, but I know they're selling a hell of a lot of it. And when we asked the question of the Gartner Research Circle, we found 50% of the customers who responded said they were on-premises. 50% of them said they were in the cloud. About a third of them are in both places. This market has been dominated by conversations about on-premises. So the cloud conversation is really just beginning. But you had the 50% numbers, interesting. It certainly is. If it's just beginning, that's kind of like denial. Well, it's not denial. It's the fact that we didn't have visibility into the data, but it's a dramatic impact on the market. And technically, technologically, the question of how do we live in a hybrid world is just as it is in the DBMS market, a question that's just beginning to be asked, people are going to have a foot in both worlds. The cloud-to-ground problem, the ground-to-cloud problem, whether it's moving or synchronizing, is going to be a formidable one. And to your point, Dave, as the whales come into the game, some of them have already spent a lot of time on this problem, maybe not in the Hadoop stack. But if you ask Microsoft about living in both worlds, they've got a great story. They've got a stretch table, they've got some muscle server. And the cloud guys are simplifying and making it easier to consume that data pipeline as a service. You want to move data back and forth and synchronize it? Oracle's got a great story with Golden Gate. It's like a Game of Thrones episode. They've got muscle and they have an army of people both tech tech and sales. The big guys are not flat-footed in this game, it's weird. I mean, they can pivot, too. So to your question about M&A, maybe the red wedding is coming. I don't know. Got to have a Game of Thrones reference. Merv Adrian from Gardner here inside the queue, breaking down the analyst session where we extract the student who knows, great to have you on. And of course, always great to see you. And I know you're working hard and you've got a busy schedule. Thanks for taking your time out of your busy day. Let's end this in our traditional way. Let's talk about the bumper sticker for Strada Hadoop. Part of Big Data NYC, our event, which we're running. So summarize what's going on in Big Data NYC this week. A lot of action. Don't roll your eyes again. All right. The store and simplicity are the story. It's about what's happening at the data layer and it's about whether I can deploy easily. Those are the biggest barriers. The market is growing. Last time we talked, I think I held up a glass and said this glass is half empty. That was the conversation about the numbers. I gave you the numbers already. It's clear what direction we're moving. We're going up as long as we don't screw it up and derail it somehow. If we can keep it simple and make it deployable and remain solution focused, this market has a lot of legs. Final comment I want to ask you, since we're in New York City, Hell's Kitchen is nearby and actually Hell's Kitchen is hot. I got to ask you about that store and simplicity thing. Data layer, how hot is it in the kitchen in this world? Is it burning up? Is it really competitive? What's your peg in terms of degree of competitiveness for that data layer? Because store and simplicity, that's a secret formula. If you can make something simple, elegant and easy, it integrates well in the cloud, AKA what you're saying, that's a winning formula. How competitive is that action? It's extraordinarily competitive. The fact that several different players are all introducing major updates here, some of them entirely new things. Hortonworks has an entirely new business line with Dataflow to tackle this question of streaming. Does that remain a separate thing? Or does it become just part of quote the Hadoop stack? You've got people who are doing metadata crawling of your lake because so many people are running around saying just dump it in the lake, we'll figure it out later. Ask Tamer about that. They're selling people, actually excuse me, they're giving away a free crawler so you can go crawl your lake after you've dumped everything into it and figure out what's there in some semantically useful way. There are so many facets to this that it's going to remain interesting for those of us who do this for a living for years to come. It's hot as hell in the kitchen here in the Cloudera, Hortonworks, Big Data, Hadoop, all breaking out in its own way. Merv, thanks for sharing the data. This is theCUBE breaking down all the analysis. Horses on the track, whatever you want to call it. Competition is hot. Horses are next door. Horses are next door. We'll be right back after this short break. Thanks for watching Live in New York City. We'll be right back at theCUBE. I'm John Furrier with Dave Vellante and Merv Adrian from Gardner. We'll be right back. Thank you. Live from New York, it's theCUBE. Cover Big Data NYC 2015.