 at Big Data SV 2014 is brought to you by headline sponsors WAN Disco. We make Hadoop Invincible and Actian, accelerating Big Data 2.0. Okay, welcome back everyone. This is Big Data Silicon Valley. Hashtag Big Data SV is a continuation of our in-depth coverage of the Big Data landscape. We were just in Big Data NYC just a few months ago. Now we're in Silicon Valley. We're all the innovations happening and this is theCUBE crowd-source innovation, streaming the data live. This is what we do, expect a signal from the noise. I'm John Furrier, the founder of Silicon Angle. I'm joined by my co-host, Dave Vellante, co-founder of wikibon.org. We have a very special guest in the house here in theCUBE, CUBE alumni, Kaz from WAN Disco. Welcome back again. Always a pleasure to have you. One, it's a dynamic conversation, great personality, and you know what you're talking about. And you don't hold back, so we love it. So welcome back. It's a lot to carry on. Thank you, John. You are a tech athlete and you're not afraid to throw a few punches and which is, we love the controversy, but let's get down to it. So what's changed since New York City, obviously New York City, very financial services oriented, very much business, a lot of tech being discussed, Silicon Valley, growth is happening, valuations are high, a lot of things happening in the marketplace. What is your view of Silicon Valley, Big Data, the Stratoconference? What is happening here? What is the core story? I think what is going on is that the Big Data is being caught on more and more, rather, I would say, of the big companies, right, so big enterprises. And as always with any new technology or the set of the technologies, there's a bunch of hype, there's a bunch of nonsense and noise, but there's a certain amount of real, real deal in it, right? And among these companies that actually bring the real deal on table are companies that stand behind the open source movement, essentially the Hadoop and HBase technologies. And of course, real-time processing, such as, you know, or close to real-time processing, such as Spark and Shark. And we see the huge momentum gain, essentially, in these companies. As you know, Databricks actually got the funding recently, so it shows that the investment community has a belief that this segment of the market will go up and will keep growing. And Clouder has actually took Spark under their wing, so there's officially supported platform for them as well. But what's interesting is that all these analytical, new analytical tool sets, tool kits, they still, I believe, struggle to reach the level of real-time responsiveness of HBase, for instance, which has been around for a while. And the building community that works actually behind of it made it the real kind of, it's pretty much the part or moving cog of any data warehouse for anyone who needs a fast access to data, right? So HBase is there, HBase is growing at the fast pace, and we see actually quite a bit of the momentum, again, from the customers and the like, who are asking about HBase solutions. And again, we'd be in the company that did the non-stop Hadoop last year, actually, interestingly, about a year ago. That time, we announced non-stop HDFS, and now we brought to the table the next offer, which is non-stop HBase, because we see actually quite a bit of the momentum gain. So how's the non-stop working right now? I mean, how's the non-stop, and you continue to roll out more and more features HBase? Is it well received by customers from a technical standpoint? Are there specific proof points you can now point to and say, hey, these are the use cases that we're exploding in a big way, and here are some new things around the corner that we're attacking? Actually, as you know, the original non-stop HDFS comes into different flavors, right? It comes for a data center, such called LEM edition, and for a cross-data center application, which is called VEN edition. And what we see mostly in terms of interest from the companies is that VEN edition is significantly higher demanded. So people actually need to have a way to solve the disaster and recovery problem between the data centers, considering especially the fact that traditional ways of doing this is actually very costly and have limited capabilities, both in the range and the data volumes. So non-stop VEN is actually being received very well. So we're going into the number of the hyperfile POCs and it's actually catching up pretty good. So H-Base is interesting. I mean, obviously it's got a lot of momentum in the marketplace. It's kind of the de facto standard in the dupe, but there's a lot of complaints about H-Base. It's hard, so there's a lot of heavy lifting involved and it's not non-stop. So you're addressing the latter of those two complaints. So you're concerned about, first of all, do you think that's valid? What I'm saying, I mean, you hear complaints from maybe not so much developers, but people that are paying developers. Is that a concern of yours? And is the market addressing that? And so I wonder if you could talk about that a little bit. Yeah, so as apparently many of the listeners might know already. So H-Base has redundant failover capabilities built in. So basically H-Base contains two main parts where the administrative piece of the H-Base called master manages the information about the essentially the regional servers that tracks the metadata about data placements and table spreads and what was not. So the redundant failover capability in, sorry redundant, I shouldn't say redundant, a certain failover capabilities of H-Base, I actually focused on the regional servers. So there's a way to failover the regional server if the active regional server failed. But the problem as usual in any distributed systems is that you can, if you're lucky, so failover could happen very quickly. If you're not so lucky, the failover can take actually a few hours. And that means service interruption essentially. Another part of the problem essentially is that H-Base client is not HE capable unlike HDFS client. So in HDFS too, for instance, client can failover quickly because there is a notion of proxy provider that helps you essentially bounce from one node to the standby node. And we took advantage of that in our implementation back in HDFS. So H-Base doesn't have such capability so client is actually not aware about the possible multiplicity of the region servers. So essentially the problem here is that you have a bunch of clients writing at high speed the data into the region servers, okay? Region servers, essentially the H-Base tables. And a couple of things could happen. So if region server goes down, the HE capability, the failover capability that H-Base actually has right now can help you to essentially re-register the region server on a different server, on a different host. And hopefully everything will go okay after that. But the clients will fail essentially the transmission because they have no way of switching over quickly, right? So what we're trying to solve in the offering that has been announced yesterday is essentially this part, right? So we're trying to solve the failover or make multiple active region servers available similarly to what we've done for HDFS. So you say the problem is that the recovery is very variable and unpredictable for clients today, the pre-wet discoverers. Operational complexity is well there, right? I mean, if you called our organization from last year, we mentioned that the operational manual for QGM with ZooKeeper is whatever, 40 pages, 24 pages. I don't remember the number, but it's sort of like complex and huge. So H-Base, as I said, is even more hardcore and even more sophisticated. So operational headache is actually bigger. Yeah, so the big chunk of that operational headache that I was referring to in my question relates to recovery and the whole failover procedures. And so you're dealing with that sort of out of the box, is that right? Yeah, we essentially give the clients an ability to run multiple region servers, right? And the all active, they can serve clients actually at the same time without any essentially overhead, pretty much without any overhead. I mean, there's some overhead on rides because we need to reach the consensus and all that stuff, but it is actually would benefit from the multiple regions or architecture because natural load balance will help actually to bring up the speed. So Kaz, what kind of apps, use cases, examples, do you see your solution being most popular in? What do you see expected uptake to be? It's sort of an interesting question. I'd say, again, my guess is as good as any, right? So because we're probably better than most, anyway. Better than John and James. We're both, if I'm not wrong. But I think what's going on essentially is most of the real time or close to real time oriented applications would benefit tremendously from this because for these guys, even at 10, 20, 30 seconds of downtime might mean huge loss of data. So that would be my guess. Apparently for anybody who is using HBase in production, for instance, a good example would be eBay Search. So I mean, they're using HBase to power up the search engine. So imagine HBase goes down in eBay. And I know for a fact that eBay, for instance, has a pretty sophisticated harness around their HBase setup in order to provide that failure or capability as quickly as possible. So I'd say guys like this would be actually very efficient. So you mentioned Search. And that's sort of an e-commerce play. I would think ad-serving, maybe. Is that right? Maybe fraud detection? I'm not an ad guy at all. But you would think HBase is being used. I would think in certain situations, they would be, yeah, maybe, maybe not. I mean, you see some other new databases emerging, but still. True. Yeah, every single day, there is a new technology coming around. And this year, Ostrata, I found actually probably another 20% of the companies I never heard about. 20% more compared to the last year. I mean, they might be here for the whole next year, or might be not. They might develop something that will catch up in the market. So we don't know. But what I'm saying is that there's always competition. We heard that yesterday from the CEO, John Schroeder at MapR. He's like, I walked through the hall. I saw companies I've never heard of before. And half of them won't even be around. So I was my comment. It's like, look to your left, look to your right. They may not be there. It's kind of like that when you go to school. But I got to ask you more importantly. We heard some great quotes. I remember the first HBase con, we covered the queue with Cloudera, and they launched that. The quote that I liked there was, HBase is like a fine tailored suit. Suit it up. It's perfect. So if you want to try to put it on somebody else, it's really difficult, which just kind of points to the use case of what HBase did at the time. The quote we just heard from the CEO, ex-Microsoft guy we just know was, it's like a big bodybuilder, big built up muscles and thin legs, professional services, and you have skinny legs on Hadoop or HBase in this case. So what's your take on that? Because you're seeing HBase is a very viable product. It's growing fast. Is it expanding its use case? What are people doing with it? What are some of the trends around HBase that you can share with the folks both technically and just from a relevant standpoint? Is it evolving faster than you expected? Slower than you expected? Is the ecosystem and the contributors growing? What's your take on HBase? Give the update. I think HBase, and again, I'm not a commuter to HBase or whatever, so I cannot tell for the community. But what I observe from the outside is essentially, I see community being very focused on bringing the stable API-complete version of HBase into the game. So for instance, right now, there is a very active work on the 0.96 line of the release of HBase, which they call the single-edit release, which would be the last incompatible sort of release. I mean, it's a huge jump. Everything's going to break. Everything's going to be incompatible. But after that, they promise to be stable and sit actually at the same spot forever. So community is actually very busy working with that. And it's great to see that there is no stagnance. There is no slowdowns. People are full of ideas, people are full of creativity, essentially, and they're actually moving forward very fast. And with the stability release in place, I think this fast move forward would be actually oriented on better performance, on better usability, on better operational procedures, and that kind of stuff. But as with any open source product, essentially, there is always this last mile between the open source release and the customer. So somebody needs to actually cover this gap. So system integrators actually need to step in and make it actually usable for everybody out of the boxes, essentially. Because as we referred a couple of times before, operational complexity is what killing a lot of these software offerings, in my opinion. So I want to ask you, Kaz. We were talking before about 20% of the companies at the show floor hadn't heard of and presumably their new startups that you hadn't heard of. You have been in the inner technical circle of Hadoop since the beginning. And some of the other companies that you see on the strata floor are ones that maybe weren't here at the beginning either. You see EMC, a lot of the big whales coming in to the community. And so you see them trying to co-op the big data theme and doing a great job of marketing. My specific question is, what's their contribution to the core of the technical community? Are there certain companies that are stepping up and bringing their capabilities to open source? Is it mostly pure marketing? What do you see there? And what's the community's response to these big companies coming in? It was positive in one hand. Are they delivering the kind of contributions that you guys would like to see? Well, I want to compare, and I'm not the first one actually to do this, but I want to compare open source development with a sort of evolution process. I mean, nobody drives it to a particular goal. So we're sort of trying to approach thousands of different ways and see what works better and what sticks. So it's like throw something in the wall and see if it falls out. And with big companies coming in, they apparently have a certain agenda that they want to achieve certain results performance-wise, feature-wise, and that stuff. And I think open source community is actually a thrill to see this, because more people will contribute to the product, the better product, the better use cases user base will get out of it. Marketing-wise, as you said, yeah, there's quite a bit of hype here and there. So some companies just want to sort of ride this wave and just, OK, so we are big data guys, because we said so. In terms of contributions, the original, I would say, core teams that were actually focused on Hadoop and HBase and what so on, they're still around, they're still very actively working. And they, I'd say, sit in probably in three, four, five, mostly companies around the Silicon Valley, right? So people, I believe what happens is that the companies where the most of the open source contributors are with, they manage to create this climate of the engineering creativity, cultural kind of paradise, if you will, right? So where people can actually contribute most effectively. I don't know if the big companies can do this. And again, I've been working for big companies in my life, such as Sun Microsystems, for instance, Rest in Peace. But I think that Silicon Valley startup spirit should be there, right? I mean, you have to innovate and for innovation, you actually have to collaborate very quickly. You need to actually bounce the ideas with your colleagues and with your peers and with people in other companies and all this stuff. And I think, actually, at Wengie Sky, the one of the reasons, again, self-plugged, but the one of the reasons I like the company is that we always bounce in the ideas with each other. We always sort of trying to help. Remember the cloud era back in the day when you guys were sitting on the floor there? Yeah. Something similar. So I don't know if that spirit was in the cloud era. Remember the old days, the cloud era? Well, the early days we were there. How many like that? Less than 30 people. Yeah, it was the first small company. But Sun's a good example. I mean, Sun always had an open culture, open source culture. And I mean, even though Linux became Sun's snake oil anyway. But are there large companies, in particular, that stand out? I mean, for example, John and I always talk about IBM's open source mojo. And I want to put you on the spot, but is that illusory? Is that real? Are the big companies sort of all together in this? Or are there some that really stand out? I think, again, each company has an agenda in there, right? So for instance, if you take a look at Intel, right? So I know that Intel is very active in Spark Community, for instance, right? So if you look at Pivotal, Pivotal seems to be very active, actually, in Groovy, Spring Source, that sort of the community. So now they ramping up their contribution into Big Top and Hadoop and whatnot, right? I don't think you actually can paint them all in one brush, if I understand the question correctly. Yeah, yeah. So you're saying it's their line with their agenda? Right. So for instance, Microsoft is very interested in bringing the Microsoft bits and pieces into the Hadoop, right? And I'm not sure if they do this in a good faith or not. Again, not my call. You kind of got to trust that they do and hope that they do, right? Right. I would ask them. I'll fill in the ones. Yeah, right, exactly. So anyway, but yeah, the modern media, anyway. Yeah, we'll leave it. We'll leave that. Because great to have you on theCUBE, great to hear from you. Give us the final word on what's happening here this year at Strata Conference and Big Data SV. What's the big revelation from your standpoint? I think what they really thrilled to see is, as we just discussed, a lot of big companies are there on the floor. And they actually, literally, they trying to start providing the services around big data. It might be Cassandra. It might be in a SQL, Mongo, HDFS, Hadoop, wherever. But they actually getting on that bandwagon, which means that customers would have more choices. Customers would have more to pick from. More competition, more fierce competition actually will find the best solutions on the market. So I mean, I like the momentum. I like the speed with which the market is actually moving forward. And there could be different explanations for that. But still, I believe we all, essentially, at the end of the day, we will gain from that highly high speed craze, if I can put it this way, where you get, actually, everybody's on board every day. Every day, new names popping up. And the marketing dollars, too. The big companies want to do market development for the little guys all for it. And then the little guys have to compete on the capabilities that they bring in the differentiation and solving problems that the big guys don't. OK, this is theCUBE. We'll be right back with our next guest here. Live in Silicon Valley, this is Big Data SV. The hashtag is hashtag BigDataSV. Go to crowdchat.net slash BigDataSV and join the conversation. We're posting photos there. You can post videos. You can post photos, post commentary. And it automatically loads the hashtag. So join us and ask us anything. We'll be happy to answer your questions. I'm John Furrier with Dave Vellante. We'll be right back after this short break.