 Okay, welcome back. We are here live in Silicon Valley. Heart of Silicon Valley. It's the San Jose Convention Center. This is siliconangle.com and wikibond.org's exclusive coverage of Hadoop Summit 2013. We're wrapping up day two of two days of amazing technology, innovation, developers, companies, entrepreneurs, building an industry all around big data and Hadoop Summit. This is where the action is. This is theCUBE. This is our flagship program where we go out to events, extract the signal from the noise and we want to talk to the best and brightest when we have that here today. Our good friend who is in the architectures that matter and we're going to hear about that today. I'm John Furrier, the founder of Siliconangle and I'm joined by my co-host. I'm Dave Vellante at wikibond.org. Konstantin Budnik is here and he is the director of engineering, big data for WAN disco, super alpha geek, working on Big Top, Apache Hadoop, Committer, you know Dr. Oz, we call him Dr. Cause. Welcome to theCUBE. Thank you guys. It's a real pleasure actually to be on the show. I am a big fan of yours and I've been following actually Siliconangles since pretty much day two when you guys sitting in the same offices as I did across the floor and it was actually amazing to see how this whole, you know, number one media coverage in technology today company came across. So it's actually, it's an honor. Well, thank you. We're really proud of that and it's people like you, Konstantin, that have really helped educate us on markets and point us in the right direction. You know, following, I called you an alpha geek. I hope you, you know, accept that as a compliment. But seriously, you know, John Furrier in particular has a nose for looking, watching guys like you and marking those trends and that's what theCUBE's tried to do. It's tried to lead that. And highlight Dave, the tech athletes, right? So to us, you know, we love theCUBE because it allows us to have a conversation, you know that, but more importantly, more than ever, the social media, the social web is about people and we believe that to be true. And ultimately, the tech athletes are the guys who are sprinting the marathons, it's entrepreneurs, it's guys building the technology. That's what this Hadoop Summit's about, Konstantin, you know, we were talking about that the other night and we appreciate the comments and we'll use that number one and we'll put a press release out tomorrow. We're number one in technology coverage as, as, you're the source, you're the marker reason. But in all seriousness, when disco, you guys are at the center of the conversation here at Hadoop Summit around your technology. You guys are doing some things that are pretty compelling, little bit different, but yet getting the attention. Please explain to the folks here what you guys are doing and why is it attracting so much attention? Yeah, everybody talking about it, you know, making Hadoop enterprise ready, enterprise great, and this is what you guys do, right? Yes, essentially, yes. So one of our key, I would say the most key technology that we bring to the Hadoop space is the solution for single point of failure in HDFS. It's a big problem for any system, but it's specifically for the system that can handle the tens and literally hundred petabytes of the information on your enterprise infrastructure. If the file system of the size goes down and becomes unavailable, it's actually a big problem because essentially everything around it becomes frozen, right? So you cannot run your analytics job, you cannot run your map reduces, you cannot run your hives, it's done, your data is unavailable. So what we're trying to do, and we're actually, what we did is using one of the quite old technologies called Paxus Algorithm, Distributed Cardination, or Distributed Cardination approach, actually, to bring multiple name nodes to single Hadoop file system. So basically, the traditional high availability architecture in HDFS allows you to have a main master and it's what was called a standby node where all the journal transactions have been written in case something bad happened to the master. So then you can copy over the edit logs and try to spawn the new name node. So but it usually comes with a downtime. So it could be five minutes, it could be half an hour, it could be one and a half minute, some time anyway. So and if you absolutely cannot afford five nines availability, and you absolutely have to have 100% availability, the traditional HEA is not an answer for you because you're going to be losing time. So the only game in town essentially who can help you is Vandiske System where we guarantee 100% uptime availability, literally 100%. That's a pretty bold claim. So let's break that down. You're talking 100% uptime. That's obviously the talk of the show. Just put a tweet out there. How do you guys do that? I mean, that's a really bold claim. Yeah, so this is the bold claim and of course we cannot guarantee that your power would be 100% up, right? I mean, we do not deliver magic. We cannot guarantee that your switch actually would not blow up in Iraq, right? We cannot control that. But when it comes to availability of the name nodes and availability of the metadata in HDFS, we guarantee that we run the multiple masters every single client of yours can work with either of the masters. And if one of the masters dies for whatever reason, the rest of them actually keep going. So until you have the majority of name nodes around, you're totally fine. So if you have three of them, you can kill one. If you have seven of them, you can kill up to three and so on and so forth. And it actually depends on your use case, how many name nodes you want to have around. And as we talked about the other night, you don't do this using like bit slicing and... No, no, no. If I understand it right, in Hadoop the metadata is separated from the data. Now what typically happens is at distance you would copy that metadata and you're paying the typical asynchronous RPO penalty. The amount of data that you might lose if in fact that that right has not occurred at the remote location. You got it completely right. So basically our approach is share nothing. So our name nodes do not share the storage. Our name nodes, every single name node has its own copy of the metadata. And we essentially, our approach is very simple, right? Instead of dealing with the issues once they happen, we actually guarantee that the issues do not happen, right? So if you try to create a file, the name nodes come to the consensus first before the file is created. And once the consensus is reached, the information about the state of the file system across all the name node becomes actually the same. So, and every one of them keeps their copy separately. So this is why when one name node dies, others who don't care. So I kind of foresee your next question, right? So you can say, okay, what if the... He'd be a cube commentator too. He'd just give them the mic. Why even be a guest? Why don't you be a host? I think you could foresee. Are you making me an offer? No, you're making more money where you're at now. Stay where you are, doctor. Okay. But you might say, okay, so if the dead name node actually came back to life essentially, right? Wanted to try to steal the clients and tell them the wrong, give them the wrong information. Right. So essentially the split brain problem, right? So actually, no, we don't have split brain problem because once your offline name node came back, it wouldn't become active and would not be serving clients until actually reconciled with the majority of the name nodes that the information it has actually. Until that knowledge propagates. So use the PAXOS algorithm, which has been around for, you know, a long time. 30 years. 30 years, I was going to say 20 plus. But okay. So it's very well understood and you use this concept of eventual knowledge, right? And then, but the application sees one copy of that metadata, right? And so that is unique in the marketplace. Now, we talked the other night about, you know, other examples of this, I threw out Google Spanner, you know, as a technique to have sort of globally distributed, you know, coherent data, right? But share with us what you told me in terms of the differences. Well, so the first and foremost difference that most of the people would probably care about is that, well, how many companies like Google are there in the world, right? Okay, here we go. So how many companies can, okay. How many companies can allow actually to build essential Spanner, right? So maybe, maybe Yahoo, pretty much the same, right? So, however, there are 500 portion companies, there are 1,000 portion companies and maybe 30% or 50% of those companies need continuous availability, right? They cannot afford to hire the staff that Google has, right? So that's actually the most important business case from my standpoint, yeah. Technology-wise, we can debate if Spanner is better or Paxis is better. It's not that really matters, seriously, because we allow you to, for a very reasonable price, guarantee the absence of SPOF and EGFS, which has been actually haunting this community for quite a long time, right? And there are the other attempts to solve the problem and MapR is trying to do this and CloudR is trying to do this important work. But as I said, they are highly available systems. They are not continuously available and this is a very important distinction. Well, I mean, conceptually the way the data center world has solved this problem in the past is essentially by brute force. Making copies and making mirrors. Mirrors, three site data centers are very expensive. And another problem with mirrors actually, you cannot, well, in often cases, you cannot guarantee the sync up between the mirrors because one thing you do in the backup, the mirror for the backup purpose and the other thing when you're trying to actually mirror a number of the data sets and let people use it and let people update it. And then you face the problem how you will sync up them back, right? And that's actually, essentially, very hard. So talk a little bit about Big Top. You're obviously heavily involved in that. Share with our audience. Yeah, Big Top is another Apache Software Foundation project. I've been one of the co-founders of it back when I was working on CloudR. And it has been open sourced and went through the incubation project, incubation period and become top level project recently. Essentially Big Top is a framework that guarantees and allows you to build software stacks, not necessarily to do, but software stacks with preset characteristics, which means that I want to have Hadoop of that particular version. I want to have HBase of this particular version. I want to have Pig Hive and the rest of these particular versions. And then when I got all of them, I want to build RPM packages and I want to guarantee that these bits would work together nicely. So do some sort of integration, validation, system integration testing, right? So Big Top gives you ability to define the stack, build the stack, deploy the stack using Puppet and actually test it with internal test framework on the side of the test. So it's essentially all-round solution for software stack developers. Talk a little bit about, you know, John mentioned tech athletes. We love that term. Just watching the folks at WAN Disco. I mean, you guys may obviously made an acquisition. You got people like yourself that came from Cloudera. How is it that this little company, the self-funded company that obviously now a public company, was able to attract such talent? What is it about the culture of WAN Disco that's quite unique? I think it might be one of those cultural things when, as you said, self-funded company, like you self guys actually, what fascinates me. I've tried you to run, to start actually with four companies in my life. All of them was started on self-funded money. None of them actually went through. So I know how hard it is that. So I have actually deep respect to the guys like you and to the guys like David Richards, who were able to find the right mix of the innovation development and the sales techniques to literally start selling products since like week three. So that one of the aspects actually that have my very deep, deep respect. The second aspect, I think, I believe, and again, I've been in the software business for about 20 years. I hold 14 US patents on distributed technologies. I think what this guy's doing is the most bleeding edge of the continuous availability and distributed systems. Seriously, I mean. It's not a self-clog. That excites you right there. That's what I'm trying to do. I'm very excited. I'm very excited. And of course, these guys 15 minutes from my door, so. That's very innovative. That's good innovation right there. That's innovation to your lifestyle. It's a societal benefits to you and the gas that you don't have to spend traveling, commuting. So I want to ask you also about your perspective. Obviously, you have experience, got a lot of patents and techs. So you're certainly, you have chops there. But open source is changing. We're now maturing as an industry, right? I remember when I first broke out of college, open source was like really post-unix and then Linux, even before Linux, as Linux was hitting the scene, you had that movement, early stage pioneers. And then now it's maturing and standard operating procedures to use open source. What do you think needs to happen in the community? You're involved in the projects. Dave and I were saying all this summer tour with theCUBE that the new standards bodies are, the open source community is being ratified with code. Actions and adoption are ultimately the proof points. The de facto standard is the ratification of these standards by the communities. No more governing bodies. So with that being said, what do you think open source needs to do to continue to accelerate at the same time, innovate and be constructive? Well, it's a very loaded creation apparently. I think it's a lot of job. Take it wherever you want. Okay, so I think the beauty of open source and essentially the power of open source is this is an evolutionary development, right? So if you think of how our species came around, right? And we can argue about that, right? So but technically speaking, our species kind of came around as a series of small successes and small mistakes, okay? We never had a mistakes or errors built up to the very high fat and fat tail actually distribution problem, right? So open source goes exactly this way. Open source is trying to do a lot of small steps and some of them are faulty, some of them are successful, but the overall move is very positive and always forward, okay? So that's what fascinates me about the open source. What needs to happen in open source in terms of being more acceptable, more successful? Well, I mean, only probably a few people actually doing open source day to day just because they love open source. I mean, the open source is essentially, in my opinion, is financed by the companies who actually trying to build the products around open source, right? Like Red Hat, right, it's a good example. Sousa, Intel, Vandisca, Vandisca is a good example actually because these guys were doing open source project in form of subversion actually pretty much since day one, right? So they know very well what open source means. Apache Track project has been actually funded by Vandisca and now we're actually committed to be 100% open source except for this little library, you know, the epoxy implementation that we actually keep proprietary. But everything else, everything else is 100% open source. I'm proud to say that our distribution is actually based on 100%, 205 Hadoop, which has been released actually about a month ago. All other components are actually 100% releases from the Apache open source. So we actually don't do any tricky modifications and stuff. So I think what needs to be done in order to success, the open source needs to get more and more companies, commercial bodies who would come and start contribution into the movement to help it grow essentially. Well, Konstantin, it's been a great show for you guys when just going to give a congratulations out to you guys. Thank you. One, we've been following you for a while, certainly the technology's solid, but you guys really broke above the noise this week by having the most excellent product out there, 100% uptime, great message, obviously ratified by a lot of the success you had at the show. Congratulations and really enjoyed seeing you and your team. Thanks for coming on theCUBE. Stay tuned, guys, because we have a great in-memory technology that's actually coming on Hadoop from Berkeley Amflabs. So another open source project. Stay tuned. We'll be there. Thanks a lot for having me. Great citizenship on you guys and great stuff. This is theCUBE. This is the flagship program. We've got the events. I'm John Furrier with Dave Vellante. We'll be right back with the Horton work guys to get a summary of what happened this past two days and get their take on what was expected, what wasn't expected and find out really what happened here outside theCUBE and of course in the lounge areas. We'll be right back with that right after the short break. Then Dave and I will wrap up the show and put a bow on these two days. This is SiliconANGLE, theCUBE on theCUBE. We'll be right back.