 The Cube at Hadoop Summit 2014 is brought to you by Anchor Sponsor, Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Welcome back everyone live in Silicon Valley in San Jose. This is the Cube, our flagship program. We go out to the events and extract the signal from the noise. I'm John Furrier, the founder of SiliconANG. I'm joined with big data analyst, Jeff Kelly from wikibon.org. One of the top analysts in the business. Jeff, welcome back with our next guest here, Robert Hodges, CEO of Continuant. Continuant. Continuant. Continuant. There you go. Get the energy. Welcome to the Cube, first time on. Thanks for coming. Thank you very much. It's a pleasure to be here. So talk about your company. We're really laying out the horses on the track. You've got the growing companies. You've got the pre-IPOs. You've got the big whales. Where do you guys fit into that ecosystem and what are you guys doing here at the Goop Summit? Sure, well, we do clustering and replication. And we kind of grew up out of the MySQL community. So started out doing products specifically for MySQL and actually focused on high availability. And we built, starting seven years ago, built what's now called Databases of Service. Have about 100 customers using it, some of them running hundreds of millions of transactions daily. And to do that, we ended up building replication and getting into the data movement business. And that's what's actually bringing us to the conference today. You know, database is a hot topic now. But you know, structured database had schema issues and that's hard to scale out. Now you get unstructured. How does that fit into your companies with the unstructured, structured conversation with the cloud, with Hadoop and all that business? Well, so here's the interesting thing. For enterprises, there's actually very little unstructured data at some level. And a tremendous amount of processing is happening in highly structured systems. For example, relational databases. Just to give you an example, one of our customers who runs MySQL has hits up to 50 billion trans, excuse me, eight billion transactions daily on MySQL systems, spread across about 50 systems. So they have a tremendous information flow through these systems. And they can, this is something where they need to be highly reactive. They have small transactions. Relational databases handle this very, very well. And the interesting question is, given that you have this relatively high value data, it's core to their business, how do they actually perform their analytics on it? And as Hadoop has evolved, one of the questions that we've had customers coming to us with is, hey, I've got all this data in MySQL. I've got it in Oracle. How do I now get it over to Hadoop where I can now participate in this broader set of analytics that, for example, merge it with sentiment data, merge it with logs or sensor data. That's what they're looking for. So data scientists really kind of like, want this product, you're seeing that as a key? Absolutely, I'll give you an example. So we have a number of customers that do marketing campaign management. So the campaigns are defined inside relational databases. They're kicked off inside relational databases. Every time that somebody touches a webpage, for example, at a company like Marketo, you see that little MKTO URL briefly pop up. Well, actually, a transaction just got logged on MySQL database. So that's critical information about their business. And they want to be able to take that information, get it into large-scale analytic engines that can, for example, look across the entire dataset, which in the transaction processing is actually spread across dozens of clusters sometimes. So you want to be able to see the breadth of the dataset. You also want to be able to combine it with other kinds of data that are in the enterprise. I think one of the things we found in our recent research is that integrating Hadoop into your current ecosystem, your current infrastructure is critical because you don't want it to live in a silo because then it kind of defeats a lot of the purpose of big data of Hadoop, which is to break down those silos so you can bring all your data to a central place and do all your kind of processing. And it sounds like one of the things you address is that ability to connect your existing systems to Hadoop. Exactly, exactly. In fact, what really brings us to the show is that we do real-time replication and out of MySQL, out of Oracle, into Hadoop. And the way that this happened was we were, two, three years ago, we mastered the art of loading into data warehouses. And so there's kind of an impedance mismatch. You can get the data out of a system like MySQL in sub-second, you need to buffer it up to load it into systems like Vertica, sort of column stores. And so starting last year, we started to have customers coming to us saying, hey, we've got a bunch of MySQL servers. We've got a Hadoop cluster that's already processing a lot of in-house data. Can you now connect these two things up? Because the tools to do that, and particularly to do it quickly, low latency, see the whole transaction log, they're badly lacking at this point. And that's where we kind of jumped in. Well, certainly, as the, there's been a big discussion this week about the relationship between Hadoop and the data warehouse and is Hadoop going to overtake the data warehouse. But the reality is there are a lot of analytic capabilities now available in Hadoop. And similar to your warehouse, you want to get that data in there as quickly as possible to give your business analysts, your data scientists, whoever the user is, the most up-to-date view possible. Absolutely. I think it helps to put numbers on that. So, for example, when we're talking about latency for loading data in traditional batch ETL, well, it could be a day to actually get a snapshot of the data. In fact, for the data sets that we deal with, we have customers in a single transaction processing system that have seven, eight terabytes of data, they have dozens of those, perhaps. And so if they were going to just do snapshots, do the sort of scraping batch ETL, their analysts would actually see that data days later. And that's leaving out the possibility that jobs might fail, other things would happen. People really need to see the data in much shorter timeframes. And I guess the real point, and the thing that we're trying to achieve is that people can tolerate different kinds of latencies for their businesses. We want them to choose latencies that are relevant to their business problems, as opposed to being limited by the software. That's really what we're trying to solve. He just hit the nail on the head. Real time isn't necessary for every type of workload. There are some cases where you're doing maybe deeper, historical analysis. And it's not critical that you have data from the last hour in this thing, but there are other use cases where you're going to need the most up-to-date data. And as you said, rather than being limited by, well, the technology capabilities, it's better to have unlimited capabilities and pick the one that's appropriate for each use case. Exactly, and I think that along with real time, that's real time, a better technical word is incremental. So you get the transactions as they occur, and there's a couple other things that are important for businesses, particularly as you start to look at the IT underpinnings. One is that you don't change the applications that are generating this information. And that's actually very important because these are running systems, they've been up, they've often been going for years. They don't want to have to do major software upgrades to integrate this data into Hadoop. That should be just done, it should be orthogonal. It should be a separate solution. So that's one thing. The second thing is not to put load on these systems. So again, being able to do change data capture straight out of the logs, it's obviously low latency, but it's also high performance in the sense that it puts low load on the systems from which we're extracting. So let's take a step back, and we've got 88 vendors back here. There's over 3,200 attendees. What's your take on this show and what it says about the maturity of this market? Well, that's a really great question. So I think the biggest, one of the most striking things about this show is, for me, is just how early on we are in this whole process. And I think there's a couple points about that. There's the obvious thing that it seems like every day there's a new project in Hadoop. So there's just a tremendous number of things that are appearing just in the area of data movement. There's multiple solutions. There's things like streaming query where we have multiple, we have TES, we have Spark, we have much, much faster query processing with things like Impala and Stinger. There's a tremendous flux. I think the other thing that's really striking is how much innovation is actually being driven by the users as opposed to the vendors. And I think actually that's, to my mind, almost the most interesting thing about this whole ecosystem right now. Well, the open source community is obviously critical to Hadoop. It has been from its existence, from its inception. And I actually talked a little bit about this with Doug Cunning and Roon Murthy yesterday on stage about, well, now that we've got all these vendors in here, especially some of the really, you know, the big whales in the industry and a lot of money being pumped into this market, both from them and from the investment community, you know, this is the saying money, you know, the influence of money on politics. Well, what's the influence of money on Hadoop? And it sounds like for you, it's critical that this open source community remains a vital part of the development. I agree. And in fact, our replication software is 100% open source. And we do this for a couple of reasons. It obviously helps us market. I mean, these are communities where the early adopters are strong open source advocates. They're sort of a prejudice. But I think it also keeps us really tapped into this innovation cycle, which is being driven by collaborative groups of users. And I think what's interesting about Hadoop is, as you say, I mean, this whole kernel of this was a business problem at Google. You know, low these 15 years and or 12 years or whatever, which then of course was open sourced by Hadoop when Doug Cutting and his associates did it. And then of course it's continuing to be driven by projects. For example, a project that's been mentioned quite a bit at this conference is Kafka. Well, that's linked in. They needed to move messages. They needed to move them fast. They wrote a project or they created an open source project which they're now bringing out into the community. And that's one more contribution to this whole problem of how to move data quickly into Hadoop. Yeah, absolutely. So in terms of the players out there, competition, who do you guys look at as competition for you guys? Well, that's an interesting question. In the MySQL to Hadoop world, we don't have much competition. I'm happy to say. I expect that won't be the case for too long because it is open source. People can see what we're doing. There's going to be other people doing this, you know, whether it's sort of low level projects or sponsored projects from vendors. Going over the Oracle side, the folks that we look at for competition are people like Informatica, Tunity. Not so much Golden Gate. They don't really have a very strong, at this time, a very strong mechanism for moving data in real time out of Oracle into Hadoop. So I think the Oracle market and in fact the commercial databases as a whole are a real interesting opportunity because they're, I think open source has yet to really penetrate those environments but there's obviously a lot of need for software that has a different economic model. So, you know, low activation energy to get it to pull it down, play around with it, try it out and also then an economic model that allows you to scale horizontally without this huge cap X that characterizes some of the enterprise solutions right now. So a question from CrowdChat here is please describe the importance of real time, the real time factor, if you will, in the big data and continuing model? Well, I think it's interesting. It's very important for some people and I think you really have to, it really depends on the business. For us, it's something that we're focused on because it actually, as I was saying to Jeff, it actually sums up a number of things. It is, when we say real time, yeah, obviously we can get the data out and we can actually get it loaded into Hadoop as quickly as users need. And this, in my experience, the numbers actually vary from five minutes to 24 hours. But the point is people can pick. But also in that real time, in doing it that way and loading the data incrementally, we're putting low load on the servers. That's tremendously important with the transaction processing system. DBAs don't even want us touching these hosts. So to be able to do this with minimum touch is very important and not have to change applications. So that's, I think if you put those three things together, the speed, the sort of the low performance hit, and then the minimal impact on applications, those three things taken together are critical. So we've been talking about big data and the cloud colliding. Yes. Now I'm certainly open stack, it's an open source version, the cloud got red hat, everyone's competing in their cloud foundry, everyone else at the platforms of service layer. So how do you see those coming together? Because you know about databases of service, you're essentially saying, hey, we're going to provide cloud functionality in a dupe like environment, it's very compelling. So that collision, talk about that collision. Yeah, sure. And since you've surprised me with this, let me try and say something coherent about it. I think there's two computing models that are actually colliding here. And if you look at, I actually come from a computing model which is much more hardware based. So database systems, MySQL for example, people tend to think of this as a whole stack problem where we're basically looking at building database systems where we start with the storage and we're looking all the way up to the interfaces to the applications. So if you look at the deployments in places like Google, in places like Facebook for example, we're very focused on managing hardware directly. And I think that the Hadoop definitely comes out of that tradition where people bought the hardware, you have then applications which are running on that hardware and spreading load across it. The cloud really takes a different cut at this where we're basically using virtualization as opposed to having applications running on a single piece of hardware with a cloud we have virtualized apps sort of which appear to be individual hosts. And I think that there's overlap between the problems that they're solving. And for Hadoop and for data management, I actually think that the, in many ways the Hadoop model is better because it's a more efficient use of resources. You don't have the overhead of virtualization. You can see the file systems more or less directly. And these are all things that when you're managing data are very, very important to understand. My final question for you is for the folks out there, share with them in your opinion the importance of this industry's moment right now. The moment in time where there's so many forces going on. You mentioned from a hardware computing background, some people have more of a data background, some have software, some have cloud DevOps. Why is this moment in time in the technology business overall? So weird, exciting and intoxicating at many levels. I think that, yeah, I think that's a great question. And I just like to say to folks who are listening to this, we have reached a point which was, it's a fundamental inflection point. Sometime during the 90s, it became cheaper to store data on disk than on paper. We have reached another inflection point where for many businesses, it is now possible to store data infinitely. And this is a kind of interesting proposition. So for example, consider an enterprise system generates data at 1,000 transactions per second for seven years. You can now store that data that entire seven years. And that's infinity for a lot of business systems because that's as long as the lawyers want it. You never have to throw data away. And I think that a lot of what Hadoop was doing is capitalizing on this declining cost and storage and saying, yes, we want to store everything. And this opens up a completely new vista in the way you think about analytics because you store the raw transactions. They're basically there forever. You can continue to ask questions on them and build value on them for as long as you want. It's a very, very different computing model and one that I'm just really, really psyched to be part of. It's really exciting, Robert. Thanks for coming on. I really appreciate you. Good luck with the company. I'll give you the final, final word. Just share with the folks out there. What's next for your company? What's the key objectives for you and your business? Sure, well, right now what we want to do is be really successful with loading data into Hadoop. So we have deployments going on, big customers, they're big challenges. We want to make those successful and then we're going to move on to, we're primarily focused on MySQL. We'd really like to move that out to other database systems as soon as possible and share that value proposition with the new sets of users. Robert Hodges, thanks for coming on theCUBE. Really appreciate it. It's day three. We're keeping the energy. Wall-to-wall coverage is theCUBE at Hadoop Summit. We'll be right back with our next guest after this short break.