 TheCube at Hadoop Summit 2014 is brought to you by Anchor Sponsor Hortonworks. We do Hadoop. And headline sponsor, WAN Disco. We make Hadoop invincible. Okay, welcome back everyone. We're here live in Silicon Valley for Hadoop Summit 2014. This is TheCube, our flagship program. We go out to the events, let's check the signal from the noise. I'm John Furrier, the founder of Silicon. Big data, wikipon.org analyst, Jeff Kelly on stage earlier, doing the big teaser of the survey and also interviewing Doug Cutting and all the Cooroo's of the Duke World. So excited. Our next guest is Launchevy, CEO of SyncSword, Cube alumni, welcome back. Thanks for having me. Great to be here. I was enjoying when you come on theCUBE because we get to geek out because your company has so much tech. Like it's like, I want to say the world's best kept secret but like you guys are doing some pretty compelling big data meets legacy meets like real computing. And so you've modernized infrastructure that people already have. That's right. So explain where you're at right now. So we, it's a 46 year old company. However, we just actually looked at a piece of analysis. I got back from Japan last week and there was a piece of work that was done in Japan scanning the Hadoop code base to identify which companies have contributed the most lines of code to Hadoop and we were seventh last year. So we've been all in on Hadoop for the last couple of years. That's pretty impressive. Some big names in there. Yeah, that's for sure. And the part of this was serendipitous. The nature of our technology was just very well suited to Hadoop, the dozens of algorithms we have and the optimizer. When we contributed MapReduce 2454 which got committed into the Hadoop trunk early in 2013, it was allowed us to infuse our optimizer and algorithms onto every node in the Hadoop cluster for handling merges and joins and aggregations. Hadoop is very sort intensive and most of the processing frameworks that run on it are also very sort intensive. So this played to the constraints here at Syncsort. So I got to ask you, one of the big things that's been really compelling about the big data event here is that everyone's succeeding. There's so many use cases, so many diverse use cases. We had Merv, Abe, and Arlo from Gardner and again, we're reiterating how, one, early it is and two, how diverse the use is. So it's not a one drink pony. So I got to ask you this resource scheduling, you're seeing all kinds of system management kind of concepts. It's an operating environment now. So I want to ask you, the Yard and data platforms on top of the stack extract away some of the complexities of the things that you guys do. How do you guys see that playing out for the industry, yourselves, and customers? So the most common use case we're seeing for Hadoop in terms of our large production Hadoop customers is still offloading legacy workloads from systems that are more expensive into Hadoop. And so what they'll do is they'll take maybe complex SQL queries that are running in a data warehouse which can be 40 to 60% of a typical data warehouse environment and they will want to recreate those in Hadoop in a way that performs just as well so that they also then have all of their data in Hadoop. And so as we run within Yarn as new paradigms for processing the data emerge, they are now running on a platform that already has all of their data in it. So it ends up becoming a two pronged strategy for these large enterprise customers. One piece is let's save lots of money and the other piece is let's get all of our data into a platform that's getting better faster than pretty much anything else in the technology industry. One of the things we came up with here at the event, we're calling it Project Silk. It's in technology preview, something that industry I don't think has ever seen before. It scans your legacy SQL, complex SQL queries which can be 100 lines or more. It visualizes it for you graphically and then it gives you recommendations and it assists in recreating those workloads most efficiently within the processing framework that run on Hadoop. Let's talk about that a little bit more because one of the biggest use cases we see certainly initially for something like Hadoop is offloading some of those workloads from legacy expensive systems. But that's not a trivial exercise. You've got some of these workloads are extremely important to these companies and you can't just take a chance that they're not going to continue running at scale and at the performance level you need. So talk a little bit about the challenges you're seeing among your customers. Some of the customers will have large amounts of SQL that was written by people who may not even work for the company anymore. And they don't know what it does. So simply scanning the legacy SQL and visualizing it and showing you're doing a join here and then it's followed by an aggregation and here are the tables that it's accessing. That alone is incredibly valuable. And then we realized pretty quickly working with some of the most enterprise scale production deployments of Hadoop in the world that anything we could do to make it even modestly easier to recreate those legacy workloads in Hadoop would dramatically accelerate the adoption of the platform. So that's why we went all in over the last couple of years on this Project Silk SILQ that we just released here. Well, so we had Jack Norris from AppR on yesterday which you guys have formed a partnership and he was talking about indeed how this is one of the number one things that their customers want to do but so many of them try again and again because they're having trouble actually getting it right. And so this seems like a solution to that problem. Yeah, and there's also been a proliferation of SQL on Hadoop engines of various kinds. And one of the things people will do sometimes is they'll take the SQL and they'll try to run it in Hive. And the Hive performance has improved pretty dramatically with some of the new DAG-based processing frameworks that are emerging. But what we did is that we also designed our products so that even if you're using Hive or PIG or one of the other languages for creating large batch ETL in Hadoop, we will still accelerate those pretty dramatically. So our product will install on all the nodes in the cluster and it will speed up your Hive pretty dramatically. And then one by one, you can take the really complex workloads and use our SQL analyzer to recreate them and something that's a little bit more native to the Hadoop framework. So when we last spoke on theCUBE, we were at, I believe, the AWS summit, or sorry, re-invent. So we were talking then about the breadth of data products that AWS has. But one of the challenges, of course, is kind of getting them all to talk to one another and getting them together. And I think you pointed out that they have the products but not really a platform. And so tell us a little bit about your strategy there and how you're helping AWS users knit those together so that it actually, you can realize the full value of those. What are our powerful AWS products? So back in November, we released our Amazon web services elastic map reduce based Hadoop product. And actually just a couple of weeks ago, we made it completely free, partially to generate some ubiquity with the power of our engine. And then also a few weeks ago, we just released our product, the non-Hadoop version on EC2, full function, also for free, for pretty much all capacity levels, except for the highest capacity level, which is also for us, it's only $2 an hour, even for the largest, so it's almost free. And the goal there is, first of all, to help people, if they're moving workloads from Kinesis and Amazon to DynamoDB to Redshift to elastic map reduce, it'll allow them to do that fairly seamlessly. The other thing that it does, which is pretty compelling, is it helps people who are using legacy transformation tools like ETL products, where they're paying a lot of money every year, they can move to a free version that's available over the internet. A couple of the earlier guests in theCUBE have talked about the nature of data gravity. And of course, a lot of data is still on premise, but more and more data is being generated in the cloud. As it's getting generated in Amazon, it's nice to have compute paradigms that run natively on Amazon. So if you do a search on the Amazon web services marketplace for ETL, our product will come up, you can click and buy it for free, and now you have a powerful ETL tool that's running natively within the Amazon web services environment. Yeah, talk about that a little bit more. The impact cloud has, we talk about the impact big data has on data integration, but what is the impact cloud is having on data integration? I mean, you've got a lot of, you've got legacy providers out there that are, you know, they were born in the on-premise world, and it makes sense. But as we go, as we move to this more cloud-centric world, and especially when you're talking about the data volumes we're talking about, how does that impact data integration? And what are you seeing in terms of, you know, any pain points that you're seeing from customers about moving data into the cloud, moving data around once you're in there, and what's the look like today? So people who are using, doing data to integration of various kinds, so transforming data and moving it from one place to another, they don't, if you're generating the data in Amazon, you don't want to take the data off of Amazon, move it into some third-party cloud environment. Do a transformation and then move it back into Amazon. First of all, there's just a lot of network IO problems with that. And second of all, it ends up slowing down the processing pretty dramatically. So having something that runs natively within Amazon is very important. The other thing that we get is by running natively on the platform, we inherit the benefits of the commoditization at the cloud platform level, right? So Amazon is already well above a million servers, and as they get cheaper and cheaper, our product becomes cheaper as well, and the entities that are either running on-premise or running in third-party cloud environments, they get commoditized whereas we benefit from that commoditization. Yeah, I mean, it's interesting because you're seeing the legacy data integration vendors come out with their quote-unquote cloud products. And they're cloud in the sense that, yes, they're deployed in the cloud, but they're not deployed in a really optimal way to take advantage of, as you said, the scale and the power of AWS. It's really a separate environment and almost defeats the purpose. That's right. One of the things that we came out recently was a JSON connector for the purposes of ingesting social media. As more of the data that's getting generated in the world is coming natively from the cloud, these types of new data formats, both on the source side and then also in the destination. So within Hadoop, we've started supporting new file formats like Parquet and Avro and others. People want this capability to be turnkey and they want it to be native on the platform that they are running their data generation systems. From an acquisition perspective, I think we talked a little bit last time about acquisitions. You're starting to see the Hadoop ecosystem wake up to acquisitions and how important they are as a way to generate both return on invested capital for their really large capital raises as well as another tool in the toolkit for innovation. So we think of innovation as having multiple time horizons. Horizon one is your existing product. Horizon two or near adjacent areas. Horizon three are things like our project Silk, right? Inventing something the world has never seen before. And if you have large quantities of capital, you can acquire technology that's somewhat disruptive that needs to go to market. And you're starting to see the Hadoop distributions do that and the security space in particular and a few others. And we're going to continue to do that with the idea of contributing the capability to make Hadoop as powerful as possible versus parking ourselves nearby Hadoop and generating code and sending it to the cluster. So one of the things we did recently, we released a patch that we're contributing to scoop and actually scoop. So people are using scoop to ingest relational data into Hadoop. And what we did is we contributed some capability that allows you to now natively access from your Hadoop cluster, mainframe data, which the early adopters of Hadoop like Google and Facebook and Yahoo didn't have mainframes. This wasn't an issue for them. But the grown up companies who we work with actually have substantial portions of their corporate data on the mainframe. And this will make the Hadoop platform now more functional in terms of being able to access that mainframe data. Yeah, so let's talk a little bit more about this show and the Hadoop community and what you guys are doing. You mentioned you're contributing to scoop. So you're being good community members and contributing. But what's your take on this? I think by my count there were 88 vendors out here. Which is, that's a pretty big ecosystem. And we've got small companies, we've got large companies, it's all kind of flavor of enterprise ID company. What's your take on the state of the ecosystem? And it's interesting because it's open source, meaning these vendors are competing, but a lot of the engineers are working together across companies. What's your take? Yeah, I credit a lot of that level of collaboration to the Apache Software Foundation. Really done a remarkable job with helping to progress the platform itself. I can't think of a technology stack that has had this level of universal support from pretty much every single large technology company as well as a number of these pure plays with extraordinarily large capital raises. Even Linux didn't have every large software vendor supporting it initially. And to some extent Hadoop in that sense is becoming like TCPIP or some other standards where everyone is building to the same exact stack. And what we're seeing with the large companies is they're making decisions about Hadoop based not necessarily on its maturity today but its pace of improvement. And I don't think there's any question at this point that it's improving faster than anything else in the industry. And as a result you have lots of venture capital flowing into the space which is generating a tremendous amount of innovation. And that innovation results in new capabilities coming online every day and also new acquisition targets getting created for companies that are well capitalized like we are so we can actually have multiple innovation tools in our toolkit. Right, I mean it's remarkable how quickly it's evolving. So tell us about what you're seeing in the field from customers in terms of not just adoption of Hadoop but from the different players. You've got HortonWords, you've got Cladera, MapR, IBM to some extent, of course Pivotal. And we've recently conducted a big data survey and one of the things we found that was kind of interesting was that of the percentage of respondents that are using Hadoop only about 20% are actually a paying customer of one of these distribution vendors. Are you seeing something similar in terms of percentages or what do you see out there? Or maybe do they come to you only once they're in production and maybe paying a Cladera or Hortonworks? What's the point to you? We've adopted a bit of an open core model ourselves with contributing basic functionality that we've invented or that's within our products into the Hadoop stack itself. So a lot of people are using Hadoop essentially as an ETL capability, that's ETL. They may not be calling it that but what they're doing is they're taking data from somewhere, bringing it into Hadoop, transforming it and then loading it somewhere whether that's Hadoop in Parquet or Avro or some other system like a downstream data warehouse. And so our thinking about this is what we wanna do is make Hadoop a more complete ETL capability by continuing to contribute code to it and then move up the stack and add higher value functionality for more advanced use cases like the offload use case. The offload use case has a tremendous return because within a quarter or even less, even a couple of months, you're able to start saving lots of money on the legacy system when you've recreated the workloads in the data in Hadoop and that this is ELT in a data warehouse. We've actually seen offload from legacy ETL products or legacy ETL suites into Hadoop as well as batch processing from the mainframe. You now have your data there. And so of the 80 companies that are here at the Hadoop Summit or more, as they start rolling out more functionality, machine learning, graph analysis, multivariate predictive analytics, new processing paradigms like Spark and Tez, what you have is new functionality coming online in a place where your data is already there. So that's really exciting, I think, for a lot of these customers. It's rare that you're able to do something that saves money and also gives you tremendous additional functionality. That's a very good point. But you mentioned, you know, all the money that's being pumped into this market and of course that can enable a lot of innovation but it could potentially, well, I don't want to put words in your mouth. Could there be potential negative impacts? And I mean, you've got all these different vendors who want to push the platform in one direction or another that's going to best suit them. Does money have a, you know, money in politics to be say, oh, that's a corrupting influence. Does money in Hadoop have any negative consequences? So, there has been a bit of, almost like a Cambrian explosion of innovation around Hadoop. Not everything will work, but that's okay. I mean, we actually think about our own portfolio in this way, which is as we invest in Horizon 3 initiatives, new products that we're investing, some of them won't work, but we want to still dedicate a certain amount of spend to those next generation disruptive technologies. So if something doesn't work, you shut down that initiative, but you roll that spend into another Horizon 3 disruptive initiative. And I think a lot of the venture capital funds are thinking about Hadoop in the same way, which is let's spread out our bets amongst a number of things. Some of them will work, some of them won't. The ones that don't work will then take that funding and essentially those technical resources who are incredible and then reallocate them to some other new emerging paradigm, whether that is a new processing framework or a new SQL on Hadoop Engine or a new graph analysis capability, whatever it is. So I got to ask you about the automation piece. One of the things that we're hearing in the industry is when things start going well, you start to see automation, you mentioned the SQL thing, it's really interesting. That's the trend that makes things easier. Tony Bayer wrote a blog post this morning, it said, SQL is the gateway drug to the enterprise. Kind of clever title, gets you a little bit of a linked bait, but it's legit. I mean, comment on that and talk about some of the automation trends that you see that get your attention. Yeah, so as Hadoop has matured as a platform, it's become clear that it needs many of the same automation capabilities as some of the more mature data warehousing environments and even the online transactional systems. One of the things we did recently is we dockerized our product. There was actually a lot of discussion of docker yesterday. You can do some really neat things in some experiments. We were able to run an entire Hadoop cluster on a single machine. And then as you add more nodes, you can farm out the various data nodes to actual separate machines. There is a lot of disruptive innovation going on in the DevOps space as well. And most of the DevOps players who think about automation are actually thinking Hadoop first, because they know that it's going to be one of the largest, most widely adopted platforms in the space. And the DevOps movement has really also drove down a lot of the consumerization trends with the cloud. How does an organization get into DevOps in a way that's seamless, agile, but not too destructive? We were talking earlier about the agile's great thing. But in the day, you still got the process. Yeah, I mean, you mentioned cloud. One of the design points for our cloud product was for a product that runs on elastic map reduce was people can get started with high performance processing in Hadoop with our cloud product. And it's the same transformation. So if they then want to move those workloads on premise, they can do that seamlessly. If they start on premise and they want to move into the cloud, they can do that as well. So that's been an important part of the value proposition. People don't want to get locked. One of the things that's attracted people to Hadoop is its open core. And a lot of them are a little bit frustrated with how locked in they are to legacy systems where the vendors are sometimes running almost extortion based business models where they're raising the prices. No, really? And so they want to avoid doing that again. And so having the portability and sort of the feeling that the core Hadoop platform itself will be open and they don't have to worry about getting locked into a paradigm like that is very important. Well, the new extortion is shifting to openness. So what's beautiful about open source is now you have, in a way, a democratization of software business because now it's like you have more choices. But I think they lock-in as a relative term. Lock-in could be just simply functionality. People don't mind. The new lock-in is actually a good lock-in, right? Like if you have a good product, that's the lock-in. And the switching costs come down to how much value you're getting at. Yeah, and there's a bit of a moving up the stack dynamic that happens, right? The innovation happens at higher and higher value as the core capabilities become more commoditized. They also tend to get open sourced and so people lose the lock-in. And all the vendors then have to move up the stack to provide more value. And that's really good for the customers. It's really good for partners and pretty much the entire ecosystem. Long Jeff, we see you have Syncsor. Check out Syncsor. We really like this company. I'm not joking aside. I say the best kept secret in the industry. She has a lot of tech there. A lot of great work. Final question I want to get you to say to the folks out there in your own words. What's going on in this show? Behind us is all the folks that are contributing, being successful. Give your take on the dysfunctioning big data market, the Hadoop market. So the level of innovation is incredible. I mean, I look at the summit and I see a large number of acquisition targets, honestly. Both for us and for some of the other players in the space and really excited about the level of resource that's flowing into the space. It's just going to make the platform even better. All the big whales are here. Looking for the tunas and the minnows. If you're a series A or series B company, great market to go to the next level, series C and pre-IPO, and certainly from growth to public offering, they're all here. That's right, exactly. Hey, this is theCUBE. Of course we're extracting that signal from the noise. We're vetting the hype and the reality. We'll be right back with more, right after the short break.