 Live from the Fairmont Hotel in San Jose, California it's the Cube at Big Data SV 2015. Hi everybody, welcome back to the Cube. We are live in San Jose, Big Data SV during Big Data Week. I'm Jeff Kelly with Wikibon. One of the big things we've been talking about this week is making Hadoop and Big Data consumable for the enterprise and a lot of that has to do with integrated capabilities around high availability, around security, around other things like governance. We're going to talk in this segment with a company that is actually taking some big steps in that space, WINDISCO. My guest joining me is Jagain Sundar, CTO of WINDISCO. Welcome back to the Cube, long time frequent guest. Thank you, Jeff. So as I just mentioned, making Hadoop enterprise great, making it consumable, making it safe, making it accepted in the enterprise is kind of what you guys are all about. Talk a little bit about your approach, what you guys are doing, kind of the latest and greatest around WINDISCO. Sure. As you know, availability is one of the big concerns that enterprises have when it comes to deploying Hadoop. That was the first place where we started. Then we found that customers had a slew of surrounding problems, like they needed the ability to run different types of applications on the same data, preferably managed as one cluster. So we came up with the concept of zones to support different profile applications in different sub-clusters, if you will. It's all managed as one cluster. The data is consistently replicated across both, but you can have memory intensive applications on one zone and regular batch processing applications on the other. We also found that customers were really upset with the fact that their DR clusters were mostly idle backup. That's a lot of money. So our system of active-active replication gives them the ability to use the compute resources and the storage resources on all parts of their cluster. This was a huge win for them. So this is the product that we started off with, non-stop Hadoop, and from working with customers, as you probably know, we've announced three out of the ten top financial institutions in the world have signed up for our software. We learned about some other challenges they had, namely they had different types of Hadoops and different types of storage in the same environment. We've come up with a new product. We'll go into more depth in that a little while later. We've come up with a new product that integrates these different types of storage and enables you to run Hadoop applications on this platform we called Alto Store. The ability to strongly, consistently replicate data across different types of storage placed in different parts of the world is very attractive to our customers. So you mentioned banks. Let's talk a little bit about that. I definitely want to talk a little bit more about the product announcement, but I'm curious, why do you think you're gaining some traction at some of these large banks? What's the profile of some of the workloads, some of the things they're trying to do and how that fits with your software? Right. So the first thing that you've got to remember is that banks operate in different parts of the world. The data is generated all around the world. Most analytics they want to run is an aggregation of all the data that's available. It's really not interesting to have data just from your own location analyzed. That was the first reason that they engaged with us. We gave them the ability to do strongly consistent replication across the wide area network. But once we got in there, we explained to them how our active-active setup enables them to run their entire cluster, not have half your hardware idling. And that was one big win for them. A lot of them are fully focused on keeping costs down. The due costs can really increase if you've got vast amounts of idle hardware, for example. So that was our entry point. And almost all of our conversations start with the ability to use the wide area network and lead into other interesting features that we bring to the table. Well, this is interesting because in our last segment we were talking about where's the money going to be made in big data? And one of the theories is well it's going to be made on the infrastructure side and it's going to be made on the application side. It's going to be much more difficult to make money kind of in the middle with the database and the algorithm tools, which are all necessary but are becoming commoditized and becoming open source in a lot of cases. So perhaps on the one end and the infrastructure where you play, important couple areas. One is making an enterprise grade, some of the things you talked about, high availability, being able to bring in data from across the world, and also being able to very quickly spin up new applications on that infrastructure. So it's interesting kind of how this market is playing out a little bit different than kind of the traditional world. So talk about, again, let's go back to banking for a second because another big issue is of course compliance and security. And obviously in a banking environment that's key. And you mentioned Hadoop can become expensive when you're talking about the hardware as well. And one way to alleviate that is maybe a cloud environment but the banks are never going to go to public cloud. So talk a little bit about the security and the governance issues and how you're addressing that issue in a banking environment which is obviously highly regulated. Right. So the first thing that we encountered when we started engaging with banks is that they have several countries that do not allow you to take their data outside of the country. But the analytics from this data is very relevant and allowed. So a system like ours where you can have selective replication, it's a global file system that runs across different parts of the world, but it's selectively replicated. Now what you can do is have your MapReduce or your analytics programs run in the location where the data is available and the results of that compute, you can replicate that to other parts of the world and then do your aggregate queries. This is a very valuable tool for banks that operate all around the world and keeps them in strict compliance and some of these compliance violations are very expensive. Folks in the banks are particularly careful. So we enable them in the infrastructure layer. This is not some batch processing disk CP job that has to be carefully monitored to make sure you don't copy the wrong data to the wrong place. This is done at the infrastructure level. It's guaranteed much like your file system in any operating system guarantees a certain level of compliance. So we offer capability and that's a lot more attractive than hiring an army of administrators to watch over your disk CP jobs. I think that was one reason why banks were attracted to our software. Do you find, is it a fair statement to say that concerns around governance and not falling out of compliance is holding back deployments in maybe not just banking, maybe in other sectors as well? I don't know if I'd call it holding back. I would say it's changing the focus from the traditional internet company features that were included in Hadoop to something different, something that's more interesting for not just banks, other enterprises as well. We're starting to make headway in some biotech firms and they have similar concerns. You really do want to be careful about where your data originates and where it's accessible. So I think the nature of feature additions is changing as a result of these players in the marketplace now and they work with big dollars. I mean, these are not small industry players. The features are changing. The product itself is taking on a whole new look with enhanced capabilities. So as the economics change, you said you're basically enabling to use a lot more of their hardware. It's not sitting idle on the replication. Are they taking advantage of that opportunity and putting more data in that they didn't capture before and or are they getting the application benefit to such a degree that again, it can actually finance the ability to get more of the raw data that then drives the metadata? The answer is both. They start off, of course, by pouring more data into their system because they have such fine control and much more hardware at their beck and call now. Now, once the data gets in there, folks using the data actually start getting a lot more interested. So the number of applications that's growing is very high. You have folks starting off. It's inconvenient for them to use the data because it's in a different place. It needs to be collated. But suddenly with our sort of capabilities, you can run simple hive queries that give you really quick answers. So we're finding that first, there's growth and storage because of making all of your cluster now usable. Next, there is a growth application. So got to get your opinion on some of the big news happening this week, of course. The big news everyone's talking about is the Open Data Platform, the announcement Pivotal made with some of their partners, IBM, Hortonworks, others. They're positioning it as we're going to cooperate to really accelerate adoption of Padoop. On the other side, you're saying, well, this is just another industry consortium, pay-to-play. You've got to pay it again. And really what should be all about is code. What's your take on this? Is the ODP the Open Data Platform a good thing or a bad thing for the industry, do you think? So I've known a lot of open source developers for a long time now. These are passionate folks. I mean, most of the developers I know personally value their open source contribution and their position in that community higher than the jobs that pay them their salary. It would be very interesting to see if a consortium of companies that have, I think, a sum total of two committers among all of these companies, have the ability to dictate terms to the open source developers. From my experience, I don't think that's going to happen. I think consortia can be formed. Decrees can be made, but the open source community will do what is best for the software. And it's something that they feel deeply about. I have very close friends in the open source community, and these guys really do love the work they've done and care deeply about it. So I question the viability of a consortium that can dictate terms to the open source developers. Well, I think that's one of the biggest questions for me was what is the open data platform going to be able to do if the open source community can't do? And they talked a little bit about the reason they've stated that they think that the open data platform is needed is because you're seeing some fragmentation in the Duke space. I think that's somewhat accurate, but I don't necessarily agree. I don't know, frankly, whether this is the approach that's going to solve that problem or not. But I think time will tell, and you've got to live up to their ideals and what they said they're going to focus on. So we'll see. So we're running close on time, but you mentioned you've got a new release, Altostore, which is, as you mentioned, a name that sounds a little bit familiar. So that's where you came from originally when this go making the Altostore acquisition, but now you've got a product with that. Tell us a little bit about the announcement. So Altostore is a product that came about because of our interactions with our customers. We've got this wonderful platform called non-stop Hadoop that's perfect for pure Hadoop deployments. Folks have got Hadoop in all of their big data data centers. But what we found was that there are other storage systems that bring slightly different characteristics, but still useful characteristics. One of those being EMC Isilon. We encounter that a lot in our customer sites. And it's got some interesting attributes. Then there's Amazon S3, which is always the big unknown. The desire of our customers to make full use of all of these different storage types led to the invention of Altostore. What this is is a Hadoop-compatible file system upon which you can run any Hadoop-compatible application, MapReduce, Yarn, Spark, HBase, whatever. The underlying storage is Isilon or there are distributions HDFS or Hortonworks distributions HDFS or Amazon S3. There are interesting attributes here. Now we do strongly consistent replication. So one thing you can consider doing is putting all of your data into a small part of Amazon S3. If you have a compute that requires maybe a thousand VMs, but it's run just once a week, don't have the capacity for that in your in-house data center. You're strongly replicated onto S3, spin up a thousand VMs, run it there, at the end of that shut down the VMs, most cost-effective use of resources. Or perhaps you have an application that requires very high shuffle performance, storage systems such as Isilon offer those capabilities, have a small part of your cluster built up of that type of storage. Now what we guarantee is strong consistent replication across all of those. You're not mucking with a bunch of shell scripts that run this CP and trying to keep it in consistent form. So that's AutoStore. We're seeing tremendous, and users are coming to us with different cases. They're telling us we don't have different types of Hadoops, but we do have different versions of Hadoops. We want to use your software to upgrade from one version to the other, bring down one data, the rest is still running, bring up a new version of the software, run your application, test, make sure everything is fine, then decommission the old one. We do that with one Hadoop, sorry, AutoStore as the platform, and you end up with a really easy path to migrating between versions of the software. So we're very excited about that. Is that a big pain point you've heard about is migrating from one version to the next? Indeed. It is a big pain point because it's not just the infrastructure. The applications have some level of dependencies, and then people write Hive jobs and Uzi jobs that run on top of that. So for the entire stack to be validated and upgraded is a painful process. It was a non-issue for all the years past because people are moving slowly from POC to production. Now there are a lot of large data lakes comprised of Hadoop components, and a lot of applications running on it. So you can't flip the switch on a whim. You really need to plan and upgrade between Hadoop versions, not even moving from one version of, you know, one distribution of Hadoop to the other. Just between the same vendors, distributions is a painful process. And I think that's a good illustration of the, you know, we talked about, you know, yarn last year got a lot of the attention, and now you can run a lot of different applications on your Hadoop cluster, and that's one of them. So you have to take that into consideration. I think that's one of the issues that enterprises are struggling with trying to understand how to manage this environment as the platform itself is evolving so rapidly. That's very true. The other thing that we keep encountering at customers itself, they do value both cloud era and Hortonworks as distributions, but a lot of them are making sure that their applications run well on both distributions. And I'll leave that matter to the interpretation of viewers. Absolutely. All right, well, Jigain Sundar from Wendisco. Thanks so much for joining us again on the Q. Appreciate it. Great conversation as always. Thanks everybody for watching. Stick around. We'll be right back with our next segment here live at Big Data SB.