 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Good afternoon, everyone. We're back at Pillars 37 in New York City, running Big Data New York City in conjunction with Strata Fall 2015. And I'm here with Rajiv Madhavan, the founding investor and chairman of Robin Systems, as well as Premal Bluk, CEO, and we're happy to have them on the show. Guys, why don't you tell us for our audience who may not be familiar with you? What's the problem set that Robin System addresses? What's the pain point that customers have today with standing up Big Data infrastructure? So we are a data infrastructure company. And the reason why we started Robin was that we looked at the IT infrastructure out there and realized that this was all designed for legacy monolithic applications. And today's distributed applications really require something different. And the pain that we are trying to solve is that because of the infrastructure gaps, today we have data sprawl, cluster sprawl, so a lot of data duplication, data pipelining, need for physical, dedicated clusters, which is creating problems with agility. People cannot get their applications up and running quickly. It takes them weeks, months sometimes to get new applications, new hardware added to their IT farms. The cost, people are not able to keep up with the rising IT investment costs to keep up with the data demands. And solving all these problems without compromising on performance is really what we're all about. Okay, so take us one level further because for the most uninformed of us, we'd be thinking, okay, so that's perhaps a deployment problem or you might use a virtual machine which actually carves up one machine into many machines. And at one point, VMware talked about provisioning Hadoop clusters with virtual machines. Help us understand this approach in relation first to that approach. So let me kind of take that question here, George. Thanks for having us. But if you look at the tenets of virtualization when VMware came into the picture, the whole idea was it was one application, say a Windows application, and you make the server look like four Windows applications, right? That's what virtualization achieved. People have then taken that virtualization into hybrid converged infrastructure and taken applications like VDI which requires six or seven virtual infrastructure where you need six to 10 of these servers that you need to virtualize. That was all fine when your application stood up on six nodes. But today you have hundreds and thousands of nodes and you have to get down to process level cluster formation. It cannot be this whole hyper-converged device where compute and storage is tied together because what happens in a big data environment is your data may be exceeding in storage, may not be exceeding in compute. It may be exceeding in compute, but it may not be exceeding in storage. So how do you put together a virtual infrastructure with containers, which is what we have chosen in our solution, for example, where what we have done is essentially soaring at only ones, but allowing multiple virtual views to be done using containers? Okay, you've said some very interesting things there. One of the tenets of big data in Hadoop is put compute and data together because it's expensive to move data in terms of data gravity because of IO throughput limitations. Now you've talked about, at least as I've heard you just say, in some way separate either physically or virtually compute and storage so that you have more flexibility in creating the infrastructure, and I assume in this case, we're talking about clusters. So this makes it easier to stand up clusters. So we're decoupling compute and storage. In the traditional world of Hadoop, for example, as in a big data, one of the examples, it applies to multiple of these application scenarios in any analytical applications. The data and compute are in one place. Robins believe is to take compute to where data is not stored, but where data is cached. So we have a pool where everything is stored, but in the host layer, we move it to a place where it's cached, the data that is currently being used. So we have an host layer and we have an application layer in the storage pool, which inherently means you get the best of two walls, best of the cost advantage because you can use the lowest cost scenarios to store your data in, and the highest performance because you can take the SSD and RAM on the whole site and get higher performance. So technically, we have cut the umbilical cord between that compute and storage. Would it be fair to say that essentially by caching the data from the storage layer on the compute layer, you're creating ephemeral storage, which some of the solutions that put Hadoop on Amazon, for instance, might do where they might make use of memory and SSD only. How do you decide what's the hot data that needs to be stored and what's the cold data that can remain on the storage tier? Yeah. So there are a couple of things you need to solve in order to solve this problem. First of all, figuring out what stays in the hot layer and what's cold. So that's one. Second, even if you solve that problem, I think one fundamental issue with Hadoop, and particularly this applies when you start adding more applications, is that you end up copying data specific to each cluster. So each Hadoop cluster has a data silo. It is not sharing data among different Hadoop clusters or among different applications. So what we are really creating is creating a data lake from where there is a single source of truth and then giving all the different applications a view into that data, a virtual view into that data. So that really allows you to eliminate all of the data duplication and the pressure that you put on the network is a disk IOS. So now when you're trying to get the data, that's where the caching comes in. And so that's, it's really, you can't really solve one problem by itself. You can just do caching with a very hot tier, expensive tier, but. It would almost be like a dumb caching that an operating system do. That's right, that's right. So really there are three tiers of that. It would be a very intelligent caching, intelligent cluster generation. And I think we are trying to solve three things together. Performance, cost, and agility. And so there are people who can solve any one of these problems by itself. You can be very agile, quickly provision things, but not really address the storage side issues. You can be low cost, pool a lot of data with cheap devices but then have performance. Or you can have a really hot tier and blow up your budget. So, and you can't really slap these three individual solutions together to get everything. And that's where you have to rethink the whole infrastructure to put together a fundamental new solution that covers all of these things simultaneously. So. Because we have this host side caching layer and we can, you can write a rule in our system saying anything that's older than six months, it automatically will move it to the lower cost. It could be a SAF in a hybrid cloud infrastructure that you're using. So, the data does not even have to be on your premise. It can be on-prem, can be off-prem in a hybrid cloud infrastructure environment. We will let you tier it across the hot layer, to the cold layer, to the warm tier. Different tiers can be created and automatically managed by the system. Just to be down in the weeds a bit, are you doing this based on time to live or are you making some more intelligent decisions about what's hot and what's not? So, it's time to live is a rule. You can override what we do intelligently in our system. So, what we have is we know the viewing pattern. So, we basically are using our own analytics to determine what the patterns of usage of this data is and based on that making it available as well in the system, right? Okay. Now, you can override that with writing your own rules which specifically says, no, this one I want to be using that particular infrastructure. Let me go back now to the storage layer, the base storage layer that's shared. It sounds like across clusters or virtual clusters if necessary. So, is this like a single namespace? And then, does that mean you can use something other than the three-way replication that is native to Hadoop? That's a good point. It is indeed virtualizing the entire storage and data with single namespace. We have a different way of implementing redundancy compared to Hadoop. So, we are much more efficient with it. Instead of three-way, we can cut that down by a factor of two. And I think this is all the key features, you know, solution in how we virtualize not just the storage but also the data across all applications. And it's just not Hadoop. We could have any fast systems underneath our system. We can provide you a simple RAID6 with a ratio coding across this virtual pool of storage that you have defined, right? So, you do the redundancy, the way sort of high-end storage systems used to do it? That's correct. It is more like what EMC does in storage on the high-end appliances, applied to enterprise applications for big data and analytical applications. Okay. So, all right, then I think I'm with you on that storage layer. Now, and I know the, I think I understand the intelligent cache. Now, was the third layer that application itself? The third layer is basically, you take an existing application and you provide an image. It could be a Docker image. We really support Docker. Or, like, see, it does not really matter. And then there's a cluster orchestration capability. So, you can essentially have these rules saying, I want to now create a new cluster with two terabytes, et cetera, et cetera, in storage. Or you can leave the system to make intelligent choices for you as and when required, right? So, it's a much more of a completely intelligent platform that we're building, where it can span not just your local enterprise level cloud infrastructure, but your hybrid cloud infrastructure across multiple racks and tiers of rack, right? Okay, but hybrid in this case, you mean it's virtual clusters on a common physical infrastructure? You don't mean partly private cloud and partly public? I actually mean partly private, partly hybrid. Because the beauty is this, right? We are not moving the entire data. We're only moving the data that's currently under compute into a cache layer. Hence, we can actually have the data sprawl be prevented and data can be stored in a single virtual storage pool and have only the data that's currently being used being moved into a host tier. So, from a chief data officer's point of view, this is very important because their worst fear is duplication of data, because then they have that whole problem of trust, you know, who touched this versus who touched that. So, you can give them sort of a single copy, but that's redundant, but not in the replicated redundant sense. And then, help me understand the scenarios where you might stand up different clusters for different applications. So, let's say you have production, dev, and QA clusters working on the same data. Today, you would have to create three different clusters, three copies of the data, and then they go off and running. So, that means not just duplication of data from in a capacitive hit, but also now you've got a agility issue because now you have to copy data over before you can start a new cluster. This is where what we are doing is a single source of truth. The clusters that you are spinning are only having a virtual window into the data. So, therefore, now when you start a new cluster, you don't really need to copy all the data over. So, it's not just single-click provisioning, but it is almost instantaneous provisioning, and that's really the difference between, a lot of people talk about single-click. It's instantaneous because you don't have to move data. Exactly. Okay. And then it's interesting, right? There's a three, so that just from all given the scenario of QA DevOps and production. There's another three X because Hadoop requires a three X triple, you know, duplication of data on top of those, right? So, instead of nine X, which means you have QA. Basically, a one and a half X because of one X plus half for eight, six type of thing. So, the amount of storage and compute cost that we can save is huge in the system that we're providing. In reality, I mean, as Promo pointed out early on, it's not about just agility. It's a combination of, can you give the agility in split second because the storage is already there to be able to get a new cluster up and running? Can you, at the same time, improve your performance by two to four X in improvement in performance? And can you reduce your cost by two to three X in cost reduction? The improved the performance was the cash data. That's the cash data. And the cost? Cost is because A, one, the duplication and the duplication and all those things are gone. Those storage savings. Storage savings. The utilization of the compute clusters is much better because we are using container-based technology where every storage container, compute container and the entire network is put together into a virtual cluster. So within the same machine, you could have several set of containers which are being put together to form a particular orchestration cluster level that gives you the performance that you need. Are customers today using this for sort of dev test and like pilot, where are they in the maturity in their journey towards sort of production applications? So we have three customers. One is a very large retailer with 30 petabytes of data. In that particular case, we are still in the development environment phase of things. We go into production in the next couple of months which is when we will launch the company. We have another one which is also a smaller retailer and a banking insurance giant as a third customer, right? So we have all of them going into production over the next couple of months here. We would have consumed 60 petabytes of data has gone through our system. So it's not, you know, we've been spending the time, two years of funding to make the storage platform be very robust. One of the beauties of all of this container-based flow that we have put together is the reliability. So for example, even with all this duplication, et cetera that exists today, when a job dies, for example, let's say a storage hard disk die or a node died, you'll have to restart it, copy the actual data and restart the job in today's big data kind of environment. In the case of Robin, what ends up happening is each container is, we know the metadata layer of what the stage of that container is. We would just, okay, this container or this from that node died. We would form another container in another node, move the data and move that particular compute job and be up and running, but the data is always available. So you don't, there is no stopping the data to get the thing back up and running. You don't lose that 30 minutes, 40 minutes time or whatever the time you're taking. Your job continues unabated. So this sounds to those of us who sort of grew up with this concept with VMware, you know, at first they made it easy to sort of have maybe a dev and test and maybe a runtime environment on a single workstation, and then they were doing dev tests, I guess, for different operating system platforms on a server. What might this look like in production? Is it that you want to have a greater resource isolation than yarn might give you? Is it the, or is it really, you know, the shared storage? I guess my question is sort of, we have resource sharing solutions and orchestration. It's almost like a remedial question, but where do you sort of, where do they come up short? You know, sort of, you've given me an answer, but help frame it again relative to those. George, you could be using yarn, and we will give yarn the benefits we are talking about too, by the way. So it's not like we are handicapping yarn. Yarn may not be able to do some of the things that we are talking about, which is all this sharing of data and the storage value propositions we are providing about. It cannot give you this whole capability of container-based isolation and being able to run the jobs across multiple of these clusters. It cannot share the data across production, QA, and DevOps kind of construction. We give you all that benefits. We also have an intelligence in the system whereby the orchestration, so you can run, it's always relative. You have, let's say you have 10 jobs running on a 1000 node cluster, right? And you're going to create different subset of virtual clusters. This job is more important. It needs 30 milliseconds of operation. That job is not as important. You can mention that in our SLA, and we would control the applications that the network compute and everything is giving it the highest priority to get that performance layer. So I think the way to look at it is there are resource managers out there today. But we still have this problem of low utilization of data centers, and why? I'm sorry, you have this from... Low utilization problem with all these big data applications. And you go through a core of it. Managing resources is one thing, but you need to solve the data problem behind it. And that's where we really come in. So we can work with any orchestration resource manager at the top level. But behind it, you need a deeper technology than just managing containers and starting off jobs. And I think that's really where we offer a lot more value beyond the existing resource managers out there. You can use any existing resource manager in our system, and you would still get the benefits of whatever we're talking about, right? Because underneath the intelligence layer we bring in of knowing the SLA requirements of each of these jobs that you're bringing into a virtual cluster. One of the problems of adding things into multi-tenancy is if you have now 10 people using a multi-tenancy, you want to ensure certain applications have certain performance levels. We can ensure that. Okay, so the current resource managers are kind of static in the sense that you assign, I guess, a certain amount of CPU memory, and probably can't isolate a certain amount of IO. But so maybe CPU memory, and I'm not even sure, I guess, storage. So you could have multi-tenancy, but they're going to step on each other. One, we know there's the duplicated storage. We can't guarantee the IO, and if there's multi-tenancy, you have that problem of not guaranteeing IO quality of service because they can sort of step on each other. What do you have one guy who's throwing up pen applications and screwing up the whole system in its entirety? And in fact, if an application was written to assume it's writing to SSD, so streaming, if two programs on the same node start streaming and they get slices of time from a resource manager like Yarn, then you lose the whole benefit of writing a stream to a solid state disk. That's great. And I think you also, we have Hadoop, MongoDB, Cassandra, there are multiple applications. So I think that's where we provide a layer which allows you to control and isolate access to resource, which is application agnostic. And I think that's where we add value, no matter what your top layer is. George, the funny thing about Robin is we don't need you to change one line of application code. Even if the application was never written to use SSD, we'll give you a performance on SSD. Okay, so. That's because it hits, we are installed in the system, it hits our metadata layer, we move the data and you're just getting that benefit in terms of how the cluster is set up, right? That's no change in application code, no change in whatever resource manager you're using, whether you're using Yarn or you're using MeSource, you can continue to use Yarn or MeSource. We just add the intelligence and the storage container abstraction to create a virtual cluster in which all of that operates in them in a controlled SLA environment. So I guess I'm thinking about this and I'm wondering, it sounds like all the Hadoop distro vendors would be all over this and maybe not even just the Hadoop distro vendors but maybe Databricks or anyone else who's got a data intensive compute framework and needs to run multiple applications that share resources. Is that so part of the plan to, once you're in production to show them? So we've already handled three of the distros in Hadoop, already in different customer scenarios. We have Databricks that we have obviously supported with Spark and one of our fourth POC that's just commencing. However, and we have joined two of the partnership programs of two of these companies, but the focus has been on the three to six customers that we're working with today, right? I mean, over the next few months, we'll be adding the resources to scale it to having more effort right at the source between us working with these partnership. But I want to emphasize this. This is not a Hadoop solution, it can do a NoSQL, it can do any application which is distributed. No change in application, no changes in terms of the data flow or it can be block storage, file storage or object storage. Could be SAF for certain clusters, we don't care. We allow you to provide a cloud for distributed stateful applications that gives you the performance objectives that we talked about. So it seems like a fair way to describe this would be, we took VMware to VMware and server virtualization to the limits when we were trying to improve the efficiency of single machine applications essentially. And now where almost all applications are distributed by default, this would be the VMware for that era. For distributed stateful applications, we are the VMware of the world, no doubt about that. Okay, that's a very powerful value proposition. All right, gentlemen, thanks for being on, Robin Systems, quite a story. Rajiv Madhavan, founding investor and chairman and Premal Bhuj, CEO, we expect to hear more from you and hope to see you at our next show and hear progress with not just the three first production customers, but some many more and big partnerships. Absolutely, thanks for having us. This is George Gilbert reporting from Big Data New York City. We're at the Pillars 37, right outside Javits Center and Strata Fall 2015. Thanks for joining us. Live from New York, it's theCUBE covering Big Data and...