 from San Jose in the heart of Silicon Valley. It's theCUBE, covering Big Data SV 2016. Hello and welcome back to theCUBE. We're here at Strata Dope in our Big Data SV. We've been doing this for a number of years and we're very lucky to have with us a couple of excellent folks to talk about some of the things that are happening within the database market. Rajiv Madhavan and, I'm sorry, Ratnaker Lavu, along with George Gilbert. And we're gonna talk a little bit about some of the things that are happening within the database world, specifically as it pertains to getting, increasing the productivity of the various assets that are associated with database. So the problem that we're trying to kind of address here is that if we go back 15 years, or for the past 15 or 20 years, we spent an enormous amount of time trying to virtualize virtually every resource, except it seems the database manager. Why is that? So it's essentially what happens when you virtualize is you're giving up some element of performance. So when databases which are inherently not running on two or three or four machines like a VDI application is running on a few set of nodes, databases running on thousands of nodes. When you apply virtualization to thousands of those nodes and you try to bring in database applications like Hadoop, Spark, Oracle, MySQL, all of these type of applications, you have to take a big hit in terms of performance. And essentially, it is the anti-thesis of the VDI model where you try to do hyperconverged. Whereas here, the resources are not hyperconverged. It's actually distributed on thousands of nodes. Storage could be in one set of nodes. Compute could be on one set of nodes. Even the compute and storage may have different performance tiers. You may say one is a hot tier, one is a cold tier because it's got an SSD layer or an NVMe layer. So the problem is when data is so vastly distributed, if you apply virtualization techniques, you get a performance hit in terms of the IO performance. Whatever ways people work around it, you're gonna get 10, 20% degradation performance. So you could not apply databases into a virtualized world. So what ended up happening is virtualized state as an application for VDI, operating systems, et cetera, et cetera. Whereas databases tended to be running on bare metal standalone at inefficiency. Some people used it on VMs, but it's no different than actually getting any benefits of virtualization. Second problem that databases brings about is you may have a set of applications. I may have a set of applications. So in a company, you may have 10 or 20 applications which need to share a common set of data. How do we do that? One set of Hadoop is used for 20 different sets of applications. How do you do that in a much more distributed environment that you can provide? And still make sure that your application is running at the performance tier that you want. Same meaning, let's say in a George's application is most important to us as a team. George needs to run at 90% utilization of the node. I get the 10% free utilization between me and you, for example. So how do we do that? In a massively distributed thousands of nodes environment with compute and storage being distributed in such a highly distributed environment. That's the problem that databases brings about. In the world of virtualization especially. So generally speaking, as you said, people have been coming up with some optimal approaches to thinking about it by introducing sometimes extremely unnatural approaches to parsing the database, segmenting it in various ways. How are, at Robin systems in particular, how do you think the solution's evolving? How are we going to solve those problems, especially as it pertains to this big data universe? So this is where we latch on to the opportunity. This is where containers is an unfair advantage. There is no layers of operating systems and virtualizations running on top of, and another layer of operating system running on top of a virtualized layer. You don't need that. There's no IO overhead. You can use containers. And how do you use that containerized technology in such a fashion that you can distribute and apply this in a stateful application? So what Robin does is, it allows you to have decouple compute and storage nodes with different tiers of compute and storage nodes, meaning one could be a high performance, one could be low performance, yet bring all of the database together and share the database among a multiple set of applications, give a guaranteed performance for each of those set of applications, all sharing and running on a multiple set of machines using container technology. Are we sustaining status, we're doing this? Yes, we are sustaining state, right? It's a completely containerized database application which allows you to do it and you have different cost models. So this application that I used the example of you needing a performance level, you may be given the best compute resources, best SSD performance tier. So allowing it in such a fashion, this is where containers is beautiful in allowing the company to do. And now when you think about containers, you think about a lot, basically Docker. But Docker has a lot of, it's not written for stateful enterprise type applications. So what Robin does is it basically provides a stateful containers which can co-emingle and operate completely within a Docker environment. So you can take Docker images, you can completely operate within a complete single looking environment, orchestrate the entire set of application images. Any applications can be diversely distributed. We all could be sharing the data across multiple storage and compute nodes. So, Renaker, you're the CIO of Coles. Yeah, the CTO. Or CTO. Yes. Is it real? So, yes it is. So one of the big problems as Rajiv pointed out is in order to actually do database as well across multiple applications, you generally had the method of sharding the data across multiple applications, right? So when you come to big data and when you think about big data and leveraging that for across multiple applications, the fundamental problem is how do you do that well and effectively using all the resources without having to necessarily shard the data? And that is when we saw kind of Robin systems as a solution and Rajiv is absolutely right. This whole container technology and leveraging that, a stateful container technology, leveraging that across multiple applications. So we have piloted with Robin. We are testing it across multiple applications. So one is a recommendations application that we have that powers our digital properties as well as a kind of analytics applications at the back end. So we're leveraging Robin to see how Robin one, is it true, right? Two, is it optimized in terms of the resources, in terms of the compute power? And what we have found is it is absolutely true in our test. So now what we're looking at is, okay, how do we scale that out across all our big data infrastructure? So just to key backwards on something you said earlier, Rajiv, the separating the compute from the storage, that's one of the main attractors of moving to the cloud is that you can spin down to zero your compute and just leave your data stored in S3. Is that something that you could do on-prem without containerization? In other words, are you bringing one of the big economic benefits of the cloud down to on-prem environments? It's very interesting. You can do that on cloud or you can even do that on cloud and a bare metal kind of integrated performance. For example, I think halfway through our engagements, we actually moved from a pure internal calls private cloud to a hosted environment. To now we're deploying on a third hosted deployment environment. So you can, and the ability to share data across multiple of these things is the other benefit that you can get. So think of it as you could have the cheapest data in a layer, which is very low cost, but you could have the highest layer, which is the next 5% of the data is used every day, let's say for that recommendation engine. It probably is pulling the latest and greatest of the data, but you also still want to use the other piece of data for the analytical function, let's say. So being able to move it at the right place to the right time, be able to then form a cluster, give you a cluster for Georgia's application, but, and by the way, Georgia's application should get 90% priority and give this guarantee in terms of the performance. That's what we enable you to do. But just to be clear, are you moving the data or are you just providing an SLA, you know, for the IO bandwidth that you need to data that's at rest somewhere? We're providing an SLA tier that allows you to actually go across multiple of those tiers. So you are separating the compute and the storage? That's right, because when you have hundreds and thousands of nodes, that's what you need to do, right? Okay. That is correct. So as you think about the challenges of doing this, especially as you move into production world, who ends up administering the system? Is it the, you know, if you're not in the cloud, if you're on premise, who ends up administering the system? Is the database administrator now running the system or is the systems administrator? Who's running the system for you? So in our case, we thought about the systems administrator, actually managing that, because one of our big purposes for Robin systems was not only this ability to virtualize data, but also is to manage resources effectively. So with Hadoop clusters, a big issue is as we increase the amount of data, as well as we increase the amount of application, the number of nodes is continuing to increase significantly, both whether it's on-prem or in the cloud. So one of the reasons why I love Robin systems when I looked at it, it is how does it optimize the workloads across multiple nodes and multiple workloads itself? And so we are putting a lot more control into the system administrator to say, if let's assume that all the data that is required is already there, then it's a little bit, it's a lot about what are the applications that are running on the nodes themselves and how much of the data that do they need and how much of the compute power do they need? So is it really kind of changing the way we think about provisioning resources for databases? How does that work? So let me give you an example, right? You typically see a 40 to 60% reduction in actual hardware resources, right? Because you're right now not using that many resources and there are workloads which are running and then dying, but you have actually locked off 80 nodes for your application, whereas that could have been used for my application during a peak load. Here you're creating a cluster and the agility of spinning a cluster together, getting a workload up and running, once it's done, it's abandoned. The data is in the storage there and you're not putting the data and the compute together into a hyperconverse, you know, wasting resources here. So you allow that integration to happen on the fly as the application is being clustered and is being put together. So each cluster has the agility to be spun up very quickly and after you're done, go away. The nodes are just gone and the storage is back into the storage media. So it allows you about a 40 to 60% reduction in compute and storage resources that you would otherwise have because the only way you could get any surety today in a bare metal is I copy your data and then I run it in a separate cluster. But the moment I copy and data, he's changed the data and we are back to is my data the right data or your data the not right data now. So all of that is eliminated if you could have this ability to share the data in such a fashion that all applications can use yet get the performance that you really need to get. So calls is obviously a large retailer, enormous amounts of data coming in from a lot of different places. Are you also seeing an uptake in some other industries as well? Yes, so we have engagements going on with finance industries, retail, a number of segments in the communication space. Anything in the machine learning big data side of things is happening but just releasing the data to do Oracle and other things. So now it is going to go from extreme Hadoop kind of environment to Oracle, MySQL, all of that data. In fact, in the floor we are showing in demos which actually show all of that applications running. So is this, should we look at this as helping customers migrate to a hybrid environment if they wanna, they've gotten their feet wet with an on-prem deployment and they find a good return but they also wanna sort of have an elastic capability out into the cloud but they wanna keep their data on-prem if possible. Is this... That's an applicable use case but it's applicable for anybody just starting with on-premises, right? So you could be just doing it on-premises even when we were doing it on-premises. The value did not change. So you can use it on-prem, you can use it on cloud, you can use it in a hybrid environment, you can still continue to get the benefits of agility, of spinning up the clusters, having every resources being utilized to the maximum rather than just buying hardware and copying things around. So those benefits are, it does not really matter whether you're on cloud, you're in private enterprise, inside the enterprise also. It's completely transparent to those issues. It's basically providing an on-the-fly creation of a cluster of machines for an application with a guaranteed SLA. All right, we gotta wrap this up. Rajeev from Robin Systems, Ranakru, thank you very much from Cole. Great session, a lot to learn. There's still a lot of technology in the world of database, it's a relatively mature technology but there's a long way to go. All right, so for George Gilbert, once again, this is theCUBE here at Strada Hadoop in Big Data Week in Silicon Valley, at the Big Data Silicon Valley Cube event and we'll be back in a few minutes after a short break.