 and Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now here are your hosts, John Walls and George Gilbert. I think it's pretty safe to say the sparks are flying here at the Spark Summit. A lot of energy here on the Expo Fords for continuing our coverage here on theCUBE. 3,500 attendees, the largest Spark Summit by far over the past three years in the history of this event. It really shows you how this has continued to generate a lot of enthusiasm within this community. Along with George Gilbert, I'm John Walls and we're joined now by Ashish Tussaud, who is the CEO of CUBO, Ashish. Thank you for being with us, we appreciate the time. Thank you for having me here, glad to be here. Let's just start off at the 30,000 foot level here. Obviously you deal with big data in the cloud and trying to introduce clients to a variety of services there. How does cloud-based infrastructure empower people or maybe engage people or entice people to get involved with big data? Yeah, no, that's a great question. The big thing about the cloud-based infrastructure is that it helps in two areas. One, it helps in making things self-service and second, it helps in automation of infrastructure. So what that means is that when you're using a cloud-based solution for big data, the lifecycle management of clusters or lifecycle management of the infrastructure can be automated because cloud is a very dynamic environment. You don't have to think about deploying infrastructure, then using it, and then getting your results. You can deploy infrastructure on the fly. You can say, I need a certain type of a computation done and the infrastructure can be created on fly and dynamically scaled and up and down. So that automation significantly reduces the dependency on experts that company needs for running the infrastructure because all of that is automated out in the software. The second thing that the cloud infrastructure does is it helps make things much more self-service. The cloud itself has been built on self-service principles which are very different from your typical enterprise infrastructure, data center side of things where you have to deploy and then there's a central team which can use that infrastructure and everybody's dependent on that central team. But in the cloud, you can create things which are much more self-service and you tie the self-service interfaces with automation on the back end, you get something that is usable at a very broader scale and with a very, very small operational footprint that the companies need to deploy. The net and net of that is that you lead, this essentially leads to an explosion of how big data can be used within the enterprise, who can use it and how enterprises can operationalize them in a much, much easier way. So let me hit that who can use it apart. I mean, who is the optimal user then? I mean, who do you think, and you've described this really quite capable scenario, right, a pretty rich environment for data transaction and computation, but who in your mind, like who's the prime beneficiary here? So, you know, it's a complex environment. Data can be used by a lot of different types of people. The primary personas that we see big data being used for are either the developers who are trying to build applications on big data platforms, analysts who are trying to look for certain patterns. I would say, you know, I'll call them much more sequel-based analysts who are trying to look for certain patterns doing a doc analysis on data, and data scientists who are trying to, you know, develop deep learning, you know, machine learning sort of applications using big data. And then the fourth persona is more like ETL engineers who are trying to, you know, put data into a place, into a form, so that all of these other groups can use them in a productive manner. So, it's a very complex phenomenon as to who you, who can, or, you know, it's a very complex or diverse environment as to who can use the data. The critical thing there is that for all of these groups to be able to use the data, you need to make infrastructure self-service so that, you know, these groups can come in through either the interfaces that the infrastructure provides or through third-party interfaces. And for the central data team that is in, you know, in control of this infrastructure, you make it, you have to make it very simple for them to allow this broad-based access. But the usage of data can be very, very diverse, with very, you know, there's almost a gradation or a spectrum of the technical capabilities of the users who are using this data, you know, all the way from people who are just consuming it through canned interfaces and apps to the people who are actually doing, you know, dynamically playing with data, you know, exploring it and creating applications out of it. So, it's a very, very diverse environment, in the enterprise especially. On the smaller companies, it tends to be a lot more developer-centric, but as you start to mature as an organization, it starts moving from there to a very broad-based set of users who can use that data. So, let me key off that, as she's, maybe it's hard to find, you know, statistically significant data, but perhaps with anecdotes, if a company starts off with a non-prem Hadoop deployment, you know, starts with a POC and then a pilot and then gets into production, what type of experience curve is that and how does that compare with, you know, starting with Q-Ball in the cloud and going through that same life cycle, if that life cycle is the same? Yeah, so I think the experience is completely different. We have seen market research that suggests that for on-prem deployments, it takes anywhere between six to 18 months for those deployments to get into production. It is one thing to download something and, you know, play on a few small nodes to test out things, but when you're, you know, deploying production ready clusters, it takes six to 18 months, according to market research that we have seen in order to get to production and even there we see, you know, some of the estimates range from 13 to 25% in success rate for those types of deployments. So what takes so long? What if some of those thorny tasks, whether it's the fact that, you know, each server component has a different security model or, you know, who knows what and then why the low success rate? So there are multiple, you know, if you look at the supply chain of creating, you know, a big data infrastructure on-prem, there are multiple bottlenecks. You know, the first bottleneck is typically, you know, getting the hardware and certifying it and, you know, deploying the infrastructure there. The second bottleneck is expertise. You know, all these projects, I mean, we are talking about Spark. You know, four years back, five years back, Spark was still in its infancy. So the ecosystem of these projects is highly evolving and it's very difficult to find expertise that can manage and deploy all of these in a way, you know, that makes optimal use of these engines. And then from there, once you start expanding and scaling, the bottleneck becomes, you know, I ran the data infrastructure at Facebook before I started Qboard. We were always running to, you know, add to capacity of these clusters because as it starts getting into production, the plethora of use cases and personas and users who start using the system put a lot of load on the system. So all these three things make it very complex in an entrepreneurial environment. So when you're actually in production, you've got all these bottlenecks and you have to chase them down. Yes. So, because you're not generally just making everything bigger, you're trying to find the critical sections. Yes, you're doing that and then, you know, the bottlenecks are, you are in an on-prem environment, you are basically bounded by laws of physics. There's a certain size of your infrastructure and everything has to fit in that. Now, if you take that, you know, coming to your original question around the experience in the cloud, the experience on the cloud is very different because there you don't think about setting up the infrastructure and then doing your processing. In the Qboard model, for example, people, you know, give the processing out and on the basis of processing, we are creating the infrastructure on the fly. So not only do you get that elasticity and on-demandness and making the infrastructure fit your, you know, your computation, not only does that take away some of the risks, the second thing that takes away risks is in the Qboard model, everything is production. You know, you are getting the same SaaS platform that is already running, you know, already processing more than 300 petabytes of data every month. And you get access to that from day one. It's not like you're the first deploy, you know, in an on-prem model, you're the first deploy and, you know, get to that scale. You're already getting access to that type of a scale. So, but you can't hide completely the scenes between the components in Hadoop, despite heroic engineering. So what shows through to the customer who's using a QBall service? So I think in the QBall service, the take that we have taken is, you know, bring the best of great tools for each particular use case and provide that in QBall and supported by a common infrastructure, common orchestration layer that deploys infrastructure. For example, for Spark, Notebooks is becoming very, very popular. So we provide Notebooks interfaces for Spark. We don't try to say, you know, this interface is now going to get used for all the other use cases. Similarly for developers, we have API integration. For SQL users, we have SQL workbenches and ODBC connectivity. So the take that you have to take is, you know, provide a plethora of different tools. Every persona has a particular tool that they like because that tool does, the 80% of their work does that very efficiently. And we connect that, we make sure that that tool is coming through a common metadata plane or a common control plane into QBall and the orchestration of the infrastructure is automated whether you're running machine learning, Spark jobs or you're learning, you know, running Presto queries for SQL analysis or you're, you know, doing Hadoop jobs. All of that comes through the same central control plane. So I mean, when you look at, you have a lot of options systems. What is it about Spark? You've talked a little bit about it, I mean, what is it about Spark that you think is you find most appealing? Yeah. And ultimately, what are your accounts doing with it? Now, what do they find to be the optimal use of it? Yeah, so great question. So Spark started really taking off about two years back in terms of usage. The thing that we find that our customers are saying about Spark is the APIs are extremely rich. So the programmability and the API experience is very, very good. The second is especially on jobs that can, you know, a lot of machine learning jobs are about iterating on the same data set again and again. And in that paradigm, something, you know, Spark started off as an in-memory platform and has diversified into a much broader platform from that. But in that paradigm, especially holding things in memory and having a great memory architecture, like Spark has, has significant advantages. And that is what we see a lot of our customers doing. They're using Spark for a lot of machine learning use cases, trying to do deep learning stuff. The libraries like MLM and stuff also help in that area. So that use case is essentially emergent. We see that data science persona, people who are doing, you know, deep learning and they might also use SQL for some bit of exploration and stuff. That persona really likes Spark because of these two reasons. You know, the speed of iteration because of the in-memory, you know, background. And second is the richness in the APIs that they can use to, you know, drive higher productivity for their jobs. Their job more productive. To get to what they're trying to do quicker. To get to the point where they can model. Right, so they want to get there quicker. I want to jump in on one question before we have to go, which is, Hadoop was designed with the assumption that it's deployed on-prem. And, you know, it assumes a certain amount, a certain type of infrastructure. And then there's the cloud native infrastructure from Amazon or Azure that assumes a different type of infrastructure. How does that show through in terms of the administrative demands, you know? And then the third category is where you're trying to combine the best of both. Yeah, so I think the cloud infrastructure and the on-prem infrastructure and the way you deal with Hadoop in these two environments is fundamentally very different. So, and the biggest difference in my mind comes from the, you know, on-prem infrastructure and Hadoop has been harping about this, was built on this premise that compute and storage is converged. You know, the same clusters have HDFS which is storing the data and on the same clusters you are running the compute. On the cloud, the infrastructure that we have built is on a completely different assumption where the storage is actually on an object store. For example, in AWS it'll be S3, in Azure it'll be Blob Store, in Google it'll be Google Cloud Storage. So that frees up compute to become dynamic. So compute is not tied to become persistent in the cloud and that allows you to do all this, you know, dynamic scaling and stuff like that. Now, some of the cloud vendors themselves, essentially they had taken this on-prem, you know, Hadoop and, you know, put that in the similar sort of, you know, architecture, you know, that we have done. The place where we have done better than those cloud vendors is number one, we are agnostic to all these clouds. The same Cubool interface runs on Azure or Google or AWS as opposed to being specific to that particular thing. The second thing that we have done is we have done automation across, you know, automation on the clouds to a much better degree. For example, auto scaling is something that we have had in Cubool for a very long time, right from the start, which assumes a deep knowledge of both Hadoop as well as a deep knowledge of what, you know, what the cloud infrastructure can provide and combining that together to get these benefits. These are not the benefits that you see in the on-prem world. You can't do, you know, think about auto scaling and, you know, dynamic clusters and things like that on the on-prem world. One more point. The second thing is also the security models on the cloud are different from what you see in, you know, in the on-prem world. You know, on-prem world, you're talking about, you know, directory integration, Kerberos and stuff like that. When you come to the cloud, since cloud was built from the outside in, a different type of- There's no perimeter. There is no perimeter. So it's like logical security, you know, security groups and stuff is all logically created. So you can do things in a much different way there than the very static security model that you have on-prem. So would you say that the Google Cloud Dataproc, which is their spark and their Hadoop and then the HD Insight on Azure and then Amazon AMR, are they really pretty much straight ports from the on-prem code or have they done some adaptation to be cloud-native? So I think they have, the adaptation that they've done is essentially separating the compute and storage part. That is the primary adaptation that all three of them have sort of done in order to get there. But to us, the, you know, that is one portion of, you know, that is the baseline to get working in the cloud. Table stakes. It's table stakes. After that, you have to add a lot of automation, which none of them have done like auto-scaling and stuff. After that, you also have to add, you know, we have a SaaS model in our, you know, in Cubol, we have added integration to tools and have certain tools which make access to data much more enterprise-wide, which none of these folks have done. You know, their focus has been primarily, you know, they approach it from an angle of saying that, hey, this Hadoop and Spark is interesting. Let's make it easy to run. Would it be that they make Hadoop and Spark a managed service and you're trying to make data a service? I would say we are trying to make a data platform within the enterprise so that not just a single, you know, all the developers use it, but other folks can also use it. And I would say these folks are generally usually targeting the developers who can, you know, that's why the amount of automation that is needed there maybe is not necessary for that crowd, but much more in our, in our platforms. Okay. Just before we let you go, just briefly, I mean, your thoughts about the summit being here, being present and seeing the community build and buzz. I mean, just your reaction to that. As you said, the Sparks are flying all around. So it's great. There's a lot of great vibe here. I'm very, very excited to see that there's a lot of developer crowd, a lot of, you know, grassroots crowd that we don't get to see in a lot of other, you know, conferences. So it's been great. It's fantastic. You know, for us, Spark is an integral part of our platform and being at this conference and seeing the amount of, you know, interest is just, is just great. It's good. Well, thank you for being with us. Good observations and best of luck down the road to Cuba. Thank you for having me. Thank you. You're back. We're back with more on theCUBE here in San Francisco right after this.