 Live from the Julia Morgan ballroom in San Francisco, extracting the signal from the noise, it's the Cube, covering Structure 2015. Now your host, George Gilbert. And we're back. This is George Gilbert. We're at the Julia Morgan ballroom in downtown San Francisco at Structure 2015. We have a very special guest with us, Bob Muglia. Bob, welcome. Good to see you, George. I'm at the risk of being politically incorrect. For the five people in Bangladesh who don't know who you are, can you give us a quick answer? I think it's a few more than that. I spent 23 years at Microsoft, seven years or so as president of server and tools, so focusing on Windows Server, SQL Server, Azure at the end, and now it's hopefully computing. Okay. Very illustrious. Okay. Let's dive right back, dive in. So for those of us, you know, we've been talking about infrastructure as a service, platform as a service, software as a service. Let's focus in on databases since Snowflake is a database, an analytic database. What's the difference between a database as a service and a managed service? It's all about squirrels. All right. Elaborate. The difference between a true software as a service or database as a service offering is that it's a service that can serve many, many different customers very effectively. It can run in an automated fashion. A managed service is really typically built by taking an on-premise system, hosting it in the cloud, and just having people manage it. I call it squirrels, right? You've got humans kind of on the treadmill, keeping the thing going, and it's very, very different than having software behind it, so it's about software versus people. So let me drill into that one level down. If you had squirrels and you standardized the operation of a managed service, could you automate many of those human intensive things? Not really because what you're doing is you're building, you're sort of stamping out typically in building one of these systems after another, essentially typically dedicated to a customer. The other part of this is really about multi-tenancy and single-tenancy. Most of these managed services are single-tenant, so they're designed to run for one organization, whereas a true database as a service like Snowflake is multi-tenant. We support many, many customers in a shared environment. Now, one thing we do do is we isolate customer data, we encrypt customer data, and we actually give people essentially dedicated clusters to run and compute the data. So that's very isolated, but the system is designed to operate in scale to many, many organizations in a multi-tenant way. So what would be some of the other things that you do as a database service that you couldn't do as a managed service? In other words, what have you wired or what have you designed differently so that someone, a user developer, doesn't have to worry about, say, partitioning or failover? In fact, the way I would put it is what we don't do. The software is designed so we don't need to build indices, we don't need to do partitioning and build partitioning keys, we don't need to vacuum and clean up things. The software automates all of that, it's just part of the way we run. And what we've done is we've eliminated all those knobs that typically fall on in an on-premise environment, the DBA. I mean, ultimately, there's a DBA involved in solving those problems for all of the other databases except Snowflake. We just don't require it because of our architecture. Would that prevent someone who started out with a non-premise database from sort of migrating ultimately to this form in the sense that if they automated some of these knobs, they'd break the processes that say an Oracle DBA has assumed would be there? Well, yeah, I mean, I think that the key is that the architecture of all these other systems, which typically dates back 20 or 30 years, right, the code is old in these systems, they require these knobs. And you can put some algorithms and things in front of it. In fact, in fact, our CTO and one of our founders, Ben White-Ageville, that's what he did at Oracle. His job at Oracle was to take this mass of different knobs at Oracle. And Oracle is the king of knobs, right? It's always been. His job for many years was to try and automate all of that. And he recognized it was a hopeless task. And he realized if he started from scratch without some of these knobs and changed the assumptions fundamentally, you could have a very different thing. And that's the core to what I would call databases of service in Snowflake. Okay, all right, that's pretty clear. So now let's put together sort of two other concepts. We see the data lake on one side of the spectrum is a, some people call it the data swamp, where you put the data where you don't have to decide up front how it's organized. And the other end of the extreme, terror data where everything's perfectly curated. So they're not direct substitutes for one another. But perhaps there's an analogy between what we call Hadoop 2.0, you know, at its core, HDFS and Yarn. And then Big Data 3.0, which is like a converged analytics platform. Help us make sense out of that. Sure, I think what you're seeing is the alternatives that people have had traditionally available. Okay. You know, on the one end you have traditional relational data warehouses, of which terror data, Netiza, Oracle. I mean, those are all examples of that. And typically that's structured data. It's data that is highly curated. You know, on the other hand, in today's world there's a lot of data that is generated by machines. And it takes on what we would typically call a semi-structured form. It's very dynamic in its content. It changes all the time. And it doesn't have a fixed schema. That's the key. It does not have a fixed schema. Up to now the only choice people have had is to work with that data is to put it in, you know, in a Hadoop based solution. And that's the data lake, data swamp, whatever you want to call it. Right. And then from there you have this relatively unorganized mess and you have to work to get it out. Anytime you want to get data out you have to write something specific to pull it out and it makes it relatively difficult to work with information. What we've done with Snowflake is totally different in that we certainly support with a full relational database, a full SQL, you know, really as complete as what you'd find in a terror data in Netiza from a SQL capability. So we have the ability to work beautifully with structured data. But we can apply all that same technology to semi-structured data that maybe is coming in in the form of JSON or Avro. And so you can essentially use Snowflake as a data lake where the data is stored in its almost native format but then we infer the actual structure from the data and allow you to run standard queries against it. So this sort of leads into my next question which is if we wanted to go from the world of Hadoop 2.0 where we sort of have this pipeline of many processes or course-grain processes with a lot of latency and we want to move to the next generation which is some sort of converged platform, what might that pipeline look like and how can you simplify that? Well, I mean I sort of hate to say that it just looks like Snowflake. I mean it is the solution that we built. I mean see the thing about Snowflake... Oh, you mean it handles sort of... We handle that, see we're an incredible transform engine. Snowflake is an incredibly effective transform engine and we're able to ingest in a native format this semi-structured typically JSON based data. And what we literally do is as the data is being read in we see and infer the structure that's in there and then we apply the same kind of technology that we would apply to structured data. We polymerize it and we build metadata information about that so we can prune it effectively. And so our queries are ultra-fast against it and what that allows people to do is to use one system just with SQL to actually curate their data. So you can take data in a raw form in and then put it in a form that's very easy for business analysts to use all just using standard SQL. So all those steps merge together. So now let's take what may be a step for some others one more step which is I want to embed sort of a machine learning model. A predictive model, a prescriptive model that machine learning has extracted from the signal in your repository. So that's still one more step in the pipeline. So you've collapsed several steps but we need let's say for some applications one more. No it's a good question. So for most of what people do they're working with data in a direct format and what they really just want to do is put Tableau or a standardized BI tool on top of this. So they have all this data they want to be able to get access to this data for their business analysts through a standard set of tools which is very straightforward to do with our solution. Now you're saying hey but what I really want to do is do some additional algorithmic processing on it. I want to use some sort of iterative processing which might look like machine learning and how would I do that. And you know there are some people use R directly on it and they hook R up to a tool like Snowflake and they work typically on a single computer. If you need to do high levels of machine learning now you need to do a set of iterative parallel processing and there a tool like Spark is really really effective and the thing about Spark is that sometimes people think that Spark and Hadoop are coupled together but they're not, they're definitely not coupled and in fact Spark can operate on independent data sources and Snowflake is a perfect data source and we're connecting to Spark and because we're in, you know Spark is parallel. So Spark could be an alternative compute engine on top of Snowflake. On top of Snowflake. And also would our work in a mighty... Inside Spark if you can, within the context of Spark now you can choose the environment. So you get the scale out of... You get the parallelism and scale out that Spark provides and Spark is inherently iterative and it is inherently parallel. Snowflake is inherently parallel. So we can output data incredibly quickly to Spark which then can be operated on and machine learning algorithms can be done and then frankly in a parallel way injected back into Snowflake. It's almost like, it's almost like those postgres or lustra, you know, sort of extensions. I forgot what they were called but Spark could be your extensibility story. And what you do in Spark is you issue essentially a SQL command to extract the data out and Spark and Spark SQL are not an efficient way of processing terabytes and terabytes of data. We are. So if you just pass the SQL down to us we can pass out a result set in parallel which is exactly what Spark would want to operate on. The predicate push down. Exactly, do a predicate push down. Exactly. On that interesting note, Bob Mugley, we have to leave it. This is George Gilbert. We're at the Julia Morgan... I'm forgetting where we are. Last interview of the day. It's late in the day, isn't it? George Gilbert, Julia Morgan, Ballroom, downtown San Francisco. We are at Structure 2015 and we will be back in a few moments.