 Live from New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. Welcome back to New York City, everybody. This is theCUBE, the worldwide leader in live tech coverage. We're here, concurrent with Strata and Hadoop World. James McCarrion is here, he's the CTO of SnapLogic, long-time industry participant, watcher, technologist, James. Welcome to theCUBE, thanks for coming on. Thanks, great for having me, guys. So, we were talking off camera, and you've got a good historical perspective, not only on Hadoop World, but on the industry in general. So, what's happening in data management? How is it evolving? What's changed over the last five to seven years? Well, what isn't changing? So, when I think about, as you know, I used to be CTO of another integration company, and I kind of view that now as like prehistoric times. So, there are like a few creatures kind of roaming the earth, like a few ERP systems, a few database systems. There are a few flavors of Unix, that was a little spicy for the time. But now, like in the last few years, everything's just completely exploded. So, lots of SaaS applications, still a lot of on-premise applications and a lot of legacy. And now we're seeing this trend around changes in the database market that we hadn't seen for a while. So, when you look back even longer than the question, but like say 10 or 20 years, you could pick any flavor of database that you wanted as long as it was Oracle, IBM, or Microsoft. And now you have relational databases, you have NoSQL databases, and then you have purpose-built databases for doing analytics. So, everything from like Vertica and Redshift to of course Hadoop. And it's exciting for us involved in technology, it's like new. And when you look at customers, they're like, well, that's great, now what? And so, I think almost everything has changed, the platforms have changed, people's expectations around what they can do with their data has changed, and the complexity has kind of gone up considerably also. And that's what I think is both challenging and exciting about this. Yeah, it's certainly challenging for customers, right, to try to keep up with all this. I mean, the people part of the equation has made it difficult for a lot of people to get value out of their data. Now, it's taken maybe longer than a lot of people thought, but what's your point of view on that from terms of just the people skills that are now available to exploit data? I think it's evolving. Something that doesn't change overnight. All of a sudden, we did an IT, just sort of meant so-called data scientists. We were talking about like Jeff Hammerbock earlier, there's just not a lot of guys like that floating around that really just sort of intuitively through experience, have the technical skill and the business angle to actually take advantage of it. So I kind of describe the early days of Hadoop. I've been breaking everything down into equations lately, which makes me very fun at parties. So basically the equation that everyone was sold on for Hadoop was, look, you have cheaper hardware and you have cheaper software than pick your favorite former analytic database or data warehousing database. This is obvious. It's like, well, okay, let's add a few other parts to the equation and what about the people cost and what about the churn or risk cost as technologies continue to evolve in the Hadoop ecosystem? So first it's MapReduce, then it's Pig and Hive and Taze and Storm and Spark and you pick your week and you can name a technology that was the IT technology. Then you need the people that are masters of those technologies and they have to keep up with it. So I think that the people part is not close to being solved. There is a potential solution coming though. I don't know if you've heard about this thing, it's called the cloud. It's gonna be big. And so I think that a lot of the improvements that we've seen in the app space for transactional applications are actually gonna be moving to the cloud. And so some of these problems, not all of them, end up getting simpler as you move things to the cloud. At least that's kind of my working premise nowadays. What is it that, I mean, you can sort of forklift, lift and shift what you've got on-prem to the cloud and you'll have the same operational challenges. But obviously if you're in a new distribution channel and you have the ability to automate the operation, give us what you're thinking about with technologies that Snapfogic would be integrating with where you think that cloud deployments have the potential to reduce the operational overhead. Yeah, so it's a good question. So you look at the cost of operating a Hadoop environment, there's kind of two dimensions to that, right? So one would be the upfront cost, like the pure data center cost. So what are the promises of Hadoop? Well, elastic scalability. Now, your first day on the job, I said, okay, we're gonna be elastic. Now go set up your data center to be elastic. It's like, well, we still need all these physical nodes, even though we're very virtualized, containerized, lots of eyes, but we need to set up our physical machines, rack and stack, everything. I need to be a master of all the administrative tools, so I need to be into ZooKeeper and all my sort of management tools. I need to know how to diagnose and fix problems as they come up, because I have nodes that go down and disks that go down, et cetera, or things that are underperforming. So in some ways, a blackout is easier than a brownout. It's like, well, this thing's underperforming for some reason, I'm not getting my disk read speed that I think I should be, I gotta go fix it. Like, well, I deal with any of that, right? So the whole premise is, look, if you want elasticity at possibly a much greater scale than you can get by building it out on premise, then if my nominal run rate is, I scale out over, say, like 10 nodes, which actually for most people I would say is sufficient, but every once in a while I wanna burst to 100 nodes or 1,000 nodes. That's very difficult for most enterprises to take on. Like, that's a lot of racks, that's a lot of electricity, that's a lot of everything. So you have to first deal with those operational issues. Then I think the thing that actually makes it possible, which actually made a lot of the separation of interests, like no SQL databases for quickly persisting and Hadoop databases and analytic databases for doing analytics. The thing that makes all that work is something like Snap Logic, right? So you need glue that ties it together. You need on-premise things connected to the cloud in the case of cloud analytics or other cloud applications connected to cloud analytics. Or you need things that tie your different instances of transactional databases and analytic databases, and that's kind of where we see ourselves fitting in. So we try to make it all easy. So a couple of years ago, maybe three or four years ago now, somebody came up with the concept of data lakes. Great, put all your data in the data lake and no scheme on right, just dump it in there and then you'll get access to and all the wonderful things will happen. And what seemed to happen was, first of all, he had lots of data lakes, but there was value in that it was cheaper than sticking it in your enterprise data warehouse. Modular the previous discussion, yes. Yes, right, right, exactly. So, okay, so what's the right strategy for customers to take in terms of whatever? Some people hate the term, some people love the term. What should people do? Yeah, and I'm somewhere in between. So I think, first of all, like a lot of things we're seeing, like what's old is new again. So this idea of like a data lake, if you turn the clock back even further, it's like when IBM first popularized data warehousing. The idea was so big, it almost kind of killed the whole industry before it even got started, like this idea of the galactic data warehouse. And customers were sold and maybe it was actually more practical back in the mainframe days where you had kind of like co-location of all your transactional applications. Why not build something that's sitting right next to it that actually is used for analysis? But it turned out we didn't really have the horsepower to do it, so I think this idea of dumping everything in one place is like seductive in a lot of ways. Like hey, wouldn't it be great if I had all of my information in one place and then just like, all these insights are gonna just come like bursting out of this. I won't even be able to contain it. And then you see what the reality is, which is like how do people really work? So they put it into a data lake or whatever they call a data lake and all of a sudden the data starts kind of evolving. So I have this data is derived from this data is derived from this data and then it forks, et cetera. And I tell you to like build a report on something on this dataset and you say, well, there's 50 of them. Like which one do I actually pick? It's like, well, that one was created by this like maniac and that one was created by a guy in another group and that one was the right one for now but then tomorrow it might be different. So what we think is that you actually need to really think about the organization within your data lake. Like it's sort of, if you just put it in as a, you know, like an NFS drive or something and you dump everything there with like no metadata and no organization, it's gonna be lousy. So you need first of all tools to help you organize it. So I think that you can think about it as like zones within your data lake. Like the raw data, you know, something, I don't have a good terminology so I'll give you my like crummy terminology. So purified and what I call bottled. And it doesn't matter how many zones you have but you need kind of rules like data SLAs that indicate, you know, how structured are things. Like some should be completely schema and read. There are parts of it where you don't wanna keep like re-implementing schema. Like every time you read data which a lot of analytics tools actually want and need and like the type system actually is your friend in some cases and a foe in other cases. And you need tools that sort of help you define and marshal the data through and allow you to subscribe to the data. So when I think of like kind of our vision for how we build, you know, build out the data lake it's not just, you know, populating is fun. It's important. Brokering out of the data lake is fun and important but actually management within the data lake I think is gonna be critically important. Let me ask kind of a couple of sort of different dimensions that you were talking about that one of the benefits of the data lake was sort of schema and read. You know, let's just put it all in there and then we'll figure out what questions we wanna ask. And also we're trying to sort of collapse the amount of time between when we sort of ingest the data and then when we can analyze it and make a decision. So in that respect it's sort of very different from the traditional ETL tool. So tell us more about how you, how the sort of origin story came about that was different. You know, not just cloud on prem but to take advantage of these other changes. Yeah, if we can maybe we can tease apart the two things. I think schema and read is kind of big. I think snap logic origin is like pretty big. So let me like at least deal with a schema and read which might be like more generally like applicable and I'm happy to talk about snap logic. So schema and read I think is amazing. So when you think about impedance in like data warehousing you have like one of the reasons why we didn't have these cool collections of data like all along was the fact that yeah, how did it work before? You know, so even like, you know, go back to like early versions of data warehousing like DB2 mainframe or other things and they were very strongly typed and you had strong schema and the general pattern was you'd have like a bunch of like the data architects they probably weren't called that back in the day. I don't know what they were called. I do have gray hair but not that much. And so they would go away and they would come up with the schema that was supposed to like span everything because you needed a schema to slot things in and we were coming out of the world like the VSAM world where typing wasn't as important and you could have these generalized buckets of things. And so everybody felt like everything had to be typed so you could query it primarily later. So it was great except by the time that they actually came up with that schema your business problem would have changed, your source systems changed, everything changed and it was like completely fricking worthless, right? So schema on read is good because in some cases the time value of that data is so low. If you take the time to create a schema for it, it's done. Like if you think about IoT and other things. So the other thing is that in the time it takes to come up with that schema just even when the time value of the data isn't low your business requirements might have changed and schema used to really define queryability and what you could do with it. So a few things have changed. So we have this fast unload mechanism. It's a little bit of kicking the can down the road because you have to face these problems at some point. Like there's reconciling 120 different finance systems. It doesn't get solved by schema on read. It just gets like, you just push the problem. So avoid doing it now. Yeah, you just sort of solve it like somewhere else. There's value in doing that because if you did it in the warehouse with the first where you first landed decision support data, you've sort of obviated a whole bunch of questions that if you want to go back and ask later you have to bring all the new stuff. I couldn't agree more. There's so much value in having all the data and actually you can credit even guys like Teradata for saying this all along. It's like, don't compress your data. Don't abbreviate your data. Don't aggregate your data. Just have the data and then figure it out. Now you can look at the economics and everything but that part was absolutely right and that's exactly what you're saying. And sometimes you want to understand the differences between the systems and sometimes you want it all unified and there's really kind of value in both. So just on the SnapLogic side, I think that there's a few things that we kind of figured out like one was everything is moving to the cloud. First operational stuff, which was where we got our start. So we wanted a completely like cloud first experience. We wanted to do a great job of integrating cloud applications. It turns out that they're all document based for the most part through their interfaces. Like so you have a REST interface that spits out a JSON document and when you looked at like kind of previous generation technologies, they were all row and column and so they would be great tools as long as you didn't mind cramming all of your documents into a row and column model. And so when you talk about this impedance mismatch, you had it right out of the box with these technologies because I had to take my data that I understood, put it into a format that I no longer understood and then maybe actually put it back into some third format later. And so there was this huge impedance mismatch in my view at every step of the way. It takes longer to build connectors in that model. It takes longer for customers to deploy because they would have this on-premise deployment model. Like I have to install software on everybody's desktop. It's like, no, just say no to all these things. It's like, look, if you have an impedance mismatch, fix that. If you have a deployment problem, fix that. SnapLogic just went head at these problems and said we're gonna just change the way everybody thinks about integration. So that's the story. And so what do you sell, a platform of tools? Yeah, so we have, like the message would be, we have an integrated solution. It's an iPad solution, meaning that we are delivered, our design experience is delivered from the cloud. You can deploy on-premise or in the cloud. And we integrate. So we do application integration. Like say you wanna attach NetSuite to salesforce.com. We do that. And that was actually kind of the origin of the company. Now we're getting more into what you consider data integration or looking at sets of data for analytic purposes. So like taking my Twitter feed and my ERP data and some mainframe data, pulling it all together into Hadoop and saying, okay, what do I really have? So now we're looking at doing Hadoop integration both in the cloud and on-premise in your Hadoop instance. And I access that in your cloud or whoever cloud? Well, it depends on what you mean by access. So like in terms of design experience, it's our cloud. So we host on Amazon, you get it from there again, no download for anything. But your data stays wherever you want it to be. So if you never wanted to touch the cloud, it never touches the cloud. It runs completely behind your firewall. If you wanna run on AWS or in other cloud platforms, yeah, we can deploy to AWS or we can have our own integration service that also runs in both Amazon and a hosted service. You can kind of run it wherever you want for data privacy reasons, that's right. All right, James, we're out of time. I'll give you the last word on whatever you like. I mean, Hadoop world, sort of Snap Logic, your vision, give us a bumper sticker. All right, well, that's pretty broad, I like it. So, well, I think, to me, it's always, yeah, yeah, no schema for this talk. So, well, I think that the big question is always, like, what do we wanna do with data? Like, I think, you know, at various times, it's easy to get mired in like the, like myopic issues of, you know, how do I operate this? Yeah, what's the latest like machine learning environment that I like need to learn? It's all like exciting stuff. And we have this like toolbox like we've never had before. Now, what are we really trying to do, you know, at the end of it? So, if we do our old analytics, like more cheaply and quicker than we used to do, is that like good? Yeah, that's like good, but it's not that exciting. Like, yeah, I always ask the political question, like, are we better off now than we were like four years ago or whenever, you know, Hadoop investment. You remember that debate. Sorry, yeah. Yeah, I was told not to like mention like names, but at least, you know, that idea is like good. And so, I don't know, like after billions of dollars of like, you know, VC money and others that have poured into this that were really better off yet. So, the question is what would make us better off? Yeah, I think non-deterministic analytics like around machine learning and AI are excellent. I actually, you know, still have this like weird utopian like vision for where everything goes, which is when you really think about like data utopia, what do you think about? You know, just think about like questions that you might ask, like, you know, air quality, like, you know, sort of like statistics about like passengers that you're flying next to on the airplane that you're allowed to like know about. Just like there's so much like every day, every minute, every second information that you don't really have easy access to. And then you look at enterprise questions about like their customers or their products and all these questions that we should have better answers for we don't. And I think a lot of it has to do with like data availability and our ability to combine that data together. So when I think about the vision of data, you know, you think about like kind of third-party stores for data, all subject to privacy and everything, you know, means, and then the ability to combine it with data that you know something about and really kind of thinking about like solving like more, you know, problems with data and what, you know, how we further eliminate the impedance behind some of those challenges. And I think that that's kind of where things had, I actually think, you know, Hadoop plays a part in this. And I think that integration technologies like SnapLogic have a big role in getting that data together and combining it in meaningful ways so that we can answer those questions. A key step to realizing that vision, James, thanks very much for coming in theCUBE. We appreciate it. Great, thanks for having me guys. You're welcome. All right, keep it right there everybody. We'll be back with our next guest. This is theCUBE, we're live from the Big Apple. We're right back.