 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Now your host, Dave Vellante and George Gilbert. Welcome back to New York City everybody. We're here at Pillars 37, it's just down the street from the Javits Center on 37th Street. This is Big Data NYC, we run this in conjunction with Strata and Hadoop World. Dan Baskett is here, he's the director of technical marketing at Pivotal. Pivotal, of course, is that innovative spin out that EMC and VMware did, the collection of assets that they created under the direction of Paul Moritz who has since moved on to the chairman. Dan, great to see you, thanks for coming on. Thank you very much, first time here. So yeah, so theCUBE, newbie, we love it. So yeah, Pivotal has been quite an interesting journey. You guys started out, you took sort of this critical mass, you created critical mass with all these sort of bespoke assets that people weren't sure what you were going to do and then you started to build out this big data and cloud platform. Cloud Foundry obviously has had some great traction and great success with some big names like IBM building on top of it and obviously HAWK announcements and you guys putting a stake in the ground in performance, so let's talk about some of that. First of all, what's your role as director of technical marketing, what's your focus and then we'll get into it. Okay, as technical marketing director, I build a lot of presentations for the field. I do a lot of benchmarking, testing. I roll definitely feedback from customers back to engineering and help try to drive some of the product direction. So what's going on in the field these days? So you're asked to come help evangelize, explain, you know, you're the guy with the customer trusts, you know, you're not a sales guy, you're not a pure marketing guy, you got technical marketing in there, so you got some credibility. What's the field asking for these days? What are the customers asking for? Typically they're asking for really proof points around the technology. So they want to see how does it perform, how are other customers using it, right? How are they using it and building net new architectures to help get some of their other issues solved? They're all from Missouri. All right, they're all from Missouri, so. One of the things that we were talking about earlier and it's a theme that we've been looking at is that the Hadoop ecosystem is a center of such hyper innovation that it's sort of spreading beyond the control of any one distro vendor or all the distro vendors. But that, there's always a trade-off that hyper innovation, you know, shows up in a little more complexity in development operation. You guys have taken what started out as disparate tools and engines and you're fashioning them into an integrated suite. Tell us what that looks like and tell us how that makes a difference in simplicity of development and operation. Okay, I think really the big thing, we started from a different place than a lot of the other vendors, right? We had sort of these well-established products, right? And really the thing that wasn't really well-established was the Hadoop ecosystem at the time. It's since grown up a lot, right? So we helped found the ODPI initiative to help get the Hadoop ecosystem in line and then we're taking our other products, right? As we build integrations between those, we're also taking those open source into the Apache system, right? So we can build this entire open source ecosystem. But starting from all the products as really well-established and 10-year plus engineering on these products, it's really helped us build a nice group or a set of products that we can integrate really well. Tell us about what those products are and how they integrate in a way that's difficult to do with independent Apache projects. All right, well the big data suite really is the suite of products on the data side. And that's the Green Plum Database, so traditional MPP scale out database. HAWK, which is really the Hadoop implementation of the Green Plum Database. Apache Geode or Gemfire, which is really for fast data, so in-memory data grid technology. And then Spring XD, which is all about data flows and bringing data into that ecosystem. All right, so we've built integrations between Spring XD to be able to push data into the data engines or pull data from the data engines. Right, it can also talk to things like Spark for doing things around data science with ML Lib. And then HAWK can also reach out into the other systems and pull data in from those as well. So I remember when Merit first started talking about HAWK, he really played up performance in a big way. In fact, he's not one known for hyperbole, and he was basically saying, we smoke anything else that's out there. That was several years ago. Give us the update on performance, and then I want to talk about the architecture as well. Okay, well we've recently updated the performance benchmarks, and really what we wanted to look at was how our query optimizer, which was really brought from joint engineering effort around Green Plum and HAWK, how does that affect the performance, and what can that do in that big data ecosystem? So we ran the traditional TPCDS type queries, and just to see and get a feel for where we were in the marketplace. So we went and looked at Hive and Impala as well, and tested all those, not only for performance standpoint, but also to see where people were from the SQL completeness. Because as we see customers out leveraging BI tools, their traditional tools they've used for analytics and data science, they want those tools to be able to come in and into the ecosystem and work without a bunch of changes. So we used those query sets to test performance and that SQL completeness. And really what we saw when we tested performance is while the other guys have gained, definitely gained some, that optimizer really plays to our advantage in very complex queries. And we're still seeing four, five X, 10 X, 20 X, query performance really across the board. And we support all 99 queries. So for testing the other guys, they're typically supporting maybe a two thirds to three quarters of the queries in that benchmark. So they can actually run all of it. So is that a function of just the efficiency of the code? Are you leveraging memory better than the other guys? Maybe talk about that later. It's really a function of how well the optimizer does in deciding what data to actually go and query, right? So it's smart. Yeah, it's smart. We're not pulling back as much data. And it's also a function of how we leverage some of the memory and disk across the system and pipelining data between nodes. So we're very efficient about that. That's your IP. That's proprietary to you guys, right? And that's your secret sauce. Well, that piece of it proprietary really until now, since we've open sourced and gone on Apache with the Hawk. Right, so basically that was your IP that you've now given to the open source community. Definitely. Okay, and then you're now building on top of that, right? So talk about the, so originally the business model was okay, we have this sort of unique advantage that we can sell and then you realized, okay, as you say, this slow, steady decline in infrastructure software pricing. So you guys have said, okay, we're going to sort of move up the stack. We're going to sell services around big data and obviously cloud. So talk about that a little bit. Right, and we'll still have a commercial offering of the Hawk product, right? We've given Hawk and it's the logo and the name to Apache, right? So we'll still have a production version of that. But basically all the technology's gone to the open source. And really the driving factors that was really customers coming to us and saying, hey, open source is the number one item on our checklist of if we want to look at your software for this solution, it has to be open source, right? So we looked at that and that became important. The other piece that was important was taking something like Hawk and we wanted to make it what we call a Hadoop native product. Right, so what's that Hadoop native mean? It means, well, number one, it needs to be Apache licensed like Hadoop, right? Because you don't want to have different licensing and open source technologies in that ecosystem. And then second, it needs to work with the other products in that Hadoop ecosystem. So the one released to Apache leverages yarn, right? Which is different than our production, the actual version we're selling today. It's the version we've actually given to Apache is a much more advanced version. Works with H-Catalog to be able to query external data sources, right? So it really fits inside that Hadoop ecosystem that becomes a Hadoop native SQL product. You know, when I listen to this, it sounds very much like what Impala and Hive originally pitched, which was, we're going to take advantage of the core catalog, which is, you know, H-Catalog coming out of Hive and that Impala uses or extends. There is the file formats, Parquet, Avro, but the folks who built Presto at Facebook, who also originated Hive said that they really had to go back to the drawing boards because there were a lot of learnings they had since they built Hive. So my question is, if those learnings, you know, required a new product at Facebook, might Hawke at some point become the native MPP SQL engine for the ODPI initiative? I think that's what we hope it does, right? Not forward-looking too far, I mean, but that's what we really hope because it addresses sort of that different market where data science, machine learning, analytics, where Hive doesn't really play today, right? So Hive is a great engine for processing lots, lots of data across the Hadoop cluster, ETL, some analytics, reporting, dashboarding, things like that, but when you get into that higher level function, it just doesn't fit the bill. And something like Impala tried to address that, but it's definitely difficult to build a database system from the first line of code where we're starting with 10 plus years of Green Plum Engineering and come out of it from a different angle. Okay, there's two angles on that, which is we know from Oracle, you know, everyone may disagree with the, or suffer from their pricing levels, but we do know that 37 years building a query optimizer and workload manager, they did pretty well on that. Right. And you've had time to mature, a decade to mature the Green Plum query optimizer and workload manager. But I guess the next question is, what do you need to surround that with in terms of a platform that makes it easier to develop and operate, you know, not just the database by itself, but a suite of products that are, you know, a data management platform, you know, combine Cloud Foundry, you know, Gemfire, Geode, Spring XD. Tell us how that fits together. Right, so the idea for the data products is to be able to eat, right? You have the full suite to be able to create something like a Lambda architecture, right? If I want to be able to do that, right? A fully integrated. The mind of TeleReader is just a quick reminder on Lambda. The ability to have like long-term data storage and fast data storage all integrated in a single cycle, right? So I can query my most recent data and, but also use the older data to be able to train my models for machine learning. Right. So have all those products, but then taking that and making what we call Hadoop Native and then taking that next to that Cloud Native architectures, right? So if I go to something like Cloud Foundry, it really makes it easy for people to develop, deploy these solutions that leverage that data on the backend. So what we're trying to do is combine those two so that from an operator standpoint, right, all the development, all that effort of deploying systems is that's done. Deploying all the data systems for developers to leverage is done. Making the connections out to all your data systems is done. You just need to build the apps to leverage that, right? So that's the ultimate goal. So how much of the development and operational life cycle can you simplify today? And then what's the vision? The operational, we can deploy today, right? So we have... Beyond deployment. Well, we sort of start there. So we have the deployment pieces done. The connections to those from a developer standpoint are all done, right? What's not done in sort of the future game is maybe reimagining what it means to be a data architecture in the Cloud, right? Because today we've got these monolithic HDFS implementations and data processing engines. And today we're just sort of taking those and dropping them in these big systems. But what if we could break out some of that functionality to make it scale at different levels? And that will be sort of the end game. Can you elaborate on that? That sounds intriguing. Can't really a lot today, but it's about just sort of reimagining what it would mean to be in a Cloud, right? Because the Hadoop infrastructure, a lot of people want to take that to the Cloud today, but if you've got a thousand nodes, right, moving that to Amazon is not necessarily practical, right? So smaller implementations are practical today. But if I can do it and be able to grow things one node at a time or one service at a time, right? So if I look at Cloud Foundry, things are service oriented, right? So microservices. So if I can start to have data products available as microservices. So if I want to be able to leverage some kind of spark functionality, but I can do that through some microservice. So the services today are too monolithic and hard to break down? Yes. And we need to get to microservices, throwing out a buzzword containers where we can orchestrate hybrid Clouds. Exactly, exactly. That are composable and sort of agnostic to the underlying infrastructure. Build your data architecture on the fly and not have to sort of just drop in this big chunk of data products. So obviously, well, obviously. Conceptually, you'd think there's a lot of demand from customers for that vision. There's definitely a lot of demand to move towards public Cloud type of infrastructure. But it's just today, sometimes it's a challenge. So we're just looking at ways to make that a little more consumable. Yeah, and the reality is, like you said, not everybody's just going to shove everything into the public Cloud, probably ever. And so we're going to live in a hybrid world. See, earlier we talked about proof points. That's what customers want. That's what your field wants, because that's what customers are asking for. Do you have proof points at this point for what you just described? It's still early days? It's still early days on that, right? So the proof points that customers are interested in today is more just around that data architecture. So we have customers leveraging the data products for Internet of Things type applications, connected cars, things like that. So they want to see that type of information get back to us and then distribute it out to the field. Because things like connected cars, it's pretty specific to us, an industry, but if you actually peel back the onion a little bit and figure out, all right, the things we're doing in a car is maybe not that much different than what I'm doing on a fitness band, right? It's still pulling data in, processing it for a large number of items, and then pushing that data back out. So you think about historically the way applications have been developed. You have this sort of infrastructure silo, and then you're developing an application that connects all the business rules and the business processes and delivers some kind of value as sort of the traditional model. And now we have this distributed mobile, all kinds of stuff happening at the edge, data's everywhere, data sources. It's a big change for customers. So how are they adapting to that change? Where are they investing? How are they sort of moving the steamship, if you will? And where should they be investing? I think the big change is really making Hadoop and these big data architectures more of a reality in the enterprise customers. I think, I mean, I've been coming to these, the Hadoop conferences for quite a few years now, and it was traditionally the Spotify's and this big internet-type companies that they were always the ones presenting, and you never- Twitter and LinkedIn, right? You never saw sort of the little orange retailers selling hardware and things like that, right? Big companies, but they really weren't looking at that stuff. I think they're finally starting to realize the value of not only their internal data, but data they can get out from some public sources and start to integrate these to really build a much better model for making money over time. And where are they in that journey? In other words, are they in pilot mode, how long does it take to get from pilot to production or proof of concept to production? We've heard a lot of different takes on those early apps. Tell us what you see. I think the majority of customers I talk to are fairly far down the journey, right? So they've got Hadoop clusters deployed and they're starting to use them, right? Are they using them for very advanced applications yet? There's far fewer of those. Still a lot in POC mode for those more advanced applications, and those typically taking three to six months, right? And then they sort of take that information and maybe start to build an architecture of, all right, this is how we could really use this type of data. And what would be, when the POCs with some of the more advanced apps, what were those apps that you have in mind? I mean, a lot of IoT type apps or bringing specific data from public sources, be it Twitter or Facebook, things like that, bring them in and make it much more, it's really marketing type use cases that are still a big piece. Well, early on in Pivotal's career, if you will, GE made a big investment in the company and that's sort of IoT related. They call it industrial internet. What can you tell us about that relationship or any activities that are going on? I mean, it's still going strong. So a big part of that relationship is not only on the data side, but also more that Pivotal Labs and that advanced development piece too for building something, an application or a suite of applications like that. But a lot of the things they're doing around industrial internet, they need all that base data products and fast access to that data. So they definitely leverage those products. Pivotal Labs is interesting. When EMC purchased Pivotal Labs, they were like, what? But it's turned out to be sort of this kind of tip of the spear, leading edge, show the customers the way. I mean, it's not necessarily the big P&L driver, I presume, at least in the future, but is that the right way to think about Pivotal Labs? It's definitely how we like to leverage them in the enterprise space. Before they were doing a lot of that internet company type thing, right? So re-architecting parts of Twitter, things like that. But they started moving toward that enterprise space and really show people how to leverage things like Cloud Foundry and our data products to rapidly develop applications that can bring value to their company. Become a much more data-driven enterprise, which is what people are trying to get to, right? They have all these data sources and they need to process that data and move their company forward based on the data they have. Just to raise an issue that seems to be coming up everywhere, before the conference, during the conference with every vendor, how is Pivotal embracing or working with Spark? I mean, because you guys are strong at the storage layer and at the analytics layer, but not necessarily competing one-to-one with all the Spark functions. Definitely interested in Spark's progress and then taking parts of Spark and integrating it with our product line, right? And hopefully some unique ways, right? So Spark is definitely a big win in terms of batch processing of data in memory to make things faster, right? So you get that immediate win over something like MapReduce. So that piece isn't necessarily as interesting, but leveraging things like ML Lib, right? For some of the data science, machine learning capabilities. All right, so now that we're open sourcing our Mad Lib product and giving it to Apache, right? Maybe some, basically moving some technologies between those to make them both stronger. But being able to take data from our in-memory data grid or Apache Geode, right? And maybe make that available as Spark RDDs or the other way and then leveraging something like Hawk to be able to read and write data to and from in-memory sources with Spark. So there's definitely a lot of different ways we're looking at integrating it. We definitely see it as a partner in the ecosystem and important when the customers are starting to become interested in. You talk about IoT, Internet of Things apps. One of the things that Spark is one of the use cases that's emerging as a sweet spot of Spark is combination of streaming machine data or time series data integrated in with the ability to filter and join and aggregate that data using SQL and feed it into a machine learning pipeline. Does that compliment what you're doing? It can compliment or we can take over part of that. So from our product suite, we can handle a lot of the stream processing and we can plug into something like Kafka to handle as a message broker. And then once we get the data into a pipeline we can push into any of the data products and then also push directly into Spark for ML lib processing and bring that back in. Because a lot of people still want that real time completely streaming data feed or something like Spark is really a micro batch type thing. So it's interesting. Some people, some people that causes issues but it's definitely something that we play with at any part. That's one of the nice things about this open ecosystem that we're built around Hadoop is I can usually plug and play any piece to meet my requirements. So are you ready to predict what's beyond Spark? You never know, I was on the walk over here talking about some of the interesting companies we're starting to see in the market. So the problem I think right now with this ecosystem is as soon as an interesting product gets out there and we identify all right it has this weakness well another product comes in to address that weakness instead of growing that product. So it'll be interesting to see if Spark continues to grow and sort of build itself in a bigger product or does something come over and take over? Well it was in the cube a couple months ago and I won't mention his name a gentleman who popped out of Google said yeah you know when I was Google with MapReduce we went way beyond that, yeah Spark we figured that out and then we went onto something else. Yes and so. And Flink from the Apache side. It's a great part of our industry I guess it's always something new. It's always somebody trying to solve a niche problem that the other niche product didn't quite solve so it's difficult for customers right? So they need to definitely looking to their vendors to provide that overall architecture solution right? And I think that's really what Pivotal provides is the big data, the fast data, the streaming data and then the development around Cloud on top of that gives us a unique position in the market I don't think the other guys really have. So Dan we're out of time but I have to see you've been coming to Hadoop World for a while we were talking sort of when we first met. You've seen a lot of changes you know it's evolved. Some of it's good maybe some of it's not so good but I think from a nostalgic standpoint what's your, give us the bottom line what's your take on sort of where we are in the ecosystem and where we're headed. I think from the conference standpoint I think part of the problems are that as your conference grows you have to fill a lot of spots so you tend to have a lot of product positioning at these conferences which coming in as a user of the products that's not really what interests me right I want to see right how do you use this open source project to solve a particular problem and that becomes interesting so you really look the heavily attended sessions are generally those type sessions right there. Hey how did I use Flank to solve my streaming problem? So I think definitely customers want to see many more of those and we try to do that with our sessions but it's fun to see Hadoop grow to this level and have all these conferences every year right I think two years ago I went to the first Hadoop Summit Europe and it was sort of Hadoop Summit minus three. It was European market was sort of that far behind it was really fun to go back to that level and see people this is a new great exciting new thing that's fun to play with and I think we've gotten past that now and it's really about enterprise use cases now. Excellent thanks for coming on theCUBE and sharing Pivotal's vision, your vision we appreciate your time great thank you. All right keep it right there everybody we'll be back from the Big Apple New York City this is theCUBE we're live from Pillars 37 we'll be right back.