 Live from New York, it's theCUBE covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Welcome back to New York City, everybody. This is theCUBE, but we're here at Strata and Hadoop World. This is Big Data NYC, our event within the event. Scott Nau is here as the CTO of Hortonworks and he's joined by Ron Bodkin, who's the founder and president of Think Big Analytics. Gentlemen, welcome to theCUBE. Good to see you both again. Thanks, Sam. Nice to be here. So, Scott, let's start with you. What's happening over in Strata? We saw, I saw the keynotes this morning. I saw Mike Olson say basically I predicted we'd make Hadoop disappear, you know? And I'm not sure that's quite happening, but what's the buzz over there? Well, the buzz and the energy is amazing and just the size of the group shows how much interest there really is in the market. I think the biggest thing that we're seeing, that I'm seeing is that we've moved from the early adopter stage into the what's in it for me, show me some value phase of the business, which is where it's really interesting and sustainable. And we're talking less about core technology and more about what are we doing with it and how are we getting customers really to find the benefit of the data that they're storing. So, Ron, I mean, you and I have talked about this before. People have struggled to get value out of their Big Data analytics. Generally, Hadoop specifically. It's been complicated, lack of data science skills. And we recently done some surveys that continues to show people they're doing okay, but there's a select group that's hitting home runs and then there's a big fat middle, I'll call it, that needs help. Do you see the same thing? Yeah, I mean, I think definitely the market continues to mature and evolve. It's growing. The early adopters typically are doing useful production work, working with new data sets, doing analytics on top of Hadoop and Big Data platforms. I think the early majority is at this stage of really trying to get their arms around the data lake. How do you build an industrial scale data lake? How do you have the right governance so you have confidence in the data and can use it for anything? So we see that's a really big topic of conversation for most of the companies that are moving to adoption, but typically they want to see, they want to have the right approach. How do we really know if the data's accurate? So that's a big thing I'm talking about in my talk tomorrow on patterns for success in the data lake. What are the right ways to do it? So is that a data quality issue? You mentioned governance, is it an organizational role? I mean, it's all of the above. I mean, I think there's still a lot of tradecraft, right, the patterns and practice of how do you do this right aren't written down. It's different, there are a lot of the principles of data management from the data warehouse world still apply, but by definition, you don't want to have all the data fully governed and curated in the data lake, right? So what techniques do you use to have confidence in the data? Often we're dealing with distributed systems where lots of data's coming in from different places, so it's easy to have a data quality problem of incomplete data and not realize it, right? When you have a single source of data, it's easy to verify it's there or it's not. So it's that, I mean, security's another factor because people are putting a lot of data together and wanting to work with richer data sets, they have security and privacy concerns. So all of these are factors that have to be addressed and I think there's a big shift that the early crowd often were more willing to go out as gunslingers and just start implementing things and not worry so much about the messy details of is it accurate, is it right? And you're seeing a big shift to, this is going to be an asset in the enterprise. So it's a good step towards maturity, but a different mindset is prevailing. So I want to get into the partnership, but before we get into that, Scott, I wanted to ask you, so you come from a world that's been around for a long time, Teradata, but you were in Teradata Labs doing some advanced work. So a lot of the things that Ron just talked about, basically, and fallen into the umbrella of enterprise ready, you know? We've seen this movie before, so I wonder if you could give us your perspective in seeing the evolution of Hadoop and the big data ecosystem, which is really exploding now and is somewhat unwieldy. What's your take on it? And I think you're right. Having seen this movie before, one of the things that I love about being part of this industry in general is, I think we have the opportunity collectively to really define the future of data and analytics and it's a whole new horizon. So just like in the 90s, right? Coming off of ERP standardization, data warehousing and BI became a really big thing and added a lot of value where it's now, you haven't survived, right? We're now living in a world of not only ERP digitized systems but the internet, 3G, 4G, cell phone networks and really the internet of things, creating a whole new context of data to collect and analyze and being able to define that fabric is a really cool place to be. And so what Ron mentioned, I think is what you're seeing is kind of the first phase of maturity. Okay, I've invented the technology, I've found some really cool things. Now let's go worry about governance, security, all the enterprise class opportunities and the community and certainly Hortonworks and our partnership with Teradata, we have responded with adding in data governance with Apache Atlas and adding in new security features to make the data trusted, to make the data like bulletproof, to build a big perimeter but also protect data and privacy inside. So those are all kind of maturity pieces that you would expect to show up at this point in the phase. So those are some of the similarities. There are some differences. I remember when Teradata came out in the 1980s, I was at IDC at the time, they came and did their roadshow and it said, that's pretty interesting in there. Sort of simplifying things and Larry Ellison mentioned at one point, Teradata actually had it right, they put it into an appliance and they made it easy to do all this complex data management stuff. As the CTO of Hortonworks, you have this open ecosystem that you have to herd and I wonder if you could just describe that dynamic as well. And I think that really gets to what I was describing as kind of the new world of moving away from, and I'm oversimplifying, right? But moving away from an ERP centric world where you have digital transactions that are rows and columns and spreadsheets and at massive scale and that kind of thing. Into the world we live in today, how data is being created just all over the place. It's a different context and that different context has a couple of impacts. One, just the variety and diversity of the data being created is huge, right? So it's no longer rows and columns and transactions, but it's web logs and it's network packets and things that are very different in terms of the contents, how you read them, how you understand the content and the intelligence that's built into the data. So that's one piece. The second piece is really there's this explosion in volume, but not an explosion in value, right? So the net value per byte is going down and the variety of data is going up. So this creates a divergence and in that context, I think you really move less from a consolidation play and more to an ecosystem play where not one tool is going to be able to bridge that span effectively, but having multiple tools that are specialized and having them orchestrated together is where the value is going to be created. I think in a building on that, that really drives a couple of things. One is because it's open source communities that no one really controls it, right? That there's strong players that influence some communities more than others, but the movement is bigger than any of the companies and that's actually really different than innovation in a commercial ecosystem where you get warring kairetsus fighting each other, but they kind of circle the wagons and have their own camp, right? There's this co-opetition of sharing contributions to projects, of adopting projects that came from left field. Innovation being led in open source, that's new, right? That's new in the big data space, but also the other point around that diversity is that it's increasing, it means an even bigger role for having the right services, the right expertise of how to do this, right? I think it's the first companies that have succeeded these technologies had a very strong engineering mentality, but that's giving way to companies that don't want to do heavy engineering, but there's still a real onus in understanding how to put these pieces together and open source ecosystems, the other thing that I think is pretty clear is they don't optimize for simplicity, right? You probably end up with a lot less design and a lot more complexity in the system as well. So it's interesting comments because it does seem that the various constituencies are trying to influence, obviously, and you mentioned innovation. I think there's innovation and there's invention. There's a lot of invention going on and innovation seems to be noticed anyway or come to the fore when it scales. And that's not trivial getting this stuff to scale when you've got so many competing, confusing options out there. So I guess my question around that is what, talking about the partnership, maybe we can do there, what's the objective of the partnership specifically as it relates to inventing and innovating, i.e. at scale? Well, I think the irony of Valley Scott should speak to it, maybe he should, because he was a key mover in creating it when he was at TerraData. And of course now he's at Fortinworks. So I'll have some editorial, but I think Scott is a uniquely good physician to comment. So what about that, Scott? Yeah, what about that? So at TerraData, TerraData created the unified data architecture and the notion behind unified data architecture is really that ecosystem of different specialized tools for different specialized purposes, but being able to orchestrate those tools together to get the benefit of a broader data fabric, right? And so recognizing that structured transactional, relational, high volume, high service level data belong in a certain specialized technology and other data can belong and fit better into other technologies is really the basis of UDA. And the value proposition is the orchestration of the multiple components. So really being able to create analytics that traverse different systems and being able to combine the intelligence and the analytic engines that are contained in those systems creates great value. It's kind of like going from standard definition TV to high definition TV. You get more data, you get more access to more data, more access to more analytics and the value is really in combining them. So that's kind of one core component. The second core component is, almost having seen this movie again, the whole open source Hadoop infrastructure is really a key enablement of being able to store all this different variety of data and all of the volume of data at the right cost point without actually having to transform or modify the data. So keeping it in its raw format. And one of the things that we found over and again through any of the different phases of analytics is the best analytics you can derive are when you have access to the unadulterated base data as it came in without any changes to it. If you make changes, you start to change the data, you apply business rules, the analytics will yield the rule that you applied. So part of having this broader data fabric where you've got the Hadoop infrastructure and landscape as well as core data warehousing technology and other analytic technology combined is you can get the advantage of all of the data unadulterated with large volume at the right cost point, multiple different analytic engines and a delivery vehicle that fits into existing infrastructure that companies have already deployed. So the Genesis, wait, you said you had some editorial on that, I'd love to hear what you have to say before you follow up, please. Certainly from our perspective, it's funny because of course our journey at Think Big was this pure play focused on Hadoop and continuing in that focus, but now part of Teradata because of the importance of Hadoop, right? I mean, so from Teradata's standpoint, there is a huge value in unified data architecture and extending the capabilities of an industry leading data warehouse to a broader set of capabilities in a data fabric, right? I would say at Think Big, from our very beginning, we were typically integrating with data warehouses and we were helping customers add value by working with new data that wasn't fitting well, it didn't have the value of density to work with the warehouse or the access patterns, the analytics were not a good fit, right? So I think that partnership is a really natural one that you're going to see, as we're doing so much more and becoming more data-driven in as a society, but the digital revolution continuing, it's not a surprise you're seeing more specialization in the tools and techniques that we're using to work with data and do analysis. Okay, and then Scott, you mentioned a couple of things, the unified data architecture and essentially leveraging the ecosystem is what, I presume the epiphany there was, we can't do it all ourselves, we have all this innovation going on, let's leverage it, that's the future. Is that what came out of your work at Teradata Labs? It did, right? And again, I think it's addressing a broader problem in a much bigger market with the right tools in the right place, number one, and then number two, in the preamble where I said, instead of centralization moving to an ecosystem model, I think we'll be defined by the marketplace requirements for some period of time, those things combined together, I said, okay, let's expand our reach by creating this architecture. And then you have this, of course, this kind of, I call it no schema on right, and it leads me to something that I read on Twitter today and it was Merv was chatting with somebody and he talked about the fragmentation of metadata and Hadoop is the same as it was in the BI world. And Merv said it could be worse because people are just dumping it in to the data lake and trying to figure it out later. So what do you make of that comment? Is it a fair characterization? And part of me says, what's wrong with that? As long as you've got some processes in the back end to meet your business objectives, but what do you think about statements like that? Yeah, well I think there are a lot of people that don't know how to govern data in Hadoop and sort of back to the comment around gunslingers going off of ad hoc approaches. The risk of not having the metadata, not having the process and patterns of how do you track it is you don't have confidence in your data, you don't know if it's complete, you don't know if it's accurate, you don't know if it's properly secured, you won't get consistency. I mean, it's very easy to have different ways of working with data that produce inconsistent results and that creates a lot of problems, a lot of challenges in operating a business and in making decisions. So there's some big stakes around doing it right, but it's about, to us, it's about having this balance that you want to have a Goldilocks governance, just the right amount. You don't want to go so far as to say, well it has to all be fully curated and parsed and organized and third normal form, classic warehouse strategy. You want to have confidence that there's production flows and they're accurate. And I think the other thing, I mean, the subtitle of my talk is Beyond Scheme on Read. It's not enough just to say, hey, we can put schema on something when we read it. Some stuff you do want to curate and structure. There's foundational elements like keys and timestamps that you want structured data and you want them parsed in almost any system. But there is this nice property being more agile and say, well, we're going to parse out and structure data just the right amount, right? That as we see value in something, we're going to promote it to a more structured form and create analytic products, but maybe we can start with a more raw form of data when just where in the value has been proven we'll harvest it and make it a more repeatable approach. I love this idea of Goldilocks, you know, governance, Goldilocks structure, whatever you want to talk about. Because people are struggling. When you talk to, we do our servers, we talk to IT people, we talk to business people, IT people are saying, yeah, check, we did it. It worked. And business people say, not really sure where the value is. And so, you can- And a lot of times those systems, what happens is it's like, when everything's running, it's good. And then something goes wrong and all hell breaks loose. Yeah, yeah, yeah. So, you can dump it into the data lake. That's cool, easy, get it up and running, but the hard work is getting the value out of it. I wanted to ask you, you guys got a unique perspective because you're both former Teradata, and now you're Hortonworks and a Teradata company. So last year, there was a lot of talk and our surveys sort of indicated this too, that there was this big sucking sound, that the ROI of Hadoop was reduction on investment. Are we going to be able to reduce my investment in my existing data warehouse and lower the denominator and increase my value? And this year's data is like a 180. Everybody's saying, yeah, well, no, my data warehouse is critical. I actually can't do some of the things that I want it to do and Hadoop. Could be flaw in the survey, could be a change in mindset. What are you seeing in the marketplace? I'm going to- Absolutely, we're seeing a maturation of customers really appreciating what each technology is good for. There was certainly a wave of inflated expectations that people believed that Hadoop and the ecosystem would quickly offer a superset of what you could do in data warehousing and analytic grids. And I think customers have come to realize that that's far from the truth. So I think we've seen a lot more of the balancing of recognizing, hey, there's a balance. Like we both, there's a balance. These two things work well together and there's value for organizations in adopting both of them. We see a lot more customers there for focusing on what's the value we get net new by using these new data sets and new ways in this new platform. So we're out of time, but I wonder if I could ask each of you, so maybe start with Scott, kind of the objectives of the partnership generally and then specifically what you want to accomplish here over the next, say, 12 to 18 months. What should we be watching? I think we obviously want to continue to, and we will continue to invest in the partnership making our customers successful. And again, I think the ecosystem approach is the easiest way to success versus this versus that versus just cost takeout. Cost takeout is constant, right? I've been in the industry for 30 years. Every customer always wants to take cost out. That's not new. What is new is the capability, the capacity and the access to the data that didn't exist before. And that's really, really the most interesting part of the partnership. How about you, Ron? Well, I think the partnership's moving into next phase as we continue to have more and more, a large number of shared customers and a lot of new capabilities, right? That as customers are building lakes, we're excited by, we see the Atlas standard for metadata is nascent, but a encouraging direction of having a reliable place to capture metadata in the community and the ability to go into doing analytics on top of some of that data that's in Hadoop. I think there's a lot of opportunity for customers of the two organizations together. That's great. Thank you very much for coming on theCUBE. It was a great discussion. The economics are always important, Scott, as you said, but to make it sustainable, you've got to find that value. So, well, thanks again for coming on and sharing your perspective. I appreciate it. All right, keep right there, but we'll be back with our next guest. This is theCUBE, we're live from Big Data NYC and the Big Apple, right back.