 Live from the San Jose Convention Center, extracting the signal from the noise, it's theCUBE, covering Hadoop Summit 2015, brought to you by headline sponsor Hortonworks, and by EMC, Pivotal, IBM, Pentaho, Teradata, Syncsort, and by Atunity. Now your hosts, John Furrier and George Gilbert. Okay, welcome back everyone. We are here live in Silicon Valley for Hadoop Summit 2015. I'm John Furrier. This is theCUBE, our flagship program, where we go out to the events and extract the signal from the noise. My course this week is Wikibon's new big data analyst putting out a whole new cutting edge research agenda and some new pieces got put out there. George Gilbert with wikibon.com. Our next guest is Scott now, VP of strategy, I mean, CTO of Hortonworks, VP of strategy, Sean Connelly, he's up next. CTO, Scott, good to see you. Welcome back to theCUBE. James Jurys from Teradata to Hortonworks. You know, there was so many tweets this morning, everybody's like, where are the orange shoes? But I have a green shirt on today. Thank you for noticing. We've had a great chat in the past, certainly technical conversations around databases, Teradata, big data, open source, all great stuff. Just a lot of stuff happening here now. When we had Rob Bearden on, he's got a spring in his step. We're going to have all the Hortonworks folks coming on. But the big thing that's happening is that you have an industry that's growing up. The big data space behind us is big data. Not so much Hadoop, but it's a big part of that. And now the cloud is coming over the top. You're seeing a lot of migration, a lot of pressure from the enterprise with cloud in the data center, which will impact a lot of this. So what's your take on the landscape right now, from a technical perspective, is the Hadoop ecosystem ready for this melting pot of innovation with non-Hadoop? You've got Spark out there. You've got some stuff out there. You've got data centers changing with cloud. What's your view of the landscape? Well, it certainly changes every minute, like the clouds in the sky today here in the Bury. I think that, frankly, a lot of the transition to change out there is actually to be able to be in concert with all the other things that you mentioned. A couple of things I think that are just really big and we're never going back. One of them certainly is the cloud, which gives ease of deployment, flexibility, cat-backed-off-back, just a lot more easy to consume compute resource. That's fantastic. When you look at the Hadoop landscape with being open source, a very quick, rapid innovation cycle, and also a low cost point as well, it opens up more avenues to data storage that didn't exist before. And so that all comes in concert with just all this data flying around. So it's a perfect storm where there's a lot of data flying around, and Hadoop, cloud, and all these other technologies are, frankly, ways of coping with that, and not only coping with it, but turning it into something of value. And the thing that I'm most thrilled about being here for my fourth time is how quickly this industry has turned to showing value. And in this morning's opening comments that I had, I talked all about customer value. It's not just about the data. It's not just about being able to land it. It's not about the science project anymore. It's about how do I turn this into making a better customer experience, running my business more efficiently, and competing in the market that I want to win. I got to ask you because this is really kind of, you know, you have the geek conversation which is intoxicating in and of itself. A lot of open source projects that are going great, great software being developed in communities, but the customers now have a new way to operate their businesses. You're talking about, you know, the ability to do something experimentally to get data, more data, if you will, with the data. So they have existing data. So to run these experiments, you don't have to build the factory of, you know, the old software that can get in with cloud, get in with open source, get something up and running. I mean, can you share some things that you've seen in that area? We were talking before we came on camera about, you know, how you can just get up and running and get that value. You can see it fast. And once you get it, then you double down. That's agile. That's DevOps. Those are the things we talk about. Yeah, and I think that that's really a key thing that all of these forces coming together is really enabling. In the past, when you were going to go build analytics, right, you had to plan for it. You had to have a need. You had to have data. You had to go do a plan. It took time. And it darn well better work out. And of course, I've been around for a long time. It worked out. Business intelligence and analytics are something that are, you know, de facto for any scaled out business these days, right? So when you fast forward into all of the new data and all of the volume and variety of the new data, kind of look at it and say, wow, this is huge. I don't have time to plan for it. Probably couldn't afford to plan for it. But can I use some of these tools to go take a snippet, find out what the relationships are and then make a decision to roll forward with some analytic solution. So in my mind, it's not like we're replacing the landscape of BI as much as we're moving from standard definition to high definition. And getting to high definition implies lots more data, lots more pixels, right? And a lot of new ways to deploy tools to capture those pixels to really understand and crisp up your relationship with your customers or your relationship inside of the business and how you operate it. And frankly, the technology stack that comes with Hadoop and the ecosystem around Hadoop is really the enabler for that because you can quickly spin something up, you don't need a schema and ETL, you can dump some data and send some data scientists out. They don't find anything, you fail fast, you go get the next thing, right? And you can do it all with a cost paradigm that makes it very, very appealing. So it's disposable, it's agile, it's got a great cost paradigm. And like I said, this year at the summit, we're actually hearing more of success. I've known for years that there was going to be, there were going to be diamonds in the rough somewhere and people are finding those diamonds. They're putting them to use. And they actually have proof points behind this. Not like, hey, they're not kind of manufacturing success, they're real successes. And love the standard TV to HDTV kind of concept because in a way, it's the same medium but it's just getting better. And actually everyone will be moved over to a new normal format. So that's totally cool. I got to ask you a question kind of as a CTO, just in general as a CTO, not just for Hortonworks, but as a technical leader in the industry, our tech athlete, as we say, talk about, because a lot of customers we talk to are trying to put their arms around, their mind around platforms versus tools. So we're in a very tool-centric world right now because tools get the job done, but there's also platforms. So, and then the lines are blurring as some noise and some confusion. Tools are hot, but they coexist. Take us through your vision of tools and platforms in this Hadoop ecosystem for customers, how they should think about it. Because when a customer hears platform, they go, oh man, that's a big expenditure, training. Tools, I like tools. Tools, I can buy a tool and hammer some nails in. Or you know what I'm saying? So take us through that, your perspective, technically, tools and platforms. Yeah, so I think you're right, by the way. I think we as an industry should come up with a better term for platform than platform because it implies all of the things that you said. I think it, and I have felt this way for 30 years. And big data and the Hadoop stack and all the tools does not change my perspective on this. I think the most important thing for any company to do is really think about and put a lot of thought into their data architecture. How am I going to architect the data? How am I going to tear data based on where the data came from, what the use cases are? How am I going to tear that across different platforms for storage and processing? And come up with a roadmap for how that's going to be deployed because once that's defined, then tools can come and go and be interchangeable versus being application or tool-centric. And if the application or tool changes and we've seen that happen repeatedly over the last 30 years, then you kind of stuck and you have to do a do-over. But if you build out the proper data fabric and data architecture, then you're foolproof, right? You've always got it, and as the tools change, you can leverage them very quickly without a do-over because the do-over. That's costly and painful. It's costly and painful, but even more importantly, a do-over, if you lose the data for three years, five years, or something, that's a timeout in your corporate intelligence. Can you really afford that? Forget the cost of storage and the rehosting. So how do they do that? So like storage is one aspect I was talking about. I mean, I like EMC's approach right now. I've got to say, having some sort of agnostic storage whether it's EMC or storage in general, I don't want to stuff everything into a tool because you said if I have to pull that back out, but also there's a do, right? You can have other things on premise. So the apps, do they drive the tools? And the data, so if it's an app-centric workload, how do you think through that? Just say generic fabric on the data layer? Yeah, so again, if you define your data infrastructure, and this is more of a logical concept than a physical concept, thinking about one level down from that data fabric, you can then get into service level and deployment and corporate standards and physical architecture and so on. So the first brush at data architecture really is logical. Now, I think once you start to look at the applications that the business needs, and you've got that logical framework built, then you can start to interchange the physical deployments that you built. Can you drill into that a little more, because the logical level versus the physical level, because the thing I picture is different storage tiers, but I don't think that's what you mean. It's actually not what I mean. So I think that's kind of the phase two question that you have. That's the physical stuff. Yeah, now that I've defined my logical architecture, which means this kind of data is going to be in this kind of file system, it's going to have these kinds of use cases, these kinds of service levels, then you can define the physical behind it based on, do I need disaster recovery? Do I need full blown, high business continuity solutions? Do I need solid state? Can I afford a lower service level with cheaper storage? And then, so once you've defined that logical architecture, you can then buy subject area and different threads in your data fabric determine the physical deployment based on the service level that you want. And by the way, I also believe that over time, those physical requirements will change, right? So at least having that logical view and then being able to interchange the physical implementation creates a more sustainable approach over time, because one thing that I've learned after all of these years is, every business changes and every business user changes his or her mind many, many times over the course of an implementation. I think that foolproof is a good angle. And I remember last time, now that I remember our conversation from the last time you were in theCUBE was, I don't know if it's the exact way to put it, but you talked about getting something started as easy, you can implement it in a variety of ways, you could do that. But the challenge to go deep is the integration and the proliferation, if you will, of the growth. That's the challenge. So technically in Hadoop, we're in that kind of phase integrate and run past the chasm landing, if you will. So then you're in the nuances of connectors. Where are we in this kind of integrate or implement integrate, grow phase? I mean, which one are we in? If those are three phases, obviously growth is in all three areas, but where's the work to be done? Where's the white spaces and how would you peg the stage, if you will? Yeah, well, I mean, obviously, I think we're moving from science project to redeployable asset very quickly. And so we see, certainly with Hortonworks and what we're delivering with ease of use and user experience and just trying to make deployments easy to spin up, spin down, make the user experience better so that you don't need specialized training to actually operate and get value. So certainly that is a piece and I'm seeing that across the different vendors that are represented here at the conference. And I think the other thing that we're seeing is and I've been a big believer of is this isn't a Hadoop thing versus the world. This is an ecosystem and it's going to be an ecosystem for a very long time. And it's going to be an ecosystem for a very long time, I think for two reasons. One is because just the orders of magnitude difference between service level requirements, investment required and variety and velocity of data are so huge that there's not going to be one Uber thing that just fixes it all. No big bang moment, it's an evolution too. So the ecosystem is here and so the implications of that include common APIs so that the components can become interchangeable so that you can, when you want to change your mind change your mind without a do over. And I do think that there is value to be added kind of filling in the blanks in between the different applications and the technologies that they fit together. When you talk about the common APIs are you talking about like file systems meeting common APIs or are you talking about maybe a layer above that? Yes, and a layer above that application APIs, specific ways for different applications to communicate with the different storage media and then for the file system and the technology to communicate with the physical media and really create an end to end use case for application development. That's great. I wanted to go back about, recap a little bit about the, this ecosystem allowing us to go after sort of the three Vs, volume, variety, velocity. Also this notion that in the Hadoop ecosystem it seems like we're deconstructing the database that we used to do Oracle and we had all the capabilities in one engine and now we can have query, we can have machine learning, we can have pick your capability and string it together. What does that allow us to do beyond just low cost that we couldn't do before? Well, I think the low cost or at least the optimized cost is really, really important. So the easiest example, and this is not high tech or particularly scientific but it's an easy example that people seem to at least kind of get. If you think about moving from old world, if you're a retailer and I walk into your store and I buy a box of mails, you create transactions a hundred bytes. And that hundred bytes has value because you know that Scott on that day bought this thing, you can do upsell, cross sell, you can study that with all the other transactions to get price elasticity, all that kind of stuff. That's really great. So it's high density, low volume data, very structured. If the same retailer, if I go to their website and I browse a bunch of stuff and I find the nails that I want and then I finally buy those box of mails, now I've created 100K to a megabyte of data for the same relative value that I bought that box of mails. So this is six, seven orders of magnitude different. And then if you further consider that less than 1% of people that tend to go to a website actually buy something, right? Now you're at nine orders of magnitude difference between the density of the value and the density of the data. Forget the variety, right? And that's going to continue when you add in sensors and the internet of things and things talking to other things. The data is all interesting, but it's not a one size fits all kind of proposition as it relates to the density, the structure or lack thereof and the value of the data. So in other words, you needed a lower cost pipeline to do the analysis because to get that same value, you couldn't have done it with the high cost platform. Yeah, and it's again, it's not specific to cost so that's an easy one to understand. It's just like that delta, I would never store 100,000 megabytes of data for the same value that I can store 1K of data. It makes absolutely no sense to do that in the same platform. So then you get into the notion of physically tiered storage or logically the data fabric of, hey, by the way, on that weblog data, I don't need five nines, I don't need to have it backed up and have tapes sent to a mountain somewhere to save it because it's not that critical, right? If I get 99% of it, that's probably okay. That different service level means, okay, now I can use a different technology platform has a different cost. Maybe a different set of tools that lets me traverse that log information to extract the information very effectively that I need to add it to the high definition view of my business, right? But I just believe it's not a one size fits all kind of thing and it's not going to be for a very long time. So this whole ecosystem affords us the ability to do those analytics that otherwise I believe would not have happened. So, but it's just to be clear, it's more than cost. It's more than cost. Cost is easy to understand. It's more than cost, different tool, right? You're going to traverse that weblog to find out that Scott at this IP address this box of nails and he paid $8 for it, right? Out of that one mag, right? That's not rows and columns. That's a different analytic to go pull that out. That's relevant. So it's variety as well as volume and cost and all those things kind of combined make for something that is very important. So how about storage for a second before we get to the last question? Who's doing it right in storage? As we get back to the point of cost and future proofing and getting that foolproof data fabric layer, you got to have kind of this non-lock-in storage layer because you want to have the data accessible, right? I'd rather have the data be able to be agile, move around anywhere I need to have it move around than grip it out of a tool or another platform. So who's doing it right? Is all the storage vendors positioned well? Yeah, EMC's got a new open source approach. Yeah, and obviously we've got a new partnership with EMC as well for the Icelon partnership, I'll maybe talk to you about that later. I think that- Was that just recently announced or was that just in general? Okay, good, great. So I think that the storage vendors are all understanding this and from what I've seen, and I'm not the world's expert on the storage market, but what I have seen is with the tools that they're creating, they're creating this huge variety in terms of service level, density, overall performance, and then building value add in their own way where they can help a customer to standardize on a specific thing, take costs out and provide those different tiers of storage. But I think, you know, storage vendors themselves also understand that it's not one size fits all, but there's going to be a combination of technology required. Yeah, I think what I'm impressed, EMC for instance, I'm impressed with this whole DevOps mindset because now they got a little bit open source, Mojo going, newly renewed, but they have this idea of making it easy for app guys to me, like in other words, I don't want to- Don't want to worry about it. Just to write more software to build connectors and stuff. If stuff can just be available, I don't want to have to do migration of data, all that stuff is kind of crazy. And automating that process is a very valuable thing. All right, final questions, we have the hook here. John Chambers gave his last speech as CEO at Cisco Live this week, and he said, I'm at 40% of the companies out there right now going to go out of business if they don't, of course, be a disruptor. That's that class, the cliche, be a disruptor versus being disrupted. On that kind of vein, I want to ask you a question from a technical perspective, less aggressive than being what he said, but what should companies do as they think about the future? You mentioned the project you do with Teradata, Presto is a great example of getting into, changing the game from standard definition to high definition. What is the big company? The enterprises that are out there, what should they be doing? What should the vendors doing out there to go from today's era into the next generation of agile cloud, big data, Hadoop? What's the approach, mindset, and how should they tackle that problem? You know, interesting comment, and I think the more things change, the more they stay the same. So in my opening comments this morning, I talked about the value of enterprise class, Hadoop, and enterprise includes governance and security. Those sound like old topics, but they need to be rethought in this new paradigm. So governance today is not about a waterfall project in IT to take six months, but enabling self-service and being able to govern it with those tools. So I think a lot of it is really a rethinking of concepts that we all know very well in the industry, but looking at it. In a new architecture. I'm sorry? In a new architecture. Yeah, but thinking about it more as a self-service than as an IT-led project. Self-service is everywhere. Self-service consumers, self-service business users, you know, self-service is the way we need to apply it. And a lot of the new technology, and frankly a lot of the big data stuff that's going on is due to data being created by all of that self-service. All right, Scott now, CTO of Hortonworks, self-service is everywhere. Of course, you can self-serve yourself to the videos, go to SiliconANGLE.tv. They'll all be on demand there after this live end. This is wall-to-wall coverage here for three days that Hortonworks Hadoop Summit 2015 is the cue. We'll be right back in for the short break.