 Live from New York, it's theCUBE. Covering Big Data New York City 2016. Brought to you by headline sponsors, Cisco, IBM, NVIDIA, and our ecosystem sponsors. Now, here are your hosts, Dave Vellante and George Gilbert. We're back in New York City, everybody. This is theCUBE, the worldwide leader in live tech coverage. Matt Morgan is here. He's the Vice President of Product and Alliance Marketing at Hortonworks, and he's joined by Wei Wang, who's a Senior Director of Product Marketing at Hortonworks. Folks, welcome to theCUBE. Good to see you again. It's good to see you, Dave. So, big week. You guys had a major announcement today, a very meaty announcement. We're going to unpack that, but Matt, let me start with you. Give us the update on Hortonworks. What's new since we saw you in June? So, we've been having tremendous success rolling out the connected data platform strategy. And the announcement that we've made today, and the one that we're going to be showcasing this week, is how much meat is behind the strategy in terms of customer adoption and overall uptake. Scott, now our Chief Technology Officer is going to be talking about connected data platforms and how people are adopting both a cloud and a data center approach for data in motion and data at rest. And we're going to be talking about the new technology that we've released in HDP and HDF to make that real. So Wei, what can you tell us about the announcement, the details, what are the key highlights, and then we can get into it? So I think that in summary, we certainly have made all the technology available not only on the premise, but also in the cloud. We certainly have enabled our customers to actually roll both the workloads in multiple versions of Spark, that was actually the highlights. And I think that you will see a lot of it showcased in the show floor at our booth. So when you talk about cloud, we do a lot of shows and sometimes it's hard to understand what cloud really is. We have a cube cloud. So when we say cloud, we're talking about the public cloud, is that right? Are there specific partners that you're working with there? Or is it also sort of cloud-like activity on-prem? Can you unpack that? So let's talk a little bit about the cloud. A lot of people immediately run home to the concept that cloud means infrastructure hosted on someone else's machine. Hortonworks has a very different perspective on the cloud than any of these other big data platform providers. We believe that the cloud represents the appropriate levels of commute and connectivity and storage at the point of data inception. And the cloud's purpose is to facilitate the needs of these modern data apps as required by their architecture. So in other words, we see the cloud not as just the ability to say, hey, I need something on demand because that is part of it. But we see it as a required part of a true data architecture. And our philosophy on this is being validated by the customers that I had mentioned. The idea is very simple. If I'm going to be deploying a next generation modern data app, let's take Prussian, for example. They build an application that does traveler safety. It automatically does push notifications based on 50,000 sources of data. And it crawls the internet constantly to bring this information in, aggregate it, combine it with your location, let you know if you're safe or not. That technology requires reaching out into the cloud, having data platforms in the cloud. So our strategy is to facilitate that. And we look kind of heterogenously at the whole perspective of cloud. You're going to have a variety of cloud providers. Obviously, Hortonworks has a strategic relationship with Microsoft, with HD Insight. And that's a logical extension with Azure, which happens to be the most common, most frequently asked for solution in terms of data center prolificy across the planet. They have more data centers than others. But that's just one example of what customers are doing for cloud technology when it comes to big data. So, Wei, when you talk about a connected data architecture, what comprises a connected data architecture? What are the piece parts? So connected data architecture is really, you think about it, we call it connected data platforms, right? And with data in motion, as well as data at rest. We're not just talking about a transformation of the modern data architecture itself. We talk about actually bringing the different components and technologies into, as I mentioned, as actually Matt has referred to, on-premises in the cloud, we call it actually a data plane. So, when people talk about data lake for years now, I think people really are referring to is a data plane that in which they can gather the data, either on-premise or in the cloud, and then move the data as where they would like to process it. As in data in motion case, they are actually are processing the data, do some very real-time, near real-time analysis right at the edge, right? And then also moving the data into a kind of closest data center in which they can basically store and do long-term predictive analysis on it. It was a term data plane, so it implies flatter, right, in a channel, and maybe more accessible than a data lake or? It's, well, if I can jump in. The idea on the data lake was like this ultimate evolution. We're going to have lots of data sources. We're going to put them into a lake. It's going to have a central architecture. We could do better correlation. We can get causality there. It's more efficient. The economics are better. What we've learned being in this business is people are going to have more than one. There's going to be more than one lake. People could set it up for cybersecurity to do data, to do their cybersecurity analytics. They could do one for customers. And what Waze talking about is these things will also exist in the cloud. So you're going to have lakes behind the firewall. You're going to have lakes in the cloud. And you need some way to look at this holistically, right? Well, this is a combination of data in motion and data at rest. And this is what we call a plane, a data plane. Looking at data in the reality that exists today of multiple lakes, both behind the firewall and in the cloud, facilitating the needs of these modern data apps. So let me make sure I understand. I can see where data that might be originating in the cloud probably stays in the cloud just because of data gravity. It's hard to move. And it's clear there's a lot of data that originates on prem or private cloud. But give us a use case where you've got that, you've got those two repositories and you want to have the data in motion, which I assume is when it's originating, it doesn't have time to be persisted. You want to do the analysis in real time, but also enrich it with data from either the on prem data lake or the one in the cloud. What would be an example of that? So that's a great question. And there are so many examples, whether I'm talking about healthcare or manufacturing, but the one that a lot of people identify with is the idea of the connected car. It's a very simple concept. You're going to have an automobile. It's going to be generating data from a variety of sensors. It needs to compute on that data as the data is conceived. So the idea of having analytics at the edge is a very big portion to a connected car. Now if you look holistically though, that car could have autonomous driving capabilities and fleet learning. You can't have that in the car, but it's best facilitated close to the car, which is in the cloud. You have dynamic scale. You have computational analysis. You have new capabilities brought up by lakes that exist in the cloud. So that's very, very valuable. But that manufacturer is going to need to do some deep historical analytics on whether or not that connected car is performing. They need to get the vendors in line. They need to get the insurance companies, getting the right quotes for their cut. There's all kinds of reasons why they want to do deep historical analysis. That exists in lakes that could be behind the firewall, because it's somewhat sensitive. It's customer information. This is an example about why connected data platforms, platforms within S, exist as an architecture. You can have multiple platforms existing both on the device at the edge, in the cloud, in the data center, right? And data in motion is managed to move data as appropriated between those two variables. So let's explore that a little more. So let's say you have data-rich technologies that are going to tell you, oh, are you risking a lane change? Or are your brakes at risk of locking? Things like that. Does that mean you want to have a mini sort of data lake or a data in motion implementation in the car? I mean, are you looking to have a Hadoop cluster exaggerating a little bit? But there's got to be stuff that's, you don't want to risk it going over a network because you need immediate latency. Okay, so it varies based on use case, but let's stay on the car for a moment. The idea that the edge endpoint is going to have broadband connectivity to the cloud is now common sense. This is the connected and connected car. It's going to have an LTE connectivity. So there is data that can be collected over time that will infer fleet learning, which instruct the cars on what to do. Now, at the same time, it's not going to use that connection every time it wants to turn left. It's going to have to do dynamic analysis or collect perishable insights from the sensors on the car real time without going out to the cloud to do the computation. But those are two different things. The fleet learning may instruct the vehicle how to take a corner on 101 due to construction over time versus the sensors do a much better job of saying, there's a car in front of me. I don't want to hit that car. I'm going to apply my break. So data plays in both variables, but that's why there are two platforms, right? To kind of give you both, does that make sense? Yeah, I saw two different, I read two different stats recently. One, two different analyst firms, I suppose. One said that 50% of the data will stay at the edge and the other one said 95% of the data will stay at the edge. I don't know if we know yet. Do you have a point of view on that? So it's use cases. Use cases will set you free, right? Look at the different variables of the real time nature of both the connected car, let's take connected healthcare as another great example. There's enormous information that's collected right at the edge. You're talking about sensors that are on the patient that are constantly measuring their health. That information is then conveyed and communicated directly to their caregivers at the hospital bed. But at the same token, if you're diagnosing a complex illness, that information is not going to stay at the edge. It's going to be centralized where there's more information analyzed. At the same token, we've got this great ASU case study, which is all about curing cancer. Take that data, times it by 10 million, create a massive data lake, and then do this deep, rich historical analysis. So it goes back to the use cases, but in all examples, what is universal is the need to have connected data platforms, platforms within S, multiple platforms, facilitating the data needs of the specific area, working together in tandem, going back to WazePoint to have a data plane. So I think that you guys had an opening, we were sitting there listening to your opening segment. I think Peter talked about a few things about that. The two things, one is, Matt talked about the use cases, right? There is certainly everybody pointed the road to getting the real benefit out of the Hadoop. It's slightly, or the slope is slightly steeper than a lot of people thought about. And we certainly in the forefront in this journey with a lot of our customers. I think Matt pointed, again, time after time, ASU present connected car. We have seen the real benefits out of these Hadoop deployment. It's not even Hadoop anymore. It's data in motion and data at rest, connected data platform solution. But at the same time, we do recognize the challenge. The challenge to overcome that, to be able to get the real benefits in a shorter period of time, we think I should answer lies in the cloud, right? Lies in both the use cases, you can deploy it certainly in your data centers, but more and more so to be able to do some analysis. And also the machine deep, deep machine learning and AI, you guys actually had a segment last night. So these are going to be all interconnected very soon. We are as a leading vendor in this area. What we need to do is to bring out the innovation faster for our customers at the same time to really tie it with the security and governance with operations, with the basically the ready, enterprise readiness to empower our customers to get, to bring their unique use cases to bring out actually the real benefits in their scenario. So when you talk about the slope, I presume you're talking about the sloping, the degree of complexity, okay, that was underestimated by a lot of folks. And then value you probably heard us talking about, oftentimes was, I can get cheaper place to store this stuff, okay, fine. Let's talk about customer value beyond that, right? Because there's this holy grail that the data warehouse never lived up to. Matt, what are you seeing in terms of customers being able to move up that value slope, if you will? So value may have been elusive to some players in big data. We're not seeing that problem, not at all. So if you look at Summit, the keynotes were all customers. You look at the business track. This is the first time we've ever had a business track. It was oversubscribed. You can look on our website and listen to these transformational studies. We talked about curing cancer at ASU, that's one example. Let's take something closer to home, Symantec. Symantec is closing the zero day window based on work leveraging Hadoop and connected data platforms. They're doing this both by leveraging cloud technology and data center technology working in concert. They're able to do this because they're able to read and analyze information in motion as well as burst compute as required to the cloud. This is an amazing use case, but it is universal that every one of these customers are talking about business value. I think that there have been new technologies that have been introduced that's accelerated that. I think the work that's been done with Spark has accelerated that curve. I think the work that's been done with SQL at scale has accelerated that curve. We've found that the objections that exist and there still are some are being whittled away at very rapidly. So the innovations that Way has spoke about that we are showcasing at this event, the work about Spark 2.0, the work that we're doing on security by integrating governance and security, the work that we're doing around ensuring that access to SQL analytics is as fast as humanly possible or machine possible so that we can keep up with these big data sets is eliminating objections. And you have to remember just 18 months ago, there wasn't even a user experience on the ops software on Embarie. Look how far we've come in a very short period of time. So I believe that the customer value is there today. I think that we're getting more and more business cases being used to acquire or leverage acquisition of the tech. And I think that those showcases are going to accelerate as more and more of this tech comes along. And isn't it fair to say, Matt and Way, that a lot of the value was in pockets and then the governance police came in and said, whoa. Wait a minute, security governance compliance, we have to, and that slowed the market down a little bit and it took a while for the industry to get sort of called enterprise ready. You're saying we're there now. And now it's the slope, the S curve in terms of value you're saying is hitting. And we really think that the adoption rate is going to accelerate in the next 18 months. And when we're having a conversation next year at our Padoop world or whatever it's going to call it, right, a new branding, we will certainly talk more about acceleration of adoption, not just again, the technologies and innovations as Matt pointed out, but also give you real examples in which we have a lot more today on the cloud. Well, again, and even the nomenclature is evolving, you're right, it's not, you know, we still use Hadoop world, you know, you guys are evolving beyond Hadoop summit, it's data world now and it's all about the data value. It's not just Uber getting value out of data folks. All right, we'll have to leave it there. Thanks very much for coming to theCUBE early. I appreciate it. All right, keep it right there. We'll be back live from New York City right after this.