 Live from New York, it's theCUBE, covering Big Data NYC 2015. Brought to you by Hortonworks, IBM, EMC, and Pivotal. Wicked Bond and theCUBE were at Strada 2015, hosting Big Data New York City. We have our two special guests from Hortonworks, Matthew Morgan, VP of Product Alliance Marketing, and Wei Wang, Senior Director of Product Marketing. And happy to have you guys. Two big topics that are of interest quite broadly, customer journey, and the new Dataflow product. Guests formerly known as Onyara. So, Matthew, why don't you tell us, I mean, we've heard a lot about data lakes. We know that sort of very early in the journey towards systems of intelligence. But tell us what we should, for those who are dipping their toes in the water now, what they should have learned from others who've gone before, and what the path forward looks like. Thanks, George. It really is an amazing year to be at Hortonworks. People are transitioning the conversation. A year ago, a lot of people were starting to explore what Big Data meant. They were looking at ways they could archive more data. They certainly wanted to take care of the unstructured data problem. But today the conversation is all about what can we do with it? And we're seeing some amazing innovations that are being driven both by business leaders as well as technology leaders on bringing the true value of open enterprise, Hadoop, and Big Data to the marketplace. Now, from our point of view, folks ask us not only for the technology, but they're asking us for guidance. They're asking us, hey, of your five, six hundred customers, how are they going about maximizing the true value of data? And the beginning of this year, we had a exploratory conversation with some of our best customers, asking them about their specific use cases, how they brought things to market, and we found some incredible correlations. Well, okay, that's a great segue towards, tell us what some of those common threads were, so that those who are trying to learn from best practices can hear. Okay, so it's so simple. There are two distinct schools of thought of value, two distinct schools of thought on value. The first one is all about cost savings. This is the most obvious, most easily accessible value proposition for CIOs to leverage. Is it cost savings related to the data processing infrastructure? Absolutely, so let me go through them in particular. There are three different kinds. The first one is active archive, which is about taking cold storage or cold deep storage, which is tape drive storage, and bringing it back online, hence the active part, and doing it for pennies on the dollar. People are able to take the archival processes they already have, expand them, make them active, and to be able to save 70, 80% of the cost. That alone drives enormous initial deployments and expansions of Open Enterprise Hadoop. The second one is about ETL offload. So, let's talk a little bit about ETL processes. Organizations spend millions of dollars to get at high value ETL output. That's where the analytics come from. This is where an Enterprise Data Warehouse adds its most value. What ETL offload allows you to do is add Hadoop to these Enterprise Data Warehouses and move some of the high value, high intensity work, and concentrate that on the EDW while moving the lower value off to Hadoop. Or have your customers measured, let's say either the CAPEX cost difference between running the ETL on the data warehouse and then going to Hadoop, or even better yet, the running costs differential of the two. So let's take the running costs, for example. Since you're dealing with an enormous cost basis to actually deploy an Enterprise Data Warehouse and often time combining storage and compute into one entity, it can be amazingly costly for an organization to scale capacity, and everybody wants that capacity. So by taking the lower value ETL and using it in Hadoop, you can cut your costs north of 80%, which is a lot, right? So this sounds partly based on CAPEX, the upfront cost because of the scale up nature of many of the data warehouses, or at least the storage associated with them. Okay, so 80% combining CAPEX and running costs. Yeah, that's a blended rate. Okay, and what's the third one? The third one, okay. So let's talk about EDW for a moment. It's important for EDW to gain all the insights, including insights that could be captured from third party data that's in the cloud. And as a result, you need to add the capacity of handling unstructured data. So photos, videos, sensor data, geolocation data, all this type of data needs to be part of your analytics. Well we can add that capacity at a very low cost. So we see that as cost savings because the general output gets a heck of a lot more rich. It's got more fidelity and you're able to do it in a highly cost effective way. These outlines are cost reduction stream, which is a collection of use cases as I indicated. Okay, let me just grab that thread a little bit before we move on. If you're taking some of this unstructured, semi-structured data that is externally sourced, whether in a feed or whether you have particular applications that you haven't used before, how do you integrate it with what's in the Hadoop, in the Hadoop data lake and how would you have done it in a data warehouse, if at all? Well the truth of the matter is this is an augmentation strategy. People want to take this unstructured data and extend their enterprise data warehouse. And Hadoop gives you that capacity, right? So if you think about your classic EDW, this is rows and columns, structured data, important critical structured data, but structured data. That critical information that is formed in semi-structured and unstructured data, like I had mentioned, can be added and you can extend the enterprise data warehouse analytics to include it. That's really the value there. Okay, so there's a little work that has to go on in terms of essentially joining, you know, logically joining the semi-structured data with the highly curated and structured data. You got it, you got it. Now those are the cost savings area. More than half of our conversations this year aren't about cost savings. Okay, go ahead. They're about business transformations. And we've been trying to identify where that line is because it's important to have the cost savings conversation because it's so immense. But the business transformation side is where the possibilities are. And what we're finding is people use the cost savings initial implementations to fund broader projects, which can drive some business transformations. In other words, does that mean they had to justify to the executive who had the budget that they've got an ROI fairly quickly so that they can free up more budget? Is that how it works? Yes, it's flipping the bit. People want to say, hey, I have all of this dollars associated for keeping the lights on. I want to take some of that and start applying it so I can gain the value of the big data, right? So I have something I would like to show you. This is the customer journey. It is, we call it internally, the placemat. And we call it the placemat because this helps articulate to our customers and prospects the different types of value use cases associated to the different use cases. Now you and I just walked through this bottom list here. This bottom list is all about the cost savings. The top swim lane, if you will, focuses on the business transformations. And there are three distinct classes of use cases that we see. Data discovery, which is brain dead simple. Get all of your data in one place. Then a data scientist can do amazing insights on it. They could do all the analytics just by having the power of the data lake. This is what we've been talking about for so long. Single view, which is in the middle. This is now becoming probably the most popular use case. A single view indicates that an organization knows everything about its customer or everything about its asset. And once it knows everything about customer, asset, product, whatever, what actionable activities can they take? That's perfect, Segway. I'll give you an example. Mercy Hospital. This is a massive organization. A million patients annually. They had three distinct sources of data about their patients. They had their ops data. They had their medical data and they had their finance data. They were operations, internal operations. Oh, like the activities to support their stay or whatever. You got it, exactly. These three distinct repositories were all stored in Epic, which is the most popular healthcare system out there. And they wanted to get the single view. They wanted to say- And they couldn't get the single view even though it was essentially one ERP system? Yeah, they had three distinct repositories. And this is the same business Just to be clear, those three repositories were all Epic systems? Yes, they were. Okay. So what we were able to do was deploy a central data lake that allowed the Epic system to write also to this central data lake. And now Mercy has that single record, single patient view. And we have an amazing video on our website where they talk about this and the value it has driven, but it gives the finance guys the ability to access the medical notes from bedside and be able to articulate exactly what the treatment was in a much more context-sensitive way. The end of the day, it's better patient care and a more efficient operation. Okay, let me just, I know, we don't have time to go too deep, but maybe paint us a slightly more concrete picture of would the finance guys be looking at patient outcomes and the different costs, essentially the different costs associated with those outcomes and the treatments that go with them? So I have personal experience on this. So this is in Mercy. But when you get a medical bill, and let's say it's not covered by your insurance, which happens more and more nowadays, you always have the simple question, why was this charged? What is this line item? This is the same type of information that would be present to a Mercy telephone representative answering a patient's concerns. Now they can have a much cleaner answer to the why, right? Let me hit this last point, right? So this last one is about predictive analytics. Predictive analytics is the ultimate use case. So everything that I showed you all builds up to the opportunity to get predictive. And as our CTO often says, prescriptive about what's next. So great case study on this one is progressive insurance. Progressive insurance introduced something called the, what was it? What is it called? It's called the telematics. It's a telematics, it's a device that plugs into the car and allows you to instantly instrument everything about the customer behavior. Yeah, snapshot. I was trying to bring you into the conversation. Snapshot, they were able to take that data and be able to predict the outcomes. Now again, it's important to understand this is an opt-in service. So it's not like they're big brother and they're forcing anyone to participate. But what they were able to do is they were able to take and instrument these drivers, which netted out to $2.4 billion of revenue. This is a big scale. By associating higher premiums with higher risk drivers. They're able to take and offer better rates to the better drivers, identify opportunities around their driver pool and overall create a more efficient operation. And when I use the 2.4, that's not net new. That's just a size conversation. It's significant, right? Multi-billion dollar business using this type of data. So this is the power of big data. Okay. All right, way, your turn. Okay. Now, Hortonworks made some waves with a recent acquisition product renamed Dataflow in the class of streaming analytics products that is getting a lot of attention. That's the whole category. Tell us where this fits. Tell us how Hortonworks is positioning it relative to their existing product line. Okay. Let me first actually would like to make an announcement that the Hortonworks Dataflow, that's the product name that actually is now available as a new subscription this week at the Strata conference. Big data. You've heard it here first. Big data in New York, absolutely. So this is coming out of our acquisition of a company called Anyara. And as our traditional way of doing it is based on a patchy project called 95 that originally started eight years ago out of NSA. So this is not brand new tech, right? This truly has been tested and used by large amount of people that within federal government and then now NSA decided to basically open source it, transfer it. We happened to bought a startup that in which is all behind this patchy project called Apache 95. So tell us about, I guess it's a lot of people are familiar with the early use cases of the data lake or Hadoop's data hub. Tell us what you see as the early use cases for a streaming product like Dataflow. Okay. So I think the traditional, the way that you think about Hadoop technology is often to transform the way people do things within a data center. But this is way outside the data center, maybe just right outside data center or way out, we call it jacket edge, right? We're talking about metal moves that oftentimes is the automobiles, cars, trains, planes that has telematics, that has sensors, has things that gathering data that is, are be able to transform it or transmit it into a device as small as Raspberry Pi. Okay, Raspberry Pi being like a single chip computer. Yes, that is usually you can buy it. Last time I tracked less than 50 bucks, right? Now, correct me if I'm wrong, but my understanding is most stream processors today are sort of one way collection devices and the analytics sort of happened in the center. Dataflow is different, tell us how. Yeah, so Dataflow actually can do the analytics also on the edge. I'll tell you the reason, I would use a use case. You did ask about early use cases. So I'm going to use the traditional or IoT use case and I'm going to talk about another use case that's more closer to a data center that combine the two products together. So let's go through the first, the oil rig. When you are gathering data on the remote oil rig, the amount of data that you can transmit is way exceed the bandwidth in which also is the most expensive bandwidth you ever would have. Because it's at the edge. Because it's at the edge because it's usually it's actually a satellite link. You're paying such an expensive link to transmit. So why we need some analytics on the edge because you want to be able to do some initial digesting, integrating, and then you want to pick the right data on your most expensive link transmitted back to your data center. Okay, here's a question. So you've pushed out analytics to the edge to say only essentially surface things that get past this filter. How does that filter change with time? How do you know how to change that filter? So actually, let me use the Raspberry Pi for a moment or oil rig. So now you transmit it, let's say 1,000 events a second. And put it in the context of this, the oil rig, just so it's correct for me. The oil rig, yeah, so oil rig, you're now transmitting 1,000 events a second on your most expensive link. That's either transfer. Like this is like drill bit, you know, when it's getting hot or running into something too hard. So most of the time, everything runs well. You're just collecting data once in a while. But let's say the satellite link went down, right? Essentially now you have 10 minutes of a gap and you turn it back on, the oil rig is on fire. So now what you do, you're now not only trying to gather the data that is transmitted back up on, you actually now want to go back in and transmit all the data. And Hortonworks data flow and Yara Apache 9.5 allows you to do that. Okay, but there's also a channel, a control channel. Which most of the other stream processors don't have. Which is now like this control channel might be like, okay, ring all the alarms and turn on all the fire, the fire extinguishers. Exactly, so that's why we call it a two-way bi-directional data flow, right? So that allows you to not only control the information coming in, but based on the analysis that allows you to do some control back out. Okay, just speaking commercially, do you see the primary early customers being those who are already on the data hub customer journey or is this a new set of customers? So I think actually it's very complimentary. I gave in the first example as an oil rig as a classic internet of things, but I think the second one is truly sit at the edge of the data center. You think about right here, it's a perfect example of Hortonworks data flow as well as Hortonworks data form. Now you have the perishable data in which we talk about data on the flight within the data rig, within oil rig, within other things moving, within things just flowing in, right? Streaming data analytics. Now you did some analysis, then you go the way because there's a time sensitivity. But at the same time, the data will be flow into your data lake, and now you have historical analysis that is month of data, years of data. Now you do analysis, and that's linked, for example, to do predictive analysis. You already did some analysis to the oil rig that helped you do analysis right there, but maybe data scientists back in the data center will analyze the whole operation within the last year and give you a better way of make sure that the same fire is not going to happen next time. So that's how you combine the value of recency with the value of context or history. That's right. Absolutely. Okay, all right, with that we have to take a break. We've had Wei Wang and Matthew, your mind notes disappeared. Matthew Morgan, VP of Product and Alliance Marketing, Wei Wang, Senior Director of Product Marketing, telling us actually some very exciting stories on customer use cases and new products. This is George Gilbert, big data in New York City. We'll be back shortly.