 Okay, we're back here live at SiliconAngle.tv's theCUBE, our flagship program, we'll go out to the events and extract the signal from the noise. We are New York City live at Strata Plus Hadoop World, put on by O'Reilly and Cloudera. We have a special demo walk-on presentation we grabbed from the hallway. As always, we go where the action is and the action is in New York City for Big Data Week and we're excited to bring really the talk of the hallway conversation was a big demo. Yesterday Microsoft did, a lot of people came out raving about how awesome it was and we checked it out and we brought in Microsoft to do a kick-ass big data demo, showing how easy it is for mere mortals to use Big Data, which has been a big theme of this year, analytics, higher processing power, ingesting data, all that tech geek stuff under the hood, that hard to do stuff, it's getting easier and easier and we're going to bring that demo to you. I'm John Furrier, the founder of SiliconAngle.com, I'm joined by my co-host. I'm Dave Vellante of Wikibon.org and we're here with Mike Flasco of Microsoft. As John said, Mike yesterday gave the hot demo of the show and everybody's swamping them after the demo, he's out of business cards. So Mike, welcome to theCUBE. Thank you, thanks a lot. That's a good sign when you're out of business cards, Dave. It's like, business is good, you know, you get all your cards, it's a sign of good insight there. So take us through the demo, set up for the folks out there, what happened yesterday, what you did in the event, in terms of what was the program and then about the demo, then we'll go right into the demo. Sure, sure. So what we did was, we announced a couple of things. One was a preview service in the cloud for Hadoop as well as kind of the complimentary thing that runs on premise. And the demo that we did was, what I considered to be kind of the wheelhouse of Hadoop, which was, we took a large amount of log files, processed them in the cloud using Hadoop, you know, did some aggregation on them, did some shaping on them, et cetera, brought them down to, let's say, 20,000, 30,000 records. And then we pulled them into Excel so that we can do a little more ad hoc analysis, as we wanted to kind of self-service analysis. And the whole idea was that we had log files tracking our online business. And in our demo, it was an online bike shop. And what we were doing is we wanted to see the traffic patterns. We wanted to then mash that up with data we had in our enterprise databases about some of our sales records and promotions we were running on those days. And then finally, once we'd done that, enriched the data a little bit more with demographic information so we could understand how our traffic was going versus the kind of marketing impact we wanted to have versus the demographic of the user that was seeing our site. And the idea was to show how easy you can do that, you know, just cutting up to speed with a cluster in the cloud and just using Excel. So do you want me to just take you through it? So just take you through it, all right, great. So what I've got up on my screen is HadooponAzure.com, which is a preview service that we announced yesterday. It lets you get up and running with Hadoop service in the cloud just a couple of clicks. But then the other thing that we talked about was, you know, sometimes after you set up a cluster, sometimes your team's running their production cluster or whatnot, and it's nice as a developer to be able to just get up and running really quickly, have a really nice debugging experience, have a really nice iterative experience as you're refining your query, getting going. And so what we did is we shipped the exact same thing that we run in the cloud as a one box set up for a developer. So literally in two clicks, a developer can be up and running with a pseudo-distributed Hadoop cluster on their laptop like I've got here. You can see the address here of my dashboard page is localhost. So I'm running the same experience we run in the cloud on my laptop. So as a developer, I can test everything out very quickly, get a nice debugging experience, et cetera. And then when I'm ready, push it up to my company's production cluster or whatnot. So what we did yesterday is we kind of went through that loop, and then what I used is this thing called the interactive console. It's something that we have that lets you kind of easily interact with Hadoop cluster across the web. Think of this as like a web-based command line for what we're doing. And let me just jump over into our Hive console. This is just a really simple kind of web UI over top of Hive. And what I had done, I'm not gonna show you all of the lead-up demo. We'll quickly get into the Excel stuff, but basically what we had done up to this point yesterday is we had a bunch of log files. We had them up in the cloud. We created some Hive tables over top of them. We processed them in the cloud, and that brought us down to kind of some aggregated data off of those weblog files. And so what I showed people was I said, select star from weblog results. Let's say limit 10, just so everybody can see what our results had looked like. And let's evaluate that. Actually that might, let's do this first. Select star from weblog month results. Have to remember what I actually did yesterday. From that. And basically what we had done is we aggregated all this data into a set of data that represented the log hits from this month for particular countries we were interested in and particular product categories we were interested in. So you can see what we're getting back is country information, category, components, or bikes, or clothing, whatnot. So the category of product we were looking at. And then IP address showing where the hit came from on our website. And so that we had used Hive in the cloud to kind of aggregate all this. And then I said, well as a developer I want to iterate on this data really quickly. I want to understand what I want to do kind of on a regular basis for my team. So I'm running all this locally. And then my next step was to say, okay, let's take this, let's mash it up in Excel. And let's figure out what insight we might be able to get out of this. So what I did is yesterday on stage, what we did is we opened up Excel 2013. And what you're looking at right now is a set of records that I've pulled in from a database running locally on my machine. I figured everybody would trust me that Microsoft knows how to get data from SQL Server into Excel. So I didn't show that step yesterday. I just said, look, here's a bunch of data. It's normalized data, so it's got that shape to it, right? And so what this data is, is it's per day what type of discount we were running on clothing versus bikes versus whatnot. So we were running a 25% discount on 1017 on clothing on our website. And then also per country we have and what kind of channels we were using to advertise those discounts. But then kind of the net new thing was, okay, so I got my data from my enterprise database. How do I combine that with all this kind of data I'm getting out of Hadoop, all this unstructured data, and how I bring this together? So what we did on stage yesterday was we said, we went data explorer and we said, look, you can go from other sources and now you've got directly an option from Hadoop distributed file system. And so if you say from Hadoop distributed file system, this brings you up this kind of explorer window. And again, because I've got all this running as a developer on my local box, I can iterate really quickly without relying on anything else to be set up around me. And I asked the system to go and interrogate my local HDFS using WebHGFS, which is just one of the standard rest interfaces of Hadoop. So it works against anybody who's running WebHGFS. I can see I've got a result set called month results here. And let me just hit the binary tabs so we can look at this a little bit better. And what it did is it said, look, we just saw that you've got some text files under the covers. It looks kind of like this. It took a sampling of it for me. And I can say, okay, that's what I want to work with, but I don't really want to work with it like that. That's hard. So parse this into a table. The delimiter I had was a comma. And you know how semi-structured data is, right? It's not always a nice square. And so you get to say what you want to do if you don't see a nice square and you're trying to treat it that way. So I said, well, you know, I know that the first few columns are what I want. So just truncate the rest if it's kind of jagged. I don't need that. So I say apply. That goes, okay, now you've got it roughly as a table. So let's make this bigger so we can see what's going on. We don't want this column, so just drop that. We know that this data represents our country. It's coming out a bunch of text files. So what we did is we renamed this to country. We know our column here looks like our category type. So I renamed this just to item. We just have two more of these to do to kind of clean it up. Let's say this is our date. And finally, this is our IP address. Remember, all of this is coming from what we processed with our Hoop cluster earlier. So this looks kind of roughly like what I'm after. Let's rename this whole query to be our traffic query. And then what we said is, look, this represents our data in Hadoop, but now let's join that with the data we got from the database so that we can do some further ad hoc analysis on it. So let's get related data. And this is the other data I've already got sitting in my sheet that came in from my SQL database earlier. And so I can say, well, this date matches this date. And this item up here matches this category down here. And it tells me 113 out of 200 rows are gonna be combined together. We're gonna miss out on about seven rows. So let's join that together. Now let's look at that table we joined to. And let's say we want the percent discount to be brought into this. And maybe we want the channel's information to be brought in, which is what channels were we using to run the discount. So great, I brought this in. Now I've combined SQL data with Hadoop data. And you can see it joined pretty well, but now I've got these nulls here. Like nulls is a database thing. I don't know how the heck I'm gonna visualize that. So let's clean this up a little bit and say, well, if the value was null coming out, that means I wasn't using any channel. So look for the nulls and just zero them out. So now I got rid of the nulls here. I don't want those. If I have a null in my percent discount column, then that just means I wasn't running a discount, so just do zero. That means there was no discount running. I was running that product at 100% of the price that day. So let's apply that. Great, so now I've kind of cleaned up that data. I've joined them together. It looks roughly like what I want. And let's go ahead and say done. Now, a couple things to notice here. Let me just add a new sheet so I can kind of zoom in on what we're doing. What we did is we just formulated a query. And that query is something that's gonna pull from the data that was in Hadoop. And we're gonna join it with the data we've got in Excel as it comes in. And I'm looking at a sample so that I could do all this quite interactively, right? I've got about 30,000 to 50,000 records sitting in Hadoop that I'm gonna pull in. And so we do it over a sample so you can see and kind of visualize this is what you're gonna get, but you get to do it interactively. And when you're ready, you say add to sheet. Add to sheet is kind of that moniker that says, okay, look good, go ahead and do the big pull from the Hadoop cluster. Go grab all those records, pull them in. So now I just sucked all those records in. They've been joined together. So now I've got it all together in a sheet, right? The thing I think is really neat about this, at this point, if you know Excel, you know what to do, right? Like, all those things you used to do with Excel data now just work here. And so we always joke that this is kind of the most used information worker platform so we were really trying to get it into the tools that people are already comfortable with as soon as they do it, yeah. So then what we did is we said, okay, this is great, you know, I can look at a table of data but how do I get richer visualization on top of this thing? So what we did with that in Excel 2013, you have all the standard Excel visualization that you're used to but there's also now something called PowerView. PowerView lets you do kind of ad hoc visualization, interactive visualization. So I'll take you kind of through the arc we did. It takes about a minute. What we did is we said, okay, we've got data by country so that's often best visualized on a map. So let's click country. Brings up our country information here. Let's just make this a little bit bigger so we have some real estate to work with in a second. And great, I've got countries listed that came out of my data but let's turn that into a map. Great, so now I've got little bubbles on a map. Okay, we're getting there. But what I wanna see is actually the hits by region. So let's go IP address and let's make that the sum of the hits. So now you can see the bubble size change, the biggest bubbles in the US, smaller one in Canada, et cetera. So I'm starting to see where the traffic was going for country. That's interesting but if we go and say, okay, well let's make this richer, let's not just do a bubble, but let's actually superimpose a pie chart on top of the region. So now I can see per region, the size of the bubble represents the number of hits, the pie chart represents how the hits were dispersed across the various product categories that I have. So now I can kind of at a glance see how I'm doing day to day. If I bring data into this equation, actually let's leave this here for a second, go up and I bring data into the equation and I say tile all this by date. Now I've got kind of a little interactive visualization that as I click the dates, my pie chart is gonna change based on the traffic for that day. So let's do a little bit more real estate here and just enhance this one more time. So let's go like that. Last thing is, so we've got our hits, we've got our date. Now what we wanna understand is how well our discounts were running against our hits and to see, when we was running a discount, did that affect my traffic on that day? So what we did for that is we said, let's bring an item on again and with that bring the discount we were running. Let's maybe look at that as a card, not a table. Go here. So we've got this, I'll do an abbreviated version of yesterday, we did one more thing but we'll leave this for now. And as you can see on 10-1, we were running a 5% discount on components, no discount on anything else. That's what our traffic looked like over here in Canada and in the US. Now if I go, you can see the green section here representing components is maybe a quarter of my traffic. If I go over to like October 5th, 10-5, all of a sudden the green section is representing over 60% of my traffic. Well I was running a 45% discount on components that day. So hopefully there's a correlation between the sale I was running online and the traffic jump and components that I'm seeing here. And so our next step was say okay, our running hypothesis is that this sale worked, right? Where was I running that? Well if I add this into the mix, now I can see okay, where was I running that sale? Well I was running it in our social channels that we use and at a local event. Okay great, so now I've zoomed that in a little more in terms of how things are working for me. So my hypothesis is that that particular campaign worked. Let's get a better understanding of the people that actually drove this traffic due to that sale. Because I think I'm hitting maybe the 20 to 30 year old crowd with my website, that's what I'm going after, but let's validate that. And so what we did is we went back to our source data and we said okay, our running hypothesis was that on components in the United States, just one more filter to go, on October 5th, so let's zoom in to October 5th. We were running that discount and this is just a sampling of the IP address, so let's just grab a few of the people that hit me that day, right? I'm assuming these people are representative of the people that were really driving my traffic, we could do them all, but let's just do a sample to, let's all copy it over here so we can zoom in on exactly what's going on here. So now some of the people who hit me that day drove my traffic, so let's highlight this section here. And what I wanna do is I wanna enrich this data to get demographic information about these people. And what we do is we have another service that we run in the cloud that we call Data Market. Data Market is just that, it's a marketplace for data. Just like they have marketplace for apps to go buy apps, there's some free, there's some for pay, there's a data marketplace, some data's free, some for pay, et cetera. Inside of the marketplace, there's a data set that lets you, given an IP address, give you back more information than you could imagine, right? And so what I did is, in Excel 2013, there's an app marketplace for information worker apps. And they're called Apps for Office. And what I did is I went and wrote an app for office that talks to that data market in the cloud and says, given this IP address that I've just spelunked to get, enrich it with the data you have in the data market and visualize it here so I can get a better understanding. So you're matching up that data in rendering. Exactly. So at the end of the day, we've got some enterprise data, some Hadoop data, some data from a third party. Awesome. So we went, insert apps for office, I built a little web log companion app that talks to that. I got three options to add average household income, number of households, et cetera. I hit go, it hits that service, it enriches back. Let me just expand these so you can see our titles. So that's public data. This is public data. Anybody can go use it. So it's private and public data, matching up together in real time on tools that everyone knows, exactly what people want. Now we're seeing, look, the median age of the people that was driving this traffic, I have a 50 and a 52. So it's somebody with some cash in their pocket to spend on my high-end bike components, right? And so you can see what they're... The baby boomers buying it for their kids or themselves. Exactly. That's the next question. And so... That's big questions. That's Cloud Air as a motto. Exactly. So now you've got this insight that says, wow, it's actually an age range that's a little bit different than we thought. Yeah, it's cool. How can we take this further, right? So that was the demo we did yesterday on stage. Mike, that's fantastic. Thanks for coming on the queue. I want to say I'm glad we could squeeze you in because I think this is exactly what people want. They want the ability to do things simply. My question on your quick question is, is it only limited to Azure? Can you use other Hadoop? Was it just, what's the lock-in? Is there a lock-in or is it only Microsoft? Good question. So basically, our whole take here is that we're pushing against things that are 100% Apache Hadoop. So all of the Hadoop stuff that we run is 100% Apache Hadoop. All the tools we build are interacting with clusters using all of the standard REST interfaces in Hadoop. So what you saw was I was pulling data with web HDFS, issuing queries with Hive. So anybody who is supporting those interfaces that are directly in Hadoop, these types of tools work on top of it. Okay, so I'll give you a quick question and then drill down on that because as I use case, hypothetical. So Dave and I play with HBase a lot. We're like HBase geeks. So I've got an HBase database. I'm sucking in all this information from the public. Data source, and it's sitting there. It's just hard to get out, right? So what do I do? What do I use this tool from that kind of developer? Right, so today, the first integration point we built was through web HDFS. Tomorrow our goal is to go and do things over top of HBase as well with these tools. So today, the direct HBase query won't, we don't have that plugin into Excel, but that's the evolution of where we're going is to support all of those projects in the Hadoop ecosystem, whether it's HBase, web HDFS, et cetera, H Catalog for metadata over time so we can get an even better experience. And you're using this just to be clear. You guys have a partnership with Hortonworks, which is, again, 100% Apache Hadoop, but this is your, their distro is what you're using. Correct. Okay, cool. Well, great demo, fantastic. That's the kind of future we see. We see mere mortals using big data, and that's at the end of the game. That's where the market's going. A little bit complicated right now, but it's rapidly accelerating ease of use and simplicity, so congratulations on a great demo. Thank you. We'll be right back with our next guest right after this short break. All right, awesome. The cube is this conceptual box, if you will, and we bring people inside of the cube and then we share ideas. The cube is a comfortable place. It's a place where people feel happy and are happy to share their knowledge with the world and we're happy to be ambassadors of that knowledge transfer.