 Union Square in the heart of San Francisco. It's theCUBE, covering Spark Summit 2016, brought to you by Databricks and IBM. Now, here are your hosts, John Walls and George Gilbert. Well, welcome back here on theCUBE. We continue our coverage of Spark Summit 2016 here in San Francisco. I'm John Walls, along with George Gilbert, and we're now joined by Bill Jacobs, who is the Director of Advanced Analytics Product Marketing for Microsoft. Did I get that right Bill? Pretty close. Yeah, good mouthful. So Microsoft, very significant presence here. You're on the show floor, on the general session stage today, and you've made some pretty significant announcements here, very much of late with regard to not only what you're doing, obviously with the Duke, that's kind of old hat for you, but with Spark and how you're bringing that in to your portfolio of services. So tell us a little bit more about that. Yeah, let me, the backdrop is interesting. I'm actually part of an acquisition of Revolution Analytics, and I came to Microsoft with an old view of the company, with Microsoft the Windows company, and had been involved with Microsoft on and on through the years. When we came inside of Microsoft a year ago, I found a company that wasn't what I was expecting at all, and the best evidence of my surprise is what we're exhibiting today, which is that one of the predominant offerings in the Azure Cloud is a Hadoop-based infrastructure available for the price of a credit card and five minutes a year time to provision a cluster. What we're announcing is the ability to provision that cluster with Apache Spark. Spark 1.6 and a sub rev of that is now generally available in the Azure Cloud, and this makes it, again, a five minute exercise to select what kinds of machines, how many of them punch go, get a cup of coffee and come back and you have a fully running Spark cluster. That cluster also includes Jupyter notebook integration already in the product to make it very easy for data scientists to interact with that cluster. We are talking about a little bit of work we've done with Cloud Era to provide some restful interfaces to make it easy to manage Spark. That's the first thing, and the second thing is there's an addition of that product, a premium addition, that includes the R server technology, which Microsoft acquired, that's when I came aboard, the acquisition of Revolution Analytics, that it makes it possible for our data scientists who are not typically involved in the deployment of large clusters to take advantage of Spark as a very, very high performance back in for doing large data analytics. Our measures, comparing Hadoop MapReduce Engine to the very same product deploying into Spark, is about a seven X performance boost, which is very surprising. It was a very well engineered, massively parallel algorithms that are highly tuned and I didn't expect that we'd get more than four or five X performance boost. We're seeing seven. When we compare five nodes of Spark to a node of, say, Linux running the open source R, the performance boost is greater than 100 X. So there's a huge amount of performance just available for the taking for data scientists who otherwise would have to become quite schooled in Spark and Hadoop in order to stand up those clusters. The last thing is simply that, and this is the part that still amazes me is Microsoft's thorough commitment to open source. A very large measure of what we do in Azure is based on Linux, Hadoop, and now Spark and R, and other languages are coming. So a lot going on, quite an interesting time. So have you seen, I mean now you're talking about a very rich big data infrastructure where the heavy lifting on Microsoft's part is automating the operation of that and making it self-service. Yes, exactly. So what sorts of applications are you seeing coming on board in these early stages? Because it used to be Windows Azure was the cloud, took out Windows because they didn't want to confuse people and think they're only supporting Microsoft stack. So what's coming on board? We're seeing a lot of things. In my particular purview as a product marketing guy in the advanced analytics side, we're seeing massive activity in financial services and insurance particularly. When you look at the insurance companies and I won't say their names, they offer you a dongle to plug into your OBD to port on your car and measure how your teenager is driving. What are they doing with that data? They're landing that in yourself. Oh yeah, we're just going to say the same. Yeah, no comment. Those types of applications that allow, the general term in insurance is usage-based insurance. But what it really is, is much more finely tailoring the risk measure to the insured. Oh, that's one example. In financial services, we're seeing a massive uptake of large clusters being used for modeling fraud, credit card processors. We are seeing a big upswing in predictive maintenance. We just, we have a couple of major equipment manufacturers in the building business. I won't say their names because I'm not sure the status of the reference ability of those. But if you think about someone who operates large equipment, if they can cut their maintenance cycle, maintenance costs, by only going out when the machine needs maintenance and being there an hour before it needs maintenance, they not only get a lower cost of maintenance, but a higher uptime for their user. And this pervades a whole bunch of industries. So that blends predictive maintenance, internet of things, technologies together. We're seeing a lot of that. We're actually, I heard we talked about airplane travel. Yes. You know, maintenance, automotive maintenance, agricultural equipment maintenance, rail transport. Gas turbines. All those things. So all these areas. I mean, is that, is huge growth area? Is that one of the key areas though? I mean, are there? You know, I think it's a sharp spiking area and here's why. And I did a little study on this. I have a history in manufacturing automation with Hewlett Packard years ago. And we started looking at the IoT business. Of course, Microsoft's involved in that business and we have other customers that are using other products. But when you split the IoT business into two areas, you come up with kind of an industrial IoT and a more consumer-oriented IoT. My thermostat and my refrigerator, or my car. When you look into the industrial IoT side, you find manufacturing automation types. This is not new for them. Yeah, it's a new data source. Maybe it's more data. But I look back. How far back do you have to go to find the first instances of statistical process control? Well, it turns out it was Toyota. 1935. Oh, were they... In their first... Quality control. In their first automotive engine casting plant is how I read the story. That's how far back it goes. So that is a spiking industry because the knowledge of how to use statistics to improve manufacturing processes is not new. The chip guys, you know, what's coming down is a million dollars a day worth of silicon. If they can spot a defect earlier in the day, they cut the loss for the day due to a defect to a half a million or a quarter of a million. So these guys in industrial IoT, they know the value of statistical prediction on their processes. And as they extend that out into the product after it's delivered, if they can spot failures starting to occur in a line of cars. And there's some great stories there. They can improve the customer experience, cut the cost of the car, cut the cost of the service, all for the application of some data science talent and big data to the problem. So that's spiking very quickly. The longer term stuff is where we're perhaps going into combining a lot of data from a lot of sources that surround us as individuals and bringing that together to improve the customer experience at a retail store. Those kinds of things are slower growing but probably bigger in the long run. But the industrial IoT business and the predictive analytics that has to accompany that, that's taking off very quickly. So we've come at it from the capability point of view and the benefit point of view. Microsoft's got this amazing asset called a enterprise sales force that they can go walk the halls of the customers and peer to the ground and identify the opportunities. How does Amazon, they have a few years head start in these services. But how do they go to an insurance company or a semiconductor company and say, we know about this type of problem. We can help you take your on-prem software and build a hybrid solution that over time moves more and more to the cloud. Well, first, Microsoft has far better position than many other vendors because we have a strong footprint in on-prem systems. The R server product runs both on-premises, on Hadoop and Linux and Windows and Teradata and in the cloud and that provides a bridging technology. But you ask about the sales force and I spent a lot of time with our sales force and I spent a lot of time out meeting their senior executive sponsors and their customers. The problem that all of us are dealing with in this industry, particularly around data science, is identifying all of the constituencies. Tell more, that's interesting. There are between two and four constituencies involved in every use of big data analytics. Somewhere as a data scientist, might be called an actuary in the insurance business, might be called a quant if he's measuring risk for a middle office guy in Goldman Sachs or something. The second audience is an app developer, typically, and a newer audience to this space because typically data analytics was a product unto itself, you produced a chart, you produced a prediction. Now that's being used to automate an application, present the right ad to the client as they walk in the store, present the right rating schedule when you rate their insurance. That's becoming an end to end process. The third audience in the equation is the IT guys. A lot of data science has been done in the past on desktops. This is a pretty good data science machine you got here until you tackle big data. When that big data problem hit organizations, all of a sudden, CIO said, I don't want any more shadow IT. I don't want any more appliances and closets in the marketing department. And they have taken control of Hadoop, the use of the cloud and big data analytics. And so that's created this three-way team. And oftentimes when I go into a major account, I'll ask the IT guys, do you know who your data scientists are? And as the data scientist, you know who's standing up your Hadoop cluster. And oftentimes the answer is, well, we're not too sure. And then the last audience is the guy with the checkbook, the vice president of marketing, VP of manufacturing, who is actually funding these big data initiatives because he knows his competitors are doing it. He's the guy with the sense of urgency who has all this stuff anew. And so the hardest challenge is not the technology. It's finding those constituencies, getting them engaged to present many perspectives of a solution so that they can agree that is the right solution going forward. We've spent a lot of time talking about technology here the past couple of days. You haven't thought about the internal politics, if you will. It's a human problem. Yeah, you've never really considered that. Almost take that as a given. But then everybody's pulling on the same order. And they are, but they just don't know maybe what boat they're supposed to be in. I don't know what the right analogy is, but there's a little disconnect internally. And you'll even get a disconnect among data scientists. They're a data scientist who have essentially a life sciences heritage, mathematics, statistics, biology, population science. They will be heavy R users because they've been teaching R in universities for a long time in that world. What you see on the floor at Spark Summit here is a little bit more of a computer science bent. And that brings a different skill set. That brings guys who speak Scala and Java and hardcore programming languages. Even among the languages in use, you find a cleaving between those guys who have more of a science and liberal arts treatment of problems and are very familiar with the broad human problems of using data science. And the guys that are worried about how do I present the best ad? I mean, what's the largest app on the internet? We all think of some things that are rather unpleasant to discuss, but probably one of the biggest is actually matching up ad placements with ad presentment for a quarter of a penny. And it pops up in the upper right of your search window. That's one of the largest apps on the internet. That's a scale that requires a very heavy duty computer science treatment. There are other problems where it's a life sciences problem. And so you see this schism even between the data scientists as to which is the right approach. R versus Scala, R versus Java or Python. And in Microsoft, we intend to essentially embrace them all because we know those constituencies will all be present somewhere within the user base of Azure. You said it's an open source world at Microsoft now. Surprisingly so, from a former outsider now insider, but we appreciate the insight. Really. It's a fun space. Yeah, good stuff. Yeah, we appreciate it. Phil, thanks for being with us here on theCUBE. Enjoy it. theCUBE continues here from the Spark Summit 2016 right after this.