 Live from New York City, it's theCUBE at Big Data NYC 2014. Brought to you by headline sponsor, Juan Disco, with support from EMC, Mark Logic and TerraData. Now, here is your host, Dave Vellante. Welcome back to Big Data NYC, everybody. I'm here with Jeff Frick. We've been going two days Thursday and Friday here at Big Data NYC in conjunction with Strata Plus and Duke World. Joseph Suresh is here. He's with Microsoft and the Machine Learning Group, which is inside of the cloud group, I understand. Yes, yes. So thanks very much for coming on theCUBE. Thank you very much. It's a pleasure to be here. So you've been at the event. Hopefully traffic wasn't too bad getting over here, but what's it like over there? What's the traffic at the Javits Center? Very, it's amazing. I believe the conference has grown by a factor of two. You know, just from February of this year, it's amazing to see the traction, Big Data and analytics is getting. Now, have you done other Hadoop worlds before? Not give a keynote, but today I did have a keynote, but you know, Microsoft has always had a presence at these conferences. Oh, sure, right. But you personally, this is your first one? No, this is my first one. So we're celebrating five years as Hadoop world. Mike Olson was here earlier on theCUBE and we were reminiscing about it was probably eight or 900 people. I think the very first one was 400. That's right. And then the second one we had theCUBE at, so it's really exploded. And a lot more enterprise people are coming over. So what was your keynote this morning? Yeah, the keynote was about a new data science economy. And the point I was trying to make, you know, while software developers have had the app economy, where software developers can go publish apps, self-service, and you can have huge wins like Instagram and WhatsApp, data scientists, you know, in spite of being some of the smartest people and what I'd like to think of as some of the smartest people around, me being a data scientist, by the way, they haven't had an outlet for their creativity. Now, data scientists don't build software apps. They build machine learning models. They build analytical systems. They build visualizations. They don't have an outlet to monetize these. And what I was talking about is about creating one such like that, right? So what's the equivalent of an app store for a data scientist? Well, I believe personally that what you need to do is package up machine learning models and visualizations in cloud-hosted APIs. And then you have an API that you can use on the cloud as a service, it's a web service. Well, then you have something that people can pay for and buy. And so what I was talking about was I took this example of a website that needs to integrate recommendations in. You know, building these recommendations is a hard analytic job. It takes a data scientist to do that. But then there is another part that's equally hard. It's to build an API that hooks into that web page and then can handle the volume of traffic and reliably serve recommendations with every click. And that's a very hard part of the job. And companies that do it today, like my former company Amazon, they have a lot of effort into building some of those. Well, with the cloud, we can simplify these things dramatically. So I took that example and I showed how on the cloud with the machine learning product we have, you can build a recommendations API, publish that into a marketplace where you can actually start charging for that by usage, by per click, and then anyone, you know, any particular, practically any website could very easily with a little bit of JavaScript integrate that API in and get recommendations live in minutes. And so now you have something that is a seed of something that can be traded in a marketplace that have intelligent services on the cloud. And so we have this great product that we launched. In summer called Microsoft Azure Machine Learning. That allows you to now publish APIs that now can also be published into a marketplace and people can now start charging for it. And that's hugely empowering to data scientists. Well, the recommendation piece is quite interesting, right? Because that, you know, reputation, recommendation, you know, our social graph is now cutting across all kinds of activities that we do. I want to, I have a question about, from a data scientist perspective, it seems like historically, industries build expertise within their domain. Retail people have retail expertise, financial services, insurance, manufacturing. We sort of stay there for 30, 40, 50 years, retire and that's all good. Data scientists seem to not want to get put into that vertical bucket. Is that a perception that's correct? Well, there are lots of data scientists in financial industry and so on that have actually grown up and have accumulated a great amount of domain knowledge. Now, of course, there are lots of people who cross industries. By the way, financial industry in many ways was the birthplace of a lot of data science, really, credit scores, fraud detection. My first job 20 years back was credit card fraud detection using neural networks. So that was my, it was a very interesting thing to do. So- But expensive. Yes, of course, yeah, yeah, yeah, absolutely. But I think data science hits its magic when you combine the magic of software and machine learning and all of these algorithms with domain expertise. But we're actually taking a track to leverage those kinds of domain expertise as well to create a long tail selection of a lot of different kinds of machine learning APIs in the marketplace. See, the thing about these marketplaces, what's great is instead of just having this one product that it tries to fits all, you get a lot of specific things. You might get a forecasting API for forecasting Wall Street time series and another one for forecasting, hey, my sales, forecasting something else. There are lots of variations and when they're customized to individual domains, they become very, very useful. Yeah, so that's, I guess, my point. I didn't put a frame recovery. It seems like data scientists wants data inputs from just outside that one slice. That's why I've brought up the example of the social graph. There's other data inputs from, maybe it's other industries, other sources, as opposed to sort of the historically having data just from that one slice. That I completely agree. And so that is a new trend to see. It's an art. Yes, it's an art. And I wonder if you could talk about that a little bit. It's an observation that we've made that is, I think. Let me just take an example of forecasting demand on an online store, for example. You have products, right? Well, I mean, previously people would just take, hey, how much that product sold in the past and then try to forecast out that demand. But today you have so many signals. You have tweets, for example. How much people are tweeting about those products? How much is purchases of those products being shared on the social graph? You have then search query signals. How much people are searching for these things? So you have so many other signals that you can now incorporate in. And a lot of this data comes from different sources in the cloud. And if you're a data scientist running, say, machine learning in the cloud, like with Azure Machine Learning, you can pull in all of those kinds of data sources, integrate that in the cloud, and build these really accurate, fantastic forecasting models. And that adds a lot of power. So it's interesting, the machine learning piece was within Azure. Yeah. Talk about that a little bit. Microsoft's culturally, Satya Nadella, from the top down is really moving to the cloud and embracing the cloud. Your background is Amazon, which is quite interesting. Talk about that culturally a little bit. Well, even culture from coming to Microsoft. You know, by the way, Satya recruited me to Microsoft and my passion has been machine learning. So I came over there to build this. In many ways, sort of I'm betting my career on building out this machine learning marketplace and making it a big success. It's actually a new Microsoft, really. I mean, Satya and others have really changed how Microsoft actually really sees itself. The challenge of mindset, we are moving much faster. We structured ourselves to actually ship in much shorter cycles. So it's changed from, for example, shipping software in the traditional way, shipping it on a disk, package software through these channels to now shipping things in the cloud. And these are live services that requires 24 by seven monitoring. And the great thing about cloud services is you can take customer feedback live as opposed to these multi-year cycles for enterprise products. You take customer feedback live and the product improves constantly, right? So building these products that constantly improve and become better for the customer, that requires a sort of a new DNA and a customer focus and a way of thinking. And that's really happening under Satya's relationship. And that's really what excites me all the time. I mean, this is an outside observer. I've seen a dramatic change. Yes, absolutely. It was almost like there was some resistance and then all of a sudden that resistance broke through like a dam breaking. And which it's interesting to note, I think today's companies are much more open to change at least recognize the need. Do you agree with that? Well, perhaps, I mean, certainly engineering companies like Microsoft, if you look at the engineers, they know technology around them is changing at such a rapid pace. At their heart, they really want to move with the technology. I mean, no technologist wants to be outdated. I mean, we're geeks, we play with stuff. We love the latest stuff that is coming out. We really want to really move fast, right? And so in some ways, perhaps the change was not necessarily unexpected. It was just a leadership setting, the orientation, right? And then the engineers gravitating to what naturally comes to them in many ways. And perhaps that's why Microsoft seems to be changing quite fast, as opposed to perhaps people with not as much of an engineering culture. But I think Microsoft is adapting and moving very fast. So it's top-down and bottom-up. Yeah, absolutely. At least certainly the engineering folks have always felt this angst about they really want to win. They really think, you know, and they should be at the cutting edge of technology. They are used to creating the path and now they feel not like others are creating the path and they really are hungry to win. And so they're eager to learn. They're eager to have the customer focus. They really want to adapt, I think, and with the new leadership, it just became really easy. So sort of the damn burst. So Joseph, talk about this kind of two sides of the coin where you're a data scientist, you've got a lot of data you're disposing of. You're building algorithms for people to be able to rent and use and leverage. So you're smart. At the other end, there's never enough data scientists. Everyone complains there aren't enough data scientists. I can't get them to come work for my company. And at the same time too, like Tableau's Message, another great Seattle company, they want everyone to really use analytics in their decision-making process and they really always kind of go back to, you know, how Excel-like can we make it? Because that's everybody's a big data tool to jure. How does those things kind of come together where you leverage the power and the expertise of really smart guys like yourself and the algorithms and things that you can build. But at the same time enable everyday line people to be more data-centric in their decision-making and to apply some of this to make their lives better and their businesses better. Yeah, now great question. You know, I think, Jeff, the answer is you need finished products that everybody can use. And in one of my sessions earlier today, I made this example. I mean, I asked how many of you actually get your clothes tailored? Okay, I mean, in the olden days, people would buy clothes and go to tailors. And it's sort of like the tailors are like the data scientist today, maybe. I mean, you go get your custom fit model that works for just you and your domain. And nobody does that anymore. Guess what? I mean, manufacturing clothes became so automated and so easy and there's such an incredible selection of clothes available in all department stores that you can actually go to department stores, select the sort of things that you like and by and large it really works well on you, right? And so you need to create something like that. The finished shirts that I can buy and what's equivalent of that for data scientists? So if I am a company that wants a recommendation API, you know, I should be able to go to a store and there should be a thousand recommendation APIs from which I can select the one that suits my need the best. So I shouldn't have to really hire data scientists or have a tailor equivalent, right? I shouldn't have to have my own tailors or my data scientists. I should be able to go to a store and actually get what I want, right? And so now the question is, how do you create a million APIs that are finished products in the cloud, right? So I have enough of a selection, enough of a diversity that vast majority of people can get what they want. I think that's the way really we need to go and instead of making tailoring more, well, you can go two ways, right? You can make tailoring simpler and you can empower them to, you know, that's an Excel like thing. That's one way to go. Or you can say, hey, I don't really need tailors that many anymore. Tailors are in short supply anyway. Why don't we just get mass manufacturing of clothes? So I can just get the selection I want, right? And what's the monetization model? You've mentioned that before. It's by usage. So if it's a recommendations API, the recommendations API is used per click and for a certain number of clicks you charge a certain dollar amount, right? So just to pay by usage. Right, and I can cap my usage. Yeah, utility model. I'm in control. Exactly, yeah. And then I have to pay only what I use. I don't have to, I mean, otherwise you have this lots of fixed investment. In this particular case, you're just subscribing to an API and you pay for what you use. Recommendation. I mean, that's a very revolutionary model. Well, it's such a great example, right? I mean, you go to Yelp, you go to TripAdvisor, and Amazon, it's still valuable use. You know, and it's just, you can even learn to decode the recommendations and you can build your own system as a user. Exactly, and yeah, by the way, I don't think there is one recommendations algorithm or one recommendations API. I actually think there should be a thousand, right? There will be a lot of diversity in the kind of recommendations people want. I mean, I say, you know, recommending wine is not the same as recommending a book, right? It's just different in character, right? And people who know wine should create a recommendations API for wine. And people who know book should create a recommendations API for books. And people who know clothes should create recommendations API for levels. But this gets to the first question that I ask in your response. So it's the domain expertise, but the fundamental code, in this case, can traverse domains and then be applied if you want to tailor the suit. Great, you know, but this base code can go across any industry, for example, or any use case. Yeah, exactly. That's very powerful, that's different. The data becoming a transport. That's right. The horizontal transport. Yeah, and so what we want to create right in the cloud is this factory for intelligent services. Think about it that way, right? It's like the clothes factory, right? And so I've created a platform, the Microsoft Azure Machine Learning Platform, and the marketplace to publish it into. So this should be the factory in which data scientists come, create a large number of intelligent services that then appear in this marketplace. And that's a store where you, as customers, ordinary people would go, get those APIs, connect it into Excel, if that's where you want to use it, connect to a website, if that's where you want to use it, and you just go. And that's how economies take shape. Well, right. And why would anybody develop that in-house? Why wouldn't they just move up the stack and focus on their business model? Right, right. I mean, exactly. So you should hire a tailor when you have a specific need, when you're gonna have a wedding, then you spend the money to get your clothes tailored. Rest of the time, you consume what's out there, and they work by and large for you. And in any way, the business is so dynamic these days that you're better off tapping into that as opposed to these slower cycles of billing everything from the ground up, at least for the vast majority of companies. Right. Joseph, another thing I want to get your kind of 101 on is, again, smart data scientists dealing with a lot of data. But then there's the human factors of, and people talk about visualization, but I mean, how do you make a billion of anything really cognizant to me to be able to make a logical decision based on such a big data set? And how are kind of the human factors being now applied to machine learning and data science to let regular people actually be able to make cognizant decisions based on a billion data points? And you see some of these visualizations, and they look pretty, but it's like, well, how's that telling you where I'm supposed to go with my next decision? And to tell you the truth, I mean, I don't think beyond a point these visualizations are all that useful. I think you do need good hypothesis testing, and you really need the ability to understand scientifically whether there is a nugget there or what you should be seeing, because data points are like stars in the sky. You can see anything you want. You can see what more visualizations you have. You can see all sorts of constellations. And in many ways, sometimes, in fact, there was a wonderful speaker today from NPR at Hadoop World who was talking about how data can be used in so many misleading ways that you can substantiate your position, anyone's position using data. And that's why Mark Twain said famously, there's lies, damn lies and statistics. And how to lie with statistics, right? Yes, exactly. I always thought that was Disraeli, okay. No, I just repeated that Mark Twain. Yeah, I didn't realize that. Disraeli ripped off Twain, okay. Like Kennedy ripped off Cicero. That's not that. And so that's the thing about pretty pictures, right? You can make pretty pictures to substantiate any viewpoint you like sometimes. Right, right. And the other thing that I think is interesting, we had a great conversation with a professor from Cal who's a mathematician. And he talked about mathematics and statistics is very different than computer science because everything is wrapped with an error bar, right? It's a confidence level. There is no yes, no, right, wrong, binary switch, one, zero. It's all probability with a confidence level and the two of those worlds are now crashing together in the paradigm of data science and actually trying to drive decisions based on correlation, not necessarily causality. That's absolutely true and that's so absolutely true. And like, there's one example, again, in my in the past when we used to do forecasting. Let's say I forecast that your book, Jeff, is going to sell 20 units next week. It's never going to be 20 units, right? It's going to be, there's going to be error bar around it. It's going to be a variance, a distribution. And in fact, somebody if a store were to, bonds and all were to stock only 20 units, maybe 50% of the time they'll be out of stock because there's an error bar around it. What you really need sort of is that distribution of the error bar so you can stop to the level where you maybe sell out only 10% of the time, right? So that probability is actually an integral part of this whole thing and it's incredibly important to use that as well in making decisions. And we're just getting used to making decisions based on probabilities. Most people find it an uncertain. Right, just the big database decisions, right? Instead of cut decisions, that's still an evolving process, I think. That's right, yeah. Let's say there's some college students in the audience and they want to know what makes a good data scientist? Where should I start if I'm interested in data science? Yeah, I know that's a great question. I think if you are a good data scientist, you really have a curiosity for unearthing patterns in the data. You really love munging the data, but you have, you're really good at coming up with a lot of hypothesis. And you know how to test them. You have lots of ideas, but you don't know that any particular hypothesis is right. You come up with ways of testing hypothesis. And then you come up with ways of deriving patterns that in some way predict the future better. Like, what are the patterns that are truly predictive of what will happen in the future and what's noise and you can separate them and so on. So that skill is what you need to develop. But then you need a lot of tools for it. You need to know statistics. You need to be facile with a fair bunch of coding. You need to go learn Python or R. R is the language of data science today. And you need to know a bit of SQL because you typically need to go to databases and extract data. And then you need to know how to pull data from the web. So learning a collection of these skills and then being able to derive insights from the data, that's really what a data scientist does. But you know, to students, look, this is actually not any harder than AP statistics. Data science and machine learning, I mean, they have this aura of being very new. But, you know, reality is it's something that is actually a lot simpler than many AP classes that you might take. By the way, I'll tell you another story, you know, like some of the earliest things in data science, like the least squares method, was invented by Gauss at the age of 18 in 1795. And he was trying to see patterns in the stars. Tell me more about that. Yeah, he was trying to see patterns in the stars. So the big data they had at that time were how planets were moving in the sky. And so he had to fit, you know, mathematically and find out what the patterns in the sky were and how these planets were moving. So he invented what was called the least squares method at that time at the, you know, tender age of 18. And that's how, you know, in some ways data science started. How important is it to be unbiased? Well, you have to be a scientist. I mean, everybody has to be unbiased, right? But data science is a scientific method. The goal is to apply science to data. And so if you are a scientist, by definition, you have to be unbiased. You have to be objective. You have to be experimenting. You have to discover insights by experimentation. And yet, you know, when you have a hypothesis, you have to be equally on the side of proving it or disproving it, right? And that's what data scientists say. I worry about data science abuse, data scientist abuse. Somebody comes down and says, I want to show this. Find me the data that shows, you know, A is related to B. But it goes back to what we said. You can't find the data that supports a point of view, right? I mean, you can absolutely do that. So I don't think there's such a thing as pure unbiased. Now you can certainly test a hypothesis and then accurately decide whether that hypothesis is proven or disproven. And even again, that's a probability too, right? Point of 5.41. But you know organizations will abuse that capability and try to, you know, coast the data scientist. Even sometimes when you have two political parties, each party can use data to substantiate their point of view, however they like. And it's actually sometimes very hard to disprove, right? And so that is the world in which we live. So we shouldn't take data for granted that data is the truth. Because at the end of the day, it is the derived hypothesis from that data. And who derives it? And what their original point of view is, that's coloring the interpretation at the end of the day. So you do have to be careful. The data doesn't lie, people do. Joseph, thanks very much for coming to theCUBE. Really a pleasure having you, really. Nice meeting you. Keep it right there, everybody. Jeff Frick and I will be right back right after this word. This is theCUBE.