 Live from San Francisco, it's theCUBE, covering Spark Summit 2017, brought to you by Databricks. Welcome back to theCUBE. You are watching coverage of Spark Summit 2017. It's day two. We've got so many new guests to talk to today. We've already learned a lot, right, George? Yeah, we had some, I guess, pretty high bandwidth conversations. Yes, well I expect we're going to have another one here too because the person we have is the founder of Predictive Analytics World. It's Eric Siegel. Eric, welcome to the show. Hey, thanks, Dave. Thanks, George. You go by Dave or David? Oh, you can call me sir, and that would be... I would cry. Can I bow? Oh no, we're bowing to you. You're the author of the book, Predictive Analytics. I love the subtitle. The power to predict who will click, buy, lie, or die. Yes, that's right. And that actually sums up the industry, right? Right, so if people are new to the industry, that's sort of an informal definition of predictive analytics, basically also known as machine learning, right? Where you're trying to make predictions for each individual, whether it's a customer for marketing, a suspect for fraud or law enforcement, a voter for political campaigning, a patient for healthcare. So in general, it's on that level. It's prediction for each individual. So how does data help make those predictions? And then you can only imagine just how many ways in which predicting on that level helps organizations improve all their activities. Well, we know you were on the keynote stage this morning. Could you maybe summarize for the CUBE audience what a couple of the top themes that you were talking about? Yeah, I covered two advanced topics. I wanted to make sure this pretty technical audience was aware of because a lot of people aren't. And one's called uplift modeling. So that's optimizing for persuasion, for things like marketing and also for healthcare actually, and for political campaigning. So when you do predictive analytics for targeting marketing normally, sort of the traditional approach is, let's predict, will this person buy if I contact them? Because then, well, okay, maybe it's a good idea to spend the $2 to send them a brochure. It's marketing treatment, right? But there's actually a little bit different question that would make even drive even better decisions, which is not will this person buy, but would contacting them sending on the brochure influence them to buy? Will it increase the chance that we get that positive outcome? That's a different question. And it doesn't correspond with standard predictive modeling or machine learning methods. So uplift modeling, also known as net lift modeling, persuasion modeling, it's a way to actually create a predictive model like any other except that its target is, is it a good idea to contact this person because it will increase the chances that they're going to have the positive outcome. So that's the first of the two, and I crammed this all in the 20 minutes. The other was a little more commonly known, but I think people like to visit it. It's called p-hacking or vast search where you can be fooled by randomness and data relatively easily. In the era of big data, there's this all too common pitfall where you find a predictive insight in the data, and it turns out it was actually just a random perturbation. How do you know the difference? Fake news, right? Okay, fake news. Well, except that in this case it was generated by a computer, right? And then there's a statistical test that makes it look like it's actually statistically significant and we should have credibility to it on it or about it. So you can avert it, you have to compensate for the fact that you're trying lots, you're evaluating many different predictive insights or hypotheses, whatever you want to call it and make sure that the one that you are believing, you sort of checked for the ability that it wasn't just random luck. So that's known as p-hacking. All right, so uplift modeling and p-hacking. George, do you want to drill in on those a little bit? Yeah, I want to start from maybe the vocabulary of our audience where they'll say sort of like the uplift modeling goes beyond prediction. Would it, actually even for the second one with p-hacking is that where you're essentially playing with the parameters for the model to find the difference between correlation and causation and going from prediction to prescription? It's not about causation. It's actually, so correlation is what you get when you get a predictive insight or some component of a predictive model where you see these things are connected therefore one's predictive of the other. Now the fact that that does not entail causation is a really good point to remind people of as such. But even before you address that question, the first question is is this correlation actually legit? Is there really a correlation between these things? Is this an actual finding? Or is it just happened to be the case in this particular sample of limited sample data that I have access to at the moment, right? So it's just, is it a real link or correlation in the first place before you even start asking any question about causality? And it does have, it does relate it to what you alluded to with regard to tuning parameters because it's closely related to this issue of overfitting and people who do predictive modeling are very familiar with overfitting. The standard practice, all tools and implementations of machine learning and predictive modeling do this which they hold aside an evaluation set called test set. So you don't get to cheat, you create the predictive model, it learns from the data, it does the number crunching, it's mostly automated, right? And it comes out with this beautiful model that does well predicting and then you evaluate, you assess it over this held aside set. My thing's falling off here. Just look at it on your go for it. So then you evaluated over this hold aside set, it was quarantined so you didn't get to cheat. You didn't get to look at it when you're creating the model. So it serves as an objective performance measure. The problem is, and here's the huge irony, the things that we get from data, the predictive insights like there was one famous one that was broadcasted too loudly because it's not nearly as credible as they had first thought, is that an orange used car is a better one to buy. It's less likely to be a lemon. That's what it looked like in this one data set. The problem is that when you have a single insight where it's relatively simple, we're just talking about the car, the color to make the prediction. Like a predictive model is much more complex and deals with lots of other attributes, not just the color, for example, make your model, everything on that individual car, individual person, you can imagine all the attributes. That's the point of the modeling process, the learning process, how do you consider multiple things? If it's just a really simple thing, just based on the car color, then many of even the most advanced data science practitioners kind of forget that there's still potential to effectively overfit that you might have found something that doesn't apply in general, only applies over this particular set of data. So that's where the trap falls in. They don't necessarily hold themselves at a high standard of having this held aside test set. So it's just kind of ironic, I think the things that it's most likely to make the headlines, like orange cars, are simpler, easier to understand, but are less well understood that they could be wrong. You know, okay, keying off that, that's really interesting because, you know, we've been hearing for years that what's made, especially deep learning relevant over the last few years is huge compute, you know, up in the cloud and huge data sets. But we're also beginning to hear about message of sort of generating synthetic data so that if you don't have a lot of, I don't know if the term is organic training data, you know, and then test data, that we're getting to the point where we can do high quality models with less. Yeah, less of that trained it. Did you interview with a keynote speaker from Stanford about that? Oh no, we only saw a part of his speech. Yeah, the speech yesterday. And that's an area that I'm relatively new to, but it sounds extremely important because that is the bottleneck. He called it, you know, if data is the new oil, he's calling it the new new oil, which is more specific than data, it's training data. So the whole, all these machine learning or predictive modeling methods of which we speak are, in most cases, what's called supervised learning. So the thing that makes it supervised is you have a bunch of examples where you already know the answer. So you're trying to figure out, is this picture a cat or a dog? That means you need to have a whole bunch of data from which to learn the training data where you've already got it labeled. You already know the correct answer. And in many business applications, just because of history, you know who did or didn't respond to your marketing, you know what did or did not turn out to be fraudulent. History is experience from which to learn. It's in the data, so you do have that labeled, yes, no, like it's, you already know the answer. You don't need to predict on them. It's in the past, but you use it as training data. So we have that in many cases. But for something like classifying an image, and we're trying to figure out, does this have a picture of a cat somewhere in the image or whatever all these big image classification problems, you do need often a manual effort to label the data, have the positive and negative examples. That's what's called training data, the learning data. It's actually called training data. So there's definitely a bottleneck. So anything that can be done to avert that bottleneck, decrease the amount that we need, or find ways to make sort of rough training data that may serve as a building block for the modeling process, this kind of thing. That's not my area of expertise, sounds really intriguing though. What about, and this may be further out on the horizon, but one thing we're hearing about is the extreme shortage of data scientists who need to be teamed up with domain experts to figure out the knobs, the variables to create these elaborate models. And we're told that even if you're doing the sort of traditional, statistical machine learning models that eventually deep learning can help us identify the features or the variables just the way they sort of identify ears and whiskers and the nose and then figure out from that the cat. That's something that is in the near term, medium term in terms of helping to augment what the data scientist does. It's in the near term in that that's why everyone's excited about deep learning right now, is that basically the reason we built these things, these machines called computers, is because they automate stuff. Like pretty much anything that you can think of and define well, you can program, right? And then you've got a machine that does it. And of course one of the things we wanted to learn is actually to do is to learn from data. Now it's literally very analogous to what it means for a human to learn. You've got a limited number of examples trying to draw generalizations from those. When you go to bigger scale problems where the thing you're classifying isn't just like a customer and all the things you know about the customer are they likely to commit fraud, yes or no. But sort of it becomes a level more complex when it's in an image, right? An image is worth 1,000 words and maybe literally a lot more than 1,000 words worth of data if it's high resolution. So how do you process that? Well there's all sorts of research of like well we can define a thing that tries to find arcs and circles and edges and this kind of thing or we can try to once again just let that be automatic, let the computer do that. So deep learning is a way to allow, I mean Spark is a way to make it operate quickly but there's another level of scale other than speed. The level of scale is just like how complex of a task can you leave up to the automaton to go by itself. And that's what deep learning does is it scales in that respect and it has the ability to automate more layers of that complexity as far as finding those kinds of what might be domain specific features and images. Okay but I'm thinking not just the help me figure out speech to text in natural language understanding or classifying images. Anything that is a signal where there's like a high bandwidth amount of data coming in that you want to classify. Okay so could that, does that extend to I'm building a very elaborate predictive model not on is there a cat in the video or in the picture so much as I guess you called it is there an uplift potential and how big is that potential for in a context of making a fail on an e-commerce site. So what I mean what you just tapped into is when you go to marketing and many other business applications you don't actually need to have high accuracy. What you have to do is have prediction that's better than guessing. So for example if I get a 1% response rate to my marketing campaign but I can find a pocket that's got 3% response rate it may be very much rocket science to define and learn from the data how to define that specifically defined sub-segment that has the higher response rate or whatever it is. So but the 3% isn't like I have high confidence this person is definitely going to buy it's still just 3% but that difference can make a huge difference it can improve the bottom line profit marketing by factor of five kind of thing in that right. So it's not necessarily about accuracy. Now if you've got an image and you need to know is there a picture of car or is this traffic like green or red somewhere in this image then there's certain application areas self-driving cars would have you it does need to be accurate right. But maybe there's more potential for it to be accurate because there's more predictability inherent to that problem like I can predict that there's predict that there's a traffic light that has a green light somewhere in image because I've got enough data enough label data and the nature of the problem is more tractable because it's not as challenging to find where the traffic light is and in which color it is. So you need it to scale to reach that level of classification performance in terms of accuracy or whatever measure you use for certain applications. Are you seeing like new methodologies like I think it reinforcement learning or where there's deep learning where the models sort of are adversarial where they can take, where they make sort of big advances in terms of what they can learn without a lot of supervision. You know like the ones where Oh it's more self-learning and it's unsupervised. Sort of glue yourself onto this video game screen and we'll give you control of the steering wheel and you figure out how to win. Yeah, I mean having less required supervision sort of more self-learning anomaly detection or clustering and these are some of the unsupervised ones. When it comes to vision there are parts of the process that can be unsupervised in the sense that you don't need labeled of your target like is there a car in the picture, right? But it can still find, it can still learn the sort of feature detection in a way that doesn't have that supervised data. Although that image classification in general and that level of deep learning is not my area of expertise, right? So that's a very up and coming part of machine learning but it's only needed where you have these high bandwidth inputs like an entire image, higher resolution or a video or a high bandwidth audio. So it's signal processing type problems where you start to need that kind of deep learning. Okay. Yeah, it's a great discussion Eric and we have just a couple of minutes to go in the segment here. Yeah. I want to make sure I give you a chance to talk about predictive analytics world. Sure. What's your affiliation with that and what do you want the CUBE audience to know? Oh, sure, yeah. So predictive analytics world, I'm the founder. It's the leading cross vendor event focused on commercial deployment of predictive analytics and machine learning. Our main events a few times a year is a broad scope business focus event but we also have vertical industry vertical focus specialized events just for financial services, healthcare, workforce, manufacturing and government applications of predictive analytics and machine learning. So there's a number of a year and you we've got two weeks from now in Chicago, October in New York and you can see the full agendas at predictiveanalyticsworld.com. All right. So great short commercial there. We've got 30 seconds left. The elevator pitch. So I'm going to ask you a tough question in 30 seconds. What is the toughest question you got after your keynote this morning? Maybe you're in a hallway conversation. What's the toughest question I got after my keynote? From one of the attendees. Oh, well the question that always comes up is how do you get this level of complexity across to non-technical people or your boss or your colleagues or your friends and family, right? By the way, that's something I worked really hard on with the book which is meant for all readers although the last few chapters have- How to get executive sponsors to understand what you're doing. Well, no, I mean that is to say give them the book because the point of the book is it's pop science, it's accessible, it's anecdotally driven, it's entertaining, it keeps it relevant. But it does address advanced topics at the end of the book. So it's sort of the industry overview kind of thing. But the bottom line there in general is that you want to focus on the business impact. So like what I mentioned briefly a second ago, if we can improve targeting marketing as much, it will increase profit by a factor of five. Something like that. So you start with that and then answer any questions they have about, well, how does it work? Are we going to, you know, what makes you, what makes it credible that it's really has that much potential in the bottom line, this kind of thing. So you sort of work, when you're a techie, you're inclined to go forward. You start with the technology you're excited about and that's my background, right? So that's sort of the definition of being a geek is that you're more narrowing with the technology than the value it produces, right? Because it's amazing that it works and it's exciting and it's interesting, and it's scientifically challenging. But when you're talking to the decision makers, you have to start at the eventual care at the end of the stick, right? Which is the value. That's the business outcome. Yeah, yeah. Great, well that's going to be the last word. That might even make it on our Cubegem segment. Okay, cool. We're having great sound bites. George, thanks again, great questions. Eric, the author of Predictive Analytics, the power to predict who will click, buy, lie, or die. Thank you for being on the show. We appreciate your time. Sure, yeah, thank you. It's been great to meet you. All right, and thank you for watching theCUBE. We'll be back in just a few minutes with our next guest here at Spark Summit 2017.