 All right. Hello and welcome. My name is Shannon Kemp and I'm the executive editor for Data Diversity. We'd like to thank you for joining today's Data Diversity webinar, Predictive Analytics, How to Get Stuff Out of Your Crystal Ball. The latest installment in a monthly series called Data Ed Online with Dr. Peter Akin brought to you in partnership with Data Blueprint. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. If you'd like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right for that feature. For questions, we'll be collecting them by the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share highlights or questions via Twitter using hashtag dataed. To answer the most commonly asked questions, as always, we will send a follow-up email to all registrants within two business days, containing links to the slides. And yes, we're recording and likewise, we'll send a link to the recording of the session, as well as any additional information requested throughout the webinar. Now let me introduce our speaker for today. Peter Akin is an internationally recognized data management thought leader. Many of you already know him or have seen him at conferences worldwide. He has more than 30 years of experience and has received many awards for his outstanding contributions to the profession. Peter is also the founding director of Data Blueprint. He has written dozens of articles and eight books. Most recent is Monetizing Data Management. Peter's experience with more than 500 data management practices in 20 countries and consistently named as a top data management expert. Some of the most important and largest organizations in the world have sought out his Data Blueprint expertise. Peter has spent multi-year immersions with groups as diverse as the U.S. Department of Defense, Deutsche Bank, Nokia, Wells Fargo, and the Commonwealth of Virginia, and Walmart. He often appears at conferences and is constantly traveling. Let me turn it over to Peter to introduce today's webinar and today's guest speaker. Peter, hello, and welcome. Thank you, Shannon. Just a note today, I'm headed from here in Washington, D.C. We're in the basement of the Bureau of Labor Statistics. They're very kind enough to lend us their facilities for the broadcast today. I'm headed from here to Kazakhstan for my first trip out. Let's see how that works out. We're going to start a daily chapter out there and have some fun. I am so thrilled to have my friend and colleague, Dr. Stephanie Burr, join us today. You see a little bit about her on the left-hand side of your screen there. She has a PhD from Columbia in Statistics and a BA from the University of Chicago in Neurobiology and Behavioral Science. So to call Stephanie, a jack-of-all-trades is just getting started. She has lots and lots of things that she does. Her company is called the ANSEC Group, but she has experience with Ernst & Young and a bunch of other places. She also learns languages in her part-time and participates in strange events, which I think you called the Stephanie your biggest mistake with cadeless shark diving. But you just mentioned on the warm-up as we were doing this that you're going to become a foster mother for some oysters. So welcome, Stephanie, and here's a little bit about some of the crazy stuff that she does. This is a talk a little bit about data science, but at the same time a little bit about how to learn how to get stuff out of your crystal wall. So Stephanie, welcome and thank you for joining us. Thanks, Peter. So we're going to start out with a little bit here of predictive analytics gone wrong. So let's give this a try. This is Mary. May I take your order? Hi, Mary. Yes, I'd like to order. Is this Mr. Kelly? Yes. Thank you for calling again, sir. I show your national identification number as 6102049998-45-54610. Is that correct? Yes. Thank you, Mr. Kelly. I see you live at 736 Montrose Corp. from your cell phone. Are you at home? I'm just leaving work, but I'm... Oh, we can deliver the box out of supply. Is that 175 like it happening, yes? No. I'm on my way home. How do you know all this stuff? We just got wired into the system, sir. Oh, well, I'd like to order a couple of your double meat special pizzas. Sure thing. There'll be a new $20 charge for this, sir. What do you mean? So the system shows me that your medical records indicate that you have high blood pressure and extremely high cholesterol. Luckily, we have a new agreement with your national health care provider that allows us to sell you double meat pies as long as you agree to waive all future claims of liability. What? Do you agree, sir? You can sign the form when we deliver, but there is a charge for processing. The total is $67 even. $67? That includes the delivery surcharge of $15 to cover the added risk to our driver of traveling through an orange zone. I live in an orange zone? Now you do. Looks like there was another robbery on Montrose yesterday. Hmm. You could say $48 if you ordered our special sprout submarine combo and picked it up yourself. From Lutosa's fix, those are very tasty, sir. Good value, too. I want double meat. Well, I'm sure you can afford the $67 then. You just bought those tickets from Hawaii. They weren't cheap, eh? Oh, but I see you checked out the budget beach prom at the library last week. Hmm. Up to you, sir. All right. All right. I'll get the sprout subs. Good choice, sir. Gotta watch that wave so you're hitting the beach, Jane. 42 inches. That's how much? Just between you and me. There's a $3 off coupon in this month's Total Men's Fitness Magazine. Your wife, Betty, subscribed to that, right? Anyhow, hit that at $19.99 even. Found on all your credit cards. Bring cash, okay? So, Stephanie, what's wrong with that? There are so many things wrong. From a privacy perspective, inappropriate use of the data. What is your objective with it? Pretty much everything's gone wrong there. And we're not there yet, right? I mean, you are really what we talk about when we talk about a data scientist, even though you've been one before the term was cool. Yes. It is cool now. Just so everyone knows. It is cool now. It is cool. In fact, it is the sexiest job of the 21st century, according to McKinsey. Now, I do have to point out that the McKinsey report came out in 2007 where any job with a paycheck associated with it was considered to be sexy. But at the same time, McKinsey called for a million and a half people to get into this, you know, profession, that they're going to need a million and a half data savvy managers. I don't think we're anywhere close to a million and a half data scientists out there. Do you think so? No. And it also depends on your definition. Yeah. And the data savvy is, we'll talk about that a little bit when we get into successful analytics projects. But that's a term that we can talk about. So let me jump through a couple of things here about data science in general. And I like you are a little skeptical of the hype. And I think that's a healthy thing for any of us in the profession to have. I heard a term recently that said that calling somebody a data scientist is like calling somebody a book librarian. You know, anybody, any scientist that doesn't actually use data is not really a scientist. But we're really talking about somebody who's specialized within that. And yet we've had some interesting challenges around it. For example, I found this map on the Internet that says, oh, you need to have a bunch of different things, you know, sort of some fundamental stuff and clearly statistics are going to be important. Like programming, machine learning. I'm not going to read all of these off to you. But, you know, that's their definition of data science. And when I got that, I started looking a little further and just sort of coming up with some other bits and pieces. So here's a mathematics, hacking skills, business acumen. I'll say something about the business acumen in particular, because the one complaint we get for the most part when we've taken this newest group of data scientists and put them to work in our corporations and government is that they give them up three years and they come back to them and say, well, they're really smart people, but they just don't seem to have an interest in solving the problems that we want to have solved. And I think that is a problem that we have. So I'm still picking pictures off of here, you know, looking at what people say you need, well, you need to be able to do complex formulas and you need some consumers' psychology. And in business, programming language is only 10% on that. I love this particular chart here because it goes back to my origin. So it's like data processor. Really, if we think about it, that is actually a good term. In fact, I had relatives in World War II that were hired as computers because people did the computing back in those days instead of the old way. Creative and curious, skeptical, you know, all of these are fine pictures. I like this one. It shows an artist on there. And my real problem with all of this, once we get down to the actual pictures level and sort of have trouble with it, is that it reminds me of another thing that I do with groups that I work with and that is that they're not allowed to use vague, abstract terms that don't mean anything. So if you're talking to somebody, you say the customer, you have to say, well, we need to differentiate between something that would be a potential customer and a current customer, you know, what's really going on there. And maybe that gives us the ability then to say X customer or VIP customer all the way around. My answer on the data scientist part is really trying to say, oh, there it is. I got Eric Siegel's quote there. It's really trying to say, what type of a data scientist are you? And then again, Stephanie, just to help introduce people to you, we had a bunch of pictures up there in the front and we talked about you becoming a foster mother for oysters. When you introduce yourself, how do you tell people what you do aside from the fact that most everybody, at least to date, thinks they know what a data scientist is? My basic way of explaining it, and this is what I do, is I analyze data to help achieve business goals, okay? So there's a number of different levels at which you can look at data science. You can look at it from the data piece where you're processing and cleaning and managing it. You can look at it from the statistics piece where you're either a doer or a consumer of the statistics. You can look at it in terms of research methods, right? Understanding the design for the statistics and what are the outcomes. You can also look at it as project management. If you've got a huge team, you actually have to have someone who understands what they're accomplishing and runs the project. You can also have the linkage to the business, and that's really a key piece that I think we're going to see emerging as we go along with data scientists is going to start becoming more and more of a data scientist business person. So, but it's basically helping businesses accomplish their goals using analytics. Using analytics. And you've got some particular specialties. You've been working in the security area and really in the audit and compliance type of areas as well. As a particular specialization, you have lots of other things that you've done, but those are two areas that you've developed. Tell us a little bit about what got you into those areas and how would somebody recognize the things that you do that are different from the work that other data scientists are doing in their respective areas, if that's a question that you can answer. Yeah, and it might be useful for people who are listening in also. You know, we have a core set of skills and we have a core set of personality attributes and styles. And like Peter says, you know, tend to be detail oriented, tend to be very focused on achieving results. And what I really enjoy is looking at industries that are emerging and that are messy. Because if you think about it, really, if you're doing some work in a messy area, you're guaranteed to succeed because you can clear up some of that complexity and you can help your clients. And data security is an awesome area for that. When we started doing credit cards and trying to help clients understand where their risk was and potential for breaches, it was kind of like a very rudimentary research study, right? And rudimentary statistics, but with that emphasis on the link to the business. So that's kind of how I look at it. And it might be useful for everyone to look at it that way, too, for their careers. And again, of course, that's evolved over time. And what you suggested as a topic title for this is how to get stuff out of your crystal ball. We had some silly pictures at the beginning if there was a few that signed on early to look at what getting stuff out of your crystal ball is. So the first thing that you said in there was if something's really messy, bringing any sort of order to it at all, most people say thank you, that's a help. And yet then once you've solved the easy problems, they come back to you and say, okay, I like that first piece that you did. Now, Dr. Berg, can you take us to the next step? And how does that process work? And maybe I should flip the chart here a little bit and talk a little bit about the way you see things in the analytical world at the moment. Yeah, well, if we think about it, about 70% of the data analytics projects fail when we start looking at specific analytic projects. We're having issues in terms of understanding and expectations, resources in the bottom line. And you've really got the three areas that you can consider in terms of analytics. A lot of times with these messy issues, we're still dealing in descriptive area, what happened and what is happening now. And once we start getting a little more sophisticated, maybe after one or two or maybe three or four iterations, that's when we start getting into the predictive piece. And we start having the understanding of our data and start being able to use techniques for scoring our subjects based on our profiles. And then I haven't gotten to this yet with my work, but the prescriptive, what should I be doing and what should I be doing with it? And that's where we're really going to start looking at focusing on and practically managing and measuring specific groups based on our predictions. So right now, we're still kind of in descriptive, moving into predictive. Although you were describing that as by a project by project piece. So the first time somebody tries to do something, of course, your first approach is to Google it and see if anybody's done it before. But when you do realize that they're coming to you because this problem hasn't been solved before, then you say, okay, the first thing we're going to try to do is to take this sort of messy thing and make it a little bit more formal. And these are the kinds of questions that you ask. What happened? What is happening? And give you profiles and pie charts and barters, charts and narratives. And it's not all about statistics, is it? It's also about storytelling. Absolutely, Peter. And really step one when you're looking at descriptive analytics is obtaining the organization-wide perspective. So you're getting a 360-degree view of the key data that you need. And a lot of times what we'll do is actually draw out our reports, and that includes our graphics and our action statements before we start. So it's almost an inherent hypothesis testing process. Sorry, process in America. I still get used to that. So that helps us understand the situation in which we're in. We have a lot of volume and noise at that point. So we want to focus on the utility that we need for our client's specific issues for answering our questions. And can you run us through a more specific example of that, that you've encountered in the past, that you need confidential information here? Yeah, that's an answer. I don't think a really good example. You know, it's a really fun example that I was going to talk about a little bit further, but it kind of gets at the nub of this, is when I was an auxiliary police officer in New York, we had introduced, the commissioner introduced something called ComSTAT. And that was basically taking crime data and using it for how we would go on patrol. Okay? So a lot of people were resistant to it, nobody likes change. We went out and that was how we had our patrols. We would be assigned to specific areas that had issues. Then, years later, we've got the data linked together and a gentleman named Scott Stringer became the manager for the city for finances. And he saw that we were losing a lot of money in terms of the legal claims against the city. So it was almost $800 million predicted payouts by 2018. So this is pretty significant, especially because the fiscal year 2015 budget for Parks Department Aging and New York Public Library was less than that. So they went to predictive analytics and they started using something that's like ComSTAT. So this is, you know, 10 years later from ComSTAT, for crime data, now we're looking at tracking the city payments for the claims. And the bizarre finding that they found with these legal claims against the city, now that we had access to different databases, was that tree pruning for the park's budget was related to the claims. Isn't that interesting? And they found that in 2010 they... I know, it's just so bizarre. And this is one of the beauties that we have with the data that we have access to and the ability to touch it and to be able to include it in our models. We never would have thought of these things before. And again, it's not looking for an answer. It's looking for potential solutions. So we went back, they went back and looked at 2010 they'd cut back this budget. It really cut it back for tree pruning. And that was when the injuries soared. Several of them were multimillion dollar claims. And so they started the pruning. Claims went back to normal. Now the analyst is a rock star. And we start seeing how having a system-wide view of the data is really important. Pretty neat, huh? So that is pretty neat. That's what you mean by organization-wise, that whole entire piece. So you started out with crime and you ended up with tree pruning. Yeah, sort of, yes. That's interesting. You also, of course, came back and addressed crime I know later on in this too. But let me just take you down to another tree pruning exercise, because we did this with the data quality that we were doing as well. And I used that same example. No idea it was related to what you were talking about before. But we found that the quality of the data that they had collected the trees on, because trees don't come with unique identifiers in New York or anywhere else. They're about two and a half million big trees of the big parks. And nobody's going to run around and put a sign on them, number one, number two, number three. But interestingly, they got to the point where the analysis could only be so sophisticated because tree pruning statistics were kept by block, whereas the accident statistics were kept by street or maybe it was the other way around. It was the street and not the pruning was kept by street but the accidents by block. And so they weren't really able to establish that. And this is just my limitation on their analysis. And it was just fascinating to see how problematic that was. That it was interesting that they ended up being sort of the same kind of an issue there. That is interesting. And that gets into how do we define our data and how do we clean it, right? Well, let's talk about volume and noise. Now some of you guys listening in there that I'm a musician, Shannon and I love music and we usually play it at the beginning of this. That's not what you're talking about for volume and noise. It's not, you know, I'm playing too much loud music. No, no, no, no, no. This is where the recent developments, and I'm talking, you know, the last five, 10 years in being able to use technology are amazing. We now have, and big data has been a really great example of that, the sheer volume that we have available to us conversely that we actually have to trawl through to identify what's relevant. And that's where the noise comes in, in terms of the quantity and then also the quality. And one of the things that's really neat that we can do, instead of just looking at basic data, like we used to with more of our regression techniques, start looking at more of the machine learning and the actual digging into the data and seeing the variability within our data as well. So that noise goes in terms of the different sources as well as within the sources, the variability within them. So let me just make sure I state this back to you correctly. You're talking about volume increasing. It's not just the literal number of the amount of things, but it's the volume of variety and veracity and all the rest of the bees that we talk about when we get to big data type techniques. And the noise that you're describing then in it is that sometimes the stuff isn't pristine. Nobody's handing you an already proven spreadsheet with all the data lined up perfectly. So you can do this regressive thing. And I'm going to come back to that regressive thing in just a minute to make sure everybody's with us on that one in there. But it really is a lot of work. And then we've heard that same 70 to 80% statistic that takes you that much time and effort to get to the real analysis that you want to do. Oh, absolutely, absolutely. And that's where you with your data management work are just critical to the success of us moving forward. No, we've always talked about it's a good partnership in the sense that we do the preparation work and hand you a nice clean data set and you can be much more effective. Would that improve your 70% failure rate in predictive analytics just to curiosity? If we were able to give you more clean data sooner or would it give you the ability to fail sooner because you simply can't get the information you want to from the data sometimes? I think it would separate the people who understand the business objectives and how to link them from the people who are just sitting doing data studies. But it would definitely help. I'd be very happy with that. We'll have to explore that in another topic at some point. So let's go back to your analysis of descriptive statistics. The COMPSTAD data described crime in New York City and you use the word regression a couple times. You know, both of us know what we're talking about here but can you explain what regression really is for everybody else who's listening in here because I'm not sure they're all having, had good statistical teachers when they were in high school or college. Oh, yeah. And for anybody who looks at statistics and thinks, oh, my gosh, you know, it's horrifying, if you can find a good teacher, it makes such a difference. It really does. In terms of regression techniques, basically what you're doing is you're just taking a hypothesis-driven approach and you're saying, okay, I think this might happen or maybe not, right? And then you collect a bunch of what we call independent variables. So these are going to predict your... potentially predict your dependent variable, which is what depends on the independent variables. Ta-da! So if you start looking at trying to predict whether someone's going to purchase a product, for example, you know, I have this pen in my hand. I want to know if people are going to purchase it. I might look at... that's my dependent variable, right? And then I'll look at my independent variables. Could it be something like gender? Could it be something like people's visits to my website for my pen? Could it be their income? No, I probably won't have that, but you never know. So those are the predictors that feed into it. And those are where you hear those techniques that are linear regression, probate regression. You start looking over time, like at time series models. But again, it's actually really fun if you start looking at how it applies to real life and if you have a good teacher. Did that explain it okay? I think so. Let me venture out here on a limb because I am not a specialist in the crime stat world. But I read some things recently about the broken windows policy of the police in New York City. I'm sure you had some exposure to that. One of the previous police commissioners believed that if you clamp down on the small elements of crime, it would actually help the overall crime rates. And I think, if I'm recalling the analysis correctly, that proved not to be the case. Do I have that right? It depends. You know what, that's a whole topic that we can discuss probably later, but yet there's pieces of that that worked really well and pieces that didn't. There we go. And this is what happens is that now that you've described the situation, now you come back and say, what worked and what didn't. And so if pieces of it worked, you want to do more of those pieces and the things that didn't work, you want to go now and say, let's find some other things that we can try as experiments to see whether they do produce good results. Absolutely. And a lot of times when you try to use your predictive models, it may fail. And that's okay because that gives you the opportunity and we'll talk a little bit later about data and assumptions. It gives you an understanding of where you need to explore and you do it. A lot of times this is iterative. And again, that's why you do want to have management buy-in and support and that they understand what you can accomplish and what you can't. Can you just move this into the predictive area? So in other words, this describes this, therefore I can use that information then to say that this will likely happen given some other set of constraints. It again depends. So something that you probably will hear a lot about is the generalizability of data. And oh, you want to talk about this really quick before we get into the meat of that? Because this is a really great way of framing it. Go, Peter. Well, I was going to say, I like the way you had set this particular piece up. So the question is what type of problem are you trying to address? And if you know the type of problem, then you can apply the right types of statistics. And again, you're much more qualified to talk on this type of a process here because it does involve threat detection and fraud and things like that. So why don't you take us through that? Well, basically what you're trying to do is understand MI specifically looking at something that's occurring now, right? It's descriptive. And then if it's predictive, that gets into like I touched on with the generalizability of the data. So you have to make sure that what you have built your model on, that you're applying it to, is similar enough, okay? So if I'm looking at, we had a little issue with trying to help one of my clients understand some fraudulent behavior for an insurance company. And we were trying to help them understand what was going on. We thought it was the doctor issue, right? And he was doing some, shall we say, work. So we found some of his anomalous behaviors, shall we say, and we applied it to a different context, a different practice that he had, a different location. It turns out that those were different types of patients, and it actually was appropriate treatment. So that was inappropriate. We were not able to predict using our descriptive data from the other location, okay? And in fact, it's the doctor, right? Oh, he still did. That's gone legal. That's gone. That's already, yeah, prosecution. But that one, I was bumming because I was like, oh yes, here's another. But that's an example. So really, in that case, we were not able to generalize from the findings from our model to the other practice. So obviously we couldn't go to prescriptive. Just what's going to happen. Yeah. Does that make sense to everybody? Hopefully so, and I'm sure they'll ask us questions if it doesn't in there. So this is one way of categorizing the types of problems that you had. You also use the word modeling in here. Now, most of our data management modeling audience thinks of this as a description of the data. Your models are descriptions of the problem. So basically what we're trying to do is understand part of it's driven by the data, like we're talking about. But if you don't have the expert input to it, you're not going to be able to understand the context and how you're going to apply it differently. So that little graphic just shows you really quickly the low and high for knowledge of data, and then you're low and high for your knowledge from your experts. And some of the things that we do, really with the messy areas, are in that lower left hand quadrant. And then some of the others may be, for example, that context, if I had understood that they were different patients, I could have been in the upper left hand quadrant. The really cool stuff is in the upper right hand quadrant, but often we can get a lot of benefits from the other areas as well for business goals and objectives. So while we're talking back to our theme of getting stuff out of the crystal ball, you're saying that any of these four could be useful, just describing the problem and understanding it's a low expertise, low data problem. That can give you some good results, but then almost assuredly, you'll mature through that process and then move either if you have more expertise up into that expert-driven quadrant, excuse me, in the upper left, or if you end up with different sets of data saying, you know, I understand you've got this type of data, but if you could get it for me at perhaps a more granular level or with a different reliance in there, that could help in that area. But if you try to go to the expert-driven with just the data, you're not going to be able to get that so far. No, no. And a lot of times that's not addressing the business needs of your clients or your organization. And in fact, when we named it, remember, Peter, stuff was a deliberate choice because stuff is an ambiguous term and it depends on what your business goals and objectives are, what your needs are for the analysis. So somebody might say, I just need to understand what's going on and that's a low, low type of thing. But if somebody says, Stephanie, I need your 10 plus years in law enforcement understanding the not just the day-to-day operations of law enforcement but the crime statistics and the crime ways and that may put you in that upper right-hand quadrant and even though you're an expert if you don't have the data, you still can't help but solve it. Yep. And you can also layer it. So with one of our clients, we've been with them for a while. This is an info security engagement and I've been with them for years doing audits and they get audited by the SEC in America. They get audited by their European regular leaders and we've been with them for a number of years. So we are actually in that upper right-hand quadrant with their technical testing, the penetration, the hacking piece, but we just started last year in that lower left-hand quadrant, low, low, with social engineering. So we've actually got different aspects of the modeling depending on where they're at and what they're needing. Very cool. So that may be good from a technical perspective. Tell me how a data scientist gets involved in social engineering types of things because that's not something I think most people would necessarily think of. Although you do have a couple of degrees in psychology, do you not? Yes. You know, it goes back to that messiness that I find so attractive. When you can apply your data analytics and research methods knowledge to a context that's messy and brings some order to it, I personally really derive enjoyment from that. And so when I see clients having a need and a business-driven need as well, I just gravitate towards that. And again, it's not necessarily me or with other people. You've got a wonderful set of statistics on this next slide that I want to just get you to talk about here because that I think wraps up the essence of this first half of the discussion that we've had here in all of this. So take us through this. Okay. Basically, if we break down our predictive analytics, there's just three components. The first of all is data. And we say, oh, okay, I've got my data. Now I'm going to run some statistics on it. And oh, maybe I'll make a couple of assumptions on I've got predictive analytics. It doesn't actually work that way. If you press the next button, I'll show you. You have to have what we call good data. And we'll get into this a little bit later. But if we talk about what is clean data, what is timely data, we're going to talk a little bit about coding, recoding, defining your variables. What are your missing variables? And this is where, Peter, you really change the world for a lot of us because we have good data and inputs. And then when we get into the statistics, it's not just statistics, but it's the right statistics. And we kind of touched on that with the quadrant of what types of statistical analysis do we need to cover? We don't always have to use neural networking or something like that. Sometimes it's just something real simple. And then a key part of it is our assumptions. And this is what can really get you into problems, is making sure that they are valid and appropriate. And in the next couple of slides, we'll talk about each of these. But really, the good data is a key overlooked component. We all love our statistics and get caught up in that. But really the data and the valid assumptions are key. And that leads you to, press the next button, strong predictive analytics. And that goes back to what you started out with too, which is what is really the business problem? People, while you like to play with statistics, you actually derive joy out of solving business problems. That's because I know you very well and know the types of things that you like to put your time and energy into. Somebody could ask you, for example, to develop a correlation between swimming pool accidents and Nicholas Cage movies. We talked a little about that the other day. And that probably wouldn't do a whole lot for you because you might have good data and you might even have the right statistics, but there's no real way that you would link Nicholas Cage movies with swimming pool accidents. Spurious correlation. Exactly. Didn't mean to shrink that at you, but that's what that is, yeah. No interest. But if we're looking at something that might have a correlation, particularly with the data that we can get our hands on now and the ability to analyze it, and that's part of the big data movement that's been occurring and being able to do our predictive analytics, we can do things like that. And it's really, really intriguing to see how we can help the businesses with their key problems and their bottom line. We should also add in here, before we jump into that, the real key for this big data ending, which we sort of said a couple of times, we're in the end of the era of post-big data in the sense that it's fallen off the Gartner Hype Seat and now become a routine part of organizations. It's becoming more normative in there. So what this means is that we have more routine access to it as opposed to specialized access to it, and that's actually an important aspect of all of this. So we've talked about this back and forth. This is what you were looking at. You were saying very nice things about me. That's very kind of you. I'm only the articulator of it. I literally am sitting here 20 feet from John Zachman, who is the real founder of this discipline on all of this. But we talk about the hierarchy of needs here. And I tell people that I'm safe to be let on the streets the next week if I can get home and ride my horse for a little bit and play a little bit of music because they're my self-actualization pieces. And none of those would be possible or any of these other levels of Maslow's hierarchy if my food clothing and shelter needs are unmet. Until consequently, it's a real important aspect of this, and we do have to pay attention to it from a data perspective as well, knowing that what we're talking about at the top of the pyramid here is really the tip of the iceberg. Whatever we throw into those things has to be based on these good foundational practices because if we don't have the ability to base the foundational practices on top of something, we won't be able to actually derive the business value. Or if we do derive it, it will simply take us longer, cost more, and deliver less than if we did trying to get it the other way. The main reason for that is because the top part tends to be sold as technologies. We've seen Casey who's working on the data strategy book with me and was the first CDO of the Federal Reserve Bank told an example just a few minutes ago where she had a bank that she was working with and she's having a conversation with the CEO and the CEO said, why don't understand why my problems aren't solved because I bought a data warehouse. She's like, okay, I'm buying a data warehouse. Jump in and comment there if you'd like. I mean, it was just a very, very cogent example. He said, check the box. Buy the data warehouse. I got it. What could possibly go wrong? Well, you could fill the data warehouse with rubbish data. Again, goes back to the things you had on your slide there. So just to briefly wrap up this one section really what we talk about is that most organizations just aren't very good. Many of you've heard me do this before so I'm not going to belabor the point but we now have a proper formula in this and more to the point it's not just a proper formulation of it but it's the idea that we now have one that your boss recognizes so it's part of ICEN 9000 or most bosses understand the capability maturity model, a way of improving all of these pieces so that we can go to somebody and say, hey, here's your scores. Here are the scores that others in your competitive industry are facing and so you're falling behind and that's not a good thing and you need to fix the ones and make them twos before you try to make any of the twos into threes. Again, I'm very much oversimplifying the overall process. Here's why the insurance industry has not tended to look very good because they've had, relatively speaking, four data management practices in here and the last one in this area, just to wrap it up, those big data that you said, Stephanie, a couple of times, we're just not getting better when we look at how we're doing. These statistics are a little better but they're not statistically significant and that means something to Stephanie and I which is to say that while the scores are a little bit higher in 2012, the volumes are incredibly larger but the numbers are actually much more likely to be due to chance than they are to an actual improvement in the data that comes in there and this is really where we get back to the data science on this. Somebody could look at this and say, oh, we're getting better and the answer is no, not as big numbers are not statistically significant. So let's talk about what it means to have good data. This is your own version of the theorem right here. And yes, so basically when we have our definition of good data, the first line in our equation, we've really got two key areas. First is the foundational strategies, right? And that has two requirements. First is what we call reducing data rot. And that's basically about 80% of data in organizations is rot, which means it's, this is Peter's, redundant, obsolete, or trivial. So that means we're only working with about 20% of their data that's usable because we've got codes, we've got scan data, we've got disparate systems, we've got data for our financials, our customers, our quality, our operational purposes. There's a lot to deal with. So when you do have your data management, you can strongly emplace. You actually have a key to locating and using this 20%. And then you can effectively reuse, reduce, and recycle your data. The next one is your data management processes. There's the five practices, right? And that's just the same, it's different iteration of the slide you just showed, so we can blow through that if you want. I don't think everybody wants to hear it again. Okay, so we've got them in place. Again, what Peter touched on was capability maturity models. And then this data-centric development flow. So a typical approach to IT development is that you're going to determine your organizational strategy, then you're going to identify your specific goals and your objectives to achieve that strategy. If you press the button, I think it'll go. There we are. Then the next level down is develop our systems and our applications accordingly, which then drives our network and infrastructure requirements. And then after everything else has been done, you go ahead and identify the data and the information. That's not the way you necessarily want to do it. You actually want to be able to view the data from an organizational versus the application perspective. So the first two steps remain the same with your strategy and your goals and objectives, but how we focus on our data and applications are going to be specified. We're going to use smaller footprints later through data architecture. So you see how that flip-flopped? That's what we're looking for. And Stephanie, I have to tell you a quick thing that happened literally just before we went online. Here, my wife sent me a message and said, it's no longer ROT, it's data that's redundant, incomplete, obsolete, or trivial, making the word riot instead of ROT. Oh, I love it! Yeah, that's what I told her. Okay, thank you. Thank you for doing that a little work well. So yeah, it is very much flipping the thinking around and trying to figure out how we can move into a more data-centric thinking environment, given that. We're just glad that the data science community is looking at it for the same way we are. Now if we can just get IT to think about it that way, we'll actually probably make some progress. But the tough one changes a lot of, as you said before, changing behavior is tough. Well, that's where webinars like this are useful, because we're raising awareness, and then also with our teaching. We're getting students while they're in school and learning at the bachelor's level and then at the master's and doctorate level too. So I think slowly but surely it'll change. And then we'll have something else to worry about. There you go. Yeah, and this next slide is just that second piece of the equation. And it's just a quick slide showing you kind of the two areas that we look at are typically the traditional regression techniques, which we covered off before. And then also the really cool machine learning techniques. So I don't know the average age of people on this call, but I was being 20 that I am. No, but seriously, we wrote our code for everything. And I remember when they introduced GUI interfaces for SPSS and SAS, we were like, what is that? And to look at the capabilities that we have now with our machine learning techniques is truly extraordinary. Being able to have these neural networks, the multi-layer perceptron MLP is just so neat. Being able to have geospatial modeling with our phones and our additional inputs of information, it's really an amazing area. But if you go to the next slide, what we really want to focus on is what we consider the valid assumptions. And this is where people get hammered all the time. Do you want me to just keep going or do you want to comment? Yes, please. I don't know. It's perfect. I'm a little excited, as you can tell. But we need to consider, will the future continue to be like the past? What timeframes am I dealing with? Are there any variables that are not included that could be useful? What would happen if my assumptions were incorrect? And what is my missing data? And the key is you really want to make sure that when you're considering this, and I'm throwing up the front documented, you want to document things. Okay, so when we think about the future continuing to be like the past, you'll see that a lot of subjects, particularly if you're looking at purchasing behavior or health care, they will have changed behaviors because of different conditions. So someone who didn't have children, having children. Someone who's working, retiring, being unemployed and working. Okay, so you're going to want to make sure when you're doing that, you don't make it too explicit with the customers. I've heard of organizations running into problems with announcing changes in life, perhaps by sending coupons or ads in the mail, and they hadn't announced it yet. So you may want to couple them. The other piece of timely is the age of your data. I know this gets you going. When was your data pulled out from the operational data? What has changed? One of the things we want to look at is how do you get that feedback in there? Those are some key areas. The key variables included. Sadly, a great example of that is the financial crisis of 2008 and 2009. They built the models. They predicted how likely customers were to repay their loans, but they did not include a variable or consideration or assumption that the housing prices might drop, right, or stop rising. So another issue here was the assumption that these were independent events, okay, that each event was unrelated. So once we had the prices drop in, it was like a house of cards or a domino effect. So that's an example of a key variable that's missing. And then missing data. This is one, hey, it's kind of like your wife with the incomplete. Okay, we love this riot. But what is your definition of identification of your missing data? How do you code it? I've sadly run into strife earlier in my career with coding missing data as a nine versus a nine-nine, and then inappropriately including factor analysis. My boss fortunately caught it, but he made fun of me for like a year. It's like coding simple things. Yeah, I got abused. Yes and no. Do you code a yes as a one and a no as two? Do you code a yes as two and a no as one? Do you code a yes as one and a no as zero? When you're pulling in different data sets from different sources and you don't have one person sitting there eyeballing it with the business understanding and the intent of the data, you can really run into strife with that. So those are some things that you want to consider. Make sure that your assumptions are there. In terms of the consequences, you want to think about how would that impact my model, right? How would it impact the application of my model to real life? One thing that I like to do is look at a play game and I say what are additional variables that might be useful and what would happen if I had less variables. And with all of that documenting, and one of the things that a few of my friends and I have found will believe our own hype. So what we do with this is we document it right at the beginning when we don't know anything about what we've been assigned and we say, okay, this is what I think could be dodgy. This is something that I think might be useful because once you start working in the data for a while, you start believing your own hype. So you want to document it and you also want to share that with the people that you're working with. I'm going to just get you to elaborate on the additional and less. This is really one of the areas we call model sensitivity. The models are only valid within a certain set of assumptions. And you mentioned the financial crisis. If you watch the movie, the big short now, they actually allude to an awful lot of these techniques in there. Although some of it comes from people like Celine Diaz sitting in a bathtub to try and keep your attention on these very technical details, because the makers knew that it was hard to get people to focus on these exact bits and pieces in here. So really, it's such a good exercise to say, I've got a model, it's got five variables. What would happen if I only had two? Which ones would I be able to take out? Would I get the same kinds of results? Or what would happen if I was able to get perfect data? And would I be able to predict any better with all of that? So those are both terrific examples of that. And notice here, this is assumptions after the right technique and the good data. So no, no, this is easy. This is really where it comes into. We had a final thought on this, right? Don't let this be you. That's right. So we, of course, love to find out what your negative equity is, but probably not the thing you want to sell door-to-door, right? So as we move into the last 10 minutes here, let's talk about what you see coming down the line. We've had predictive analytics work well in some industries and not so well in other industries. Gaze into your crystal ball, but tell us about what you think is happening. Okay. I personally think, and this is my thought, is that we're experiencing an evolution in analytics, like the evolution that manufacturing went through before the Industrial Revolution, during and after. And related to that, I just want to just really quick about the big data comment that I made. It's a necessary critical foundation that we have. It's like with ERPs. Anybody who was involved at Ernst & Young, we had to put in the Enterprise Resource Planning Systems. And it was very lucrative. It was very popular back in the day. And then once it was done, everybody assumes that it's a key piece of having an organization department functioning interdependently. And the same thing when we were with the banks and we started setting up analytics to detect fraud. That was super hot for a while. Now you just assume that your credit card is not going to work if you're traveling and you forgot to tell your provider. So just want to back up on that. But in some ways, I think what we're doing is going through this evolution in analytics. So in the beginning, we had these handcrafted, right? And so I'm using the analogy of a model, sorry, of a hammer, right? So, you know, we had a stone. Then what we moved into is someone makes you a hammer. If you need it, you say, here, I need it. And they make it for you and keep going. They become handcrafted, artisanal, customized for what we want. Then the next stage is we started moving into mass production. Now it's standardized. It's scalable. We've got consistent quality. We've got lower cost. We might have lost some customization, personal touches. But in general, it's working really well. The same thing with mass production with features. Now what we'll find, I believe, is the same thing with analytics. We used to have handcrafted models. For my dissertation, we sat there and I did my dissertation with path analysis to avoid the accumulated errors associated with regression analysis. Not exciting. Then we moved into mass production. So we're moving into the standard. I should keep pressing the buttons if you want. And now we're looking at scalable, consistent quality and lower cost. And I think what we're moving now is into the mass production with features. And we don't even know what we're going to find next because of these technologies. I reckon we're going to have something again coming disruptive. And it should be really interesting to see what evolutionarily will hit next. So you look at that on a relative scale. You're saying that we're moving through these phases. It's also not a uniform movement. It depends on the problem space as well. Healthcare may be at one stage. Info security may be at another stage. And retail analysis and modeling a lot. Target that you were describing earlier may be at a completely different stage. So we can't assume a lot about this. We have to actually have the expertise and consult with people that are really trying to do this in the right way. Absolutely. And it does boil down partly to the character, the stick. If there's money involved that can be made or if there's regulatory requirements that people want to comply with and avoid penalties and fines, that's what drives it. Oh, and the point of that little caveman is basically not to make a joke about predictive analytics, but a caveman lugging his hammer was probably never able to predict that we would have a jack hammer, a huge automated hammer. So how do we know what our future is going to be? It's kind of exciting. You know what you can tell? That's a plug for your company, so absolutely get that in there for sure. And let's talk now about checklists, what types of things. If you're thinking about moving into this area, you need to consider. Okay. Basically, we're going back to our little statement, data, statistics, plus assumptions, equals successful predictive analytics. So we touched on in terms of data at your source. What, when, where, why, how was it acquired? And documented, right then. And what you want to do is go through this checklist. If you're a manager, you want to go through that you're talking to your analyst about it and make sure that their analysts understand what they're doing, why they're doing it, before you start the project. And you can have frequent check-in points, because especially when you're in the middle of the data, sometimes you lose sight of the forest for the trees. If you're an analyst, take the initiative to start reviewing this with the manager. And again, go through before you start it. She can make sure you understand what are the business goals and also that the manager can understand what you're working with. A lot of times they say, oh, you know, you're smart. You just, you're going to be fine. Well, actually, no. These are the steps I'm going to have to take for cleaning my data, missing data. How am I going to handle my outliers? What do they have on the data and what are my variables? And also the generalizability to the population. Your manager may assume that if you come up with a model, kind of like I talked about with the debacle with the other location for fraud, that it would apply. You want to make sure that they understand clearly what the data can and cannot do. Statistics, again, the rationale and implementation and those assumptions, that really is key. List them and describe them. Document what are the implications if they're not valid, both individually and in combination. That's where you can really find some interesting things. Clearly articulate what conditions would make your assumptions not valid and the variables that we talked about that you could include or remove. So that's something you want to work with collaboratively before you embark on the initiative. Okay. Yeah, it's like a favorite. Yeah, this is just really quickly. You want to make sure that you have your data analytic factors in place, your organizational factors in place, and there's a couple of success factors for you. So basically the data analytic factors are looking at how you're dealing with and analyzing your data. Okay, so the implementation strategies. Peter loves to talk about this, that the organizational thinking must change. You need to crawl, walk and run. There unfortunately are no silver bullets for anything. Okay. And you want to develop a scalable analytical solution. You want to start small and achieve your success and then build. From organizational factors, you have governance models. You also want to make sure that your data and IT are aligned. And do you want to talk about the CDO really quick? Absolutely. Absolutely. So if everybody's in charge of your data and data is everybody's responsibility, it's nobody's responsibility. And the idea that data tends to work well at workgroup levels, but not. You take all those workgroups and align them towards strategy. It has been problematic. So that's the need for this enterprise data executive, top data job, chief data officer, whatever we're going to call it, put somebody in charge of it. Absolutely. And then the success factor. We call it slots. And it stands for small, which is you want to start with repeatable and scalable solutions to show your progress. Go for the low hanging fruit. Okay. Issues that are meaningful to your organization and your bosses. And relatively straightforward to address. In terms of your outcomes, you want to make sure you keep your eye on your outcomes. And again, with the data, it's really easy to get lost in the process. What I like to do is actually draft my report first. I think I mentioned that. And I'll draw pictures of what I think the data would look like and possibly not look. So supporting my hypothesis and not. Draw the charts out. Then you can really be guided on what data that you're going to need and how it needs to look. In terms of T for time, I suggest doubling for me personally. I triple it. Okay. Sometimes you find that there's a part of your project that takes a little longer. And finally, last is L. So Stephen Covey always says, begin with the end in mind. So that is hopefully helpful to you. All right. Well, we are right at the top of the hour. We have included some additional references in here on this next slide. There we go. With my ideas to take a look at, there's a couple of conferences that are here. One I don't see. You've got mentioned here, Predictive Analytics World. But there's lots and lots of places that we can go for this. Unfortunately, we're out of time for the presentation part of this. But we move now into the Q&A part and see what sort of questions you guys have for Dr. Stephanie Byrd. And of course you can go to Enterprise Data Diversity in September for this. Oh, that's right. Yeah, I forgot you could do a session on all that. Yes. So, and just to answer the most common question that we received, we will be sending, I would just remind you to everyone, I will be sending a follow-up email for this webinar by End of Day Thursday with links to those slides. So recording of the session. And anything else requested throughout. And if you want to submit some questions, submit them in the Q&A in the bottom right-hand corner of your screen. And starting off, do you have a favorite tool for modeling and getting predictions? Example, SAP, SPSS, et cetera. I've never heard anybody suggest SAP for that, have you? Well, you know, SAP is a great for enterprise resource planning. So for you to get your data, what I'll typically use is, there's three that are pretty hot is, SPSS, that's what I grew up on. I've been using it, and I'm not going to say since when, but it's a long time ago. So I'm extremely comfortable with SPSS. It's changed a little bit since IBM spot it. There's a little bit more marketing around it. SAS is fabulous. The way you can look at your data is just stunning in SAS. It's kind of like some people are Mac or iPhone users. This is my assessment. I'm more comfortable with SPSS than SAS. It's just, it's almost too intuitive. Setting up the data is hard for me. Tableau is something that I want to check out in the near future. But again, it goes back to, what are you comfortable with? Where do you get the best answers from? And what does your business need? And let's add another little piece in there of the SPSS piece. Have you had a chance to play with Watson yet, Stephanie? No, my account was dead. I didn't validate it by the 24 hours. We'll have to work on that because the Watson piece is actually wrapped around SPSS. And since I know you like that so much, any of you all that are listening can go to Watson.ibm.com and sign up for your own account on Watson. IBM's very happy for you to do that because of course they get to look over your shoulder or whatever it is that you're doing on that. But it's a really, really cool set of technologies that kind of take us into that next-generation stuff you were talking about. It is extraordinary. What I would love to be able to do is have the data around me. You know, where you can walk into the data. Yeah, there was a movie that had Demi Moore and Kirk Douglas sent a while back that had that visualization component in there. I forget what it was called, but it had some really interesting theories about how people could look at that and really see what was happening. I'm not sure we're quite there yet, but maybe some of the VR worked. Sorry? Yeah, it'll come. And our attendees are quiet today. I think... Oh, dear. I know. Uh-oh. Are you people snoring? That's why they're on mute. Ping us if you're awake. Go with the kind of life. If you're already... Oh, we have a question on your cajoless shark diving. No. I didn't get back to that one. Aw. How stupid was that? All right, the answer is I clearly did not include in my variable list of predictors of terror factor and how that would impact the dependent variable of experience of shark diving. I decided it was a bet with one of my friends and neither of us would back down. So we went down to South Africa and we found one place that will let you do cajoless shark diving. Of course, my friends pointed out, oh, your insurance is probably not going to help. And you can only... You can't have cuts if there's issues with any blood or related to that. You can only go down a certain number of days after they've been fed so they're not hungry. And we did it. And it was the most horrifying experience I have ever done. So if you decide to go diving, do it in a cage. That's fun. It's safe. Cajoless was not the way to go. What kind of sharks? All kinds of them. Wow. They had...it was horrifying. It was absolutely horrifying. Well, getting back to the topic ahead. Stephanie, do you have any experience with R? It seems to be a hot topic lately. R. I do not. And that's where when you tend to play with the messier things, sometimes you don't get to play in the hot topics. Is the person doing it? We may not get a response on that, but we'll see. Oh, yeah. Just take the talk back. Very limited. R is one of the...yeah, okay. So it's certainly worth investigating. Again, as Stephanie said a couple different times in the presentation, it depends on what business problem you're trying to solve. I do know that the students that I'm working with in these areas, if they're very fluent in R, will say something like, oh, well, you know, I can set up an R program that will do this, that, and the other thing, and then I point out to them that they actually lose a lot of what they're trying to accomplish. Because remember, Stephanie Steppes talked specifically about writing down, beginning with the end in mind, to write Stephen Covey, et cetera, et cetera. But she has an idea of what's supposed to happen with R, or what you're doing is you're actually manipulating the data in real time. And while there's nothing wrong with that, traceability becomes less given that sort of a situation. So that your actual dataset becomes more fluid, and recreating your results can often become difficult given that type of situation. That's not to say that R is a good or bad tool, as SAS is not a good or bad tool, or as SPSS are not good or bad tools, but that what you're trying to accomplish should be driving all of these things rather than, oh, I think I'd like to learn R because it's needed on a resume somewhere. That being said, if you find something that you are entertained by, I like to try and set up Fridays, Friday afternoons, as my time to play. And recently it hasn't happened, but I have a whole list of things that I want to do starting 2017, and that Friday afternoon time is when I can do that. So if you want to learn about it, set up a time, schedule it on your calendar, and go for it. Because sometimes what you'll find that you intuitively are attracted to, there is an underlying factor that there's a reason you like it, and it may come in handy. Can I stump the class? No, but they say they're awake. That's good. Okay. Any experience with WECA? WECA. I don't know it. That doesn't mean anything to me. I'm going to Google it real quick, but I do not know WECA. Machine learning. This is great though, because you guys are going into my Friday afternoon fun time. WECA is a collection of machine learning algorithms for data mining tasks. Well, obviously neither of us have any experience in it, but there is, if you Google it, the first thing that pops up is more information about it. So y'all taught us something. That's cool. We like that. So, Stephanie, where do you see data science going? What are the next steps? I think, like I mentioned, that we're in this evolutionary stage. I think we're going to find that the basic analyses we've done are more generally accepted. We just assume now that we have data from customers, from phones. And I remember at Ernst & Young, we were doing an e-commerce transformation survey, and we had predicted this was a while back. Again, I'm not going to date myself, but the M-commerce would be ubiquitous. And we said, you're going to be able to buy things from your phone. And we might even have these tablet things or something that moves. It's not connected to your desk. And people thought we were insane. So now that is ubiquitous. We know that we're going to get data from these multiple channels. So I think we're going to find our assumptions of more data increase. I think a bottleneck is going to be the cleaning and the managing, and getting it to the point where we can analyze it. And I think we're going to find, rather, we're seeing now more applications that are addressing point solutions and then starting to work their way into the mainstream. All right. Well, those are all the questions that we have for today. Peter and Stephanie, thank you so much for this great presentation. That was just fabulous. Stephanie, thank you so much for joining us. It was exciting to have a data scientist on and talk about predictive analytics. Oh, it was my pleasure. If anybody wants to email, I'm better with email than calling, but we can set up a time if you have questions or you want to talk about the profession or anything like that. I'm absolutely delighted to help. It's a really fascinating area, so whatever we can do to help, let us know. I love it. I will make sure... Yeah, I'll get that out as well in the follow-up email. So just a reminder to everyone, I will send the follow-up email by end of day Thursday with links to the slides, links to the recording, and I will include the connection information here, the contact information. So I hope everyone has a great day. Thanks to all of our attendees for taking the time to participate and again, Stephanie and Peter, thanks so much for this great presentation. Thank you, Shannon. Thank you, Shannon. Cheers. Okay, bye-bye.