 Hello and welcome. My name is Shannon Kemp and I'm the Executive Editor of DataVersity. We'd like to thank you for joining this DataVersity webinar, Enhanced Predictive Modeling with Better Data Preparations, sponsored by Alteryx. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen. Or if you'd like to tweet, we encourage you to share our highlights or questions via Twitter using hashtag DataVersity. If you would like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the upper right for that feature. As always, we will send a follow-up email within two business days, containing links to the slides, the recording of this session, and additional information requested throughout the webinar. Now let me introduce to you our speakers for today. Ritu Jane, Director of Industry Marketing at Alteryx, and Dan Putler, a.k.a. Dr. Dan, Chief Scientist at Alteryx. Ritu has over 18 years of experience in retail supply chain, both as a practitioner and as a technology marketer. Most recently, she spent over 12 years leading global marketing strategy for various markets and industries, including retail. Supply chain is small and mid-sized businesses at SAS Institute. Prior to SAS, she held a number of positions in retail supply chain management, leading strategic planning, operations, and production management with various companies, excuse me, with William E. Conner and Associates of Global Sourcing Partner to prominent retailers such as Dillard's and Pottery Barn. Dr. Dan is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product roadmap for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to business-to-business financial services. He is co-author of the book Customer and Business Analytics, Supply Data Mining for Business Decision Making Using Art, which is published by Chapman and Hall CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and research at the University of British Columbia's Sotter School of Business and Purdue University's Cranert School of Management. And with that, I will give the floor to Ritu to get today's webinar started. Hello and welcome. Thank you, Shannon. I appreciate that. Good morning, everyone. Thank you for joining us today. So in today's session, who we are, what we do, I'll spend about five, 10 minutes about that. And then I'm going to pass it on to Dr. Dan here, who is going to talk about the key factors you should consider when building predictive models. And he'll talk about, within that context, about the importance of having a very clearly defined use case, starting with the right data and importance of data hygiene, the predictive modeling, and how do you choose the right modeling technique. So for those of you who are not very familiar with who Alteryx is, we're a leader in self-service data analytics. We have over 300 associates all over North America, Europe, and Australia, and over 1,500 customers worldwide. Some of our marquee customers include companies like Grover, McDonald's, Walmart, Ford Motor Company, JPMorgan Chase, and others. We have a very loyal customer base, as is evidenced by our 95% plus renewal rates and strong backing by investors. So we're constantly funding innovation, making sure that we are providing latest and greatest capabilities for our customers. So unlike some of the legacy analytics providers like SAS or SPSS now, which is a part of IBM, their products were designed for specialized programmers or PhD statisticians. Alteryx, unlike that, was designed for line of business users so that they themselves can pull data from any source irrespective of where it was. Whether it was in the cloud environment, in your Excel spreadsheet, in your databases, internally or externally, in Hulu clusters, you can pull data from any source. You can prep it, blend it, standardize it, cleanse it using Alteryx, same workflow. You can enrich it with prepackaged data from experience. So we provide demographic, segmentation data from experience, spatial data from TomTom, pharmac data from Dun and Brad Street, and population data from Census. You can enrich it with those data, and then you can perform analytics. Spatial, predictive, or statistical, any kind of model using the same workflow with Alteryx. And then you can share the output in the format you desire. You can share the output in Excel or PDF format. You can share it through the data visualization tool of your choice, like Tableau, Click or Microsoft Power BI. Or you can parameterize these workflows and share them as asked, either within your own private cloud environment or through Alteryx Analytics Gallery. So with that, I'm going to pass it on to Dr. Dan, but for those of you who have not seen the tool in action, I would very strongly encourage you to go to alfrix.com slash trial and download a free copy for a 14-day trial and just play with it and see how easy it is to build these models, to prep your data, to cleanse your data without writing a single line of code using Alteryx. With that, I'm going to pass it on, and I will have Dr. Dan talk to you. Thank you. There we go. Hello, everyone. As Ritu has indicated, I'm Dan Putler. I'm going to kind of start out today and really kind of talk a lot about how do you think about running a project that involves predictive analytics. So a lot of those issues revolve around the use of data, the proper use of data, determining what data to use, what you want to do for data hygiene, how the nature of the data influences the nature of your analysis and a number of other things. To get started, and it turns out probably the most important thing is that it's really critical to understand the underlying business issue that you're dealing with. And specifically, really what you're trying to do is determine what specific decision needs to be made, what information is needed to inform that particular decision, and that may be raw information. So if you look at what goes on in terms of reporting, business intelligence reporting, and dashboarding, it's really just looking at the raw, the initial information. But in a number of instances, that information isn't already available, so you have to rely on other methods to get the needed information to inform that decision, and that's typically going to involve some element of predictive analytics. Now, when we look at predictive analytics, what we're really ultimately trying to do is create an underlying model of some sort of process involving customers or a process or something along those lines. And so in determining the appropriate data that you want to use with that, it's really typically really useful to come up with some sort of mental model of the process that helps in determining what is really the potentially relevant information. You can kind of live with just what you have, but in some instances you're not going to have a set of information you need. You may need to add additional information. And in other cases, you may be including irrelevant information associated with the process that you're trying to model. It may have spurious correlation associated with it, so really a relationship that appears to exist that doesn't really exist, and that can lead you astray. So when you think about what you're doing from a modeling point of view, you really need to have some understanding and really have developed some sort of a mental model of the underlying process. It turns out what type of analysis you do is going to becomes critical in providing the exact information that you need to inform a particular decision. So it turns out that you need to look at the decision that needs to be made in order to determine the appropriate modeling methods that you want to use. Now, as we go into this, we talk a little bit about what we're trying to do from a business understanding point of view. The way I'm going to kind of motivate that through this discussion is to use a couple of use cases to see how this exactly kind of plugs in. These use cases come out of things that I have been involved with over time. They've been disguised a little bit, but sort of the underlying sort of learning that goes associated with it is still going to be relevant. So we'll talk about it from a business use case point of view, but then we'll also move into issues of data hygiene and really how the data hygiene begins to diverge for predictive analytics as opposed to business intelligence reporting and dashboarding. And we'll talk about some of the gotchas that kind of fit into the data hygiene issues that exist for predictive analytics that really aren't that much of an issue when you look at things from a business intelligence reporting point of view. So the two use cases that I'm going to talk about today is how much electricity does the utility need to have the capacity to supply for any given hour tomorrow? This comes actually out of something that we worked with with one of our customers, and it provides a really nice use case about what goes on in terms of the predictive modeling process. The second one also is based off of projects that I've been involved with, in this case a number of them. And the specific thing we're going to look at is a specialty outdoor sports retailer. And one of the things that they're interested in looking at is do they want to send a paddling, to which of its customers does it want to send a paddling sports catalog? So paddling sports for those you don't know involve canoeing and kayaking, so it's a pretty specialized sort of undertaking as a recreational sports. So let's jump into the electricity supply case for just a moment. The question, how much electricity does the utility need to have the capacity to supply for any given hour tomorrow? Actually has two underlying decisions associated with it. And those are really which of our existing power plants should we start to bring online or start to take offline? So the idea is with electricity is that you really can't easily store it. You have to have so much electricity on a grid to meet your underlying demand. And as a result, you need to plan how much electricity is going to be available for people to pull off of your local grid associated with that. You can at times have too much electricity in the grid, and so power plants or electric utilities spend a lot of time figuring out how much power they want to supply at any given moment in the day. They also don't necessarily have to produce it themselves. They can also purchase it from the spot market. The way the spot markets typically work is that they are what are known as day ahead markets. So you have to make a decision the day before, do you want to bid on electricity and how much electricity do you want to bid on? So those are two really specific questions that are really critical for running an electric utility. The critical information that needs to be known in this case is to equate supply with demand, so they really need to have a sense of how much electricity is going to be demanded in each hour of the following day. Now, one of the tricks in this particular case is that you need to make these decisions the day before so you can appropriately bring power generation online or take it offline or make bids in the spot market. And so at the time that, you know, when this information is truly known, yeah, it's past the point where you've made your decisions. And so as a result, you can't work with the true data. In some sense, what you have to work instead with is some sort of predicted or some sort of forecasted data. So you need to forecast what the likely demand is going to be at any hour tomorrow to make sure you have enough capacity for supply of electricity generation. And the only way that you can really do that is through the use of some sort of a predictive model. And so that becomes a kind of critical element associated with this. So the other thing that goes into this is that we are interested in what factors are likely to drive the demand for electricity in a given hour tomorrow because that's really what we're trying to forecast. This is where a metal model comes in really handy for the process because it allows you to think about what factors are going to drive electricity demand in the following day. So if we were in a smaller venue, what I would do right now is go to, you know, the professor's thing and do sort of a socratic sort of discussion. That doesn't work well in this particular venue. So I'm just going to take a couple seconds, have everyone sort of develop a metal model just for a few seconds about what's going to drive electricity demand. And so we can get an idea, at least, you know, you can take a look at what we came up with the factors and compare it to what the factors you came up with. So I'm going to be kind of quiet for 20 seconds or so as you think through what are the factors that are likely to demand electricity or really the factors that are going to drive electricity demand tomorrow. So at this point hopefully you've kind of come up with some ideas of the factors that are likely to drive electricity demand and this is the set that we think are important. And we've actually sat down and done models of this and it turns out these turn the models at high levels of efficacy in terms of predicting that. So the things that begin to matter are the day of the week and if you think about that for a moment it begins to make sense. You would expect electricity demand to be lower on weekends as opposed to weekdays. You know, and it turns out there's this very strong pattern in a lot of instances where the peak demand of days for electricity turn out to be at least for the utilities we've worked with run between Tuesday and Thursday. It drops a little bit off on Monday and Friday and it turns out weekends are comparatively lower, which kind of makes sense since fewer people are in the workplace, they're more at home, they may be outdoors, etc. The hour of the day begins to matter a great deal. So you'll see spikes in demand early in the morning as everyone is getting ready to go off to work, school, or whatever they're going to do for their day. And then a huge spike up in demand if you look in the 6 p.m. timeframe as people arrive home from work and they're preparing dinner and they're running a dishwasher and their washer and dryer and all those sorts of things. So the hour of the day has a huge impact associated with what's going on in terms of electricity demand. It turns out the temperature itself matters, particularly if you're looking at space heating or space cooling. So if you're in a fairly warm or fairly cold climate, the temperature of that particular hour is going to begin to matter for your electricity use. The other thing that matters is sort of the preceding hour and the reason why that begins to matter is that there's some sort of inertia and behavior. So you may be sitting at home, you're noticing it's suddenly getting a little bit warm and you kind of can bear it for a little while but at some magic moment you say, okay, it's time to turn on the air conditioning. So in that particular case, there's some sort of latency in behavior. So the preceding hour's temperature turns out to matter as well. And then we also find effects per month of the year in many instances. So if you kind of think about it comparing December to January, the important differences are people running Christmas lights and a number of things along those lines. So those are the important factors that allow us to begin to predict what electricity demand is going to be in each hour of the next day. One issue that goes into it is that, yeah, like electricity demand, the temperature in that hour or even the preceding hour tomorrow is not going to be known at the time decisions are made associated with our demand generation capacity or what we're doing on the spot market. But what we're going to need to do instead is predict what the expected temperature is going to be in each hour of the day, given the information that is available to us at the time that we need to make our decisions. And if you look at that, there are a number of factors that allow us to predict hourly temperatures. So if we look at what's going on, in most instances we're going to have a forecast high and low for the following day that's going to come out from the National Weather Service or from some third-party organization such as the Weather Channel or private meteorologists. You can provide underlying temperature, high and low forecast for that for the following day. We turn around and take those high and low temperatures and figure out how to, given those high and low temperatures, how we can model what the temperature is going to be in each hour. A couple of things that begin to matter if you think about it for a few moments are the number of minutes since either sunrise or sunset. The idea being that what you probably care about is the high of the day and then how many minutes has it been since sunrise because we'd expect the temperature to start from a low level and move towards the expected high of the day. And the same is true for minutes from sunset. So we know the high for the day and as we know after sunset it's going to begin to cool off so we can begin to look at that. The other thing that we can take advantage of the fact is that there's typically a lot of relationship between the weather yesterday versus the weather today. And so we can take advantage of the same temperature in the same hour or the previous day. And actually, given that we're making a decision today about tomorrow and we don't know all the temperatures for today, we actually, in the models that were created, actually use the prior day. So these are the temperatures from two days ago. Right? In this case, what we need is two predictive models. Right? So the first of those are predicting the hourly temperatures for the next day since that's important input to really the question that we care most about, which is predicting hourly electricity use given the temperature and other factors. So those are the things that we want to kind of look at in this particular use case. Now, if we move to the second use case, we're talking about the outdoors specialty sports retailer who is thinking about who they want to send a paddle sports catalog. It turns out, we've indicated what the question is, is to which of our customers should we send this paddle sports catalog. And it turns out there's a definitive answer to that particular question. And it boils down to, well, gee, if we send a catalog to customer acts, how much do we expect them to spend on items from the paddle sports catalog? And we need to subtract off what our cost of goods sold are for those items. And we want to compare that to what's the cost of sending them the catalog. So kind of more kind of succinctly, we can say we want to send it to any customer where the full cost of sending the catalog is less than the expected margin dollars, which are the item price less item cost, from the items a customer would purchase from that particular catalog. Now, while the answer to that question is definitive, it's hard to determine a priori whether or not a particular customer is going to meet this criteria. And so what we begin to need to do, again in this particular case, is get an idea or come up with a predictive model of whether or not the return to sending the catalog to the customer is going to exceed the cost or not. And the idea of if the return is greater than the cost, we want to send them a catalog. In this particular case, what we don't know that information beforehand, but what we can do is determine this at least probabilistically by using, again, a predicted model to help us provide the information about whether or not the expected return from that customer exceeds the cost of sending them that particular catalog. And to do this, typically two models are used. The first of those models is looking at whether or not a customer will buy anything from the catalog at all. So it's looking at sort of a yes, no, do they use the catalog or not. And the second model that would come into play is one that would look at the expected margin dollars that we could expect to get from this customer conditional on them using the catalog at all. So we can look at this as, again, involving two models as in the last use case for electricity demand generation, although those models are going to be fairly different from one another. The other thing that goes on in terms of this is we need to come up with ideas of what variables are likely to derive these two decisions. And the important thing I want to bring up, this information needs to be known a priori. And when we talked about that, we kind of alluded to that in the electricity demand case because really what we were looking at is, well, we know we don't know the temperatures tomorrow, but gee, we can come up with forecasts of them using information that is going to be known tomorrow and we can take advantage of it that way. The same thing comes up here in this particular case is that you need to take advantage of information that you know prior to sending out the catalog. And one thing that kind of goes on that when people start doing predictive modeling at times, they don't think through that particular issue and you can run into problems. So we've had situations where, you know, there were events that should not have been known beforehand and as a result, when those were included, we can get problems associated with the underlying, you know, did A cause B or did B cause A. So it's always important, you know, the way to address those particular issues is to make sure that the information that you're using in creating models is known prior to what you're trying to predict in this particular case. So if we go back to this at just a moment, there are probably some things to think about in terms of what would allow us to predict this, probably some things that immediately come to mind or demographic and socioeconomic information, age, income, family status. So I'm going to start with that because people frequently jump to that, you know, that conclusion to begin with. Let's take just a couple seconds here to think about what else may matter and I'll go into that in a little bit more detail in just a second. Okay. So hopefully you've had a chance to at least think about that for a second. A couple of things really kind of come into play and particularly when you're dealing with these sorts of items you're likely to have people who are going to show up in a showroom and we also know this is for paddle sports so we're talking about kayaking and canoeing and things related to that. So an important set of things is going to involve information surrounding location. So we could think of very broadly the state that the individual and the store is present in. A critical thing for retailing is the travel time to a store on the part of a particular customer and the idea of all other things being equal. A customer that is close to you is more likely to take advantage of things than a customer that is further away from you. The proximity to, you know, areas where you would take advantage of paddle sports so the ocean, lakes, rivers, you know, if you're living a long way from the ocean the value of a sea kayak is greatly reduced but if you turn out to be close to a lake it can you may suddenly be interesting to you. So location information begins to matter as well and the other thing that matters a great deal is people's past purchase behavior and if you look at what's going on in a lot of marketing applications there are a number of metrics or there are three metrics in particular that turn out to be really helpful in terms of predicting people's future behavior and that is known as recency and the measures that have been come up with are known as recency, frequency, and monetary value or RFM. And recency just looks at if I have a customer one thing that's critical is when is the last time they bought from me? Was it a long time ago? Which means they're probably not a good prospect or was it very recently? Which means that they're probably a good prospect. The other is frequency. So within a given time period say one year how often have they bought from me? And the idea is within that particular year or that particular time period the more frequently people have bought from you the more likely you are to buy from you again. So frequency becomes an important metric as well. And then finally the last measure is monetary value. So how much have they spent with you in the past? So in this particular case if people have bought from you recently they buy from you frequently and they spend a lot of money with you they're more likely to respond to this particular catalog and they're also likely to spend more on items from that particular catalog. Now there is a drawback with using the recency frequency and monetary value set of metrics for doing things and that really relates to the fact that if you have a new customer they may have bought from you frequently but their frequency and monetary value are going to be low. So you have to use these variables carefully and so for new customers you probably want to give them a break in time where you're not really using the RFM measures to monitor what they're doing. And the other thing you probably want to do is not use all of time that a customer has been buying from you but you probably want to take a specific time period say a one year time period. If you use a really long time period you can kind of unintentionally shrink the set of people that you send catalogs to or market to and you implicitly sort of shrink down your customer base. So while these measures are very powerful you have to be careful on how you use them but if you think about it for a few moments you can do it pretty judiciously and basically get all the positive benefits from these really very strong predictors of people's behavior. The other thing that needs to go on in this particular case is that we're talking about a particular catalog, the paddle sports catalog. So we need to get an appropriate observations on an appropriate target variable. So did people buy from a catalog of this type and how much did they spend when they did buy from that catalog? And there are really a couple ways you can do this. You can use appropriate historical data. So if this is a yearly event where you send out the paddle sports catalog we'll probably write about this time in late March, early April. In that particular case, if you've done this last year you can take advantage of last year's data to figure out who's likely to respond to the catalog this year. If this is a new activity for you then you really don't have appropriate historical data to take advantage of but you can run a test. And the idea with a test approach is that you'll take a sample of your customers, preferably a random sample of your customers, say 10% of your customers and send them the paddle sports catalog and then you look at who responds and who doesn't respond. What you'll do is based off that test sample of customers you develop a set of predictive models that will enable you to predict the likelihood and the expected spending levels of people who did not receive the test catalog but who you are thinking about sending the test catalog to. So it allows you to come up with those predictive models to get the expected return from each of your customers based on the test sample of your data. So those are the two use cases. Now I want to get into the nitty-gritty of doing predictive models and that's going to involve both modeling methods as well as the data hygiene that needs to be done associated with those models. So it turns out there are a lot of predictive methods that exist there and it's really potentially the case that they can be overwhelming, particularly for new users. And so what I'm going to talk about here are some kind of basic rules that you can get at to select an appropriate model. I'm not going to go into the modeling specifics in this particular case, but we have a number of materials that we can provide you with that go into the nitty-gritty associated with what goes on with specific models. I can tell you that, well, I'm going to hold on to that for just a moment, but we will kind of get an idea of kind of the broad classifications of the underlying predictive models that are out there. So there are really two criteria for selecting the final modeling method to use for a particular predictive model, and that is selecting an appropriate modeling method. And by appropriate, I mean that sort of the underlying properties of it are consistent with the nature of the data that you have. And that, to be honest, is really related to the thing you want to predict. What type of variable is it? Is it a categorical variable? So we're talking about did you respond to the... Are you likely to respond to the catalog yes or no? Are you likely to buy brand A, B, or C? Something where things are going to fall into a set of discrete categories, or are you talking about some sort of numeric value? And in that particular case, most of the modeling methods basically handle nicely either categorical variables or numeric variables and don't handle the other one well. And I can talk about things that, and sometimes those modeling methods do that really very explicitly. Sometimes they do it essentially under the hood. Selecting the model, and the final model, right, and hence the underlying method you want to use is you want to select the model method that provides the greatest predictive efficacy among the set of appropriate models for predicting new data. Now, ultimately, when you're doing predictive models, really what you want to do is be able to predict new data. There are modeling methods that actually can sit down and do what are known as overfit, the sample that you use to create the model with. They overfit, they may not do well when they're confronted with new data. So a standard criteria that people will use is create two samples from their data. Frequently it's known as a training sample and a test sample. We tend to call it an estimation sample and a validation sample. And the idea is you create the model in the estimation or training sample, and then you look at how well it is able to predict new data, that data that wasn't used to create the model in the validation or test sample. And in some sense, once you've done that, you've selected the best model for predicting with the greatest predictive efficacy for predicting new data, and that's the one you want to take advantage of. So you need to work within the appropriate set of models and then window it down to a specific model that has the greatest predictive efficacy, right? As I've indicated already, the target variable is really critical in determining what is the appropriate set of modeling methods that can be used. And there are really two broad sets of types of models that we can look at. They're what are known as classification models, which predict the category into which a case, like the customer falls into. So are they likely to respond to the catalog or not respond? Are they likely to take action A, B, or C? So they fit nicely into a set of categories that you're trying to predict. And the other one, and this is somewhat of a confusing term, is what are known as regression models, which are used to predict numeric quantities. Now, one of the things that you run into in predictive modeling and statistics in general is that frequently the same term show up to mean very different things. So a very common model for classification models is what is known as logistic regression. And logistic regression is basically looking at is something going to be in category A or category B. So it's known as a binary classification model because it looks at there are only two options that could occur, A or B, yes or no. It's really a very common one in that particular case. Unfortunately, it's called logistic regression, even though it's a classification model. Regression models in this kind of context really are oriented towards talking about predicting numeric quantities of some sort. And we can kind of break it down into types of numeric quantities. Are we talking about integers or are we talking about something that's really trying to truly continuous, but that's, you know, that's for another day. Let's go linking back to the use cases that we talked about before. So in the case of the paddle sports catalog, whether or not someone is likely to respond to that catalog is a classification model. It's a basic, you know, two outcomes. They respond to the catalog, which means they use it or they don't use it. So it's a yes-no sort of response. However, when we look at how much they spent from the catalog and once what we've done is subtracted off the cost of it, we really get looking at the margin dollars from each of those customers, that's really a continuous variable. So that's really a regression problem. When we look at what we talked about in terms of how much electricity do we need to generate in each hour tomorrow, what we really needed to do is predict both the hourly temperature, again, which is going to be a quantity, as well as the expected electricity demand, likely in megawatt hours, which again is going to be a quantity. So when we looked at those earlier use cases, one of them was the classification model, yes-no response to the catalog, and the other three were regression models. So since we were trying to predict continuous quantities in that particular case. So we talked a little bit about selecting models. We could have spent a whole lot more time on that, and we can provide you with ways of learning more about that. But the one I want to do now is move into data hygiene. And it turns out data hygiene for predictive analytics is really a bit more demanding than it is for doing business intelligence reporting or dashboarding. So it turns out lots of things that we're just buying for building a dashboard will cause havoc when trying to develop a predictive model. And in this particular case, the gotchas associated with the data hygiene when going to create a predictive model are really kind of fall into three different areas. One is fields with missing values. And in lots of cases, people can have data where there's a lot of data that's missing from those underlying records. It turns out you can deal with them in business reporting really easily by saying, you know, creating a category when you do the summarization that says, you know, unknown or missing. And you can get counts of the number of records where you had missing information. When we look at creating predictive models, right, the problem that you run into is that with these missing values, if you're looking at the target variable, they're going to tend to drop the records in that particular case. So, you know, if I have 100 records that I'm looking at to create a model, you know, if I have 10 missing values in one field, those records get removed from the data, so I'm left with 90. And that's true for some predictive modeling methods. Others are robust to missing values. So in some sense when you're creating models, one thing that you can run into is you think, you know, you're creating two different models, but if they treat missing values differently, you're kind of creating models that are somewhat really different from one another because they use different data to create those models. So you suddenly are creating an apple and orange as comparison. So that can create problems associated with those missing values. The other thing that can go on associated with it, if you have a number of fields that have missing values and you're using a method where it drops out records where there are missing values, well, gee, it turns out you can get this situation where you get unlucky and because of the missing values across these multiple variables, suddenly you can wind up with no data because all the records have been removed from your analysis because of missing data. And we have seen that from our customers more time than we care to talk about. So it's something to be aware of and where this really comes in and can be really critical is that if you're using household data that's provided by providers, third-party providers such as Axiom or Experian, in some cases they don't know information about people, so they have missing values associated with some of the fields that they collect. If you include enough of those variables and each of them have different values that are missing, we've seen a number of instances where people started with what they thought were 10,000 observations, 10,000 records, and because of all the missing values at the end of the day they were left with nothing. And so it becomes a big issue. It turns out, as we'll talk about in just a moment, you can deal with it, but it becomes a critical thing. Now, the other thing to understand is that no method can address records with a missing target variable, what I want to predict. So in that case, if I don't have any reporting on the target variable for a particular customer, I really can't take advantage of that information. Other thing that comes up, and it turns out categorical variables turn out to be particularly problematic, and in a lot of business data you have a lot of categorical variables, so you need to be really cognizant of that. And some of the things that you frequently run into are categorical variables where there's little variability. So we've seen instances where 99% plus of all the records in a particular column in the database have exactly the same value. So that frequently doesn't become a good predictor just because there's no variability across your records associated with that particular category. The other place where things become problems are cases where people have created a really large number of categories and at the end of the day, for a particular field, and at the end of the day what you find out is that a lot of those categories have only one or two or maybe a handful of records associated with them. And it turns out that's problematic because you run into essentially statistical reliability problems because when you have so few observations you've got a lot of random sampling variability, so they can lead you astray. And so that becomes a bit problematic. The other thing is when you go and use those models is you can run into problems that you had a sample data that you created your model with when you go to other records in your database. What you run into is you run into new values of those categories that didn't exist in the model, and in a lot of cases models can't nicely deal with that. They're unable to come up with a predicted value for those cases because they don't know how to handle that particular category value on that particular variable, so they'll return a missing value associated with that particular record that you're hoping to predict. The other thing that goes on, and again this happens with categorical variables, is that frequently people begin to identify what they're doing, identify a set of categories using a set of integers. So I may have regions of the country, so north, south, east, west, that I didn't label north, south, east, west, no, instead I label them one, two, three, and four. In that particular case you can run into problems. In the case of a target variable there are modeling methods that actually will handle both classification and regression problems, but under the hood they use different kind of sub-algorithms to deal with the classification portion versus a regression portion. And if you've got a set of catalogs that are identified as a set of integers, they'll be treated as if they were numeric values. So they'll use one set of algorithms versus the classification algorithms that are really, really appropriate in that particular case. So by doing that you can inadvertently wind up using the inappropriate and inappropriate modeling method. The other thing that goes on is how predicted variables are handled also changes based off of whether or not something is actually a set of categories or is truly a set of integer values. So the idea frequently in a lot of models is that there's sort of this idea that as I go from my regions one, two, three, and four, if I treat that as a set of numeric values as a set of integers there's an expectation that as I go from one to four I'm either tend to increase the response or decrease the response. But in reality there's a set of groups and they don't have a numeric scale to them and so they're not going to be handled properly if they're not identified as categories as opposed to integer numeric values. Next set of things in terms of what's going on with data hygiene is how do we address the issues that we just identified. We look at missing values for predicted variables. It makes sense to impute missing values. In the case of numeric variables using a fixed value such as the mean, median, or zero is commonly used to address those missing values. In addition you can set up a separate sort of sidecar categorical variable that indicates whether or not any given observation was imputed or was imputed versus you have that data available to you. My basic recommendation in this area is that you'd like to choose a value that is not going to matter a great deal and frequently for a lot of model types zero is a good choice. That's particularly true if all the expected values on your variables are going to be all positive or all negative because then when you have a zero value you're at one of the extremes in that particular case. Zero frequently is a good choice in this particular instance. The other thing you really want to do is to say, hey, let me include an indicator variable because the value wasn't truly zero but to deal with the fact that I don't have information in this particular case I gave it a value of zero and then created a categorical variable in that case that basically has the values yes and no. Yes, this value was imputed for this variable or no it wasn't and the categorical variable can basically in some sense sort of begin to compensate for the fact that you have this missing information. Interestingly enough in some applications knowing something is missing is actually important information in itself so one kind of classic example of this is that people who are applying for credit cards if they don't report something like their income you can give it a value of zero but the fact that they have kind of hidden their income from you is probably more informative than the income value itself because it indicates they're really kind of trying to hide something from you. Right? In terms of missing values for categorical variables what we can do is in this case it works pretty easily so if you have missing values in the categorical variables it's pretty easy to come up with a new category indicating a missing value and that's really my recommendation. If you want to have sort of an informative value associated with it some people will argue for using the mode value for that particular variable and the mode value is going to be the one that's most commonly reported. Again, my recommendation is to basically indicate that that's a missing value and label it with its own special label. Right? And it turns out if you go to more advanced methods there are model-based approaches for replacing missing values with predicted values based on other available data in your data set. And then finally if you have missing data for the target variable my recommendation is to filter out that particular set of data. Right? Addressing problematic categorical variables really quickly. You can address these categories or addressing categorical variables which are dominated by a single category depends on how much data you have available to you. Right? So if you actually have a lot of records then what you can do is you can say, okay, I'm going to live with the fact that I've got most, you know, 99% of the people are in category A but if I've got a reasonable number of records in categories B and C say at least 20 for each of them I can take advantage of those and kind of include those predictor variables. If I only have one or two that are in categories B and C it probably makes sense to try to exclude that field as a predictor because there's just not, you know, for things outside of the dominant category there just aren't enough records to begin to get an idea of what the underlying true effect is of those variables. In the case of categorical variables with very few records in some of the categories what you really want to do is begin to think about combining those categories together. And I'm going to argue that what you really want to do is combine those categories together based off of logical, you know, on a logical basis. So think through what those categories mean and which of them kind of logically makes sense to kind of put together. One thing that you could potentially do is look at ones that have similar relationship with the target field that, gee, you know, this category that had only three people, you know, all of them said yes and this other category that had only two people all of them said yes so let me combine those two categories together. Well, the problem is you have random sampling variability and what you can do is kind of lead yourself astray because it's just the fact that you have a really small number of observations in each of those categories is why through just dumb luck that you saw kind of a strong positive effect associated with that variable. So I always encourage the use of logical considerations for combining categories together as opposed to the relationship with the underlying target. Right? And then finally fields that use integer values to identify different categories. It's really simple. Change their data type to some sort of string type and then they'll be identified properly in most statistical software and modeling software. Key takeaways of what we've talked about so we're hoping to get a few questions in here as well. Clearly define a business issue so an easiest way to do that is to create some sort of a mental model. Start with the right data because it's going to be critical for developing a predictive model with a high degree of predictive efficacy. In terms of data hygiene requirements, understand that they're a lot more stringent for developing predictive models than they are for business intelligence, reporting and dashboarding. And then finally, variable type really does matter a lot and it matters both for selecting an appropriate modeling method. Right? Are we dealing with a classification model or are we dealing with a regression model? And then they also matter a great deal when we're imputing missing values and also getting appropriate interpretation of those underlying variables. Finally, the volume of data can be critical when addressing problematic categorical variables. So again, if you have a situation where you have most of the records fall in one particular category, if you have enough records that fall into these other categories it becomes a useful predictor. If you have too few people that fall into those, you know, small categories, there's not enough statistical reliability with what you're doing, so they become really problematic and you need to begin to think about what you want to do to either combine them together or not use that particular predictor variable. And with that, I think we're open to questions. I should say real quickly, if you have questions outside of here, we kind of encourage you to go to community.altrix.com, which is where the Altrix community gathers and provides, you know, where we can provide advice, information. You can also message me directly through the community and get an e-mail message directly to me. Within the slides themselves there are going to be links that you can take advantage of when you receive those, so it's a good way to do it. The other thing that you can do, as Ritu mentioned earlier, is if you want to see, take advantage and apply some of the things that we talked about today, you can do so by downloading a trial version of Altrix from altrix.com slash trial. And with that, it's probably a good time to open things up for Q&A. Thank you, Dr. Dan and Ritu. Great presentation today. We've got a lot of good questions coming in already. Just a reminder, one of the most popular questions we get are people inquiring about getting a copy of the slides and the recording. I will send a follow-up e-mail within two business days, so by end of day Friday for this webinar. With links to the slides, links to the recording and inting else requested throughout the presentation, including some of the resources that Dan just mentioned. So Dan, specifically, can you clarify your guidance around the volume of RFM data to use? Well, or volume, I think when we talk about RFM, the trick is the really critical on guidance. What you're really looking at in that particular case is RFM variables are really useful when you are dealing with sort of somewhat more established customers, so customers that have been with you for a while. The reason why I say that is that if you just think about it, if you've got a brand new customer, they haven't been around long enough to have much frequency in the past year and a lot of monetary value in the past year. So models that really focus exclusively, when you base your criteria solely on models that use RFM, you begin to systematically exclude your new customers. And so in those particular cases, really what you want to do is probably not use a predicted model to target those particular customers, but you may work with some sort of a rule of thumb that says, okay, for the first three months, we're basically going to see, you know, we're always going to select this customer into a number of things we do or select them randomly into the things that we do, so they can basically establish to us what kind of customer they are with respect to those RFM values. When someone has been with you for a while, they become extremely predictive associated with what their future behavior is going to be, because they've essentially had a chance to signal to you what their nature is. Are they going to be a frequent buyer from you or an infrequent buyer? If someone's a new buyer, they're going to appear to be infrequent even though they have the potential of being a very frequent buyer they're just new, and so they have to kind of indicate to you what kind of a buyer they are. Hopefully that addressed the question. Certainly. It certainly sounds like it, and certainly the questioner can submit additional questions for clarification if needed. And I love this next question, near and dear to the hearts of databases, certainly. Do you have a good reference point for data scientists who have skills in the field that are relatively new to the field? Well, in terms of understanding kind of the nature of the methods and the issues that kind of come up, yeah, I hate to do a book pitch, but it seems like a good one to kind of start with. One of the things, so there are a number of things going on, and some of this is going to be a little centric to us. The book that I wrote with Bob Kreider, who's at Simon Fraser University in Burnaby, British Columbia, which you referenced at the start of the talk, is a pretty good one to kind of provide sort of the forest overview of what predictive modeling is all about, and then describe some of the specific methods which we didn't cover today in greater detail. So I kind of recommend that book as a good one to kind of get into the underlying subject matter. We're doing a number of things right now. If you take advantage of what's going on with Altrix to kind of dig into it a little bit further, we're developing a set of what we're calling starter kits associated with introducing people to different predictive modeling methods in the context of a particular business situation. You know, the webinar we did today, we're also looking to do a nanodegree at Udacity in this particular area as well. So a number of resources that way, beyond sort of the Altrix and Mecentric stuff, there are a number of books that are out there that begin to describe it. This may be a good moment to go to your favorite, likely online book retailer, you know, be that Amazon, Barnes & Noble, or, you know, what's going on in your particular jurisdiction and take a look at what's going on in terms of the set of books that are there and kind of get what's going on with what the user reviews are on those sites to get an idea of what's going to be a good resource to take advantage of. There's also a number of things associated with, in this area, that kind of come out of the R community and probably also the Python community as well. I'm a little bit more familiar with the R community. Sure. And we can certainly create and send a link to make sure that everyone has a notebook where to purchase your book. We always love giving out additional resources. So another question is, you know, very specific. So you mentioned at least 20 non-dominant category records, but out of how big a sample was that? Well, in some sense, what you're looking at here is you're looking at trying to minimize the random sampling variability. So in that particular case, the total sample size isn't so critical. You know, what you're really trying to do is just get enough observations for that particular category that you can make some sort of reasonable inference from it. 20 is a rule of thumb. And, you know, there's a set of statistics that looks at these kinds of issues that looks at power analysis. But 20 is a reasonable rule of thumb. It's really, to be honest, a bit independent of the sample size. So it's true that it's hard to get, you know, 20. If you have a sample of 20 people and you have two categories, yeah, it's hard to get up to 20. But if you have a sample of only 20 people, yeah, you probably don't really have enough people to begin to create a model based on that. Since there's, you know, so much random sampling variability. So it gets back to random sampling variability doesn't really relate to the overall sample size. It relates to the number of records associated with one particular category. Sure. And again, back to questions specifically for you. Any guidelines on when to input or not input data. It seems inputting might be the wrong thing to do in some cases. Well, it could be. I mean, again, if you're imputing data, one of my, you know, one of the recommendations I gave, so it's always wrong to impute a target variable, in my opinion. In that, you know, unless it's a very kind of specialized case. So I can imagine some pure time series cases where that begins to matter. And if we look at other cases with predictor variables, you know, this is where it's sort of important to include whether or not you've imputed, you know, an indicator variable whether or not you've imputed a value or not. All right, because in some sense if you've got, you know, that's kind of trying to pick up the effect of needing to do the imputation. So that's why I kind of gave the recommendation of try to find a relatively innocuous value. And if all your values are going to be either positive or negative, zero is a fairly innocuous value, but also include the indicator variable that says, oh, by the way, this record has been imputed. That gets around a lot of the issue. You know, I think I'd have to go into the specific, a really, you know, get a good understanding of a really specific use case is to say, yeah, that doesn't make sense. I can imagine some, but it's going to, you know, I think it'd be a pretty special use case. Sure, we've got time for a couple more questions. We've got a lot of great questions coming in. I'd like to have a look at the workflows for the examples covered. Are they available for view? We can, yeah, I think what we can do is we can definitely do a follow-up where we would have some of the, you know, include, you know, examples of them. A lot of the modeling techniques that we've done, not necessarily specifically, as I indicated, in a number of cases this is based off of sort of specific things we did with customers. And so for, you know, nondisclosure reasons, the exact, you know, the exact use case that we base things off of will be a little bit tricky, but in other cases what we've done is we've got data that we can, we have use cases where this begins to work that we've included within the product, where you can take advantage of it. We have a set of what we call samples within the product that, you know, they're largely oriented towards illustrating the use of the various tools within all tricks, but they also intentionally kind of go through a particular sort of business use case so you can begin to look at it. But we can definitely highlight a number of those as well so a user can get, you know, a potential office user can get their hands on them. That sounds great, and it looks like there's several people who are interested in that as well. So I'm afraid that's all we have time for today. There's some other great questions coming in, but I will send them over to you just so you can address them as appropriate. And so thank you, Ritu and Dan, for this great presentation in Q&A. Just to remind everyone, we'll be posting the recording webinar on slides to the DataRosy.net site within two business days, and I will send a follow-up email with links to that, as well as the additional information requested, including Dan's book. And thank you again to Altairx for sponsoring today's webinar, and as always, thanks for attending and for all the great interaction. We always appreciate our attendees being involved and so involved in everything that we do. I hope everyone has a great day. Thank you. Thank you, Dan. Thank you, Ritu. You're welcome.