 Good morning. Thanks for stopping by. The keynote is full of a lot of mind-blowing demos. I enjoy it every time. What is easy to forget in the face of all of that is that underneath, when you do data science or analytics, there's actually some relatively simple concepts. So we're going to look at it from a completely different angle for this hour. There will be no math, no computers even, and the goal is just to take these concepts and to make them really concrete. Get them in your hands. So even if you're completely new to it, you'll at least walk out with the basic idea of how it all fits together. So we're going to take a little journey, and this is a map we'll be referring to it. We start out at the beginning of any data-driven project is getting data or getting more data. Data is just numbers and names. So with numbers, things that we can measure, temperature, you know, 38.3 degrees, or count something, the number of people in this room. And there's lots of other things you can measure, like the intensity of sound at an instant or the brightness of a pixel. All of these things you can put numbers on. You can add them, subtract them, divide them. They behave like numbers. And then the other type of thing you can do are names or categories or types. So these are things like a breed of a dog, a type of a hot drink, the name on an aircraft or on a droid. There are things that if you change them a little bit, you could be pointing to something completely different. So these are two fundamentally different types of data. Now, to make things extra tricky, there are names that actually look like numbers. So if you take your phone number, and then you just increase it by one, and someone calls that, if it were a number, you would expect that to do something close, like ring the person next to you. But this could ring someone on the other side of the country. This is just a pointer and identifier. And there's actually quite a few numbers that look like that. So if you have zip codes or social security numbers or credit card numbers, these are actually names. If you try to start adding and subtracting them, funny things will happen. Similarly, there are names that look like numbers or names that can be turned into numbers. So these are things that have a well-defined order and approximately the same spacing in between. First place, second place, third place. Small, medium, large. So these are things that you can actually pretty easily translate into numbers. And you can do things like add and subtract them if you're careful about it. So this process of taking names and numbers, measuring and collecting and storing and searching and transforming them, this we refer to sometimes as data engineering. It is a very intense discipline, requires a lot of careful thought, requires a lot of software. There's a whole collection of Microsoft tools listed over on the right for doing these tasks. I'm not going to talk about them in this talk. They will probably come up quite a bit in the other talks that you visit this week. I encourage you to listen and learn all you can about them. They're very cool. We are going to set them aside for now. Similarly, there is a data science process that matches this roadmap that we're going to cover very closely that emphasizes these individual tools, tutorials on how to use them. There will be a link to it that will go out with these slides at the end of the talk. I encourage you to check that out as well. Once we have data, the next step is to ask a sharp question. Now, a sharp question is one that has to be answered with a name or a number. A vague question, on the other hand, doesn't necessarily have to. To keep this distinction straight in my head, I think of a genie who, you can ask him questions. He's bound to give an accurate answer, but he's mischievous, and he'll give the least helpful answer that he possibly can. If you ask a question like, how can I increase my profits? The genie might say, work harder. Motivate your employees. Outplay your competitors. All great advice, none of it is helpful in making an immediate decision. However, if you were to ask a question such as, how many times will the feature that I built get used by a new user? That cannot be answered except with a number, giving you exactly the information you want. That's the type of question we have to ask our data to get good answers back. So, we've come up with a sharp question, in this case, what will the stock price of my company be next week? It would be really useful if we knew that. We can gather a bunch of data that's relevant, such as our history of sales in different regions, information about our competitors, our products, how often they're used in the first month and in the first quarter and in the first year by different users, and then other information, perhaps, economic information throughout the rest of the world. What we absolutely need to be able to answer this question is also historical information about the stock price. So, the question that we're trying to answer, we need examples of answers to that question from the past. That's how machine learning works. It just looks for what relationship have these had in the past. Then in the future, when we get all this other information about sales and competitors and stock markets, then we can guess, based on that relationship, what will our stock price be? That's called our target variable, our target data. It's not uncommon to get a data set and then to ask a question that doesn't have, you don't have target data for, in which case, you go back to the beginning and gather more data. The next step is to take all that data and to put it into a table. This merits its own stop on our journey because it's not as straightforward as you would expect. So for our question, what will our stock price be next week? We go and measure stock prices historically, and we get the closing price at the end of every day. This determines the form our table takes. We have one example of our target variable per day. We want to transform all of our other variables so we have one value per day. Every row on this table will have exactly one instance of our target variable. Now to do this, there are a variety of tricks we can do. Sometimes we have to aggregate. So let's say we would like to know the total number of users we have on any given day. But the form this data takes in our database is a list of usernames and the list of the dates that they joined. In order to know how many users we have today, we have to go back through and add them up. Aggregation. Another common one is distribution. So if I want to include information about my last month's sales, that'll be true throughout this whole month. Every single row that's relevant to this month will have the same number there. I have to take it and distribute it throughout the table. Another thing we can do is computation. So if I want to include the number of days since a product release, but in my database, what I have stored is the date and name of each product when it was released. Then I have to go and do some simple calculation. I've defined what's the most recent product release and subtract to get the number of days. There are some things that you just want to go out and measure. You can do that. Bring in new sources of data. For instance, the Dow Jones Industrial Average on each of those days. And then some things you may not have access to directly, you can estimate. If you have a way to make a reasonable guess, then that's actually better than nothing at all. So you can fill the table in with your reasonable guesses where you lack hard data. And as a last resort, you can actually leave blanks. There are ways to deal with that, as we'll see in just a couple of minutes. So once you have your data in a table, you have nice, it's called a rectangular format. You have one row for each of your target variables. You're ready to move on to the next step. We want to check our data for quality. If you've ever worked with real data, it's really pretty messy. It's gathered by real people like you and me that some of us are more careful than others. Some of us see the world a little differently than others. And that gets reflected in the data. So here's an example data set to show what I mean. Now the first thing to do when presented with a new data set, I like to go through each of the columns and just make sure I understand what's there. So here we have ID, looks like some kind of identification number, a first and last name, a birth year, the individual's height, the place they were born. Whether or not their identity is secret, information about whether they can fly, their alignment, it looks by looking at the data. It's like whether they're good or bad and whether or not they wear a cape. So in this data set, we can see it's nice and rectangular. And by convention, unless you're told anything else, the target variable is the very last column often. So this data set will help us in the future know whether or not someone wears a cape once we know all this other information about them. Now before we can use this, we need to do some cleaning. So the identification numbers, they all look like nice four-digit IDs, no obvious repetitions, nothing badly formatted. Same with the first and last names. They all look like very nice plausible first and last names. When we get to birth year though, it looks like this was maybe something that was manually entered. And we can see that there are stray punctuation, lots of different formats, some people use quotes. We also have a 2287 BC, which is a valid date, but is difficult to interpret for a computer as a number. In order to clean this up, kind of puree this and make baby food out of it for the computer, we need to get a uniform representation. So we go through and one way or another, using the tool of your choice, make a nice uniform representation. We even take our 2287 BC and turn it into a nice negative integer. Now the computer can add and subtract these and do whatever it wants to do with numbers to learn about them. Now height poses a different problem. These are all nicely formatted. No errors or inconsistencies, but it's kind of a quirky format. So unless the computer automatically knows how to interpret this, it'll just interpret it as a string, and it won't know that 6'1 is any closer to 6'0 than it is to 4'0. So we just translate it into inches. Now it's a nice number. The computer can add and subtract and do what it does with numbers. Birthplaces look good. One of them is unknown, but that's a perfectly okay thing to put. This identity is secret. Now this is confusing. This is a really common thing. Some of them are blank. Some of them say NA. One says missing. It's not at all obvious what that means. When you're dealing with data, it's a really common experience to be given something like this and to scratch your head. What you often have to do is look for documentation. What do these fields mean? These entries in these field often track down the person who collected the data and say, hey, what were you thinking? What did this mean? Sometimes you have to do some research of your own to dig into it, and sometimes you just have to make an educated guess. But the goal is a nice clean uniform representation. So after doing one of those things, we get a nice yes and no all the way down the column. Similar for can fly. Looks like this was a free entry field. Lots of different formatting and spelling. Some get number grades. Some get qualitative assessments. We go through and use our judgment and our research to unify this as well. As you can see, this is a very human intensive process to do this. This part there are... It's an unsolved problem to do all of this automatically. Mostly because we use a lot of what we know in the outside world to make this happen. Similarly with alignment. We wanted to go through and make this a nice uniform representation. And we're successful, and it just so happens there's three levels when we're done. Selena Kyle doesn't really fit in the good or the bad category. So we put her somewhere in the middle. And then finally with the target variable as well. Go through nice clean uniform representation. This is what it means to clean your data, to get it into a form so that the machine learning algorithms with their limitations can process it well and make sense of it. Now we have nice big clean rectangular well formatted data. Sometimes we have to do some additional feature engineering. It's called to transform what's there to make it useful. So here's an example of some data. Column zero, column one. These are a couple of features. Column two will be our target variable. So we would like to predict future values of column two, given future values of column zero and one. To understand if there's a relationship here already, sometimes it's a useful exercise to plot your features against your target variable. And so we'll go ahead and do that. And in this case, on the right we've plotted feature one. On the left we've plotted feature zero. In both cases the y-axis is our target variable. And we can see that this is really not helpful. If you eyeball a line through each of those clouds, it's flat. Which means that if I tell you the value of one of those features, it doesn't help you at all to tell me what the target value probably is. And in fact, when I build a model out of it, this coefficient of determination says how good the model is at predicting what's going to happen next. One is perfect. Zero is absolutely worthless. We're very close to zero. So we take one step back and think about where our data came from. Now I tell you that column zero is the time in hours that a subway train leaves Central Square Station in Boston. And column one is time in hours since midnight that that same train arrives at the Kendall Square Station, the next stop. And that column two, our target variable, is the max speed of that train in kilometers per hour. Now we know that arrival time and departure time are related to speed, so they must interact. The default thing to do, whenever you have any two variables and you know there's an interaction, you multiply them together to get another variable. That's a very simplest type of feature engineering. We get this interaction term. So in this case, let's go ahead and create this multiplication and plot it against our target variable. We do that in the right hand plot there. We can see that it's actually still not helpful. There's still a big blob, and the best line we can eyeball through it is horizontal. And sure enough, the coefficient of determination is still really close to zero. So the default approach did not work on this particular data. We take yet another step back. Freshman physics, speed is distance divided by time, elapsed time. Now the distance is going to be the same in all these cases because we have our two stops and they're not going anywhere. The elapsed time we can actually compute, which is the arrival time minus the departure time, the difference between the two. So they do interact but not in a multiplicative way. They interact by subtracting one from the other. So if we calculate this elapsed time and create a new variable out of that and plot it against our target variable, we see something very different. We see this green swoosh on the right. And if you eyeball a line through that, then if I tell you the difference between feature zero and feature one, you can make a really good guess what that peak speed was. This is an example of having to massage your features to get them into a form that's useful. Machine learning algorithms don't do this automatically. They each have their own quirks. There are certain types of relationships that some are better at pulling out than others. None are good at pulling out all the possible relationships. So knowing something about where your data came from and what it means can help you do this beforehand and then it just kind of gives your algorithms a leg up. And sure enough, the coefficient of determination is now very close to one, 0.88. Now there's lots of other feature engineering tricks. Some are specific to the type of data that you use. There are a lot that are specific to images. Some are specific to text as well. And then there are others that are specific to the domain you're working in. Whether you're in economics or agriculture or sociology, there are relationships that people have observed or hypothesized for a long time that seem to hold up. That domain knowledge gives you a lot of built-in clues about how to combine your data to get something that's predictive that's useful. Now you've probably heard about deep learning. The reason deep learning is exciting is because it's an attempt to pull all these relationships out automatically to learn what they all are so we don't have to know everything about the problem. Now there's still a gap there, but it has closed at somewhat. And that's why it gets so much hype. That's why it's able to do things that machine learning hasn't been able to do previously. I won't have time to cover it in this talk. If you would like to dive at a similar level into deep learning, I'll share a link to that at the end as well. Now we have our features. The next step is to answer the question. So a little detour here. This is this little number six. This is all of machine learning. So we're going to do it real quick. Add an overview. There are just five different questions that machine learning can answer. Five types of questions. So the first one is how much or how many? The answer to this question will always be a number. And these are questions posed in the form of what will the temperature be next Tuesday? The answer is 38 degrees. What will my fourth quarter sales in Portugal be? There's a really concrete numerical answer to that. These types of algorithms, this family, is called regression, and they're really common. We'll look at an example of one in just a minute. The second family of categories is an answer with a number but with a name. So which category does this belong to? Which label applies to this? For instance, is this an image of a cat or a dog? It's probably the most commonly asked question in vision research. Another is what aircraft is causing this radar signature? So I have my radar feed, I'm getting a blip, I have a list of things that could possibly be which category does it belong to? Or what's the topic of this news article? This family of machine learning algorithms is called classification. It's very common actually to do classification between just two classes. Is this an example of A or not A? Is there a dog in this picture or not a dog? And then there's also multi-class classification. I have a whole bunch of topics, which one or which ones apply to this news article. Our third family of machine learning algorithms answer the question, how does this data naturally break down into a bunch of groups? So you can imagine if I give you a bag of M&Ms and say, hey, would you please separate this out into some groups of similar M&Ms? Most of us would separate out the red ones and the green ones and the yellow ones and break them out by color. That's not the only way to do it. We could separate them out by primary colors and all the other colors. We could measure them very carefully and separate them out by weight or by sugar content or by how perfect the M is printed on them. There's lots of different ways. They're all valid. There's a lot of different types of algorithms. They're often called clustering algorithms, often referred to as unsupervised learning, but they make groups of things. And usually the groups are things that we intuitively relate to. This is helpful in a lot of applications. You've probably experienced it if you're a consumer of online streaming video then your provider has taken your viewing habits and used that to lump you together with a bunch of their other customers. And then if there's a video that a lot of other people in that lump have watched and enjoyed that you haven't seen yet, it will probably be recommended to you as, hey, you also might like. The fourth family answers the question, is this data point weird? This is a really common question. When there is a problem, often it's manifested by some kind of weirdness. Something is unusual or unexpected or out of spec. If you're driving in a car and you have pressure sensors in your tires, it would be great to know if there's some kind of really unusual reading in those pressure sensors. There might be a problem. If you're doing internet security and you get some really odd traffic, that might indicate that there's a problem. And this is helpful when computers can't go in and diagnose the problem, but we have a human that can do that, but the human can only do so much. So you have the machine learning algorithm look through millions or billions of examples, funnel that down to a few hundred that it can then hand over to humans and have a human in that loop. And you've probably experienced this. All of us, our credit card companies are monitoring our purchases every day and asking this question about each one of our purchases. Is this weird? Or are the last few purchases weird? So when my credit card showed sudden purchases of very expensive items from the same vendor in different locations that I've never purchased before, it sent up a flag that said this is weird and it locked me out of my credit card until I could call them and straighten it out. They were able to invalidate those. This is able to solve me a lot of pain in the long run and the credit card company as well. So finally, the last family of algorithms, a little bit different flavor. These are called reinforcement learning algorithms. These are becoming more common in robotics applications or automated systems where our machine gets to decide over and over again what to do. These are usually small decisions, usually not with a terrible penalty for getting it wrong. So there's room for trial and error. And there are things that have to happen a lot. So for instance, in an automated temperature control setting, the system has to decide whether to raise or lower it on a regular basis. Or if you have a robot vacuum at home, it has to decide every minute whether it stays plugged in or goes out and vacuums the living room again. Now, these are not all very low consequence applications. Self-driving cars, for instance, have to decide, hey, there's a yellow light. Do I accelerate or do I brake? Now, they don't use an approach like this because trial and error is not an acceptable way to solve that problem. But that's the type of problem that this solves. So with that overview, we'll jump into an example. The goal of this is to show that these things are not scary. The concepts behind them are not earth-shattering. In this case, I have a setting, a ring made for holding a diamond that doesn't happen to have a diamond in it, but it is built to hold a diamond that weighs 1.35 carats. And I'd like to buy a diamond for it. It belonged to my grandmother. I'd like to restore it. So I go into a jewelry store and I'd like to know, hey, how much will this cost me to buy a 1.35 carry diamond? It turns out they don't have any. So I decide, huh, machine learning, data science. I write down the weight and the price of all of the diamonds in the store. So for instance, at the top of the list, there's a 1.01 carry diamond. It's just over $7,300 and so on down the list. Then I take my pen and my paper and I create a little number line for the weight in carats. And I see that most of them fall between zero and two, so I size it appropriately and I create another number line at a right angle to that for the price, covers all of those values. And I start with the first diamond and I eyeball a line that's about 1.01 carats and I eyeball another line that's horizontal that links up with it at about $7,366. And right where they meet, I put a dot. So this is the way you make a scatter plot with two columns of data. I repeat that for everything all down the columns and when I'm done, I get a picture that looks like this. Now there's a secret in data science that we don't talk about much. Sometimes this is all you need to do. Sometimes at this point when you have a picture, for instance if you have your data plotted across the United States, the answer jumps out at you. You can say, whoa, my sales are way up in California. They're way down in Idaho. I know what I need to do next. In this case you can say, whoa, the price is a lot lower for lighter smaller diamonds than it is for bigger diamonds. Now, we already knew that. That actually doesn't answer our question so we're going to keep going. When we look at those dots, you can see that there's an obvious shape to them. It's like a blurry line that goes from bottom to the top from left to right. And so we really carefully take our Sharpie and draw a line right through the middle of it. This step is actually really significant. It's easy to overlook because we do this automatically in our brains all the time. But what we've done is we've taken the actual raw, hard, noisy data and we've simplified it. We've made a cartoon version of it. And this lets us do a couple of things. It lets us answer questions really concretely. It also lets us answer questions for which we don't have any data. We don't have data points at every weight along that x-axis. But we do have a line. So now what we can do is say, well, I'm going to eyeball a line up from 1.35 carats. And where it crosses my diagonal line, my model, I'm going to eyeball a horizontal line over to the price axis. And it hits right at about $8,000. That is the answer to my question. How much will 1.35 carry diamond cost? About $8,000. This is machine learning in action, building a model, making a prediction. It's not any more complicated than that. Now, looking at this, another good follow-up question is, well, okay, none of those dots lie exactly on that line. So I don't expect to pay exactly $8,000. But will it be $8,000 plus or minus $50? Plus or minus $5,000? This will also be really helpful when I'm deciding what to do. Luckily, we can solve this as well. We eyeball a nice fat envelope around that line. The captures most of our data points. It doesn't have to include all of them. This we'll just call our confidence interval. So we're pretty confident that our future data will all lie in this region because our past data all has. Now we just look at where our 1.35 carat line crosses that envelope on the bottom and the top, and eyeball those lines over to the price axis. And we see, ah, not only do I expect to pay $8,000 for my diamond, it will very probably be greater than $6,000 and less than about $10,000. So somewhere in that range. Now I know what I need to know to go out and earn some money and prepare to make this purchase. So I want to take just a second to give us all a collective hat on the back. So you have mentally just walked through a full machine learning modeling process. And you're able to do it without any computers or any math. There were no equations. We didn't even do multiplication here. This is conceptually about as stiff as it is. Now, if instead of just weight and price, we also wanted to consider the cut and the clarity and the color and the number of inclusions in the diamond, then we start adding columns to this. We can't draw it on a piece of paper anymore. And it's not as simple as drawing a straight line. You have to draw a funky multi-dimensional plane, and it gets really hard and your head starts to hurt. If you think about it too hard, luckily math is the tool for doing that. It can find the very best high-dimensional line to fit through that and make a model just like we would have made if we could eyeball it. The other thing is if instead of about 15 diamonds, we had 15,000 or 15 million, doing this by hand of course would take a whole lot longer, that's where computers come in. They're a tool to let us do this with bigger and bigger data sets. But what I want you to keep in mind is what they're doing under the hood is not magic. A common question when you're answering a question with data is, do I have enough data? How much do I really need to answer it? It all depends on what you want to do with it. So when you look at your answer, when you look at your line that we drew, if we knew that our diamond was going to cost about $8,000 plus or minus $8,000, that's actually not enough information to make a good decision on. Too fuzzy, too blurry. Here's an example, a notional example of what this image looks like when we don't quite have enough information to reconstitute it. There's something there, but you can't tell what just yet. You know you have barely enough data when you can look at your answer and you can kind of squint and you can say, ah, I know what's going on just well enough to make my decision. So in this case, I see, oh, I can tell that's a broad canal, buildings on either side, beautiful sunset. I have decided I want to go there on vacation. So that's when you have just barely enough data. When you gather even more data, that actually allows you to make more and finer grain decisions if you want. So now at this point, you can stand back and look and say, oh, of the hotels on the left bank, that first one on the third floor has fascinating architectural features. That's where I want to stay when I go. So with more data, you can make progressively finer decisions. You can ask finer questions. That's what it buys you. So we've answered our question. The next step is to use the answer in some way. There's a lot of ways you can do this. It doesn't really matter how. It all depends on what you need, on what your goal was going into it. But you can turn it into a web service. This is something that Microsoft's Azure Machine Learning does exceptionally well. You can simply use it to make a decision or to set a price for your products. You can take this code and publish it out on GitHub, or you can write up a document sharing your results, build a dashboard. It doesn't matter, but you have to do something with it or else it'll disappear and it won't make any difference to anyone, except perhaps to you. Now there are a few big gaps, gotchas when you go to use your data. These are things that if you understand them on a conceptual level, it'll put you ahead of probably 90% of the general population and even 50% of the technical population. These are things that are often overlooked. So the first one is that almost all machine learning algorithms assume that the world doesn't change. If I built a detailed model of flight travel, predicting the number of passengers every day, 10 years of data, excellent performance, and I fielded that model on September 10th, 2001. On September 12th, 2001, I wouldn't expect it to behave very well because on September 11th, something happened to change the system in a pretty big way relevant to what I was trying to predict. This can happen to any model that we build, any prediction that we make. And the crazy part is the machine learning algorithms are assuming the world doesn't change. We know that's false. It's changing every day. There's nothing we can do to stop it. The question that you have to ask on an ongoing basis is, is the world changing in a way that calls my results or my model into question? So the only antidote is to keep your eyes wide open and you can also occasionally rebuild your model, refresh it using the most recent data to help inoculate yourself against this. Gap number two, a lot of machine learning algorithms take a lot of examples to learn. So if I'm modeling something about internet traffic, say cybersecurity, I can measure things. I can collect data for a day and get 100 million or a billion examples. I can train a sophisticated machine learning model. No problem. If I am building a model of how corn grows, collecting one data point takes a year. So if I need thousands or tens of thousands of data points, that's a problem. But by the time I get the answer, the problem that I was trying to solve will be gone and I certainly won't be around to worry about it. So keeping in mind the time that it takes to gather the data is important for some of our applications, for some of our businesses. Don't worry. There's an answer to all this. And the third big gap, machine learning can't tell what caused what. So here's a plot. It plots two different things. The number of pounds of cheese that on average someone eats in a year over the course of 10 years and the number of people who died by getting tangled up in their bedsheets on each of those years. Now we look at these and they track pretty closely. I wouldn't be at all surprised to see this in a magazine article and the caption underneath could read excessive cheese consumption leads to death by bedsheet strangulation. But what I want to point out, and this is a subtle but fundamental gap, once you start seeing it, you won't be able to stop seeing it. This graph doesn't say that at all. It's equally plausible to claim that people dying by bedsheet strangulation causes those around them to eat more cheese. This doesn't actually suggest either one of those. It just says, hey, they happened to move in the same way over the same period. Draw your own conclusions. It would require additional experimentation to establish either one of those causalities. So this is big. I strongly urge you to keep an eye out for this in the press. It's not always so benign. Looking at our own business data, we might be able to see, wow, there was a huge surge in profits in Q4 of FY17. Also, I took many more business trips to Belgium during that period. So the obvious answer, when using this to make a decision, I need to schedule more business trips to Belgium. We make decisions like this all the time. And maybe they're right, maybe they're not, but they're not data-driven. The data does not tell us to make that jump. Okay, so the three gaps. You have to collect a lot of data. Machine learning assumes the world doesn't change, and it also doesn't tell us anything about causality. These are gaps that keep us from being completely data-driven, from going from what our data tells us to the decisions we need to make. Luckily, to cross that gap, we have our human insight and judgment. We're really good at making reasonable guesses with not enough data. In fact, we're way too good at doing this, and we often have to rein it back and use data to do that. But where the data stops, that's where we can let that go. We can follow our intuition, our insight experience, our gut feeling, and cross that gap and make a decision. There's a paradox. This gap is caused by not enough data. Of course, if we had enough of exactly the right data, we could build these two sides together so that there would be no gap whatsoever. In most cases, because the world changes, because data takes so long to gather, because we get correlation and not causation, that gap never gets fully closed. So we need to cross the gap to get more information, but we can't close the gap until we get more information. So we take the leap of intuition and judgment and cross it for the first time, and the second time, and the third time, and the more often we do that, we use that data to build it a little closer and a little closer. Over time, the data we gather, often it'll tell us, yes, what you guessed the answer should have been, that was exactly what it should have been. Sometimes it'll tell us, oh, that was actually not the right way to go. You should have done this other thing instead, but it's better than standing on the far side wishing for more data, not being able to do anything. So not being afraid to close that gap is an important part of using our machine learning results, our data science results. So we've completed our journey, conceptual journey through data science. I appreciate your patience. I'm sure I've raised many more questions than I've answered. Feel free to contact me. There are a number of resources that I'll share around in these slides. Follow me on LinkedIn or Twitter if you'd like to get links to the slides themselves. This video itself will be published in a couple of days. I keep an eye out for that. And I thank you very much for your patience. We actually have a few minutes left for questions, and I will happily stick around and field those for the next 15 minutes. Thank you. So because this is being recorded, if you don't speak into a microphone, it didn't happen. There's a microphone there, and there's one right there. I encourage you to form a single file line behind each one, and we'll alternate. If you have any questions you'd like to ask. Maybe I did answer them all. Okay, we've got one right over here. Thank you. First of all, excellent presentation to take a very complex thing and make it digestible. It's just fabulous. My question is mostly about number four. Check for equality. A lot of times the struggle with that is how do you know as one person if your results, your data, has high quality or low quality? It's kind of if you're looking for your input on that in general. So how do you know when it's a sufficient quality? There's obviously a lot... There's a lot of things you can do to check a lot of its domain specific. So what determines good data and bad data depends on the problem that you're trying to solve and what domain it is. I can give an example, for instance. I worked in agriculture for a time, building models, and it turns out that when you take... dig up some soil and go and process it to measure the nitrogen in it, if you don't refrigerate that soil, then lots of chemical things happen, bacteria go crazy, and it changes the levels of the nitrogen a lot. We got a whole bunch of data, and that data we learned about a third of it had sat in a hot car trunk for a weekend, which of course throws those numbers off entirely. And so one way to handle that would be to say, okay, that data is suspect. We just don't trust it. We're going to remove it. Unfortunately, we didn't have any way of knowing of all of our data points which third that was. So that's a pretty extreme example of raising serious questions about data quality, and at that point we actually decided to backtrack and try something else entirely. But it was completely dependent on the domain that we were in, knowing things about biology and about soil that I didn't know before, and also it depended on the answers that we were trying to get out of it. One thing that I've learned is that real-world data painfully enough, every single data point actually has a story behind it. You wouldn't want that to be the case. You want it to be clean and reproducible and perfect and mean exactly the same thing every time. But sensors fail, networks fail, bad things just happen, and so sometimes determining whether data is good or bad is a discipline all in itself. Thanks. We'll take this one next. Yes, I'd like to go back to your point about closing the gap. Yes. And I guess the question I have is there's an old adage people don't allow data to stand in the way of their opinions. So how do you close that gap with people who really are not interested in getting into the back of what's going on here and really do want to believe that eating cheese causes strangulation? What's your views on how you can actually close that without just opening the book and say, well, there's the stats? I love that question. There's obviously a spectrum. On one side of the spectrum we have I will not do or believe anything until it's incontrovertibly supported by data. Anyone at that end of the spectrum doesn't do or believe anything. The other end of the spectrum is I will do and believe what I like regardless of what data I observe. So data has no effect there. And we all fall somewhere in between. So the question is, as I understand it, well, how do you, if you're interacting with people that are leaning this direction, how do you convince them? How do you change their mind? If I knew the answer to that, I'd be a very wealthy man. But it's an important question to consider. It probably depends entirely on the individual that you're interacting with. It's a very unsatisfying answer, but it's an excellent question. Thanks. Hi, so I'll just echo previous sentiments. This was a really fantastic talk, and thank you so much. This question I think really piggybacks on the last one, which is, what advice would you have for sort of best practices for generating good hypotheses and having sort of appropriate senses of skepticism around the data or around the sort of models that you're building so that you can avoid these kind of cheese strangulation mistakes? Yeah, so this is a good thing that I've, a big thing that I've been faced with quite a few times. One thing that I've learned is to never trust my data. I take what's there at face value, but I don't ever assume that it's absolutely correct. So when a question comes up, something's unusual, one of the possible answers is, there's a flaw in the data. The other trick is to never trust my model. So whenever a question comes up, it's not behaving the way I would expect it to. The other possibility I look at is maybe there's something funny going on with my model. And coming at it from a continually being skeptical of it helps me often to find that I've done something wrong, which is the case more often than not. To look for errors, try to anticipate errors, try to create errors in what I've done and try to break it. And when I've tried unsuccessfully to break it or fix it, then I draw what I feel like might be an unlikely or counter-intuitive suggestion that the cheese causes bedsheets triangulation. One big red flag is when my data firmly and soundly supports something that I already believed and really want to believe. There's a strong tendency to have confirmation bias and to stop when you get the answer you're looking for. And the only way I know to inoculate yourself against that is to distrust your own beliefs a little bit too and just be pretty ruthless in how hard you beat on it. So, yes. Again, De'Echo, thank you for an outstanding presentation today. And I don't have so much of a question as a thought to throw out to the gentleman who just asked about changing minds. And this is where the science in data science comes in. This is where you have to approach your work without preconception. Get rid of those a priori notions and approach it with an open mind because if you don't approach this objectively you are creating an echo chamber. You're not doing science. You're not doing predictive analytics. And this bears itself out through models because you'll create a confirmation bias but that will quickly be disproven when a real data comes to play. Why is our model wrong? You guys just stopped when you got that answer. So what you said there I think exactly answered what this gentleman's question was. But thank you again for outstanding talk. Thank you. It's a really important thing. One thing that has surprised me as I've come to understand it is how bad scientists can be at doing science in the sense of approaching things skeptically of being willing to disprove their own hypotheses. There's quite a few examples of popular ideas that have stayed popular for a long time despite a big pile of data adequate to convince an open-minded observer otherwise. Another ego here. Great presentation. My question is a bit more specific about technologies. You showed a slide there with a lot of cloud-based solutions for tools. Are on-premise tools still valid to do this data science and data cleansing like a SQL server data quality and master data service before you get the data to the cloud or is the tendency to do all the data cleansing in the cloud? Oh, yes. So personally I use a mixture of tools. On-premise versus cloud I find doesn't affect my ability to take this journey around the circle that I've showed. But some applications are more suited to one than the other and it usually depends what you're trying to do. Sometimes the data is so large that it just doesn't fit on my laptop in which case cloud's a great way to go. Or if I want to share it with a lot of people I want to build, say, my model and create a web service so that a thousand or a million different people can query it and use it. Then cloud is an obvious winner there. And also, I mean, outright plug here, Microsoft has done a lot of work to make these cloud tools actually very intuitive to use. And so sometimes just for pure ease of use it's just convenient to open a browser and pull up Azure Machine Learning and put together a model right there. All right, thanks. Hi, yeah. My question is around how much do you find yourself trying to learn about the domain to which the data applies? Well, you talked about diamonds, so if you don't know there's something called clarity or inclusions, then I would expect you wouldn't be as effective at it. If you're analyzing dropped calls, for example, on a cell phone and you don't know there's different, like something about towers and something about which model tone you have, which waves it's using. As a scientist, what's your assessment of how much time you should be spending learning and becoming an expert maybe or not, becoming an expert at the subject matter that you're analyzing? In my experience, every amount of time that I've been willing to invest in a learning domain has paid off in the quality of the results. Machine learning algorithms are, there's a toolbox of them that are fairly common and it's actually, there are a few people who have taken a shot at, you know, automatically take your data set and just run every single one of these on them and find the one that works the best. Usually, for a lot of common business problems, that's not the hard part. The hard part is cleaning your data in a smart way. For instance, when I have missing values, what do I do with that? Do I substitute in a zero? Do I substitute in some estimate? Do I remove that person entirely? The right answer for that will depend a lot on the data and exactly what it means. And I have found that if I try to just take the data as a table of numbers and names and put it black box into an algorithm, it never works. I always end up having to backtrack, I study up what does this mean, what does this mean, I talk to people and usually even the ones I thought I understand, I realized that I was interpreting them wrong or even if I interpreted it according to the label, I learned that the label was wrong. So being willing to get and just dive into your data set and get your hands really dirty pays off 10 times over. Thanks for the great talk. I can actually go home and sleep properly thinking that I can also do data science. It's not as daunting as it was. Excellent. My question was regarding your data science process in your first few slides, you mentioned certain categories or techniques of algorithms. The first one was basic classification, yes or no, true or false. And about your fourth technique where you mentioned if it's almost like an anomaly, whether it's this weird or not, the answer to that is also like an yes or no. So how is that different from the first classification, how is that technique and how do we identify it? So it's a really good point. You could see it as a two-class classification, is this weird or is this not? And that's actually how it's implemented in some cases, not in all. What I presented of course was a very high-level simplification. There are other types of machine learning algorithms there, but that's kind of like the broad swath down Main Street of capturing most of them. But with anomaly detection algorithms, some of them are two-class classification. The reason that becomes challenging is that two-class classification algorithms usually assumes about equal numbers of yeses and noes, of haves and have-nots. By nature, anomaly detection, you don't have very many. They're anomalous. They're weird. And usually, you're looking not only for weird things that you've seen before, but you're even more interested in weird things that you haven't seen before. So it's a slightly different type of problem. So even when you do use two-class classification methods, you have to tweak them a little bit to solve it. Good insight into that relationship. Thank you. So there's no one over here, so we'll take the next one right here. And just to heads up, we just have one more minute and then we'll have to wrap up. Hi. Thanks for the great talk. I have a question. In one of the first slides, you said that the machine learning process and the data science process assumes that the world doesn't change. But the world changes. Can you maybe give an example of where it's hard to solve a problem just to put new data in and retrain your things? What do you do in those situations like if you can give an example and what tools Microsoft has to deal with that? Yes. So two-part question. We're just about out of time, but a very brief answer is anything having to deal with your customers if you have a business? What you learn about your customers when your company is small six months later if your company's doubled in size? So it's something very important to keep an eye on at that point. I apologize that I don't have time to dig deeper because it's a very insightful question. But thank you very much. And thank you all very much for your attention. Appreciate it.