 Hello. Welcome to data science for everyone. Even if you're scared of math, or don't know how to program a computer, or you've never even heard of data science, this presentation is still for you. What I'm going to talk about is a high-level description of how I do data science. Other people will give you different answers, but hopefully this will get you started. The very first step is to get data. And if you have data, get more data. So when I say data, I mean numbers and names. That's data that's numerical and categorical. What I mean by that, numbers are things that are amounts or counts, something that can be measured in extent. Money, pixel brightness, sound intensity, where names or categorical variables are things like type of dog, or a variety of hot drink, or the ID, the identification of an aircraft, or the model number of a droid, or an ice cream flavor, or something like the contents of a text. It's a name. If you change it a little bit, it could become a completely different name. If you change a number by a little bit, it's still pretty close to the original number. Now to confuse this issue, there are names that look like numbers. A phone number, if you change it by just one, points to an entirely different person. And that person's not necessarily closer than if you change it by a thousand or ten thousand. A phone number is a name that looks like a number. Similarly, a zip code, if you change it by just one or two, you might find a place that's relatively close geographically, but it will still be a completely different place. And it could be that changing a zip code by a thousand might produce another zip code that's right next to the original one. Again, identification numbers, 007. There's lots of types of identification numbers, serial numbers, credit card numbers, social security numbers. You change any one of those by ten or by a thousand or by a million. Either way, they'll point to something completely different. And to further confuse the issue, there are numbers that look like names, ordinals for instance, and names that can be turned into numbers. So these are names that have a distinct order, like first, second, third. And it helps if the difference between each of these is similar. Small, medium, large. There's a distinct order. And a medium is about as much bigger than a small, as a large is than a medium. Left, middle, right. You're going to go time zones, train stops. Anything that has an order can be turned into a number and interpreted by a machine learning algorithm. The process of measuring and collecting and storing and searching and moving and transforming data, names and numbers, is a special bit of work, special discipline to itself. Azure has a lot of tools to do this. And we're going to talk about them a lot during this conference. I encourage you to keep an eye out for them and learn as much as you can. We won't talk about them in this presentation. Here we'll stick to the fundamental conceptual notions of data science. The things that you try to do with all of these tools. I would also like to point you to the Cortana analytics process. You see these five steps below here. These are similar to the steps that I'm going to talk about today. They place special emphasis on the tools that you use and give specific instructions on how to use those tools to get these results. They're a next step. So after this presentation, go and check that out to further your education. After you have data, the next thing you need to do is ask a sharp question. And the thing you're looking for is, is your target in the data? I'll explain what I mean by that. To start with, if you have a vague question, it doesn't have to be answered with a name or a number. You can imagine that there's a genie in a lamp and it emerges from the lamp and it will answer any question you ask it truthfully. But it's a mischievous genie and it will ask answer as unhelpfully as it can. So if you ask it a vague question like, how can I increase my profits? You may get an answer like, work harder. True as far as it goes, but it's not specific enough to be helpful. Whereas, if you ask it a sharp question that has to be answered with a name or a number, such as, how many times will the feature that I built get used by a new user? The genie can't help but give you the correct numerical answer to that question. There's no wiggle room. That's the type of question you want to ask your data. You need to make sure that your data includes your target. Your target is an example of answers to your question that came in the past. So if your question is, what will my stock price be next week? Your target is a history of stock prices. So you can gather sales information by region, information on your competitors' products. You can gather information about your users and how long they've been around and external information about markets. But none of that will do you any good in answering your question until you also have a history of your stock price so that you can line it up with those and find the patterns. If you don't have your target in the data, go get more data. If you do, put your data in a table. This is not quite as trivial as it sounds. It has to go in your table so that there's one target value per row, one instance of your target variable for each row. In this case, with our stock price history, we recorded the stock price at the close every day so we have one row per day. Now, a lot of the data that we would like to include doesn't naturally occur once per day. For instance, total users. The way we've stored this data is username and date joined, a long list. In order to get total users, we have to aggregate some over that list. This is a common method to get things to line up in rows. Another way, the complement to this, is to distribute. There are some quantities like last month's sales or new users last month that are going to be the same for this entire month. It's still the same month that it was. That number is going to be constant. In order to line it up one per row, we distribute it and copy it across all of the relevant rows. Another thing we find ourselves doing quite often is computing. For instance, days since press release or days since product release. In this case, our data may be stored as a date and the name of the press release. In order to find the days since the press release, we have to take the date of the last press release and today's date and subtract them to a little calculation. Another thing we may have to do is simply measure. If we need something that we didn't create ourselves, we may have to go out and gather that data. Maybe someone else collected it for us and we can get it from them, or maybe we have to get it from scratch. Something like the average Dow Jones industrial average for each day. Another thing we can do is estimate. Sometimes at the time we create our table, we don't have the most recent data and we can't calculate or distribute or aggregate to get the number. So we may have to fudge it, make some educated guesses. We know if we have a basis for making a guess, that's better than leaving it blank. We've still added some information. But when all else fails, we can leave blanks. We don't want to leave too many because the more we leave, the trickier it gets to do quality data science, but there are ways to handle it. So after the data is in a table with one row per instance of the target, the next step is to check for quality. We want to make sure the data is acceptable to use. So here's a table. We can see going through the columns. The very last row is our target variable. Where's a cape? This is a bunch of data on different superheroes and supervillains, and we want to be able to predict in the future, given a new superhero or supervillain, whether or not they wear a cape based on everything else we know about them. Now, when you go to survey data for quality, I don't know of any shortcuts to the process of walking through the data. What this means is you go to the first column. You look at the title. You think about what it means. If it's not clear, you look up what it means. If it's not written down, you talk to the person who created it and find out what it means. You still don't know you make a guess of what it means. With that in mind, you visually scroll through the data. For large data sets, you won't be able to go through the whole data set. So you can look at aggregations. You can look at the unique values. You can look at histograms of how much of each value occur and make sure that it makes sense. Make sure that you don't see anything that makes you cock your head and squint and think, does that look right? Because if you do, then it probably means there's an error in the data and it may need correction or clarification. This is almost always the case in real data. So to go through this data, first we look at the ID column. We scan through that and it becomes pretty clear that this is just a unique numerical identifier for each row. Very common in databases. We look at the next column, first name. Those look like all fine first names. They're all alphabetical. They're all capitalized. They seem formatted well. Same with the last names. All very plausible. No red flags. When we get to birth year, we assume that this is the year when each of these individuals was born, but some of these quantities don't look like years at all. Some of them have stray punctuation. Peter Parker's year looks like someone re-typed the number one several times. Some have quotes. If you look down at 42287 BC, that's a perfectly valid year, but by appending BC on there it becomes non-numerical and the machine learning algorithm doesn't know what to do with it. In order to clean this up, we need to make sure these are all formatted the same way. After we go through and fix them, this is what it looks like. We even took the 2287 BC and just turned it into negative 2287 so that it could very easily be interpreted by our algorithm. Next, we look at the height. This is nice, easy to interpret, uniformly formatted. The difficulty here is we know that height is numerical. We know that 6.1 is much closer to 6.2 than it is to 5.1. That's not represented when these things are in strings with symbols, the foot and inch tick marks. So we need to change them to a numerical form. Once we change them to inches, now they're naturally interpretable by the algorithm. We look at the birthplace and again, these all look like very reasonably formatted names. It looks like some cleanup has occurred already on this column. We do have an unknown in there, but that's reasonable enough. This column, identity is secret. So this looks like a yes or no. Does this person have a secret identity? So in the case of Bruce Wayne, yes. In the case of Aurora Monroe, not applicable. It's not totally clear what that means. If we look down at Victor von Doom, missing. Again, it's not totally obvious what that means. Janet Van Dyne, nothing at all. So there's some ambiguity in the data. In order to use it to its best effect, we need to figure out what is intended here. And in some cases, make some guesses, or do a little bit of research, talk to who recorded it, or if they've left the company, talk to someone who heard something about who recorded it, and make a guess. After doing that, we end up with a nice, every row is a Y or an N, uniform representation, two levels, very easily interpretable by an algorithm. Next column, Can Fly. Again, it appears that this column is an indicator of whether an individual can fly. But it looks like whoever entered this was using different standards. Some are numerical, if a three, and a nine, and a 12, and a one, based on the fact that Clark Kent is a 12, and Bruce Wayne is a three, that high numbers mean a good flyer. We have notes. Jet, presumably can fly, but in a jet. Lev, which we do some research and find is short for levitate. We have some no, not applicable, blank. So again, it takes some interpretation. When we first look at this and we look at all of the unique values and we plot a histogram of how many time each occurs, it's clear that there's a lot of noise and a lot of different standards. After we go through and clean it up, it becomes very simple. Yes, no. In some cases, this is oversimplification, but it helps the machine learning algorithm to interpret it well. Again, alignment. Another case, lots of different things, lots of different ways to say the same things in some cases. We go through, we look at what it means, and we unify it. And in this case, we end up with three levels, mostly good, some bad, and one neutral. Selena Kyle is tough to categorize. It's okay to have more than two categories. It's okay to have more than 10 categories. As long as the representation is uniform, and everything that's intended to be the same is called the same. We finally get to where the cape. We treat the target variable just like any other. Look through, think about what it means and unify it. Now, we have a nice, clean, rectangular, non-empty, uniform representation. This data is ready to start processing in an algorithm. So, the next step. Transform features. And are the new features predictive? And I'll talk about what that means. Sometimes, the data that you get needs a little bit of massaging before it can help you answer your question. This is called feature engineering. It's just a fancy word for taking the features that you already have and doing something to them to change them, combine them, break them down. I'll show you an example of this. So, this data right here, column 0, column 1, column 2. They're all numerical. Column 2 is our target. We want to use column 0 and column 1. We'll call them feature 0 and feature 1 to predict feature 2, our regression target. Which means we're going to draw a line through a plot and try to predict or model feature 2. When we plot feature 0 against our target and feature 1 against our target and we draw the best line we can through those fuzzy blobs, it's a flat line. There's no slope to it at all. That's very unsatisfying. What that means is that if you tell me the value of feature 1 or the value of feature 0 I'm going to predict the same value of the regression target every time. The same value for feature 2. It's a constant guess. Which means it's not very predictive. Sure enough, there is a measure of the goodness of that line as a model, as a description of the data. That's called the coefficient of determination. When it's 1 the description is perfect. When it's 0, the description is worthless. We're pretty close to 0.016. So, now I tell you that feature 0 is actually the departure time in hours since midnight of the subway train from the central square stop and feature 1 is the arrival time in hours since midnight of the same train at the Kendall square stop and that feature 2 is the maximum speed in kilometers per hour that the train reaches when going between the two stops. So, we think about that and realize departure time and arrival time are both related to the speed and it's peak speed, not average speed but there should still be a relation there. So, they interact alone they don't help us know what the speed is but together they do. Well, the default way to handle interaction in data is to multiply the data points together. So, let's create a new data point 0 times 1 we'll call it our new feature and we plot it on the right hand side against our target our peak speed and we draw a line through it and unfortunately again we see it's a flat line that's the best fit to that fuzzy blob and sure enough our coefficient of determination is still very close to 0.014 just 100%. So, we step back again and we think freshman physics the average speed is distance divided by elapsed time which is arrival time minus departure time minus not times. So, now we create a new feature which is feature 1 minus feature 0 and this will be the difference of those times it will be the time in hours between the two stops now when we create this feature and plot it against the peak speed we get a very different picture we get a nice swoosh and sure enough the best line that fits through that is very predictive now if you tell me the difference between feature 1 and feature 0 I can make a pretty good guess at what the peak speed was between the two stops and sure enough the coefficient of determination of our model now is 0.88 pretty close to 1 so in order to get this performance we had to take the original features which were perfectly accurate and full of good information but we had to transform them so that the algorithm could get the information that it needed out of them there are lots of different ways to do feature engineering some of them are data specific they take advantage of the fact that with certain types of data like images interesting information is stored in repeatable ways similarly with text you can look at the frequency that words occur and divide that or scale that by the inverse document frequency there are other domain specific feature engineering tricks depending on whether you're working in economics or agriculture or sociology different things matter and sometimes you have to take what you measure and do some tricks with it before you get to the thing that matters now you've probably heard about deep learning it's been used successfully with images and text and audio to automatically learn all kinds of features deep learning strength is that it learns the patterns from the data and in fact it's outside the scope of this talk but I'd like to use it for a teaser there's another talk that I gave recently that I'll share a link to at the end of this talk on the fundamentals of how deep learning works at the same level no math no code, just some of the basic concepts now you've taken your data you've asked a sharp question put it in a table checked it, it's high quality you've done your feature engineering and now you're ready to answer your question what you're looking for is whether that answer is clear whether it lets you do what you want to do this is where machine learning comes in it uses statistics to look at patterns in the past to predict patterns in the future so of all the questions that you can ask of your data it might surprise you that there are really only five questions that machine learning can answer or five types of questions we're going to step through each of them the first one is how much or how many examples of this are what will the temperature be on Tuesday or what will my sales be next quarter there are questions that ask for a number the type of algorithms that answer this are called regression algorithms that may involve fitting a line or a curve or a surface to the data that can be used as a simplified version a cartoon version of the data and we'll show how this works in just a few minutes but how much or how many is a very common question to ask of your data another very common question is which category if I get a picture is this picture an image of a cat or a dog or if I see a radar signature of all of the aircraft in my library which aircraft is probably causing this or I read a news article of all the topics that I've seen in the past which topic or topics does this news article cover this type of algorithm is called classification and it's used to assign a category or set of category to new examples the third question that you can ask of your data is which groups which groups does it naturally break down into this is pretty common sometimes you just want to take imagine someone gives you a bag full of M&Ms and says hey break this into groups that are similar doesn't really matter how you do it you'll probably start doing it by color but you could also do it by weight or by diameter or by sugar content there's lots of different ways to do it but however you do it it can be helpful so common examples of this are which shoppers have similar taste in produce if you've ever watched movies streaming online you've probably had movies recommended to you this is the result of an algorithm that asks the question which viewers like the same kind of movies and then it'll look and see which movies your compatriots have seen that you haven't seen yet and recommend them to you another very popular question to ask surprisingly popular is is this weird you can have a long history of observations and experience and it's very useful to be able to identify when something weird happens raise a flag and say hey I haven't seen this before or at least not very often so for instance if you drive a car that has pressure sensors in the tire it's nice to be able to answer the question is this pressure reading unusual if you have some kind of internet security system it's nice to be able to answer the question is this internet message typical your credit card company is always asking the question is this combination of purchases very different from what this customer has made in the past when things are weird it doesn't necessarily mean that something's wrong but it can and when something's wrong there's usually something weird about it so it's a great place to focus your detective efforts finally the fifth question is which action should I take so in the case where a machine especially a robot needs to make a decision it's a low consequence decision and when the machine gets to make a lot reinforcement learning is a way to do this and to learn from your experience so for instance an automatic temperature control system in a building needs to answer the question should I raise or lower the temperature a little vacuum cleaner in your house needs to answer the question should I vacuum the living room again or stay plugged into my charging system a self-driving car may need to answer the question should I break or accelerate in response to that yellow light this is a little bit different than the other questions but I'm including it for completeness we won't talk about it anymore today so machine learning algorithms are kind of magical until you dig in we're going to pull the curtain back and show that it's not really magic at all it's just a little bit of pattern identification so in this example I walk in to a jewelry store I have an old ring that my grandmother used to own and it has a setting for a 1.35 carat diamond and I'd like to fill that setting but first I want to know how much it's going to cost so I can save so what I do is I go to the case full of the diamonds and for each one I note the weight and I note the price the first one I find is 1.01 carats and the price is $7,366 I note it I move on and I fill up a list of all of the diamonds in the case what I have here is a data set there's two columns in this case that's what I want to predict the question that I'm asking is how much? how much will this cost? and then my feature my other feature is carats so in order to go to work here I first draw a number line that represents the weight in carats looks like the range is about 0 to 2 so I make sure that it covers that and draw some intermediate tick marks and then I draw a perpendicular line to represent price again I see that the range is almost $20,000 so I put that in with some intermediate tick marks and I look in my very first data point 1.01 carats and I eyeball a vertical line from 1.01 the carat line price $7,366 I eyeball a horizontal line from $7,366 where they meet I put a dot that data point literally the dot represents the first diamond on my list I can do this again and again for every diamond on the list and I get something that looks like this now this is just our data set it's another representation we haven't done anything funny to it we've just put it in a different form now it's worth noting on a tangent about half the time when you're doing data science you can stop right here if I owned a business and I was asked what size diamond should I choose in order to make a ring as easily as possible from here it's obvious that price goes down as weight goes down so the answer is choose the lowest weight diamond you possibly can we're good at pulling out those type of insights just using our eyes taking this picture of the data and turning it into an idea now the question that we're asking is a little bit more subtle on the next level if we look at those dots and squint it looks like a fat fuzzy line and it's not too tough to eyeball you can take a marker and draw a straight line right down the middle of it now what we just did is really significant we've created a model which is kind of a fancy way of saying that we took and made a simplified version now it is simplified it is like a cartoon of our data you can see that those data points most of them don't lie right on the line at all that line is just an approximation it kind of says what the model does and the data scientists interpretation of this is the line is the ideal it's what the baseline price is it's not the weight of a diamond but because the real world it's a little bit gritty a little bit unpredictable things happen you just can't account for some of them are going to be higher some of them are going to be lower and that's called noise or variance so your data is always the model plus a little bit of noise which is what we have here so this to answer our question this we drew it by hand but it can also be the result of a machine learning algorithm the question is how much so we know the type is a regression because we're using a line it's called a linear regression so a linear regression would use math to fit to these dots and to generate something a whole lot like this we did it with pen now that we have a model we can answer our question how much will a 1.35 karat diamond cost there aren't any diamonds on our list that are 1.35 karats so we have to refer to our model we eyeball a line from 1.35 and right where it crosses our model we eyeball a horizontal line and it hits right at 8000 boom there's the answer the diamond is going to cost us $8000 now we look at those dots and we think well most of those dots don't lie right on the line so is it going to be a little bit over or a little bit under 8000 or a lot over or under and we can figure this out too we go back to our dots in our line and we draw a fat envelope around the line that encompasses most of the dots it doesn't have to capture all of them this is called our confidence interval this is our model extended out to embrace most of what's there and we're pretty confident the future data points will also be in this envelope now we can see where our 1.35 karat line crosses the top and the bottom of this envelope two more horizontal lines over and this becomes the confidence interval of our estimate so now we have a more complete story we can say our 1.35 karat diamond will cost about $8000 but it'll almost certainly be less than $10200 and greater than $5800 so I'd like to point out that what we just did was create a linear regression model to make a prediction without using math or computer programs now it's a pretty big deal you should feel proud of yourself imagine though if instead of just weight we had several other characteristics of the diamond like the color how close it is to being white or the quality of the cut or how many inclusions how many little cracks or carbon granules or inside the diamond a lot of other characteristics those would translate into more columns which would in turn translate into more dimensions in our little plot because our paper is two dimensional it gets hard to represent more than two dimensions when we're drawing dots and it gets really hard to draw lines and planes in more than two dimensions this is where math comes in and it can automatically find the line or the plane that fits as close as possible to the middle of that slew of dots and then imagine if you had instead of 15 diamonds you had 15 million then you really want to have a computer involved otherwise it's going to take you a very long time to compute it but the basic ideas are simple enough that you can do it with pen and paper now in order to get a good answer you need to have enough data if you don't have enough data what you get is kind of like this you can see something but it's really hard to know what it is and it's not firm enough to base any decisions on so you add data and you try again and then you end up with just barely enough data and you know you have just barely enough data because at that point you can just barely make the decision that you're trying to make answer the question that you're trying to answer so with this if you kind of lean back and squint you can say you know what that looks like a canal that looks like a gorgeous sunset with clouds in the sky and those look like buildings that is a beautiful place I want to go there on vacation so I used the data and I was able to interpret it to answer my question do I want to go there on vacation so that's how I knew I had barely enough data now as you gather more and more data the picture becomes clearer and clearer and then eventually you're able to look and make finer grain decisions answer more questions now I can say oh you know those three hotels on the left bank the nearest one has fascinating architectural features that's where I want to stay in fact I want the room on the third floor that's the second from the right I'm going to see if I can get it so you need to make sure you have at least barely enough data more data is even better so we've used our data to answer our question but we're not done yet we have to use the answer if a tree falls in the forest and no one's around to hear it it might make a sound but if you don't use the beautiful model that you built it definitely won't delight your customers so ways to use the answer we found you can make a web service this is something that's exceptionally easy to do with azure machine learning which the other talks today will cover in depth you can make a decision do I want to go there on vacation you can set a price how much do I want to charge for this hot dog you can take and publish your code on github or in the azure machine learning gallery you can write a pdf showing your results and share it around or you can build a dashboard say with power bi to show your results to potential customers or to an employer there are no ways to use it but you have to do something with it a model or an analysis all on its own doesn't have any life now there are some things to be aware of, some gaps when using your model nearly all machine learning algorithms assume that the world doesn't change if you gather your data and the world changes in some fundamental way then unfortunately your data may all be invalid if I had gathered information on air travel right up until September 10th, 2001 and then tried to make predictions about September 15th, 2001 they would be completely wrong because the world changed fundamentally people's attitudes changed and their plans changed and events changed on September 11th so you need to make sure that your data is relevant the world hasn't changed out from under you the second gap is related most machine learning algorithms take lots of examples to learn now if you're learning something about say internet traffic and you're collecting a billion packets in a day then it won't take you long to collect a nice sizeable data set to be able to pull some conclusions out of it but if you're studying something like global climate change and one data point is a single year then by the time you've collected 10,000 data points the world may have changed out from under you or the results of your analysis may be no longer relevant because all of the people and even all of the species that cared are no longer around to benefit from the answer the third gap is that machine learning can't tell what caused what so take this example this is real data we have the amount of cheese consumed in pounds per person each year compared to the amount of people the number of people who died by getting tangled up in their bedsheets and if you look at that plot you can see there is a pretty close relationship it would not be surprising to see this plot in a news article with the conclusion that cheese consumption causes death by bedsheets similarly it wouldn't be surprising to see it in the next newspaper with the conclusion that people getting tangled in their bedsheets and dying caused their relatives to eat more cheese either one is plausible based on the data but the data doesn't tell either of those stories machine learning can't tell what's caused what there's a third option which is that these things are completely unrelated causally but just random chance they showed a similar pattern this is more than philosophical for example that you might be familiar with you have a case of the hiccups you're with your friends and they start offering remedies eat a spoonful of sugar drink three glasses of water stand on your head and you try these things in succession eventually while you're doing one of them your hiccups stop the friend will say see mine worked if your other friends are clever they'll say no no no mine worked it just took a little while to kick in in fact you can't tell that from the experience the data just shows that they're correlated not that they're causally related now this example seems trivial but it's very closely related to our experience with economic recessions there's a lot of different theories about what causes them and what ends them and usually successive leaders try different theories in different situations and whatever they happen to be doing when the recession ends is given credit but there's actually no way to know for certain based on that data alone whether it caused it or not another example the name of the conference are just a coincidence now these gaps are not showstoppers for us we can close them with human insight and judgment we're really good at making reasonable guesses without having enough information in fact we're so good at it that we will usually make unreasonable guesses even in the presence of enough information but the upside of that the survival value of that is that even when we don't have enough information we'll fill in the gaps and we'll build that bridge and we'll make that intuitive leap until the information catches up and can close it for us and often it'll close it exactly the way we bridged it but occasionally it'll surprise us and close it a different way but the important thing is that not having the information doesn't stop us we have to gather the data very carefully as far as it goes and then go with our gut after that so this is the complete cycle of using data to answer a question when you're done go back to the beginning get more data, ask another question and go through it all again now I've put together some resources related to each of these steps if you have any questions or comments please email me or reach out on twitter or on LinkedIn I would love to chat and hear what you have to say thanks for listening