 Welcome to the fourth video in the Data Science for Beginners series. In this one, we'll build a simple model and make a prediction. A model is a simplified story about our data. I'll show you what I mean. Say I want to shop for a diamond. I have a ring that belonged to my grandmother with a setting for a 1.35 karat diamond. And I want to get an idea of how much it will cost. I take a notepad and a pen into the jewelry store and write down the price of all the diamonds in the case and how much they weigh in karats. Starting with the first diamond, it's 1.01 karats and $7,366. Now I go through and do this for all of the other diamonds in the store. So notice that our list has two columns. Each column has a different attribute, weight in karats and price. And each row is a single data point, represents a single diamond. We've actually created a small data set here, a table. Notice that it meets our criteria for quality. The data is relevant. Weight is definitely related to price. It's accurate. We double check the prices that we wrote down. It's connected. There are no blank spaces in either of these columns. And as we'll see, it's enough data to answer our question. Now we'll pose our question in a sharp way. How much will it cost to buy a 1.35 karat diamond? Our list doesn't have a 1.35 karat diamond in it. So we'll have to use the rest of our data to get an answer to the question. The first thing we'll do is draw a horizontal number line called an axis to chart the weights. The range of the weights is 0 to 2. So we'll draw a line that covers that range and put ticks for each half-karat. Next we'll draw a vertical axis to record the price and connect it to the horizontal weight axis. This will be in units of dollars. Now we have a set of coordinated axes. We're going to take this data now and turn it into a scatter plot. This is a great way to visualize numerical data sets. For the first data point, we eyeball a vertical line at 1.01 karats. Then we eyeball a horizontal line at $7,366. Where they meet, we draw a dot. This represents our first diamond. Now we go through each diamond on this list and do the same thing. When we're through, this is what we get. A bunch of dots, one for each diamond. Now if you look at the dots and squint, the collection looks like a fat fuzzy line. We can take our marker and draw a straight line through it. By drawing a line, we created a model. Think of this as taking the real world and making a simplistic cartoon version of it. Now the cartoon is wrong. The line doesn't go through all the data points, but it's a useful simplification. The fact that all the dots don't go exactly through the line is okay. Data scientists explain this by saying that there's the model, that's the line, and then each dot has some noise or variance associated with it. There's the underlying perfect relationship, and then there's the gritty real world that adds noise and uncertainty. Because we're trying to answer the question how much, this is called a regression, and because we're using a straight line, it's a linear regression. Now we have a model and we ask it our question. How much will a 1.35 karat diamond cost? To answer our question, we eyeball 1.35 karats and draw a vertical line. Where it crosses the model line, we eyeball a horizontal line to the dollar axis. It hits right at $10,000. Boom, that's the answer. A 1.35 karat diamond costs about $10,000. It's natural to wonder how precise this prediction is. It's useful to know whether the 1.35 karat diamond will be very close to $10,000 or a lot higher or lower. To figure this out, let's draw an envelope around the regression line that includes most of the dots. This envelope is called our confidence interval. We are pretty confident that prices fall within this envelope because in the past, most of them have. Now we can draw two more horizontal lines from where the 1.35 karat line crosses the top and the bottom of that envelope. Now we can see something about our confidence interval. We can say confidently that the price of a 1.35 karat diamond is about $10,000 but it might be as low as $8,000 and it might be as high as $12,000. We did what data scientists get paid to do and we did it just by drawing. We asked a question that we could answer with data. We built a model using linear regression. We made a prediction complete with the confidence interval and we didn't use math or computers to do it. Now if we'd had more information like the cut of the diamond, color variations, how close the diamond is to being white, the number of inclusions in the diamond, then we would have had more columns. In that case, math becomes helpful. If you have more than two columns, it's hard to draw dots on paper. The math lets you fit that line or that plane to your data very nicely. Also, if instead of just a handful of diamonds, we had 2,000 or 2 million, then you can do that work much faster with a computer. Today we've talked about how to do linear regression and we made a prediction using data. Be sure to check out the other videos in Data Science for Beginners from Microsoft Azure Machine Learning.