 You're doing such a wonderful job developing a comprehensive understanding of the way data is collected and analyzed to help inform business decisions. You've learned a lot about how to analyze data and use the insights data holds to help stakeholders develop a complete picture of the business situation at hand. At this point in the program, your portfolio now includes a new entry in your PACE strategy document, a tidy data set, visualizations that tell a story with data, and a simulated AV test. As you continue to work on other projects included in this program, your portfolio will continue to grow demonstrating your progress, learning, and skills. Keep in mind that you're producing concrete examples that you'll want to discuss with potential employers and hiring managers during future interviews. So far, you have reinforced your understanding of the importance of following the PACE structure in a data career, observed how Python can help power data manipulation, as well as organize and analyze a data set to tell a compelling story. Plus, this section taught how to use statistics to explore any data set you're presented with or to analyze and interpret data you collected through sampling. As you begin preparing for future interviews, it may be the case that you are asked, how would you use statistics to measure the performance of our company? How have you used statistical models to solve business problems? What are the various factors that go into an experimental design for an AV test? You might also be tasked with sharing details about a significant data project you worked on and how you used statistics to gain insights that benefited your team or organization. In that case, the project you just completed would be great to talk about. You use Python to simulate an AV test for a specific scenario. Keep in mind too that AV tests are used in many other contexts. All the same, AV tests generally call for similar methods since they serve to compare two versions of the same thing, whether product, website, email campaigns or the like. A key skill for many professions, including those in the data career space, is the ability to be flexible in applying your knowledge. You've already come so far in the data learning journey. I'm proud of you. Coming up, you'll learn all about regression models. Then you'll have an opportunity to demonstrate your understanding and assess a business scenario that requires a multiple linear regression model by creating a blog post. By the end of this program, you will have a solid portfolio to demonstrate all you've accomplished. Congratulations on completing the statistics portion of this program. Your knowledge of statistics gives you a solid foundation for learning more advanced methods of analysis as you progress in your career as a data professional. Your stats knowledge will also serve you well in future job interviews. A strong understanding of fundamental stats concepts will make you a more compelling job candidate and a better data professional. Earlier, I mentioned that being a data professional means continuing your learning journey throughout your career, and that's what I love about it. Every time I encounter a new concept, I'm filled with a sense of excitement and renewed curiosity. Learning about advanced data analytics is like exploring an ever-expanding universe. There are so many amazing new worlds to discover. As the amount of data continues to grow, so does our knowledge about the best ways to analyze and interpret that data. I still spend a lot of time educating myself on the latest developments in machine learning and reading about new ways to use statistical methods for data analytics. And now you've embarked upon that same learning journey, and you've already learned so much. In this course, you've learned to calculate descriptive stats like the mean, median, and standard deviation to explore and summarize a new dataset, apply probability distributions like the binomial, Poisson, and normal distribution to model your data, use sampling distributions to make point estimates about population means and proportions, construct confidence intervals to describe the uncertainty of your estimates, and conduct hypothesis tests to determine the statistical significance of your results. Coming up, you'll continue to build on your stats knowledge. In the next course, you'll add a powerful new tool to your analytics toolbox, a statistical method called regression. After that, you'll get to explore the fascinating world of machine learning. Having made it through stats, you're well-prepared for upcoming courses. I'm excited for you to begin working with your next instructor. You may remember them from the video that introduced this program, my colleague Tiffany, who works at the intersection of data science and marketing here at Google. Tiffany will tell you all about regression and help you take your next step toward finishing this program and pursuing your future career as a data professional. It's been an honor to join you on this stage of your learning journey. I wish you all the best for the next stage and beyond. Good luck. Hi, welcome back to the next steps of your journey as a data professional. I'm so excited to join you. We'll be discussing regression models and more hypothesis testing so that you can explore the relationships in your data. The tools we learn about together will allow you to uncover powerful insights about your data, tell compelling stories and influence decisions and strategy making. The modeling fundamentals we explore in this course will give you an even stronger foundation to pursue entry-level jobs in the data career space and to tackle more advanced topics in the future like machine learning. I'm eager for us to connect all the pieces you've learned so far. You've come a long way since the start of your data journey from the foundations of data science and Python fundamentals to exploratory data analysis and statistics. You've learned so much about the data landscape. Together we'll be applying your current skills to your first model, regressions. My name is Tiffany. I'm a marketing science lead and I work at the intersection of data science and marketing here at Google. As a child I gravitated towards math probably because it was my favorite subject. When I got to college I was not exactly sure what I wanted to major in but I was drawn towards quantitative fields. As many of you may have also experienced I didn't love my first decision and had the opportunity to try out a few majors exposing me to the variety of analytical career opportunities. Fast forward to my career I've held positions in finance, statistical research and data science consulting before joining Google. I've worked with a variety of data including patent, fraud, hardware sales and now marketing data. The great thing about being a data professional is that it's in high demand in most industries which means that you can find interesting problems to solve across a huge variety of companies, products and countries. I love the flexibility to find the right fit for me while also knowing that I will be able to deliver value to whatever team I'm on. I've learned so much in my career and I'm excited to share my experience with you. In this course we will talk about modeling relationships between variables. To do this we will focus on regression analysis. Regression analysis or regression models are a group of statistical techniques that use existing data to estimate the relationship between a single dependent variable and one or more independent variables. We will unpack this definition together later in this program. The techniques we learned in this course will help you answer any number of questions with actionable steps to achieve your organization's goals. For example regression models can help you understand what variables impact sales. They can also help you understand the factors that lead a customer to subscribe to a newsletter. Regression models can even help you understand why a user keeps scrolling on a company's website. As we work through the course together we'll continue discussing the role data professionals play in making responsible decisions from data handling to modeling. No matter how good a potential data story is, how we arrive at our story is the priority in being honest and effective storytellers. Python programming will be critical to running and testing complex models, visualizing data and communicating results. Rigorous exploratory data analysis or EDA will inform which models we choose and how we approach the modeling process. Statistics will play a major role in helping us understand how our models work and will allow us to present actionable results to stakeholders. Our statistics toolkit will allow us to build the models we talk about. Regression analysis is a relevant and marketable skill. Regression models are flexible so you can design them based on the data that you have. Additionally regression model results offer opportunities for interpretation and communication. Data professionals can provide insights from these models that align clearly with actionable steps. For example we're able to build regression models to identify what actions on a website are indicative of high value customers. For instance there is a high likelihood that customers who visit the sale page, watch a video or sign up for emails are customers who will make multiple purchases in a year. Some of these indicators may seem obvious but having a regression analysis that identifies and quantifies the relationship is powerful. We're able to use the information in marketing campaigns to attract new customers. Together we'll be exploring regression analysis as the basis of a solid foundation for building more complex machine learning models. I'm so excited to show you each of the tools in your regression toolkit. You'll get a lot of practice in Python. We'll explain how to determine if a model is appropriate for the data, how to run the model, and how to understand the computer's results. We're also going to discuss some math and review the concepts step by step the whole time. You can also review course resources whenever you want. I'm so excited to get started on regressions so let's begin. Hi I'm Tiffany. I'm a marketing science lead here at Google. I have always loved math and stats so from a very very young age I was kind of drawn to that probably because that was one of the only things that I was good at in school. I just remember my brother struggling with his math homework and he was two years older than me so probably when I was like in first grade I would sit down with him and help him with it. In my role as a marketing science lead I built a lot of regression models to predict things like who is going to be a high valued customer or how to best allocate my budget across different campaigns. Regression models are super important powerful tools. You'll be able to answer a wide variety of questions using different types of regression models. More recently I've been building models and trying to interpret the coefficients so I've been building a lot of regression models to identify things like who's a high value customer or who's going to make multiple purchases in a year by analyzing website behavior. So we really want to get down to which pages or which actions that people are taking on a website that's indicative of them being a high value customer or someone who makes multiple purchases within a year. I love being able to take large amounts of data and being able to solve massive problems. I recently worked with a client who wanted to know how to optimize their millions of dollars of marketing budget across different channels and by building a regression model we were able to help them optimize their spend in order to increase sales. Some of the opportunities that we currently face is third party cookies are going away. I'm sure you've all had sort of an ad that follows you around on the internet before so that was enabled by third party cookies and so with those going away we have a whole brand new opportunity to come up with different model types and different analyses to try to do things that we were easily able to identify previously using third party cookies in that tracking system and so there is a massive opportunity for innovation right now in the analytics field in marketing. Things are constantly changing and there's so much information out there that you're never going to know everything that you need to know so just keep an open mindset and an always learning mindset is going to be key. Know that there are so many resources out there for you to research learn and you can do that as you go. I can't wait to explore regression analysis with you. Together we'll learn how data professionals go from a problem to actionable insights using regression analysis techniques. We'll begin by introducing regression analysis. The focus for regression analysis is about understanding relationships between variables. We'll use the pace framework plan analyze construct and execute to guide us through the rest of the course. Next you'll have the chance to apply pace to simple linear regression which will be our first model we explore comprehensively. We'll go over the entire process from beginning to end using different scenarios and data. Then we'll examine multiple linear regression closely. Multiple linear regression expands upon a lot of the concepts from simple linear regression but will allow us to problem solve a larger variety of questions. We will focus on some more nuanced topics like variable selection and model interpretation. We'll also consider a few hypothesis tests such as chi-squared test and ANOVA. These tests will help us explore different groups in the data allowing us to glean interesting insights. Finally we will review the fundamentals of logistic regression. This is the last and most complex model which will set you up well to approach the rest of the program on machine learning. Whenever you're ready I'm excited to get started on the next video. You may be familiar with terms like machine learning models or regression analysis and if not you're in the right program. Next up we'll get started on our learning journey. To begin we're going to discuss a few key terms and definitions. Regression models are based on a statistical foundation. Models used in the data career space are a family of techniques that rely on existing information or data points to inform what we might think other data points will look like. The goal is always to tell a story about the relationships between the variables and the data. The story will help stakeholders adjust their business strategy and decisions. Modeling follows an iterative process. You may be familiar with other frameworks such as the data lifecycle or the six steps of exploratory data analysis. In this course we will cover each step of the modeling process using pace. Outlined earlier in this program pace, plan, analyze, construct, and execute provides us with the foundation for conducting regression analysis. Let's take some time to preview how pace works in the context of regression analysis. In regression modeling the plan stage is about understanding your data in the problem context. Knowledge you bring whether from industry or others can be instrumental in the plan stage. By considering what data you have access to, how the data was collected, and what the business needs are, you'll be able to strategically analyze, construct, and execute the rest of your work. The plan stage will guide the other three stages of pace. After you plan you have to analyze. In this stage you examine your data more closely so you can choose a model or a couple of models you think might be appropriate. When working with regression analysis this is where you use Python to perform EDA and check the model assumptions as needed. Model assumptions are statements about the data that must be true to justify the use of particular data techniques. As a data professional you will use statistics to check whether model assumptions are met. A good understanding of statistics gives data professionals the power to construct meaningful models. After you analyze you must construct. For regression analysis this is where you actually build the model in Python or your coding language of choice. This step involves selecting variables, transforming the data as needed, and writing code. Even though you checked model assumptions before you built the model, many model assumptions need to be rechecked after the model is built. So you'll do that in the construct phase as needed. The last part of construct phase is evaluating the model results. At this point you are answering the questions, how good is my model? You'll choose metrics, compare models, and get preliminary results. Then based on your evaluation you can use EDA to refine your model accordingly. Of course as a data professional you must first and foremost be an honest storyteller. Studying the results produced by the regression will uncover relationships within your data and help you discover insights to tell the full story. This leads to the last part of pace, execute. You'll interpret everything you learned from analyzing and constructing to share the story. You'll prepare formal results and visualizations and share them with stakeholders. To do this you'll convert model statistics into statements describing the relationship between the variables and the data. These descriptions must consider the context and initial questions from the plan phase. At the center of everything is data and the pace framework helps data professionals stay organized. The insights data professionals produce must be data driven and accurate and they have to make sense given the business or community context. We'll go over these steps using examples later in the course but pace is iterative. As you grow as a data professional your experience will help you decide when to pivot between stages of pace. You might switch the order or repeat steps depending on the situation. Now that we've laid out the pieces of the modeling puzzle we can talk about how correlation and regression are related. Then we will explore two foundational regression models, linear and logistic regression. These models will be covered in more depth in later videos. This overview will provide you with a solid basis of understanding. Time to start putting together our puzzle. Don't forget your statistical grammar tools. You'll need them. Previously we learned that regression analysis is about estimating relationships between a single dependent variable and one or more independent variables. In this video we will discuss our first modeling technique, linear regression. Many patterns that you've observed in daily life can be expressed using linear regression models which is pretty cool. For example, as a version of computer software gets older online searches for that software version may decrease. As a social media personality gains followers their book sales increase. These relationships can be modeled using a linear regression. The linear in linear regression indicates the kind of relationship we can visualize on a graph, a line. A line is a collection of an infinite number of points extending in two opposite directions. In a graph the individual points show up as a line and we only see a portion of the line. Linear regression is a technique that estimates the linear relationship between a continuous dependent variable y and one or more independent variables x. For example, we could model the relationship between the prices of a product and the number of sales. Our y variable would be the number of sales and our x variable would be the prices. In an earlier course you learn the difference between continuous and categorical variables. As a reminder, continuous variables are variables that can take on any real value between its minimum and maximum value. For example, product sales, vehicle speed, and time spent on a webpage are all continuous variables. Whereas types of products and educational level are not. These are categorical. Categorical variables have a finite number of possible values. While linear regression allows us as data analytics professionals to estimate continuous dependent variables, there are other regression models that let us estimate categorical variables. We'll learn more about those in a later course. Throughout this course we'll talk about dependent and independent variables. The dependent variable is the variable a given model estimates. Sometimes the dependent variable is also called a response or outcome variable and is commonly represented with the letter y. We assume that the dependent variable tends to vary based on the values of independent variables, typically represented by an x. Independent variables are also referred to as explanatory variables or predictor variables. For example, let's say you're working at a cake shop and you're trying to understand the factors that contribute to cake sales. The dependent or y variable would be the number of cake slices sold on any given day. An independent or x variable could be how many cups of coffee are sold that day. Maybe as more coffee is purchased, more cake is also purchased. In linear regression, you might encounter two more terms, slope and intercept. The slope refers to the amount we expect y, the dependent variable, to increase or decrease per one unit increase of x, the independent variable. The intercept is the value of y, the dependent variable. When x, the independent variable equals zero. Going back to cake and coffee, the slope would be how many slices of cake are purchased per cup of coffee purchase. The intercept would be the number of cake slices that are sold when zero cups of coffee are sold. When two variables x and y are related in a linear way, we say they are correlated. Using statistics, we can actually calculate how strong the linear relationship between x and y is. Pretty cool. There are two kinds of correlation, positive and negative. Positive correlation is a relationship between two variables that tend to increase or decrease together. For example, as more cups of coffee are purchased, more cake slices are purchased. Negative correlation, on the other hand, is an inverse relationship between two variables. When one variable increases, the other variable tends to decrease. And the reverse is true too. For example, let's say you're still working at the cake shop and you're estimating how often to refill the iced coffee dispenser. You can model the relationship between iced coffee and hot coffee sales. As hot coffee sales increase, you might notice iced coffee sales tend to decrease. Or perhaps you're working at a media company and you're analyzing readership. As the length of news articles increases, the number of people that finish reading the article might decrease. This is also an example of negative correlation. Identifying these kinds of relationships can be incredibly useful in the workplace and in everyday life. Determining linear relationships helps us answer questions such as, which factors are associated with an increase or decrease in product sales? Which factors make a social service providers increase resources in a given region? Which factors lead to more or less demand for public transportation? In the cases mentioned, how big the slope of the regression is tells us how much sales, resource allocation, and public transportation increase or decrease. Using linear regression, you can help answer similar questions in any industry. However, it is important to note that correlation is not causation. For example, in your cake shop, people buying coffee does not cause cake sales to increase. When modeling variable relationships, a data scientist must be mindful of the extent of their claims. Causation describes a cause and effect relationship where one variable directly causes the other to change in a particular way. Proving causation statistically requires much more rigorous methods and data collection than correlation. For a data professional, the distinction between correlation and causation is especially important when presenting results. For example, although we can say that as someone gets older, the number of places they have visited tends to rise, we cannot necessarily say that someone's age causes the number of places they have visited to increase. There could be other factors causing you to travel more that coincide with your aging, such as visiting family or traveling more for work. Any of these factors could be correlated to aging, but it's hard to say whether age or other factors are causing the traveling. Articulating that correlation is not causation is part of a data professional's best practices and ethical toolbox. Both correlational and causal relationships provide useful insights. Regression analysis helps data analytics professionals tell nuanced stories without needing to prove causation. That concludes a high-level overview of your first modeling technique. To recap, linear regression is a way to model linear relationships. Dependent variables vary according to independent variables. The slope identifies how much the dependent variable changes per one unit change in the independent variable. Positive and negative correlation describe linear relationships between variables. Always be mindful when interpreting regression results. Correlation is not causation. There are lots of linear relationships in the world and in industry. In this video, we've provided just a few examples, but there are plenty more. That's all for now. Next we'll address the math behind linear regression. Your ever-growing statistical foundations will help you explain regression results in the clearest way possible. In an ideal world, you would want every data point relevant to the question you are trying to answer. In a previous course, you learned that data professionals describe this as population. Let's imagine you're working for a publishing company. You want to understand the relationship between an author's social media following and their book sales. You would need data on every book author's social media follower count and every book sale of all time. This is an impossible task, but luckily, you don't actually need an entire population to run meaningful regression analysis. You can get a reasonable estimate with a representative sample. A sample is a part of the population, which is just the statistical way of saying a sample is some of the data you could possibly have. If you have a set of sample data, each data point can be represented with its own set of attributes or x and y values. In that case, the sample does not contain all possible values from the population of data. The observed values or actual values are the existing sample of data. Each data point in this sample is represented by an observed value of the dependent variable and an observed value of the independent variable. In the publishing company example, the dependent variable is book sales and the independent variable is how many followers the author has on social media. An observed value you might have in your data set would be x or number of followers equals 10,000 and y or number of book sales equals 500. The goal of the regression analysis is to define a relationship mathematically between the sample y's and x's to understand how the two variables interact. You can imagine that at every x value, there are many possible values that y could take on. To simplify this understanding, linear regression analysis focuses on the mean of y given a particular value of x. This mean of y is the value on the line in a linear regression and is denoted with a Greek letter that looks like a lowercase m. You can remember m is for mean. The Greek symbol is mu. As described previously, in order to define a linear relationship, we need a slope and an intercept. In statistics, we write the intercept as beta zero, which we sometimes call beta not, and the slope is written as beta one. Mu of y and the betas are sometimes called parameters. Parameters are properties of populations and not samples, so we can never know their true value since we can't observe the whole population. But we can calculate estimates of the parameters using our sample data. To differentiate between the population parameters and the estimates of the parameters, we denote the estimates with a hat. The calculation is beta zero hat, beta one hat, and mu hat are all parameter estimates. Although it's valuable to recognize the mu notation, for the remainder of this course, we'll use a simplified notation. y equals beta zero plus beta one times x. For example, let's say that we estimate beta zero as negative one and beta one as five. Now let's input some values for x. If x equals zero, we get negative one, which is the y intercept or beta zero. If x equals one, we get y equals four. If x equals two, we get y equals nine. If x equals three, we get y equals 14. From just four data points, a pattern emerges. For every one unit increase in x, we get a five unit increase in y. Remembering our equation, five is our slope or beta one. The slope tells us how much y increases for every one unit increase in x. These estimated betas are also called regression coefficients. So now whenever you see the hat symbol, you'll know that you are estimating betas, also known as regression coefficients. In the prior formula, the regression coefficients were the slope and intercept. They described the linear relationship found in the sample data. In order to enter values of x and get estimated values for y, we assumed that beta zero was negative one and that beta one was five. But how did we arrive at those regression coefficients? One of the most common ways to calculate linear regression coefficients is ordinary least squares estimation or OLS for short. We will go over the math behind OLS later in the course. For now though, let's discuss an overview of how OLS works. In linear regression analysis, we as data professionals are trying to minimize something called a loss function. The loss function is a function that measures the distance between the observed values and the model's estimated values. Theoretically, we could draw an infinite number of lines that model the data we have. But we don't want to find just any line, we want to find the best fit line. So we want to minimize the loss function. With this foundational understanding of how linear regression works, we will be able to talk about pace in linear regression analysis, which we addressed earlier. Later in the course, we'll build on the concepts we covered in the video to learn more about how regression works, when to use linear regression, its variants, and OLS in Python. I still remember being on a sports team, needing to determine which team first played offense or defense using a coin toss. What's the chance the coin will come up heads? What's the chance the coin will come up tails? The probability of either event occurring is 0.5 or 50%. This is a classic probability problem. But in the field of data analytics, we get to use a technique called logistic regression to model much more complex probability problems. For example, what factors lead to someone subscribing or not subscribing to a newsletter? Under what circumstances does someone comment on an online video or social media post? Given certain factors, how likely is it that someone renews their membership to an organization? All of these questions are tackling discrete events or categorical data. There are a specific set of possible outcomes. Subscribe or don't subscribe, comment or don't comment, renew or don't renew. To answer these kinds of questions, data professionals use a model called logistic regression. Logistic regression is a technique that models a categorical variable based on one or more independent variables. The dependent variable can have two or more possible discrete values. Let's say that your company has a newsletter and is interested in increasing readership. On the company website, users have the opportunity to subscribe to the newsletter. One factor related to the newsletter subscription could be how many minutes the user spends on the webpage before leaving. Our dependent variable Y has two possibilities. Users don't subscribe, which will represent with a zero or users do subscribe, which will represent with a one. Our independent variable X is continuous and measures how many minutes the users spend on the webpage before leaving. Let's graph the hypothetical data on a scatter plot, like we would when doing EDA or exploratory data analysis. We can observe that the data points are roughly in two horizontal lines. The higher one indicates the user subscribed. This is when Y equals one. The lower one indicates that the user did not subscribe. This is when Y equals zero. The X axis indicates how long the user was on the site in minutes. Since the relationship between X and Y is not just a straight line, we need a new mathematical way of expressing the relationship between X and Y. Logistic regression will allow us to model the probability a user will subscribe to the newsletter. The key concept is that the mean of Y given X is equal to the probability of Y equals one given X. Let's explore this idea. At this point, our observed Y values are just a bunch of zeros and ones. To find the mean, we would sum all of our observations and then divide the total by the number of observations. Because some of the observed data are zeros, when summing the observations, the zeros all add up to zero. Therefore, the sum of all the observations is equal to the total number of ones. Then we divide by the total number of observations. This is equal to the probability of Y equaling one, or someone becoming a subscriber. In this case, we know that the mean of Y given X is equal to the probability of Y equaling a certain outcome given X. Sometimes the probability of Y given X is written as P to reinforce the idea of probability. To help you remember, you can think P is for probability. Now we want to understand what variables help explain those probabilities. Mathematically, we need a way to relate the X variables to the probability that Y equals one. Imagine we only have one independent variable. So then we need to relate the probability of Y equals one to beta zero plus beta one times X. In logistic regression, we use a link function to express the relationship between the X's and the probability that Y equals some outcome. A link function is a non-linear function that connects or links the dependent variable to the independent variables mathematically. We will discuss how logistic regression and link functions work in more detail later on. But for now, just know that this is an example of the differences between linear and logistic regression models. Let's review a few more similarities and differences between modeling approaches. First, linear regression involves a Y that is continuous, like book sales, while logistic regression involves a Y that is categorical, like newsletter subscription. Choosing one model over the other is about the kind of data you have. You can answer similar questions about what factors impact an outcome of interest. Secondly, since linear regression models a continuous variable, we're estimating the mean of Y. But logistic regression models a categorical variable. So we're modeling the probability of an outcome. For example, if Y equals one or if Y equals zero. Lastly, for linear regression, we can express Y directly as a function of X. But for logistic regression, we need a link function to connect the probability of Y with X. We'll continue focusing on logistic regression with just two categories, like the subscription problem. But there are more complex versions of logistic regression that can model multiple outcomes or categories, such as types of skin care products people buy, or types of services that people receive. In this video, we covered the basics of logistic regression. We also compared and contrasted linear and logistic regression models. Later in this course, we'll cover an estimation technique for logistic regression called maximum likelihood estimation. And we'll uncover a bit more of the math. For now, know that computers are incredibly powerful and enable us as data practitioners to focus less on the math and more on the storytelling. The rest of the course will provide you with a solid understanding so you know what's going on in that powerful machine of yours. Together, we've covered a lot of high level regression concepts already. You should be proud of yourself for learning all of these new concepts. It's been incredible starting you off on your regression journey. In prior courses, you learned about the data career space, the importance of communication in data driven work, considerations for how data is cared for, the value of Python programming as a data tool, EDA and statistics. In this course, we've started connecting these concepts together to prepare you to build your first models. So far, we've talked about pace in regression analysis. Planning allows you to consider how the data was collected and what the business needs are in a particular instance. In the analyze phase, you perform EDA, which helps you determine whether one model is more suitable than another. When constructing your model, you'll be amazed by the power of your computer and the Python programming language. Through model construction, you'll apply your creativity as you visualize your data and regression models as well. You'll then rely again on statistics and math to evaluate any model you encounter. Finally, in the execute phase, you have to focus on communication as you interpret the results of your model. By combining the four steps of pace, you'll start creating data driven stories soon. We covered two models briefly, linear regression and logistic regression. These models are able to estimate common relationships between variables that we observe in our personal and professional lives. Regression models help us answer questions about what factors are associated with a variable of interest and by how much. The data always leads us to the questions we can ask and answer. Linear regression needs a continuous dependent variable so it can help model anything measurable and quantifiable. Logistic regression needs a categorical dependent variable so it can determine the probability of something occurring. Linear regression focuses on positive correlation, variables that increase or decrease together, or negative correlation. One variable goes up and the other goes down. Logistic regression models use a link function to relate the x's and the categorical y. Later, we will learn how to use estimation techniques in Python with real data and we'll go through more use cases of regression. Although you will encounter many formulas along the way, keep in mind that the end goal is to figure out the story the data is trying to tell. You have many resources available to you in this course. Good luck with the rest of this section and be patient with yourself. I'm thrilled to join you again next time. Welcome back to Google's course on regression models. It's great to be with you again. Previously, we provided a high-level overview of regression analysis and two core foundational regression models. Linear and logistic regression. In this section of the course, we'll go over how to set up, build, evaluate, and interpret our first regression model. We'll also review model assumptions, construction, evaluation, and interpretation as we've previously defined. There are a lot of new concepts so if you ever need to get reorganized, remember the PACE framework. Plan, analyze, construct, and execute. Each part of PACE aligns with a part of the process of regression analysis. Sometimes we have to repeat some steps, but these are the phases to keep in mind and we'll go through the process concretely together using examples to help guide you. To recap, linear regression is a technique that estimates the linear relationship between a continuous dependent variable and one or more independent variables. Recall that an independent variable is the variable whose trends are associated with the dependent variable. Independent variables are commonly represented by the letter x. Meanwhile, the dependent variable is the variable that a given model estimates, also known as the outcome variable. The dependent variable is commonly represented by the letter y. We'll learn more about simple linear regression. Simple linear regression is a technique that estimates the linear relationship between one independent variable x and one continuous dependent variable y. The model assumptions, code, evaluation metrics, and interpretation skills we review here will extend directly into more complex models like multiple linear regression. By solidifying your foundation through simple linear regression, you'll be prepared to tackle advanced models that can answer more complex questions in different industries and business contexts. As we go through simple linear regression together, you'll combine many of your prior skillsets, including Python programming, exploratory data analysis or EDA, and statistics. These tools will allow you to construct a simple linear regression model that can help you influence strategy and decision making in any company or organization. The key terms, activities, and learning resources that we explore together in these videos will also be important in the rest of this course, which will refer back to our discussions of simple linear regression. Together, we'll go through all the stages of pace while learning regression modeling. We'll first review the linear regression equation and then learn about estimating parameters using ordinary least squares. Then we'll define each of the four key assumptions of simple linear regression. As for the analyze stage, we'll use Python and EDA to verify of our data meets these assumptions. In the construct stage, we'll build your first regression model in Python together. You'll also have opportunities to practice both Python and EDA on your own as well. We'll also learn several evaluation metrics and a technique to help you quantify how good your model is. Lastly, aligned with the execute stage, we'll use our metrics to practice interpreting our results for stakeholders and non-technical audiences. I'm so excited to get started on simple linear regression modeling. So let's begin. My name is Jarad and I'm a principal lead in the analytics and decision support team in YouTube and our team focuses on building out business intelligence and analytics for the YouTube business organization. Mentorships have been a huge part of my progression in my career. I've had the fortune of many mentors over the course of my career, both at Google and outside of Google, some from like schooling, some just by happenstance, but every single one of them have given me something that was either a motivating factor or encouraged me to believe more in myself or that like just were there to listen to me vent about something not working or whatever it might be. Look for groups that are focused on the areas that you're interested in. If we're thinking about BI and analytics, there's so many groups of folks who are business intelligence professionals and there's niche groups that are focused on data science or predictive analytics. I've also just reached out to people, whether it is via email or introductions from other folks that I know, LinkedIn, and I usually tend to look for people that are doing something that I'm interested in that have a skill that I don't have and I just reach out and say, hey, I'd love to have a conversation about XYZ thing and hear how you think about it. I've actually found that LinkedIn is a really powerful tool. It takes work, but it's a very simple entry point for networking. And when I say networking, I don't mean just like small talk. I mean actually reaching out to someone in the hopes that they are someone that actually wants to be reached out to. And really being intentional about what it is you want to learn from people who have done something you want to do. Bet on yourself because most likely there are not as many people who are doing the same, but the more you show and prove your capabilities, your stick-to-itiveness, whatever other adjective you can use, the more people will start to bet on you and the more opportunities that come from that. In prior videos, we've mentioned simple linear regression, which is a regression technique that estimates the linear relationship between one independent variable X and one continuous dependent variable Y. Recall that the linear and linear regression indicates what the data looks like when plotted on an XY coordinate plane. Align. In simple linear regression, we're only interested in two variables, one X and one Y variable. So the equation for a regression line is Y equals intercept plus slope times X, which is represented as beta zero plus beta one times X. Since we'll have a number of data points for any given problem, there are many different lines we could draw that might fit the data. However, we're looking for the best fit line, the line that fits the data best by minimizing a loss function or error. In order to find the best fit line, we need to measure error. We can consider error as some difference between the data we have, the observed values, and the predicted values generated by a given model. The predicted values are the estimated Y values for each X calculated by a model. The difference between observed or actual values and the predicted values of the regression line are what's known as a residual. The equation for residual value is residual equals observed value minus predicted value. Each data point has one residual. Using mathematical notation, the equation for the residual of an individual data point is epsilon i equals Y i minus Y i hat. Epsilon is a Greek letter that resembles the letter E as in E for error. We can calculate individual residuals for each data point, but an important thing to note is that the sum of the residuals is always equal to zero for OLS estimators. For some estimators, the sum of the residuals is not always equal to zero. In order to capture a summary of total error in the model, we square each residual and then add up the residuals for every data point. This is called the sum of squared residuals. The sum of squared residuals, or SSR, is the sum of the squared differences between each observed value and the associated predicted value. For linear regression, we'll be using a technique called ordinary least squares to get our best fit line. Ordinary least squares, also known as OLS, is a method that minimizes the sum of squared residuals to estimate parameters in a linear regression model. Using OLS, we can calculate beta zero hat and beta one hat using properties of the sample data. Recall that the hat symbol means it is an estimate of a parameter. We will never know the exact parameter. Remember, parameters or betas are a characteristics of a population. Since we will only ever have sample data, our goal is to get a reasonable estimate of the parameter. Let's say that we have a certain sample of data and we now want to determine a line that fits the data well. For our first attempt at a best fit line, the slope is 1 and the intercept is 2.5. To calculate the sum of squared residuals, first we calculate the predicted values for each x. Next, we can find the residual for every observed value of x. On the graph, we've plotted the residuals. The residuals are the difference between each observed value and what the line predicted. So this line is okay. But let's try to get a little bit closer to the points. Let's try another line. This time, we'll set the slope equal to 1.25 and the intercept equal to 3. Again, we have to plot the residuals and calculate the sum of squared residuals. From the graph, it seems we're getting closer, but it's hard to tell if we've got the best line. We could just keep trying different lines and through trial and error, pick a line that we think is closest to the data points. But that's very time consuming. The good news is that in Python, the computer will use OLS or the ordinary least squares estimation technique to test out many lines and identify which one is the best fit line. With OLS estimation, we find that beta 0 hat equals 1.5 and beta 1 hat equals 3.2. The lines we're estimating represent the best fit of the model to the data. Later, we'll talk about uncertainty using p values and confidence intervals to aid in the interpretation of results. Now that we've covered what simple linear regression is, we'll go back to pace in our regression analysis framework. We'll examine the model assumptions that the data needs to follow for us to use this cool new tool in our regression toolbox. All right, let's begin with the analyzed stage of the pace framework. The first task in simple linear regression analysis is checking the assumptions of the model. In addition to the technical needs of the model, you'll need to consider the business context of the problem you're working on. This will come in the planned stage. Previously, I spoke about model assumptions as statements about the data that must be true in order to justify the use of a particular modeling technique. Ensuring that we're using the right model given the data that we have allows us to be confident in the results those models produce. Think of model assumptions as the bridge between the analyze and construct phase of the pace framework. In other words, examine the assumptions before the construct phase when possible. Certain assumptions can only be checked after model construction. So make sure you check those assumptions after you apply the model to confirm if the model is valid or not. Data visualizations can be used as a tool to determine if model assumptions are met. Thankfully, Python will help you tremendously with generating these and I'll be here to guide you through it all too. There are four key assumptions of a simple linear regression. Linearity, normality, independent observations, and homoscedasticity. For now, we'll focus on understanding what each of these assumptions mean and how they are checked using data visualizations. The first assumption of a linear regression just so happens to be the simplest to check for the linearity assumption. You already know that the linear in linear regression comes from the way the data looks when plotted on an XY coordinate plane. Align. To detect if this assumption is met, you just have to make sure that the points on the plot appear to fall along a straight line. If the visualization looks like a random cloud or resembles a curve rather than a line, then the assumption is considered invalidated, meaning that this model does not fit the data well. You might need a different or a more complicated model for this data set. In contrast, a scatter plot that shows the data points clustering around a line indicates the linear regression would be an appropriate model to represent the relationship between X and Y. Next up on the checklist is the normality assumption. This assumes that the residual values or errors are normally distributed. Since this assumption is about residuals, you cannot check the assumption until after the model is built. But once the model is built, you can create a specific plot called a quantile quantile or qq plot of the residuals. If the points on the plot appear to form a straight diagonal line, then you can assume normality and check this assumption off the list. Next is the independent observation assumption, which states that each observation in the data set is independent. Here, it is helpful to use contextual information about data collection and the variables used to determine if this is true. If the assumption is met, we would expect a scatter plot of the fitted values versus residuals to resemble a random cloud of data points. If there are any patterns, then we might need to reexamine the data. Last but not least, the homoscedastic assumption is fourth on the list. This one sounds complicated, but knowing the literal meaning of the term helps. Homoscedasticity means having the same scatter. Scatter plots come to the rescue once again when checking for homoscedasticity. Returning to the scatter plot of fitted values versus residuals, there should be constant variance along the values of the dependent variable. This assumption is true if you notice no clear pattern in the scatter plot. Sometimes you'll hear this described as a random cloud of data points. But for example, if you observe a cone-shaped pattern, then the assumption is invalid. Linearity, normality, independent observations, and homoscedasticity are the four assumptions of a simple linear regression. Before moving ahead, don't feel like you need to memorize all of this material right now. Remember, data analysis is an iterative process. You can go back to these concepts, check to see how the assumptions align with your data, and then move forward with the regression process. You'll have plenty of opportunities to practice and hone your skills throughout this course. Now let's apply everything you've learned here and try it out on a dataset using Python. Are you ready to start practicing your computer programming skills now? In this video, we'll apply some of the concepts around simple linear regression to data. The move from theory to application is a big milestone, but remember that you have worked hard to get to this point. We'll go through the code in some depth today. You will have access to the code for you to review in detail. Let's begin by exploring a problem that's been affecting a local zoo where you were recently hired as part of the analytics team. The caretakers who manage the penguin habitats are having trouble keeping their population adequately fed. They're hoping to find out if certain features of the birds are related to body mass to better manage their feeding routines. The dataset includes structural measurements of different penguins, such as bill length and body mass. Now that you have some context, you can use exploratory data analysis to start analyzing the data. First, import some packages, pandas and seaborn. Both will be especially useful today. So the amazing thing about seaborn and many other libraries in Python and other programming languages is that there are built-in datasets that you can work with. You don't even need to download any files. Now load the penguin dataset and save it as a variable called penguins. The load dataset function returns a data frame, so penguins is a data frame. Now that you can access the data, use the head function to examine the first couple of rows. There are a few continuous variables, bill length, bill depth, and flipper length, all measured in millimeters. Body mass is measured in grams. There are also a few discrete variables, species, island, and sex. Since you're working with simple linear regression, you'll focus on the continuous variables. If you would like to, you can access the code to see how I clean the data. I subset the data to include only two species of penguin and dropped a couple of rows with missing data. Now you can start creating plots to identify some linear relationships between the continuous variables. The clean data is saved in a data frame called penguins underscore final. Now you can input the data frame into Seaborn's pair plot function. Seaborn's pair plot function creates a scatter plot matrix. A scatter plot matrix is a series of scatter plots that show the relationship between pairs of variables. By using the pair plot function, you will observe a few linear relationships in the scatter plots. The diagonal displays the distribution of the continuous variables. This ensures you that the data has met the linearity assumption for building a simple linear regression. First, bill length and body mass seem to be positively correlated. Next, flipper length and bill length also seem to be positively correlated. Lastly, body mass and flipper length also seem to be correlated. Let's explore the relationship between bill length and body mass further in terms of linear regression assumptions. We know we have met the first linear regression assumption of linearity. Luckily, the diagonal of the pair plots also shows us the distribution of each variable. We can observe that both bill length and body mass are close to being normally distributed. This suggests that we'll probably have normally distributed residuals. The third assumption is of independent observations. Since each row has data on a different penguin, we have no reason to believe that one penguin's bill length or body mass is related to any other penguins. We can confirm the last assumption, homoscedacity, after we build our model when we graph the residuals. Now let's subset the data once more to isolate bill length and body mass. Note the use of double squared brackets, which tells Python which columns you want to choose. Write out the regression formula in terms the computer can understand, and you'll save it as a variable called formula. As you do this, pay careful attention to the column names. You need to specify the column names exactly so the computer knows how to run the regression model. First, type the y variable column name, which is body underscore mass underscore g. A space and then a tilde. Another space and the x variable column name, which is bill underscore length underscore m m. The tilde lets the computer know that whatever comes after is our x variable. The spaces are not necessary, but can be helpful for clarity. Now that you have the data and the formula, you can create an OLS object using the OLS function from the stats model module. Save the object as a variable called OLS. Input the OLS formula variable as the formula argument of the OLS function. Then input OLS data variable as the data argument. Next, you'll use the OLS objects fit method to actually fit your linear regression model to the data. Save the results as a variable called model. Finally, print out the result of the ordinary least squared estimation, which is the technique the OLS function used to build the linear regression model. Then use the model summary method, which will print out a table of many different statistics. This table contains lots of information. We'll go through some different sections of the table later in this course and leave the rest for you to explore on your own. For now, we'll focus on the bottom section, which details the coefficients the model determined would generate the best fit line. Since we're using a simple linear regression model, we have two coefficients, an intercept or beta zero and a slope or beta one. You can find the y intercept of the best fit line in the intercept row of the coefficient column, which is abbreviated to COEF in the table. In this case, it's negative 1707.2919. You can find the slope in the building row of the coefficient column. The slope of the best fit line is 141.1904. Let's rewrite this as a linear equation, which will help us interpret the results later. Begin by plugging the variables into the linear equation y equals intercept plus slope times x. y is the penguin's body mass in grams, x is the penguin's bill length in millimeters. Next, the model will just calculate the intercept and slope. Round both to the nearest hundredth or two decimal places. This means that penguin with one millimeter longer bill length has 141.19 grams higher body mass on average. Remember, you still need to examine some assumptions about the residuals to double check the conclusion. Great! You have fit a linear regression model to the data. To finish checking the model's assumptions, calculate some fitted values using your model's predict method. Then you can access the residuals using the model variable. By residuals, we mean the difference between the actual and fitted values using the model's residual attribute. Finally, create a couple of plots to confirm your findings. Use the seaborne's regplot function to plot the data with the best fit regression line. You can observe a linear relationship between the variables, the best fit line, and a small shaded region around the line indicating the uncertainty around the model estimates. Returning to the linear regression assumptions, create a scatter plot of the fitted values against the residuals. This is a very common plot you'll encounter when working with linear regression to check various assumptions. From this plot, you can observe that residuals seem randomly spaced, which means you can assume homoscedacity. A random-looking scatter plot is indicative that the independence assumption is not violated, but it's not the sole reason that we believe it to be true. You could examine inputs and other more advanced statistical tests to confirm this. Lastly, create a histogram of our residuals to determine if the residuals are normally distributed. If the residuals are normally distributed, following that classic bell curve shape, then you can confirm the normality assumption has been met as well. The residuals are a little bit skewed in the histogram, so you can create a QQ plot to verify normality. You can use the stats model QQ plot function to create the graph. There is a straight diagonal line trending upward with some slight curvature on the extremes. You may want to explore this further, but for now, this is pretty good confirmation of the normality assumption. Now that you've confirmed all the assumptions are met, you can say that the results from the regression model are likely reasonable. Wow, we went through a lot of code and many plots together. You should be so proud of what you've accomplished in this video. Keep in mind that you can always review what we've covered, along with the code and documentation. Great work! So far, we've used PACE to think like a data professional as we addressed the Penguin problem at our local zoo. We planned by thinking through the problem and subsetting the available data appropriately. How can we better understand the relationship between Penguin Anatomy and their body mass? In the Analyze stage, we performed EDA and checked model assumptions. Then, moving on to the construct stage, as a reminder, there are two parts. We built our model and were able to pull out some parameter estimates. Now we're going to focus on the next step of the construct phase, model evaluation. Model evaluation is an important practice in data analysis. Careful evaluation and interpretation of your regression model helps you understand its performance and accuracy. To start, let's revisit the results from our regression model. Based on the summary results, you know that OLS has determined that the best fit line has an intercept of negative 1,707.29 and a slope of 141.19. But randomness and unpredictability are characteristics of every regression model that make it difficult to predict outcomes with 100% certainty. After all, there is still a difference between our observed and predicted values. You've just found the model that you are most certain about. To explore the notion of uncertainty further, let's turn our attention to the rest of the OLS summary table rows about the intercept and bill length. There is a column labeled P greater than the absolute value of T. This indicates the P value associated with the coefficient estimates. The two columns to the right of the P value column indicate a 95% confidence interval around the coefficient estimates. When evaluating simple linear regression results, you'll focus less on the intercept row and more on the row involving your independent variables of interest. In this case, that's bill length. So you can say that the coefficient estimates for bill length is 141.19 with a P value of 0.000 and a confidence interval from 131.788 to 150.592. Previously, when you learned about hypothesis tests, confidence intervals were defined as a range of values that describe the uncertainty surrounding an estimate. In the case of linear regression, we are estimating parameters. So a 95% confidence interval means that interval has a 95% chance of containing the true parameter value of the slope. What if the slope and intercept were slightly different? Well, let's draw out a few different lines on our plot with slightly different slopes and intercepts, all within our confidence intervals. We get a region around the regression line that is tight around the center and fans out a bit towards either end of the line. This shape may appear familiar. You plotted it previously using Seaborn's reg plot function. These lines make up the shaded region that was around the regression line. Essentially, the confidence interval around the parameter estimates reveal what we call a confidence band. A confidence band is the area surrounding the line that describes the uncertainty around the predicted outcomes at every value of x. Typically expressed as a shaded region around the best fit line on a scatter plot. A confidence band reveals the confidence interval for each point on a regression line. Confidence bands are simply another way to report your findings responsibly. Simple linear regression is a powerful addition to any data professionals toolbox. Whether you are analyzing the financial impact of price increases to a streaming service or you're forecasting sales for a fashion boutique, regression analysis can help you make discoveries and understand the relationship behind the data. But we must remember data is noisy and results can be uncertain. When using regression models like simple linear regression, even the best data doesn't tell a complete story. As a data analytics professional, you should always aim not only to evaluate the performance and accuracy of your models, but also to report uncertainty. Communicating about confidence intervals and confidence bands is part of being a responsible data professional. These metrics will also help you understand how well the models can tell the story behind the data. Up to this point, we've progressed through the plan and analyze phases of the PACE framework for regression modeling. We've even started the construct phase by actually building a linear regression model. I'm excited to continue guiding you through model evaluation. We're getting so close to the execute phase, where you share the stories behind the data you're studying. Using a variety of evaluation metrics supports data professionals' confidence in the insights produced by their analysis. These metrics are key to responsible communication of results. If models are inaccurate or imprecise, decisions made based on those insights may also be inaccurate. Three metrics you might encounter are R-squared, mean-squared error, also called MSE, and mean absolute error, or MAE. The main metric that academics, researchers, and data professionals use when evaluating regression models is called the coefficient of determination, or R-squared. So that's what we'll focus on. You may have noticed that in the output from OLS model you previously built, there was a part of the output labeled R-squared. This is what we're talking about. R-squared, or the coefficient of determination, measures the proportion of variation in the dependent variable Y, explained by the independent variables X. To explain this metric more thoroughly, let's think through the example about penguins and linear regression again. You identified a linear relationship between the penguin's bill length in millimeters and the penguin's body mass in grams. Based on your regression analysis, you found the best fit line. Body mass in grams equals negative 1,707.30 plus 141.19 times bill length. But the data points just cluster around this best fit line. Many of the data points are actually not on the line. This means that bill length only accounts for some of the changes in body mass. R-squared helps data professionals determine how much of the variation in the X variable explains the variation in the Y variable. At most, R-squared can equal 1, which would mean that X explains 100% of the variance in Y. If R-squared equals 0, then that would mean X explains 0% of the variance in Y. The OLS summary table shows the model has an R-squared of 0.769. This means that bill length explains about 77% of the variance in body mass. There is still about 23% of the variance of body mass that is unexplained by the model. This variance might be due to other factors or natural unexplained differences from penguin to beguin. There is no benchmark value that R-squared has to equal. But in general, the higher R-squared, the better. Because it adds validity to any recommendation you make based on your analysis. R-squared is a useful metric that can help you evaluate your model. But there are also processes that help strengthen the evaluation of a model. Typically, when we have a dataset, we use at least part of the dataset to build and test the regression model. The computer uses the data to calculate a measure of difference between the actual and predicted values, such as sum of squared residuals. Then, based on the computer's calculations, it can find the best line. But sometimes, we want our model to be good at generating predictions for data we haven't collected, or that doesn't exist yet. For example, let's return to the penguins at the local zoo. They welcomed a new group of penguins to their flightless bird habitat. It would be helpful to have a sense of how the new penguin's structural measurements relates to their body mass. So, in those cases, we want to know how the model we built performs on the data it learned from and how the model performs on data it hasn't experienced yet. In this case, we'll need to save a holdout sample before we build the model. A holdout sample is a random sample of observed data that is not used to fit the model. Then, you can evaluate how well the model fits the data used to build the model, and you can evaluate how well the model fits the holdout sample. We've covered a lot of different ways you might evaluate linear regression model. R-squared and holdout samples will serve you in many cases, allowing you to confidently share the insights you've discovered. I'm excited to discuss the execute phase and how to communicate model findings soon. Earlier in the course, you built a regression model that passed all four of the assumptions for simple linear regression. Residual plots confirmed a linear relationship, and you were able to successfully evaluate the performance of the model using a few common metrics. Next, it's on to the execute phase of the PACE framework. How exciting! This is the point of the PACE framework when your ability to communicate is crucial. Together, we're going to review, interpreting the results of our regression model, exploring ways in which those insights can be translated into formal visualizations and communicating a meaningful narrative with stakeholders. In one of my previous positions, I was brought on to help a newly launched mobile phone service provider reduce the amount of fraudulent orders they were receiving. After building out a model, it was time to present the model, insights, and recommendations to my stakeholders. These folks were non-technical business partners, so I knew that data-specific terminology wouldn't interest them. As a matter of fact, it would be too technical, and they would probably lose interest very quickly, which would cause lack of buy-in. What was important to my business partners was stopping fraudulent orders, so telling the story in a way where I succinctly articulated how many fraudulent orders the model would detect, and how that translated to the bottom line resulted in the model being implemented immediately. I remember feeling so accomplished in that moment, because after all the hard work of building the model, I was able to get the model into production. Once the model was in production, fraudulent orders dropped drastically. Telling the right story to the right people takes time and practice. This is a skill that we'll continue to work on throughout our careers. It's super important to know your audience and tailor your delivery and story in order to have the most impact. Enough about me. Let's return to the example that we've been exploring where you were working at a local zoo. You were trying to understand the relationship between penguin structural measurements and their body mass. In the regression model you created, you observed a positive correlation between bill length and body mass of a group of penguins. Let's use the results of that model to translate the statistical findings for stakeholders. Regression model interpretation depends on coefficients and p-values. Coefficients will determine how changes in the independent variable are associated with changes in the dependent variable. p-values demonstrate whether coefficients are statistically significant. Recall that the slope is used to determine the amount you expect y to increase or decrease per one unit increase of x. Let's go through the numbers yielded in the OLS summary table to gain a better understanding. Body mass equals intercept plus slope times bill length. The slope is positive 141.19 and the intercept is negative 1,707.29. Next let's review the p-value. The coefficient for bill length has a p-value of 0.000. This p-value tests the null hypothesis that the coefficient is zero. If we observe a small p-value the null hypothesis is likely false. So we reject the null hypothesis and conclude that the coefficient is not zero, meaning that there is a strong relationship between x and y. It's important to understand that the positive correlation between bill length and body mass you've observed does not necessarily reflect causation. Remember causation describes a cause and effect relationship where one variable directly causes the other to change in a particular way. The positive correlation between bill size and mass could be the result of a variety of factors. So even though we cannot speak to causation we can still provide valuable insights about the penguins. Based on our regression analysis we can say that for penguins with one millimeter more in bill length their body mass is 141.19 grams more. These results are statistically significant with a p-value of 0.000, a confidence interval of 131.79 to 150.59 and an r squared of 0.77. By providing measures of uncertainty around our estimates we're responsibly reporting our results. Based on the interpretation of the numbers you can contribute to the zoo community around you. Each person on the team at untrue zoo brings a variety of experiences and knowledge. The input from your colleagues adds richer detail to the story the model tells. Without the informed perspectives of the other bird caretakers the avian department would not be able to make a sound argument for an increase in food to maintain an adequate inventory that supports the penguins and other birds at the zoo. When sharing insights as a data analytics professional you need to make sure that your findings can be quickly understood and correctly interpreted. Communicating the context of your data is one way you can report it responsibly. For example if your results only apply to a specific penguin population make it clear that people should be cautious about extrapolating to larger or different groups of penguins. Data visualizations are excellent ways of making statistics relatable for others. Use caution when presenting terms like coefficients and p-values in your visualizations. Consider that not everyone will fully grasp the significance of these terms immediately. There are many useful libraries like matplotlib and seaborn that can help create visualizations. Programs like Tableau, PowerPoint, or Google Slides allow for creation of high quality presentations to help you provide context that's relevant to the business problem. Regardless of your visualization tool of choice communicating with those around you is critical to your success at all points of the pace process. As a data analytics professional clear communication allows you and your work to have impact throughout your team and organization. Well done! You've come a long way and you built your very first regression model. I'm so thrilled to continue your regression journey with you. Before we wrap up let's recap what you've added to your data toolkit. Linear regression analysis is a foundational data science technique. By now you're familiar with pace, plan, analyze, construct, and execute in linear regression analysis. You had an opportunity to use ordinary least squares estimation in Python. You used OLS to get the best fit line that minimizes the error between predicted and actual values. Next you learn the four main model assumptions of simple linear regression. Linearity, normality of residuals, independent observations, and homoscedasticity. You had some practice applying EDA in Python to check whether linear regression is appropriate based on meeting these assumptions. You built a model in Python and you learned how to evaluate model fit using r squared, hold out samples, and measures of uncertainty like confidence intervals and p values. Lastly you turned numbers and statistics into a story. You explored a Penguin data set to show how a data analytics professional can present findings to others in clear, simple terms. You learned the value of producing visualizations in Python to communicate these results of your simple linear regression model to stakeholders, a valuable skill that you'll be able to use throughout your data analysis journey. I'm proud of how far you've come along and you should be too. So far we've gone through all of pace in simple linear regression. While we completed the circuit once, it doesn't stop there. Next, we'll extend our knowledge of simple linear regression to multiple linear regression models. Simple linear regression is great for tackling problems with a single independent variable. However, the more complex the problem, the more variables can influence what's going on. That's where multiple linear regression becomes useful. You're making amazing progress and are on your way as a future data analytics professional. Hi there. It's great to join you again on your journey towards becoming a data analytics professional. You've come a long way since the start of this course. Together we covered pace in regression modeling and an overview of two foundational regression models, linear and logistic regression. We used your statistical knowledge to discuss how simple linear regression works and when to use it. You learned about model evaluation. You also learned how to interpret and communicate regression results effectively and accurately. All of these skills will translate to other regression and machine learning models. As you grow as a data professional, you'll be able to use these skills to discover many data-driven insights. Simple linear regression is a great foundational technique, but it can feel limiting as a technique because it only allows for one independent variable. There are many cases where you might be interested in two, three, or four independent variables. For example, there could be many factors that are associated with product sales in a given month, such as holidays, new product launches, changes to the retail website, or changes to marketing campaigns. We need a new technique to figure out how each of these variables is correlated with product sales. This is where multiple linear regression comes in. In the upcoming videos, we'll be exploring the world of multiple linear regression, often referred to as multiple regression. Just how we went through pace, plan, analyze, construct, and execute with simple linear regression, we'll do the same with multiple regression. To start, we'll discuss multiple regression. While simple linear regression only allows one independent variable X, multiple regression allows us to have many independent variables that are associated with changes in the one continuous dependent variable Y. Adding more independent variables into the equation complicates the math, but everything we covered is just an extension of simple linear regression concepts. As a result, we'll go back to our statistical basis and revisit pace. Since this isn't your first model, we're going to start with A, analyze. We'll review model assumptions for multiple linear regression. EDA will continue helping us determine if our assumptions hold true. Then we'll focus on construct phase, where we'll learn how to build the model and you'll have a bit more practice with Python. Next, in the execute phase, we'll focus on interpretation and telling a story from multiple linear regression. With more independent variables, we have to think more carefully about the insights we derive and how we communicate our results. Then we'll iterate back to the construct phase and learn a bit more about the nuances of multiple linear regression. As data analytics professionals, one of the most important skills in figuring out the best model for each use case, which goes back to the plan in pace. In pursuit of finding the best model and as models become more complex, we need new ways to evaluate them effectively. As a part of finding the best model, we'll learn about variable selection and regularization. In your career as a data analytics professional, you'll likely encounter some very big data sets. And it may be challenging to figure out which variables are statistically important just by looking at the data. But the tools you learn here will allow your computer to calculate a possible set of variables for a given model, output, and evaluation metric. Although the computer is performing incredible mathematical operations, you as a data professional are reading the output so that you can understand and communicate the relationships the computer has uncovered. Now it's time for us to learn more about multiple regression and how we can use different independent variables to estimate a dependent variable. Let's keep putting together the regression puzzle. Previously, we learned about simple linear regression. For example, the more advertisements a company uses, the more clicks their website receives. This is an example of a relationship between variables that can be modeled with simple linear regression because there is only one x variable and one y variable. The number of website clicks is a continuous dependent variable. The number of advertisements is the independent variable. But the company might be interested in exploring the characteristics of each advertisement to determine what kinds of advertisements are related to more website clicks. Perhaps shorter advertisements are correlated with more clicks or maybe advertisements containing a call to action such as donate or subscribe are associated with more clicks. Multiple linear regression can help us answer these kinds of questions. Multiple linear regression also known as multiple regression is a technique that estimates the linear relationship between one continuous dependent variable and two or more independent variables. Let's reexamine the equation for a simple linear regression. y equals beta 0 plus beta 1 times x. Recall that beta 0 also called the y intercept and that beta 1 is also called the slope. Let's say that y represents the website clicks and x represents the number of people that are in the advertisement. Now let's add in the length of the advertisement into the equation y equals beta 0 plus beta 1 times x 1 plus beta 2 times x 2. The x 1 represents the number of people that are in the advertisement and x 2 represents the length of the advertisement. Note that because we have two x variables now we've added subscripts to differentiate between them. You'll encounter this notation often in multivariate analysis. Multiple regression allows us at a basic level to add any number of independent variables that we're considering. So a full multiple regression equation might be y equals beta 0 plus beta 1 times x 1 plus beta 2 times x 2 plus as many variables you have up to beta n times x n where you're interested in n independent variables. Even though we are just adding more beta coefficients and independent variables we can still reap the benefits of linear regression. Just like simple linear regression, multiple linear regression can yield highly interpretable and communicable results. But because the underlying math is a bit more complex we have to be mindful of how we convey our results and what the coefficients mean. To ensure we're ethically communicating our results as clearly as possible we'll go over two concepts one hot encoding and interaction terms. One hot encoding allows us to use categorical independent variables in our multiple linear regression. For example we might have print advertisements and digital advertisements. This is a categorical variable and one hot encoding will allow us to incorporate it into our regression model. If we want to account for how two independent variables affect the y variable together we can use something called an interaction term. Both of these topics are important because they will change how we interpret the coefficients of our model. We will learn how to do this together in the upcoming videos. As a data analytics professional you might encounter scenarios in which you're interested in variables that are not continuous. They could be categorical. For example in the case of website clicks and advertisements some ads are black and white and some ads are all in color or maybe some ads have a call to action while other ads don't or perhaps some product ads are on different streaming services. These are all categorical variables that could be related to how many clicks the website receives. In the case where we have a categorical independent variable we have to represent the categories as numbers for the computer to understand the data. There are two main ways to handle categorical data one hot encoding and label encoding. In this video we will learn about one hot encoding. One hot encoding is a data transformation technique that turns one categorical variable into several binary variables. Let's take the example of an advertisement having a call to action or not. We would create a new variable in the data set. Let's call it action and mathematically we'll denote it as x action. If an advertisement has a call to action then x action equals one. If an advertisement commercial does not have a call to action then x action equals zero. Our equation would become y equals beta zero plus beta one times x one plus beta two times x two plus beta action times x action. x one measures the number of people in the advertisement and x two measures the length of the advertisement. Now let's assume that we have two advertisements where x one and x two are the same but one has a call to action and one doesn't. We can then assume that the advertisement with the call to action had beta action more website clicks than the advertisement without a call to action. Now let's remove the call to action variable. What if we are interested in which streaming platform the ad is on? Let's say the company is running ads on three services a, b, and c. Let's also assume that ads can only run on one platform at a time. So if an ad is on service a it's not on service b or c. Now we have a categorical x variable that has three possibilities. Let's figure out how to represent it mathematically. To represent two possibilities has a call to action versus does not have a call to action. We use one binary variable. In order to represent three possibilities we need two binary variables. Let's examine why this works. Let's imagine we have a binary variable x service a. If x service a equals one we know that the advertisement is playing on service a but we also know two other pieces of information. If the ad plays on service a then it is not playing on service b and the ad is also not playing on service c. If x service a equals zero we only know that advertisement is not on streaming service a. The advertisement could play on either service b or c. Since we have missing information we need another binary variable to help us and the computer figure it out. Let's add a variable x service b. If x service a equals one we already have all of our information. We know that the ad plays on service a so x service b must equal zero. The ad does not play on service b and we also know it does not play on service c but if x service a equals zero we can learn more information from x service b. If x service b equals one then we know that the ad plays on service b. In turn we know that the ad does not play on service c. Lastly if x service a equals zero and x service b also equals zero then we know the ad plays on neither service a nor service b so the ad must play on service c. Now we have all the information we need with just two variables. So let's revise the equation one more time. We have y equals beta zero plus beta one times x one plus beta two times x two plus beta service a times x service a plus beta service b times x service b. Now that the equation is written out you'll notice there is no variable x service c because it would not provide us with more information but the interpretation is a little bit different. We can think of service c as the default streaming service so beta service a is the difference in website clicks for two advertisements that are the exact same except one ad is played on service c and one is played on service a. Similarly beta service b is the difference in website clicks for two advertisements that are the exact same except one ad is played on service c and one is played on service b. In this video you learn how one hot encoding allows us to turn one categorical variable into several binary variables. Now we can start estimating how many clicks the website will get based on variables about the ad. Soon we'll go over how to use python to one hot encode categorical variables. We will revisit pace in regression modeling through model assumptions, model construction, model evaluation, and model interpretation together. Great work so far. I can't wait to join you in the next video. In previous videos you learn about the model assumptions of simple linear regression. In this video we'll review the assumptions you're already familiar with and introduce a new assumption specific to multiple linear regression. Simple linear regression and multiple regression share four assumptions. Linearity, independent observations, normality, and homoscedasticity. The main diagnostic tools are the same scatter plots and plotting the residuals after model construction. To recap the linearity assumption states that each predictor variable x i is linearly related to the outcome variable y. Plotting scatter plots of each x variable against the y variable will inform us which variables likely have a linear relationship with y. The independent observation assumption states that each observation in the data set is independent. We can only examine this assumption by checking the data collection process. For example if we are collecting data about income people from the same household might not be independent from each other. The next assumption is the normality assumption which states that the residuals are normally distributed. The last assumption we're reviewing is the homoscedasticity assumption which states that the variation of the residuals errors is constant or similar across the model. There is a completely new assumption when we are working with multiple regression. The data cannot be multi-colinear. The no multi-colinearity assumption states that no two independent variables x i and x j can be highly correlated with each other. This means that x i and x j cannot be linearly related to each other. For example if you're working at a concert venue you might want to predict concert ticket sales. There are many factors involved. Number of social media followers, number of streams on the music platforms, year the artist debuted, cost of the tickets, how many days until the concert, and more. Although the cost of the ticket and how many days left until the concert might both be strong predictors of concert sales it is likely that the cost of the ticket is correlated with how many days are left until the concert. Maybe there's a sale the week before the concert so sales spike up. If you keep both variables in the regression model it will be unclear which factor has what kind of effect. Additionally you might not be explaining significantly more of the variance in ticket sales. Let's do some exploratory data analysis to confirm our initial thoughts. We can create a scatterplot matrix using Python code to show the relationship between pairs of variables. The scatterplot matrix creates a scatterplot for every pair of variables. If we're observing linear relationships between an independent variable and the dependent variable we should consider including it in our multiple regression model. If we observe linear relationships between two independent variables and we include both variables in the model we'll likely violate the no multicollinearity assumption. In the concert ticket example you have six variables in the data set. One dependent variable concert ticket sales and five independent variables number of social media followers, number of streams on music platforms, year the artist debuted, cost of the ticket, how many days until the concert. You might expect the number of social media followers and the number of streams to be highly correlated with one another based on our context but you can't confirm this hunch until you create scatter plots. That being said EDA and visualizations are powerful tools but they can't detect every relationship so we turn back to the math. Conveniently our computer can calculate VIF. The VIF quantifies how correlated each independent variable is with all the other independent variables. The minimum value of VIF is one and it can get very large. The larger the VIF the more multicollinearity there is in the model. Once we have identified multicollinearity one solution is to drop one or more variables with multicollinearity. Another possible solution is creating new variables using existing data. Then you can calculate the VIF again to check the multicollinearity assumption again. Remember when in doubt explore your data EDA and visualizations are critical to understanding underlying trends and telling a compelling story. By taking these steps you can ensure that your model fits the data and that your results are reasonable. Great work so far keep it up. We've already interpreted the results of simple linear regression before. For example if we plot some data about temperature and ice coffee sales the scatter plot will show us a series of points in a diagonal line trending upward. We are able to interpret the results relatively simply. In this video we'll go step by step through the process of building a multiple regression model and then interpreting the results. As an example let's say the data follows this equation. Sales equals negative 44 plus 2.2 times temperature. We can say that a one degree increase in temperature is associated with 2.2 more ice coffee sales. This is great and probably explains a large amount of why ice coffee sales vary on any given day. But there are other factors involved and if you work for a large coffee company you might be interested in exploring factors under your control. You can't change the temperature but you can be strategic about where you build your store or whether or not you have an ad on a nearby building. Thankfully we're learning about multiple linear regression now so as a data analytics professional you can help your coffee company answer these questions. So let's add one more variable to the equation. Is there an ad near the store? Let's revise the equation accordingly. Sales equals beta zero plus beta temperature times x temperature plus beta ad times x ad. The ad variable is a binary variable. There are two possible scenarios when there is an ad nearby and when there isn't an ad nearby. If there is an ad posted nearby then x ad equals one and the temperature will take on some value. Let's say it's 75 degrees Fahrenheit. So the equation becomes sales equals beta zero plus beta of temperature times 75 plus beta ad times one. We estimate that the presence of the ad is associated with some increase in sales. Now let's take the same temperature and say that there isn't an ad nearby. The equation then becomes sales equals beta zero plus beta temperature times 75 plus beta ad times zero. Simplifying this equation we get beta zero plus beta temperature times 75. The beta ad terms drops out because it's multiplied by zero. When there isn't an ad then the sales are only a function of the temperature. Now let's remove the ad variable and introduce the question of proximity to public transportation. The variable transportation will measure how many kilometers away a given coffee shop is to a bus train or subway stop. The equation becomes sales equals beta zero plus beta temperature times x temperature plus beta transportation times x transportation where temperature and distance to transportation are continuous variables. For every one degree increase in the temperature while holding distance to public transportation constant we expect ice coffee sales to increase by beta temperature. For every one kilometer further a store is from public transportation while holding temperature constant we expect ice coffee sales to decrease by beta transportation. Note that we have to hold the other variable constant when interpreting the results. The math will explain why. Now we've gone over what to do when we have several standalone variables but there are cases when we might expect two variables to interact. For example in the case of temperature and distance to transportation we might expect that if it's cooler distance to transportation might have a different effect. If we want to account for how two variables values affect each other we include an interaction term. An interaction term represents how the relationship between two independent variables is associated with changes in the mean of the dependent variable. Typically we represent an interaction term as the product of the two independent variables in question. Going back to the example of coffee shop sales if we suspect that distance from transportation might be associated with different changes in coffee shop sales based on the temperature we can include the interaction term temperature times transportation. Originally we said that sales equals beta zero plus beta temperature times x temperature plus beta transportation times x transportation. If we want to include the interaction between temperature and transportation we can revise the equation to be sales equals beta zero plus beta temperature times x temperature plus beta transportation times x transportation, plus beta interaction times the interaction between temperature and transportation. The interaction is represented with a multiplication symbol. In this example, we took into consideration the interaction between independent variables using the interaction term. We'll continue to learn how to interpret the nuances of regression coefficients in this course. We've covered a lot so far, and practice will help you build confidence in these concepts. Spend time connecting the resources available in this course to help you use multiple regression, interpret the results, and tell the data story. Next, we'll keep digging into multiple regression. We explored some interesting and more complex questions that multiple regression can help us answer. We went through an overview of how to interpret multiple regression models. Now let's turn to Python. Just like with simple linear regression, we'll find the best fit line by minimizing the sum of squared residuals, which is a measure of error. While we could spend a lot of time performing ordinary least squares estimation one step at a time, Python has some built-in functions that help us build models. This way, we can spend our time as data analytics professionals exploring the data and communicating insights from the calculations. We're going to revisit the penguins data set from earlier to see if we can learn more about the penguins body mass. We've already done some exploratory data analysis and data cleaning, and saved data as a variable called penguins. Get a quick summary of the data using the pandas head function. There are four different variables in the data set. Body mass in grams, bill length in millimeters, gender, and species. The data is relatively clean. For this problem, you'll try to predict body mass based on the other variables. Next, divide the data set into the independent variables, or X's, and the dependent variable, Y. This step helps prepare the data for the function you'll use to create training and holdout data sets. Use the train test split function from the scikit-learn library to create a training data set and a holdout or testing data set. First, import the function. Now use the prepared data to divide the data into training and holdout data sets. Note that the test size variable is the proportion of data you're randomly assigning to the holdout data set. In this case, you're holding back 30% of the data to test the model. Depending on the context, it may be appropriate to hold back more or less of the data. The random state variable does not have to be set, but you are assigning it the value of 42 so that you can replicate our results. You'll encounter the number 42 frequently in code documentation. The number 42 is a significant number in a popular science fiction novel and has been adopted by the computer science community. If you put in a different value for the random state variable, you'll get different results. It does not matter what number you put, but setting the random state allows someone else to replicate your work. Next, we need to think about our multiple regression formula. Our independent variables are bill length, gender, and species. We'll use the stats models module to run the regression, like you did for simple linear regression. The stats models ordinary least squared function, OLS, needs to know your regression formula. Save it as a variable. OLS underscore formula equals body mass tilde bill length plus gender plus species. Note that we added capital C in parentheses around gender and species. The notation lets stats models OLS function know that gender and species are categorical variables. The function will then encode the variables. If you haven't already imported the OLS function, do so with the following line of code. The data was saved already as a data frame called OLS data. You can input your formula and the data frame into the OLS function. Lastly, use the OLS objects fit method to actually fit the model to the data. One of the benefits of using stats models OLS function is that it provides us with a compact summary table of relevant statistics. We can easily find the coefficients, standard errors, t statistics, p values, and confidence intervals for each independent variable. These values let us interpret the regression results quickly and easily. Access the OLS summary table using the summary function. Let's explore the male variable. Under the column for the coefficient, there is a row labeled C parentheses male. The way the variable was encoded was male is one and female is zero. This means that the baseline or reference point is female penguins. So the coefficient indicates how much body mass would differ between two penguins that only differ in gender. Assuming the male and female penguins are the same species and have the same building, we expect the male penguin's body mass to be about 528.95 grams more than the female penguin. The p value is very small, so this coefficient is statistically significant. Now let's consider the row for bill length. Assuming that two penguins are the same gender and species, if the bill length increases by one millimeter, we would expect the penguin with the longer bill to be about 35.55 grams larger in body mass. The p value is very small, so the estimate is statistically significant. The OLS summary table also gives you model evaluation metrics like r squared. The r squared is 0.85, indicating that your model explains about 85% of the variance in body mass. This seems reasonable, but we will discuss the importance of metrics other than r squared when working with multiple linear regression later. You know how to fit the model to the data and get a summary table of statistics, and you explore the coefficients and p values. There's a lot more in the table so you can investigate any of the metrics we have not covered yet in this course. I encourage you to peruse the readings in this lesson and try out the code on your own. All of these statistics form the basis for interpreting results and communicating with stakeholders. Great work so far. I'll join you again next time. So far we've extended simple linear regression to multiple regression, which is a powerful tool because it allows us to answer a wide variety of questions. Whenever we're estimating a continuous variable by using several different independent variables, multiple regression is a good first step. But every tool has its limitations. When we first discussed simple linear regression, we focused on r squared as the most common metric for evaluating linear regression. To recap, r squared is the proportion of variance of the dependent variable y, explained by the independent variable or variables x. This seems like a reasonable and highly interpretable metric for determining how good of a linear regression model you have. If you recall, when we examined the output of multiple regression model, r squared was still one of the metrics listed. But when we started adding more independent variables into the equation, r squared gets more complicated. Whenever you add another independent variable to a multiple regression model, the r squared increases without fail. But not all variables added to a model equally contribute to understanding changes in y. This is a problem because the high r squared can be misleading. If we are just trying to get a high r squared without considering what each variable contributes, the model becomes very specific to the data it was built on. Therefore, the model is no longer applicable to a larger population. As data analytics professionals, we call this overfitting. Now we're going to focus on what to do now that we know about overfitting. Overfitting in the data space is when a model fits the observed or training data too specifically and is unable to generate suitable estimates for the general population. The conclusion from the regression model no longer applies to the population, which we are trying to draw conclusions for and only applies to the data used to build the model. Overfitting tends to occur when a model is too complex or incorporates too many variables. Preventing overfitting is one of the reasons that we use model evaluation techniques like holdout sampling, which is when we set aside a part of the data we already had but did not use to fit the model. By using a holdout sample, we can observe if the model performs just as well on data it is not experienced yet. Aside from holdout sampling, we can also use another metric called adjusted r squared to evaluate multiple regression models. Adjusted r squared is a variation of the r squared regression evaluation metrics that penalizes unnecessary explanatory variables. Just like r squared, adjusted r squared varies from 0 to 1. Adjusted r squared is used to compare multiple models of varying complexity. R squared is more useful when interpreting the results of a regression model, as you can determine how much variation in the dependent variable is explained by the model. Now that we have reviewed the problem of overfitting and some ways to better evaluate multiple regression models, we can turn to variable selection. After all, we need a method to determine which variables to include and exclude in our models. I'm excited to explore variable selection and other techniques later in this lesson. In previous videos, we defined overfitting as a problem when we modeled the observed or training data too specifically, and are unable to generate suitable estimates for the general population. One way to better evaluate how good a model is while factoring an overfitting is using adjusted r squared, which penalizes unnecessary variables in a model. We learned that adjusted r squared is most effective when we can compare multiple models that are using different subsets of independent variables. In this video, we'll review a couple of variable selection techniques. Thankfully, Python and our computer will take care of a lot of the work very efficiently. But it's important as a data analytics professional to have a high level understanding of what's going on in your machine. Variable selection, also known as feature selection, is the process of determining which variables or features to include in a given model. As with many of the processes discussed in this program, variable selection is iterative. As you grow as a data analytics professional, you will develop a stronger intuition for how to go about variable selection. In this video, we'll cover forward selection and backward elimination, which are based on extra sums of squares f test. These simple techniques will allow you to continue exploring the world of multiple regression and prepare you for more advanced techniques covered later. Forward selection and backwards elimination essentially work from opposite directions of the problem. We know that a model with zero independent variables is probably not the best choice. We know that a model with all of the possible independent variables is also probably not the best choice. Forward selection is a stepwise variable selection process. It begins with the null model with zero independent variables and considers all possible variables to add. It incorporates the independent variable that contributes the most explanatory power to the model based on the chosen metric and threshold. The process continues one variable at a time until no more variables can be added to the model. Backwards elimination is a stepwise variable selection process that begins with the full model with all possible independent variables and removes the independent variable that adds the least explanatory power to the model based on the chosen metric and threshold. The process continues one variable at a time until no more variables can be removed from the model. Both forward selection and backwards elimination require more cutoff points or threshold to determine when to add or remove variables respectively. One common test is the extra sum of squares F test. The extra sum of squares F test quantifies the difference between the amount of variance that is left unexplained by a reduced model that is explained by the full model. The reduced model can be any model that is less complex than the full model. For the F test, like other hypothesis tests, data professionals usually evaluate it based on a P value. Based on the P value, we can be fairly confident that important variance is being explained by a given model. We'll revisit F test later in this course when we talk more about hypothesis testing and estimating categorical variables. With great work so far, we've covered forward selection, backward elimination, and extra sum of squares F test. This provides a great start to variable selection and making intentional decisions when constructing a multiple regression model. Coming up, we'll go over how to perform variable selection using Python and continue exploring ways to control for overfitting. So far, we've covered a lot about the versatility of linear regression models in impacting business decisions and strategy. But no tool is perfect. Previously, we introduced the problem of overfitting when a regression model is fit too closely to the training data and therefore has trouble properly estimating the population data. The problem of overfitting is related to the bias variance tradeoff, a concept at the heart of statistics and machine learning. The bias variance tradeoff balances two model qualities, bias and variance, to minimize overall error for unobserved data. The ideal model has some bias and some variance. Bias simplifies the model predictions by making assumptions about the variable relationships. A highly biased model may oversimplify the relationship, underfitting to the observed data, and generating inaccurate estimates. For example, given some data, we could assume that y equals 2. This is a highly biased model. Variance in a model allows for model flexibility and complexity, so the model can learn from existing data. But a model with high variance can overfit to the observed data and generate inaccurate estimates for unseen data. Note, this variance is not to be confused with the variance of a distribution. We can think of bias and variance as two ends of a scale. We don't want there to be too much bias or too much variance. So we have to ask ourselves, as data professionals, how to balance some bias and some variance to minimize error and get the best model possible. Speaking about balancing, as data analytics professionals, we have to balance knowing that even as we gain experience, there's always going to be more to learn. I think it's super important to always keep learning and avoid complacency, like you all are currently doing. Personally, I try to attend at least two conferences a year to learn what is going on in the industry and to network with other data professionals. I'm also really fond of collaborating and asking questions and have found that I learned so much from my colleagues this way. Know that at this point you have a really solid foundation and we want to continue working with you to expand your vocabulary and toolkit even further. Now it's time to learn about regularized regression. Regularization is a set of regression techniques that shrinks regression coefficient estimates towards zero, adding in bias to reduce variance. By shrinking the estimates, regularization helps avoid the risk of overfitting the model. There are three common regularization techniques, lasso regression, ridge regression, and elastic net regression. We won't go into all of the mathematical underpinnings of the techniques, but there are lots of resources in this course and online for you to explore further if you're interested. For all three kinds of regularized regression, some bias is introduced to lower variance in the model. Lasso regression completely removes variables that are not important to predicting the why variable of interest. In ridge regression, the impact of less relevant variables is minimized, but none of the variables will drop out of the equation. Ridge is a great option if you want to include all of the variables. When working with large data sets, we can't always know if we want variables to drop out of the model or not. So we can use something called elastic net regression to test the benefit of lasso, ridge, and a hybrid of lasso, ridge regression all at once. Each regularized regression technique is trying to help us better fit our model. But keep in mind that the estimated parameters are much harder to interpret than with simple linear regression or multiple regression. That concludes our brief exploration into regularization and the bias variance tradeoff. Now that you know the basics of regularization and the bias variance tradeoff, you can continue learning about how to find the best regression model. Keep up the great work and I'm excited for us to continue on your journey to regression analysis. Wow, we extended a lot of what we learned about simple linear regression to multiple linear regression. While simple linear regression models the linear relationship between one independent variable X and one continuous dependent variable Y, multiple regression models the linear relationship between two or more independent variables and one dependent variable. The allowance for multiple independent variables provides data analytics professionals with the ability to ask a wider variety of questions. We reviewed the model assumptions for simple linear regression and extended them to multiple regression. The model assumptions that apply to multiple linear regression are linearity, independent observations, normality, homoscedacity, and no multi-colinearity. We showed how to test each of these assumptions using either math, built-in Python functions, or through visualizations during EDA. We highlighted multi-colinearity as that assumption is specific to multiple regression. We then provided some examples of how to code a multiple regression in Python. There are similarities with coding a simple linear regression, but there are some key differences. Next, we focused our time together on how to interpret and evaluate multiple regression results now that we have so many independent variables. We also discussed understanding the computer output, which is critical to providing accurate and nuanced storytelling. Lastly, as part of model interpretation and evaluation, we reviewed the problem of overfitting and ways to counteract it. Overfitting occurs when a model too closely matches a training dataset or observed data and is unable to predict unseen data or generalized to the population either. To combat overfitting, we discussed two variable selection techniques – forward selection and backward elimination. Both used the extra sum of squares f-test to determine whether to add or remove a variable. To wrap variable selection, we provided an overview of regularization, which helps prevent overfitting. To understand regularization, we defined the bias variance tradeoff, which is core to many data science and machine learning model decisions. A model with high bias might underfit the data over simplifying the model. A model with high variance might overfit the data over complicating the model. We introduced three regularization techniques – lasso, ridge, and elastic net regression. They all add bias to reduce variance. Regularization is particularly helpful when working with large datasets, as it can be difficult to predetermine which variables are or are not important. Wonderful work so far. We've gone over many concepts. Everything we've worked on together is here for you to review on your time. Good luck, and I'll join you again next time. Hey, welcome back to modeling variable relationships. We've covered a lot of material about regression so far. We worked through the assumptions, construction, evaluation, and interpretation of simple linear regression model. And then built a more complex model using multiple regression. These tools gave us powerful ways to ask and explore questions about continuous variables. For example, if we were interested in understanding product sales, multiple regression would allow us to consider the impact of both digital video and print advertising on sales. Now we'll start focusing more on categorical variables. By expanding our toolkit to include more hypothesis testing, we can start asking and answering a different range of questions. For example, is the difference among three or more groups statistically significant? Is the distribution of observed data different from what we expected? Perhaps we're conducting user research and you want to make a change to the website you're analyzing. You can use hypothesis testing to determine if user engagement is different between a couple groups, where each group is using distinct website layout. Recall that hypothesis testing is a statistical procedure that uses sample data to evaluate an assumption about a population parameter. You are testing hypotheses about a population to see if there are any significant differences. For example, in the case of user research on a website, you could test the hypothesis that changing the color of the subscribe button or the placement of images might change how much time users spend on the website. You can utilize different models and tests from your data toolbox to answer various questions. You can also use some techniques together. For example, sometimes you might want to combine a regression model with hypothesis tests or a series of hypothesis tests. The core principles of data analysis continue to apply to each technique. We want to explore and better understand the relationships between different variables. We can use what we have learned to tell compelling stories about the data. Based on the kind of data we have available, we can determine which test or model will be most appropriate. But sometimes we have to use several tests and models before we can determine the best approach to answer our questions. In this video, we'll start with chi-squared distribution and chi-squared test. Chi is the 22nd letter of the Greek alphabet and looks a lot like a fancy capital X in English. Our discussion will expand on prior learnings about t-test and hypothesis testing. Chi-squared test will help us determine if two categorical variables are associated with one another and whether a categorical variable follows an expected distribution. For example, if we flip a two-sided coin many, many times, we would expect about half of the time one particular side comes up. About the other half of the time the other side comes up. Next, we will examine analysis of variance, or ANOVA, and its variance, analysis of covariance and COVA, and multivariate analysis of variance, and covariant, MANOVA, and MANCOVA respectively. By unpacking variable relationships with a focus on categorical data, we'll be able to expand the types of data we can effectively use to draw conclusions about various decisions, strategies, and practices in industry. Previously, we encountered hypothesis testing in the form of one-tail and two-tail t-tests. T-tests help us answer questions about whether the mean of two different groups is significantly different. In this lesson, we'll explore if our data is what we expected it to be. We'll introduce two hypothesis tests, the chi-squared goodness-of-fit test and the chi-squared test of independence. These tests will help us compare our expected and observed data. We'll revisit null and alternative hypotheses when we define chi-squared tests just like you might have done with t-tests. When using linear regression, we were primarily focused on continuous variables, but what if our variables aren't continuous? Chi-squared tests can address questions involving categorical variables. For example, let's say you're working at a movie theater which sells small, medium, large, and extra-large portions of popcorn. You have projections or some expected counts of how each size will sell. In a bar chart, it can be difficult to determine if the expected counts are statistically the same as what you observed. But you can run a chi-squared goodness-of-fit test to answer the question. The chi-squared goodness-of-fit test determines whether an observed categorical variable follows an expected distribution. As a data practitioner for the movie theater we mentioned, you're working on the popcorn sale problem. An employee claims that 25% of people order each size on any given day. Now you can create a table of counts based on the number of popcorn buyers on a given day. So let's say 100 people bought popcorn yesterday. Then you can multiply the percentage by the total to figure out the expected counts for each cell in the table. 25% of 100 is 25, so you expect that 25 people bought each size of popcorn. For the chi-squared goodness-of-fit test, the null hypothesis states that 25 people buy each size of popcorn on any given day. Basically, the null hypothesis states that the variable follows the expected distribution. If we accept the null hypothesis, then we can say that the observed distribution of popcorn sales is the same as what the employee claims. The alternative hypothesis states that the variable doesn't follow the expected distribution. Basically, the distribution of the observed data is significantly different from what we expected it to be. This means that different number of people buy each size of popcorn on any given day. In this case, the chi-squared statistic equals the sum of the observed number minus the expected number squared, divided by the expected number. Once you gather the observations about popcorn sales, you can then use the chi-squared statistic to calculate the p-value and determine if you can reject the null hypothesis at the given confidence level. The next test is called the chi-squared test for independence, sometimes called a test of homogeneity. The chi-squared test for independence determines whether or not two categorical variables are associated with each other. For example, let's say you are wondering if weather is associated with popcorn sales. Maybe when it rains, people are more likely to buy hot buttery popcorn. To state the hypothesis, you first need to determine your variables. In the popcorn case, the first variable could be whether there's rain or not. The second variable could be if more than 100 people buy popcorn, or if 100 people or fewer buy popcorn. Now we're ready to state our hypothesis. The null hypothesis in the test for independence is that the variables are independent and are not associated with each other. The alternative hypothesis states that the variables are not independent and are associated with each other. For the chi-squared test, it's important to construct a 2x2 table of counts to see how many observations fall under each category. Let's say we have data from the spring, summer, and fall popcorn sales. There were 275 days of data collected, including 83 days where it rained and 192 days where it did not. On 135 days, more than 100 people bought popcorn, and on 140 days, fewer than 100 people bought popcorn. You can then fill in each cell with the count of days for each category. You can then calculate the expected value for each cell in the 2x2 table using the following formula. The expected value for the cell in the i-th row and j-th column is equal to the total of row i times the total of column j divided by the total count in the 2x2 table. The formula comes from the definition of independence, where two events are independent if the probability of both occurring, probability of A and B, is equal to the probability of A times the probability of B. The observed value is just the count for the cell in the i-th row and j-th column. You can then use the same formula as the goodness of fit test to calculate the chi-squared statistic. As with the goodness of fit test, you can use the chi-squared statistic to calculate the p-value and determine if the two categorical variables are independent or not. Great work! We'll cover a couple of the assumptions of chi-squared tests and how to perform chi-squared tests in more depth in upcoming readings. The key here is connecting concepts you have learned about hypothesis testing to these chi-squared tests as well. You'll answer more questions about categorical variables soon. Knowing our data well helps us determine what we can do with the data. It helps us understand which tests we can run, which tests we can't run, and which tests we might be able to run if we transform the data a bit more. In prior videos, we showed the difference between continuous and categorical variables. We've spent most of this course so far talking about linear regression, which can estimate continuous variables of interest, but there is so much categorical data out there. So far, we've learned about the chi-squared goodness of fit test and test of independence. Both of these tests examine the relationship between categorical variables. Now we'll focus on analysis of variance, or ANOVA, which helps examine the relationship between categorical variables and continuous variables. Analysis of variance, commonly called ANOVA, is a group of statistical techniques that test the difference of means between three or more groups. If this sounds familiar, you might recall t-test, which are a common statistical test. ANOVA is an extension of t-test. While t-tests examine the difference of means between two groups, ANOVA can test means between several groups. Let's say you work at a botanical garden, and you're wondering if different species of butterflies have different lifespans. ANOVA testing can help in this situation. There are two main types of ANOVA tests, one-way and two-way. One-way ANOVA testing compares the means of one continuous dependent variable in three or more groups. We'll use a categorical variable to represent the groupings. When using ANOVA, we have a null hypothesis and an alternative hypothesis. Let's say you're measuring the lifespans of three different butterfly species, monarch, morning cloak, and swallowtail butterflies. Your colleagues suggest that the lifespans may be the same regardless of butterfly species. Since the null hypothesis is that the lifespans are equal, a one-way ANOVA test is appropriate. This means that the lifespan of monarch butterflies is equal to the lifespan of morning cloak butterflies, which is also equal to the lifespan of swallowtail butterflies. We can write H0 as mu monarch equals mu morning cloak equals mu swallowtail. To generalize, we can use one-way ANOVA when the null hypothesis states that the means of each group are equal. The alternative hypothesis would then be the lifespan of these three butterfly species are not all the same. This is a little difficult to express using math symbols, so you can just write not before the null hypothesis. So not mu monarch equals mu morning cloak equals mu swallowtail. Only one of the mean lifespan has to be different to reject the null hypothesis. We represent the alternative hypothesis as H sub 1 or H1 because you may run more complex tests in the future where you are testing more than two hypotheses at once. One-way ANOVA is a great tool, but sometimes you might encounter situations in which there are two factors that are associated with the continuous dependent variable. At the Botanical Garden, let's say you want to study if butterfly lifespan is related to the species of the butterfly and the size of the butterfly. Imagine that butterflies can be small, medium, or large. Now the data is varying according to two factors, species and size. Two-way ANOVA testing compares the means of one continuous dependent variable based on two categorical variables. You are now testing three null hypothesis and alternative hypothesis statements at once. The first null and alternative hypothesis pair is the same as before. They focus on the relationship between species and lifespan. The null hypothesis states that there is no difference in life spans between the three butterfly species. The alternative hypothesis states that there is a difference in life spans between the three butterfly species. The next pair of hypotheses focuses on our second new categorical variable, size. The null hypothesis is that there is no difference in life spans based on butterfly size. The alternative hypothesis is that there is a difference in life spans based on butterfly size. The last pair of hypotheses test the interaction between the two variables. This concept of interaction may be familiar from our work on multiple regression. The null hypothesis states that the effect of species on lifespan is independent of the effect of butterfly size and vice versa. The alternative hypothesis states that there is interaction effect between butterfly size and species on lifespan. Regression lets you understand how independent variables impact dependent variables. ANOVA allows you to zoom in on some of those relationships to tell a complete story by unpacking relationships in a pairwise fashion. Wonderful job! In this video, we reviewed the differences between one way and two way ANOVA. And we're able to state the null and alternative hypotheses for each test in our butterfly scenario. Next, we'll learn how our computers and Python can help us run ANOVA tests. At the core of all of our regression analyses and statistics is storytelling. We want to understand how the different variables are related. Although regression and ANOVA can help answer similar questions, one or the other might be more useful in specific cases. A regression analysis will help provide a holistic picture of if and by how much a number of different variables impact an outcome variable. On the other hand, ANOVA helps unpack pairwise comparisons among some sets of those variables to better understand the nuances among the elements that fueled the regression analysis. ANOVA can work like a magnifying glass, zooming in on specific parts of the regression story. Now, let's use a subset of a dataset about diamonds to explore ANOVA in Python. You can load in the original dataset through the Seaborn library. We've already cleaned the data and transformed some variables. We have two variables, logarithm of the price and the color grade of each diamond. Using pandas, load in the CSV file. Now that the dataset is loaded in, examine the data. If you haven't already, import the Seaborn package as SNS. Use a box plot to determine how the price varies based on the color grade. There are a few different color grades represented. D, E, F, H, and I. There seems to be some variation in the log of the price, but it's not clear if there is a difference based on color grade. So let's test it out using a one-way ANOVA. You'll be using the stats models module. Start by creating a regression model using the OLS function, and then fitting the model to the data using the fit method. This will allow you to use the ANOVA function to see if there is a statistically significant difference in price between the groups. Remember that you have to add the capital C and parentheses around color in the OLS formula to indicate color is a categorical variable. Based on the results from the model, you can observe that there is a statistically significant relationship between diamond color and price. But it's unclear if the price differs between the various colors. To learn more, you can run a one-way ANOVA test. You can create an ANOVA table using the stats models ANOVA underscore LM function. The table will provide you with statistics about the color variable. Recall that a one-way ANOVA test compares three or more groups of one categorical independent variable based on one continuous dependent variable. Let's state the null and alternative hypotheses. The null hypothesis is that there is no difference in price of diamond based on color grade. The alternative hypothesis is that there is a difference in price of diamond based on color grade. Note, there are different ANOVA types, one, two, and three, which you can read about in the documentation. The ANOVA table gives you the sum of squares and degrees of freedom for the color variables and its residuals, as well as the F statistic and P value for the color variable. From the results, the P value is very small, which means that you can reject the null hypothesis that the mean of the price is the same for all diamond color grades. Now, let's add another categorical variable into the mix. Let's add in the cut of the diamond. There are three kinds of cuts to include. Ideal, premium, and very good. You can load in the dataset that has been cleaned. You'll do the same as before and first fit a regression model, but the equation will address the possible interaction effects between cut and color. The colon indicates interaction between the color and cut categorical variables. Before you run the two-way ANOVA test, let's review the three hypothesis pairs you'll be testing. First up are the null and alternative hypotheses about diamond price based on color. Next are the hypotheses about diamond price based on cut. Lastly are the hypotheses about the interaction between color and cut on diamond price. Now get the results of a two-way ANOVA test using the same ANOVA underscore LM function. The table includes a row for each of the two categorical variables and a row for the interaction between cut and color. Since the P value is very small for all three, you can reject all three null hypotheses. In conclusion, the logarithm of the price is not the same for different colors. Additionally, the logarithm of the price is not the same for different diamond cuts. Finally, there is an interaction effect between the color and cut that impacts the price of the diamond. Wonderful job! In this video, we reviewed the difference between one-way and two-way ANOVA, the null and alternative hypotheses, as well as how to code and interpret the results from Python. We've covered a lot so far, and in the following videos and readings, we'll continue exploring the power of ANOVA testing. Previously, we talked about the effectiveness of ANOVA testing to understand continuous variables a bit more in relation to categorical variable of interest. To recap, ANOVA testing can be applied to the results we get from a linear regression. If we have a categorical independent variable such as diamond color and we're trying to estimate the price of a diamond, we can use an ANOVA test to determine if there is a statistically significant difference in price based on diamond color. But, as we stated before, the null hypothesis just states that the means are different. If we have three or more groups, we don't know which one is different or how many are different from each other. There might be cases where it's really important to know if one category in particular is different. For example, if you're building a roller coaster, you will be choosing among a couple of different material types based on their strengths. You really want to know if and how much the materials differ in strength. In this case, an ANOVA post hoc test can be helpful. An ANOVA post hoc test performs a pairwise comparison between all available groups while controlling for the error rate. If you recall from learning about statistics, we have confidence intervals and p-values to quantify our uncertainty. There is always a small chance that we falsely reject the null hypothesis purely based on probability. Falsely rejecting the null hypothesis is sometimes referred to as type 1 error. Typically, there's a 5% chance we've rejected the null hypothesis when it was actually true. But if we run a bunch of tests, all with a 5% chance that we're incorrectly rejecting the null hypothesis, the chance that we've made a mistake multiplies. So the odds that we've made at least one mistake increases very rapidly the more tests we perform. Post hoc ANOVA tests control for that increasing probability. One of the most common ANOVA post hoc tests is the Tukey's HSD. Honestly, significantly different. After performing ANOVA test where you get statistically significant results, all you know is that at least one of the group's means are different. Tukey's HSD tests will then compare all the pairs of groups and determine which pairs are different from one another while controlling for the fact that you're running multiple hypothesis tests all at once. Now let's return to the one-way ANOVA diamond example. Let's say you already have your packages imported from last time and you've saved the dataset as a data frame called diamonds. Create the linear model again. You already know that there is a statistically significant relationship between color and price. Now run the one-way ANOVA just like before. Here you can observe that the p-values are much smaller than 0.05. So the results are statistically significant and therefore you can reject the null hypothesis that the mean price is the same across all diamond colors. Now that you have significant results from ANOVA test, you can run the Tukey's HSD post hoc test. First, import the Python function pairwise underscore Tukey's HSD from SM. Next, run the test. The end-dodge variable indicates which variable is being compared across groups. The groups variable tells the function which variable holds the groups you're interested in. In this case, color. The alpha variable tells the function the significance or confidence level you're testing for. You set alpha to 0.05 as you are aiming for a 95% confidence level. You can access the results of Tukey's test in two ways. One way is to use the summary function. You can also use the print function. In both cases, you can observe all of the pairwise comparisons between each pair of diamond color grades. There is an adjusted p-value column, a column with the lower and upper bound of the confidence intervals, as well as a column letting you know whether or not you can reject the null hypothesis. If the reject column reads false, then you cannot reject the null hypothesis. If the reject column reads true, then you can reject the null hypothesis. For example, the first row compares the mean diamond price between color grades D and E. The null hypothesis is that the diamond prices between the two color grades are the same. But the Tukey's HSD test informs us that you can reject the null hypothesis with a p-value of 0.001. In conclusion, the diamond price is not the same for H&I grade diamonds. Awesome job today! To recap, combining tests may seem like a complex task, but it's so powerful to have a wide set of tools to run analysis. We'll continue unpacking ANOVA and post-hawk tests in the course materials. As always, revisit any prior videos and readings as needed. You can do this. Good luck and I look forward to joining you again next time. My name is Ignacio. I'm a staff economist in the chief economist team. I was born in Argentina and being born into a middle-class family in Argentina means that I got to experience every single economic crisis that you can imagine in my most formative years. And that probably got me interested in economics, which some people say is the original data science. I started working in public policy, trying to help decisions makers in health and education use data to inform some hard decisions they have to make. And after five years of doing public policy, I moved into the tech sector. I started working at Google where I used the same toolkit but to inform business decisions. If you are going to enter this field, professional development is something that will always be a part of your life. When I finished my PhD, I said, OK, I have my toolkit. I can go and answer any questions. That's not the case. Soon after when I started my first job in policy research, I remember this very senior statistician came to me and say, like, well, you are doing this great, but if you were using these different tools, it would be way better. I quickly learned that there were things that I could learn that I didn't learn in grad school that would help me be better at my job. And I made the decision of investing on myself, and that basically meant that on a weekend I would wake up early and I would watch some YouTube lectures that people recommended me. I would read a book that people recommended to me. And once I learned what I wanted to learn, I remember talking with my boss at the time and saying, we just did this thing for this client, but now I want to do it in a different way. And I also want to show them and see what they think. By the time I left my previous job, 100% of my work was using that new toolkit that I learned while at the job. And I expect that five years from now, the tools that I'm going to be using are probably going to be almost exclusively things that I haven't learned yet. So you're probably finishing your advanced data analytics certification and you're asking yourself what is next. So my guess is that you're a self-motivated person and you can learn by yourself. And today doing that is quite easy because resources are free available online. You can go to a virtual conference from your home. You can watch a lecture on YouTube. You can be on Econ Twitter and learn what is the latest innovation and methodology that may help you answer a business question you're working on that day. And once you start in that world, you can go down the rabbit hole and learn more and more. So far, we've learned some exciting ways to uncover stories about categorical variables. One of these tests is ANOVA, or Analysis of Variants. ANOVA helps us learn more about the differences in continuous Y variable based on different groupings. Using ANOVA, we can determine whether groups are actually different from one another. You may recall that when we learned about simple linear regression, we could only take into account one independent variable to explain the variance of one dependent variable. Then we learned about multiple regression, which allowed us to incorporate lots of different factors into our analysis. By working with simple linear regression, you build a strong foundation for understanding the mechanics of regression modeling. By working with multiple regression, you were able to vastly expand the kinds of questions you could explore. For example, using simple linear regression, you could explore the relationship between the weather and ice coffee sales. But with multiple regression, you could then include temperature and distance to public transportation, or other variables you think might be related to coffee sales. There's a similar relationship between ANOVA and ANCOVA, or Analysis of Covariants. Analysis of Covariants, or ANCOVA, is a statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariates. Covariants are the variables that are not of direct interest to the question we are trying to address. By taking the covariates into account, we can better isolate the relationship between the categorical variable we are interested in and the why variable. This allows us to draw more accurate conclusions about the relationships among the variables. For example, if examining ice coffee sales, ANCOVA allows you to analyze how the sales are different on workdays versus the weekend while controlling for the temperature of the day. Let's say we have a drop in sales on weekend. We could assume that no one buys coffee on weekends because they're not going to work, but perhaps it was especially cold that weekend. ANCOVA allows us to double check whether temperature was a factor. You might wonder why a data analytics professional would use ANCOVA when we already have linear regression analysis we can use. There are many similarities between ANCOVA and linear regression. For example, both allow for continuous and categorical independent variables. Both focus on a continuous why variable and both center on understanding relationships between variables. But the use cases depend on which variable we're most interested in understanding. With ANCOVA, while we are not focused on the covariates, we are including covariates to gain a clear understanding of the categorical variable. With regression, we might be interested in all of the independent variables or in predicting the why variable for unseen data. That was a lot. We've gone through one-way ANOVA, two-way ANOVA, and ANCOVA after covering various linear regression models. All of these are important concepts to be able to talk about as a data analytics professional, but don't feel like you have to remember everything we're going through verbatim. That would be really hard. When you're working on a new project, you'll be able to refresh your memory by revisiting this course, doing a quick Google search, or asking other people. I literally do this every day. For instance, I don't remember every equation off the top of my head. But what's important is knowing that there are assumptions and equations. And when I need them, I can either look at my passcode or search for them. If you're anything like me, you'll get really good at fine-tuning your searches. Going back to ANCOVA, we're dealing with hypothesis tests. So it's important to state our null and alternative hypothesis before running a test. We'll cover how to run the test in Python later in this course. In this video, we'll focus on understanding what kinds of questions ANCOVA can help us answer. Let's say that you're working at a bookstore and you're interested in the relationship between book genres and sales. It seems new books tend to get more attention because authors are traveling to promote their recent work. The good news is that ANCOVA will let us control for publication year. Controlling for other variables is important so we don't draw conclusions that are not accurate. In this example, the categorical independent variable is book genre. The covariates is the year since the book was published. The continuous dependent variable is the number of books sold in the last month. The null hypothesis, or H0, is that book sales are equal for all genres regardless of the number of years since publication. The alternative hypothesis, or H1, is that book sales are not equal for all genres regardless of number of years since publication. Just like with ANCOVA, you have Python and our computers available to run the test and the math. But you need to interpret the results. Typically, if the test yields a p-value of less than 0.05, you can reject the null hypothesis that all of the means were the same, even when controlling for the covariates. Great work! We've reviewed a lot about the importance of ANCOVA, which you can add to your data analysis toolkit now. As always, remember that there are additional resources for you to keep exploring ANOVA and its variants. Good luck, and until next time, keep exploring interesting questions and telling compelling stories. Remember how far you've come on your data journey so far. You've built and extended a number of models and tests from their most basic form to include more variables and different kinds of data. As you've explored various models, you've encountered different questions you can answer and different use cases for each model. You've made the shift from ANOVA to ANCOVA, much like you extended simple linear regression to multiple regression. You added more independent variables to help isolate the effect of each X variable on the Y variable in question. In this video, we'll add more dependent variables to allow us to perform new and different kinds of comparative analyses. The two tests we'll introduce are MANOVA and MANCOVA. Based on the names, you might be able to guess how these relate to ANOVA and ANCOVA. MANOVA, or multivariate analysis of variants, is an extension of ANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables. Like ANOVA, the two most common versions of MANOVA involve one or two categorical independent variables and are referred to as a one-way and two-way MANOVA. The independent variable must be categorical and the outcome variables must be continuous. Since we're still dealing with hypothesis testing, we need some hypotheses to test. Let's return to the bookstore example to generate and test new hypotheses about factors relating to book sales. The one categorical variable will be book genre. The two continuous dependent variables could be the number of books sold per month and profits from the book sales. Let's say you're working with a one-way MANOVA test. In this case, the null hypothesis would be that the means of both continuous variables are equal for every book genre. So the number of books sold per month is the same for each book genre and the profit from selling books is the same for each book genre. The alternative hypothesis would be that the means of both continuous variables are not equal for every book genre. This indicates that only two genres of books must differ for just one of the outcome variables we're examining. For example, perhaps the profit made from self-help books differs from the profit made from science fiction books. Or maybe the number of graphic novels sold per month differs from the number of historical fiction books sold per month. In either case, you could reject the null hypothesis. MANOVA allows us to think of each data point as having a number of characteristics, which are the continuous Y variables that we want to understand based on one or two sets of groups we care about, the one or two categorical X variables. If, however, we're only interested in one categorical variable and we want to control for another variable, we can use MANCOVA. MANCOVA or multivariate analysis of covariance is an extension of ANCOVA and MANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables while controlling for covariance. Let's say you're still interested in whether book genre is related to the number of books sold and the amount of profit, but you want to control for the popularity of the author, then you could use MANCOVA. The categorical X variable is still book genre, but now you add a covariate, which is the follower count an author has across social media platforms, which you are controlling for. Then the two Y outcome variables remain the same, the number of books sold per month and the monthly profit. The null hypothesis is that book sells and monthly profits are equal for all genres regardless of the author's social media following. The alternative hypothesis or H1 is that book sells and monthly profits are not equal for all genres regardless of the author's social media following. As you continue expanding your data toolkit, you'll encounter more test tools and models that build off each other. Identifying the connections and use cases is important as you continue throughout your career as a data professional. I'm excited for you to continue testing your hypotheses. Everything we've covered so far has allowed us to explore categorical variables in different ways. By fully utilizing the assortment of tools available with regards to categorical variables, you'll be able to tell a wider variety of stories in your data. After all, we are scientists and sometimes we need to try out a few tests in order to figure out the story that the data holds. Starting with just understanding one categorical variable, we examine the chi-squared goodness of fit test, which determines whether a variable matches a theoretical distribution. Then the tests of independence helped us determine whether two categorical variables were independent of one another. Moving beyond the chi-squared distribution, we discovered ANOVA or analysis of variance. ANOVA forms the basis of all the other tests we covered in this section. One way ANOVA is a powerful way to determine if there are differences in a continuous outcome variable between groups you're interested in. Two-way ANOVA allows you to gain similar insights while incorporating another set of groups as well. ANOVA built upon the idea of ANOVA by allowing us to control for another variable to isolate the effect of the groups from covariates. MANOVA built upon ANOVA and ANOVA to allow testing of multiple outcome variables of interest. As with any hypothesis testing, all of these tests had null and alternative hypotheses. Stating these hypotheses helps you to articulate what you're trying to learn from running these tests. I encourage you to continue working through scenarios with hypothesis tests. The more questions you ask, the more the data will reveal. Keep in mind the data requirements of each test and that there is always room for error. With practice and time, you'll be telling impressive stories soon. I'll join you again next time. Hello, it's great to be with you again. When we learned about linear regression, we started thinking about different kinds of questions that regression analysis could help us answer. For example, we could consider the factors that contribute to Penguin's body mass or factors that affect click-throughs on a website. We're going to expand the kinds of questions we can ask. This time, we'll focus on modeling the probability that an event will occur. There are lots of problems that we could explore. For example, what factors influence the odds that a customer buys from a company again? What impacts the likelihood a worker receives high performance ratings? What contributes to a user commenting or not commenting on a video? Logistic regression can help us in these situations. Remember that logistic regression is a technique that models a categorical dependent variable Y based on one or more independent variables X. The dependent variable can have two or more possible discrete values. We'll focus mostly on binomial logistic regression, which models the probability of an observation falling into one of two categories based on one or more independent variables. We use a binary variable Y to indicate the category. For example, let's say that you're a data professional working for a basketball team, and you want to understand the probability that any given player on your team will score more than 10 points in a game. There are many variables you might want to consider. For example, how well did the player perform last season? What's their average playing time? How many points did they score this season? This might seem like a multiple linear regression problem, but consider the outcome variable. Whether or not the player will score more than 10 points in a basketball game. This is a binary outcome variable. Since there are only two possible outcomes, we can't draw a best fit line in the same way we did for linear regression. For example, if you plot playing times against whether or not the player will score more than 10 points in a game, your data will look like two flat lines, one at Y equals zero and one at Y equals one. This is very different from the linear relationship you observed in prior scenarios. Later in the program, we'll learn more about binomial logistic regression and review the stages of regression and pace. We'll analyze our data by understanding the model assumptions to the best of our ability and ensure we have a binary outcome variable. We'll construct our model and evaluate it using several different metrics. Then we'll execute and share the results with stakeholders and other team members. Logistic regression is a versatile and powerful model, and I'm excited to review it with you, so let's get started. So far, we've learned that binomial logistic regression is a method for modeling the probability of a binary outcome, such as how likely it is that a player will score more than 10 points in a basketball game. Will a user comment or not? But just like any other statistical method, we have to make assumptions about the data to have confidence in the results. Now we'll discuss the main assumptions of binomial logistic regression and consider how to find the best logistic regression model given a set of data. Logistic regression is a bit more complex than linear regression, so our goal is to understand the basics of how it works, not to understand every detail of the model. Recall that as we navigate the first two stages of pace, we can figure out if a logistic regression model is the best choice to address the question we're working on. We consider the assumptions of the model as we analyze our approach. Some assumptions are similar to those of linear regression, and some are different. The first and most important assumption of binomial logistic regression is the linearity assumption, which is a bit different from the linearity assumption for linear regression. In binomial logistic regression, the linearity assumption states that there should be a linear relationship between each X variable and the logit of the probability that Y equals 1. The linearity assumption is the key assumption that explains how we can estimate a logistic regression model that fits the data best. To understand logit, we must first define the odds. The odds of a given probability P is equal to P divided by 1 minus P. We can think of the equation as the probability of P occurring divided by the probability of P not occurring. For example, let's imagine that in a package of cookies with different flavors, you know that about 60% are chocolate. You'll represent this as 0.6. Then the probability of a cookie not being chocolate is 0.4 because 1 minus 0.6 is 0.4. The odds a given cookie is chocolate is 0.6 divided by 0.4, which is 1.5. The logit or log odds function is the logarithm of the odds of a given probability. So the logit of probability P is equal to the logarithm of P divided by 1 minus P. Logit is the most common link function used to linearly relate the X variables to the probability of Y. To translate this into less technical language, let's explore the basketball example further. If you're working for a basketball team as a data practitioner, you'll want to know the likelihood of your player scoring many points in a game rather than the other outcome that they don't score many points. By assuming that there is a linear relationship between the X variables and the logit of the probability that Y equals our outcome of interest, or 1, you can then find some beta coefficient that explains the data you've observed. You can write the logit of P in terms of the X variables. So logit of P equals beta 0 plus beta 1 times X1 plus beta 2 times X2 all the way up to beta n times Xn, where n is the number of independent variables you are considering in your model. And like linear regression, we don't want just any set of beta coefficients. We want the best set of beta coefficients to make sure our model fits the data. In linear regression, we minimize the sum of squared residuals, which is a measure of error, to figure out the best model. In logistic regression, we'll often use maximum likelihood estimation to find the best logistic regression model. Maximum likelihood estimation, or MLE, is a technique for estimating the beta parameters that maximize the likelihood of the model producing the observed data. We can think of likelihood as the probability of observing the actual data given some set of beta parameters. To understand these definitions, we need to revisit the assumptions of binomial logistic regression. Aside from linearity between each X variable and the logit of Y variable, we assume that the observations are independent. This assumption relates to how the data was collected. Because the observations are assumed to be independent, we can say that the probability of observing data point A and observing data point B is equal to the probability of observing A times the probability of observing B. Therefore, if you have n basketball players on your team, you can calculate the likelihood of observing the outcome for each player, and then multiply all of the likelihoods together to determine the likelihood of observing all of the sample data. The best logistic regression model estimates the set of beta coefficients that maximizes the likelihood of observing all of the sample data. Now that we have estimated how maximum likelihood estimation works, we'll consider two other assumptions of binomial logistic regression. We assume that there is little to no multicollinearity between the independent variables. If we include multiple X variables, they should not be highly correlated with one another, just like with linear regression. Lastly, we assume that there are no extreme outliers in the dataset. Outliers are a complex topic in regression modeling and can be detected after the model is fit. Sometimes it is appropriate to transform or adjust variables to maintain model validity. Other times it can be appropriate to remove outlier data. To recap, we've defined the main assumptions of logistic regression and how to fit the best logistic regression model to the data using MLE. Coming up, we'll explore how to build and evaluate a logistic regression model in Python using real data. Meet you there. We've explored the kinds of questions logistic regression can help answer, reviewed the theory supporting logistic regression, and why it's a reliable and foundational data analysis tool. In this video, we'll build a logistic regression model in Python, which puts us in the construct phase of PACE. To illustrate how to code a binomial logistic regression in Python, we'll go through an example with data together. The dataset we'll use comes from a study about motion detection in older adults. We've selected a subset of the data to work with. We've already loaded our dataset in a data frame called Activity. You can call the Describe function to get more statistical information about the variables. There are 494 rows of data based on the count row. You can also observe the mean, standard deviation, minimum value, maximum value, and the value at each quartile for every variable. Now examine the first few rows using the head function. There are two variables in the dataset. The first variable measures acceleration in the vertical direction. The second variable indicates whether a person is lying down. You want to use logistic regression to predict whether the person is lying down or not. Load in scikit-learn's train test split and logistic regression functions. Now prepare the data for the Python functions to split the data into training set and a holdout dataset. Divide the activity data frame into the x variables and y variables as you've done for other models. Here you'll just focus on one x variable, vertical acceleration. To follow good data practices, split the data into a training and holdout dataset using scikit-learn's train test split method you imported earlier. You'll use the holdout or testing data later when you evaluate the model you're building now. We set the random state so that the results are repeatable, but you don't always have to set the random state. If you use a different random state, your results will differ. Next, build the logistic regression model and save it as a variable called CLF. This is a name you'll encounter frequently. CLF stands for the classifier. Call on the logistic regression function to build a classifier. Then use the fit method of the logistic regression classifier and input the training dataset. Then call in the classifier's coefficient and intercept attributes to get the parameter estimates. The coefficient or parameter estimate for beta 1 is negative 0.118 rounded to the nearest thousandth. And the estimate of the intercept or beta 0 is 6.102 rounded to the nearest thousandth. If you want to create a plot of your model to visualize results, you can use the seaborne package. So import seaborne using SNS alias if you haven't already. Then call on the regplot function. You have to tell the function which column the x variable is in, which is labeled A-C-C, parentheses, vertical, closed parentheses. Then specify y variable column. Next, tell the function where the data is stored. Specify that you're interested in logistic regression by setting the logistic argument is true. The plot displays a sharp S shaped curve. The shaded region around the line indicates the confidence band. You can observe two horizontal lines of data points. One where the person is lying down when the variable equals 1. The other one is when the person is not lying down when the variable equals 0. We'll discuss how to interpret logistic regression results next. In this video, you told the computer to find the best way to determine the likelihood of someone lying down based on vertical acceleration. You also began the work of interpreting the results and were able to plot the model. Coming up, we'll consider a variety of ways to evaluate the quality of the model and how to tell a meaningful story about the data. You have built your first binomial logistic regression. You were able to calculate the parameter estimates and graph the logistic regression. In this video, we'll use the metrics and graphs produced in Python to evaluate how good your model actually was. To recap, you're working with the activity data that measured acceleration and whether or not a person was lying down. You previously saved the data into training and holdout data sets and the logistic regression model as a variable CLF. Now, input the holdout data set into the predict method to get the predicted labels from the model. Save these predictions as a variable called yPRED. Note that MLE predicts a probability that an observation is a 0 or 1. The predict function from scikit-learn actually labels each observation with a 0 or 1. The predict function works by assuming a threshold of 0.5. So if MLE predicts a value greater than or equal to 0.5, the predict function will label that observation a 1. If MLE predicts a value less than 0.5, the predict function will label that observation a 0. The predictPRABA function, on the other hand, will allow you to check what probability was predicted for each data point. Now that you have predictions about whether each observation is a 0 or 1, you can create a confusion matrix that will give you a quick overview of how well your model categorized each data point in the holdout data set. A confusion matrix is a graphical representation of how accurate a classifier is at predicting the labels for a categorical variable. The confusion matrix displays how many data points were accurately categorized by a classifier for each category. The other squares in the grid convey how many data points were misclassified. Now build your own confusion matrix, and together we'll review each part of the plot. ImportedSiteKit learns Metrix module to create your own confusion matrix. To graph the confusion matrix, you can use two methods from the Metrix module, confusion matrix and confusion matrix display. Use the confusion matrix to generate values for the matrix, and save the output as a variable called cm. Then save the graph as a variable called dysp using the confusion matrix display method. Now show the confusion matrix by plotting the confusion matrix display functions output. The diagonal from upper left to lower right displays how many data points were accurately categorized by the classifier for each category. The other squares in the grid show how many data points were misclassified. On the axis of the confusion matrix, there are labels indicating the category of data point as 0 or 1. 0 means the person was labeled as not lying down for that observation. Data professionals call a data point that is labeled 0 a negative data point. 1 means the person was labeled as lying down for that observation. A data point that is labeled 1 is a positive data point. In the upper left corner of the confusion matrix, we have the count of observations that our classifier predicted as not lying down and they actually were not lying down. These are called true negatives. Moving to the lower right, we have our true positives. This is our count of observations that our classifier predicted as lying down and they actually were lying down. Now let's consider where our model's predictions were wrong. In the upper right is the count of observations where our classifier predicted the person was lying down but they were not lying down. These are our false positives. In the lower left is the count of observations where our classifier predicted the person was not lying down but they were lying down. These are our false negatives. In a great model, we should observe a high proportion of true positives and true negatives, and a low proportion of false positives and false negatives. In this video, you learned about the confusion matrix. This versatile tool can help you add depth to stories you tell about logistic regression models. In the following videos, we'll discuss even more ways to evaluate the quality of a logistic regression model. I'll meet you there. Here we covered confusion matrices, which help to visually represent and quantify how well a logistic regression performed. In this video, we'll continue to work through the analyze and construct phase of pace and discuss evaluation metrics for logistic regression. These metrics summarize the true positives, true negatives, false positives, and false negatives presented by the confusion matrix. The three metrics we'll discuss are precision, recall, and accuracy. You can access the functions used in this video through Scikit-learn's metrics module. And we'll continue to consider our activity data example. Precision measures the proportion of positive predictions that were true positives. Precision is equal to the number of true positives divided by the sum of true positives and false positives. In this case, precision would tell us among the people we predicted to be lying down how many of them are actually lying down. Scikit-learn's metrics library has a convenient function that does the math for us. We input the y values in our holdout data set and the y values that our model predicted from the x values in our holdout data set. Since the range of precision is 0 to 1, a score of 0.97 is great. Recall measures the proportion of positives that the model was able to identify correctly. Recall is equal to the number of true positives divided by the sum of true positives and false negatives. Remember that false negatives are those that were lying down, but the model did not detect as lying down. Recall measures the proportion of people the model correctly identified as laying down out of the people who are actually lying down. We input the same data that we input to get precision, but using the recall score function from Scikit-learn. We get a recall of about 0.98, and since recall also ranges from 0 to 1, this model performs pretty well. Lastly, we have accuracy, which is the proportion of data points that were correctly categorized. Accuracy is equal to the sum of true positives and true negatives divided by the total number of predictions. In our activity example, accuracy will measure the proportion of people our model correctly identified as laying down and the proportion of people our model correctly identified as not laying down. We can use the accuracy score function from Scikit-learn. In this case, our model achieved an accuracy score of 0.97, but note that most of the time precision, recall, and accuracy are not all this high. This is to be expected. Two other common evaluation techniques that may be helpful when working with classifiers are the ROC curve and AUC. These concepts are related to thresholds, true positives, and false positives. Although we use a threshold of 0.5 to generate our predictions, sometimes the threshold is determined based on the scenario. Notice that when we decrease the threshold, the true positive increases because we are predicting more observations to be positive, but the false positive rate also increases. The model's true positive rate and false positive rate changes at every threshold. With an ideal model, there would exist a threshold where the true positive rate is high and the false positive rate is low. We can use an ROC curve and AUC to examine how the true positive and false positive rate change together at every threshold. Data professionals may use ROC curves and AUC when comparing classification models, so we'll explore them later in this program. In this video, we covered a number of important metrics. These concepts are similar and related. Don't expect to grasp all the nuances of the content on your first pass. Allow yourself some extra time and practice to understand how best to apply these measurement tools to support the stories you're telling with data. Now we'll start constructing the kinds of stories you can share with logistic regression. So far, we've discussed how to apply pace to logistic regression. We've analyzed our data by understanding the model assumptions to the best of our ability, ensuring we have a binary outcome variable. We've constructed our model and created several evaluation metrics. Now we can begin executing and sharing the results. For example, just like linear regression, we have coefficients that the maximum likelihood estimation technique found that would best model the data. But the interpretation is slightly different. Let's revisit the role of logit in logistic regression using our example. To recap, we wanted to understand how changes in vertical acceleration are related to whether a person is lying down or not. Our focus now is on what the beta coefficient means in our equation. In this case, a one unit increase in vertical acceleration is associated with a beta one unit increase in the log odds of p. But if we exponentiate the log odds, then we can determine how much the odds change as a percentage based on changes in vertical acceleration. So e to the power of beta one is how many times the odds of p will increase or decrease for every one unit increase in vertical acceleration. We know our coefficient is negative 0.118 from the output. And now we can exponentiate the coefficient. e raised to the power of negative 0.118 is 0.89. We can say that for every one unit increase in vertical acceleration, we expect that the odds the person is lying down decreases by 11%. This makes sense because if someone is moving faster in a vertical direction, they are probably not lying down. Given that we have found a strong predictor of whether a person is lying down or not, we can consider the larger picture. For example, classifying motion can help detect falls or suspicions of injury. Our strong predictor can be a piece of a larger story about providing care to older adults. Next, let's examine how the interpretation would change in different scenarios. Let's try an example where the coefficient is positive. Imagine the coefficient is 0.25. Then we exponentiate and get e raised to the power of 0.25, which is 1.28, rounded to the nearest 100. Then we could say that for every one unit increase in x, we expect the odds of y being 1 to increase by 28%. Now imagine there are other factors in the model, such as acceleration in other directions. When there are multiple independent variables in a logistic regression model, just like in a linear regression model, we have to report coefficients while holding other variables constant. So we could say for every one unit increase in x, holding other variables constant, we expect the odds of y being 1 to increase by 28%. Moving beyond the coefficient, it is always helpful to state the p-value and confidence interval to give additional information about how likely the result is just by chance. Scikit-learn does not have a built-in way of getting p-values or confidence intervals, but stats models does. This is a great example of how different tools and packages can give you different kinds of information. As a data practitioner, you have to choose your tools depending on your priorities. When presenting the results, it can be helpful to include a confusion matrix and some statistics on precision, recall, accuracy, and or ROC, AUC. However, depending on the situation, you might want to include all of them or focus on a subset of metrics. Consider that certain industries or organizations may have preferred metrics given how they manage their modeling process. It's important to check with your team about this as you get started as a data analytics professional. Take, for example, the case of detecting spam text messages. Spam texts are unsolicited messages sent to many recipients. Only a small fraction of text messages that you receive are probably spam. But if your goal is to accurately classify spam text messages so you don't give away private information or click on a bad link, then you're really only focusing on how well you can detect spam messages. In fact, in this case, accuracy is not a good metric. Let's say only 3% of text messages are spam. That means 97% are text messages you want to receive from friends and family or automated messages from service providers. So if a model predicts that 100% of messages are not spam, that model would have a 97% accuracy rate, which seems great. But it's not that good in this context because the model will not have detected any spam at all, even though 3% of messages are in fact spam. In the case of spam messages, recall is probably a more meaningful metric. Recall will tell us the proportion of spam messages that the model was actually able to detect. Precision, on the other hand, would tell us the proportion of data points labeled spam that were actually spam. Precision essentially measures the quality of our spam labeling. There are other metrics that you can explore as well, such as AIC and BIC, which can help determine how good a model is while factoring in how complex the model is. As with many of the topics covered in this course, we've reviewed many concepts and now understand the terminology associated with the material. Work in the data space is dynamic. There is no one solution that works every time. Data professionals are constantly learning and thinking about better approaches. We always have to contextualize the data, the problem, and the available solutions. Next up, we'll focus on comparing some of the models and tests you've learned so far. Meet you there. You've learned a lot about binomial logistic regression and the kinds of problems that can help you solve. Now, you'll review some of the other techniques that you've learned and focus on the kinds of questions you might encounter in your work as a data professional. So far, you've learned about linear regression models, hypothesis testing, specifically chi-squared tests, and ANOVA variants, and logistic regression. You've also used the PACE framework, plan, analyze, construct, and execute to help guide your work. You started with the plan phase to understand which needs were being prioritized. Based on the questions being asked and the data you had access to, you determine which models or tests would be most appropriate. Let's consider an example. Imagine you're a data professional at a recording studio. You have a team meeting and the conversation is about the number of times each song is streamed. Some questions that come up include, what factors influence the number of music streams? And how much does each factor influence the number of streams? Since the number of music streams is the outcome variable and the number of streams is a continuous variable, you could consider linear regression or hypothesis testing. But because the question asks about how much each factor influences music streams, linear regression is a better model to answer the question. Remember that linear regression allows for accessible interpretation of the coefficients and R squared to help explain which factors impact the outcome variable and by how much. But you want to make sure that the model assumptions are met to add validity to your conclusions and insights. Let's consider another example. Imagine you are working for a cafe. They're sampling different coffee beans from different countries and want to figure out which kinds of coffee beans sell better. The team has put together some projections about how they expect the beans to sell. But they're curious if the bean sales are independent of their pastry sales. The country of origin of the beans is a known differentiator. The pastry sales would be more of the covariate. In this case, although the outcome variable, sales is still continuous, the focus of the question is about comparing different groups, such as different kinds of coffee beans and different countries of origin. Therefore, you should focus more on hypothesis testing, which can be a great way to conduct A-B tests. The null hypothesis would be that the cafe sells approximately the same number of bags of each coffee bean type. The alternative hypothesis would be that the cafe does not sell the same number of bags of each coffee bean type. By conducting a series of tests, you can either accept or reject the null hypothesis with the particular p-value to help understand how well the model explains the trends. And the team will be able to better understand which beans to order to sell more coffee at the cafe. Let's consider one last example in which you're working at a social media company and you're interested in exploring why some posts do or do not go viral. You decide on a question. How can I predict whether a post will go viral? Since the outcome variable is binary, either the post go viral or does not, binomial logistic regression might be the first model you consider. The best way to determine if a logistic regression is the right choice is to build and evaluate the model. Luckily, there are many metrics you can use, including p-value, confusion matrices, precision, recall, accuracy, ROC, AUC, AIC, and BIC. Choosing the best metrics will depend on the situation. When interpreting the coefficients for logistic regression models, remember to exponentiate them. Recall that when sharing results, logistic regression coefficients report in percentages, how much a factor increases or decreases the likelihood of an outcome. As a data professional, you will encounter many different questions that require a variety of approaches to address. By focusing on the fundamentals of what each model does well, you'll be able to find ways to put these tools into practice. We've arrived at the end of the course. Congratulations. You should be proud of how much you've learned and can apply to your work as a future data analytics professional. Before we finish, let's review what you've learned. We've defined binomial logistic regression as a foundational data technique used to model the probability of one or two outcomes occurring. In learning about binomial logistic regression, we focused on the importance of the logit function, both as an assumption of logistic regression and in terms of how to interpret the results. You learned about maximum likelihood estimation, a common technique for estimating the best parameter for maximizing the likelihood of observing the data used to build a logistic regression model. Then you learned how to construct a logistic regression model in Python using the scikit-learn library. Lastly, you considered some evaluation metrics, including confusion matrices, precision, recall, and accuracy. Scikit-learn has many convenient functions that can help you evaluate and interpret a logistic regression. But sometimes you'll need other Python libraries or packages. For example, it can be helpful to go to stats model, which you used when learning about linear regression, chi-squared, and ANOVA testing. Python packages, like regression models, have their own strengths and weaknesses. Your ongoing experience as a data professional will help guide you to the tools that's most appropriate for each situation. You also learned how to interpret logistic regression coefficients and what considerations to make when selecting metrics and plots based on the questions you're trying to answer. Lastly, you reviewed example scenarios to figure out why certain models or tests might be most appropriate in each situation. You've worked very hard up to this point. Celebrate all that you've learned and consider the many questions you've asked and answered throughout this learning journey. You're well on your way, so keep up the good work. Hi, I'm Tiffany. It's great to be with you again. I'm here to talk more with you about your portfolio ideas and how you can use it in your job search. Completing the project for this course is a great way to present your knowledge and experience about data-centered tasks to potential employers. This time, your project will demonstrate what you've learned about regression analysis. Ready to get started? As you learned in earlier courses, each portfolio project provides an opportunity to complete tasks that demonstrate what you've learned in a course and create artifacts that you can add to your portfolio. While completing this portfolio project, you will also develop your interview skills. When interviewing for a job, I recommend that you discuss specific experiences that apply to data professional work. You can also use this portfolio project in these instances to show and explain your transferable skills. Additionally, some employers might ask you to assess a business scenario that requires multiple linear regression model. This project was designed to help you practice these skills. In order to complete the project, you'll be presented with details about a business case. Use the instructions to complete a new entry in your PACE strategy document and create a multiple linear regression model with ANOVA testing. You'll also use your PACE strategy document to continue recording your methods and considerations for approaching data projects. Remember, you can go back to other videos and resources in this course if you'd like a refresher on any of the material. Ready? Let's go. Hi, I'm Leah. I'm a data architect and I work on data models for data scientists and data analysts to use to find insights into data. I have a very untraditional background. I didn't come from computer science or anything like that. I studied sociology and philosophy in France, actually, and then I came back to the U.S. and got a job as a knowledge engineer, and that actually introduced me to this knowledge engineering space. When interviewing for a job, it's really important to talk about the data models that you've worked on and to present them in such a way that you can explain in a concise manner to your interviewer what you've done when you are presenting your data models and your data modeling experience. You can talk about the problem that you're solving for, the techniques that you've used, and how exactly you presented it to the stakeholders, which is a huge step. Take the projects, the artifacts that you're getting from this course, as well as think about other things that you're really passionate about and go ahead and build a website or a Jupyter notebook or a GitHub page or something like that that you can share a link to is really important. When I've conducted interviews in the data analytics space, I love to see websites and GitHub links and certificates and other things that show me that they have worked on other projects even outside of work that they're very passionate about and they want to share. And I can go to their website and I can see a project that is well presented where they explain at a high level the concept that they're working on or the problems that they're solving for and then work their way down to get into more detail, find specific domain you like, a certain data set that you like, type of data you like and focus on that. I notice I get very excited about my passion projects and I feel like I work a little more on them and that helps me to really push through and make it exciting because I'm excited about it. In this course, you learn about regression models, how they work, how to build them in Python and how to interpret them. In the previous course, you practice statistical skills to analyze and interpret data. Everything you've been learning will help you complete this new portfolio project. To complete this project, you'll be presented with a business scenario and a data set that requires a multiple linear regression with the goal of solving a unique challenge. Your PACE strategy document will help you to complete the project and encourage you to think like a data professional. As you learned in this and other courses, data professionals solve problems by finding and explaining relationships between variables and data. They tell stories based on these relationships that inform stakeholders to adjust their business strategies. As a data professional, the work that you do has meaningful impact and can shape the way organizations make informed decisions. This part of the portfolio project is a great opportunity to demonstrate your technical skills, your ability to address complex problems and your ability to communicate effectively about solutions. And remember, these skills take time to develop. The more portfolio projects you complete, the better prepared you'll be to face difficult job interview questions and handle challenges in your future role as a data professional. Hello again. I'm back to check in on how you're doing with your progress so far. You've already accomplished so much. In addition to the AB test you built and analyzed, the tidy data set you organized and the data visualizations you composed, you've also built a multiple linear regression model. Taking a moment to reflect on each of our portfolio projects creates an opportunity to really acknowledge everything you've learned, practice and accomplished up to this point in the program. These projects are collectively preparing you for future interviews as you begin to navigate the data career space. And your piece strategy document has notes for you to refer to about your process, considerations and steps for completing these artifacts in your portfolio. This is all stuff you'll want to discuss with potential employers and hiring managers during interviews. So far, you identify the importance of understanding assumptions for linear regression. You practice evaluating models and you have been taught to recognize the importance of explaining your process when working with regression models. As you begin preparing for interviews, you'll want to think about questions you may be asked like, what kind of assumptions do we have for linear regression? What should we do if there are outliers? How do we determine whether outliers are influential points? How do you check multicollinearity and what should be done if there is multicollinearity? Of course, there will be additional questions for you to answer in your interview. You might even find yourself in a situation where potential employers and hiring managers ask you to describe how you would approach a project that you have little to no domain knowledge about. This portfolio project requires you to assess a unique business scenario. In response, you built a multiple linear regression model with ANOVA testing. In addition to showcasing your modeling skills, you also demonstrated your ability to communicate findings effectively in your professional writeup, which included data visualizations, evaluation and interpretation of the model and key business insights. Coming up, you're going to learn all about machine learning models. Then you'll have an opportunity to solve a problem by building your own. You're doing a great job putting together your portfolio. Congratulations on reaching the end of this course. Incredible work. I hope you take a moment to celebrate and reflect on everything you've learned and accomplished throughout this course and the program overall. You began by applying the PACE framework, plan, analyze, construct and execute to modeling relationships between variables and learning about two fundamental models, linear and logistic regression. Then you learned how to check assumptions and to construct, evaluate and interpret a simple linear regression model using examples. You were able to practice your Python and modeling skills on real data too. You extended your understanding of simple linear regression concepts which you applied to multiple linear regression with more variables then interpretation became more complex and the number of considerations increased. However, the scope of questions you could ask and answer also expanded. Next, you turn your focus to categorical variables with hypothesis testing. You learned about chi-squared and ANOVA testing. This allowed you to ask questions about how groups differed from one another. Finally, you learned about binomial logistic regression which focuses on the probability of a particular outcome. I hope you're proud of every line you've coded and every question you've asked. Being able to talk about and apply concepts we covered will serve you in whichever industry you work and the roles you take on the rest of your data analytics career. You now have a much faster skill set that includes regression models, evaluation metrics and hypothesis testing. As you move through the rest of the program, you'll be able to draw connections using statistics and regression modeling. In the next course, you'll learn from Susheela, a fellow Googler about the machine learning landscape. You'll explore supervised and unsupervised machine learning. I can't wait for you to meet Susheela and for all she has to teach you. You'll learn a lot of new and interesting techniques and solve problems with big data. I'm so grateful I could be here with you on your regression journey. I hope you feel more confident about your data professional knowledge and are prepared to keep developing your skills. You're ready to continue to the next step of your career as a data analytics professional. Have you ever wondered how your internet search browser predicts what you're about to ask? For instance, the algorithm knows that after you type will dinosaurs, there's a pretty good chance that next you'll type come back. But how do programmers actually build a trained algorithm to make this kind of prediction? Well, in this course, you'll learn the process for creating complex models, which can be used to offer suggestions about all kinds of things. These models work by applying previously gathered data to make educated guesses about everything from dinosaurs, to the music you listen to, to the route you might take to work. And without talented data professionals and the machine learning models they create, we wouldn't have this useful feature. Throughout this course, in addition to models, you will explore the major characteristics of machine learning and keep developing your skills in Python. But before we start, let me offer my congratulations on everything you've accomplished. You've practiced data cleaning, reviewed key statistics concepts, and explored regression models. Along the way, you've become better equipped to navigate the more complex parts of being a data professional. I will be your guide as we consider a big picture perspective of the world of machine learning and complex models. My name is Sushila. I'm a data scientist and I work on projects for YouTube here at Google. At YouTube, we use machine learning every day to deliver content to users, from music reviews to travel vlogs to my personal favorite, cat videos. I'm excited to explore machine learning with you. As you've learned previously, machine learning is the use and development of algorithms and statistical models to teach computer systems to analyze and discover patterns in data. Its applications are diverse from optimizing microchip design and improving earthquake predictions to recommending videos to watch on YouTube and more. This course will help you on your journey to becoming a data professional who can use machine learning to work on applications similar to the ones I just mentioned. Before you begin, it may be helpful to review linear and logistic regression and statistical models covered earlier in this program. The terms complex models and machine learning are often used interchangeably. In this course, when the term complex model is used, we are referring to mathematical or computational models in general, inclusive of everything from regression to deep learning. First, you'll learn the different types of machine learning, like supervised and unsupervised learning. Then you'll learn how to implement some of them. Each type will be broken down by its characteristics so that the uses and purposes of each type are clear. Next, we'll consider another perspective of the data science workflow and discover how to apply pace to machine learning. In addition, you'll learn how exploratory data analysis or EDA relates to machine learning. You will also practice additional skills like feature engineering. Another part of your introduction to the machine learning landscape will be selecting relevant models and metrics. You'll determine the most appropriate model for your purpose and data. You will learn which tools to use to evaluate your model's performance. You will also continue to work in Python. You'll explore the most commonly used Python libraries and packages for specific types of machine learning. And you'll use the same resources that Google's data experts rely on in their work. By the end of this course, you will have a virtual tool belt packed with the tools you need to build and improve on complex models. Lastly, you will have the opportunity to practice your new job-ready skills in a portfolio project that's based on a common workplace scenario for data professionals. Machine learning is an exciting field for data professionals. It's evolving and developing all the time with new applications every day. The concepts in this course will help prepare you to join the field and advance your career in data science. I'm Sushila. I'm a senior data scientist here at Google. I work with teams at YouTube to answer questions they have about their product with data. A typical day for me can look very different depending on which day of the week it is. I like to guard some of my days of the week where I have very few meetings. And that's time for me to have really deep work time, really focused time. Usually during that time, I will be either working on current analyses, planning future analyses, writing up summary results, preparing presentations, any number of these things. Things where I really need to focus and work on a one thing uninterrupted for long chunks of time. On the other days, I typically have a lot of meetings with software engineering partners, product managers, anybody that really can help me give context to my problems. My favorite thing about being a data scientist is really getting to have an impact on people's lives, especially with a product like YouTube. In my role, I've done a lot of measuring of comments quality. And for this, we have a lot of different kinds of data. We have logging data where we can see things like how many comments did you read? How many did you write? How many minutes of a video did you watch? All the way to survey data where we'll ask users, how did you find this comment? Was it satisfying? Did you enjoy it? These kinds of categorical responses are super helpful that we then turn into binary things to classify comments as good or bad. When you go into the comment section and you see something that's just delightful, that is the result of a bunch of work from a bunch of people. And that algorithm is built on this data. When we make improvements to the comment section, it has an actual effect on people's lives. Their experience with the product gets better. And YouTube is a product that touches so many people. It's an amazing way to be able to kind of feel like you're part of somebody's life. Data science and breaking into the data science field can be very daunting. It can be a very intimidating field where it feels like everyone has these fancy models and these big presentations and you just don't have those things. You don't know how to do them yet. But keep at it. You will. You will learn how to do these things. Everybody had to start somewhere. Interviewing for data science roles is maybe the most unprepared I had ever felt in my life. I did not spend my whole life preparing for this exact moment. But I got through it because I just kept chipping away at it. Let's discuss some topics that we'll cover in this part of this course. First, we'll consider the main types of machine learning. Then we'll review different variable types used for machine learning like continuous, categorical, and discrete. It's essential to have an understanding of these concepts so you can build an appropriate model. Next, we'll introduce you to recommendation systems and explore how they use content-based and collaborative filtering to make content suggestions. Both types of filtering are important techniques for building many of the recommendation systems used today. Finally, we'll learn about the ethics and applications of machine learning. We'll examine the questions that data professionals must ask as they plan, analyze, construct, and execute models. Because machine learning models are very powerful, it's important for data professionals to consider the ethical implications of their work. Coming up, we'll give you the tools to avoid some common mistakes that can mean the difference between a problematic project and a successful one. Now let's get started. Earlier in the program, you learned that logistic regression and linear regression have different purposes. For example, when you discover data points are grouped into two different categories in a dataset, a straight line may not describe the data very well or be useful as a means of predicting outcomes. A logistic regression or sigmoid function is a better fit for a dataset like that, particularly when new data points are introduced to it. As you've learned, linear regression is particularly good for datasets in which data points can be represented with a straight line. With some datasets, simple regression models may not be sufficient for your analysis. For example, this graph shows a distribution of three classes. A simple linear regression model wouldn't be very useful if we didn't know what class our observations belonged to. And a simple logistic regression model wouldn't be able to predict a problem with three classes. This is where data professionals need more complex machine learning. But many machine learning models use regression principles as a foundational layer to begin the process of teaching a computer model to make decisions. Depending on the type of data available and the kind of problem you want to solve, you'll probably select one of two machine learning types, supervised and unsupervised. Because supervised learning problems occur more frequently in the workplace, data professionals use this type most often. Supervised machine learning uses labeled datasets to train algorithms to classify or predict outcomes. Data professionals use supervised machine learning for prediction. Labeled data is data that has been tagged with a label that represents a specific metric, property, or class identification. For example, imagine you need an algorithm to predict whether a bird is a penguin or an ostrich based on height. You have a data set of heights and an indicator that specified whether that measurement came from a penguin or an ostrich. The height value is the X data. The indicator is the label or the Y data. Here's another example. You own a restaurant and you have data on how many customers visit per month and how much revenue you generate per month. If X is the number of customers and Y is the amount of revenue, then you can use an algorithm to predict next month's revenue based on the number of projected customers. Whether you're measuring birds or predicting revenue, you need labeled data for supervised machine learning. Next, think about the terms classify and predict as they apply to supervised learning. We can use our bird and restaurant examples to help. The bird example requires an algorithm that seeks to classify or collect different types together into categories, classes, or groups. In the restaurant example about predicting revenue, the algorithm's goal is to forecast or estimate a value, given data that is already labeled. To summarize, supervised machine learning algorithms use data with answers already in it and use it to make more answers, either by categorizing or by estimating future data. As a data professional, you will manually adjust these types of models to meet business needs. Using your knowledge of data cleaning, statistics, and regression, you will learn to train, tune, and optimize complex models to deliver more accurate results. The other most common type of machine learning used by data professionals is unsupervised learning. Unsupervised machine learning uses algorithms to analyze and cluster unlabeled data sets. In this type, data professionals ask the model to give them information without telling the model what the answer should be. At this point, you may have an idea of what unlabeled data means. Think back to the ostrich example we discussed earlier. Unlabeled data would describe a set of flightless birds and not contain any kind of labels, tags, or categorizations. When you receive a data set like this, the goal is to group the birds by their similarity based on patterns detected by your model without there necessarily being a correct answer. Once an algorithm is deployed, unsupervised learning will manage data as it comes in and classify or analyze it. For example, when a news aggregator categorizes an article by topic, or a media platform categorizes a video by genre, this is done by unsupervised learning algorithms. Later in the program, you will learn how these algorithms work on a conceptual level, how to implement them, and how to apply them to data sets you'll encounter on the job. There are a couple of other types of machine learning besides supervised and unsupervised. Reinforcement learning is often used in robotics and is based on rewarding or punishing a computer's behaviors. The computer will take action based on a policy or set of rules that it has learned. If the action results in a favorable outcome, it will receive a reward. In an unfavorable outcome, the computer will receive a penalty. Based on whether it received a reward or punishment, the computer will update its policy, trying to optimize for rewards or minimize penalties. This process will repeat until a satisfactory policy is found. Finally, there is deep learning. Deep learning models are made of layers of interconnected nodes. Each layer of nodes receives signals from its preceding layer. Nodes that are activated by the input they receive then pass transformed signals either to another layer or to a final output. Another term you often hear in connection with machine learning is artificial intelligence. Artificial intelligence includes all types of machine learning. So we won't rely on the term for the purposes of this course. Instead, we'll focus on supervised and unsupervised learning. These are the most common applications of machine learning and having strong skills in these domains is valuable to potential employers. Supervised and unsupervised learning use many of the same principles as reinforcement learning and deep learning. So you'll have the foundation you need to explore these topics further on your own. Now you're familiar with the machine learning landscape. In this course, machine learning falls under the scope of artificial intelligence, which is illustrated on this map. Machine learning and artificial intelligence refer to the same principle, training a computer to detect patterns and data without being explicitly programmed to do so. Under the category of machine learning, you'll also find all the other types of learning we've discussed. Finally, there is one aspect of machine learning and data science that every data professional should know. Quality is more important than quantity. A small amount of diverse and representative data is often more valuable for data professionals than a large amount of biased and unrepresentative data. The concept of infinity can be difficult to comprehend. Whether we're cooking a meal or reading a book, most activities humans partake in have a beginning, middle, and end. In other words, they're finite. But as you've learned from previous courses, we data professionals deal with infinity all the time. And this is also true when building complex models. As you learned earlier in the program, continuous features can take on an infinite and uncountable set of values. Understanding this concept is critical when selecting a machine learning model and choosing the measurements to check that model's utility. Imagine you own a citrus tree farm and you want to learn the average weight of this year's yield of kumquats. The entire yield or population is 100 bushels. Using simple random sampling, you pull three kumquats from each bushel and weigh them individually. The recorded individual weights of all these kumquats are considered continuous data because the possible value of one kumquat is infinite and uncountable. In other words, one kumquat doesn't weigh 15 grams, exactly, and out of the 300 kumquats you weighed, their weights may have been measured to two decimal places, like 15.76, 16.09, and 15.56. But the measurement is continuous because the weights could be any infinite number between those measured points, like 15.762950. Simply put, weight is a continuous feature because it has an uncountable set of possible values. Conversely, the total number of kumquats in your 100 bushels has a fixed quantity. Because of this, the total number of kumquats is not a continuous feature. As a data professional, knowing whether the features you input into a machine learning algorithm are continuous or fixed will be essential to choosing the correct model and the evaluation metric for that model. Recognizing whether data features are continuous is not the only indicator to consider when deciding which machine learning model to use, but it is a very helpful one. Here's our machine learning map again with some new information added to it. Beneath supervised, you find a block of models used to predict continuous outcomes, including several regressors. Supervised learning models that make predictions that are on a continuum are called regression algorithms. Some of these models were introduced earlier in the program. Others will be defined later in the course. For now, just know that data professionals use these types of models to work with continuous data. The goal of these models is to predict outcomes or values based on the data sets provided. The nature of a data professional's job in this case is to train the model to predict the values as accurately as possible. And that's what you'll learn coming up. Meet you there. Sorting candies by size and shape and color is pretty simple. So it may seem like teaching a machine to sort wouldn't be that difficult. Whether or not a particular model is appropriate for a problem like sorting by characteristic is largely determined by what type of variable it must predict. In this video, we'll review a few types of variables that are helpful for determining the right supervised machine learning model for your data. As a reminder, continuous variables are variables that can take on an infinite and uncountable set of values. On the other hand, categorical variables and discrete variables are not continuous by nature. Rather, categorical variables contain a finite number of groups or categories. For example, you might use a categorical variable to classify a vehicle type, like car, motorbike, or bus. Next, there are discrete variables which have a countable number of values between any two given values. In this way, discrete variables are unlike continuous variables which are uncountable and have an infinite set of values. So the height of a tree is a continuous variable, but the number of trees in a park is a discrete variable. Discrete variables are able to be counted and categorical variables are able to be grouped. For example, the paint color of a house is categorical, while the number of houses in a neighborhood painted lavender is discrete. Recall the definition of supervised machine learning, a category of machine learning that uses labeled data sets to train algorithms to classify or predict outcomes. Categorical, classification, and discrete variables are a part of that supervised learning definition. Many machine learning algorithms are trained using large data sets that group data inputs into two or more groups. Knowing what types of features you have in a data set and what outcomes you're looking for will help you to determine the most applicable machine learning model. Let's consider an example from the manufacturing sector. You are the lead data scientist at a stuffed animal manufacturer. You're lucky to have an automated system which stuffs, stitches, and tags plush cats and dogs at the same time. The system was set up that way because the stuffed animals are sold in packs of two, one cat and one dog. But now, the retailer is requesting that cats and dogs be sold separately. Rather than buying new parts to update the machine, the plant manager asks you to use a camera to identify the cats and dogs so they can be separated automatically. The algorithm for grouping the cats and dogs based on images from a camera would use categorical data as part of a supervised machine learning model. The algorithm will ask, is this a dog or a cat? You'll train the computer using the visual data to recognize and separate the incoming dogs and cats. With that problem solved, you're asked to build a model to predict how many shipping containers are needed to ship all of the stuffed animals. This has a discrete target variable because you're counting a number of containers. Now, let's revisit our machine learning map with the newly added categorical area. You'll find a few new terms. Classification is the broad category under which logistic regression, decision tree classifiers, naive Bayes classifiers and some others reside. Notice that the decision tree, random forest and boosting models are present in both the continuous area and the categorical area as both regressors and classifiers. Just like any other field of study, there are functions and applications which can't be categorized into just one group. As you continue to develop your understanding of these models, the placement in both areas will make more sense. We'll spend more time on these algorithms later in the course and pretty soon you'll be building some models of your own. Now that you understand the different types of features used in machine learning models, let's investigate how they can be used together in a type of model you're likely very familiar with. Have you ever been streaming your favorite new album and when you reach the end something entirely new begins to play? You've never heard it before, but you really like it. How did your streaming service do such a good job choosing a new song for you? It used a recommendation system. Recommendation systems are a subclass of machine learning algorithms that offer relevant suggestions to users. And as you probably realize, they're everywhere. Just about any website or app that matches you with something, whether it's an outfit to wear or a recipe to cook, most likely uses a recommendation system. The main goal of a recommendation system is to quantify how similar one thing is to another and use this information to suggest a closely related option. In this way, recommendation systems make it easier for users to find and connect with information, products, and content that's relevant and enjoyable. Let's examine how this works. First, you selected a song on the music streaming service. Then when the song ended, the service played more music related to your initial choice. This is an example of content-based filtering, in which comparisons are made based on attributes of content itself. In this case, attributes of the music you played are compared to attributes of other music to determine similarity. To make this comparison, there must be data about each song that's a deconstruction of its attributes. In other words, everything that makes the song unique is identified and labeled, like the artist's voice type, the rhythm or beat, or whether a certain instrument is featured. Then, when you search for a song, the content-based recommendation system will access the list of attributes for that song and every other song in its library. Finally, the system will compare them all using the same list of attributes. Good song recommendation systems compare hundreds of attributes. Content-based filtering has benefits and drawbacks. Some of the benefits are that they're easy to understand, they help recommend more of what a user likes, even niche things that few others are interested in, and they don't need information from any other users to work. Another advantage is that the filtering is not limited to comparing items, like songs. They can map users and items in the same space and then recommend things that are closest to a user's typical preferences. Interestingly, sometimes a benefit can also be a drawback and vice versa. For instance, the fact that content-based systems always recommend more of the same type of thing can be a drawback. Users won't be introduced to something that diverges from what they've selected in the past or learn something new. Another disadvantage is that the attributes often have to be selected and mapped manually for all the items, which is an enormous amount of work. Finally, content-based filtering is ineffective at making recommendations across content types because different content types don't use the same features. For instance, a book doesn't have beats per minute, so the same streaming service won't be able to use your song preferences to recommend a new novel, so the use cases can be limited. Note that in this music streaming example, you didn't actually rate anything. You just listened to your playlist and the algorithm found similar songs. When you stream videos, on the other hand, you might rate or review something when you like it. The recommendation system can also use your feedback to suggest other videos you might like. In the video streaming example, you both viewed and actively participated in the feedback process by liking the videos you enjoyed. A drawback of this method of recommendation is that you probably like videos about various topics, but the system will use your feedback to only suggest similar videos to the ones you liked. Here's another example of how recommendation systems work based on your feedback, called collaborative filtering. When a user actively likes content by rating it or giving it a good review, it leads to collaborative filtering. A recommendation system will use collaborative filtering to make comparisons based on who else liked the content. Then it will suggest videos to someone else with similar preferences. Collaborative filtering is different from content-based filtering in that the recommendation system doesn't need to know anything about the content itself. All that matters is if you liked it. It's a different flavor of recommendation system, which brings me to ice cream. Perhaps we wanna know which ice cream flavors David will enjoy. A collaborative filtering system would compare past items that he has liked to what other people have liked. And because David's likes and dislikes are very similar to Fatima's, the algorithm will predict that he'll enjoy fudge brownie because Fatima does. Collaborative filtering works regardless of what the items are. The flavors don't matter. It doesn't even matter that it's ice cream. It could be cars or restaurants or hair products. All that matters is that David and Fatima have similar tastes. The ability to recommend across content types is one of the main advantages of collaborative filtering. Other benefits are that it finds hidden correlations in the data and it doesn't require tedious manual mapping. Drawbacks of collaborative filtering are that these systems need lots of data to even start getting useful results and every user must give the system lots of data. Also, collaborative filtering data is very sparse, which means it has a lot of missing values. Let's use movies as an example. There are hundreds of thousands of movies in existence, but most people have only viewed a small fraction of them. So each person's movie data would have missing values for all the movies they haven't experienced. A recommendation system would need to use advanced filtering techniques to manage all that empty space. Most recommendation systems are highly complex and use hybridized models that make use of elements from both content-based and collaborative filtering. But even in the simplest use cases, there are different needs, resources, and strategies available for data professionals to use. And it's the data professional's job to recommend the best problem-solving approach. Recently, you learned about recommendation algorithms and how they help users discover everything from new music to stream, to new hair care products to try, or new games to play with friends and family. These algorithms can be extremely accurate, but they can also miss the mark. For example, one limitation of recommendation tools is the problem of popularity bias, which is the phenomenon of more popular items being recommended too frequently. This leaves the majority of other items, which might be just as pleasing to users, not getting the attention they deserve. As machine learning becomes more accessible and its power applied to an ever-broadening set of challenges, there's greater potential for models to have unintended and even harmful consequences. So, as a data professional, it's important that you prioritize fairness in the data that you have and use. Part of this responsible data stewardship is taking steps to reduce the potential of unintended consequences of your machine learning applications. Data professionals must also consider risk. They may need to make decisions that could expose a business and the people it serves to negative consequences. Recognizing the potential for bias will help to minimize risk. Bias in machine learning is particularly deceptive because it stems from human bias, but because a computer makes the prediction, it's easy for the result to seem objective. Often, the bias is unintentional. Let's consider an example of creating an unintentionally unethical model. Suppose you're building a facial recognition model to deploy as part of a service for a sunglasses retailer. To generate a database of facial templates, you recruit people in your office to have their faces scanned. Eventually, you have several hundred scans, which you think should be plenty to generate the templates you need. You're excited to test what you've made, so you ask people from another department to act as a test group. The test results exceed your expectations. You even make the project open source so others can use it and build on it, but then the service goes live. It's not performing nearly as well as expected. One reason might be that you didn't consider the full range of people who would be using the service. If all the people you used to generate the templates were, say, older than 30, perhaps the service didn't work well on young adults, or maybe you used far more people on one end of the gender spectrum. Since your work was released to the public, other people might now be using your templates in their own models without realizing that there could be a problem with the template and the facial recognition model. None of this was caused by bad intentions. The negative consequences were the result of bias in the training data, which was inherited from bias in the data collection process. Specifically, the faces used to generate the template didn't represent a wide enough variety of people. By using this repository to build your model, you ended up with a data set with a data class imbalance. In other words, the input data was biased before the modeling even began. This is just one example of how machine learning solutions have the potential to carry with them unintended consequences related to equity and fairness. You'll learn about other examples later in this course. You'll also learn some fundamental questions you can ask at each step of the modeling process to help reduce the risk of a model causing harm. Predictive models are central to machine learning and have the potential to do a lot of good, but with that potential comes risk. Therefore, it's critical for data professionals to ensure that their models are ethical. First, let's consider what that means. There's no simple guide for this because there are so many different kinds of models that are applied to a variety of tasks. As a data professional at any level, you'll probably have to make decisions about models that carry ethical implications no matter what problem you're trying to solve. Nonetheless, it's always important to ask questions that help you consider the fairness of your model. Let's explore some of the questions that you should ask during the planning stage of model development. Right away, you should ask yourself what the intended purpose of the model is, how will its predictions be used, and by whom? Who is affected by the model and how harmful or significant could the effects be? If your model uses personal information, have these people given their consent for you to collect and use this data? Is there a way for them to withdraw their consent? Are they aware of what you're doing with their information? A common application of machine learning is making predictions of merit. Who is eligible to receive something? It may be a loan, admission to a university, or access to government services. These are ethically sensitive situations because models that make these predictions affect people's lives. If you're designing a model to predict whether a bank should issue someone a loan, who will use that information? What could happen if the model is wrong? What are the long-term consequences if the loan is denied? Once you've considered these questions, the next step is to analyze the data. This step carries its own set of questions. You must ask whether the data you intend to use to build your model is appropriate, well-sourced, and representative. To continue our example, does your loan data go back many years? If so, it's likely that marginalized and oppressed classes are underrepresented in the data because previous measurements were influenced by bias or prejudice. There's a common saying in data science, garbage in, garbage out. If there are problems with your data, then there will be problems with your predictions. After you've planned your process and analyzed the data, it's time to construct the model and ask additional questions. For instance, is it important that the model's predictions be explainable? With some modeling methodologies, it may be difficult to know where the predictions came from. This is sometimes known as a black box model. Neural networks are widely known for being difficult to explain, and therefore they're not appropriate for many applications where transparency is important. Algorithms like random forest, add a boost, and XG boost aren't completely black box, but they may require additional efforts to explain and justify their predictions. At the other end of the spectrum, linear and logistic regression methods are highly explainable, so are single decision trees. Once you've planned everything out, analyzed the data, and built your model, there are more questions to ask before you complete the execute phase. First of all, ask yourself if you understand your model and its predictions. Do they make sense? Are their predictions fair? One way of evaluating model fairness is by checking to see how the model's error is distributed over a population. If the model only makes errors in particular cases that are similar, it could carry higher ethical risk. Another question to ask is whether someone is assigned the responsibility of reviewing and monitoring the model, both pre and post deployment, to make sure it's performing well and to assess the potential for harm. Finally, make sure you're considering the issue of consent at each stage of the PACE process. As you can appreciate, there are a lot of questions to ask to ensure ethical model development. Few data professionals will be responsible for answering all of these questions themselves, but nearly every data professional must answer some of them, so it's important for you to always keep in mind. When you ask the right questions at each stage of the PACE workflow, you help to ensure your models are good for business and good for everyone involved. Today, there are many tools and programs that can help you perform data analytics and build machine learning models. Knowing what's available in your digital tool belt is important as you approach and solve problems as a data analytics professional. At YouTube, we need to wrangle vast quantities of data. Rather than reinventing the wheel each time, I use tools and libraries that other data professionals have already created to help me clean, validate, and visualize my data efficiently. You've been applying many of these tools throughout this program, but let's take a moment to review some of the software you've used, some that you haven't, and figure out how they relate. When creating any Python script or program, development is almost always done inside an integrated development environment, or IDE. And IDE is a piece of software with an interface to write, run, and test a piece of code. If you took the original Google Data Analytics certificate, you coded using the language R, and R's accompanying IDE, RStudio. It's possible to create and run scripts inside any standard text editor, but IDEs provide many tools to support the development of your code. You already used an IDE in this program, maybe without even knowing it. You've coded in Python and executed that code within the Jupyter Notebook interface. In this case, Jupyter Notebook is the IDE. For most coding languages, there are many IDEs available to a developer. They all perform similarly with differences in functionality and included tools. Selecting one will often depend on your personal preference or on your employer's preference. Later on, you'll learn more about IDEs, but for now, let's examine how they each handle different Python files. The two most common file types are Python scripts, denoted with the extension dot pi, and Python notebook files, denoted with the extension dot i, pi, and b. Depending on the task, a data professional might use both of these file types and may even alternate while working on one problem. Although both of these file types can execute code, they each have their own advantages. It's important to remember that Python is not just a language used in data science. It's a flexible, general-purpose language that can be used for web development, automation, cryptography, and other tasks. Many times, all you'll need is a Python script. Python script is the Python code written in a plain text file, executed by the computer without the need for human supervision. In situations when it's not necessary for a human to check the code while it's running, data professionals generally prefer to use Python script. Scripts are especially useful when the program incorporates several files. Scripts are also helpful when there are many errors in the program that require debugging, since scripts can take advantage of additional functionality that notebooks cannot. However, Python scripts typically aren't ideal for data science. A data analytics professional, especially during EDA, needs to use Python to interactively explore a dataset and view the outputs of their code in near real time. Often, these results are shared with colleagues and must be in a human-readable format. Python notebooks are preferable for data tasks that use code to tell a story. Notebooks can be really useful for pairing code with human-readable descriptions and outputs, non-code elements like images, links, and general text can be embedded directly into the file. They also have some nice functional advantages, such as the ability to export the file as a PDF. It might seem like .py files are preferable to .ipynb files, but that isn't necessarily true. Python notebooks are just another tool common in the data space, both for learners and industry professionals. Many employers seek candidates who have experience working with existing Python notebooks and know how to create new ones. You'll continue to use Jupyter notebooks for this part of the course. However, all the code and concepts you've learned will work just as well in a standard .py file if you're ever in a situation that requires that file type. Python scripts, notebooks, and IDEs are just part of the tool belt. Think of them as the foundation for the rest of the projects you'll be working on in this program and as a data professional. Selecting the right combination of these tools will help you successfully complete any task you're given. Earlier, you explored integrated development environments. Keeping in mind the different types of Python files that are available to you, in this video, you'll learn more about specific options for IDEs that you might use as a data professional. Knowing whether or not you want to use a Python notebook or a Python script can help you visualize the overall workflow of your project. But when you start a project using one type, it doesn't mean you need to use it throughout the whole project. Data professionals change their development environment in the middle of their workflow more often than you might think. You can always switch if you realize that you need some functionality that is offered by a different IDE or file type. Jupyter notebooks is one of the most commonly used IDEs that support Python notebooks. However, it only offers support for Python notebooks, not Python scripts. Other IDEs, such as Spyder, will only support Python scripts. Some IDEs, such as Visual Code Studio, can support both Python notebooks and scripts. Something else to consider when you're selecting your IDE is the tools that are built into the software. Many of them are relatively simple and make development more efficient. Code completion is a very common feature. It autocompletes what you type based on the functions and variables that are present in the code. Additionally, many IDEs include a file manager, which is helpful when you're managing a larger project. Debugging support and code testing are also available for many IDEs. However, they're more advanced and might not be something you use in your day-to-day work as a data professional. Sometimes, you might start working in an IDE only to find that it's missing some tools you need or want. Once you figure out which features you need or which changes you'd like to make, you can then either customize the IDE you're using or use a different IDE that better fits your needs. Think of your IDE as a kitchen. There are various tools and utensils that are vital for cooking a meal, but the kitchen is the physical space where all the work is going to happen. If you're the person cooking, having a kitchen where you feel comfortable and find everything you need makes the process of cooking much easier. Applying the same logic to your tools and software will make you a better data professional throughout your career. Throughout this program, you've learned a lot about Python and its built-in functions. You've also used some Python packages and libraries for coding. If you recall, Python packages are a collection of modules that include functionality that is not necessarily present in the base Python language. Modules are used to organize functions, classes, and other data in a structured way. Libraries in Python are simply collections of packages. In this video, you'll explore some common packages, including what they can do and how they can work together to help you accomplish a task. Generally, there are three types of Python packages that you'll be using as a data professional. They can be used to accomplish the same tasks. However, they often have small differences that can make one more useful depending on the situation. The first category we'll explore is operational packages. They're also the first packages you'll normally use in the analytical process. Operational packages load, structure, and prepare a data set for further analysis. When creating a Python file for analysis, the first thing you have to do is read in your data. The pandas package is often the most useful for doing this, but the pandas read CSV function, which reads your data into a data frame is only a tiny percentage of what's included in the package. For efficient analysis and modeling, you can use functions that are built into the pandas package. This makes it easier to complete tasks, including preliminary data inspection, cleaning data, and merging and joining the data frames. Other operational packages, such as NumPy and SciPy, provide functions for advanced mathematical operations. The second category is data visualization packages. There are many different packages that can help you create the perfect plots and graphs based on the needs of a project. While simple plotting functions exist across the most popular packages, there are small differences among them. You should become familiar with as many as possible. Matplotlib is usually the go-to library for basic visualizations in Python. It has a wide range of features and can be challenging to master, but is extremely powerful, allowing developers to create almost anything they can imagine. Seaborn is another visualization package that is focused on statistical visualization. Statistical visualizations are simple to create using Seaborn, though it's not always possible or requires too much effort to create other types of plots. Plotly is often used for presentations or publications, such as creating a data visualization for an interactive dashboard. It's similar to Matplotlib in the sense that it can be challenging to master, but it can create incredible graphs and even allows you to add interactive elements to the visualizations. The final category of packages used in this course are for machine learning. Scikit Learn is a machine learning library that is built upon many of the packages we've already discussed. This library enables you to build a variety of model types, both supervised and unsupervised. It also provides a great interface in which to analyze the results of a model. Packages vastly expand the functionalities of Python and experienced data professionals will use many of them in their work. As you become more familiar with all of these exciting tools and features, you'll be even more prepared for a successful data career. Here's a little secret about the data field. Almost no one knows what they're doing 100% of the time. Even the most experienced professionals encounter issues with their code or the data analysis process itself and need to search for answers. That's why it's so important to understand the available resources that can help you find the solutions you need. So let's consider a situation. You just finished a piece of code and pressed the run button, but an error appears. It could be a problem with the way you imported your data or something not totally correct in the way you prepared the data for analysis. What can you do? The first step might be searching for the error as most problems will have been encountered by other developers before. Python is particularly good at telling you where the error is whether it's a simple syntax error or some other exception. Your IDE will often specifically identify the problem along with the exact line number where it was caught. So take the error output and search for it online. This usually yields some pretty helpful results. In fact, you'll often find your exact same error in the first few search results. If you're not getting the help you need there, you can search on a public platform like Stack Overflow directly. Stack Overflow is a go-to resource for coding issues considered by many data professionals to be the definitive collection of coding questions and answers. Not only that, but the community is very responsive and helpful, so you should feel comfortable posting your own questions. Another resource that can be great is the documentation for the package or module you're working with. Documentation is an in-depth guide written by the developers who created the package. Documentation features specific information about various functions and features and usually includes helpful examples. Kaggle is another resource that many data professionals use at some point in their journey. The online community features tens of thousands of public data sets along with Python notebooks that provide examples of how to conquer a variety of analyses. Many people who are just starting out in the data science industry use Kaggle because it offers tutorials and data sets to learn and practice machine learning techniques. Data professionals who've been in the industry for many years also use Kaggle to learn about new advances in technology and keep their skills sharp. When you encounter an error message or other coding issue, it's also a best practice to consider whether your tools are up to date. Package updates can sometimes break pre-existing code or change the way a particular function is accessed and used. Your code will usually throw an error that tells you if a certain package or piece of software is out of date, but staying proactive can prevent the problems in the first place. So make sure you have correct versions of all your software and tools. These resources are just a few to consider when you need help with your programming. You'll find developers collaborating all over the internet, and the more you learn, the easier you'll get to find solutions yourself. There's a lot of technology that helps you perform data tasks, and there's an extensive knowledge base online to reference whenever you have a problem during the process. But it's equally important, especially when working in industry, to know which team members can help you answer questions or solve problems. And this doesn't necessarily mean other data professionals. There are many stakeholders who contribute to data tasks. Information technology teams, business intelligence departments, and marketing professionals can all provide critical information. Companies often have their own specific tech, and the IT department can explain what's available for use. For example, if you realize you require a specific software or hardware to complete the problem you're working on, then the IT department can help you access what you need. Business intelligence teams take in raw data and make it accessible for further analysis. This could be in the form of dashboards for quick insights, or they may provide preliminary information about a data set, which you'll then use to inform your models. There are also marketing departments, sometimes known as sales, accounts, insights, or product management teams. They can give you more of the why behind a particular analysis from a final product perspective. If you want to confirm that your data project is headed in the right direction, checking in with marketing can give you a clearer target to work toward. In addition to those we've mentioned, there are many other teams you might consult. Knowing the professionals who work adjacent to your project can help you tremendously, so learning about them is valuable whenever you start a new job. Plus, the different teams at a workplace is something you can and should ask about in an interview situation. Questions such as, what other teams will I be working with? Or, what resources are available if I encounter this issue? Can give you a great perspective on what your day-to-day work environment will be like. Remember, data work is collaborative. Your projects will always have other stakeholders involved, and you should be taking advantage of these key resources as much as possible. Hi, I'm Samantha and I'm a product analyst. Product analyst is a data person who drives decision-making on a product team. This can be working with the product managers to make sure that all of the decisions that they make are data-driven and backed up by accurate data. I'm someone who's interested in a lot of things and finding this role in product analytics was perfect for me because it combines my interest in wanting to have a technical background with my interest in people. One of the cool things about data science is that it's a new and developing field and there's always new technology or new languages or new methodologies that are coming out. One of the best ways that I stay up to date is following data scientists on Twitter or looking at Reddit forums or following data science blogs either put up by companies or by individual contributors. Being able to just find a place where you can chat with other data scientists or talk about problems that you might be facing in your own projects. I'm always referencing these blogs in case I want to find some methodology that I might not have thought of or some technique that might be helpful or just a little trick that you might not necessarily know just by reading a textbook. And in my day-to-day job, I'll often look for libraries that some of these data scientists actually develop on their own and it will usually help me solve something that might not have been possible before. Data science is traditionally a very academic field and there's a lot of academic conferences that one can attend, but one I think that was great for those in the industry is the Open Data Science Conference. Here you'll find analysts, scientists, engineers from companies all around the world that will come together to learn about the latest technologies and network and get to know each other. This is a great opportunity to really meet other data scientists in the field, especially those that don't have the traditional backgrounds or titles of data scientists that you might see on a job posting. One of the things for me that really helped when I was learning data science and starting out for my career was finding a community that was interested in the same things that I was interested in and that could provide advice and consultation when I was working on my job. This can be anything from helping with recruiting and data science questions to answering coding problems that I might not have been able to find on the internet or even just finding people to work on projects with. Having others to collaborate with can only help when completing your very first data science project. Take a moment to reflect on how far you've come. We covered quite a few new and complex concepts. Let's review what you've learned. You identified and differentiated the main types of machine learning, including supervised, unsupervised, reinforcement, and deep learning. We discussed the difference between continuous and discrete features. You also learned about categorical features, which are finite sets of discrete features. Then we provided an overview of recommendation systems and how they're useful for suggesting content, from music to ice cream flavors. You're now familiar with how recommendation systems work and you can identify the differences between content-based and collaborative filtering as well as their benefits and drawbacks. Finally, you learned why it's important to be a responsible steward of data and how to ask essential questions at each step of model development. By thinking through the ethical implications, you learned how to reduce the potential for your model to cause unintended or harmful consequences. All these concepts and skills are essential to a career as a data professional. Having this knowledge will be especially helpful as you continue in this course. Coming up, you'll select machine learning models, learn ways to measure them, and build the models in Python. These important skills will enable you to be a data-driven storyteller, a powerful influencer of change wherever you work. Congratulations, meet you in the next part of the program. Welcome to another section in this course. I'm so happy to be here with you again. Together, we're about to apply the knowledge and skills you've been developing to do something really exciting. Build machine learning models from start to finish. As you've been discovering, a lot goes into the process of model building. Data professionals apply many different techniques when working to achieve a business goal, and we have tons of opportunities to keep learning and improving upon what we create. Data professionals are also keenly aware that we're human. This means we're inherently imperfect. Well, just like us, the models and the datasets we build are also imperfect, but that doesn't mean they're not useful. It just means we need to be aware of their limitations. In fact, expecting imperfection can be an advantage. Plus, there are many tools to help us better understand data limitations and even turn them into effective data solutions. Coming up, we're gonna explore these benefits as we build and refine our models. This will be a valuable proficiency for you as you move forward into the data career space. I can't wait to begin. Join me in the next video to get started. Welcome back. In this video, we're going to focus on the PACE workflow for building effective models. As a refresher, PACE consists of four stages, plan, analyze, construct, and execute. PACE helps guide the steps data professionals take as we align our data and models with the business needs. Soon, we'll use it to gain a more comprehensive understanding of the data and prepare it for modeling using feature engineering and techniques to manage class imbalances. From there, we'll practice a supervised learning model using Python called naive base. The goal will be to model bank customer churn rate to predict whether a customer will close their bank account. In this context, churn is the rate at which customers stop doing business with a company over a given period of time. And for continuous improvement, we'll create different model iterations. Finally, we'll use performance metrics to evaluate how successfully the model addressed the business need. Using the PACE workflow not only provides a good baseline for approaching the problem, but also helps data professionals stay focused throughout the process. By applying and referencing this framework, you'll have the skills to manage data-driven problems in your career. Ready? Let's go. Happy to have you back. I have a plan for this video, and the first part is discussing the plan step in PACE. There are two parts, centering the business need and considering the most appropriate machine learning model. When using PACE, it's essential to align your plan with the business and data teams. For machine learning, this means ensuring that the machine learning model you plan to construct meets the actual business needs. This may seem a bit obvious, but given the potential complexity of the problem, multiple departments will likely be involved in the output. You also have to consider the data that's available to you before getting into the rest of the process of building the model. Remember that you will most often be building models for a company or an organization. Because of that, you'll need to ensure that throughout the other stages of PACE that your data, modeling, metrics, and optimizing strategies stay focused on what you developed during the plan stage. Let's explore an example. Suppose that you're working in the finance industry and your firm is trying to predict housing prices. You have a massive data set about houses in a certain area. This data set contains information about the houses, such as square footage, the number of bedrooms and bathrooms, and location. Most importantly, the data set also contains the most recent sale price for each house. As the second part of the plan stage of your PACE workflow, you use the context and requirements of the business need to consider what type of machine learning model would be best suited for the problem of predicting housing prices. Based on what you know of machine learning models so far, you need to create a supervised continuous model to get the desired numerical result. Continuous models are the only ones that can help you predict housing prices for this example. Now that you have your plan for the business problem and the scope for the machine learning model, you're ready for the next step. Analyze. See you soon. I'm Ganesh. I'm a group data science manager at Waymo. So Waymo is a self-driving car part of Alphabet. Our goal is to automate driving for users so that they can just hail a car from anywhere, become pick them up and go them drop off wherever we want to. I manage a team of data scientists whose main goal is to figure out opportunities for optimization in improving the processes at Waymo. Every day is a new challenge because today you're working on one problem, tomorrow it's completely different. And part of the charter that my team does is run experiments for Waymo. So whenever someone thinks of a process change or whenever someone wants to launch a new feature, we use experiments or A-B test as you call it to figure out is this change truly helpful? Is it working? What is the expected impact we're gonna get from these changes? So there was this time, wherein we were conducting an A-B test to measure the improvements that a process is going to have. We had a control group who was doing the old process and then we had a test group who was doing the new process. We were expecting a lot of improvements to come from that because we were super sure, we did some demos and we thought that the new process is significantly better than the old process. But when we launched it, nothing really happened. We waited, nothing happened again. This was hard because a lot of program managers, a lot of product managers have put their effort into this. We did some field studies to go and see why it was wrong. The users were so accustomed to the old process that they weren't willing to try the new process at all. They were going back to the old ways and that's why our test wasn't performing as expected. That was a learning moment for us and for me especially, right? Because this is a big part of launching an A-B test. And that's when I realized that you don't wanna launch process changes with existing users or customers. It's always good to use your new users to try these new processes because that way you get an unbiased opinion of what is happening and that way you can truly measure the impact of your new process. I mean, had the problem not been resolved, we would have thought that the new process is bad and not launched a thing that is truly helping the company. So resolving this, figuring out what the problem was, made sure that we didn't throw away that and we launched something that was truly important for the company. It's okay to make mistakes. You're encountering situations where many people are in for the first time. So it's okay. It's okay to ask for help. It's okay to make those mistakes. But the big part is we shouldn't be repeating those mistakes again and again. So having this learning mentality of trying and doing is gonna be the most important thing that any data scientist has to build. Earlier, we explored the planned stage of our PACE framework. Now that you've finished preliminary planning, it's time for the analyzed stage of PACE. While keeping the business need in focus is particularly true for the planned stage, it's also essential throughout the process of developing a machine learning model. The business need informs a data professional on what the model needs to produce. The result indicates what type of model is required. The business need also informs data professionals about the data that's necessary to train the model to achieve the desired result. The main focus of the analyzed stage is to develop a deeper understanding of the data while keeping in mind what the model needs to eventually predict. For example, if you're creating a supervised learning model, the first thing you'll need to know is what your model is trying to predict. In other words, you'll need to understand your response variable. You've already done this earlier in the course when you were deciding which type of supervised learning model to use, continuous or categorical. Let's use weather as an example. If you need to predict the exact amount of precipitation in inches, you'll probably go with a continuous model. But if the business needs a model that can predict whether it will be rainy, cloudy or sunny, that requires a categorical model. Often as a data professional, your data isn't structured exactly the way you need it to be. This is where you can take advantage of many of the exploratory data analysis principles you learned earlier. You'll be using all the techniques you learned so far to develop an understanding of what data you have available and how it's structured. Let's return to the continuous model example about predicting precipitation in inches. In your data set, the rainfall amounts that were recorded might not be in the exact units needed. Or the data could be split between rain, snow and other types of precipitation. These are all details that a data professional needs to analyze before building the model itself. For the categorical model example, where you're predicting whether it's rainy, cloudy or sunny, your data set might not be labeled with the categories you need. The individual days in the data set might be labeled with only cloud cover metrics, which is something you'll have to change to be able to analyze the results effectively. After getting a solid understanding of what your response variables are and how they're structured, the next step is exploring your predictor variables. Understanding the relationships that exist between variables in your data set is essential to building a model that will produce valuable results. Similar to the response variables, the predictor variables might not be in the format or style that you want. In those cases, the same considerations apply. You'll need to figure out how you want your data structured before building your model. This process of carefully considering the variables you have and what you need leads into the next part of the analyze stage, feature engineering. Coming up, you'll learn about feature engineering, various techniques that are available, which scenarios to use them in and what they can do for your data. Welcome back. Previously, you learned that the main focus of the analyze stage is to develop a deeper understanding of the data. Carefully considering the variables you have and what you need leads into the next part of the analyze stage, feature engineering. Feature engineering techniques solve the problems in how your data is structured and if done well, can improve your model's performance. In this video, you'll learn more about feature engineering, how it works and when to use it. Feature engineering is the process of using practical, statistical and data science knowledge to select, transform or extract characteristics, properties and attributes from raw data. This definition has quite a bit to consider. First, the process of feature engineering is highly dependent on the type of data you're working with. Before we go much further, let's check out some examples. Earlier, you learned about types of features or variables called continuous and categorical. Remember, continuous variables are variables with values obtained by measurement. As a result, they can take on an infinite and uncountable set of values. Categorical variables on the other hand are variables that contain a finite number of groups, categories or countable numerical values. Your process for feature engineering will be changing and altering these variables with the end goal of using them to train a model. This can often be a challenging process. Data sets used in the workplace can sometimes require multiple rounds of EDA and feature engineering to get everything in a suitable format to train a model. This process highlights one reason why the PACE framework is so beneficial to data professionals. The analyze stage builds directly from the plan stage or put more simply, the plan informs how you analyze. Without a strategically aligned business and technical plan, the analyze stage and feature engineering process would be like trying to build a skyscraper without having blueprints. The three general categories of feature engineering are feature selection, transformation and extraction. Let's learn more about those categories. We'll start with the feature selection. The goal of this type of feature engineering is to select the features in the data that contribute the most to predicting your response variable. In other words, you drop features that do not help in making a prediction. This can be done either manually or algorithmically. Let's use a simple weather data set as an example. It contains five different variables and 14 data points. The variable on the far right represents whether or not you would wanna place soccer or football outside based on the other data. In our example, whether or not it's windy outside might not affect our playing soccer. If that's the case, then we would want to select outlook, temperature and humidity and drop windy from our data set. Features selection would mean selecting the variables that help the most in making a prediction. The next category is feature transformation. In feature transformation, data professionals take the raw data in the data set and create features that are suitable for modeling. This process is done by modifying the existing features in a way that improves accuracy when training the model. In our weather data set example, your data might include exact temperatures, but you might only need a feature that indicates if it's hot, cold or temperate. To make that transformation, you could define some cutoff points for the data and create a new categorical feature from the numerical data. You might define anything above 80 degrees Fahrenheit as hot, anything below 70 as cold and anything in between as temperate. Feature transformation would mean transforming the degrees into the temperature categories you've defined. Finally, let's discuss feature extraction. This type of feature engineering involves taking multiple features to create a new one that would improve the accuracy of the algorithm. For example, imagine we want to create a new variable called muggy that could be used to model whether or not we play soccer. If the temperature is warm and the humidity is high, the variable muggy would be true. If either temperature or humidity is low, then muggy would be false. Remember, the feature engineering techniques are designed to improve your model's performance. While you can make a lot of improvements by tweaking and optimizing the model, the most sizable performance increases often come from developing your variables into a format that will best work for the model. In my work, this often appears as needing to make an outcome variable binary. For example, we may get survey ratings from users that are on a scale of one to five stars, but we need to predict whether a piece of content is good or bad. In this case, we need to change our response variable by mapping the star ratings to either the good or bad label. This is one example of how data professionals better understand their data in the analyzed stage. During EDA, you're just beginning to develop an understanding of your data. Feature engineering is a step beyond EDA. You're selecting, extracting, or transforming variables or features from data sets for the construction of machine learning models. Now that you have a better idea of the concept of feature engineering, you're ready to perform some of your own feature engineering in Python, which you'll do in a later video. Hello again. In this video, we'll continue our exploration of the analyzed stage of pace. Understanding what your variables are and how they're structured is only part of the process. It's also essential to understand the frequency in which the variables exist. For classification problems, you need to specifically understand the frequencies of the response variable. As a data professional, you might encounter data sets that are unequal in terms of their response variables. One example of unequal data sets is in the context of fraud detection. You could have millions of examples of non-fraudulent transactions and only a few thousand examples of actual fraudulent transactions. How can a model be built to detect fraud with such limited data to train the model? This issue is known as class imbalance. A class imbalance is when a data set has a predictor variable that contains more instances of one outcome than another. The class with more instances is called the majority class, while the class with fewer instances is called the minority class. It's extremely rare for a data set to have a perfect 50-50 split of the outcomes. There is normally some degree of imbalance. However, this isn't necessarily a problem. Believe it or not, a 70-30 or 80-20 split can be fine. Major issues only arise when the majority class makes up 90% or more of the data set. You'll only know if there's an imbalance issue after the model is built. There are two techniques that allow us to fix any potential issues, up sampling and down sampling. Both of them involve altering the data in a way that preserves the information contained in the data while removing the imbalance. Down sampling involves altering the majority class by using less of the original data set to produce a split that's more even. The number of entries of the majority class decreases, leading to more of a balance. You can use different techniques to achieve this, but generally they are all based on this concept. One technique is to do this randomly by selecting entries to remove, or you can follow a formula. For example, you can take the mean of two data points in the majority class, remove those data points and add the average data point. Up sampling is the opposite of down sampling. Instead of reducing the frequency of the majority class, you artificially increase the frequency of the minority class. Similar to down sampling, there are multiple ways you can achieve this. The simplest technique is called random oversampling, where random copies of data points in the minority class are copied and added back to the data set. Or mathematical techniques can be used to generate non-identical copies, which are then also added to the data set. So if both up sampling and down sampling achieve the same result, you might be wondering which one to use. Most of the time, you won't know which one is preferred until you've built the model and observed how it performs. However, there are some general rules you can follow regarding when to up sample and when to down sample. Down sampling is normally more effective when working with extremely large data sets. If you have a data set that has 100 million points, but has a class imbalance, you don't need all of that data to build a good model. You definitely don't need the additional data that would come from up sampling. Alternatively, up sampling can be better when working with a small data set. If you're working with a data set that only has 10,000 entries, removing any of that data will more than likely have a negative impact on the model's performance. Keep in mind, class balancing is a fickle process and may require some trial and error. Building models with both up sampled data and down sampled data will determine which technique is better in any given situation. Additionally, you'll have to experiment with what sort of split you're rebalancing achieves. Balancing the data so the classes are split 50-50 might not always be optimal. On the other hand, turning a 99-1 split into a 70-30 split might be fine and that is something to consider as you develop your model. Welcome back. In this video, you'll code in Python to wrap up the analyze stage of the PACE framework. In the analyze stage, you gain a deeper understanding of your data. This is the stage where you prepare your data so it can be used to train the model. Then you'll move on to the construct stage of the PACE workflow. For the rest of this part of the course, we'll build models that will predict customer churn at a bank. Customer churn is the business term that describes how many and at what rate customers stop using a product or service or stop doing business with a company altogether. The models will be supervised classification models because they'll each predict a categorical target. In this case, it's a binary one, whether or not each customer churned or stayed. Before we get into the data set, we need to import the packages that we'll use. All we'll need for this notebook are NumPy and Pandas because we're only preparing the data for modeling. Now, we'll read in the data set from a CSV file to a Pandas data frame. We'll call the data frame DF original and inspect it using the head function. The data set that we'll be using to solve this problem contains customer data, where each entry in the data set represents one customer. For each customer, there are several features that describe the customer's relationship with the bank and information about the customer's finances. Additionally, there's metadata for each customer, like name, gender, and customer identification number. The two features that might not be completely intuitive are tenure, which represents how many years the customer has used the bank, and geography, which identifies which country the customer lives in. Additionally, we have the feature labeled exited. This indicates whether the customer left the bank. A one signifies that they stopped doing business with the bank and a zero indicates that they are still a customer. The variable exited will be the response variable or the variable that our model will attempt to predict. When modeling, the best practice is to perform a rigorous examination of your data before beginning feature engineering and feature selection. This process is important because not only does it help you understand your data, what it's telling you and what it's not telling you, but it can also give you clues about what new features to create. You've already learned the fundamentals of exploratory data analysis or EDA. So this notebook will skip that essential part of the modeling process. Just remember that a good data science project will always include EDA. Let's get a quick overview of our data. We'll use the info function to inspect the data frame. From this table, we can confirm that the data has 14 features and 10,000 observations. We also know that nine features are integers, two are floats, and three are strings. Finally, we can tell that there are no null values because there are 10,000 observations and each column has 10,000 non-null values. Next, we'll prepare this data set to be used to train the model. The first thing we're going to do is feature selection. If you recall, this is the process of picking out the features that we want the model to use to predict the outcome. And we drop any features that aren't useful to the model. In our bank data, notice that the first column is called row number. This just enumerates the rows. Since a row number shouldn't have any intrinsic correlation with our response variable, we'll remove this feature from our data set. The same is true for customer ID, which appears to be a number assigned to the customer for administrative purposes and surname, which is the customer's last name. We'll drop these two. As you complete feature selection, keep the ethical concerns you learned earlier in the course in mind. Consider the implications of your data and the resultant model. For example, in this modeling exercise, we will not include the gender column. This data feature raises a set of complex issues, technically, culturally, and ethically. We recognize that the most rigorous approach would be to model both with and without this feature and examine how it influences predictions. Whatever the approach, it should be driven by an aim for equitable outcomes and for your particular use case. We'll remove these columns by calling the drop function on our data frame. We pass to it a list of names of the columns that we want to remove, and we indicate that we're dropping columns, not rows, by including axis equals one. We'll assign the results to a new data frame called churnDF. The resultant data frame is shown by calling the head method. However, there's still more to do before we start training the model. Next, let's practice feature extraction. This is the process of taking two or more features and using them to create a brand new feature that will make the model more accurate. Normally, feature extraction is done using statistics to analyze how predictive each variable is and whether the new feature that is extracted is more predictive than the original variables on their own. For now, we're going to extract a feature as an example. Without conducting the analysis, we would normally perform if we were trying to build a production-ready model. Let's create a new variable and call it loyalty. We'll do this by taking the tenure of a customer and dividing it by their age. The logic behind using tenure and dividing it by the customer's age is that it represents the percentage of a person's life that they've been customers of the bank. People with greater percentages may be more loyal customers. Now we have a new column called loyalty, which we can verify by inspecting the data frame. Let's move on to feature transformation. There are some features in this data set that need to be transformed. Remember, feature transformation is the process of changing how a single feature is represented in the data set with the goal of improving the accuracy of the model. Many classification models require you to convert categorical features to make them numeric. Our data set has one categorical feature called geography. Let's check how many classes appear in the data for this feature by using the unique function on the series. There are three unique values, France, Spain, and Germany. Let's encode this data so it can be represented using Boolean features. We'll use a pandas function called get dummies to do this. When we call pd.getdummies on this feature, it will replace the geography column with three new Boolean columns, one for each possible category contained in the column being dummied. When we specify drop first equals true in the function call, it means that instead of replacing geography with three new columns, it will instead replace it with two columns. We can do this because no information is lost and the data set is shorter and simpler. In this case, we end up with two new columns called geography Germany and geography Spain. We don't need a geography France column. Why not? Because of a customer's values in geography Germany and geography Spain are both zero, we'll know they're from France. After feature selection, extraction, and transformation, the next step is modeling. We'll be using this data set for most of the modeling you'll do in the remainder of this part of the course. And you now have a solid foundation for the construct and execute stages of pace. Hello again. You're now about halfway through the pace workflow. During the planning stage, you developed a better understanding of the business need and the data available. And in the analyze stage, you investigated the data using exploratory data analysis practices. You applied feature engineering techniques to select, transform, and extract data into a format that was suitable for training. This video continues on to the construct stage where you'll bring the model to life. You'll do this by building a model called naive Bayes. Naive Bayes is a supervised classification technique based on Bayes theorem with an assumption of independence among predictors. The effect of the value of a predictor variable on a given class is not affected by the values of other predictors. Let's break it down. Bayes theorem gives us a method of calculating the posterior probability, which is the likelihood of an event occurring after taking into consideration new information. In other words, when you calculate the probability of something happening, you take relevant observations into account. It can be represented with this equation, which calculates the posterior probability of C given X. Probability of X given C is the probability of the predictor given the class, which is multiplied by P of C, the prior probability of the class. The product of these two terms is then divided by P of X, which is the prior probability of the predictor. The posterior probability equation can be rewritten to reveal what's going on behind the variables. The probability of a class given the first predictor variable P of X1 given C is multiplied by the probability of a class given the second variable P of X2 given C, and so on for all predictor variables used in the model. This is complex. So let's return to the weather-based example that we discussed in an earlier video and apply naive Bayes to gain a better understanding. The weather data set will help you build a model to decide whether to go outside and play soccer. This data set has five columns. The first four are the predictor variables, and the final column is the label of the data set. It shows whether we should play soccer. The outlook variable identifies if it is rainy, cloudy, or sunny. The humidity variable indicates the relative humidity, and of course, the windy variable determines if there's wind. Start with the outlook variable. Calculate the posterior probability of one of the features in the data set. To do this, construct a frequency table for each attribute against the target by tallying the number of times soccer is and isn't played for a given attribute. Then, transform the frequency tables into likelihood tables by calculating the number of times soccer is and isn't played for each attribute. Use this information to find the probability of the predictor given the class P of X given C, the probability of the class P of C, and the probability of the predictor P of X. In other words, everything needed to calculate the posterior probability. As a reminder, you may pause and review the video if needed. The process of finding the posterior probability needs to be done for every possible class that is potentially being predicted. In this case, there are only two outcomes, play or don't play. Once these values are found, the prediction is made based on the class with the highest posterior probability. When you repeat the process for the second class, don't play, you're only required to modify the calculations after finding the likelihood table. Observe that the posterior probability of playing while it is sunny is higher than the posterior probability of not playing. So if it's sunny outside, a naive Bayes model would predict that the conditions are right to play soccer. Later, you'll explore how multiple predictor variables can be used to make a prediction. All the same concepts will apply. No matter the number of variables that are used, naive Bayes calculates posterior probabilities and makes predictions based on which outcome has the highest probability. You're doing great work. Keep up the momentum and I'll be with you again soon. Now that we've discussed how naive Bayes works, it's time to implement it using Python. We're going to continue our work with the bank churn data frame that we prepared in the feature engineering notebook. Remember, we dropped the row number, customer ID, surname and gender columns. dummy encoded the geography column to convert from categorical to Boolean and engineered a new feature called loyalty by dividing each customer's tenure by their age. Recall that the predictor variables of this data set are of different types. For example, balance and estimated salary are continuous while geography is categorical. Also, remember that Scikit-learn has a few different implementations of the naive Bayes algorithm. And each assumes that all of your predictor variables are of a single type. As a data professional, one of the first things you'll learn on the job is that real world data is never perfect. Sometimes the data violates the assumptions of your model. In practice, you'll have to do the best you can with what you have. For this lesson, we're going to use the Gaussian NB classifier. This implementation assumes that all of your variables are continuous and that they have a Gaussian or normal distribution. Our data doesn't perfectly adhere to these assumptions, but a Gaussian model may still give us usable results, even with imperfect data. Let's get started. As always, the first thing to do is import any packages and libraries that you'll need. We'll begin by importing NumPy, Pandas, and Matplotlib. We'll also import train test split to help us split our data into training and test sets. The model we'll be using is called Gaussian NB, which we'll import from scikit-learn's naive Bayes module. Next, we'll import functions that we'll use to calculate our model's accuracy, precision, recall, and F1 scores. Finally, we'll import confusion matrix and confusion matrix display, which will help us calculate and plot a confusion matrix of our model's results. Let's read in the data frame and call it churnDF. Before we begin modeling, let's do a couple more things. First, we'll check the class balance of the exited column, which is our target variable. We can do this by calling value counts on the Pandas series. The class is split roughly 80-20. In other words, about 20% of the people in this data set churned. This is an unbalanced data set, but it's not extreme. So we'll proceed without doing any class rebalancing of our target variable. Secondly, naive Bayes models operate best when the predictor variables are conditionally independent from each other. When we prepared our data, we engineered a feature called loyalty by dividing tenure by age. Because this new feature is just the quotient of two existing variables, it's no longer conditionally independent. So we're going to drop tenure and age. This step may or may not be beneficial, but we'll do it to help adhere to the assumptions of our model. We've prepared our data and we're ready to model. Now we need to split the data first into features and target variable, and then into training data and test data. Let's assign our predictive features to a variable called x, and the exited column, our target, to a variable called y. Then we can split into training and test data. We do this using the train test split function. We'll put 25% of the data into our test set and use the remaining 75% to train the model. Notice that we include the argument stratify equals y. If our master data has a class split of 80-20, stratifying ensures that this proportion is maintained in both the training and test data. Equals y tells the function that it should use the class ratio found in the y variable, which is our target. The less data you have overall and the greater your class imbalance, the more important it is to stratify when you split the data. If we didn't stratify, then the function would split the data randomly, and we could get an unlucky split that doesn't get any of the minority class in the test data. In that case, we wouldn't be able to effectively evaluate our model. Worst of all, we might not even realize what went wrong without doing some detective work. Finally, we set a random seed, so we and others can reproduce our work. Now it's time to build the model. Just as with linear and logistic regression, our modeling process will begin with fitting our model to the training data and then using the model to make predictions on the test data. First, we'll instantiate the Gaussian NB model, assigning it to a variable called GNB. Then, we'll fit it to the X and Y training data. Lastly, we'll use the predict method to use the model to make predictions on the X test data, assigning the results to a variable called Y PREDS. Now we can check how our model performs using the evaluation metrics we imported. For each one, we pass to it first the actual Y test data, and then the predictions. This isn't very good. Our precision, recall, and F1 scores are all zero. What's going on? Well, let's consider our precision formula. There are two ways for the model to have a precision of zero. The first is if the numerator is zero, which would mean that our model didn't predict any true positives. The second is if the denominator is also zero, which would mean that our model didn't predict any positives at all. Dividing by zero results in an undefined value, but scikit-learn will return a value of zero in this case. Depending on your modeling environment, you may get a warning that tells you there's a denominator of zero. We don't have a warning, so let's check which situation is occurring here. To do this, we'll call NumPy's unique function on the predictions. Okay, the model predicted zero, or not churned, for every sample in the test data. Both the numerator and the denominator are zero. Consider why this might be. Perhaps we did something wrong in our modeling process, or maybe using Gaussian NB on predictor variables of different types and distributions just doesn't make a good model. Maybe there were problems with the data. Before we give up, maybe the data can give us some insight into what might be happening, or what further steps we can take. Let's use describe to inspect the X data. Something that stands out is that the loyalty variable we created is on a vastly different scale than some of the other variables we have, such as balance or estimated salary. The maximum value of loyalty is 0.56, while the maximum value for balance is over 250,000, almost six orders of magnitude greater. One thing that you can try when modeling is scaling your predictor variables. Some models require you to scale the data in order for them to operate as expected, while others don't. Naive Bayes does not require data scaling. However, sometimes packages and libraries need to make assumptions and approximations in their calculations. We're already breaking some of these assumptions by using the Gaussian NB classifier on this data set, and it may not be helping that some of our predictor variables are on very different scales. In general, scaling might not improve the model, but it probably won't make it worse. Let's try it. We'll use a function called minmax scalar, which we'll import from the sklearn pre-processing module. Minmax scalar normalizes each column, so every value falls in the range of zero to one. The column's maximum value would scale to one, and its minimum value would scale to zero. Everything else would fall somewhere in between. This is the formula. To use a scalar, you must fit it to the training data and transform both the training data and the test data using that same scalar. Let's apply this and retrain the model. First, import the scalar. Then, we'll instantiate it and assign it to a variable called scalar. Now, we fit the scalar by passing our X-train data to it. Next, we use the transform method to scale the X-training data. Finally, we transform the X-test data. Now, we'll repeat the steps to fit a model, only this time we'll fit it to our new scaled data. When we calculate the performance metrics for this model, we don't get an error. The model isn't perfect, but at least it's now predicting customers who churn. Let's examine more closely how our model classified the test data. We'll do this with a confusion matrix. Remember that a confusion matrix is a graphic that shows you your model's true and false positives and true and false negatives. We can plot this using the confusion matrix display and confusion matrix functions that we imported. Here's a helper function that will allow us to plot a confusion matrix for our model. All of our model metrics can be derived from the confusion matrix and each metric tells its own part of the story. What stands out most in the confusion matrix is that the model misses a lot of customers who will churn. In other words, there are a lot of false negatives, 355 to be exact. This is why our recall score is only 0.303. Coming up, you'll investigate the various model evaluation metrics and when to use them. You'll explore the use of several metrics to evaluate model performance. Then you will determine which of the models best satisfies the business requirements for the data project. Meet you there. Wow, you've reached the last stage of the PACE workflow. Execute. This is where model analysis happens and it's finally production ready. You've already learned a lot about model metrics, the options available to you and what they can demonstrate to a data professional about the model that has been built. And you've built supervised learning models, including a categorical model in the form of logistic regression. The metrics you used to evaluate those models will also make it possible to evaluate a naive Bayes model. As a review, accuracy reflects the number of correct predictions divided by the total number of predictions. However, accuracy doesn't always tell the full story. Some data sets will feature a strong class imbalance which occurs when the majority of items belong to only one class. Then the data set is considered imbalanced. Here's an example using a binary classification problem. An IT professional wants to use a model to detect malware in the computers at their company. Perhaps there are 5,000 instances in their data set but only 500 positive instances where malware was actually present in a computer. This person would have an imbalanced data set as the chances of finding malware among all the checks that happen is actually comparatively low. This is where the precision and recall metrics can help. Precision measures what proportion of positive predictions were correct. In other words, if the model predicted that malware was going to be present, how many times was it actually on a computer? Precision is calculated by dividing the number of true positives by the sum of the true and false positives. And the recall metric, on the other hand, finds the proportion of actual positives that were identified correctly. In the context of our example, recall indicates how many would-be malware threats were classified. Recall is calculated by dividing the number of true positives by the sum of the true positives and false negatives. F1 score combines both precision and recall in one metric. Accuracy, precision, recall, and F1 score are top metrics in classification techniques. More specifically, precision, recall, and F1 score are especially useful for measuring unbalanced classes. In any case, data professionals use all four metrics to evaluate categorical supervised learning models. As you've begun to discover, each model performs differently and some algorithms work better than others. When building any model intended for production, it's essential to improve the results. You might change specific parameters to discover how the performance improves. So, you should always keep in mind that model building is an inherently iterative process. The first model that you produce will almost never be the one that gets deployed. The iterative process provides the information needed to get the model working optimally. After tweaking the parameters or changing how features are engineered in each model, the performance metrics provide a basis for comparing the models to each other and against themselves. Coming up, we'll find out what these metrics reveal about our model, use them to evaluate other models we've built, and examine how to improve model performance. Continuous improvement is a key part of being a data professional, so these exercises are preparing you to keep advancing all kinds of data processes. Developing a machine learning model is a complex process, but having a solid framework to rely on helps set you up for success, no matter the business need. In this part of the course, the PACE workflow provided support to help you navigate the different stages of addressing an example business scenario. During the plan stage, you assessed the business need to determine what type of model is best suited for predicting bank customer churn. This decision was based on the available data. Next, in the analyze stage, you examined the data using EDA practices and feature engineering techniques. This process revealed more details about the data to help inform your plans for building the model. From there, you went on to construct the first iteration of your naive base model. You then tested the model with preliminary evaluation metrics to determine its performance against the test data. And finally, in the execute stage, you closely evaluated the model's performance and considered how it could be improved. And that is the PACE workflow for machine learning models. Having this process in your tool belt will allow you to solve many business problems. While the models may very well get more complex as you continue your journey, sticking within the framework will help you achieve the results you need. You've come a long way on your journey into the world of data and you've been building a strong foundation for data modeling. So far, you've learned about linear regression, logistic regression, and naive base models. All supervised learning techniques that make predictions on labeled data. Most machine learning applications today are based on supervised learning. But when we consider all the available data in the world, the vast majority of it is unlabeled. Photographs, voice recordings, videos, social media posts, these are all examples of unlabeled data. You may be familiar with this concept in the context of data analytics. If you earned your Google Data Analytics career certificate, you learned that any data that's not organized in an easily identifiable manner is known as unstructured. In this program, we'll sometimes refer to it as unlabeled, but the meaning is the same. So how do we make sense of all that unlabeled data? We use unsupervised learning techniques. When our data is unlabeled, these methods make it possible for data professionals to learn about the data's underlying structure and find out how different features relate to each other. Earlier in the course, you explored one very common type of unsupervised learning, recommendation systems. You learned that they're a subclass of machine learning algorithms which offer relevant suggestions to users, such as new songs for your playlist or a new coat for winter. Now, you'll get to know many other exciting methodologies and applications of unsupervised learning. In this section of the course, you'll first learn about K-means, an unsupervised modeling technique. You'll investigate how it clusters data based on each observation's similarity to others in the data. You'll also build a K-means model and learn how to evaluate it using metrics called inertia and silhouette score. I'm thrilled to have you with me as we explore unsupervised learning. There's so much great potential for future development. We've only just begun tapping into the vast amount of unstructured data in the world. Let's start modeling. In this video, we'll introduce you to the K-means algorithm. K-means is an unsupervised partitioning algorithm. It's used to organize unlabeled data into groups or clusters. It does this by creating a logical scheme to make sense of the data. With K-means, each cluster is defined by a central point or a centroid. Its position represents the center of the cluster, also known as the mathematical mean, hence the name K-means. There are four steps to building a K-means model. Let's examine them one at a time. In step one, you choose the number of centroids and place them in the data space. K represents the number of centroids in your model, which is how many clusters you'll have. This is a decision that you make. Sometimes you'll have an idea about the number of clusters necessary for a project. For example, if your company manufactures five different products, you might want to set your K value to five. Other times, you won't know how many clusters your data should be split in two. So try different values for K and determine what provides the best results. Here, it's apparent that our data is grouped into two clusters, one on top and the other on bottom. At step one, we'll randomly initiate two centroids represented by the blue and red Xs. Step two is to assign each data point to its nearest centroid. The nearest centroid is the one that's closest in space. In this example, the top two observations are assigned to the blue centroid and the bottom two observations are assigned to the red centroid. As a quick refresher, in this context, an observation is simply any data point being observed. Step three is to recalculate the centroid of each cluster. Again, the centroid's location is calculated by taking the mean of all of the points in its cluster. Note that the centroids move to the midpoint of their clusters. This will happen each iteration until the algorithm reaches convergence. This is the stable point found at the end of a sequence of solutions. Step four is to repeat steps two and three until the algorithm converges. In this case, we have very little data, so the model is simple and has already converged. If you had a lot more data, those centroids would get increasingly closer to their related clusters. You also might find that the cluster assignment of each data point changes as the centroid locations move within each iteration. Something to be mindful of is that it's important to run the model with different starting positions for the centroids. This helps avoid poor clustering caused by local minima. In other words, not having an appropriate distance between clusters. Let's explore this concept using our example. What if this had been the initial positions of our centroids? Notice what happens. In step two, we assign the points to their nearest centroid. With these particular starting positions, the two observations on the left are assigned to the red cluster, and the two observations on the right are assigned to the blue cluster. For step three, we recalculate the position of each cluster centroid. The model has converged, but the clusters aren't what we'd expect. We know that the most intuitive clustering would be for the top two observations to be in one cluster and the bottom two in another. But that's not what happened. And further iterations will not change this. This is why it's important to run the model with different centroid initializations and to avoid poor clustering due to the model converging in local minima. Note that this clustering isn't wrong. It's still a valid resolution of the model. It just doesn't make much sense in this context. After all, the goal is to find a clustering scheme that makes sense of your data. Finally, note that even though K means is a partitioning algorithm, data professionals typically talk about it as a clustering algorithm. The difference is that outlying points in clustering algorithms can exist outside of the clusters. However, for partitioning algorithms, all points must be assigned to a cluster. In other words, K means does not allow unassigned outliers. So to recap, K means is an unsupervised learning technique that groups unlabeled data into K clusters based on similarity. The clustering process has four steps that repeat until the model converges. The value for K is a decision that the modeler makes. And finally, it's important to build multiple models to avoid poor clustering. Later in this section, you'll learn how to determine the best value for K. You'll also discover some of the limitations of K means models. Lots more coming up. At this point, you're familiar with the basic intuition behind the K means algorithm. In this video, we're going to demonstrate how to apply your knowledge of K means to an actual example. We'll use the K means algorithm to compress colors in a photographic image. This demonstration is intended to lead you through an application of the K means theory to give you a deeper understanding of how it works. So for this video, focus less on the mechanics of the code itself and more on the results. Let's get started. We're going to use a photograph of some tulips as our data, which we'll read in as an array using Matplotlib's I am read function and display using its I am show function. This is the image that we're going to manipulate using K means. When we check the shape of the image in pixels, we're told that it's 320 by 240 by three. We can interpret these numbers as pixel information. Each dot on the screen is a pixel. This photograph has 320 vertical pixels and 240 horizontal pixels. But what is the dimension of three? This dimension refers to the values that encode the color of each pixel. Each pixel has three parameters, red or R, green or G, and blue or B. Together, these values are known as RGB values. The value for each color, R, G and B, can range from zero to 255. This means that there are 256 cubed or more than 16 million different possible combinations of RGB, each resulting in a unique color. To prepare this data for modeling, we'll reshape it into an array where each row represents a single pixels RGB color values. Now we have an array that is 76,800 by three. Each row is a single pixels color values. Let's create a panda's data frame to help us understand and visualize this data. Each row of the data frame represents a single pixel and the three columns are its R, G and B values. Because we have only three columns, we can visualize this data in three-dimensional space. This graph plots each of the photograph's pixels in a 3D coordinate space. Each axis ranges from zero to 255, just like each value in RGB and each dot in the graph is the color specified by its RGB values, just like in the original photograph. The more intense the color, the more dots are concentrated in that area. The most represented colors here are the most abundant colors in the photograph, mostly reds, greens and yellows. We can examine this graph from different angles and even zoom in and out. We can also train a K-means model on this data. The algorithm would create K clusters by minimizing the squared distances from each point to its nearest centroid. Here's an experiment. What do you expect to happen if we built a K-means model with just a single centroid? In other words, with K equal to one. Let's find out. We'll first instantiate the model. As a refresher, instantiation involves creating a copy of the class, which inherits all class variables and methods. So, let's set the number of clusters to one and fit it to our data. Now we're going to copy the original image, replace each of its rows with the values of its closest cluster center and reshape the image so we can display it. The image we get back doesn't resemble tulips at all. So what happened? Well, let's run through the K-means steps. First, the algorithm randomly placed a centroid in the 255 by 255 by 255 color space. Then it assigned each point to its nearest centroid because there was only one centroid, all points were assigned to it and therefore to the same cluster. Next, the algorithm updated the centroid's location to the mean location of all of its points. Again, there's only a single centroid, so it updated to the mean location of every point in the image. Usually these steps would repeat until the model converges, but in this case, it took only one iteration. We updated each pixel's RGB values to be the same as those of our centroid. The result is the image of our tulips where every pixel is replaced with the average color. We can verify this for ourselves by manually calculating the average for each column in the array. This will return the mean R value, G value and B value. Let's compare this to what the K-means model calculated as the final location of its one centroid. We'll do this by using the cluster centers attribute of the fit K-means model object. They're the same. Now let's return to the 3D rendering of our color space. Only this time, we'll add the centroid. The centroid is a large circle in the middle of the color space. Notice that this is the center of gravity, so to speak, of all the points in the graph. Okay, now let's refit another K-means model to the data, only this time using K equals 3. Take a moment now to consider what you might expect to result from this. Go through the steps of what the model is doing like we did above. What colors are you likely to see? So we refit the model setting our number of clusters to three and we get three centroid locations, which are the RGB values we can use to display the colors of each centroid. We'll use a helper function to display our color swatches and there they are. You might have hypothesized that there would be similar colors as a result of the three cluster models and that's correct. The photos dominant colors of red, green and yellow are present here. Again, we can replace each pixel in the original image with the RGB values of the centroid to which it was assigned by the new K-means model. This is a function that will display the photo for any value of K that we choose. We'll call this function with three as its argument. We now have a photo with just three colors, the same three colors from the swatches above. Each color's RGB values correspond to the values of the location of its nearest centroid. We can return once more to our 3D coordinate space. This time, we'll recolor each dot to correspond with the color of its centroid. This will allow us to see how the K-means algorithm clustered our data spatially. Check it out. Each pixel is now colored according to the RGB value of its centroid and the clusters are at the vertices of the space. This whole process can be applied for any value of K. Here's the output of each photo for K equals two through 10. Notice that it becomes increasingly difficult to see the difference between the images each time a color is added. This is a visual example of something that happens with all clustering models, even if the data is not an image that you can see. As you group the data into more and more clusters, additional clusters beyond a certain point contribute less and less to your understanding of your data. This demonstration has deepened your understanding of how the K-means algorithm works. Soon, we'll explore methods for numerically determining which K value is best for particular data. As always, feel free to explore the notebook more on your own to keep building your skill set. You now have some familiarity with the intuition behind the K-means methodology. In some of the examples we presented, we plotted points in two-dimensional space. In those cases, it was clear if the model was correctly assigning points to clusters. We also studied an example where we were able to visualize the data in three dimensions. Unfortunately, most cases data professionals encounter on the job are not so easy. Your data will have many more than three dimensions, so you'll not be able to visualize how each observation relates to those around it. You might not even know how many clusters there should be. So how do you decide the value for K? And once you do, how do you know if your model is working as intended? In linear and linguistic regression, you used metrics such as r squared, mean squared error, area under the ROC curve, precision, and recall to evaluate the effectiveness of your model. But in unsupervised learning, you don't have any labeled data to compare your model against. The metrics aren't applicable. In fact, your model isn't predicting anything. Instead, it's grouping observations based on their similarities. It's up to you to investigate and understand the different clusters. Remember, if you have some domain knowledge or the problem you're trying to solve has its own constraints, use these things to your advantage. For example, maybe you're investigating customer segmentation for a service that offers four different subscription levels. Then you'd probably want to use four as your value for K. But if you have no way of knowing in advance what value to use for K, don't worry. There are other ways to figure it out. Before we get started with evaluation metrics, consider what makes for a good clustering model. Basically, you want clearly identifiable clusters. This means that within each cluster, or intra cluster, the points are close to each other. It also means that between the clusters themselves, or inter cluster, you want lots of empty space. One way to evaluate the intra cluster space in a K-means model is to identify its inertia. This is a different concept from inertia as it's defined in physics. Here, inertia is defined as the sum of the squared distances between each observation and its nearest centroid. Essentially, this is a measurement of how closely related observations are to other observations within the same cluster. That information is then aggregated across all the clusters to produce a single score for the particular metric being measured. Another important metric for evaluating your K-means model is the silhouette score. This is a more precise evaluation metric than inertia because it also takes into account the separation between clusters. Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model. We'll cover this in more depth later. For now, just know that the silhouette score helps evaluate your model, provides insight as to what the optimal value for K should be, and uses both intra-cluster and inter-cluster measurements in its calculation. Nice work. You now have a better understanding of what makes a good clustering model. You also understand more about some of the metrics to determine your model's effectiveness. Coming up, we'll further explore these metrics and learn how data professionals put them to use on the job. Previously, you were introduced to inertia and silhouette scores as metrics to evaluate K-means models. These are indispensable tools for data professionals who work with K-means models and any other model in the clustering family. Now let's expand on these concepts. Consider, again, what makes a good clustering model. Ideally, you'd have tight clusters of closely related observations. And each cluster is well separated from other clusters. Remember that inertia is the sum of the squared distances between each observation and its closest centroid. It's a measurement of intra-cluster distance. So it gauges how closely related each observation is to the other observations in its own cluster. Inertia can be represented by this formula, where n is the total number of observations in the data and C sub k is the centroid of the cluster that the observation x sub i is in. The more compact the clusters, the lower the inertia because there's less distance between each observation and its nearest centroid. Therefore, it's important for inertia to be as close to 0 as possible. Can inertia ever be 0? Well, it's possible, but this scenario wouldn't offer any new insight into the data. Here's why. In one case, if all observations were identical, this would mean all data points are in the same location. Then, inertia equals 0 for all values of k. And the second case is when the number of clusters is equal to the number of observations. If each observation is in its own cluster, then its centroid is itself. Inertia is a great metric because it helps us to decide on the optimal k value. We do this by using the elbow method. In the elbow method, we first build models with different values of k. Then, we plot the inertia for each k value. Here's an example. Notice that the greater the value is for k, the lower the inertia. So should you always select high k values? Well, no. A low inertia is great, but if it results in meaningless or inexplicable clusters, it doesn't help you at all. A good way of choosing an optimal k value is to find the elbow of the curve. This is the value of k at which the decrease in inertia starts to level off. In this example, that occurs when we use three clusters. Sometimes, it might be difficult to choose between two consecutive values of k. In that case, it's up to you to determine which is best for your particular project. The second important metric for evaluating your k means model is the silhouette score. This is a more precise evaluation metric than inertia because it takes into account the separation between the clusters. Silhouette score is defined as the mean of the silhouette coefficients of all the observations in the model. Each observation has its own silhouette coefficient, which is calculated as b minus a over whichever value is greater, a or b, where a is the mean distance from that observation to all other observations in the same cluster, and b is the mean distance from that observation to each observation in the next closest cluster. The silhouette coefficient can be anywhere between negative 1 and 1. Consider this schematic. If an observation has a silhouette coefficient close to 1, it means that it's both nicely within its own cluster and well separated from other clusters. A value of 0 indicates that the observation is on the boundary between clusters. If your observation has a silhouette coefficient close to negative 1, it may be in the wrong cluster. So as you've just experienced, when using silhouette score to help determine how many clusters your model should have, you'll generally want to opt for the k value that maximizes your silhouette score. Inertia and silhouette score are important metrics to help you determine the most appropriate number of clusters for your k-means model. Now that you're familiar with how they're derived, you'll be well prepared to continue working with them. Previously, you learned about inertia and silhouette scores. You understand that their metrics used to help decide an effective value for k in a k-means model. And because k-means is an unsupervised learning model, it's used to find structure and relationships within data, but there are no right answers. Therefore, data professionals who use these models have to rely on these metrics to help them determine whether their model is identifying characteristics of their data that are useful for their needs. In this video, we're going to build a k-means model and evaluate it using inertia and silhouette score. We'll go over which packages to import, how to scale data, instantiating, and fitting the model, and of course, using the labels and inertia attributes and silhouette score function to determine a final value for k. Once again, we'll return to Jupyter Notebooks as the platform where we'll build our models. In a new notebook, the first step, as always, is to begin with import statements. This will create the computing environment with the necessary packages and tools for your project. In this case, we'll import NumPy and Pandas as our operational packages. We'll also import the following task-specific items from scikit-learn, k-means, silhouette score, and standard scaler. Note the syntax, as each item is imported from a different package. k-means comes from sklearn.cluster, silhouette score comes from sklearn.metrics, and standard scaler comes from sklearn.preprocessing. The function make blobs is something we'll use just for this demonstration to help us create synthetic data. We'll use Seaborn for graphing. In practice, you'd have a real dataset and you'd read in this data and perform EDA, data cleaning, and other manipulations to prepare it for modeling. For simplicity and to help us focus on modeling and analysis, we're going to use synthetic data for this demonstration. We'll start by creating a random number generator. This is to help create reproducible synthetic data. We'll use it to generate clustered data. For now, we won't know how many clusters there are. By calling the random number generator and assigning the result to a variable, we can avoid viewing the true number of clusters our data has. This will let us use inertia and silhouette coefficients to determine it. This next step uses make blobs and our random number generator to create data that has an unknown number of clusters, at least to us. These steps return a NumPy array, but it's usually helpful to view your data as a pandas data frame. This is often how your data will be organized when modeling on the job. So we'll convert our data to a pandas data frame. In the six columns, we find that our data has six features. This is too many dimensions for us to visualize in 2D or 3D space. We can't observe how many clusters there are, so we'll need to use our detective skills to figure it out. Because K-means uses distance between observations as its measure of similarity, it's important to scale our numerical data before modeling if it's not already scaled. For this, we'll use scikit-learn's standard scaler. Standard scaler scales each point x sub i by subtracting the mean for that feature and dividing by the feature's standard deviation. This ensures that all the variables have a mean of zero and a variance standard deviation of one. There are a number of scaling techniques available in scikit-learn's pre-processing package, including standard scaler, min-max scaler, normalizer, and others. There's no firm rule for determining which method will work best, but with K-means models, using any scaler will almost always lead to better results than not scaling at all. We can instantiate standard scaler and transform our data in a single step by using the dot fit transform method and passing our data to it as an argument. Here's a tip. If your computer has enough memory, it's helpful to keep an unscaled copy of your data to use later, so we'll assign the scaled data to a new variable called x underscore scaled. Now that the data is scaled, we can start modeling. Because we don't know how many clusters exist in the data, we'll begin by examining the inertia values for different values of K. Let's start with K equals three, an arbitrary number in this case. One thing to note is that by default, scikit-learn implements an optimized version of the K-means algorithm called K-means plus plus. This helps ensure optimal model convergence by initializing centroids far away from each other. Because we're using K-means plus plus, we will not rerun the model multiple times. Now, let's instantiate the model. Because we want to build a model that puts our data into three clusters, we set the n underscore clusters parameter to three. We'll also set the random underscore state to an arbitrary number. This is only so others can reproduce our results. If we left this value blank, it's possible others could replicate our code exactly and still get different results due to the random initial placement of centroids. The next step is to fit the model to the data. We do this by using the fit method and passing in our scaled data. This returns a model object that has learned your data. You can now call its different attributes to view inertia, location of centroids and class labels, among others. We can get the cluster assignments by using the labels attribute. Similarly, find the inertia by using the inertia attribute. Let's find out what happens when we check the cluster assignments and inertia for this model. The labels attribute returns a list of values that is the same length as the training data. Each value corresponds to the number of the cluster to which that point is assigned. Because our k-means model clustered the data into three clusters, the value assigned to each observation will be zero, one, or two. The inertia attribute returns the sum of the square distances from each sample to its closest cluster center. This inertia value isn't helpful by itself. We need to compare the inertias of multiple k-values. To do this, create a function called k-means inertia that fits a k-means model for multiple values of k, in our case, two through 10. The function calculates the inertia for each value, appends it to a list, and returns that list. Then, we plot it. The x-axis is the number of clusters, and the y-axis is the inertia. This plot contains an unambiguous elbow at five clusters. Models with more than five clusters don't appear to reduce inertia at all. Right now, it seems like a five cluster model might be optimal. But let's check the silhouette scores. Hopefully the results will corroborate our findings. To get a silhouette score, we call the function and pass to it two required parameters, the training data, and its assigned labels. Let's check this out for the k-means three model we created earlier. It worked! However, this value isn't very useful if we have nothing to compare it to. Just as we did for inertia, we'll write a function that compares the silhouette score of each value of k from two through 10. Now, plot these silhouette scores. This plot indicates that the silhouette score is closest to one when our data is partitioned into five clusters. It confirms the inertia analysis. In this case, because we used synthetic data, we can review how many clusters actually existed in our data. This is the variable created by the random number generator at the beginning of the video. We called it centers. We were right! We were able to use inertia and silhouette score to correctly deduce that our data has five clusters. At this point, we'll want to do some further analysis to determine whether we can understand our clusters and if they're appropriate for our use case. We'll instantiate a new k-means model with n underscore clusters equals five and fit it to our scaled data. Okay, now we can confirm that there are five unique labels ranging from zero through four. So, we can use them to create a new column in the unscaled data frame. Next, we could perform analysis on the different clusters to identify what makes them different from one another. This would not have been possible with the previously scaled data because the numbers wouldn't make a lot of sense. Note that in many cases, it's not always clear what differentiates one cluster from another and it can take a fair bit of effort to determine whether it makes sense to cluster your data a given way. This is where domain knowledge and expertise through practice are very valuable. Congratulations! You've reached the end of another section of the course. Along the way, you've discovered that unsupervised learning is a vast field with many different applications. First, you learned about k-means models for deriving structure from your data. You were introduced to the concept of clustering similar groups of data around centroids and building a k-means model. We explored the importance of running a k-means model multiple times with different values for k. We also explained the issues with local minima and how to build models with different centroid initializations to make sure you're getting the most accurate results. You're now much more familiar with inertia and silhouette score, methods for choosing the best number of clusters and evaluating the effectiveness of your model. And you can identify the elbow of an inertia curve and use it to help determine an optimal k-value. There's a whole world of unsupervised learning models and methodologies out there. This is just the beginning, but now you're empowered with key tools for navigating the landscape and developing your talents as a data professional. Hello, and welcome back. You've come so far building models of your own using the tools and skills you've been learning throughout this program. Now, in this final part of the course, you'll revisit supervised machine learning by investigating some more advanced classification techniques. These advancements are very exciting for data professionals because they enable us to overcome some typical modeling limitations. One such method is tree-based learning. Tree-based learning is a type of supervised machine learning that performs classification and regression tasks. It uses a decision tree as a predictive model to go from observations about an item, represented by the branches, to conclusions about the item's target value, represented by the leaves. Soon, you'll learn how single decision trees provide a foundation for more advanced approaches to all kinds of data work. Then, you'll move on to ensemble learning techniques, which enable you to use multiple decision trees simultaneously in order to produce very powerful models. In addition to learning how these new models work and their use cases, you'll be introduced to hyperparameter tuning. Knowing how and when to adjust or tune a model can help a data professional significantly increase performance. Together, we're going to build models that can be very impactful for tons of different business applications. What you're about to learn can really make you stand out to employers in the industry. Let's get started. In the world of supervised learning, there are tons of techniques that can help you make predictions. One popular tool for classification and prediction is the decision tree. It serves as a foundation for some of the most effective models used in industry today. A decision tree is a flowchart-like supervised classification model and a representation of various solutions that are available to solve a given problem based on the possible outcomes of related choices. Like all supervised learning classification techniques, decision trees enable data professionals to make predictions about future events based on the information that is currently available. They also have some very specific advantages in certain areas over other supervised learning models. Decision trees require no assumptions regarding the distribution of underlying data. Unlike the models we've covered previously, they can handle collinearity easily. Additionally, preparing data to train a decision tree can be a much less complex process, requiring little pre-processing, if any at all. However, decision trees are not perfect. No model is. Decision trees can be particularly susceptible to overfitting. The model might get extremely good at predicting scene data, but as soon as new data is introduced, it may not work nearly as well. This is something that you'll need to keep in mind while building these types of models. A decision tree consists of nodes and edges. The edges connect together the nodes, essentially directing from one node to the next along the tree. Decisions are made at each node. At each, a single feature of the data is considered and decided on. By the end, any relevant features will have been resolved, resulting in the classification prediction. Let's explore this a little further. Here's a decision tree that will help you decide whether or not to go outside and play soccer or football on any given day. The first decision that will be made relates to the weather outlook. For this tree, there are three options, sunny, cloudy, or rainy. This node, where the first decision is made, is called the root node. It's the first node in the tree and all decisions needed to make the prediction will stem from it. It's a special type of decision node because it has no predecessors. The nodes where a decision is made are decision nodes. Decision nodes always point to a leaf node or other decision nodes within the tree. So for our example, if it's supposed to be sunny or rainy, the tree will continue making more decisions to arrive at a final prediction. However, if it's cloudy, the tree arrives at a prediction. Soccer will be played. This brings us to a leaf node. Leaf nodes are where a final prediction is made. The whole process ends here, so no further decisions are required after this point. Now, view where the decision tree would have gone if the outlook had been sunny. We're not at a leaf node yet. There are still decisions to be made. This time, the consideration is about the humidity. If the humidity is above 75%, the tree ends at a leaf node that says don't play soccer. However, if the humidity is below 75%, the decision tree will say play soccer. The nodes that are pointed to, whether leaf nodes or other decision nodes, are child nodes. The node that is pointing to them is a parent node. The algorithm decides what and where variables are split based on what will provide the most predictive power. So for example, if 90% of the time that it rains, soccer is not played, this variable would be very predictive. Splitting the data on outlook would give new groups, each of which has a majority of play and don't play. Now you know the basics of decision trees. This foundation will be helpful as you continue learning about tree-based modeling. Coming up, you'll cover the aspects of building the tree and using your training data to develop the nodes and edges. Then, you'll learn how to optimize tree-based models and what you, as a data professional, can do to maximize their capabilities. Now that you've developed a solid understanding of classification models, it's time to build a model that's a bit more advanced. In this video, you'll examine how to create a single standard decision tree in Python. We're going to use a decision tree to approach the same business need as with the naive Bayes model from earlier, modeling customer bank churn. The first thing, as always, is to import any necessary packages and libraries into our notebook. You've experienced most of these before, but the packages for the decision tree itself are new. Let's import decision tree classifier, the scikit-learn implementation of a single decision tree. Additionally, import the plot tree function to produce a visual of the decision tree after it's built. We also import the confusion matrix and confusion matrix display functions to help us calculate and plot a confusion matrix for our model. And lastly, we have our four evaluation metrics. We're going to read in the original dataset as a panda's data frame as usual. Remember, this is where you'd normally do exploratory data analysis or EDA. Then you would use what you learned from EDA and what you know about the use case of your model to decide on an appropriate evaluation metric. For our bank churn models, we're going to assume that a metric that balances precision and recall is best. The metric that helps us achieve this balance is F1 score, which is defined as the harmonic mean of precision and recall, or their product divided by their sum. Again, there are many metrics to choose from. The important thing is that you make an informed decision that is based on your use case. Now that we've decided on an evaluation metric, let's prepare the data for modeling. Just as before, we'll drop the unpredictive features and the gender column, so our model doesn't predict based on gender. Then we'll dummy encode the geography column, creating Boolean columns from the categorical column. Our last preparation is to separate our target variable from the rest of the data and then split the data into training and test sets using the train test split function. Don't forget to stratify based on the target. The first thing we'll do is train a baseline decision tree model. We won't tune it. It's just to give us scores that we can use as points of reference. We do this by instantiating the classifier and setting the random state. We'll assign it to a variable called decision tree. Next, we'll fit it to the training data. This grows a decision tree on our data. It all happens behind the scenes. Finally, we'll use the predict method to use the tree we just grew to make predictions on the X test data, assigning the results to a variable called dt-pred. Now we can get the results by using the different evaluation metric functions we imported. This model's F1 score is better than what we got from the naive Bayes model we built. Let's inspect the confusion matrix of our decision tree's predictions. First, we'll write a short helper function to help us display the matrix. Notice from this confusion matrix that the model correctly predicts many true negatives. This is to be expected because the data set is imbalanced in favor of negatives. When the model makes an error, it appears slightly more likely to predict a false positive than a false negative, but it's generally balanced. This is reflected in the precision and recall scores, both being very close to each other. Next, let's examine the splits of the tree. We'll do this by using the plot tree function that we imported. We pass to it our fit model as well as some additional parameters. Note that if we did not set max depth equals two, the function would return a plot of the entire tree all the way down to the leaf nodes, but we are most interested in the splits nearest to the root because these tell us the most predictive features. Class names displays what the majority class of each node is and filled colors the nodes according to their majority class. How do we read this plot? The first line of information in each node is the feature and split point that the model identified as being the most predictive. In other words, this is the question that's being asked at that split. For our root node, the question was, is the customer less than or equal to 42 and a half years old? At each node, if the answer to the question it asks is yes, the sample would move to the child node on the left. If the answer is no, the sample would go to the child node on the right. Genie refers to the node's genie impurity. This is a way of measuring how pure a node is. The value can range from zero to 0.5. A genie score of zero means there's no impurity. The node is a leaf and all of its samples are of a single class. A score of 0.5 means the classes are all equally represented in that node. Samples is how many samples are in that node and value indicates how many of each class are in the node. Returning to the root node, we have value equals 5,972 and 1,528. Notice that these numbers sum to 7,500, which is the number of samples in the node. This tells us that 5,972 customers in this node stayed and 1,528 customers churned. Lastly, we have class. This tells us the majority class of the samples in each node. If we look at the top of the tree, this plot tells us that if we could only do a single split on a single variable, the one that would most help us predict whether a customer will churn is their age. If we look at the nodes at depth one, we notice that the number of products and whether or not the customer is an active member also are both strong predictors of whether or not they will churn. This is a good indication that it might be worthwhile to return to your EDA and examine these features more closely. Now that you have a basic understanding of how tree-based modeling works in Python, you have two more techniques to learn before moving on to some more powerful optimization techniques. Hyperparameter tuning and cross-validation. Using these, we can optimize single decision trees even further and you'll use those concepts to supercharge the models you'll learn later on. Meet you again soon. Recently, you've been exploring how to build a decision tree classifier model. For many of the models you've worked with throughout this course, you used evaluation metrics such as F1 score to gauge their performance. But throughout this section of the course, you'll be taking an extra step to gain some additional performance increases from your models. A very popular and widely used technique to improve performance after creation is known as hyperparameter tuning. Hyperparameters are parameters that can be set before the model is trained. They can be tuned to improve model performance directly affecting how the model is fit to the data. Hyperparameter tuning is the process of adjusting the parameters to find the best values that will result in the most optimal model. Just like a musician tuning the strings on their guitar, the idea is to achieve balance and a beautiful result. For tree-based modeling, there are many hyperparameters that can be tuned and they can have a big impact on the model itself. You've actually already used one previously in this course when you worked with K-means. As you'll recall, when building a K-means model, you set the value of K to produce different cluster results. But when you changed its value, you performed hyperparameter tuning. One of the more basic hyperparameters for a decision tree is called max depth. Setting this hyperparameter defines a limit of how long a decision tree can get. The depth of a decision tree is the number of levels between the root node and the farthest node from the root node with the root node itself being level zero. Consider our previous example. This tree has three levels. The root node is level zero. The nodes in the middle are level one and the leaf nodes all the way at the bottom are level two. So this tree has a depth of two. However, this decision tree has a depth of four. Even though this decision tree isn't as filled out as the previous example, what matters is the distance of the farthest node from the root and whether it's a leaf node or a decision node. This leads us back to max depth. When working with very large data sets, you could potentially create massive trees that are very deep. But this isn't necessarily what you want for your model. So setting a value for max depth can help reduce overfitting problems by limiting how deep the tree will go. Additionally, it can reduce the computational complexity of training and using the model in the first place. For example, if you're finding that a decision tree has the same performance with a depth of 10 versus a depth of 100, you can set max depth to 10 and achieve the desired performance more quickly. Another very commonly used hyperparameter is called minSamplesLeaf. This hyperparameter defines the minimum number of samples that must be contained in a leaf node. It means that split will only happen if there are enough samples in each of the result nodes to satisfy the required value. For example, maybe part of the way down your tree, there's a decision node that currently has 10 samples. However, the minSamplesLeaf hyperparameter is set to six. There would be no way to split the data so that each leaf node has six samples and therefore no further split can take place. There are other hyperparameters for decision trees that you'll learn, but first let's explore finding the optimal values for the parameters. And here's where something called grid search is useful. Grid search is a tool to confirm that a model achieves its intended purpose by systematically checking every combination of hyperparameters to identify which set produces the best results based on the selected metric. So at the end, you'll have values that produce optimal results for your model. When performing a grid search, the first step is to specify which parameters you want to tune and the set of values that you want to search over. For example, maybe we want to tune maxDep and minSamplesLeaf. We would define potential values for each of these. For maxDep, we could check depths of four, eight, 12, 20, and 30. For minSamplesLeaf, we could try 10, 50, and 100. The algorithm will check every combination of values to see which pair has the best evaluation metrics. It would first check maxDepth of four with minSamplesLeaf of 10, then 50, then 100. The algorithm would then check maxDepth of eight with minSamplesLeaf of 10, then 50, then 100. This continues until every combination has been analyzed. Remember, you can try any values and any number of values during grid search if you believe the benefits are worth the cost of your computing time. Coming up, we'll put into practice many of the tree-based modeling concepts you've learned so far, all the way from constructing the tree to optimizing it and using it to make some classification predictions. Looking forward to it. You now know about many of the tools involved with building some pretty powerful models. Now in this video, you'll explore model validation. Model validation is the set of processes and activities intended to verify that models are performing as expected. This is achieved with a validation data set, which is a sample of data that's held back during training. The validation data set is instead used to give an unbiased estimate of the skill of the final tuned model. Note that validation data is different from test data and must remain unseen until the very end of the process. In some of the models we've built so far, we haven't been doing this, but that's okay. We were just using those models to understand the process of building an evaluation. Previously, before training the model, we took our data and split it into two sets, one training set and one testing set. These sets were used to train and test the model respectively. With validation, the data is actually split into three sets. The first two are training and testing sets as before, but now there's an additional validation set. This validation set is used instead of the test set to evaluate the model, leaving the test set untouched. In addition, another popular method is cross-validation. Cross-validation is a process that uses different portions of the data to test and train a model across several iterations. It works like validation, but with a slight twist. Instead of having one validation set to evaluate the model, the training data is split into multiple sections, known as folds. Then the model is trained on different combinations of these folds. For example, perhaps we want five folds. First, the data would be split into the training and test data. Then the training data would be split into the five folds. The first model iteration will train with folds one, two, three and four, using the fifth fold to get metrics for the model. The next will train with folds one, two, three and five using the fourth fold to get metrics. This process repeats until every combination is done and the evaluation metrics are averaged to get final validation scores. Which validation technique you choose mainly depends on the data set you're working with. Cross-validation is particularly useful when working with smaller data sets as it maximizes the utility of the data available, more so than standard validation. However, cross-validation is not necessary when working with very large data sets. There's so much data that maximizing the utility is not required and actually can be problematic depending on the computing resources at your disposal. However, if limited computing resources or constraints in the data are not issues, then cross-validation is almost always applied. Validation schemes are essential to building and selecting effective models. Data professionals working on these types of business projects are responsible for determining the best scheme to use. This comes with experience, along with an understanding of the data and the available tools. You're well on your way to developing these important skills. Hello and welcome back. In this video, we'll be taking the foundation we built when creating a decision tree classifier in Python and expanding on it to fine-tune our models. If you recall, hyperparameter tuning involves changing parameters that directly affect how the model trains before the learning process begins. Different models have different types of hyperparameters that are available for you to adjust. You've learned about two that apply to tree-based models, max depth and min samples leaf. As a reminder, max depth defines how long a decision tree can get and min samples leaf defines the minimum number of samples for a leaf node. These are the hyperparameters that will be tuned on a single decision tree. When originally exploring hyperparameter tuning, we also considered the steps for finding the optimal values for hyperparameters. Randomly entering values won't produce the best results, which is why we use grid search. As a refresher, grid search specifies a series of values for each hyperparameter to be tuned. It systematically checks every combination of those values to determine which set produces the best results based on the selected metric. Think of it as brute-forcing the different hyperparameter values. Imagine forgetting your pin and trying every single number between 0000 and 9999. Sure, it would take time, but eventually you'd find it. This is how grid search works. Okay, now let's get into the code. You'll work within the same framework as the other classification models you've created. We'll begin where we left off in the decision tree notebook. Remember, we've already performed feature engineering and the data has already been split into X and Y data, as well as training and test sets. But now, we're going to add a new function. Grid Search CV is imported from the model selection package of Scikit Learn, enabling the hyperparameters to be tuned. The CV in Grid Search CV stands for cross-validation. Each time a set of hyperparameters is used, it's scored against a validation set, keeping the test data unseen. You'll use the validation scores when comparing models moving forward. Note that you won't be comparing this tuned decision tree to the existing models. All the other models were scored and compared using test data, which is actually an improper practice. When data professionals perform model selection in the workplace, the test data must always remain unseen to the models being worked on. That data is only used at the very end of the model development process. As mentioned before, the parameters you'll tune will be max-depth and min-samples-leaf. A dictionary is defined where the key is the name of the hyperparameter and the value is a list of numbers that will be tried as that hyperparameter. While the grid search is based on F1 score, you still want to find out what the other scores are. So create a set called scoring with the names of each of the desired metrics. Next, create an instance of a decision tree classifier named tuned decision tree. The grid search CV function is then called as arguments pass in the decision tree classifier object, the parameters, the scoring methods, the number of cross-validation folds, and specify the metric the search will focus on. Finally, fit the model to the data. Check which hyperparameters the grid search identified. By getting the best estimator attribute from the grid search object, you can observe the values it found. So, a max-depth of eight and a min-samples-leaf of 20 was best when using the F1 score as a measure. Getting the best score attribute confirms the best average F1 score across the different folds among all the combinations of hyperparameters. Note that this model achieved a score of about 0.5607. This final code block is a helper function to extract scores for the model. It produces a data frame that has the name of the model along with the four scores you've been using. Call this function at the very end and save the resulting data frame as a CSV file for later use. Right now, there's nothing to compare this score with. However, note the model has an F1 score of 0.5605. And soon, you'll go on to create other, more advanced tree-based models and find scores to compare with this one. With those numbers, you'll be able to determine which model is not only the best performer, but also the best in the context of the business needs. At this point, you've learned that decision trees are useful because they're easy to understand and interpret, flexible with regard to the data they use and highly versatile. Decision trees can be good predictors, but you also know that they're prone to overfitting and they're very sensitive to variations in the training data. How do we solve these problems? The answer is by using the wisdom of the crowd. Perhaps you're familiar with this concept because it can apply to everyday situations as well. If I have a jar filled with jelly beans and I ask a spatial math expert to examine it and guess how many jelly beans there are, their estimate will typically be less accurate than if I ask a thousand ordinary people to do the same thing and then take the average of their guesses. We can apply this same concept to modeling using a process called ensemble learning or simply, ensembling. Ensemble learning involves building multiple models and then aggregating their outputs to make a final prediction. Just like in our jelly bean example, predictions using an ensemble of models are very accurate even when the individual models themselves are barely more accurate than a random guess. A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a naive Bayes model and a decision tree classifier. In this way, when the models make errors and they always will, the errors will be uncorrelated. The goal is for them to not all make the same errors for the same reasons. You could build an ensemble using the three models I just mentioned. You'd train each model on the same data, then use each model's individual predictions to make a final prediction, say by taking the majority vote if it's a classification task or averaging the results if it's a regression task. But there's another way to build an ensemble, a way that uses the same methodology for every contributing model. In this kind of ensemble, each individual model that comprises it is called a base learner. For this method to work, you usually need a lot of base learners and each is trained on a unique, random subset of the training data. If the base learners were all trained on the exact same data, there would be too much correlation between the errors. This would affect the strength of the base learners. And if a base learner's prediction is only slightly better than a random guess, it becomes a weak learner. So to address this, data professionals do something called bagging in order to ensure random subsets of the data and strong learners. The word bagging comes from bootstrap aggregating. Let's break this down. Remember from statistics that bootstrapping refers to sampling with replacement. That's what happens during bagging too. Each base learner samples from the data with replacement. For bagging, this means the various base learners all sample the same observation and a single learner can sample that observation multiple times during training. The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction. For regression models, this is typically the average of all the predictions. For classification models, it's often whichever class receives the most predictions, which is the mode. When bagging is used with decision trees, we get a random forest. A random forest is an ensemble of decision tree base learners that are trained on bootstrapped data. The base learners predictions are all aggregated to determine a final prediction. Random forest takes the randomization from bagging one step further. A regular decision tree model will seek the best feature to use to split a node. A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree. This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble. Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow, depending on how you choose to tune your model. Ensembling many base learners helps reduce the high variance that you typically get from a single decision tree. Ensemble learning is powerful because it combines the results of many models to help make more reliable final predictions. Plus, these predictions have less bias and lower variance than other standalone models. Coming up, we'll explore random forests in more detail. Lots to come. Now that you've been introduced to random forests, let's examine a little more closely what they are and how they function. It's important to understand this methodology because it's commonly used in data work and many of its component steps are used by other more advanced modeling strategies. As a refresher, a random forest is an ensemble of learning trees whose predictions are all aggregated to determine a final prediction. Each tree in a random forest model uses bootstrapping to randomly sample the observations in the training data with replacement. Remember, this means that any tree in the model can use the same observation and the same observation can be sampled more than once by the same tree. Bootstrapping is a critical component of random forest models. It ensures that every base learner in the ensemble is trained on different data while allowing each learner to train on a dataset that's the same size as the original training data. Because there are duplicated observations in the tree's training data, each one will be missing some of the observations from the original training dataset. One more important principle of random forest models is that all trees in the ensemble are trained on a random subset of the available features in the dataset. No single tree sees all the features. Again, this is to introduce another element of randomness and ensure that each tree is as different from the others as possible. You learned that one of the main weaknesses of decision trees is that they are very sensitive to new data, so they're prone to overfitting. Therefore, randomizing both the data and the features used by each base learner means that no single tree can overfit all the data. This is because no single tree sees all the data. In fact, the trees underfit the data. They are high bias, but together they can be very powerful predictors that are more stable than a regular single decision tree. In addition, random forests are very scalable. All the base learners they rely on can be trained in parallel using different processing units, even across many different servers. Finally, just like decision trees, random forest models need to be tuned to find the combination of hyperparameter settings that results in the best predictions. After all, random forests are made up of many decision trees, and data professionals want them all to be as effective as possible. Hey, welcome back. In this video, we'll build on your understanding of how decision trees grow. This will be the basis on which we'll tune a random forest model. You've learned that random forests make predictions by sampling features and observations to grow trees. With decision trees, splits are decided by which variables and which cutoff values offer the most predictive power. Now, let's consider that decision trees continue to split until one of a certain set of conditions is met. The first condition has to do with the observations that a leaf contains. When all of the observations belong to the same class, this means the leaf node is pure. The second condition affecting where a tree splits is whether the minimum leaf size or maximum depth is reached. Also, a decision tree may stop growing if it achieves a certain performance threshold. The value and metric for this threshold can both be specified by the modeler. You'll recall that settings such as these are known as hyperparameters, and they can be tuned to improve model performance, directly affecting how the model is fit to the data. We demonstrated that one of the most important hyperparameters in a decision tree is its max depth. This specifies how many levels the tree can have, and ultimately determines how many splits it can make. Remember, every time a node splits, your data gets separated into smaller subsets. The model is drawing another decision boundary. We also introduced you to minSamplesLeaf, which defines the minimum number of samples for a leaf node. With minSamplesLeaf, a split can only occur if it guarantees a minimum number of observations in the resulting nodes. Now, a new concept, minSamplesSplit, can be used to control the threshold below which nodes become leaves. Random forest models have these same hyperparameters because they are ensembles of decision trees. These hyperparameters control how the learner trees are grown, but random forests also have some other hyperparameters which control the ensemble itself. This first hyperparameter controls the randomness of the trees, and it's called maxFeatures. This setting specifies the number of features that each tree selects randomly from the training data to determine its splits. For example, if you have a data set with features a, b, c, d, and e, and you build a random forest model with maxFeatures set to three, your first tree might use features a, c, and e to determine its splits, and the next tree might use features b, d, and e, and so on. The second hyperparameter, number of estimators, controls how many decision trees your model will build for its ensemble. For example, if you set your number of estimators to 300, your model will train 300 individual trees. If you're building regression trees, then the model's final prediction would average the predictions of all 300 trees. If you're building classification trees, the final prediction would be determined by whichever class received the majority vote from the 300 individual trees. For random forest models, performance will typically increase as trees are added to the ensemble, but only to a certain point. After this point, improvement will level off and adding new trees will only increase your computing time. This happens because the new trees will become very similar to existing trees, so they won't contribute anything new to the model. As a final point, many data professionals build models without hand setting each hyperparameter. In fact, when using scikit-learn, the model might perform well with no hyperparameters at all. That's because it has an effective default setting. And remember to make the most of grid search to help you iterate. Data professionals know how to experiment with combinations of hyperparameters in order to build the model that makes the very best predictions. Now that you're familiar with the logic behind random forests and some of its most important hyperparameters, you're ready to build a model. In this video, we'll build a random forest model that uses grid search to cross validate and tune the hyperparameters. Let's open up a Jupyter notebook and get started. Recall that in this scenario, we're trying to predict which customers will close their bank accounts. Import NumPy and Pandas, Matplotlib, GridSearchCV, and TrainTestSplit. Then all the evaluation metrics. Also import random forest classifier. Note that we're using the classifier because we're trying to solve a classification problem. But we could also import the random forest regressor if we were predicting on continuous data. Remember that we've prepared our data by dropping the row number, customer ID, and surname columns because they don't have predictive value. We've also dropped the gender column because we don't want the model to predict on the basis of gender. Then we dummy encoded our categorical variables to prepare for modeling. The last step before modeling is to split the data into the training and test sets using TrainTestSplit. We're going to compare our models F1 score to what we got from our naive Bayes and decision tree models. So it's important to split the data in the same exact way to ensure that all models train and test on the same data. Therefore, we'll make sure that our test data is 25% of all the data. We'll also stratify based on our target column and set the random seed to 42, just as we did previously. This splits our X and Y data into XTrain, XTest, YTrain, and YTest. Okay, we're ready to model. Let's find out what happens if we tune our model using cross-validation. One thing you'll notice when we build ensemble models is that the training time will usually be much longer than what you've experienced so far. That's because instead of building a single tree, we're now building from 75 up to 150 trees for each combination of hyperparameters we specify. It's useful to know how long it takes a model to train. You can get a cell's runtime by entering percent percent time at the top. This is called a magic command. Magic commands, often just called magics, are commands that are built into IPython to simplify common tasks. They always begin with either percent or percent percent. Now, define the hyperparameter grid. Tune five hyperparameters. Max depth, min samples leaf, min samples split, max features, and number of estimators. Notice that for max depth, none is included. This means that one of the options allows the trees to grow without a specific limit on their depth. Next, instantiate our classifier and assign it a random state for reproducibility. And to specify the metrics, the model will capture. Finally, instantiate the grid search object. It has two positional arguments, the classifier and the parameter grid. Tell it to use the scoring metrics specified above and set CV to five. This means the model will be cross validated using five folds. Lastly, specify refit equals F1. This is necessary when we've given multiple scoring metrics because it tells grid search that even though we want to check a few different metrics, the one we care most about is the F1 score. As a quick refresher, F1 score is a combination of precision and recall, combining the two into a single metric. In this instance, when we call the best estimator, it's the one with the highest average F1 score across all five validation folds. Now, fit the model to the training data. Depending on the processing power available, the number of hyperparameter combinations specified in the grid search, the size of the data set, and the number of folds used to cross validate, this could take a long time. In this example scenario, the time magic tells us it took about 20 minutes to fit. There's always a trade-off between searching over a large hyperparameter space and good runtime. The more hyperparameters you search, the better your model will be, but the longer it will take to fit. When models take a long time to fit, you don't want to have to run them again. If your kernel disconnects or you shut down the notebook and lose the cells output, you'll have to refit the model, which can be frustrating and time consuming. The good news is that there's a method that enables you to save the fit model object to a specified location and then quickly read it back in. And in the next video, we'll discover how that works. Meet you there. In the last video, we began creating a random forest model that used grid search to cross validate and tune hyperparameters. Now, we'll build on that by using a separate validation dataset to validate a model. But first, recall where we left off. We discovered a common issue in the data realm, the trade-off between searching over a large hyperparameter space and a good runtime. As we observed, the more hyperparameters searched, the better the model, but the longer it takes to fit. When models take a long time to fit, it's inefficient to have to keep running and refitting them again. Once you find a model you're happy with, you don't wanna start from scratch every time you open your notebook. And that's where pickling comes in. Pickle is a tool that saves the fit model object to a specified location and then quickly reads it back in. It also allows you to use models that were fit somewhere else without having to train them yourself. So let's pick up where we left off and pickle the model. First, specify a file path to the directory where the model will be saved. Then create a with open statement passing to it the file path plus the name you want to use to save this model followed by dot pickle. This creates an empty pickle file. The second argument, WB, gives permission to write to the file in binary, which is how pickling works. Use as to assign the return value of open to a local variable named to write. Call pickle.dump and pass the fit model object to it. Then the to write variable. In the next cell, read back in the pickled model from where it saved. The only difference in syntax is using RB to specify that we'll be reading binary and using pickle.load to assign a new variable which points to the fit model. Make sure you call this new variable by the same name you used for your fit model above. In this case, RF underscore CV. If you comment out the line of code where you fit the model and the cell where you pickle the model, you can close the notebook, reopen it, and rerun all the cells without having to wait for the model to fit. You can also send the pre-fit model to other people to use. Now, use the best params attribute to identify the hyperparameter values of the model that had the best average F1 score across all the cross validation folds. To find the average F1 score of the best model, use the best score attribute. Then use the make results function to generate a table of all the results and concatenate that with the overall table to compare all the models. Interesting, the cross validated random forest model has an average F1 score of 0.58 across all five validation folds which is a little better than the single tuned decision tree. It also has better recall, precision, and accuracy. Nice. Okay, now let's use a separate validation data set to validate the model. To do this, split the training data into a new training set and validation set. Use train test split. Remember to stratify the wide data. Use an 80-20 split. Don't forget, this is only splitting the training data which itself is 75% of all data. This means that our new training set will be 80% of 75% of the data. And the new validation set will be 20% of 75% of the data. The test data remains untouched. This next part is a little tricky. Grid search CV wants to cross validate the data. In fact, if the CV parameter was left blank, it would split the data into five folds for cross validation by default. Because you're using a separate validation set, it's important to explicitly tell the function how to perform the validation. This includes telling it every row in the training and testing sets. Use a list comprehension to generate a list of the same length as our XTR data where each value is either a negative one or a zero. Use this list to indicate to grid search CV that each row labeled negative one is in the training set and each row labeled as zero is in the validation set. Call this list split index. Now, import a new function called predefined split from scikit-learn's model selection package. Predefined split provides trained test indices to split data into training and testing sets using a predefined scheme. The next step is almost identical to what you did before. Search over all the same hyper parameters and keep the syntax the same as when cross validating. But now, pass the new split index list to predefined split and assign it to a variable. We'll call the variable custom split. Finally, set the CV parameter of the grid search equal to custom split. Now it's time to fit the model. Use time magic to get the time it takes the model to train. Wow, in this example scenario, the model took only about four minutes to train. During cross validation, the training data was divided into five folds. An ensemble of trees was grown with a particular combination of hyper parameters on four folds of the data, validating it against the fifth fold that was held out. This whole process happened for each of five holdout folds. Then another ensemble was trained with the next combination of hyper parameters, repeating the whole process. This continued until there were no more combinations of hyper parameters to run. But now, there's a separate holdout set for validation. An ensemble is built for each combination of hyper parameters. Each ensemble is trained on the new training set and validated on the validation set. But this only happens one time for each combination of hyper parameters instead of five times with cross validation. That's why the training time was only a fifth as long. All right, now pickle the model again. Run the cell where the pickle is written. Then go back and comment out the line of code as well as the call to write the pickle. Remember to have a cell where the pickled model can be read back in. Check the results. When you call best params, notice that the ensemble with the best F1 score used slightly different hyper parameters than the cross validated model. Now the F1 score is 0.576, better than the single decision tree model, but not as good as the cross validated model. Both would likely produce similar results. Just keep in mind that the cross validated model is a little more reliable because it was more rigorously validated. In fact, if a different random seed had been used to create the validation set, it's possible that we might have gotten lucky and even had a model that performed a little better than our cross validated model. But note that this doesn't mean it would be expected to perform better on the test data. This notebook's purpose was to demonstrate the different processes involved in validating a random forest model using cross validation and validation with a separate data set. In practice, it's unlikely that you'd do both. Instead, it would be more effective to choose an approach based on time requirements, the amount of data, and the number of different hyper parameter combinations to be explored. As always, the work that data professionals do requires thoughtful tradeoffs and adaptability. Random forest is one methodology for building tree-based ensemble models. Now, I'm gonna introduce you to another related methodology called boosting. Boosting is one of the most powerful modeling methodologies in the field. It's used in nearly every industry that relies on predictive modeling. Many winning models from Kaggle and other competitions use boosting. It's an essential tool in any modelers tool belt. Boosting is a supervised learning technique where you build an ensemble of weak learners. This is done sequentially with each consecutive base learner trying to correct the errors of the one before. Remember, a weak learner is a model whose prediction is only slightly better than a random guess, and a base learner is any individual model in an ensemble. This practice is similar to random forest and bagging. Like random forest, boosting is an ensembling technique, and it also builds many weak learners then aggregates their predictions. But there are some key differences. Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. This is because each new base learner in the sequence focuses on what the preceding learner got wrong. Another difference from random forest is that for boosting models, the methodology you choose for the weak learner isn't limited to tree-based methods. However, we will use tree-based implementations in this course because these are common and effective ways of building boosting models. There are various different boosting methods available, but throughout this part of the program, we'll explore two of the most commonly used methodologies. The first is called adaptive boosting, or adaboosting. Adaboost is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner. Here's a demonstration. Adaboost builds its first tree on training data that gives equal weight to each observation. Then, the algorithm evaluates which observations were incorrectly predicted by this first tree. It increases the weights for the observations that the first tree got wrong and decreases the weights for those that it got right. This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional. Once all the trees have been built, the ensemble makes predictions by aggregating the predictions of every model in the ensemble. Because adaboost can be used for both classification and regression problems, this final step is a little different depending on which type is being addressed. For classification, the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation. For regression, the model calculates a weighted mean prediction for all the trees in the ensemble. There is one disadvantage to note about boosting. You can't train your modeling parallel across many different servers because each model in the ensemble is dependent on the one that preceded it. This means that in terms of computational efficiency, it doesn't scale well to very large data sets when compared to bagging. However, this generally isn't a concern unless you're working with particularly large data sets. But there are many noteworthy advantages, including being one of the more accurate methodologies available today. Also, just like random forest, the fact that it's based on an ensemble of weak learners means that the problem of high variance is reduced. This is because no single tree weighs too heavily in the ensemble. Boosting has a few key advantages. First, unlike random forest, it reduces bias. It's also easy to understand and doesn't require the data to be scaled or normalized. Boosting can handle both numeric and categorical features and it can still function well, even with multicollinearity among the features. Plus, it's robust to outliers. Now, note that resilience to outliers is a major strength of all tree-based methodologies. This is because the model splits the data the same way regardless of how extreme a value is. Here's an example. Suppose you have six elephants. Three are females that weigh 2,000, 2,500, and 3,000 kilograms. And three are males that weigh 4,000, 4,500, and 5,000 kilograms. If you grew a decision tree using this data, it would draw a decision boundary between males and females at 3,500 kilos, the midpoint between the weights of the heaviest female and the lightest male. Now, suppose that instead of weighing 5,000 kilos, the last male elephant weighed 10,000 kilos. Your model would still divide males and females at 3,500 kilos. It doesn't matter that the last elephant doubled in size. Speaking of doubling in size, your experience and skills are growing at an enormous rate. I'm really proud to be taking this journey with you and can't wait to keep it up. Previously, you learned that boosting is an ensembling technique that builds models sequentially, with each model in the sequence focusing on the mistakes of the previous one. And you discovered that add-a-boost works by assigning greater weight in each model to the incorrect predictions of the model that preceded it. Now, we're going to explore gradient boosting. Gradient boosting is different from adaptive boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it. Here's a demonstration. This example uses a decision tree regressor, so imagine that the target is a continuous variable. Let's start with a set of features, x, and a target variable, y. We'll train the first base learner decision tree on this data and call it learner one. Learner one makes its predictions, which we'll call y hat sub one. The residual errors of learner one's prediction are found by subtracting the predicted values from the actual values. Call the set of residual errors error one. Now, train a new base learner using the same x data, but instead of the original y data, use error one as the target. That's because this learner is predicting the error made by learner one. Call this new base learner learner two. Learner two's predictions are assigned to y hat sub two. Then compare learner two's predictions to the actual values and assign the difference to error two. In this case, the actual values are the errors made by learner one. This process will continue for as many base learners as we specify. For now, repeat it just once more. Stopping here results in an ensemble that contains three base learners. To get the final prediction for any new x, add together the predictions of all three learners. If you like, pause the video now and repeat the process to review how it works. Ensembles that use gradient boosting are called gradient boosting machines or GBMs. GBMs are among the most widely used modeling techniques today because of their many advantages. One of these is high accuracy. As we mentioned earlier, many machine learning competition winners succeeded largely because of the accuracy of their boosting models. Another advantage is that GBMs are scalable. Even though they can't be trained in parallel, like random forests, because their base learners are developed sequentially, they still scale well to large data sets. GBMs also work well with missing data. The fact that a value is missing is viewed as valuable information. So GBMs treat missing values just like any other value when determining how to split a feature. This makes gradient boosting relatively easy to use with messy data. Also, because they're tree-based, GBMs don't require the data to be scaled and they can handle outliers easily. Gradient boosting also has its drawbacks. One is that GBMs have a lot of hyperparameters and tuning them can be a time-consuming process. Another drawback is that they can be difficult to interpret. GBMs can provide feature importance, but unlike linear models, they do not have coefficients or directionality. They only show how important each feature is relative to the other features. Because of this, they're often called black box models. This is a model whose predictions cannot be precisely explained. In some industries, such as medicine and banking, it's essential that your model's predictions be explainable. Therefore, GBMs are not well-suited for some applications. GBMs can also have difficulty with extrapolation. Extrapolation is a model's ability to predict new values that fall outside of the range of values in the training data. For instance, if one loaf of bread costs $1, two loaves of bread costs $2, and three loaves cost $3, a linear regression model would have no trouble predicting that 10 loaves cost $10. But a GBM wouldn't be able to unless it saw the cost of 10 loaves in the training data. Finally, GBMs are prone to overfitting if not trained carefully. Usually, this is caused by tuning too many hyperparameters, which can result in the trees growing to fit the training data, but not generalizing well to unseen data. You're doing a wonderful job filling up your data toolkit. Everything we're exploring together is priming you for an exciting career. Keep up the great work. There are numerous machine learning packages that include implementations of the boosting models we examined earlier. Most of them have many of the same tunable hyperparameters as decision trees because the most popular ones use tree-based learners. They are also similar to random forests in that they have additional hyperparameters that control the ensemble as a whole. This video will explore some of these hyperparameters so you'll be able to assemble models that fit well to your data and make accurate predictions. The implementation of GBM modeling that we'll be using from this point on is called XGBoost. XGBoost stands for extreme gradient boosting. XGBoost is used widely in the field of predictive modeling and as a data professional, you're likely to encounter it frequently if you work with models. Scikit-learn has its own GBM implementation, which is similar, but XGBoost is a commonly used gradient boosting package that has many useful optimizations. These optimizations include fast training, effective regularization of features and tunable hyperparameters, which can improve model predictions. Let's return to max depth. As you'll recall, this was used both in decision trees and random forests. It has the same functionality in XGBoost, which is that it controls how deep each base learner tree will grow. The best way to find this value is through cross-validation. The model's final max depth value is usually low. Remember, the deeper the tree, the more a model learns feature interactions that could be very specific to the training data, but may not generalize well to new information. Even short trees are powerful because of the ensemble. Typical values for max depth are two through 10, but this depends on the number of features and observations in the data. The second hyperparameter is N estimators, which is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using grid search. For smaller datasets, more trees may be better than fewer. For very large datasets, the opposite could be true. Typical ranges are between 50 and 500. Now, let's investigate some hyperparameters that we haven't used before. This first one is very important. It's called learning rate. You'll remember that each time an ensemble builds a new base learner, it fits the data to the error from the previous model. In a basic implementation, the predictions of all the trees could then be summed to determine a final prediction. In this case, each tree's prediction is considered equally important to the final prediction. In practice, we use the learning rate to indicate how much weight the model should give to each consecutive base learner's prediction. Lower learning rates mean that each subsequent tree contributes less to the ensemble's final prediction. This helps prevent overcorrection and overfitting. Another common name for this concept is shrinkage because less and less weight is given to each consecutive tree's prediction in the final ensemble. Think of it like riding a bike for the first time. Before you find your balance, you might move the handlebars too far in one direction, causing you to veer off course. So you need to make an adjustment, but often you'll overcorrect, so you need to shift back the other way. Each time you correct, you move the handlebars less and less until you're traveling smoothly in the direction you wanna go. This is the same idea as what happens when you slow the learning rate. Each correction affects the prediction less than the one before it. Also, if you use a low learning rate, your model will often require more trees to compensate. Again, this is best determined using grid search. Typical values are from 0.01 to 0.3. The last hyperparameter we'll examine is very similar to min samples leaf from decision trees, but it has a different name. It's called min child weight. A tree will not split a node if it results in any child node with less weight than what you specify in this hyperparameter. Instead, the node would become a leaf. This is a regularization parameter, so values that are too high cause the model to underfit the data. The range of this setting is zero to infinity. If set between zero and one, the algorithm interprets this as a percentage of your data. So 0.1 would mean that a node could not split unless its children each have greater than or equal to 10% of the training observations. Generally, think of values greater than one as being equivalent to the number of observations in a child node. So a value of 10 would mean no child node could contain fewer than 10 observations. We're nearing the end of this course and you've learned so much already. As always, make the most of course resources and always feel free to return to any of these videos to keep practicing. In this video, we'll demonstrate how to build and tune an XG boost classification model. We'll use this model to compare the performance of all our previous models and select a final one. Let's return to the bank churn data. Import most of the same libraries, packages, and functions used in previous models. NumPy, pandas, matplotlib, pickle, all model metrics, grid search, and train test split. We have two new imports as well. XGB classifier and plot importance, both of which come from the XG boost library. Remember that certain columns have been dropped, including customer ID and gender. Also, the geography column was dummy encoded. Just as before, assign features and target data to variables X and Y, then use train test split to split it into X train, X test, Y train, and Y test. Stratify based on the target and set the same test size and the same random seed used for previous models. This helps ensure a direct comparison when evaluating model performance. Now, begin modeling. Use grid search to tune some hyperparameters. Specifically, focus on max depth, min child weight, learning rate, and number of estimators. Define the values that grid search will permutate as a dictionary called CVparams. Next, instantiate the classifier. Note the objective parameter was set to binary colon logistic. This means that the model is performing a binary classification task that outputs a logistic probability. The objective would be different for different kinds of problems. For instance, if you were trying to predict more than two classes or performing a linear regression on continuous data. Now, set the random state. Score in the same way as random forest. Accuracy, precision, recall, and F1 score. And finally, instantiate the grid search. Remember to set refit to F1, as this tells grid search to refit the model that had the best average F1 score when it finishes its search. Now, fit the model to the training data. Use the handy time magic so the cell outputs the time it takes to run. This example scenario took nine minutes and 45 seconds. Okay, now, pickle the model. It's also possible to use models that were built in other notebooks. For example, to use the random forest model, import it to this notebook using another with open statement. Call it rf underscore cv. Now, compare the models by using the best score attribute for both the new xgboost model and the random forest model from earlier. In this case, xgboost outperformed the cross-validated random forest model by a very close margin of just three thousandths. Now, use the make results function created previously to generate a results table for this model and append it to the overall results table. This makes it possible to compare the scores across all models. Sort the results on the F1 column in descending order. The table clearly shows that our xgboost model outperformed all other models when measuring on F1 score. All right, now it's time to evaluate how the superior xgboost model performs when making predictions on the test holdout data. Use grid search cv's predict method to make predictions on the x test data and assign the results to the variable. Then, compare these predictions to the actual values contained in y test and generate evaluation metrics. Wow, the model performed better on the test data than on the validation data for all four metrics. This is always a possibility, but don't be alarmed if your model performs slightly worse on the test data. After all, test data is completely unseen by the model. Successful data professionals know that a job isn't finished simply because they've produced a model that results in an effective performance metric. It's equally important to interpret that model and make recommendations based on the findings. A confusion matrix is very helpful when assessing a model's variables and features. For instance, in our model, from the 2,500 people in our test data, there are 509 customers who left the bank. Of those, our model captures 256. The confusion matrix indicates that when the model makes an error, it's usually a type two error. In other words, it gives a false negative by failing to predict that a customer will leave. On the other hand, it makes far fewer type one errors, which are false positives. Whether these results are acceptable depends on the costs of the measures taken to prevent a customer from leaving versus the value of retaining them. In this case, bank leaders may decide that they'd rather have more true positives, even if it means also capturing significantly more false positives. If so, perhaps optimizing the models based on their F1 scores is insufficient. Maybe we'd retrain them to focus on recall instead. What is certain is that our model helps the bank. Consider the results if decision makers had done nothing. In that case, they'd expect to lose 509 customers. Alternatively, they could give everybody an incentive to stay. That would cost the bank for each of the 2,500 customers in our test set. Finally, the bank could give incentives at random, say by flipping a coin. Doing this would incentivize about the same number of true responders as our model selects. But the bank would lose a lot of money offering the incentives to people who aren't likely to leave. Plus, our model is very good at identifying these customers. Another way to help explain our model is by checking the most important features. XGBoost gives us a very useful function called plot importance to let us observe the relative feature importance of our model. After we've imported the function, we can use it to output a bar graph by passing to it the best estimator from grid search. In our model, estimated salary, balance, credit score, and age were the top predictors of whether a customer will leave. It would probably be useful to return and do another EDA focused on these features. At this point, you might also want to add back in the gender column as well as a column of your final model's predictions to your original data. This would allow you to measure how evenly your model distributed its error across reported gender identities. From linear and logistic regression to naive bays, decision trees, random forest, and XGBoost, you're now equipped with a powerful set of tools. They'll help you stand out as a professional in the exciting and rewarding data career space. Out of all the models you've learned throughout this program, the tree-based modeling techniques are going to be some of the ones you'll use most throughout your data journey. In this section of the course, you discovered why tree-based models are often preferred over other supervised learning techniques such as naive bays and logistic regression. From there, you explored decision trees. You learned how they worked, how to build them, and how they're used to make predictions on future events. Then you considered hyperparameter tuning with decision trees. You learned about max-depth and min-samples-leaf, understanding how changing these hyperparameters affect how the model is trained, and in turn affect your model's performance. You then explored some more advanced tree-based modeling topics such as ensemble learning. Two of these techniques in particular, bagging and boosting, enabled multiple decision trees to arrive at a model that works better than any one tree ever could. You implemented one of the most popular bagging methods, random forests, and observed how this technique compares to single decision trees. Then you learned several boosting methods and came to understand two popular approaches, adaptive boosting and gradient boosting. After learning their differences and unique advantages, you implemented them in Python and again witnessed how they compare to random forests and single decision trees. And for each of these ensembling methods, you discovered how hyperparameter tuning can come into play. Whereas this made relatively minor differences and improvements with the single decision trees, you observed how tuning hyperparameters such as learning rate and end estimators can take an ensemble learning model from good to great. These models are some of the most cutting edge in data science today. The skills you gain with these tools will absolutely make you stand out to potential employers. What you've learned will also expand your data science education and serve as a launchpad for you to really boost into the data world. Hi, it's great to be with you again. You might recognize me from the last course. I'm Tiffany and it's time again to complete a portfolio project and apply what you've learned throughout this course. Just as in the earlier courses, this portfolio project will guide you to complete several tasks and create artifacts that showcase your skills. During interviews, you may be asked questions to test your understanding of different machine learning models. Also, having projects on your resume can help you stand out to a hiring manager who may invite you to complete an interview. During the interview, you can rely on your portfolio to discuss data science in general or explain modeling strategies more specifically. In order to complete the portfolio project, you'll be presented with details about some business cases. Choose one and use the instructions to complete a new entry in your PACE strategy document and create machine learning models to solve the problem. By the time you complete this project, you'll have machine learning models that you can add to your portfolio. At this point, you're almost finished with this course and you have learned everything you need to complete this project and you're well on your way to advancing your career as a data professional. In this project, you will solve a data problem using the models you learned in this course and then make a business recommendation. Following the PACE workflow, you will create a plan and communicate your process for completing the project. Ready? Then let's get started. I'm Ure. I lead analytical teams in Google leading data scientists and product analysts. So I've interviewed north of a 500 people over the years. Always what I look for is problem solving. Often I'll ask you about a project you did and we're gonna drill down, I'm gonna ask you details. And it's gonna be something you've spent time with, right? You've went back and forth, you look at it from different angle. And I always wanna see that I, thinking about it as a few seconds as you tell me about it, cannot come up with better answer than you. So always pick the ones that you know very well and you can almost teach me, give me some new insight about it that is not obvious. If you can teach me something new, that's a magical moment for me. What you really wanna show is that you can solve problems, right? You took something, you kinda thought it through and you come up with something that is original and you can communicate it outward. I wanna see you framing a problem. Framing is about what is best? How do I measure it? How do I find data to help me? So you have to focus. I wanna see that your project focus on something that is interesting. And very importantly, this has to be communicated out and the best thing in my mind is visualization. You wanna show how all of this turns out to something that I understand. If you can kinda bring forward something nobody else did or explain it in a way unique way, that's always gonna stand out. Here's what I always look for. Simple solutions to a hard problem. If it's too hard, now I can do anything with it. So simple solutions, not obvious ones, simple ones that are really a solution to a hard problem. Here's a pro tip, if you get comfortable feeling uncomfortable, then your growth is unlimited. That's where you wanna be. That means that you're stepping out of your comfort zone and doing the stuff that matters. In this course, you learned about supervised and unsupervised machine learning models, how they work and how to build them in Python. In the previous course, you practiced building, interpreting and evaluating regression models. Up to this point, you've also been working hard to develop skills with Python, data visualizations and statistics too. Now it's time to compile all of that knowledge as you complete this portfolio project. Here, you'll be presented with a business problem and a data set. You will then go through the pace workflow to create a plan, build machine learning models, document your work and select the model that would be the best solution to the problem. All of the models you build will be models that you learned in this course and may include NAVE phase, decision tree, random forest, XG boost, and even K-means. You'll also select an appropriate evaluation metric to gauge your model's performance. As you learned in this course, data professionals analyze and discover patterns in data which inform the most appropriate models that are needed to solve business problems. Then they communicate about their work and recommendations to colleagues and stakeholders. And remember, building these models will require a bit of patience. You've done an excellent job developing your skills and they will definitely be useful as you complete this project. Please feel free to return to any of the other videos and course materials if you need a refresher on any of the content. At this point in your program progress, you've covered so many topics, everything from understanding the data career space to Python, visualizations, statistics, modeling, and more. Your portfolio includes machine learning models on a dataset to solve a problem and your PACE strategy document has a new entry where you explain your work at each stage of the process. Through every step of this program, you've created a number of artifacts to add to your portfolio that demonstrate your knowledge and skills. There is so much to be proud of. There are multiple ways you can highlight your work and explain what you've done to potential employers and hiring managers in future interviews. As I've mentioned previously, you'll want to dedicate interview time to talking about the tools you've learned about, the transferable skills you've developed, and the experiences you've had in this program. As a data professional, you may be asked to learn and adapt to new tools on the job, just as we have illustrated in this program. There are a lot of great tools out there and different businesses use different tools and skills depending on their needs. As you apply for jobs, keep in mind that you have learned a lot of transferable skills that can be applied across different tools. In this course, we discussed the importance of determining the most appropriate models to use based on the problem you are solving and data available. Along the way, you discovered you can use different machine learning models to help find a business solution. And you acknowledged that it's important to explain your process when working with machine learning models. During an interview, it may be the case that you are asked, how would you use supervised learning models to address a business problem? How can you use different machine learning models to help you find a business solution? Why is it important to explain your process when working with machine learning models? I encourage you to consider what you have learned in this program to begin answering these questions. Of course, there will likely be other points of discussion in the interview as well. In this portfolio project that you just completed, you built different machine learning models using Python. These models helped identify potential solutions to a unique business challenge. Additionally, you recorded a new entry in your PACE strategy document that details your thoughts, considerations and process steps for this project. I'll also highlight that this project built upon the knowledge you developed as you progressed through this program. Now you're prepared to perform the tasks and responsibilities of a data professional. As a reminder, your interviewers have a business challenge, just like stakeholders on data projects. They have an open job position they need to fill. Think about what they need to know about you to make a decision that solves that business challenge, just like you've been practicing in each portfolio project. Coming up in the Capstone course, you'll bring together all of the content and skills from across the program and apply them in one project. This will be an opportunity to apply skills from each of the portfolio projects in the earlier courses to solve a new business problem. This will provide you with even more artifacts to add to your portfolio. Congratulations, you just finished the final instructive portion of the program and you're ready to move on to your Capstone project. You've learned a whole lot throughout this final section and you're now ready to take this new knowledge and move on in your data journey. Whether your next steps are to continue your education or take what you've learned into industry, you now have a comprehensive foundation to build on or use. We started this section learning about the foundations of machine learning with a focus on the different types of models that are available to a data professional. You saw how different types of business needs require different types of models. Additionally, you learned about recommendation systems and many of the most common use cases for these types of models, along with the different popular techniques and their advantages and disadvantages. From there, you started building out your machine learning tool belt, different integrated development environments, types of Python files, and data oriented Python packages together give you the tools you need to approach any data-driven problem. The PACE workflow is the framework in which you can put those tools to use. By taking the time to follow the steps of plan, analyze, construct, and execute, you can ensure that you stay on track to produce a model that will deliver meaningful results. The plan stage involved taking a close look at the business need and the data available and determining what type of model would be appropriate. In the analyze stage, you applied many of the exploratory data analysis principles that you learned earlier in the program. Additionally, you learned about a new subset of techniques called feature engineering that allow you to manipulate and prepare your data for modeling in a variety of ways. The construct stage introduced you to a new type of supervised classification model, naive base. You learned about and built this model along with applying evaluation metrics to gauge the performance of the model. Finally, in the execute stage, you performed any needed validation techniques, further evaluated the model and made any needed tweaks to get the most performance out of it. Next, you took a deeper look at unsupervised learning models. K-Means is one of the most widely used unsupervised learning techniques, and in this part of the program, you built a K-Means model and used common evaluation techniques to fully understand its results. Finally, you learned about tree-based modeling. Tree-based techniques are some of the most effective models that currently exist in the industry. You saw how single decision trees work, learning how they function conceptually along with building one for yourself. With this foundation, you were introduced to two ensembling techniques, bagging and boosting. Within tree-based modeling, you saw one of the most important aspects of using advanced machine learning techniques. Hyperparameter tuning is essential for building models in industry, allowing you to optimize models to fit your specific needs. In the next section, you'll be taking everything you learned throughout the entire program and applying it to a capstone project that will be an invaluable piece of your portfolio. See you soon. Hey there. Throughout this program, you've been developing your skills as a data professional and practicing effective communication. Additionally, you've been building a portfolio of examples to showcase your abilities. And after all of your dedication and hard work, you've reached the capstone, the final project of this program. Congratulations. At the conclusion of each of the earlier courses, you completed a portfolio project. In each, you practiced the skills you learned in that course. Within those portfolio projects, you used a PACE strategy document. There, you outlined your goals and planned the necessary workflow to perform each task. As a result, you've already created a number of great examples for your portfolio that are similar to the projects you will complete as a data professional. Each project highlights your ability to perform specialized tasks that relied on specific skills and tools. In this capstone course, you'll begin by selecting one of the project options and reading through the project overview. Next, you will use the PACE model to guide your workflow. These capstone scenarios will bring together all of the content and skills you have applied in each of the past portfolio projects. The big difference is that instead of focusing on individual course topics, the capstone project will incorporate all of the skills you developed in this program from start to finish. By adding your capstone project to your portfolio, employers and business contacts will recognize what you have accomplished and how hard you've worked to develop your skills. As you begin this project, know that you can go at your own pace. You can always refer to your past work if you face a challenge or need help. Each lesson and activity in this program has helped you prepare for the steps you'll complete in this project. Now it's time to begin the capstone project. Good luck. Let's get started. Hi, I'm Daisy, data science manager at Google. I lead a team of data scientists. We focus on delivery inside a machine solution to improve financial analyst's productivity. In the past three years, I conduct about 200 interviews. I typically hire mid-level data scientists with some relevant work experience for about at least five years. But in the past, I also had experience to hire entry-level data scientists as well. I look for candidates with the experience that they leverage advanced analytics solutions or machine learning solutions to drive business impact. Having those evidence on the resume or demonstrate those experience throughout the interview is quite critical. And I don't really emphasize certain experience from specific industry. And the reason is I see a good data scientist can actually leverage their knowledge and then adopt it into a different business environment. The successful candidates, are those they are able to relate their past projects or their schoolwork to any type of problems. Our interview questions normally cover two parts. One is focused on understanding the candidates, their technical knowledge. In that aspect, we usually will want to understand their coding skill, particularly in RO Python and SQL. In addition, we also want to understand their knowledge in machine learning or statistics as well. And the second part will be related more on the soft skill. In these aspects, we care about whether the data scientist can work with the business stakeholders, understand their problems, and then also be able to recommend the analysis and the insights to kind of help them to solve their business problems. Sometimes I bump into the candidate that they get stuck on the first questions and then they will keep thinking about that question throughout the entire interview. So that's also something I would definitely encourage the candidates to give their best, but also know when to stop. If you are interested in becoming a data scientist and but you don't have previous work experience in this field, I would recommend you to start thinking about build up your portfolio that can be through doing capstone projects from the online courses or certificate program and also do some pro bono work as well. And then there's also many keiko type of competitions that will help you understand what's close to the real world problem will look like. So definitely highly recommend to build up this portfolio and start to get exposure to the messy data which is close to the real world problem. Welcome back. Let's talk about what to expect in the capstone project. At this point, you've had a chance to review the project details and the different project options. Shortly, you have an opportunity to access the instructions and review an example for the capstone project. Once you have reviewed the instructions, you'll start by selecting one of the project options. Here, you will be given access to all of the necessary data and project information. Next, use the pace model to organize your approach and outline the necessary steps you will need to take. The capstone project follows a similar structure to each of the portfolio projects. By this point in the program, none of the tasks needed to complete the capstone will be new to you. And if you get stuck, you can review previous projects. There's one major difference between the capstone project and the portfolio projects you've done so far. The capstone project is comprehensive because it requires skills and knowledge developed across the whole program, including Python, EDA, data visualizations, stats, modeling, and pace. Now it's time to get started. At the end of the project, you'll have a set of artifacts that you can add to your portfolio to showcase your skills to prospective employers. Good luck on your capstone. Congratulations, you completed the capstone project. This is a huge accomplishment that demonstrates all of your hard work in this course. You apply to your knowledge of data science and effective communication, creating a dynamic representation of your professional skills. You can now include the capstone project in your portfolio, share it with potential employers and discuss it during interviews. You can also share any of the materials you generated during the process, including your models and any elements of your pace strategy document. These can help explain your workflow, thought processes, and ultimately the choices you made. Ideally, you'll share all of the above. Each component illustrates a step in your progression as a data professional. Let me be the first to congratulate you on completing the capstone project. Your commitment to this program has been impressive and I can't wait to have you share your passion for data science with prospective employers. Before that, we're going to get you ready for the job market. First, we'll refine your resume and prepare you for the interview process. Then, before you leave the program, we'll take a moment to celebrate your incredible accomplishment. Hello again. It's great to be back with you. Let's turn our attention to a discussion of the ways you can use everything you've learned in this program to have a successful experience on the job market. As you progress through this course, there have been many opportunities to familiarize yourself with what is expected of a data professional. Not to mention, you've also been working on your pace strategy document, which is the foundation of your portfolio. Let's continue to discuss even more ways to promote your career advancement, like refining your resume and preparing for interviews. The job search is an important part of your learning journey. The good news is that you've already begun preparing for this experience by enhancing your skills. An important first step is learning what organizations expect of data professionals. This program has emphasized that data is everywhere due to the pervasive nature of technology. In every industry, companies need data professionals to make informed decisions. Whether you have a passion for healthcare, finance, human resources, retail, education, construction, or anything else, there's a data-related job for you. You might also recall that there are many roles for data professionals across the field at large and within industries more specifically. A search for a data position might include terms such as data scientist or data analyst, among others. Job opportunities with these titles may require anywhere from zero to three years of experience. However, if a job asks for more years of experience than you have, you should apply anyway. As long as the skills you have match the job description, it's worth applying. A solid portfolio and resume can give you a chance even if you don't have a lot of experience. If you're looking for a job in a specific industry, the job listing may ask for skills or knowledge relating to that field. You can research job offers in your industry of interest to find which additional skills you should develop. But where should you apply to data jobs? You can send your applications on any job searching site, such as LinkedIn, Indeed, or Glassdoor. Even a quick Google search can help you find some recent job opportunities. On each of these sites, you'll be able to fill out job applications and share your resume and portfolio. If you do get a response to your application, you may interact with a recruiter or hiring manager. They may reach out to you with an email or phone call to set up an interview. If so, congratulations. You'll then prepare by researching the company if you haven't already and rehearsing for your interview. Some companies may contact you for multiple rounds of interviews, especially if they received a lot of applicants. If this is the case, your recruiter will prepare you for each round and tell you what to expect. You may be asked to describe the projects in your portfolio or complete a short data-related exercise. Once you complete your final interview, you may not get an immediate response. At this point, you should follow up with the message, thanking the interviewer for their time. You can also apply to more jobs, work on your data skills, or continue developing your portfolio. Arguably, this waiting period is the hardest part, but hopefully, it leads to an interesting new opportunity. If you don't receive contact for a while after your interview, reach out to the recruiter or their hiring manager to check on your application status. Now that you know what to expect from the data career hiring process, it's time to prepare your application materials. This includes your projects, portfolio, and resume. Then you'll learn more about interview techniques to help you land a position as a data professional. If you've earned your Google Data Analytics certificate or completed any data projects in general, you may already have an online portfolio. As a reminder, our portfolio is a collection of materials that can be shared with potential employers. It's a part of your job application with evidence of your accomplishments. If you don't have a portfolio yet, it's time to make one. A portfolio is a shareable, accessible way to showcase your work. Potential employers will be interested in the skills you've gained through this program and other skills you may have through previous experiences. Having tangible artifacts in a portfolio to demonstrate these skills for prospective employers should set you up well for the interview process. You can also use it to demonstrate your background in non-data industries. So it's important to create a well-rounded portfolio. Portfolio's enable you to share pace model documents, GitHub repositories, links to presentation slide decks, and other assets that will help demonstrate your skills. You can host your portfolio on your own custom website or use an existing data sharing platform. Sites like GitHub or Kaggle that you may have used for sharing data can be used to link out to your project artifacts. Tableau also has a social platform and sharing capabilities. Once you pick a platform or multiple platforms to host your portfolio, you can add your projects. Choose whether to represent your projects by including your data, screenshots of your visualizations, embedding the code, or all of the above. If embedding isn't possible on the platform you choose, you can always include links to allow access to your projects. When you've included all of the relevant parts from your project and your portfolio, you should also explain your process, describe what work you did, why you made the choice, and what you could have done differently. It also helps to include a short biography. By describing your professional goals and interests, you can personalize your portfolio and make it stand out against other applicants. Going forward, it's important to update your portfolio as you complete projects through educational experiences, online courses, or on the job. Some projects you work on may deal with private data, so make sure that you follow your employer's rules and regulations for data sharing. In cases where you cannot share any data or visuals from a project, you can still include a summary of what you did in your portfolio, which details you share may be determined by your employer, but it's important to document the roles you had on each project you were a part of. Now it's time to create or update your portfolio. You can do this now or make time to update your portfolio later. This will be a process you go through periodically throughout your career. Demonstrating your accomplishments in a portfolio will get you ready for new opportunities, now and for years to come. Earlier in the course, you learned about creating a portfolio. Before you apply for a job, it's also important that you prepare a resume. A good resume can directly impact your chances of becoming a candidate for a position. If you completed the Google Data Analytics program, you learned a lot about creating an effective resume. If you need a refresher, feel free to revisit those resources. As you transition into your job search, you'll want to review your resume to ensure it reflects the experiences, technical abilities, knowledge, and skills you've developed in this program. Let's discuss what organizations expect in a data professionals resume and what you can do to help your stand out. There are often a wide varieties of responsibilities listed in job descriptions. So one of the first things you can do to improve your chances of getting noticed by recruiters and hiring managers is to refocus your resume for the specific job you are applying for. To do this, look over the requirements for the position and take note of areas within your resume that showcase the skills listed on the job posting. You may need to revise descriptions on your resume so that they reflect language and terms used in the job posting. Since many of the jobs in the data career field are specialized, your resume should be as well. Most employers expect a resume to include technical and software proficiencies. This is where you'll list the languages, platforms, and software that you have used to analyze data. After completing this program, I encourage you to add Python and Tableau. As you revise your resume, you can also include previous work and educational experiences you have with other data-related software. It might be the case that you encounter job descriptions listing programming languages, software, and skills you're less familiar with. In those instances, consider which tools and skills you have on your existing resume that may be transferable to the position you're applying for. In this program, you learned many valuable skills that transfer across roles in the data space. Don't forget to list your newly developed skills, including EDA, statistics, modeling, and communication. Keep in mind that hiring managers want to see examples of past work. As we discussed previously in the program, this is most often demonstrated through an online resume or a portfolio. Portfolios and resumes work together to help potential employers and hiring managers better understand how you'll be an asset to their team because of your hard work throughout this program. You've created a portfolio full of paid strategy entries and data projects. All of these are strong evidence of your proficiency as a data professional. There may even be places in your resume where you can briefly describe what you've learned in the program. As you begin the process of applying for positions, take time to revise and update your resume. With your completed portfolio and resume, you'll be ready to move into the final part of the application process. My name is Daniel and I am a technical recruiter on our product analysts and data science teams. I feel like helping candidates land an amazing job is really kind of a life-changing experience. And so being a part of that journey is really amazing. I've probably reviewed thousands and thousands of resumes during my career at Google. So hopefully during this conversation, I can give you some tips and tricks to kind of stand out amongst the crowd. As an entry-level person, entering into the data analytics field for the first time, there's a number of different things you can do to your resume to make you stand out amongst the crowd. So the first thing you wanna do is make sure that your skills area is listed at the top of your resume and calling out in that skills section what is relevant to the role that you're applying to. So for data analytics, the things that we're really gonna look for is areas like statistics, some coding language, software, analytics frameworks. So being able to call that out on the top of your resume is really important. But you also wanna call out the soft skills. And I think that can be really relevant through a number of different areas. So in your past work, it could be things like collaborating, working cross-functionally, being able to problem-solve. The second piece is you wanna make sure that your resume is very clear and concise and can tell a specific story. In the experiences, it doesn't have to be specific work experience. It could be projects you've worked on at school. It could be internships. But you really wanna call out anything that aligns to solving analytics problems, using metrics and really solving problems using these analytic frameworks. And then the last piece I'll say is keep it to one page at most and make sure that in your experience that it's in reverse chronological order and you're really telling a story about what you've done in analytics. If you're looking to come into a field and you don't have experience, there's a lot of your past experiences could be really relevant to what they're looking for. So the first thing I'd recommend would be to look at the postings, look at the analytical field and see what are the key areas that they typically look for and then look through your past experience and see how that aligns. A lot of what we look for aside from like those core, kind of hard technical skills and data science is really about how you're able to break down problems, apply some analytical method and then provide recommendations and work with the business. And so I think a lot of times you can pull that from your past experience and it's very relevant to what we do here just a little bit different maybe in analytics. Throughout the program, you've learned how the PACE model can help guide you through a variety of tasks within a project. Although we have used the model in the context of data oriented work, you can apply PACE in a number of different ways including to the interview process. Let's discuss this further. As you learned earlier, the P in PACE is all about planning. As you plan for an interview, there will be some general questions you'll want to be prepared to answer. Potential employers are interested in how you might approach different situations that commonly occur in their work environment. For example, they may ask you to describe how you approach a data project or how you might communicate with stakeholders. As you progress through this program, you use PACE to provide structure and guide your workflow. As a result, you now have within your portfolio examples of how you approach a variety of tasks common within data projects. Looking back at these PACE strategy document entries, you may notice that your thought process evolved. Consider how your approach changed from the first to the last entry. With more practice, you developed ingrained habits for thinking like a data professional. Sharing some of these insights into your personal workflow during an interview will help showcase skills that you have refined in this program like your ability to thoughtfully plan, analyze, construct and execute. In an interview, you can use the PACE strategy document that you composed to showcase your capacity for growth and your ability to adapt since you completed projects that increase complexity throughout this program. Let's take a look at the A in PACE for a moment. Analysis can be a repetitive process within data projects. Through your coursework, you often encounter the need to undertake analysis at various stages. In preparation for an interview, you'll want to conduct some form of analysis about the company and the position they are trying to fill. Most companies have areas on their corporate websites that offer a brief history, past accomplishments and an overall mission statement. Additionally, there's often a wealth of information about companies on career sites like LinkedIn. Taking the time to investigate a potential employer before your interview can help you prepare questions and begin to visualize yourself in the role. The C in PACE illustrates construction. As you progress through each course in this program, you've also been building upon your knowledge of what it means to be a data professional and constructing a portfolio of data models, visualizations and other deliverables. During your interview process, you will want to provide links to your portfolio containing all of the artifacts you constructed throughout the program. This can include a variety of items like structured data, visualizations, linear regression models, an A-B test and machine learning models. In addition to examples of your data skills, you will also want to carefully construct your correspondence with the hiring manager or interviewers. Through these exchanges, you will demonstrate your ability to communicate effectively. Remember to consider your intended audience and the purpose behind your message. During your actual interview, you will bring it all together, executing on your plan to secure a position in the organization. Here you will share your experiences and portfolio with the goal of taking the next steps in your data professional career. As you've discovered throughout the program, there is always overlap in the PACE model. Each stage requires a degree of execution, whether you're developing a plan of action, analyzing data or sharing findings. That's where your path has been leading throughout this entire program, to a point where you are more prepared to handle data oriented challenges and communicate effectively with stakeholders. Along the way, you've practiced decoding business challenges, making complex projects actionable and communicating key data insights through portfolio projects and a capstone that will set you apart from other candidates. You've discovered that a highly effective data professional needs a balance of technical, interpersonal and workplace skills. Through this program, you have developed in all three areas. Through the PACE model, you always have a structure that will help you be comfortable when approaching new projects. It can provide a solid foundation for any project throughout your data professional career. I also hope that you'll use the PACE model to inform the ways you showcase your transferable skills and interviews on the job market. The supplemental materials included in this course will offer additional resources to assist you in your job search. I'm so proud of just how far you've come and I'm excited about all the possibilities and opportunities that await you. Hi, I'm Eva. I'm a product analytics lead here at Google. I manage a team of analysts and together we help our product managers and our engineering partners answer business questions so that they can help. Advertisers better optimize their accounts on Google Ads. So I started off doing mostly, I would say, sales, marketing and event planning sorts of jobs. I kind of just said yes to everything and anything that my friends were involved in. And all of this combination of different backgrounds and jobs ultimately taught me how to tell stories and I thought that was really important. When I first joined Google, I was actually a sales person. So I helped people optimize their AdWords accounts. While I was doing that, I realized that I wanted to do a better job myself to help people more effectively and at scale in all of these sorts of things. So I would, you know, after work, go home, watch a bunch of YouTube videos on how to do SQL, met with some analysts and partner teams. We went through like a few SQL problems. I started teaching other people how to do that. In doing that, I made friends with some people on the sales analytics side of things and they took a chance on me, to be honest, and eventually transitioned it to an analytics team. I think that, you know, having that deep knowledge of the sales program and then switching to a team that was focused on helping the sales people helped because then I had that deep context. The analysts that I have known that are the best at their job did not have traditional backgrounds. I think that the people that I look up to the most were cellists, drummers, biologists, physicists, teachers, et cetera. And the reason why you're in a particularly great position not having the traditional job but learning the skills now is that no analyst is worth their salt if they do not have the business context, deep understanding of the people that they're working with in order to be able to help them answer their questions. You, by not being in that traditional world, have that knowledge. So don't feel like, oh no, you're so behind or anything like that. Understand that you coming from a non-traditional background actually has quite a bit of value. Hey again, today a lot of us spend a lot of time connecting with people online. We stay in touch with family and friends we can't see every day or post about what we're doing, eating and watching on social media. But our presence online goes beyond the personal. A consistent and professional online presence is an important tool in building a career in data analytics. A professional online presence is important for a few key reasons. First, it can help potential employers find you. Second, it lets you make connections with other data analysts in your field, learn and share data findings and maybe even participate in community events. Keep in mind that a lot of networking happens online now. So if you aren't keeping up your online presence you might be missing out on great opportunities without even knowing it. There are lots of different professional sites that you can take advantage of as you start building your own online presence. For now though, we'll focus on LinkedIn and GitHub. LinkedIn is specifically designed to help people make connections with other people in their field. It's a great way to follow trends in your industry, learn from industry leaders and stay engaged with wider professional community. And if you're actively looking for a new job LinkedIn has job boards that you can search. You can even narrow down your location to see who's hiring near you. Plus job recruiters frequently use LinkedIn to find potential data analysts for new projects. So it's always a good idea to keep your LinkedIn profile up to date with your resume. You might find yourself being recruited. LinkedIn also lets you connect with people and build a network. You can share exciting things happening in your professional life and keep up with where your connections go. You never know when you might end up working with someone again. With LinkedIn, you can be endorsed for having job skills or endorse other people. So if you impress someone at a previous job, they could let other people know just how awesome you are to work with. GitHub, the other website I mentioned earlier is a little different. GitHub is part code sharing site, part social media. It has an active community collaborating and sharing insights to build resources. You can talk with other GitHub users on the forum, use the community-driven Wikis or even use it to manage team projects. GitHub also hosts community events where you can meet other people in the field and learn some new things. GitHub has a lot of features for you to check out and the best way to learn more about it is to check it out for yourself. We'll also be talking more about GitHub later in the program. Sometimes if you're looking for a new career, finding someone who has something in common with you like shared interests or the same hometown and reaching out to them can help a lot. Just a 15-minute conversation with someone could set you on the path to a new career, whether that's on a professional networking site like LinkedIn or at a community event hosted by GitHub. LinkedIn has become one of the standard professional social media sites. So it's a good starting place for building your online presence. And GitHub offers a lot of really great tools for data analysts in the community. So if you don't already have accounts on these sites, challenge yourself to set them up now. Connect with other people, share some updates about what you're working on right now. And if you're already using LinkedIn and GitHub, great news. We're going to talk more about how to enhance your existing social media presence next time. See you soon. Hello, let's talk about social media. Today there's 3.8 billion people using social media around the world. So there's a good chance you probably already have an online presence. That's great. It means you're already connecting with people online, maybe even professionally on websites like LinkedIn. And if you aren't, getting started is as easy as signing up today. But there's some really easy ways you can enhance your online presence even more and use your existing profiles to build your professional identity. One of the first things you should ask yourself when looking at your new or existing online presence is this. Would you be okay with potential employers and colleagues seeing your social media profiles? Try putting yourself in their shoes. When a potential employer is looking at your public profiles, they're asking themselves if you're the right person to represent their company and values. Is there anything on your current accounts that could make them think otherwise? If you wanna limit what you share, be sure to check the privacy settings on your accounts. If they're set to public, anyone can see everything you post. You can also make specific photos or albums private, but remember this doesn't erase them from the internet. Keep in mind changing your privacy settings doesn't necessarily keep all of your posts secure. So you should always think carefully before you post. Now the best way to make sure that your posts and photos are appropriate and professional is to delete any that you wouldn't want your future boss to see. And if you're getting ready to upload photos for the first time, think about how those pictures represent you before posting them. Feel free to backup these photos for your personal files, but maybe don't put them on Facebook or Instagram. Speaking of Facebook and Instagram, there's some easy options for deleting posts on these platforms. Both Facebook and Instagram have an archive function that allows you to remove posts from your profile. You can even mass delete posts on Facebook. And while you're at it, check your Twitter. Your social media profiles are probably connected, so it's important to make sure that they're all representing you the way you want to be seen professionally. A good rule of thumb, your posts should be family friendly. This goes for photos and text posts. Check to make sure your content and language is appropriate for the whole family. And while we're working on enhancing your online persona, a professional profile picture is a great touch. Even if your account is set to private, recruiters will likely still be able to see your profile picture. Having a photo for your LinkedIn profile is important because it significantly increases your chances of being contacted. So make your profile picture one that represents your professional side in the best way possible. Once you've gotten your profiles up and running, post mindfully. Think about the professional image you're trying to create and stick to it. This means curating posts for different platforms. Decide which platform you want to use for family and friends like Facebook and Instagram, and keep updates about your personal life on those platforms. Use professional platforms like LinkedIn for posts related to your work life and building professional relationships. A huge number of companies and hiring managers use online sources to identify and pick candidates. So it's important to make sure that your online presence has a positive impact on your real life. Make sure your online presence is job appropriate by making your accounts private, deleting posts you wouldn't want your boss or colleagues to see and posting mindfully. And don't be afraid to ask someone you respect professionally to take a look and give you some feedback. That can be a big help in building that online presence and using it to make connections within your professional community. Now that we've built and enhanced our online presence, let's learn more about building networks and reaching out to other professionals. See you soon. Great, you're back. When you take a picture, you usually try to capture lots of different things in one image. Maybe you're taking a picture of the sunset and want to capture the clouds, the tree line and the mountains. Basically, you want a snapshot of that entire moment. You can think of building your resume in the same way. You want your resume to be a snapshot of all that you've done both in school and professionally. In this video, we'll go through the process of building a resume, which you'll be able to add your own details to. Keep in mind this is a snapshot. So when managers and recruiters look at what you've included in your resume, they should be able to tell right away what you can offer their company. The key here is to be brief. Try to keep everything in one page and each description to just a few bullet points. Two to four bullet points is enough, but remember to keep your bullet points concise. Sticking to one page will help you stay focused on the details that best reflect who you are or who you want to be professionally. One page might also be all that hiring managers and recruiters have time to look at. They're busy people, so you want to get their attention with your resume as quickly as possible. Now let's talk about actually building your resume. This is where templates come in. They're a great way to build a brand new resume or reformat one you already have. Programs like Microsoft Word or Google Docs and even some job search websites all have templates you can use. A template has placeholders for the information you'll need to enter and its own design elements to make your resume look inviting. You'll have a chance to explore this option a little later. For now, we'll go through the steps you can take to make your resume professional, easy to read and error free. If you already have a resume document, you can use these steps to tweak it. Now there's more than one way to build a resume, but most have contact information at the top of the document. This includes your name, address, phone number and email address. If you have multiple email addresses or phone numbers, use the ones that are most reliable and sound professional. It's also great if you can use your first and last name in your email address like Jando17 at email.com. You should also make sure that your contact information matches the details that you've included on professional websites. And while most resumes have contact information in the same place, it's up to you on how you organize that info. A format that focuses more on skills and qualifications and less on work history is great for people who have gaps in their work history. It's also good for those who are just starting out their career or making a career change. And that might be you. If you do want to highlight your work history, feel free to include details of your work experience starting with your most recent job. If you have lots of jobs that are related to a new position you're applying for, this format makes sense. If you're editing a resume you already have, you can keep it in the same format and adjust the details. If you're starting a new one or building a resume for the first time, choose the format that makes the most sense for you. There's lots of resume resources online. You should browse through a bunch of different resumes to get an idea of the formats you think works best for you. Once you've decided on your format, you can start adding your details. Some resumes begin with the summary, but this is optional. A summary can be helpful if you have experience that is not traditional for a data analyst or if you're making a career transition. If you decide to include a summary, keep it to one or two sentences that highlight your strengths and how you can help the company you're applying to. You'll also want to make sure your summary includes positive words about yourself, like dedicated and proactive. You can support those words with data like the number of years you've worked or the tools you're experiencing like SQL and spreadsheets. A summary might start off with something like hardworking customer service representative with over five years of experience. And once you've completed this program and have your certificate, you'll be able to include that too. Which could sound like this? Entry level data analytics professional. Recently completed the Google Data Analytics Professional Certificate. Sounds pretty good, doesn't it? Another option is leaving a placeholder for your summary while you build the rest of your resume and then writing it after you finish the other sections. This way, you can review the skills and experience you've mentioned and grab two or three of the highlights to use in your summary. It's also good to note that summary might change a little as you apply for different jobs. If you're including a work experience section, there's lots of different types of experience you could add. Outside of jobs with other companies, you could also include volunteer positions you've had and any freelance or side work you've done. The key here is the way in which you describe these experiences. Try to describe the work you did in a way that relates to the position you're applying for. Most job descriptions have minimum qualifications or requirements listed. These are the experiences, skills, and education you'll need to be considered for the job. So it's important to clearly state them in your resume. If you're a good match, the next step is checking out preferred qualifications, which lots of job descriptions also include. These aren't required, but every additional qualification you match makes you a more competitive candidate for the role. Including any part of your skills and experience that matches a job description will help your resume rise above the competition. So if a job listing describes a job responsibility as effectively managing data resources, you'll want to have your own description that reflects the responsibility. For example, if you volunteered or worked at a local school or community center, you might say that you effectively manage resources for after-school activities. Later on, you'll learn more ways to make your work history work for you. It's helpful to describe your skills and qualifications in the same way. For example, if a listing talks about organization and partnering with others, try to think about relevant experiences you've had. Maybe you've helped organize the food drive or partnered with someone to start an online business. In your descriptions, you want to highlight the impact you've had in your role as well as the impact the role had on you. If you helped a business get started or reach new heights, talk about that experience and how you played a part in it. Or if you worked at a store when it first opened, you can say that you helped launch the successful business by ensuring quality customer service. If you use data analytics in any of your jobs, you'll definitely want to include that as well. We'll cover how to add specific data analysis skills a little bit later. One way to do this is to follow formula in your descriptions. Accomplished X as measured by Y by doing Z. Here's an example of how this might read on a resume. Selected as one of 275 participants nationwide for this 12 month professional development program for high achieving talent based on leadership potential and academic success. And if you've gained new skills in one of your experiences, be sure to highlight them all and how they helped. This is probably as good a spot as any to bring up data analytics. Even if this program is the first time you really thought about data analytics, now that you're equipped with some knowledge, you'll want to use that to your benefit. So if you've ever managed money, maybe that means you helped a business analyze future earnings or maybe you created a budget based on your analysis of previous spending. Even if it was for your own or a friend's small business, it's still data that you've analyzed. Now you can reflect on when and how and use it in your resume. After you've added work experience and skills, you should include a section for any education you've completed. And yes, this course absolutely counts. You can add this course as part of your education and you can also refer to it in your summary and skills section. Depending on the format of your resume, you might want to add a section for technical skills you've acquired, both in this course and elsewhere. Besides technical skills like SQL, you could also include language proficiencies in this section. Having some ability in a language other than English can only help your job search. So now you have an idea of how to make your resume look professional and appealing. As you move forward, you'll learn even more about how to make your resume shine. By the end, you'll have a resume you can be proud of. Next up, we'll talk about how to make your resume truly unique. See you soon. Great to see you again. Building a strong resume is a great way to find success in your job hunt. You've had the chance to start building your resume and now we'll take the next step by showing you how to refine your resume for data analytics jobs. Let's get started. For data analysts, one of the most important things your resume should do is show that you're a clear communicator. Companies looking for analysts want to know that the people they hire can do the analysis. But also can explain it to any audience in a clear and direct way. Your first audience as a data analyst will most likely be hiring managers and recruiters. So being direct and coherent in your resume will go a long way with them as well. Let's start with the summary section. While you won't go into too much detail in this section about any of your work experiences, it's a good spot to point out if you're transitioning into a new career role. You might add something like transitioning from a career in the auto industry and seeking a full-time role in the field of data analytics. One strategy you can use in your summary and throughout your resume is PAR or PAR statements. PAR stands for problem action result. This is a great way to help you write clearly and concisely. So instead of saying something like was responsible for writing two blogs a month, you'd say earned little-known website over 2,000 new clicks through strategic blogging. The website being little-known is the problem. The strategic action is the strategic blogging and the result is the 2,000 new clicks. Adding PAR statements to your job descriptions or skills section can help with the organization and consistency in your resume. They definitely helped me when I changed jobs. Speaking of the skills section, make sure you include any skills and qualifications you've acquired through this course and on your own. You don't need to be super technical, but talking about your experience with spreadsheets, SQL, Tableau, and R, which is a programming language that we'll get to later, will enhance your resume and your chances of getting a job. So if you're listing qualifications or skills, you might include a spot for programming languages and then list SQL and R, which are both a part of the Google Data Analytics certificate. You might even add in the top functions, packages, or formulas that you're comfortable with in each. It also makes sense to include skills you've acquired in spreadsheets like PivotTables. PivotTables, SQL, R, and lots of other terms we covered here might get you noticed by hiring managers and recruiters. But you definitely want your resume to accurately represent your skills and abilities. So only add these skills after you've completed the certificate. Once you start applying the ideas we talked about here to your resume, you'll be well on your way to setting yourself apart from other candidates. And after you've completed your final course, you'll have the opportunity to complete a case study and link it on your resume. This will be a great opportunity to show recruiters and hiring managers the skills you've learned while earning your certificate. Before you know it, you'll have a pretty great resume that you can update quickly whenever you're searching for a data analyst job. Nothing wrong with that. Up next, we'll talk more about adding experience to your resume. Bye for now. Great work. You've reached the end of the program. Remember when you made a commitment at the start of the program to build upon your data analytics knowledge and skills? Well, after all of your hard work and dedication, you fulfilled that promise. I am honored to be the first to congratulate you on all you have accomplished. But I'm certainly not the last. There are a few other people waiting to speak to you. Congratulations on reaching the end of this course. Incredible work. Congratulations on completing your certificate. You should be proud of yourself. Congratulations on all your progress. Well done. Awesome job, everybody. Congratulations. You finished the course. Congratulations on completing this section of the program and good luck on the rest of your journey. Congratulations. Good work. Congrats and hopefully I'll be working with you someday and getting a job at Google. Congratulations. This was not an easy feat and you stuck through it. Very proud of you. Why is it good? Congratulations. You did a great job. Good luck out there. Congratulations. You did it. Congratulations. You've made it to the end of the course. So now go ahead, get your foot in the door and start applying. You've done it. You've finished the course. I'm looking forward for you to be in my next interview and telling about all the wonderful things that you did and congrats. Congratulations. Congratulations. You finished your certification. And now you have to remember that there are many more things to learn. Congratulations. You did it. Good luck with the rest of your data science journey. It's time to celebrate your achievement. You worked very hard on completing this program and learned a tremendous amount about what it means to be a data professional. All that's left to do is collect your certificate of completion, which you can display on your resume or LinkedIn profile. I hope you're as proud of your accomplishment as I am. It's been fantastic supporting you on your learning journey throughout the program. And now it's time for you to get out there and make an impact in the data career space. Goodbye for now. You just earned your Google career certificate. This is a huge achievement that demonstrates that you are invested in learning new skills for your future. On behalf of my fellow course instructors and myself, congratulations. Now that you have earned your certificate, you can share your accomplishments during your job search. It can be displayed on job search platforms such as LinkedIn, Indeed, and Glassdoor. You can also connect with companies in your field that are eager to hire through Grow with Google's Employer Consortium. As you learned at the beginning of this program, the demand for data skills is growing at an incredible rate. With the skills and knowledge you've gained from this program, you can begin applying for jobs or work to advance your career in this high growth, high impact field. The process may take some time, but you now have everything you need to get hired in the data career space. Let's review all you've learned throughout this program. You started by learning the foundations of data science. Here, you learn the roles and functions that data professionals play within an organization. Then you learned about the data tools and the importance of a structured workflow. As a reminder, effective communication is crucial for successful collaboration as a data professional. And finally, you learned about careers in data-driven fields and how to prepare yourself for your future. In the next course, you learned how to use Python for data-related work. You investigated a variety of concepts within the Python language, such as syntax, variables, loops, strings, data structures, and object-oriented programming. Additionally, you discovered how to expand Python's capabilities with libraries and packages. Next, you explored how to access stories within data through exploratory data analysis. Here, you used more tools within Python to clean and prepare data for analysis. You also learned how to create visualizations using Tableau that help present information inside of large data sets. In your statistics course, you learned about descriptive and inferential statistics, basic probability and probability distributions, sampling, confidence intervals, and hypothesis testing. You had the opportunity to conduct AV testing using actual data. Then, you investigated regression models and learned about assumptions, validation, construction, evaluation, and interpretation. Each of these were explored using Python and incorporated into your PACE workflow. Finally, in the last course, you focused on the machine learning landscape. Building on your knowledge of machine learning, you explored different types of models, like supervised, continuous, and categorical. You were also introduced to unsupervised learning techniques and taught how to use them to meet business needs. It was an incredible amount of work. As you enter this new phase in your career, be sure to stay engaged by following trends within the data career space. Your learning journey doesn't end here. You can keep it up to date with industry news, emerging data insights, and inventive ways to improve your skills. Also, continue updating your portfolio and resume to highlight your best work and what you've accomplished. You've already shown some serious dedication to the field, so stay curious. And with that, congratulations. It's been wonderful leading you through the final part of this program. I know you're well-prepared for a fantastic career as a data professional. Good luck.